neural networks collection 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What do we use the confusion matrix for?

To calculate sensitivity, specificity and accuracy

ANNs Step 3

Training/Learning, ANN need to learn and change over time by establishing appropriate connections between units; increase the strength of appropriate connections, prune away inefficient connections (more you train the better it gets) (EX: classification)

DM Subjective Interest

Subjective measures require a human with domain knowledge to provide measures: • Unexpected results contradicting apriori beliefs • Actionable • Expected results confirming hypothesis

target variable

The predefined attribute whose value is being predicted in a data analytical model

Network Topology

Variations include: • Arbitrary number of layers • Fewer hidden units than input units (causes in effect dimensionality reduction, equivalent to PCA) • Skip-layer connections • Fully/sparsely interconnected networks

Joint Distributions

The probability of an output given all previous inputs, increases by factor of two for every added attribute for binary outputs

K Nearest Neighbors Classifier

Vote whichever y in NN with plurality (majority)

Soft Margin

We have effectively replaced the hard margin with a soft margin New optimization goal is maximizing the margin while penalizing points on the wrong side of d.b.

Blocking Paths

When a path is blocked, no information can flow through it This means that observing C, if it blocks a path A-C-B, it means there is no added value in observing A, and B is fully determined by C

multi-layer

multilayer perceptron (back propagation), competitive net

one-against-one (pairwise) classification

train N(K-1)/2 binary classifiers for a N-way multiclass problem. Each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes At prediction time, a voting scheme is applied: all N(N-1)/2 classifiers are applied to the new sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier

Conditional Independence

x is conditionally independent of y given z if the probability distribution governing x is independent of the value of y given the value of z Algorithmically: P(X=x|Y=y|Z=z) = P(X=x|Z=z)

multilayer perceptron (back propagation) performance rule

yj=f(Σiwijxi) where f(x)=1/1+e^-x

Analogue

____ circuits code in continuous changes in voltages, as do neurons in their sub-threshold state

neural networks

are software systems that can train themselves to make sense of the human world.

Hebbian Learning

basic idea: When an input and output tend to coincide, the strength of connection between input and output increases

Unsupervised Learning Tasks

- Outlier detection: Is this a 'normal' xi ? - Data visualization: What does the high-dimensional X look like? - Association rules: Which xij occur together? - Latent-factors: What 'parts' are the xi made from? - Ranking: Which are the most important xi ? - Clustering: What types of xi are there?

Unsupervised Learning Methods

- PCA - ICA - K Means Clustering - Spectral Clustering

Prior Probability

- Probability of encountering a class without observing any evidence - Can be generalized for the state of any random variable - Easily obtained from training data (i.e. counting)

C4.5

- Successor of ID3. - Multiway splits are used. - Statistical significant split pruning.

Normal distribution

- The Normal distribution has many useful properties. It is fully described by it's mean and variance and is easy to use in calculations. - The good thing: given enough experiments, a Binomial distribution converges to a Normal distribution.

Neurons vs. Nodes

- number of inputs - input activity - excitatory/inhibitory - strength of synapse

Hamming networks: how to find which stored pattern is nearest to a new input pattern

- take maximum of the outputs - can be done using a Maxnet

objective of Kohonen layer

- to attach each output node to a stored pattern as represented by its weight vector - winner node at Kohonen layer is closest to the input vector - weight update rule: for each iteration, adjust the weight such that the weight vector of each node is as near as possible to the input sample vector for which that node is winner

Boosting D_t+1 calculation

D_t(i)*e^(-alpha_t*y_i*h_t(x_i)/Z_t)

Instance Based Learning Pros

Always has good performance on training data Fast Simple

K-NN

compute the nearest neighbors and assign the class my majority vote. increasing the value of k makes the algorithm more resilient to noise in the data, although it can also cause some unwanted side effects

VC number and input dimensionality relationship

d-D : VC d+1

Training error

fraction of training examples misclassified by h

inductive learning defn.

given a set of observations come up with a model, h, that describes them.

vigilance parameter

user can select vigilance parameter to control dissimilarity between members of the same cluster in ART

What is each column in a data frame?

variable with its own type e.g. integer, factor

Causes of inconsistent data

• Different data sources • Functional dependency violation (e.g., modify linked data)

Decision surfaces are

• Linear functions of x • Defined by (D-1) dimensional hyperplanes in the D dimensional input space.

PCA requires calculation of

• Mean of observed variables • Covariance of observed variables • Eigenvalue/eigenvector Computation of covariance matrix

HDF5

• Much more complex file format designed for scientific data handling • It can store heterogeneous and hierarchical organized data. • It has been designed for efficiency.

Sparse Kernel Methods

• Must be evaluated on all training examples during testing • Must be evaluated on all pairs of patterns during training - Training takes a long time - Testing too - Memory intensive (both disk/ RAM) Solution: sparse methods

PAC

Probably approximately correct

Tree Building Algorithms

C4.5, CART, CHAID

SVM Margin is defined as

The minimum distance between decision boundary and any sample of a class

Apriori algorithm

Apriori algorithm is a fast way of finding frequent itemsets

Statisticians

Clssification

Decision Tree n-XOR

Expressible, but uses 2^n nodes (exponential)

Graph

G(V,E)

Decision Tree Regression

Splits on attributes such as variance Outputs an average, linear fit, or other numerical function

Explain the holdout method

Splitting data into training data and test data. Use training data to create model Use test data to score the accuracy

What is information gain in ID3?

The expected reduction in entropy

How has modern supervised learning improved?

deep learning

Dirty data

incomplete noisy inconsistent

noise

random, meaningless information that pollutes your data set

Machine learning resources

time, space, samples (or data)

one-against-all (one-against-rest) classification

train a single classifier per class, with the samples of that class as positive samples and all other samples as negatives (total classifiers for N classes) this strategy requires the base classifiers to produce a real-valued confidence score for its decision of that class, and the class with the highest confidence is chosen

accuracy

(TP + TN)/ (TP + TN + FP + FN)

SLC Pros

deterministic, if doing in graph space can be solved as spanning tree problem

SVM: ||w||

distance between two hyper-planes (margin)

gradient descent/hill climbing

find optimum by iteratively following the gradient/slope of the parameter space

Main difference between supervised and unsupervised learning?

supervised: labels known (classification) unsupervised: no labels (clustering)

Machine Learning

the extraction of knowledge from data based on algorithms created from training data Technique for making a computer produce better results by learning from past experiences.

SVM Kernel Trick

the original input space can be mapped to some higher dimensional feature space where the training set is seperable

Define overfitting

the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably

NN Back Propagation

Computationally beneficial organization of chain rule errors flow backwards able to change weights of entire network to figure out output often reaches local optimum

DM Functionalities

Concept/Class description • Characterization • Discrimination • Frequent patterns/ Associations/ Correlations • Classification and Regression (Prediction) • Cluster analysis • Outlier analysis • Evolution analysis

Maximum Likelihood

Derivative

Measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function

1st order Markov models

Restricted to encoding sequential correlation on previous element only

Deep learning methods

(Deep) Neural Networks • Convolutional Neural Networks • Restricted Boltzmann Machines/Deep Belief Networks • Recurrent Neural Networks

When do we use resampling?

Sample data is very small

Linearly-separable SVM

Satisfying solution (e.g. perceptron algorithm): finds a solution, not necessarily the 'best' Best is that solution that promises maximum generalizability

Normal Density

By far the most (ab)used density function is the Normal or Gaussian density

Version Spaces

Contain true hypothesis in H Training set is subset of X with all x having c(x) Hypotheses consistent with examples

Cross validation

Separate training and testing set using folds that are iteratively checked

Weighted adjacency matrix W

Shows how a graph is connected

Training Perceptron rule

Single unit training, finds separating line Guarantees finite convergence for linearly separable

What are the main application of Artificial neural networks

Statisticians - classification Philosophers - mind v. machince

What data are in input variables?

can include both categorical and numeric data

Linear Separability

categories are linearly separable if one can categorize the examples perfectly by adding up and weighting the evidence from individual features

What function in R tells us the variable type?

class

What is ∈ often called?

irreducible error since it's not estimable

multilayer perceptron (back propagation) architecture

supervised, feedforward, dense, nonlinear, random, multi-layer, binary or real

Causes of incomplete data

• "Not applicable" data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/ hardware/ software problems

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Inductive Learning: m

Number of examples to train on

Hidden layer(s) can

- Have arbitrary number of nodes/units - Have arbitrary number of links from input nodes and to output nodes (or to next hidden layer) - There can be multiple hidden layers

SVM: Mereer Condition

Kernels act like distance or similarity functions

Unsupervised Learing

- We only have xi values, but no explicit target labels. - You want to do 'something' with them.

Units/Nodes

- neurons - activates by input value - activation passed on through connections

Give an example of Labelled data

Age 30 40 50 Salary 35,000 40,000 60,000

Eager Learner

Aims to learn the function behind a dataset

What is artificial intelligence ?

Computer system that attempt to model and apply the intelligence of the human mind Turing Test - computer that can talk like a human

Training algorithm

Given a model h with Solution Space S and a training set {X,Y}, a learning algorithm finds the solution that minimizes the cost function J(S)

Eigenvalue

Given an invertible matrix , an eigenvalue equation can be found in terms of a set of orthogonal vectors , and scalars such that: M

Expressiveness of MLP

Given enough hidden layer neurons nH, any continuous function from input to output can be expressed as a 3 layer network.

Mutual Information

Gives a measure of how 'close' two components of a joint distribution are to being independent

Minimum Error Rate

Goal is to minimise error rate

Decision Tree Preference Bias

Good splits at top, correct over incorrect, shorter trees

Rule confidence

confidence(A -> B) = P (B|A)

State 2 partition methods

cross validation and bootstrapping

What does supervised learning require?

massive labeled datasets labels usually come from humans

feedforward

perceptron, multilayer perceptron (back propagation), competitive net

Splitting Criteria for Classification: Indicators of node impurity

pure node: all examples are of the same class, maximal impurity: if both classes are equally probable, if all examples belong to the same class

What is categorical data?

qualitative distinctness: = and not= e.g. names, gender, medical diagnosis order: <, <=, >, >= e.g. levels of satisfaction, rating

Rule support

support(A -> B) = P(A u B)

Boosting initial importance distribution

uniform (1/n)

What is sampling?

unnecessarily fast data collection e.g. sub-sampling (can create artifacts, destroy signal)

What is dimensionality reduction?

unnecessarily many dimensions e.g. principal component analysis

Search Methods

• Exhaustive • Greedy forward selection • Greedy backward elimination • Forward-backward approach

Evaluation Procedures

• For large datasets, a single split is usually sufficient. • For smaller datasets, rely on cross validation

Parametric methods

• Many methods learn parameters of prediction function (e.g. linear regression, ANNs) • After training, training set is discarded. • Prediction purely based on learned parameters and new data.

Simple data splits

- Fixed train, development and test sets - Bootstrapping - Cross-validation

Cross-validation

- Randomly split data into n folds and iteratively use one as test set - All data used to test, and almost all to train - Good for small sets

clustering

- determining a set of representative centroids/prototypes/cluster centers/reference vectors

Predicates

A logic statement, generally as boolean logic

What are the different steps of batch learning? Is it a practical method?

1) Initialize the weights. 2) Calculate the gradient for the whole dataset. 3) Update the weights. 4) Continue with 2) until it reaches a local minimum. It is often unpractical because of the size of the datasets.

Decision Tree Expressiveness

All Boolean Functions

ID3 "best" attribute

Greatest information gain or greatest gini gain

Data Discretization

Grouping a possibly infinite space to a discrete set of possible values For categorical data: ________ Super-categories For real numbers: ________ Binning ________ Histogram analysis ________ Clustering

Inductive Learning: bath or online

Manner in which training examples are presented

Kurtosis

Measurement of noisiness of data dimension

Philosophers

Mind and machines

Boosting

Recursively trains models based on "hardness" of data points

Power of hypothesis space

largest set of inputs that hypothesis class can label in all ways

ockham's razor

prefer the simplest hypothesis consistent with data

Commonly used kernels

• Linear kernel • Polynomial kernel • Gaussian kernel (Gaussian kernel is probably the most frequently used kernel out there - Gaussian kernel maps to infinite feature space)

Rulesets

• Single rules are not the solution of the problem, they are members of rule sets • Rules in a rule set cooperate to solve the problem. Together they should cover the whole search space

Impossibility Theorem

No clustering scheme can achieve all three properties

Candidate

Possible target concept

Codebook Vector (CV)

- represents a Voronoi region - also called Voronoi centres - the set of CVs is a compressed form of information represented by all input data

retrieval in associative networks

- retrieval (or "recall") refers to the generation of an output pattern when an input is presented to network

Target Concept

Answer

Estimating hypothesis accuracy

Sample Error vs. True Error Confidence Intervals

Consistency

Shrinking intracluster distances and expanding intercluster distances does not change the clustering

Why is 1-NN not always bet?

data irl is very noisy

Data Integration

• Entity identification problem • Redundancy detection • Correlation analysis • Detection and resolution of data value conflicts • e.g. weight units, in/exclusion of taxes

types of associations

- hetero-association - auto-association

What are the different steps of stochastic learning? What are its advantages/disadvantages?

1) Initialize the weights. 2) Calculate the loss for a single sample. 3) Calculate the gradients. 4) Update the weights. 5) Continue with 2) until it reaches the global minimum (eventually). Advantages: - Noisy, can hence escape from local minima. - Less computation per learning step and hence much faster. - Often results in better solutions. Disadvantages: - Wobbles around very strongly and hence might not converge. - Small computations do not use full capacity. - Conditions of convergence are hidden.

prediction model (introduction)

= a mapping from known features to an unknown target - do not make decisions: prediction + threshold = decision

Inductive Learning: Epsilon

Accuracy to which target concept is approximated

Branching Factor

Branching factor of node at level L is equal to the number of branches it has to nodes at level L + 1

Complete Linkage Clustering

Distance between two clusters is maximum distance between observation in one cluster and observation in other cluster

Joint Entropy

H(x,y) = - SUM(P(x,y)logP(x,y))

MAP

Maximum a priori

How is deep neural network optimized?

Optimized through gradient descent! (Forward-Backward algorithm) - Penalize complex solutions to avoid overfitting

Marginalization

P(x) = sum_y(P(x,y))

The Principle of Plurality

Plurality should not be posited without necessity.

Weak Relevance

There exists a set of features such that adding x_i to it improves bayes optimal classifier

Decision Tree Expression of Continuous Attributes

Use inequalities, also able to use same attribute again

What does predictive data mining include?

classification regression

Wrapping Evaluation Methods

hill climbing randomized optimization forward search backward search

hetero-association

mapping input vectors to output vectors that range over a different vector space, e.g. translating English words to Spanish words

classification

outcome is a discrete variable (typically <10 outcomes)

K nearest neighbor learning algorithm

store all training examples

What version of data frame do we use in this class?

tibble

supervised learning aims

to discover a function h(x) that approximates f(x), where h is a hypothesis

Preference Bias

what kind of hypothesis we prefer from H

perceptron learning rule

Δwij=kδjxi where δj=(tj-yj) for desired outputs t

Bayes' Error

- The Bayes Error rate is the theoretical lowest possible error rate for a given classifier and a given problem (dataset). - For real data, it is not possible to calculate the Bayes Error rate, although upper bounds can be given when certain assumptions on the data are made. - The Bayes Error functions mostly as a theoretical device in Machine Learning and Pattern Recognition research.

One-class SVM

- Unsupervised learning problem - Similar to probability density estimation - Instead of a pdf, goal is to find smooth boundary enclosing a region of high density of a class

Layers

- input, output ==> input can come from programmer or other nodes from system ==> optional: greater than or equal to 1 hidden layer

What is association?

- the task of mapping input patterns to target patterns ("attractors") - for instance, an associative memory may have to complete (or correct) an incomplete (or corrupted) pattern - unlike computer memories, no "address" is known for each pattern

vector quantization

- this is a task that applies unsupervised learning to divide an input space into several connected regions called Voronoi Regions, representing a quantization of the space - every point in the input space belongs to a region and is mapped to the corresponding CV

Know which scientific fields ANNs are used in.

--Computer scientists: information processing and learning, image classification, object detection and recognition --Statisticians: classification --Engineers: signal processing and automatic control --Physicists: statistical mechanics --Biologists: predicting protein shape from mRNA sequences, disorder diagnostic, personalized medicine --Philosophers: minds and machines --Cognitive scientists: models of thinking, learning, and cognition --Neuro-physiologists- understanding sensory systems and memory -ANN can be used to understand how visual info is represented in V2 and V4 and higher levels of visual hierarchy; some studies, humans and ANN can solve same task -Showed mice black and white movies while recording regions in visual cortex --Used in lots of different fields

Know how ANNs compare to human brains.

-Composed of many units (similar to neurons) -Units are interconnected (similar to synapses) -Units occupy separate connected layers (similar to multiple brain regions in sensory pathways) -Units in deeper layers respond to more and more abstract information (similar to more complex receptive fields in "higher" cortical areas) -Require learning to perform tasks efficiently (similar to neural plasticity) -Through experience, ANN learn to recognize patterns

What can ANNS do?

-Facebook face recognition at 98% accuracy -Self driving cars: detect important objects on the road, tell moving cars apart from cyclists and pedestrians, predict what objects will do, choose a path -Navigation: choosing the best route given current traffic conditions, finding landmark locations -Why were they struggling before? Each person's sounds are unique, humans speak in a continuous flowing manner, "ice cream" v. "I scream", "I" and "eye", and other language ambiguities make it hard for voice processing -Voice recognition (Siri, Skype, Android) -Language: they can describe pictures (but they pay more attention to details)

What machines are still NOT good at:

-Learning from small number of examples and less practice -Solving multiple tasks simultaneously -Holding conversations -Active learning -Scene understanding -Language acquisition -Common sense -Feelings -Consciousness -Theory of Mind (understanding thought and intentions of others) -Learning-to-learn -Creativity

State the advantages of KNN

-Robust to 'noisy' data -Excellent if training data is large

Overfitting

A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data. It presents itself in complicated decision boundaries that depend strongly on individual training examples.

connectionist network

A network of units (like neurons) that are connected to one another and transfer information between each other (like axons). Made up of input units, hidden units, and output units as well as connection weights.

What is a ROC curve?

A plot of sensitivity vs (1- specificity) used for points on the curve represent different cut off points used for testing positive.

What is a decision tree?

A popular supervised learning algorithm used in classifications problems. (flowchart)

Turing test

A test to empirically determine whether a computer has achieved intelligence

Sparse coding

A type of coding that uses as small a number of active neurons as possible and provides another important design principle for engineers building artificial neural networks

Error Backpropagation

1. Apply input vector to network and propagate forward 2. Evaluate d(k) for all output units 3. Backpropagate d's to obtain d(j) for all hidden units 4. Evaluate error derivatives as:

Measures of classification accuracy

Classification Error Rate Cross Validation Recall, Precision, Confusion Matrix Receiver Operator Curves, two-alternative forced choice

Fishers LDA

Classification Problem. A dimension reducing technique. A set of parameters are used to project data x to a smaller dimension d'. The aim is to maximize the distance between the means of the two classes.

SVM: x_i (^T) x_j

Measure of simularity

Feature Selection

Method of reducing the complexity of a dataset by choosing the best features

Local minima

The smallest value of the function. But it might not be the only one.

Support and Confidence

Are measures of pattern interestingness

Errors

Arise from world noise, electrical noise

CART

Classification And Regression Trees

Inductive Learning: complexity of H

Complexity of hypothesis class (High: overfit, Low: underfit)

Attribute subset selection

Feature selection Feature selection is a form of dimensionality reduction in ML, hence the DM term 'dimensionality reduction' for manifold projection is problematic. Approaches: • Exact solution infeasible • Greedy forward selection • Backward elimination • Forward-backward • Decision tree induction

Soft Clustering

For clusters that overlap with other clusters, each point has a percentage that it is of each cluster

recurrent

Hopfield net

Cost function - ℓ2 norm

In order to avoid over-fitting, one common approach is to add a penalty term to the cost function. Common choices are the ℓ2-norm, given as: Where C0 is the unregularized cost

How to increase P(h|D) using h?

Increase P(h) and P(D|h)

The Principle of Parsimony

It is pointless to do with more what is done with less.

What is the gradient of a function?

It is the "derivative" of a multi-variable function, which map a vector onto a point. It is the vector that contains all partial derivatives of that function and points in the direction of the slope of the function in that point.

Regression

ML task where T has a real-valued outcome on some continuous sub-space Examples: • Age estimation • Stock value prediction • Temperature prediction • Energy consumption prediction

Chain Rule

P(a,b) = P(a|b)P(b) or P(a,b) = P(b|a)P(a)

Stopping Criteria

Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting.

ROC Curve

Receiver Operator Characteristic (ROC) curves plot TP vs FP rates

RELU

Rectified Linear Unit New trend, responsible for great deal of Deep Learning success. Advantages: • No 'vanishing gradient' problem • Can model any positive real value • Can stimulate sparseness

Instance Reduction

Reduces the number of instances rather than attributes. Much more dangerous, as it risks changing the data distribution properties • Duplicate removal • Random sampling • Cluster sample • Stratified sampling

Numerosity Reduction

Reduces the number of instances rather than attributes. Much more dangerous, as it risks changing the data distribution properties. • Parametrization • Discretization • Sampling

What is pruning?

Reducing the size of the decision tree by getting rid of the deadweight.

Association Rules

Reflect items that are frequently found (purchased) together, i.e. they are frequent itemsets • Information that customers who buy beer also buy crisps is e.g. encoded as: beer ) crisps[support = 2%, confidence = 75%]

NN Error Functions

Regression: - Binary classification - Multiple independent binary Classification: - Multi-class classification (mutually exclusive):

Regularization

Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.

Abstract Essence of ML

Representation + Evaluation + Optimisation

Bayesian Inference

Representing and reasoning with probabilities

Instance Based Learning

Retains all training data and infers new classifications/values from it

Tree components

Root node, branch, node, leaf node.

Step Activation Function

Switches from 0 to 1 at time 0 with infinite slope

Integrate-and-fire neurons

Silicon neurons that, like real neurons in the visual cortex, have the job of extracting information about the angles of lines and contrast boundaries in the retinal image

TPR - True Positive Rate - Recall

TP/actual Positive = TP/TP + FN

Constrained Teacher

Teacher unable to give the sought after function immediately Must show what is relevant and irrelevant through examples

Intrinsic dimensionality

Subspace of data space that captures degrees of variability only, and is thus the most compact possible representation

Cross-validation criterion

Split training data in a number of folds. For each fold, train on all other folds and make predictions on the held-out test fold. Combine all predictions and calculate error. If error has gone down, continue splitting nodes, otherwise, stop

Emission Probabilities

Probabilities of observed variables

Lazy Learner

Remembers training data and only tries to apply it to unknown data when queried

Sample

Training set

h_ML

argmin(sum((d_i-h(x_i))^2)) (also sum of squared error)

types of numeric data in R

int double/numeric

Clustering Properties

Richness, Scale invariance, consistency

Ancestral sampling

is a simple sampling method well suited to PGNs

Drawback of avoiding overfitting method

data withheld for test set is not used for training

Restriction Bias

restricts hypothesis set (H) to probable answers

Autoassociative network

A type of network that stores patterns rather than merely pairs of items

Training Gradient descent/delta rule

Calculus so more robust, converges to local optimum

What is Artificial Intelligence

The study of computer systems that attempt to model and apply the intelligence of the human mind

Kernel trick

The key element of kernel methods is that they do not actually map features to this space, instead they return the distance between elements in this space This implicit mapping is called the (definition)

Evaluation procedure for single split or cross validation

- For large datasets, a single split is usually sufficient. - For smaller datasets, rely on cross validation

Simple competitive learning

- Hamming net and Maxnet help to determine whose weight vector is nearest to an input pattern on more complex networks - also known as Kohonen Learning

A Bernoulli trial

- It is a trial with a binary outcome, for which the probability that the outcome is 1 equals p (think of a coin toss of an old warped coin with the probability of throwing heads being p). - A Bernoulli experiment is a number of Bernoulli trials performed after each other. These trials are i.i.d. by definition.

What are Learning Vector Quantizers

- LVQ uses winner-take-all network (clustering is used as preprocessing step) - class membership is known for each training pattern - learns the codebook vectors

Latent variables

- Latent variables are variables that are 'hidden' in the data. They are not directly observed, but must be inferred. - Clustering is one way of finding discrete latent variables in data.

Overfitting can occur when

- Learning is performed for too long (e.g. in Neural Networks) - The examples in the training set are not representative of all possible situations (is usually the case!) - Model parameters are adjusted to uninformative features in the training set that have no causal relation to the true underlying target function.

LDA

- Linear Discriminant Analysis - Most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. - The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting

Clustering Applications

- Market segmentation - Social Network Analysis - Vector quantization - Facial Point detection

Regularization

- Maximum likelihood generalization error (i.e. cross-validation) - Regularized error (penalize large weights) - Early stopping

All probability theory can be expressed in terms of two rules

- Product rule - Sum rule

Data Mining

- Quest to extract knowledge and/ or unknown interesting patterns from apparently unstructured data. aka Knowledge Discovery from Data (KDD) • Data mining bit of a misnomer - information/ knowledge is mined, not data.

Random Initialization

- Randomly Initialize K-Means clusters using actual instances as cluster centers - Run K-Means and store centers and final Cost function - Pick clusters of iteration with lowest Cost function as optimal solution - Most useful if K < 10

Fixed train, development and test sets

- Randomly split data into training, development, and test sets. - Does not make use of all data to train or test - Good for large datasets

Maximum margin classifiers

- This turns out to be a solution where decision boundary is determined by nearest points only - Minimal set of points spanning decision boundary sought - These points are called Support Vectors

Regression Trees

- Trained in a very similar way - Leaf nodes are now continuous values - the value at a leaf node is that assigned to a test example if it reaches it - Leaf node label assignment is e.g. mean value of its data sample Problem: nodes make hard decisions, which is particularly undesired in a regression problem, where a smooth function is sought.

Orthogonality

- Two vectors and are orthogonal if they're perpendicular - If their inner product is 0: a · b = 0

Backpropagation

- Used to calculate derivatives of error function efficiently - Errors propagate backwards layer by layer

Three common ways to decide when to stop splitting decision tree

- Validation set - Cross-validation - Hypothesis testing (chi-squared statistic)

Batch Gradient descent

- Vanilla gradient descent, aka batch gradient descent - Make small change in weights that most rapidly improves task performance Gradient descent computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset - Can be very slow - Intractable for datasets that don't fit in memory - Doesn't allow us to update our model online, i.e. with new examples on-the-fly. - guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Little's Synchronous Mode

- all nodes are updated simultaneously, at every time instant using *eqn* - cyclic behaviour may result when two nodes simultaneously update to move towards a different attractor

uses of Hopfield

- can be used to retrieve a stored pattern when a corrupted version of pattern is presented - can also be used to "complete" a pattern when parts of the pattern are missing

Hopfield networks

- commonly used for auto-association and optimization tasks - node values are iteratively updated based on its net weighted input at a given time - it is a fully connected symmetric network - weights are determined using Hebbian principle - typically undergoes many state transitions before reaching the stable state

learning in associative networks

- consists of encoding the desired (to be stored) patterns as a weight matrix (network)

Pruning: post-pruning

- fully grow tree - cut branches that "do not add much" (basically the opposite of tree growing, trace performance of tree on a 'fresh' set of data)

Feature extraction

- goal is to find the most important features (ie those with the highest variation in a given population) - side-effect is reduction of input dimensionality

error-correction in NI

- has low error-correction capabilities - not learning anything for changes in data, assigned from the training data - multiply W with even a slightly corrupted input vector often results in a "spurious" output that differs from stored patterns

weights in hopfield network

- implicitly store the attractors (training samples) - also represent the correlations between node values for all attractor patterns - therefore, a large weight indicates a greater correlation between neighbouring node values

Why use a NN instead of using an array as a lookup table?

- it is parallel (independent of the number of entries) - it is fault tolerant (graceful degradation) - it is a neural model of memory - if set up properly, it can provide outputs for noisy inputs (as in character recognition, input-outputs are typical characters and hand-written or scanned inputs are noisy which need to be recognized)

negative corelation (w_lj is -ve and large)

- nodes l and j frequently have opposite ON/OFF values in attractor patterns

PART 1: CLASSIFIERS

- pattern recognition - diagnostic decisions EX: plants VS. vehicles

drawback of Hopfield

- performance of Hopfield network depends considerably on the number of target attractor patterns to be stored - "assumption of full connectivity" - a million weights for a thousand node network

storage capacity

- refers to the quantity of information that can be stored and retrieved without error C = number of stored patterns / number of neurons in the network - depends on connection weights, stored patterns, anddifference between stimulus patterns and the stored patterns if not fully connectd C = number of stored patterns/ number of connection weights in network

resonance in ART

- signals travel back and forth between the output and input layers until a match is discovered. if no match is found a new cluster is formed around the new input vector

Discrete Hopfield networks

- similar to the recurrent network - asynchronicity: at each time interval only one node's output changes other modes: non-synchronous and synchronous

BAM stability

- stability is not assured for synchronous computations - however, stability is assured with inter-layer asynchrony (nodes updated at discrete time) even if all the node sin one layer change state simultaneously - this allows a greater amount of parallelism than hopfield networks (in asynchronous HN only one node is updated at a time) - rate of stabilization depends on proximity of new input to stored pattern

Tests for comparing distributions

- t-test compares two distributions - ANOVA compares multiple distributions - If NULL-hypothesis is refuted, there are at least two distributions with significantly different means Does NOT tell you which they are!

What is the objective of associative networks?

- to model associative memory - the network memorizes its input-output pairs

Brain-State-in-a-Box (BSB)

- used for auto-association and can be extended to for hetero-association with two or more layers of nodes - fully connected - # of nodes = dimensionality n of input data - simultaneous updates of nodes - values of nodes are continuous (belong to [-1,+1] - node function is a ramp function - positive self activation

Approximation of Data distribution

- using fewer points in the same approximate areas to represent a distribution - clustering extracts only one point for each cluster

Hebb's observation (1949)

- when one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell - weight change, which monotonically increases if x belongs to 0,1 (use a decay factor that gradually reduces thew weigh) - if used in systems that use bipolar [+1,-1] signals, weights can increase and decrease

State the disadvantages of KNN

-Need to predetermine k -Ambigous choice of distance metric -computation cost is high

components of Human Intelligence:

-perception,self-awareness -learning -ability to use reason and logic -ability to write and speak clearly, use language -behavior in social situations -ability to recognize, understand and deal with people, objects, and symbols -ability to think on the spot and solve novel problems (intuition)

Creating a Tree Model in pseudo-code

1) Start from root node 2) For each variable find the best split - For nominal variables, consider splits of the type X=a, X=b, ... - For ordinal / numeric variables, consider splits of the type X ≤ a - - Assess quality of split somehow (see below) 3) Compare best splits per variable across variables - Quality(split at salary) vs. Quality(split at age) vs. ... 4) Selecting best overall split gives two internal nodes 5) Repeat above for each new internal node 6) Continue until some stopping criterion is met

Pruning: pre-pruning

1) do not fully grow tree but stop early - min gain is below some threshold - max depth is reached 2) disadvantages - consider focal node only (horizon effect) - how to select parameters ( eg. max depth)?

Knowledge Discovery Process

1. Data cleaning - remove noise and inconsistencies 2. Data integration - combine data sources 3. Data selection - retrieve relevant data from db 4. Data transformation - aggregation etc. (cf. feature extraction) 5. Data mining - machine learning 6. Pattern Evaluation - identify truly interesting patterns 7. Knowledge representation - visualize and transfer new knowledge

Six general questions to decide on decision tree algorithm:

1. How many splits per node (properties binary or multi valued)? 2. Which property to test at each node? 3. When to declare a node to be leaf? 4. How to prune a tree that has become too large (and when is a tree too large)? 5. If a leaf node is impure, how to assign a category label? 6. How to deal with missing data?

There are three reasons to reduce the dimensionality of a feature set

1. Remove features that have no correlation with the target distribution 2. Remove/combine features that have redundant correlation with target distribution 3. Extract new features with a more direct correlation with target distribution.

Backward Elimination

1. Start with complete SF set (contains all original features) 2. Find feature that, when removed, reduces the filter score least 3. Remove feature from SF set 4. Repeat steps 2-3 until convergence

Forward Selection

1. Start with empty SF set and candidate set being all original features 2. Find feature with highest filter score 3. Remove feature from candidate set 4. Add feature to SF set 5. Repeat steps 2-4 until convergence

Random forests

1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

alternative HA models

1. performs iterative auto association in input layer - generates a store pattern and feeds it to 2nd layer of hetero-association network 2. generates output from 1st layer using non-iterative node rule - feeds to output layer where it performs iterative auto-association -generates a stored output 3. bidirectional associative memory (BAM)

Self organized map phases

1. volatile phase - prototypes search for niches to move into 2. sober phase - consists of nodes settling into cluster centroids in the vicinity of positions found in the earlier phase - sober phase converges but the emergence of an ordered map depends on the result of the volatile phase

Backpropagation

A common method of training a neural net in which the initial system output is compared to the desired output, and the system is adjusted until the difference between the two is minimized. algorithm that propagates errors back through hidden layers to input

Define bias

A systematic error in the model.

Dropout

A very different approach to avoiding over-fitting is to use an approach called dropout. Here, the output of a randomly chosen subset of the neurons are temporarily set to zero during the training of a given mini-batch. This makes it so that the neurons cannot overly adapt to the output from prior layers as these are not always present. It has enjoyed wide-spread adoption and massive empirical evidence as to its usefulness.

K-medoids clustering

Addresses issue with quadratic error function (L2-norm, Euclidean norm) Replace L2 norm with any other dissimilarity measure (V...)

Neuronal networks

All real brains consist of highly interconnected ____ that need space and whose neurons need energy

PGNs are generative models

Allow us to sample from the probability distribution it defines

Principal Components Analysis

An eigenproblem that projects the data along the axis or axes of maximum variance. Aims to minimize L_2 error when moving dimensions and allow for best reconstruction

How to train ANN

An error function on the training set must be minimized. This is done by adjusting: - Weights connecting nodes. - Parameters of non-linear functions h(a).

Backprop is for:

Arbitrary feed-forward topology Differentiable nonlinear activation functions Broad class of error function

What are the similarities and differences between biological and artificial neurons?

Artificial; need more time, practice, examples to learns

Curse of dimensionality

As feature number grows, amount of needed data to generalize increases exponentially

t-test

Assesses whether the means of two distributions are statistically different from each other

Naïve Bayes Cons

Assumes features are independent from each other

Decision Tree "Best" Attribute

Attribute which splits set as nearly in half as possible

ARFF

Attribute-Relation File Format

Cross validation error

Average of accuracy scores for each fold

Belief Network

Bayes Network, directed acyclic graph with probability tables with parent dependencies

z-score normalisation

Better terminology is zero-mean normalization • min-max normalization cannot cope with outliers, z-score normalization can. • Transforms all attributes to have zero mean and unit standard deviation. • Outliers are in the heavy-tail of the Gaussian. • Still a linear transformation of all data.

Sampling Theory Basics

Binomial and Normal Distributions Mean and Variance

NN Expressiveness

Boolean, continuous functions, arbitrary function ("jumps between continuous functions")

Noisy data - Clustering

Canceling noise by clustering - Cluster data into N groups - Replace original values by means of clusters OR: - Use to detect outliers

Noisy data - Regression

Canceling noise by regression: 1. Fit a parametric function to the data using minimization of e.g. least squares error 2. Replace original values by the parametric function value

Support Vector Machines

Chooses decision boundary with the greatest margin on either side

Filtering

Classifier does not inform the feature search and as such the data is "filtered" once and then handed off to learner

Maximum margin classifier

Classifier which is able to give an associated distance from the decision boundary for each example.

Locally weighted regression

Close K points are chosen and a line fit to them

Simpler Method for complex itemset

Closed frequent itemset: X is closed if there exists no super-set Y such that Y has the same support count as X Maximal frequent itemset: X is frequent, and there exist no supersets Y of X that are also frequent

Ensemble Learning

Combine rules to create complex rules

Hidden Unit Activation

Common functions for are unit step, sigmoid or logistic and tanh

Classification measures - Error Rate

Common performance measure for classification problems 1. Success: Instance's class is predicted correctly (True Positives (TP) / Negatives (TN)). 2. Error: Instance's class is predicted incorrectly (False Positives (FP) / Negatives (FN)). 3. False positives - Type I error. False Negative - Type II error. 4. Classification error rate: Proportion of instances misclassified over the whole set of instances.

KNN with k=n (simple average)

Constant output

K-Means Issues

Convergence is guaranteed but not necessarily optimal - local minima likely to occur • Depends largely on initial values of uk. • Hard to define optimal number K. • K-means algorithm is expensive: requires Euclidean distance computations per iteration. • Each instance is discretely assigned to one cluster. • Euclidian distance is sensitive to outliers.

DM task primitives

DM task primitives forms the basis for DM queries. DM primitives specify: • Set of task-relevant data to be mined • Kind of knowledge to be mined • Background knowledge to be used • Interestingness measures and thresholds for pattern evaluation • Representation for visualizing discovered patterns.

Basic Decision Tree

Decision trees apply a series of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label.

What is Deep Learning?

Definition: • Hierarchical organization with more than one (non-linear) hidden layer in-between the input and the output variables • Output of one layer is the input of the next layer

Three ways of constructing new kernels

Direct from feature space mappings Proposing kernels directly Combination of existing (valid) kernels • multiplication by a constant • exponential of a kernel • sum of two kernels • product of two kernels • left/right multiplication by any function of x/x'

graceful degradation

Disruption of performance due to damage to a system that occurs only gradually as parts of the system are damaged. This occurs in some cases of brain damage and also when parts of a connectionist network are damaged.

Cost function - Euclidean distance

Distance measure between a pair of samples p and q in an n-dimensional feature space

Entropy algorithm

E(S) = - sum_v(p(v)*log(p(v)))

Decision Tree Stopping Conditions

Early stopping Cross validate pruning weaker leaves

Confusion Matrix

Easy to see if the system is commonly mislabelling one class as another

Directed PGN

Edges have direction (Bayesian Network)

Choosing K

Elbow method • Visual inspection • 'Downstream' Analysis

Rule based learning

Equivalent in expression power to traditional (mono-thetic) decision trees, but with more flexibility • They produce rule sets as solutions, in the form of a set of IF... THEN rules

Bootstrapping

Estimating the sampling distribution of an estimator by resampling with replacement from the original sample.

Some variables are observed, others are hidden/latent

Example observed: Labels of a training set Example hidden: Learned weights of a model

Machine Learning broad definition

Field of study that gives computers the ability to learn without being explicitly programmed.

Concept

Function

Entropy

Helps rank attributes by how much they contribute information ranging from 0 (no info) to 1 (maximum info)

Sample complexity

How many teaching examples are needed for a learner to create a successful hypothesis (batch)

Computational complexity

How much computational effort is needed for a learner to converge

General classification problem

If classes are disjoint, i.e. each pattern belongs to one and only one class then input space is divided into decision regions separated by decision boundaries or surfaces

Cross-validation

In k-fold cross-validation, a dataset is split into k roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using k-1 folds as the training set and predictions are made on the k'th fold. This is repeated until a prediction is made for all k folds, and an error rate for the entire dataset is obtained. Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation. • Training Data segments between different folds should never overlap • Training and test data in the same fold should never ovelap Error estimation can either be done per fold separately, or delayed by collating all predictions per fold.

Binomial distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

SAMPLE ERROR

In statistics, sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population.

i.i.d.

Independent and identically distributed random variables

K-Means Clustering

Informally, goal is to find groups of points that are close to each other but far from points in other groups • Each cluster is defined entirely and only by its centre, or mean value µk

Instances

Input

ID3

Interactive dichotomizer version 3 Used for nominal, unordered, input data only. Every split has branching factor , where is the number of values a variable can take (e.g. bins of discretized variable) has as many levels as input variables

Deep Learning

Involves developing the tools of critical thinking and applying them to whatever challenges you encounter now and in the future. - deep learning neural network architectures differ from normal ones bc they have more hidden layers - they also differ bc they can be trained in an UNSUPERVISED or SUPERVISED manner for both UN.. and SUP.. tasks

What is the Jacobian of a function?

It is the "derivative" of a multi-variable function, which map a vector onto a vector. It is the matrix that contains all partial derivatives of that function and points in the direction of the slope of the function in that point.

Define KNN

K Nearest Neighbours uses the majority in k neighbours (training data) to make a prediction about the new test data

Kernel methods

Kernel methods map a non-linearly separable input space to another space which hopefully is linearly separable • This space is usually higher dimensional, possibly infinitely • Even the 'non-linear' kernel methods essentially solve a linear optimization problem!!!!

Gaussian Laplacian

L = D - W

Wrapping

Learner informs feature search of effectiveness of set when finding optimal set. Takes model bias into account but very slow.

What can ANNs not do?

Learning from small numbers of examples and less practice (ANN: 38 days vs. human: 2 hours) Solving multiple tasks simultaneously Holding conversations Active learning (humans seek new information to gain knowledge) Scene understanding Language acquisition Common sense Feelings Consciousness Theory of mind (understanding thought and intentions of others) Learning to lean (getting better at learning new tasks) Creativity

Linear Regression Big-O

Learning takes time O(n) with space O(1) Querying is time O(1) and space O(1)

Explain cross validation

Leave one out method

Polynomial Regression

Match data to f(x) = c_0 + c_1 * x + c_2 * x^2 ... using mean squared error

ICA Examples

Natural Scenes -> edges Documents -> topics

Undirected PGN

No edge direction (Markov Random Field)

Instance Based Learning Cons

No generalization Overfits Lookup with same x could give multiple y's

Degrees of freedom of variability

Number of ways data can change/ number of separate transformations possible

Small set of SVs means that

Our solution is now sparse

Perceptron Algorithm

Perceptron is modeled after neurons in the brain. It has m input values (which correspond with the m features of the examples in the training set) and one output value. Each input value x_i is multiplied by a weight-factor w_i. If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and 'fires' a signal (+1). Otherwise it is not activated. The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product <w, x>. To produce the behaviour of 'firing' a signal (+1) we can use the signum function sgn(); it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative. Thus, this Perceptron can mathematically be modeled by the function y = sgn(b+ <w, x>). Here b is the bias, i.e. the default value when all feature values are zero.

Random Components Analysis

Randomly chooses axes on which data is projected onto. Frustratingly works well when preprocessing classification problems (by chance picks up some correleations)

Conditional Entropy

Randomness of y when given x H(y|x)=-SUM(P(x,y)log(y|x) If x||y, H(y|x) = H(y) H(x,y)=H(x)+H(y)

Terminology Prediction Trees

Root node - contains all data Splitter/ branching node - "asks" a question, often binary Leaf node - no further splits, makes predictions

Significance level

Significance level α%: α times out of 100 you would find a statistically significant difference between the distributions even if there was none. It essentially defines our tolerance level. If the calculated t value is above the threshold chosen for statistical significance then the null hypothesis that the two groups do not differ is rejected in favor of the alternative hypothesis: the groups do differ.

What are the characteristics of a good mini-batch composition?

Since the network learns faster from unexpected and new examples and that samples from the same classes contain similar information, it is best to compose mini-batches with samples of different classes. The same sample may be re-used in different mini-batches, as in this circumstance the sample would create a different error surface. Re-using the exact same mini-batch should however be avoided.

Cost Function

Squared error cost function. J(S)

Physicists

Statistical mechanics

Gradient points in the direction of

Steepest Ascent

DM Objective Interest

Support: P(X U Y ) Percentage of transactions that a rule satisfies Confidence: P(Y | X) Degree of certainty of a detected association, i.e. the probability that a transaction containing X also contains Y

SVMs seek a decision boundary

That maximizes the margin

If we calculate the rate of change of a function f with respect to a matrix W, what is the Jacobian of W?

The Jacobian of W is the matrix that contains the partial derivates of f with respect to each of the variable contained in W.

Energy cost

The ____ of signaling - from one neuron to another - has probably been a major factor in the evolution of brains

Learning rules

The ____ that train networks do so by modifying the strength of the connections between the neurons, a common one being a rule that takes the output of the network to a given input pattern and compares it with the desired pattern

Curse of Dimensionality

The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data. First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the target label (in case of supervised learning). In other words we can easily express the target using the dimensions that we have. Second,we may need to increase the number of training data exponentially, to overcome the curse of dimensionality and that may not be feasible. Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest neighbors, everything can become far from each others and it's difficult to interpret the distance between the data points.

Define variance

The difference from one model to the next.

Complex Itemsets

The general rule procedure for finding frequent item sets would be: 1. Find all frequent itemsets 2. Generate strong association rules However, this is terribly costly, with the total number of item sets to be checked for 100 items being

Generalize

The neat thing about ANNs lies in their ability to ____ to input patterns they have never been exposed to in training

Define classification

The process of categorising something usually with a discrete value output (0 or 1)

Feedforward associator

The simplest form of ANN which has layers of interconnected input and output units

How to make it smarter?

Things that can and cannot be improved Ability to connect neurons = smarter, increase connections (units) ANN only learn what you program them to learn (trials is not training, just practicing, can't teach themselves, can't LEARN) THEY CAN'T LEARN We can study for quiz that can help us with a test, but ANN can't generalize to other situations Can't understand social cues

How to make trees compact?

To do so, we will seek to minimise impurity of data reaching descendent nodes

Training a network

Training a NN involves finding the parameters that minimize some error function Choice of activation function depends on the output variables: - Unity for regression - Logistic sigmoid for (multiple independent) binary classification - Softmax for exclusive (1-of-K) multiclass classification

Non-linearly Separable Problem

Usually problems aren't linearly separable (not even in feature space) 'Perfect' separation of training data classes would cause poor generalization due to massive overfitting

node

Labels

Values that h aims to predict Example: • Facial expressions of pain • Impact of diet on astronauts in space • Predictions of house prices

What trees are preferable?

We prefer simple, compact trees, following Occam's Razor

Acquiring emissions

Wide range of options to model ****** probabilities: - Discrete tables - Gaussians - Mixture of Gaussians - Neural Networks/RVMs etc to mode

response variable

a variable that measures an outcome or result of a study

h_MAP

argmin[-lg(P(D|h))-lg(P(h))] or minimization of (length(D|h)+length(h))

ANNS Step 2

brain is made up of billions of neurons and quadrillion of synapses and the more powerful brains have more connections and neuron between them and similarly ANN with more units, connections, and layers are "smarter" Hierarchy of visual processing: in the retina, neurons are receptive to points of light and darkness; in the primary visual cortex respond to faces, hands, all sorts of complex objects, both natural and manmade - ANN uses similar hierarchy of layers, info becomes more and more abstract at higher levels

Naïve Bayes Pros

cheap inference few parameters estimate parameters with labelled data connects inference and classification Empirically successful

What is classification?

classify an item as one of existing # of categories

K nearest neighbor prediction algorithm

classify the new example by finding the training example that's most similar; return corresponding label

What does descriptive data mining include?

clustering: id distinct grouping of data summarizing: avgs, associations, test stats anomaly detection: id unusual items or data pts

What is feature creation?

combine attributes to use e.g. includes polynomial functions of x in regression

sparse or dense

competitive net

Hopfield convergence

computation terminates since: - each node update decreases (lower-bounded) E - the number of possible states is finite - the number of possible node updates is limited - and node output values are bounded the continuous model generalizes the discrete model, but the size of the state space is larger and the energy function may have more local minima

Where do we store data in R?

data frame

topological vicinity

defined as topological distance D(t) between nodes - the neighbourhood contains nodes that are within a distance of D(t) from node j at time t, where D(t) decreases with time - D(t) does not refer to Euclidean distance, it refers to the length of path connecting two nodes

Degree matrix D

degrees di on the diagonal D = (diδij)i,j=1,...,N (Kronecker delta δij is 1 if i = j, 0 otherwise.)

weight update function for Kohonen learning

delta_w = n(i-w) n = learning rate i = input w = weight

Average Linkage Clustering

distances between two clusters is mean distance between observation in one cluster and observation in other cluster

canned phrases

everyday communication - greeting others, answering phone calls, replying "I'm fine, and yourself?"

types of categorical data in R

factor (categories) logical (T or F) character (words or sentences)

Richness

for any assignment of objects to clusters, there is some distance D such that P_0 returns that clustering

Boosting algorithm

for t in T: construct importance distribution at t find weak classifier h_t(x) w/ small error output H_final

True error

fraction of examples that would be misclassified on sample drawn from distribution

Single Linkage Clustering

given K, link clusters until only k clusters remain Link due to distance between clusters (nearest two points) O(n^3)

Optimizing NN weights strategies

gradient descent momentum higher order derivatives randomized optimization penalty for complexity

Spectral Clustering with Gaussian Laplacian Algorithm

i. Given similarity matrix between data points, construct weighted graph adjacency matrix W ii. Compute Laplacian L Compute first k eigenvectors u1,...,uk of L: Lu = λu (Lu = λDu for iii. Shi-Malik algorithm, more commonly used.) Let U ∈ RN×k be matrix with eigenvectors as columns Take rows zi ∈ R1×k,i = 1,...,N of U Cluster {zi} i=1,...,N using k-means into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {vj|zj ∈ Ci}.

Activation Functions

i. Sigmoid function ii. Hyperbolic tangent function iii. Rectified Linear Units - used for hidden layer neurons iv. Softmax function - used for output layer neurons

auto-association

input vectors and output vectors range over the same vector space, eg. character recognition, eliminating noise from pixel arrays

What are aspects of unsupervised learning?

labels are unknown / don't exist uncover unknown structure in the data infer hidden labels and # of groups dimension reduction visualization and exploratory analysis

binary classification

labels belong to only two classes

Splitting Criteria for Classification: measurement

let I(N) denote the impurity of some node N goodness of split is weighted mean decrease in impurity information gain (IG): IG(N) = I(N) - p^N1*I(N1) - p^N2*I(N2)... see notes

What are the values that factors take on one of?

levels

Sample Complexity and VC Dimension

m >= (1/Epsilon) * (8 * VC(H) * lg(13/Epsilon) + 4 lg(2/delta))

Haussler Theorem

m >= (1/Epsilon) * (ln(|H|) + ln(1 / delta)) If you know epsilon an delta targets sample a lot and you'll be fine as long as H is discrete

What is transformation?

make distribution symmetric or more Gaussian e.g. log-transform (may make interpretation more difficult)

How to make a data set linearly separable?

mapping data to a higher-dimensional space

What is the unexplainable variation from unmeasured X's?

measurement error ∈ "error term"

Usefulness

measures feature effect on particular predictor minimizing error given model/learner

Cognitive scientists

models of thinking, learning, and cognition

nonlinear

multilayer perceptron (back propagation), Hopfield net, competitive net

ICA Properties

mutually independent maximal mutual information bag of features

overfitting

occurs when the generated model is overly complex, and too closely tries to capture the idiosyncrasies in the data under study, instead of capturing the overall pattern. By doing so, the model will fail to accurately predict futrue (previously unseen) observations

For KNN should k be even or add?

odd to ensure we always have a majority

Why is having a large margin good?

of all the possible linear functions, this one is the most robust to outliners and thus has the best generalization abiliity

gradient descent definition

operates by keeping track of the current state and changing that state to (hopefully) improve performance

single-layer

perceptron, Hopfield net

supervised, binary or real

perceptron, multilayer perceptron (back propagation)

random

perceptron, multilayer perceptron (back propagation), Hopfield net, competitive net

Bayes' Theorem

posterior = (likelihood x prior)/evidence

How do we test/predict/evaluate our data?

predict new outputs, given only input

What is the primary goal of supervised learning?

prediction model output as function of input

What are synonyms for input variables?

predictors independent variables covariates features regressors

image processing pipeline

preprocessing, feature extraction, train model, classify new image

What is numeric data?

quantitative discrete or continuous x / + - e.g. days until, hotter by, totals to, times more likely, twice as old

Entropy

randomness in data

What are synonyms for output variables?

response dependent variable class label (categorical) outcome

margin

the width that the boundary could be increased by before hitting a data point

Neuro-physiologists-

understanding sensory systems and memory ANN can be used to understand how visual info is represented in V2 and V4 and higher levels of visual hierarchy; some studies, humans and ANN can solve same task Showed mice black and white movies while recording regions in visual cortex

Hopfield net architecture

unsupervised, recurrent, dense, nonlinear, random, single-layer, binary

How do we train our data?

use available input/output data to estimate this function

Where is unsupervised learning useful?

visualization and exploratory analysis

Hopfield net learning rule

wij=Σi≠jg(yi[s])g(yj[s]) where g(yi[s])={1 if yi[s]=1, -1 if yi[s]=0} for each desired stable state s

perceptron performance rule

yj=f(Σiwijxi) where f(x)={1 if x≥θ, 0 else}

competitive net performance rule

yj={1 if Σiwijxi "wins", 0 else}

multilayer perceptron (back propagation) learning rule

Δwij=kδjxi where δj=(tj-yj)f'(netj) for output units δj=(Σkwjk)f'(netj) for hidden units desired outputs t

competitive net learning rule

Δwij={k(xi-wij) for winning j, 0 else}

Evaluating Rules

• A good rule should not make mistakes and should cover as many examples as possible Complexity: Favour rules with simple predicates (Occam's Razor)

The simplest ANNs consist of

• A layer of D input nodes • A layer of hidden nodes • A layer of output nodes • Fully connected between layers

Frequent Itemset

• Absolute support of an itemset is its frequency count • Relative support is the frequency count of the itemset divided by the total size of the dataset

ANN feature selection

• Artificial Neural Networks can implicitly perform feature selection • A multi-layer neural network where the first hidden layer has fewer units (nodes) than the input layer • Called 'Auto-associative' networks

Scientists that use Artificial Neural Networks:

• Computer scientists (information processing and learning, image classification, object detection and recognition) • Statisticians (classification) • Engineers (signal processing and autonomic control) • Physicists (statistical mechanics) • Biologists (predicting protein shape from mRNA sequences, disorder diagnostic, personalised medicine) • Philosophers (Minds and Machines) • Cognitive scientists (models of thinking, learning and cognition) • Neuro-physiologists (understanding sensory systems and memory)

Filter Scores

• Correlation • Mutual information Entropy • Classification rate • Regression score

CFS

• Correlation based feature selection (CFS) selects features in a forward-selection manner. • Looks at each step at both correlation with target variable and already selected features.

DM query languages

• DM query language incorporates primitives • Allows flexible interaction with DM systems • Provides foundation for building user-friendly GUIs • Example: DMQL

Model Combination View

• Decision Trees combine a set of models (the nodes) • In any given point in space, only one model (node) is responsible for making predictions • Process of selecting which model to apply can be described as a sequential decision making process corresponding to the traversal of a binary tree

DM systems can be divided into types based on a number of variables

• Kinds of databases • Kinds of knowledge • Kinds of techniques • Target applications

PCA

• Manifold projection • Assumes Gaussian latent variables and Gaussian observed variable distribution • Linear-Gaussian dependence of the observed variables on the latent variables • Also known as Karhunen-Loève transform

Sparse kernel methods

• Must be evaluated on all training examples during testing • Must be evaluated on all pairs of patterns during training • Training takes a long time • Testing too • Memory intensive (both disk/RAM) Solution: sparse methods

DM integration with DBS/ Data Warehouses

• No coupling - DMS will not utilize any DB/DW system functionality • Loose coupling - Uses some DB/DW functionality, in particular data fetching/storing • Semi-tight coupling - In addition to loose coupling use sorting, indexing, aggregation, histogram analysis, multiway join, and statistics primitives available in DB/DW systems • Tight coupling

Mining frequent patterns

• One approach to data mining is to find sets of items that appear together frequently: frequent itemsets • To be frequent some minimum threshold of occurrence must be exceeded • Other frequent patterns of interest: ____ frequent sequential patterns ____ frequent structured patterns

DM Types of data

• Relational databases • Data warehouses • Transactional databases • Object-relational databases • Temporal/sequence/time-series databases • Spatial and Spatio-temporal databases • Text & Multimedia databases • Heterogeneous & Legacy databases • Data streams

Output layer can be

• Single node for binary classification • Single node for regression • n nodes for multi-class classification

Memory-based methods

• Uses all training data in every prediction (e.g. kNN) • Becomes a kernel method if using a non-linear example comparison/ metric

ROBBINS-MONRO

•Addresses the slow update speed of the M-step in K-means •Uses linear regression (see lecture 1)

Which is the best linear function?

The one with the maximum margin

PCA Pros

Well studied Fast

nonsynchronous update

all even or odd numbered nodes are updated

Directed Acyclic Graphs (DAGs)

are Bayesian Networks. Meaning there are no cyclic paths from any node back to itself

Polynomial Regression Drawbacks

as k-> number of data points, least squared error decreases but risk of overfit increases

1-NN

assign class of nearest neighbor

inductive learning

assume just the data: -cleaner and good base case -more complex mechanisms needed to reason with prior knowledge

Self-learning

certain bodily acts are linked with certain mental experiences on the basis of first person experience.

State the error formula for testing performance

error = number of errors / number of cases

What is regression?

find function for which predicts continuous data

parameter optimization strategies

grid search, gradient descent

Multi-layer perceptron (MLP)

i. Fully connected ii. Consist of 3 layers: 1. 1 Input Layer a. d - dimensional input x b. No neurons at input layer - each input unit simply emits input xi 2. 1+ Hidden Layer a. nH neurons in each hidden layer b. Each neuron uses a nonlinear activation function c. Weight wji indicates input to hidden layer - j = hidden layer, i = input layer 3. 1 Output Layer a. c neurons in output layer b. neurons use nonlinear activation functions that relate to the problem being solved

Epsilon exhausted version space

if and only if for all h in the version space the error of h is less than Epsilon

Strong Relevance

if removing x_i degrades bayes optimal classifier

What is the solution to samples with missing data?

impute missing data plug in good estimate for missing values

Filter Evaluation Methods

information gain variance entropy gini index non-redundant independent

Computer scientists:

information processing and learning, image classification, object detection and recognition

BSB node update rule

initial activation is steadily amplified by positive feedback until saturation

What are artificial neural network in a brain?

interconnected groups of nodes akin to the vast network of neurons in a brain

Conditional independence in PGN

is the PGN mechanism to show information in terms of interesting aspects of probability distributions

Forward Search

iterate through features, training learner and pick subset that performs the best. Continue until k features picked.

Infinite Hypothesis Spaces

linear separators artificial neural networks decision trees (continuous)

Expectation Maximization Properties

monotonically non-decreasing likelihood does not converge (practically does) Will not diverge Can get stuck Works with any distribution if EM solvable

What is aggregation?

multiple identical sensors, days to months e.g. sum, averaging

PCA Properties

mutually orthogonal maximal variance ordered features

Hopfield net performance rule

netj=Σi≠jwijyi, yj={1 if netj>0, 0 if netj<0, yj if netj=0}

neural networks

networks of nerve cells that integrate sensory input and motor output - parts, workings - characteristics

Hebbian learning

neurons that fire together wire together simultaneous activation of cells leads to pronounced increases in synaptic strength between those cells

asynchronous update

nodes to be updated may be selected in - cyclic order or - at random each node in the network must have the opportunity to change state - called "fairness" property for computer networks

grid search

not efficient; create nested for loops that iterate over all possible combinations of parameters

What do we model in output variables?

numeric dependent variables: "regression" categorical dependent variables: "classification"

regression

outcome is continuous

non-iterative association

output pattern is generated from teh input pattern in a single iteration 1. Hebb's law may be used to develop associative "matrix memories" or 2. Gradient descent can be applied to minimize recall error

Probability Theory Recap

p(x) = marginal distribution p(x,y) = joint distribution p(x|y) = conditional distribution

linear or nonlinear

perceptron

SVM: kernel types

polynomial, radial basis kernel (Gaussian), sigmoid

Biologists

predicting protein shape from mRNA sequences, disorder diagnostic, personalized medicine

Inductive Learning: 1 - delta

probability of successful learning

What is feature subset selection?

remove unnecessary variables e.g. redundant names and ids

Write down the formulas for the following: sensitivity, specificity and accuracy

sensitivity = TP/FP specificity = TN/FN accuracy = (TP + TN)/(FP +FN) TP = true positive FP = false positive TN = true negative FN = false negative

Classification tree

set of (splitting) rules to recursively partition a data set. => min mixture of classes (impurity) within nodes A supervised learning technique that uses a structure similar to a tree to segment data according to known attributes to determine the value of a categorical target variable

regression tree

set of (splitting) rules to recursively partition a data set. => min variance of the response within nodes A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.

Itemsets

simply a set of items (cf set theory)

A Latice/Trellis diagram visualizes

state transitions over time Also good tool to to visualize optimal path through states (Viterbi Algorithm)

Computational Learning Theory

study of the design and analysis of machine learning algorithms

perceptron architecture

supervised, feedforward, dense, linear or nonlinear, random, single-layer, binary or real

feature weighting

the idea of assigning more weight to features known to be more important for some particular classification problem (e.g. color if classifying fruit)

Information Theory

the quantification, storage, and communication of information

Perceptron

the simplest neural network possible: a computational model of a single neuron - conceptually - formally

James McClelland & Connectionism theory

theory that memory is stored throughout the brain in connections between neurons, many of which can work together to process a single memory a type of information-processing approach that emphasizes the simultaneous activity of numerous interconnected processing units

What is a sign of supervised learning?

training data given

What is discretization?

transfer continuous into discrete e.g. rounding, converting numeric to categories

How do we query a variable in a tibble?

using $

Algorithms

very specific, step-by-step procedures for solving certain types of problems

Min-max normalization

• Enables cost-function minimization techniques to function properly, taking all attributes into equal account • Transforms all attributes to lie on the range [0, 1] or [-1, 1] • Linear transformation of all data

Causes of noisy data (incorrect values)

• Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission

Artificial Neural Nets

• Feed-forward neural network/Multilayer Perceptron one of many ANNs • We focus on the Multilayer Perceptron • Really multiple layers of logistic regression models

EM Algorithm issues

• Takes a long time • Often initialised using k-Means

Bayesian equation to sigmoid function

Substitute Divide through by numerator term Cancel common terms

Hypothesis Quality

- We want to know how well a machine learner, which learned the hypothesis as the approximation of the target function , performs in terms of correctly classifying novel, unseen examples - We want to assess the confidence that we can have in this classification measure

Naïve Bayes

Bayes net classifier that assumes all attributes are independent

Cost function - Manhattan or City block distance

Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance

SVM: alpha_i

Importance of support vectors

EXTREME Dimensionality Case

In an extreme, degenerate case, if D > n, each example can be uniquely described by a set of feature values.

Regression

Learning a function that provides a continuous value

dense

perceptron, multilayer perceptron (back propagation), Hopfield net

Deep Learning

- Basically a Neural Network with Many hidden layers - Can be used for unsupervised learning and dimensionality reduction

Major DM prep tasks

- Data cleaning : Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies - Data integration : Integration of multiple databases, data cubes, or files - Data transformation : Normalisation and aggregation - Data reduction : Obtains reduced representation in volume but produces the same or similar analytical results - Data discretization : Part of data reduction but with particular importance, especially for numerical data

Good Old Fashion AI

- John Haugeland ñames his book, "Artificial Intelligence: The Very Idea," in 1985 -inspired by computer model - as good as program (knowledge, rules)

Joint Probability

- Joint probability is the probability of encountering a particular class while simultaneously observing the value of one or more other random variables - Can be generalized for the state of any combination of random variables

Decision Tree Representation

- Tree with decision nodes representing attribute - edges represent decisions - leaves are output

Adaptive Resonance Theory

- allow the number of clusters to vary with problem size - adaptively adds new clusters as new nodes - binary values only - #inputs = size of vector #outputs = number of vectors - finished when the input returned (signal travels both ways) is close enough to original input - trains on same vectors multiple times - bidirectional

comparison FF vs bidirectional

- associative memory (feed forward only) will produce an output for an input even if it was not one of the patterns that were originally stored in W but it will not necessarily be one of the stored patterns - bidirectional memory will always produce one of the previously stored outputs, and in addition the input will have been modified to become the input that was paired with that output

applications of topological structure

- clustering: each weight vector represents a centroid - vector quantization: weight vectors represent codebook vector - approximating probability distribution: centroids in a given region is roughly proportional to the number of input vectors in that region

positive correlation (w_lj is +ve and large)

- nodes l and j frequently turn ON or OFF together in attractor patterns

weight vectors in SOM

- weight vectors often become ordered ie. topological neighbours become associated with weight vectors that are near each other in the input space

Hamming Netowrks

- weights on links from an input layer to an output layer represent components of stored input patterns - Hamming networks compute the "hamming distance", the number of differing bits of input and stored vectors - P output nodes can store P vectors each associated with a weight vector

accuracy may not be useful measure in cases where

1- There is a large class skew 2- There are differential misclassification costs - say, getting a positive wrong costs more than getting a negative wrong. 3- We are interested in a subset of high confidence predictions

K-Means Algorithm

1. Assign each xi to its closest mean. 2. Update the means based on assignment 3. Repeat until convergence

Constrained Learner algorithm bounding mistakes

1. Assume each variable can be positive and negative 2. Given input, compute output 3. If wrong, set all positive variables to 0 to absent, negative variables that were 1 to absent. GOTO 2

Techniques for canceling out noise

1. Binning - First sort data, then distribute over local bins 2. Regression - Fit a parametric function to the data (e.g. linear or quadratic function) 3. Clustering

Evaluating RuleSets

A complete rule set should be good at classifying all the training examples Complexity: Favour rule sets with the minimal number of rules.

Well-posed Learning Problem

A computer program is said to learn from (E)xperience E with respect to some (T)ask T and some (P)erformance measure P, if its performance on T, as measured by P, improves with experience E.

Recurrent neural net

A more complex ANN that consists of a single layer where every unit is interconnected and all the units act as input and output

Kohonen as k-means

- uses stochastic gradient descent on the quantization error (see equation)

What operations work for ordered factors?

>, <, min, max

VC Shattering

Ability to label in all possible ways

Decision Tree learning

Pick best attribute Follow path Continue until answer node found

K-Means Clustering

Pick k centers at random Center claims closest points Recompute centers by averaging clustered points Repeat until convergence

What are critical points?

Points with zero slope (i.e. minimum, maximum, saddle point).

What is P(D)?

Probability of the Data given by |VS| / |H|

The true error of hypothesis h

Probability that it will misclassify a randomly drawn example from distribution : D However, we cannot measure the true error. We can only estimate it by observing the sample error eS

Scale Invariance

Scaling distance by a positive value does not change clustering any of the clusters

Define sensitivity and specificity

Sensitivity- refers to the proportion of people who have positive test results and who really have the disease. Specificity- Refers to the people who do not have the disease and whose test results are negative.

Sequence Data

Sequence data is data that comes in a particular order Opposite of independent, identical distributed (i.i.d.) Strong correlation between subsequent elements - DNA - Time series - Facial Expressions - Speech Recognition - Weather Prediction - Action planning

Testing Set

Set for testing model

ANNs Step 1

Neurons in ANN are called units and they receive info from other units (like dendrites through neurons) then they integrate the inputs similar to IPSP and EPSP in real neurons. Each unit has preferred threshold, and if summed signals are greater than threshold, unit will pass info forward in network

Can an infinite H be PAC learnable?

Noisy data

Noise is a random error or variance in a measured variable

Bayes Rule

P(h|D) = P(D|h)P(h) / P(D)

Inductive Learning: Learner/Teacher

Paradigm where the learner is given information by the teacher who knows the correct answer

Hypothesis class

Set of all concepts

Kernel

Shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Data reduction

Should remove what's unnecessary, yet otherwise maintain the distribution and properties of the original data • Data cube aggregation • Attribute subset selection (feature selection) • Dimensionality reduction (manifold projection) • Numerosity reduction • Discretization

Engineers

Signal processing and automatic control

Mixture of Gaussians

Simple formulation: density model with richer representation than single Gaussian

F-measure

Comparing different approaches is difficult when using multiple evaluation measures (e.g. Recall and Precision) F-measure combines recall and precision into a single measure

What is a confusion matrix?

Confusion matrices are used for dealing with false positives and false negatives because some prediction vs reality test are worse than others. (Think HIV test)

edge

E. Partition graphs such that nodes in same cluster have large weights between them and between cluster weights are small

Decision Tree n-OR

Expressible, uses linear number of nodes

False Positive Rate

FP/actual negative = FP/TN + FP

Feature Selection

Feature Selection returns a subset of original feature set. It does not extract new features. Benefits: • Features retain original meaning • After determining selected features, selection process is fast Disadvantages: • Cannot extract new features which have stronger correlation with target variable

Training LDA objective:

Find (i.e. learn) that minimizes some error function on the training set. Significant approaches: • Least squares • Fisher • Perceptron

The goal of data mining is to

Find interesting patterns!. An interesting pattern is: 1. Easily understood. 2. Valid on new data with some degree of certainty. 3. Potentially useful. 4. Novel.

Linear Discriminant Analysis

Find projection discriminating based on given labels. (Not really unsupervised)

Pattern Recognition

Finding patterns without experience. It's also called unsupervised learning.

argmax_h(P(h|D))

Finds the most probable hypothesis given the data

Forward-backward algorithm

First applies Forward selection and then filters redundant elements using backward elimination

Pruning

First fully train a tree, without stopping criterion After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small

Bayesian Learning Algorithm

For each h in H calculate P(h|D) and return the argmax (Not practical |H| is usually infinite)

SVM: kernel

Function used to find a hyperplane between classes, can project data into different representations that are more easily seperable

Generalisation

Generalization is the desired property of a classifier to be able to predict the labels of unseen examples correctly. A hypothesis generalizes well if it can predict an example coming from the same distribution as the training examples well.

Perceptron training

Given examples, find weights that map inputs to outputs

Gradient

Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) . So given a function i.e. g(x,y) = -x+y^2 you know, that it is better to minimize the value of x, while strongly maximize the value of y. This is a base of gradient based methods, like steepest descent technique.

Discrete latent variables

Hidden variables that can take only a limited number of discrete values (e.g. gender or basic emotion).

unsupervised, binary

Hopfield net, competitive net

Mistake bounds

How many misclassifications can a learner make over an infinite run (online)

Artificial Neural Networks make interesting mistakes even after training:

Human brains are much more complex and have many more neurons so context is more apparent

Importance of cleaning data

If you have good data, the rest will follow

Multivariate Trees

Instead of monothetic decisions at each node, we can learn polythetic decisions. This can be done using many linear classifiers, but keep it simple!

Classification

ML task where T has a discrete set of outcomes. Often classification is binary. Examples: • face detection • smile detection • spam classification • hot/cold

Kernel methods

Map a non-linearly separable input space to another space which hopefully is linearly separable • This space is usually higher-dimensional, possibly infinitely • The key element is that they do not actually map features to this space, instead they return the distance between elements in this space • This implicit mapping is called the (Definition) Trick

SVM: max(2/||w||) or min(1/2 * ||w||^2) or max(w(alpha))

Maximization of the length of margin

Independent Component Analysis

Maximizes the statistical independence of dataset. Assumes there are "hidden" variables that the data is a linear combination of. I(y_i,y_j)=0

K Nearest Neighbors Regression

Mean of y in nearest neighbors

Features/Attributes

Measurable values of variables that correlate with the label y Examples: • Sender domain in spam detection • Mouth corner location in smile detection • Temperature in forest fire prediction • Pixel value in face detection

Mutual information

Measure of reduction of entropy in variable x given a variable y I(x,y) = H(y) - H(x|y)

Kullback-Leibler Divergence

Measures difference between 2 distributions based on mutual information. When x and y independent they converge, as x and y are dependent they diverge. D(p||q)=Integral(p(x)log p(x)/q(x))

Linear Regression revisited

Model: - linear and additive relationship - random variation Model estimation: - free parameter beta, set beta to max fit Objective: - min loss function - loss: sum of R square

Data quality measures

Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • Intrinsic, contextual, representational, and accessibility

Constrained Learner

Must ask many questions of teacher to find correct function (about 2^k queries needed)

topological structure (eg. SOM)

- competitive learning requires inhibitory connections among all nodes at kohonen layer. topological requires that each node has excitatory connections to a small number of nodes - topology specified in terms of a neighbourhood relation among nodes - learning proceeds by moving the winner node as well as its neighbours towards the presented input sample

k-means clustering

- computes cluster centroids directly instead of making small updates to node positions

bidirectional associative memory (BAM)

- connections between input and output are bidirectional - no intra-layer connection - generates output at 2nd layer using non-iterative node update rule and a signum step function - generates 1st layer pattern to correspond to 2nd layer output using a similar update rule

centroid in neural networks

- each output node constitute a weight vector of that node, which represents the centroid of one cluster of input patterns

How do LVQs work

- each output node is associated with an arbitrary class label in the beginning - at end should be associated with approx. the number of training data in that class - initial weights are chosen randomly - learning rate decreases with time - helps the network converge to a state in which weight vectors are stable - when a new pattern is presented, if the winner node is the correct class then move winner node closer to pattern

Maxnet

- recurrent competitive one-layer network - used to determine which node has the highest initial activation - node function is f(net) = max(0,net) - "mutual inhibition factors" are less than 1/# nodes - nodes update their outputs simultaneously. each node receive inhibitory inputs from other nodes via intra-layer connections - allows for greater parallelism in execution, since every computation is local to each node rather than centralized

iterative association

- reducing error in generation the desired output - same as least square procedure using Widrow-Hoff rule - good for hetero-association

signum function

-1 if net<0 0 if net=0 +1 if net>0

What are the different steps of mini-batch learning? What are its advantages/disadvantages?

1) Initialize the weights. 2) Calculate the loss for a subset of the training samples. 3) Calculate the gradients. 4) Update the weigths. 5) Continue with 2) until it reaches the global minimum (eventually). Advantages: - Wobbles around slightly and can hence escape from local minima. - Can fully exhaust computational capacity. - Results in good solutions. Disadvantage: Slightly wobbles around the solution.

Artificial Neural Network Structure Summary

1. Composed of many units (similar to neurons) 2. Units are interconnected (similar to synapses) 3. Units occupy separate connected layers (similar to multiple brain regions in sensory pathways) 4. Units in deeper layers respond to more and more abstract information (similar to more complex receptive fields in "higher" cortical areas) 5. Require learning to perform tasks efficiently (similar to neural plasticity) 6. Through experience, Artificial Neural Networks learn to recognize patterns .

prediction model (def)

A prediction model is the result of applying a supervised learning algorithm to a labeled data set which includes observations that are characterized by a set of features (aka attributes, independent variables) and a target(aka dependent, response) variable.

Perceptron Expressiveness

All Boolean Functions

Occam's Razor

All things being equal - the simplest explanation is the best

Content-addressable storage

Information that is stored in the weights of the connections, the same way that synapses change their strength during learning

Bagging

Learn rule over k subsets of data and combine into a single rule

Inductive Learning: Training example selection

Learner asks teacher, teacher gives examples, x chosen from distribution by nature, malicious teacher gives learner false information

Machine Learning

Learning from experience. It's also called supervised learning, were experience E is the supervision.

Misclassification Impurity

Minimum probability that training example will be misclassified at node N

model development (def)

Model development (aka estimation, building) is the process in which the learning algorithm crafts the model in such a way that some measures of the agreement (aka fit; antonym error, loss) between the model and the data is maximized. (linear regression!)

Relevance Vector Machines

Model the typical points of a data set, rather than atypical( a la density estimation) while remaining a (very) sparse (like heat map) representation Returns a true posterior Naturally extends to multi-classification Fewer parameters to tune

K-Means Properties

Monotonically non-increasing in error each iteration polynomial O(kn) finite iterations O(k^n) error decreases if ties broken consistently can get stuck

On-line Gradient Descent

On-line (or Schotastic) gradient descent also known as incremental gradient descent updates parameter one data point at a time. - Handles redundancy better. (Batch GD has redundancy) - Usually much faster than Batch GD. - SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily - Can deal with new data better. - Good chance of escaping local minima. However, when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent

Creating a Tree Model

Recursive partitioning approach (greedy search strategy) - find the best possible split (most important variable?, optimal threshold for this variable?) - split tree accordingly and create two branches - recursive partitioning (repeat previous steps, find best possible split for each branch) - stop when no further improvement is possible 2 important decisions: 1) splitting rule: How to split a node? 2) Stopping rule: How to decide if a node is a leaf node?

Multiclass SVM

SVM is an inherently binary classifier. Two strategies to use SVMs for multiclass classification: - One-vs-all - One-vs-one Problems: - Ambiguities (both strategies) - Imbalanced training sets (one-vs-all)

Slack variable

Slack variables introduced to solve optimization problem by allowing some training data to be misclassified Slack variables en >= 0 give a linear penalty to examples lying on the wrong side of the d.b.: point on correct side of db |tn ! y(xn)|, otherwise

Validation set Criterian

Split training data in a training set and a validation set (e.g. 66% training data and 34% validation data). Keep splitting nodes, using only the training data to learn decisions, until the error on the validation set stops going down.

Define supervised learning

Supervised learning is the machine learning task of a learning that maps an input to an output based on training data

If we calculate the rate of change of a function f with respect to a vector v, what is the gradient of v?

The gradient of v is the vector that contains the partial derivatives of f with respect to each of the variables contained in v.

dependent variable

The outcome factor; the variable that may change in response to manipulations of the independent variable.

Explain the ID3 algorithm

This recursive algorithm uses the entropy function (the probability of 1 or 2 values) to build a decision tree.

What is the training data used for?

To build a model for your test data

Tree variaties

Trees are called monothetic if one property/variable is considered at each node, polythetic otherwise

Backward Search

Try combinations of n-1 features, throw out excluded feature of best performance. Continue until k features remain.

Linearly Married

Two classes are not linearly seperable

Convolutional Neural Networks

Type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.

Feature Transformation

preprocessing features set of features in order to create a new, often smaller, feature set. Attempts to retain as much relevant, useful information as possible

Redundancy

____ is reduced when efficient coding maximizes the amount of information represented in a pattern of spikes

Grandmothered

____ networks are overtrained, make no errors, and end up responding only to one type of input

Digital coding

____ uses 0's and 1's

matrix associative memory

a single weight matrix represents associative memory - corresponds to the weight vector of the ANN

What happens if you weight the distance (play with scale)?

reduce noise, improve classification

Comparing Hypotheses

t-test Analysis of Variance (ANOVA) test

disadvantages of K-NN

determining distance function can be different if feature values are scaled differently, so, normalize all feature values; this one also does not perform well when the dimensionality of the feature space is very high

k-fold cross validation

divide data into k equal subsets, run learning k times (each time leave out one set for testing), avg. error rate of all k rounds is a better estimate of model accuracy

How to avoid overfitting?

divide data into training/test; use only training set to train; test on test set

What does "describes" mean in inductive learning defn.

h models the observations well, and is likely to predict future observations well

NN Restriction Bias perceptron

half spaces

multiclass classification

labels belonging to three or more classes

Linear SVM (Support Vector Machine) classifier

the support vectors are the nearest points

What is scaling?

transform features to mean 0 and variance 1 e.g. Z-scores

competitive net architecture

unsupervised, feedforward, sparse or dense, nonlinear, random, multi-layer, binary

Noisy data - Binning

Cancelling noise by binning: - Sort data - Create local groups of data - Replace original values by: ______ The bin mean ______ The closest min/max value of the bin

Data Transformation

Data transformation alters original data to make it more suitable for data mining. • Smoothing (noise cancellation) • Aggregation • Generalisation • Normalisation • Attribute/feature construction

What is a loss function?

It is a function of the network's parameters which measures the difference between the output of the network and the desired output. Training the network means minimizing the loss by modifying the parameters (weigths).

Missing Attributes

It is common to have examples in your dataset with missing attributes/variables. One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes. A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute. This is a great benefit of trees (and in general of combined models,)

Bayesian Learning

Learn most probable H given data and domain knowledge

Inductive Learning

Learning from examples

Nearest Neighbor Big-O

Learning is very fast with time O(1) and space O(n) Querying is also fast with time lg(n) + k and space O(1)

Learning Rulesets

Learning rules sequentially, one at a time • Also known as separate-and-conquer Learning all rules together • Direct rule learning • Deriving rules from decision trees

What are some components of human intelligence?

Learning, reason and logic, behavior of social situations

Why Sample for Bayes net?

Less complex and almost as accurate as inference

cognitive aging

Lifelong process of gradual, ongoing, yet highly variable change in cognitive functions that occur as people get older that is not a disease or a quantifiable level of function

Linear Separability

Linearly separable data: • Datasets whose classes can be separated by linear decision surfaces • Implies no class-overlap • Classes can be divided by e.g. lines for 2D data or planes in 3D data

What is each row in a data frame?

observation with a value for each variable

Supervised Learning Methods

- Bayesian Classification - Perceptron Learning - Multilayer perceptrons - Fishers LDA

Bayesian Classification Assumptions

- the densities are isotropic - priors are equal

ID3 algorithm

A <- Best attribute Assign A as decision attribute for node For each value of A create descendent of node Sort training examples to leaves If examples perfectly classified stop else iterate over leaves

Linear Regression

Match to mx+b function with minimum squared error (could be constant function f(x) = c)

Sigmoid Activation Function

Use sigmoid for activation to make differentiable sig(a) = 1/(1+e^-a)

KNN with k=n (weighted average)

Variable output, closer points have more effect

Boosting alpha algorithm

alpha_t = 1/2 * ln((1-error_t)/error_t)

KNN Preference Bias

locality (near points are similar to each other) smoothness (averaging) All features matter equally

NN Preference Bias

low complexity networks, low weights

Fishers Ratio rewritten

to make dependence on weight vector explicit

Classification

Choosing from set of classes that matches a set of inputs

Gain algorithm

E(S) - sum_v(abs(S_v)/abs(v) * E(S_v)

NN Restriction Bias sigmoid

More complex, not much restriction

SVM: y = w(^T)x + b variables

y: classification label w: parameters of plane (hyperplane) b: moves plane in/out of origin

Neural Networks

Artificial representation of a neuron in mammal brains, has activation function that provides a sign {+, -} based on the weighted sum of inputs. Each input given a bias (weight). Computes separating line.

neural networks collection 2

Ensembles d'études connexes

NISSANCONNECT SERVICES

final exam fn 390

N-Grams: Predictive Text

French and Indian War

Information Systems Management WGU

Intro to Humanities Midterm

California code and ethics

insurance regulation

Econ

Chapter 8

The Rime of the Ancient Mariner

15. Data Management = 11%

Public Speaking Chapter 6: Analyzing the Audience

leadership and management quiz 1

Biology study set 4

Napoleon Bonaparte

Biotech Exam 2

History: The Declaration Of Independence

Virginia Real Estate Testprep

Ch. 3 Markets and Institutions