ANN1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Do not forget to include terms with index zero

for the bias in all calculations of net.

Overtraining an ANN

generalization increases error will powerful very well on data that it was trained with but poorly on test data specialization

NN Restriction Bias perceptron

half spaces

Deep learning methods

(Deep) Neural Networks • Convolutional Neural Networks • Restricted Boltzmann Machines/Deep Belief Networks • Recurrent Neural Networks

accuracy

(TP + TN)/ (TP + TN + FP + FN)

Clustering Applications

- Market segmentation - Social Network Analysis - Vector quantization - Facial Point detection

Sensitivity Analysis on ANN Models

-Common criticism: Lack of expandability -The black box syndrome

Unsupervised Learning

-For clustering -Self organizing -Adaptive resonance theory

Discrete Unsupervised

1. ART-1 2. Carpenter/Gross Berg

Association

1. Link Analysis 2. Sequence Analysis

Clustering

1. Outlier Analysis

Data Mining Tasks

1. Prediction 2. Association 3. Clustering

Discretionary Input

1. Supervised 2. Unsupervised

How much energy does the human brain consume?

10 watts

What operations work for ordered factors?

>, <, min, max

Kohonen's self-organizing feature map

A type of neural network model for machine learning

By training the system on multiple sets of input-output relationships, it can learn by...

Adjusting weights to correctly identify a specific set of inputs as a bird and other sets of inputs as a fish.

Neural Networks

Artificial representation of a neuron in mammal brains, has activation function that provides a sign {+, -} based on the weighted sum of inputs. Each input given a bias (weight). Computes separating line.

What are the similarities and differences between biological and artificial neurons?

Artificial; need more time, practice, examples to learns

Tree Building Algorithms

C4.5, CART, CHAID

What is a silicon Retina do?

Captures light and changes its Output depending on the light changes

Directed PGN

Edges have direction (Bayesian Network)

Choosing K

Elbow method • Visual inspection • 'Downstream' Analysis

False Positive Rate

FP/actual negative = FP/TN + FP

Soft Clustering

For clusters that overlap with other clusters, each point has a percentage that it is of each cluster

Bayesian Learning Algorithm

For each h in H calculate P(h|D) and return the argmax (Not practical |H| is usually infinite)

i.i.d.

Independent and identically distributed random variables

Content-addressable storage

Information that is stored in the weights of the connections, the same way that synapses change their strength during learning

Deep Learning

Involves developing the tools of critical thinking and applying them to whatever challenges you encounter now and in the future. - deep learning neural network architectures differ from normal ones bc they have more hidden layers - they also differ bc they can be trained in an UNSUPERVISED or SUPERVISED manner for both UN.. and SUP.. tasks

What is the gradient of a function?

It is the "derivative" of a multi-variable function, which map a vector onto a point. It is the vector that contains all partial derivatives of that function and points in the direction of the slope of the function in that point.

What is the Jacobian of a function?

It is the "derivative" of a multi-variable function, which map a vector onto a vector. It is the matrix that contains all partial derivatives of that function and points in the direction of the slope of the function in that point.

he perceptron model is a more general computational model than McCulloch-Pitts neuron.

It takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more than some threshold else returns 0.

Define KNN

K Nearest Neighbours uses the majority in k neighbours (training data) to make a prediction about the new test data

Inductive Learning

Learning from examples

For each weight the change is:

Learning rate * error * activation * (1 - activation) * input. Error assigned to unit = sum of weighted errors of units fed

Learning Rulesets

Learning rules sequentially, one at a time • Also known as separate-and-conquer Learning all rules together • Direct rule learning • Deriving rules from decision trees

Linear Regression Big-O

Learning takes time O(n) with space O(1) Querying is time O(1) and space O(1)

Linear Separability

Linearly separable data: • Datasets whose classes can be separated by linear decision surfaces • Implies no class-overlap • Classes can be divided by e.g. lines for 2D data or planes in 3D data

Inductive Learning: bath or online

Manner in which training examples are presented

Association Rules

Reflect items that are frequently found (purchased) together, i.e. they are frequent itemsets • Information that customers who buy beer also buy crisps is e.g. encoded as: beer ) crisps[support = 2%, confidence = 75%]

NN Error Functions

Regression: - Binary classification - Multiple independent binary Classification: - Multi-class classification (mutually exclusive):

Abstract Essence of ML

Representation + Evaluation + Optimisation

TPR - True Positive Rate - Recall

TP/actual Positive = TP/TP + FN

SVMs seek a decision boundary

That maximizes the margin

Energy cost

The ____ of signaling - from one neuron to another - has probably been a major factor in the evolution of brains

Feedforward associator

The simplest form of ANN which has layers of interconnected input and output units

Feed forward associates

The simplest form of artificial neural networks that contains layers of interconnected input and output units

Why can't ANNs operate in real time?

They are simulated on digital computers, and the simulation takes time.

Explain the ID3 algorithm

This recursive algorithm uses the entropy function (the probability of 1 or 2 values) to build a decision tree.

Forward propagation

This step is called forward-propagation, because the calculation flow is going in the natural forward direction from the input -> through the neural network -> to the output.

Where do any neuron's (biological or silicone) send their messages and info?

To targeted areas/neurons

Sample

Training set

Simple Perceptron

Used to classify patterns said to be linearly separable

Signal Transduction Step 1

When a pre-synaptic neuron is activated, an electrical signal is sent down the axon toward the terminal buttons. (Electrical conduction)

feedforward networks

While it is typical to consider this type of network structure as related to feedforward networks where the nodes are visualized as being attached, this type of architecture is fundamentally different in arrangement and motivation.

Grandmothered

____ networks are overtrained, make no errors, and end up responding only to one type of input

Digital coding

____ uses 0's and 1's

What is unsupervised learning?

a kind of learning where the system is trained to discover statistically salient features of the input population

Net Output of Node

activation value of node multiplied by the weight of the link

backpropagation rule

actual output is determined by computing the outputs of units for each hidden layer

Carbamazepine

blocks Na channels

Lamotrigine

blocks Na channels

Perceptron You can just go through my previous post on the perceptron model (linked above)

but I will assume that you won't. So here goes, a perceptron is not the Sigmoid neuron we use in ANNs or any deep learning networks today.

Ketone bodies

by-product of the lipid metabolic pathway after the fat is converted to energy

What function in R tells us the variable type?

class

Rule confidence

confidence(A -> B) = P (B|A)

topological vicinity

defined as topological distance D(t) between nodes - the neighbourhood contains nodes that are within a distance of D(t) from node j at time t, where D(t) decreases with time - D(t) does not refer to Euclidean distance, it refers to the length of path connecting two nodes

applications

face recognition online, self-driving cars, voice replication

lesson 3

feedback projections can help to direct attention

What is regression?

find function for which predicts continuous data

partial seizure

involve only a portion of the brain at the onset

Generalized seizure

involve the whole brain consciousness is usually lost

What is ∈ often called?

irreducible error since it's not estimable

perceptron limitations

irst, the output values of a perceptron can take on only one of two values (0 or 1) due to the hard-limit transfer function

Ancestral sampling

is a simple sampling method well suited to PGNs

Inductive Learning: 1 - delta

probability of successful learning

face fusiform area

responds to faces

Linear SVM (Support Vector Machine) classifier

the support vectors are the nearest points

How do we query a variable in a tibble?

using $

Thresholds are neither continuous nor differentiable,

whereas the sigmoid function is both continuous and differentiable

Memory-based methods

• Uses all training data in every prediction (e.g. kNN) • Becomes a kernel method if using a non-linear example comparison/ metric

ROBBINS-MONRO

•Addresses the slow update speed of the M-step in K-means •Uses linear regression (see lecture 1)

Each weight is updated using the increment

∆wi = −γ∂E\∂wi for i = 1, . . ., `,l γ represents a learning constant, i.e., a proportionality parameter which defines the step length of each iteration in the negative gradient direction.

Calculate gradient

∇E = ( ∂E\∂w1,∂E\∂w2,∂E\∂w`)

One-class SVM

- Unsupervised learning problem - Similar to probability density estimation - Instead of a pdf, goal is to find smooth boundary enclosing a region of high density of a class

Backpropagation

- Used to calculate derivatives of error function efficiently - Errors propagate backwards layer by layer

Three common ways to decide when to stop splitting decision tree

- Validation set - Cross-validation - Hypothesis testing (chi-squared statistic)

Connectionist Models Of Mind: Visual Object Identification

- Visual features (lines etc.) - Letter identification (the letter itself)

Occam's Razor

All things being equal - the simplest explanation is the best

Decision Tree Stopping Conditions

Early stopping Cross validate pruning weaker leaves

Confusion Matrix

Easy to see if the system is commonly mislabelling one class as another

Energy Minimization

Energy = measure of task performance error

Why use a sigmoidal activation function?

Firstly, people wanted to use a function that they could use mathematics to understand. Thresholds are neither continuous nor differentiable, whereas the sigmoid function is both continuous and differentiable. It looks very similar to a threshold with rounded edges

Of course we only use a training method if we do not know an easier way of calculating the weights required for success

For example, it would be foolish to train a neural network to learn the Boolean function for majority voting.

Mutual information

Measure of reduction of entropy in variable x given a variable y I(x,y) = H(y) - H(x|y)

SVM: x_i (^T) x_j

Measure of simularity

Kurtosis

Measurement of noisiness of data dimension

model development (def)

Model development (aka estimation, building) is the process in which the learning algorithm crafts the model in such a way that some measures of the agreement (aka fit; antonym error, loss) between the model and the data is maximized. (linear regression!)

K-Means Properties

Monotonically non-increasing in error each iteration polynomial O(kn) finite iterations O(k^n) error decreases if ties broken consistently can get stuck

Adaptive Resonance Theory

- allow the number of clusters to vary with problem size - adaptively adds new clusters as new nodes - binary values only - #inputs = size of vector #outputs = number of vectors - finished when the input returned (signal travels both ways) is close enough to original input - trains on same vectors multiple times - bidirectional

comparison FF vs bidirectional

- associative memory (feed forward only) will produce an output for an input even if it was not one of the patterns that were originally stored in W but it will not necessarily be one of the stored patterns - bidirectional memory will always produce one of the previously stored outputs, and in addition the input will have been modified to become the input that was paired with that output

applications of topological structure

- clustering: each weight vector represents a centroid - vector quantization: weight vectors represent codebook vector - approximating probability distribution: centroids in a given region is roughly proportional to the number of input vectors in that region

Hopfield networks

- commonly used for auto-association and optimization tasks - node values are iteratively updated based on its net weighted input at a given time - it is a fully connected symmetric network - weights are determined using Hebbian principle - typically undergoes many state transitions before reaching the stable state

components of Human Intelligence:

-perception,self-awareness -learning -ability to use reason and logic -ability to write and speak clearly, use language -behavior in social situations -ability to recognize, understand and deal with people, objects, and symbols -ability to think on the spot and solve novel problems (intuition)

What are the advantages of neural networks?

-prediction accuracy is generally high -robust, works when training examples contain errors -output may be discrete, real-valued, or a vector of several discrete or real-valued attributes -fast evaluation of the learned target function

What happens while the epoch produces an error? (6)

1. Continually check the next input 2. Err = T - O 3. T is the value that we need produced 4. O is the output 5. Err shows by how much the output of O differs from training value T 6. If Err is not 0 then we need to change the weights

There are three reasons to reduce the dimensionality of a feature set

1. Remove features that have no correlation with the target distribution 2. Remove/combine features that have redundant correlation with target distribution 3. Extract new features with a more direct correlation with target distribution.

backpropagation and automatic differentiation

Backpropagation is a special case of a more general technique called automatic differentiation.

Define supervised learning

Supervised learning is the machine learning task of a learning that maps an input to an output based on training data

Complex Itemsets

The general rule procedure for finding frequent item sets would be: 1. Find all frequent itemsets 2. Generate strong association rules However, this is terribly costly, with the total number of item sets to be checked for 100 items being

In Oja's rule what's the limiting term and what does it do?

The limiting term is rV2wj. This ensures that |wj| is <= 1.

he locations of the neurons so tuned

The locations of the neurons so tuned (i.e. the winning neurons) become ordered and a meaningful coordinate system for the input features is created on the lattice. The SOM thus forms the required topographic map of the input patterns.

Self-organization networks

The main difference between them and conventional models is that the correct output cannot be defined a priori, and therefore a numerical measure of the magnitude of the mapping error cannot be used.

Support Vector Learning

The minimal representative subset of the available data is used to calculate the synaptic weights of the neurons

SVM Margin is defined as

The minimum distance between decision boundary and any sample of a class

Generalize

The neat thing about ANNs lies in their ability to ____ to input patterns they have never been exposed to in training

The Forward Pass1

To begin, lets see what the neural network currently predicts given the weights and biases above and inputs of 0.05 and 0.10. To do this we'll feed those inputs forward though the network

Redundancy

____ is reduced when efficient coding maximizes the amount of information represented in a pattern of spikes

we get

a i = σ(net) i = 1/(1 + exp(-net)) ii= 1/(1 + exp(-Σh=0 H wh ah )) iii= 1/(1 + exp(-Σh=0 H wh (1/(1 + exp(-neth ))))) iv= 1/(1 + exp(-Σh=0 H wh (1/(1 + exp(-Σi=0 N wihxi )))))

Layers

a row of nodes. Input and output layers

matrix associative memory

a single weight matrix represents associative memory - corresponds to the weight vector of the ANN

Back-propagation

a supervised learning algorithm the network produces an output which is compared to the known desired output weights are then modified to reduce error

A few 'loose ends' need to be considered: in kohonen layer

a) How do we choose initial classes? There are a number of ways of doing this. Choosing weights (vectors) randomly is one way, and this is often followed by a slightly different learning strategy for a short time. One 'tweak' is that rather than adjusting just one unit, we adjust the losers in the other directions. An alternative is to start with a large learning rate (η) and reduce this as learning progresses.

Boosting alpha algorithm

alpha_t = 1/2 * ln((1-error_t)/error_t)

Artificial Neural Network

an artificial neural network is composed of many artificial neurons that are linked together according to a specific network architecture. The objective of the neural network is to transform the inputs into meaningful outputs.

step 2

create many units and many layers and connect them

State 2 partition methods

cross validation and bootstrapping

One particularly interesting class of unsupervised system is based on competitive learning, in which the output neurons compete amongst themselves to be activated, with the result that only one is activated at any one time

his activated neuron is called a winner-takesall neuron or simply the winning neuron. Such competition can be induced/implemented by having lateral inhibition connections (negative feedback paths) between the neurons. The result is that the neurons are forced to organise themselves. For obvious reasons, such a network is called a Self Organizing Map (SOM)

Rock-Mine Network (Task)

design a computerized sonar system to differentiate echoes returned from rocks vs. those returned from mines. Sonar system sends a wave to determine whether object is a rock or mine

for MLPs and backprop and perhaps simple perceptrons how is the change in the weights determined?

determined by the partial differential of the cost function with respect to the weight vector.

disadvantages of K-NN

determining distance function can be different if feature values are scaled differently, so, normalize all feature values; this one also does not perform well when the dimensionality of the feature space is very high

k-fold cross validation

divide data into k equal subsets, run learning k times (each time leave out one set for testing), avg. error rate of all k rounds is a better estimate of model accuracy

How to avoid overfitting?

divide data into training/test; use only training set to train; test on test set

KNN Preference Bias

locality (near points are similar to each other) smoothness (averaging) All features matter equally

what are loss in backpropagation

loss = Absolute value of (desired — actual ).

Ketogenic Diet

low carb, low protein, high fat designed to induce a continuous state of ketosis

Input data characteristics

enough example desired outputs are needed for the ANN to adequately generalize from the data Adequately represents all possible conditions must not exclude variables that drastically affect the relationships in the data include a good representation of the desired output

State the error formula for testing performance

error = number of errors / number of cases

canned phrases

everyday communication - greeting others, answering phone calls, replying "I'm fine, and yourself?"

Perceptrons

explain how simple Perceptrons (such as those implementing a NOT, AND, NAND and OR gates) can be designed

types of categorical data in R

factor (categories) logical (T or F) character (words or sentences)

Myelin Sheath

fatty substance that is deposited around the axon of some cells that speeds conduction

Boosting algorithm

for t in T: construct importance distribution at t find weak classifier h_t(x) w/ small error output H_final

The power of backpropagation stems

from its training algorithm as we shall see below

inductive learning defn.

given a set of observations come up with a model, h, that describes them.

What does "describes" mean in inductive learning defn.

h models the observations well, and is likely to predict future observations well

what are the solution to backpropgation neural network?

he combination of weights which minimizes the error function is considered to be a solution of the learning problem

overfitting

occurs when the generated model is overly complex, and too closely tries to capture the idiosyncrasies in the data under study, instead of capturing the overall pattern. By doing so, the model will fail to accurately predict futrue (previously unseen) observations

Entropy

randomness in data

RNS system

seizure prediction/disruption

lower layers

sensitive to basic features such as edges and their orientations

Main difference between supervised and unsupervised learning?

supervised: labels known (classification) unsupervised: no labels (clustering)

Rule support

support(A -> B) = P(A u B)

the forward pass 4

the forward pass 3 Here's the output for o_1: net_{o1} = w_5 * out_{h1} + w_6 * out_{h2} + b_2 * 1 net_{o1} = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967 out_{o1} = \frac{1}{1+e^{-net_{o1}}} = \frac{1}{1+e^{-1.105905967}} = 0.75136507 And carrying out the same process for o_2 we get: out_{o2} = 0.772928465

Weights may get different corrections depending upon their input. In the Perceptron case it is easy to assign 'credit' (or rather blame)

when things go wrong because we know what the error is and we know what caused the error - weights that had non-zero inputs

perceptron learning rule

Δwij=kδjxi where δj=(tj-yj) for desired outputs t

gives us an expression for this correction which, for a single output unit in our notation is:

ε = (t - a)a(1 - a) the error on output

Causes of incomplete data

• "Not applicable" data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/ hardware/ software problems

Evaluating Rules

• A good rule should not make mistakes and should cover as many examples as possible Complexity: Favour rules with simple predicates (Occam's Razor)

Mining frequent patterns

• One approach to data mining is to find sets of items that appear together frequently: frequent itemsets • To be frequent some minimum threshold of occurrence must be exceeded • Other frequent patterns of interest: ____ frequent sequential patterns ____ frequent structured patterns

DM Types of data

• Relational databases • Data warehouses • Transactional databases • Object-relational databases • Temporal/sequence/time-series databases • Spatial and Spatio-temporal databases • Text & Multimedia databases • Heterogeneous & Legacy databases • Data streams

prediction model (introduction)

= a mapping from known features to an unknown target - do not make decisions: prediction + threshold = decision

What are attractors?

These are stored patterns in the configuration space and they attract the network to one of these states. When the network is stable it should have converged to one of these attractors.

Where do you integrate and fire neurons act and what are they capable of doing

They act in their sub threshold region even though they are capable of switching her voltages to go past the threshold

What are artificial neural networks use for and how are they observed?

They are used to study learning and memory and their observed on computers and they consist of simple units connected in networks

what's the significance of the cost function?

it determines the proximity of the output of the network to the desired state.

Forward Search

iterate through features, training learner and pick subset that performs the best. Continue until k features picked.

If we differentiate σ(net) with respect to net we get

dσ(net) = - net exp(-net) (1 + exp(-net))-2 = σ(net)(1-σ(net)) = a(1 - a) dnet

Distributed Processing

each unit can receive and transmit multiple inputs and outputs. (ex- one neuron)

Rock-Mine Network (Input Layer)

echo profile

Benefits of Electrical Conduction

electrical conduction is what gives the nervous system the rapid response

Passive Electrical Conduction

electrical potentials and their intensity diminish with distance traveled

Neurons communicate via

electrical synapses, chemical synapses, and extrasynaptic neurotransmission.

Modular brain hypothesis/specialization

After LGN and V1, different areas of brain are designated to recognize and process particular categories of visual information

Give an example of Labelled data

Age 30 40 50 Salary 35,000 40,000 60,000

Decision Tree Expressiveness

All Boolean Functions

Neuronal networks

All real brains consist of highly interconnected ____ that need space and whose neurons need energy

Bootstrapping

Estimating the sampling distribution of an estimator by resampling with replacement from the original sample.

Some variables are observed, others are hidden/latent

Example observed: Labels of a training set Example hidden: Learned weights of a model

The output units:

The output layer might represent the object Thus a certain set of features (inputs) might map on to a specific object

Examples of feed forward networks

The perceptron and adaline

Simple Perceptron

The perceptron is a single layer feed-forward neural network.

Local minima

The smallest value of the function. But it might not be the only one.

What do we model in output variables?

numeric dependent variables: "regression" categorical dependent variables: "classification"

What is each row in a data frame?

observation with a value for each variable

SVM: kernel types

polynomial, radial basis kernel (Gaussian), sigmoid

Bayes' Theorem

posterior = (likelihood x prior)/evidence

Biologists

predicting protein shape from mRNA sequences, disorder diagnostic, personalized medicine

What is the primary goal of supervised learning?

prediction model output as function of input

What are synonyms for input variables?

predictors independent variables covariates features regressors

ockham's razor

prefer the simplest hypothesis consistent with data

Feature Transformation

preprocessing features set of features in order to create a new, often smaller, feature set. Attempts to retain as much relevant, useful information as possible

image processing pipeline

preprocessing, feature extraction, train model, classify new image

What are synonyms for output variables?

response dependent variable class label (categorical) outcome

lesson 2

there are specialized areas in the brain and we can train ANNs to compare to human brain activity

t is simpler (for the time being at least) to think of inputs

ust one hidden layer and an output unit

What is each column in a data frame?

variable with its own type e.g. integer, factor

CFS

• Correlation based feature selection (CFS) selects features in a forward-selection manner. • Looks at each step at both correlation with target variable and already selected features.

DM query languages

• DM query language incorporates primitives • Allows flexible interaction with DM systems • Provides foundation for building user-friendly GUIs • Example: DMQL

Recurrent

-Network signal travel in both directions

Learning Rate

-Parameter that control how fast the learning takes place -Too high: Jumps back and forth -Too low: Takes too long to get to solution

State the advantages of KNN

-Robust to 'noisy' data -Excellent if training data is large

SOM

-Self Organizing Maps -Introduced by Finish professor Teuvo Kohonen -Applied to clustering type problems -Each node has set of weights corresponding to input values -When a set of input values comes in, the best matching unit (BMU) is IDed

Feed Forward

-Simple Perceptron -Multilayer Perceptron

ANNSoftware

-Stand Alone: NeuroSolutions, BrainMaker, NeuralWare, NeuroShell -Part of Data Mining Suite: PASW, SAS, Statistical Data Miner

What tasks are to be solved by an artificial neural network?

-controlling the movements of a robot based on self-perception and other information -deciding the category of potential food items in an artificial world -recognizing a visual object -predicting where a moving object goes, when a robot wants to catch it

How do Neural Networks Learn?

-learning by adaptation -at the neural level the learning happens by changing of the synaptic strengths, eliminating some synapses and building new ones -synchronous activation increases the synaptic strength -asynchronous activation decreases the synaptic strenth

What are the disadvantages of neural networks?

-long training time -difficult to understand the learned function -not easy to incorporate domain knowledge

what is adaptive resonance theory?

...

What are the different steps of batch learning? Is it a practical method?

1) Initialize the weights. 2) Calculate the gradient for the whole dataset. 3) Update the weights. 4) Continue with 2) until it reaches a local minimum. It is often unpractical because of the size of the datasets.

What did Hebb do and in what year? (3)

1. 1949 2. First learning rule 3. Weights automatically adjusted

Predicates

A logic statement, generally as boolean logic

How many connections are in the human brain?

A million billion

Recurrent neural net

A more complex ANN that consists of a single layer where every unit is interconnected and all the units act as input and output

simple choice is defining φ(i, k) = 1

A simple choice is defining φ(i, k) = 1 for all units i in a neighborhood of radius r of unit k and φ(i, k) = 0 for all other units. We will later discuss the problems that can arise when some kinds of neighborhood functions are chosen

What is the structure of a recurrent neural network?

A single layer where every unit is interconnected and all units act as input and output.

Define bias

A systematic error in the model.

Turing test

A test to empirically determine whether a computer has achieved intelligence

Eager Learner

Aims to learn the function behind a dataset

Support Vector Machines

Chooses decision boundary with the greatest margin on either side

K-Means Issues

Convergence is guaranteed but not necessarily optimal - local minima likely to occur • Depends largely on initial values of uk. • Hard to define optimal number K. • K-means algorithm is expensive: requires Euclidean distance computations per iteration. • Each instance is discretely assigned to one cluster. • Euclidian distance is sensitive to outliers.

he self-organization process involves four major components

Cooperation: The winning neuron determines the spatial location of a topological neighbourhood of excited neurons, thereby providing the basis for cooperation among neighbouring neurons. Adaptation: The excited neurons decrease their individual values of the discriminant function in relation to the input pattern through suitable adjustment of the associated connection weights, such that the response of the winning neuron to the subsequent application of a similar input pattern is enhanced.

How do we recognize a good learning rate in terms of cross entropy and accuracy?

Cross entropy decreases steadily and accuracy increases steadily.

Data Transformation

Data transformation alters original data to make it more suitable for data mining. • Smoothing (noise cancellation) • Aggregation • Generalisation • Normalisation • Attribute/feature construction

Basic Decision Tree

Decision trees apply a series of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label.

What general differences are there between shallow and deep ANNs?

Deeper models tend to perform better as they add more parameters, while shallow models start to overfit.

What is Deep Learning?

Definition: • Hierarchical organization with more than one (non-linear) hidden layer in-between the input and the output variables • Output of one layer is the input of the next layer

ML

Maximum Likelihood

Inductive Learning: m

Number of examples to train on

The Backwards Pass 2

Output Layer Consider w_5. We want to know how much a change in w_5 affects the total error, aka \frac{\partial E_{total}}{\partial w_{5}}

Chain Rule

P(a,b) = P(a|b)P(b) or P(a,b) = P(b|a)P(a)

Bayes Rule

P(h|D) = P(D|h)P(h) / P(D)

Marginalization

P(x) = sum_y(P(x,y))

Neural Computing

Pattern recognition methodology for machine learning

Terminology Prediction Trees

Root node - contains all data Splitter/ branching node - "asks" a question, often binary Leaf node - no further splits, makes predictions

Inductive Learning: Learner/Teacher

Paradigm where the learner is given information by the teacher who knows the correct answer

Multiplexed

Process where several messages can be carried though the same wire

Therapeutic stimulation

RNS delivered biphasic pulses from 0.5-12 mA, 40-1000 seconds, 1-333Hz Seizure detector, predictor, disruptor

Conditional Entropy

Randomness of y when given x H(y|x)=-SUM(P(x,y)log(y|x) If x||y, H(y|x) = H(y) H(x,y)=H(x)+H(y)

ROC Curve

Receiver Operator Characteristic (ROC) curves plot TP vs FP rates

Multiclass SVM

SVM is an inherently binary classifier. Two strategies to use SVMs for multiclass classification: - One-vs-all - One-vs-one Problems: - Ambiguities (both strategies) - Imbalanced training sets (one-vs-all)

Estimating hypothesis accuracy

Sample Error vs. True Error Confidence Intervals

step 5 training kohonen

Step 5: In the hidden competitive layer, the distance between the weight vector and the current input vector is calculated for 6.1). K i 1 2 j j ij D (x w ) (6.1) where K is the number of the hidden neurons and Wij is the weight of the synapse that joins the ith neuron of the input layer with the jth neuron of the Kohonen layer.

Synapse

Switch or signal

To make our notation simple we consider only one output unit and use net and a for its net and activation respectively)

The H hidden units are labelled 1 ... H with neth and ah being their respective nets and activations.

If we calculate the rate of change of a function f with respect to a matrix W, what is the Jacobian of W?

The Jacobian of W is the matrix that contains the partial derivates of f with respect to each of the variable contained in W.

Define variance

The difference from one model to the next.

What is information gain in ID3?

The expected reduction in entropy

Which is the best linear function?

The one with the maximum margin

dependent variable

The outcome factor; the variable that may change in response to manipulations of the independent variable.

What is sparse coding?

Using as small a number of active neurons as possible to conserve energy.

Acquiring emissions

Wide range of options to model ****** probabilities: - Discrete tables - Gaussians - Mixture of Gaussians - Neural Networks/RVMs etc to mode

Learning can be perceived as what?

an optimisation process

sparse or dense

competitive net

noise

random, meaningless information that pollutes your data set

A Latice/Trellis diagram visualizes

state transitions over time Also good tool to to visualize optimal path through states (Viterbi Algorithm)

Preference Bias

what kind of hypothesis we prefer from H

The most intuitive loss function is simply loss =

(Desired output — actual output). However this loss function returns positive values when the network undershoot (prediction < desired output), and negative values when the network overshoot (prediction > desired output)

Major DM prep tasks

- Data cleaning : Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies - Data integration : Integration of multiple databases, data cubes, or files - Data transformation : Normalisation and aggregation - Data reduction : Obtains reduced representation in volume but produces the same or similar analytical results - Data discretization : Part of data reduction but with particular importance, especially for numerical data

The artificial brain we will describe is excellent at pattern recognition (even with noisy inputs), applications include:

- Face recognition - Climate forecasting - Economic forecasting - Checking your credit card purchases - SIRI

Evaluation procedure for single split or cross validation

- For large datasets, a single split is usually sufficient. - For smaller datasets, rely on cross validation

Simple competitive learning

- Hamming net and Maxnet help to determine whose weight vector is nearest to an input pattern on more complex networks - also known as Kohonen Learning

Hidden layer(s) can

- Have arbitrary number of nodes/units - Have arbitrary number of links from input nodes and to output nodes (or to next hidden layer) - There can be multiple hidden layers

Instead of a standard font, what if the numbers were written in various types of handwriting?

- Input layer - Hidden layer - Output layer

A Bernoulli trial

- It is a trial with a binary outcome, for which the probability that the outcome is 1 equals p (think of a coin toss of an old warped coin with the probability of throwing heads being p). - A Bernoulli experiment is a number of Bernoulli trials performed after each other. These trials are i.i.d. by definition.

Good Old Fashion AI

- John Haugeland ñames his book, "Artificial Intelligence: The Very Idea," in 1985 -inspired by computer model - as good as program (knowledge, rules)

Joint Probability

- Joint probability is the probability of encountering a particular class while simultaneously observing the value of one or more other random variables - Can be generalized for the state of any combination of random variables

What are Learning Vector Quantizers

- LVQ uses winner-take-all network (clustering is used as preprocessing step) - class membership is known for each training pattern - learns the codebook vectors

Latent variables

- Latent variables are variables that are 'hidden' in the data. They are not directly observed, but must be inferred. - Clustering is one way of finding discrete latent variables in data.

Overfitting can occur when

- Learning is performed for too long (e.g. in Neural Networks) - The examples in the training set are not representative of all possible situations (is usually the case!) - Model parameters are adjusted to uninformative features in the training set that have no causal relation to the true underlying target function.

complex partial

memory, awareness or consciousness are impaired

Classical Computers

not very brain like. Serial processing.

Unsupervised Learning

"Any learning technique that has as its purpose to group or cluster items, objects, or individuals"

Strong Positive Weight

+1. Strongly increasing chance of firing- excitatory

Deep Learning

- Basically a Neural Network with Many hidden layers - Can be used for unsupervised learning and dimensionality reduction

Supervised Learning Methods

- Bayesian Classification - Perceptron Learning - Multilayer perceptrons - Fishers LDA

Connectionist Models Of Mind: Animal identification

- Conceptual features (feathers, wings, etc.) - Animal (the animal itself)

Simple data splits

- Fixed train, development and test sets - Bootstrapping - Cross-validation

LDA

- Linear Discriminant Analysis - Most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. - The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting

Regularization

- Maximum likelihood generalization error (i.e. cross-validation) - Regularized error (penalize large weights) - Early stopping

Unsupervised Learning Methods

- PCA - ICA - K Means Clustering - Spectral Clustering

Little's Synchronous Mode

- all nodes are updated simultaneously, at every time instant using *eqn* - cyclic behaviour may result when two nodes simultaneously update to move towards a different attractor

Bayesian Classification Assumptions

- the densities are isotropic - priors are equal

Advantages of ANN

-Able to deal with highly nonlinear relationships -Not prone to restricting normality or independence assumption -Can handle a variety of problem types -Usually provides better results -Handles both numerical and categorical variables

Disadvantages of ANN

-Black box solutions that lack expandbility -It is hard to find optimal values for large numbers of network paramters -hard to handle large number of variables -Training make take a long time for large data sets, may require case sampling

Sensitivity Analysis cont..

-Conducted on a trained ANN -The inputs are perturbed while the relative change on the output is measured -Results illustrate the relative importance of input variables

Momentum

-Counterbalancing parameter aimed at slowing down the learning process

Supervised Learning

-For prediction -Back propagation

SOM Algorithim

1. Initialize each node's weights 2. Present a randomly selected input vector to the lattice 3. Determine most resembling (winning) node 4. Determine the neighboring nodes 5. Adjusted the winning and neighboring nodes 6. Repeat steps 2-5 until a stopping criteria is reached

alternative HA models

1. performs iterative auto association in input layer - generates a store pattern and feeds it to 2nd layer of hetero-association network 2. generates output from 1st layer using non-iterative node rule - feeds to output layer where it performs iterative auto-association -generates a stored output 3. bidirectional associative memory (BAM)

For a network of 1000 units how many patterns can be retrieved before complications occur?

150

How many patterns can a 1000 unit recurrent neural network house error-free?

150

Here is a sketch algorithm for training the network.3 and 4 steps

3) Normalise the initial classes. 4) Until we have 'done enough' repeat the following: i) choose a training example x, (at random or systematically) ii) find the class wc closest to the chosen example iii) update wc by adding η(x - wc ) to it and then normalising the result.

How many kilometres of 'wires' are in the human brain?

3.2 million

What is the brains energy consumption spent on?

50-80% on the conduction of action potentials, the rest in manufacturing and maintenance.

ID3 algorithm

A <- Best attribute Assign A as decision attribute for node For each value of A create descendent of node Sort training examples to leaves If examples perfectly classified stop else iterate over leaves

Artificial Neural Network

A family of models inspired by biological neural networks. A system of interconnected neurons which exchange very basic messages between eachother. These networks tend to be dynamic and adaptive (capable of learning).

What is the simplest form of an ANN?

A feedforward associator, with layers of interconnected input and output units.

Artificial Neural Network Definition

A mathematical/computational system made up of units that are interconnected with similar organization as in biological neural network

They attempt to simulate cognitive processes by programming a computer to function like...

A matrix of neuron-like connections

backpropagation drawbacks

A multilayer neural network requires many repeated presentations of the input patterns, for which the weights need to be adjusted before the network is able to settle down into an optimal solution

perceptron and linear seperability

A perceptron is more specifically a linear classification algorithm, because it uses a line to determine an input's class. If we draw that line on a plot, we call that line a decision boundary

What is a ROC curve?

A plot of sensitivity vs (1- specificity) used for points on the curve represent different cut off points used for testing positive.

Sparse coding

A type of coding that uses as small a number of active neurons as possible and provides another important design principle for engineers building artificial neural networks

Autoassociative network

A type of network that stores patterns rather than merely pairs of items

lesson 1

ANNs use similar hierarchy of layers, which was not programed

VC Shattering

Ability to label in all possible ways

Inductive Learning: Epsilon

Accuracy to which target concept is approximated

backpropagation and learning

Although we can have as many inputs and outputs and as many hidden layers as we like, it is simpler (for the time being at least) to think of inputs, just one hidden layer and an output unit. What makes it a backpropagation network is the algorithm used for learning

Instance Based Learning Pros

Always has good performance on training data Fast Simple

Target Concept

Answer

Apriori algorithm

Apriori algorithm is a fast way of finding frequent itemsets

Backprop is for:

Arbitrary feed-forward topology Differentiable nonlinear activation functions Broad class of error function

backpropagation and delta rule and chain rule

Backpropagation is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer

Nucleus

Central Processor

Learning algorithm of kohonen layer

Consider the problem of charting an n-dimensional space using a onedimensional chain of Kohonen units. The units are all arranged in sequence and are numbered from 1 to m (Figure 15.4). Each unit becomes the ndimensional input x and computes the corresponding excitation

KNN with k=n (simple average)

Constant output

Cost function - Euclidean distance

Distance measure between a pair of samples p and q in an n-dimensional feature space

Feature Selection

Feature Selection returns a subset of original feature set. It does not extract new features. Benefits: • Features retain original meaning • After determining selected features, selection process is fast Disadvantages: • Cannot extract new features which have stronger correlation with target variable

Attribute subset selection

Feature selection Feature selection is a form of dimensionality reduction in ML, hence the DM term 'dimensionality reduction' for manifold projection is problematic. Approaches: • Exact solution infeasible • Greedy forward selection • Backward elimination • Forward-backward • Decision tree induction

Most Popular Neural Network Architecture

Feed forward multilayer perceptron with back propagation learning algorithm -Use for classification and regression

Concept

Function

Training algorithm

Given a model h with Solution Space S and a training set {X,Y}, a learning algorithm finds the solution that minimizes the cost function J(S)

Eigenvalue

Given an invertible matrix , an eigenvalue equation can be found in terms of a set of orthogonal vectors , and scalars such that: M

Expressiveness of MLP

Given enough hidden layer neurons nH, any continuous function from input to output can be expressed as a 3 layer network.

Perceptron training

Given examples, find weights that map inputs to outputs

Nearest Neighbor Big-O

Learning is very fast with time O(1) and space O(n) Querying is also fast with time lg(n) + k and space O(1)

Importance of cleaning data

If you have good data, the rest will follow

Question: Where is the concept represented when it is distributed like this? PART IV

Imagine what a patient's symptoms would be if they lost a portion of the network? For example, some of the higher inputs that code for more complex representations? Prosopagnosia? Or some of the inputs or hidden units that contain emotional information? Autism? What is the nature of the information coded in a hidden unit?

SVM: alpha_i

Importance of support vectors

EXTREME Dimensionality Case

In an extreme, degenerate case, if D > n, each example can be uniquely described by a set of feature values.

Remember, of course, to update weight 0, the bias - it is easy to forget

In fact, this is the most updated weight as it is always updated when an error occurs.

Cost function - ℓ2 norm

In order to avoid over-fitting, one common approach is to add a penalty term to the cost function. Common choices are the ℓ2-norm, given as: Where C0 is the unregularized cost

Binomial distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

SAMPLE ERROR

In statistics, sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population.

value of bias

In this case the value of net is the value of bias plus w times the sum of the other inputs. We can write this as: net = bias + wΣ1 N ?i

In the case of the rectified liniar unit (ReLU) function, what if the backpropagation step causes to map all possible inputs to a negative drive?

In this case, the ReLU gate never fires: the output and derivative are 0. The error cannot flow through the ReLU. Its weights won't be updated anymore. The ReLU is "dead". This can be avoided by using a small learning rate and slightly positively initialized biases.

How to increase P(h|D) using h?

Increase P(h) and P(D|h)

Multivariate Trees

Instead of monothetic decisions at each node, we can learn polythetic decisions. This can be done using many linear classifiers, but keep it simple!

SVM: Mereer Condition

Kernels act like distance or similarity functions

Kernel methods

Map a non-linearly separable input space to another space which hopefully is linearly separable • This space is usually higher-dimensional, possibly infinitely • The key element is that they do not actually map features to this space, instead they return the distance between elements in this space • This implicit mapping is called the (Definition) Trick

Polynomial Regression

Match data to f(x) = c_0 + c_1 * x + c_2 * x^2 ... using mean squared error

MAP

Maximum a priori

Constrained Learner

Must ask many questions of teacher to find correct function (about 2^k queries needed)

Degrees of freedom of variability

Number of ways data can change/ number of separate transformations possible

How does our brain do pattern detection so well when the inputs can be so different?

Object constancy

MLFFNS

Of course, during the training process, all weights may affect each other. When we wish to distinguish between MLFFN whose units are threshold units from those whose units have a sigmoid activation function we use MLFFNT for the former and MLFFNS for the latter

The Backwards Pass 1

Our goal with backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole

Small set of SVs means that

Our solution is now sparse

Step Activation Function

Switches from 0 to 1 at time 0 with infinite slope

what are backpropgation network look for?

The backpropagation algorithm looks for the minimum of the error function in weight space using the method of gradient descent

Question: Where is the concept represented when it is distributed like this?

The concept is distributed throughout the network. It is not in any one node but resides within the overall architecture - that is, in which nodes are active, to what extent they are active and the weights between nodes. This is what is referred to as a distributed representation. Note that the same nodes can contribute to multiple representations!

The examples and initial units in part c) above were chosen both to make your calculations easy and also to allow you more easily to draw a diagram showing the training. How would initial classes be chosen in a real application

The final part (e) was based on bookwork, in that it asked how initial classes are chosen in applications. Valid points to be made include the fact that there is no guaranteed method, that random choices are often made and that even the number of classes is something for experimentation. Most candidates were able to state this.

target variable

The predefined attribute whose value is being predicted in a data analytical model

Joint Distributions

The probability of an output given all previous inputs, increases by factor of two for every added attribute for binary outputs

Define classification

The process of categorising something usually with a discrete value output (0 or 1)

The stages of the SOM algorithm can be summarised as follows:

The stages of the SOM algorithm can be summarised as follows: 1. Initialization - Choose random values for the initial weight vectors wj . 2. Sampling - Draw a sample training input vector x from the input space. 3. Matching - Find the winning neuron I(x) with weight vector closest to input vector. 4. Updating - Apply the weight update equation ∆wji = η t Tj I t xi − wji ( ) ( ) ( ) , (x) . 5. Continuation - keep returning to step 2 until the feature map stops changing.

What is Artificial Intelligence

The study of computer systems that attempt to model and apply the intelligence of the human mind

when the system fires

The system fires only when sum ni=1 wiPi ≥ θ, where θ is the threshold of the computing unit at the output.

How to make it smarter?

Things that can and cannot be improved Ability to connect neurons = smarter, increase connections (units) ANN only learn what you program them to learn (trials is not training, just practicing, can't teach themselves, can't LEARN) THEY CAN'T LEARN We can study for quiz that can help us with a test, but ANN can't generalize to other situations Can't understand social cues

Name the condition for the strengthning of synaptic connection between two cells

This occurs when both cells are simulstaneously activated.

Perceptron learning

Training a single threshold unit turns out to be almost trivial

What is supervised learning?

Training by providing an input/output set and using the actual output of the network to calculate an error vector and correcting input weights accordingly.

What is an epoch?

Training set fed into neural network

Where we have used bold for the term that is about to be replaced.

Training the network is slightly more complicated and we won't derive the expressions used;

What are silicon neurons built of?

Transistors.

Backward Search

Try combinations of n-1 features, throw out excluded feature of best performance. Continue until k features remain.

Linearly Married

Two classes are not linearly seperable

Convolutional Neural Networks

Type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.

Decision Tree Expression of Continuous Attributes

Use inequalities, also able to use same attribute again

Calculating the Total Error

We can now calculate the error for each output neuron using the squared error function and sum them to get the total error: E_{total} = \sum \frac{1}{2}(target - output)^{2}

Once the connectionist models of mind has learned to correctly identify these inputs...

We can present a new input and see how it categorizes it.

The hidden units:

We can think of these as intermediary stages in information processing (feature extraction)

What is an example of how an ANN learns?

When an output is given to the input pattern and compare with the desired pattern. when there's an error it is reduced until the error is minimal but this happen slowly

Are learning rate and mini-batch size dependent?

Yes they are. Smaller mini-batches: smaller learning rate. As a result of this, cross entropy decreases (almost) linearly and accuracy reaches an early plateau. Larger mini-batches: larger learning rate. As a result of this, cross entropy explodes (possibly falls and zigzags) and accuracy is zigzagging heavily.

What is the hidden layer?

You don't know what is happening

The inputs units:

You specify what the inputs are For example, you might let each input unit represent a specific feature that is either present or not in the stimulus Thus one stimulus might be a certain set of features

What is an epoch?

a single pass through the training data set

Analogue

____ circuits code in continuous changes in voltages, as do neurons in their sub-threshold state

What is the ultimate objective of training?

a to obtain a set of weights that makes almost all the record in the training data classified correctly Steps: -initialize weights with random values -for each unit -compute the net input to the unit ad a linear combination of all the unit -compute the output value using the activation function -compute the error -update the weights and the bias

response variable

a variable that measures an outcome or result of a study

A backpropagation neural network is a feed forward, two or more layered network of units which have sigmoidal

activation preceded by a 'summing block' that calculates the net of the unit

three layers for CPN;

an input layer that reads input patterns from the training set and forwards them to the network, a hidden layer that works in a competitive fashion and associates each input pattern with one of the hidden units and the output layer which is trained via a teaching algorithm that tries to minimize the mean square error (MSE) between the actual network output and the desired output associated with the current input vector

Kohonen network

an unsupervised network with two layers, these nets are also called feature maps

Correcting the weight to the output unit is necessary, and this is done by adding a correction Δwh to wh for each hidden unit

and adding a correction Δwih to wih to each weight from the inputs to the hidden units.

binary classifier

binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

What does predictive data mining include?

classification regression

What is classification?

classify an item as one of existing # of categories

K nearest neighbor prediction algorithm

classify the new example by finding the training example that's most similar; return corresponding label

Polynomial Regression Drawbacks

as k-> number of data points, least squared error decreases but risk of overfit increases

Describe the architecture of a Kohonen Grossberg network, giving a clearly labelled diagram showing its components and an explanation of the action of each layer

asked for a description of the architecture of such networks and required a diagram for full marks. Good answers included a clearly labelled diagram with a note on each component. The question also required an explanation of the action of the two independent layers. On the whole, candidates seemed to understand the rationale for each layer but were less well able to explain this for the Grossberg layer. The concept of 'winner takes all' is an important feature of the Kohonen layer and answers that omitted to mention this lost marks.

backpropagation definition

backpropagation neural network is a feed forward, two or more layered network of units which have sigmoidal activation preceded by a 'summing block' that calculates the net of the unit.

Hebbian Learning

basic idea: When an input and output tend to coincide, the strength of connection between input and output increases

Naïve Bayes Pros

cheap inference few parameters estimate parameters with labelled data connects inference and classification Empirically successful

Neurotransmitters

chemical signals that are released by one neuron and affect the properties of other neurons

Connectionism (Mind as ANN)

cognitive activity is modeled on a computer in terms of the connections of many simple neuron-like structures, known as artificial neural networks, capable of learning, graceful degradation, and parallel, distributed processing

What is feature creation?

combine attributes to use e.g. includes polynomial functions of x in regression

higher-level layers

complex features or even whole objects tend to emerge

Depth and strip leads

composed of platinum and iridium

Thus working out the gradient of the activation requires little extra

computation beyond working out the activation itself.

K-NN

compute the nearest neighbors and assign the class my majority vote. increasing the value of k makes the algorithm more resilient to noise in the data, although it can also cause some unwanted side effects

chracter of sigmoid function

continous

For us they are just input values. It is worth noting here that we use the term

epoch to mean one cycle of learning through the whole training set.

Training error

fraction of training examples misclassified by h

Ketogenic long term side effects

high cholesterol, kidney stones, vitamin and minerals

Resection surgery Side effects

infection, hemorrhage, possible loss of recent memory function, transient speech defecits

Filter Evaluation Methods

information gain variance entropy gini index non-redundant independent

Computer scientists:

information processing and learning, image classification, object detection and recognition

Levetiracetam

inhibits glutamate

supervised learning

mode of ANN learning where the training data contains complete information about the characteristics of the data and desired outcomes

Expectation Maximization Properties

monotonically non-decreasing likelihood does not converge (practically does) Will not diverge Can get stuck Works with any distribution if EM solvable

nonlinear

multilayer perceptron (back propagation), Hopfield net, competitive net

multi-layer

multilayer perceptron (back propagation), competitive net

Parallel Processing

multiple computing units perform the calculations simultaneously (ex- memory system and language system)

What is aggregation?

multiple identical sensors, days to months e.g. sum, averaging

deep artificial neural networks

taking inspiration from the human brain, neural networks are systems that can train themselves to make sense of the human world

turing test

test to empirically determine whether a computer has achieved intelligence

Two-layer feed-forward neural networks made of units with sigmoidal activation are universal in the sense

that any computable function can be represented by one of these networks

modular brain hypothesis

specialized areas in the brain that selectively respond to certain kinds of objects

Transcranial Magnetic stimulation

stimulating coil held over a subject's head and as a brief pulse of current is passed through it, a magnetic field is generated that passes through the subject's scalp and skull GABA inhibition

K nearest neighbor learning algorithm

store all training examples

What version of data frame do we use in this class?

tibble

Neural Network Applications

used to match complicated, vague, or incomplete data patterns; used for pattern recognition, interpretation, prediction diagnosis, planning, monitoring, debugging, repair, instruction and control

testing data

used to test the network after training

Training data

used to train the network

Conditional Independence

x is conditionally independent of y given z if the probability distribution governing x is independent of the value of y given the value of z Algorithmically: P(X=x|Y=y|Z=z) = P(X=x|Z=z)

multilayer perceptron (back propagation) performance rule

yj=f(Σiwijxi) where f(x)=1/1+e^-x

perceptron performance rule

yj=f(Σiwijxi) where f(x)={1 if x≥θ, 0 else}

competitive net performance rule

yj={1 if Σiwijxi "wins", 0 else}

Note that you will see lots of different looking formulae in the literature

you need to get used to this as no standard is yet agreed.

C4.5

- Successor of ID3. - Multiway splits are used. - Statistical significant split pruning.

Bayes' Error

- The Bayes Error rate is the theoretical lowest possible error rate for a given classifier and a given problem (dataset). - For real data, it is not possible to calculate the Bayes Error rate, although upper bounds can be given when certain assumptions on the data are made. - The Bayes Error functions mostly as a theoretical device in Machine Learning and Pattern Recognition research.

Normal distribution

- The Normal distribution has many useful properties. It is fully described by it's mean and variance and is easy to use in calculations. - The good thing: given enough experiments, a Binomial distribution converges to a Normal distribution.

Orthogonality

- Two vectors and are orthogonal if they're perpendicular - If their inner product is 0: a · b = 0

Batch Gradient descent

- Vanilla gradient descent, aka batch gradient descent - Make small change in weights that most rapidly improves task performance Gradient descent computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset - Can be very slow - Intractable for datasets that don't fit in memory - Doesn't allow us to update our model online, i.e. with new examples on-the-fly. - guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Unsupervised Learing

- We only have xi values, but no explicit target labels. - You want to do 'something' with them.

Hypothesis Quality

- We want to know how well a machine learner, which learned the hypothesis as the approximation of the target function , performs in terms of correctly classifying novel, unseen examples - We want to assess the confidence that we can have in this classification measure

Layers

- input, output ==> input can come from programmer or other nodes from system ==> optional: greater than or equal to 1 hidden layer

Codebook Vector (CV)

- represents a Voronoi region - also called Voronoi centres - the set of CVs is a compressed form of information represented by all input data

retrieval in associative networks

- retrieval (or "recall") refers to the generation of an output pattern when an input is presented to network

resonance in ART

- signals travel back and forth between the output and input layers until a match is discovered. if no match is found a new cluster is formed around the new input vector

Discrete Hopfield networks

- similar to the recurrent network - asynchronicity: at each time interval only one node's output changes other modes: non-synchronous and synchronous

BAM stability

- stability is not assured for synchronous computations - however, stability is assured with inter-layer asynchrony (nodes updated at discrete time) even if all the node sin one layer change state simultaneously - this allows a greater amount of parallelism than hopfield networks (in asynchronous HN only one node is updated at a time) - rate of stabilization depends on proximity of new input to stored pattern

Tests for comparing distributions

- t-test compares two distributions - ANOVA compares multiple distributions - If NULL-hypothesis is refuted, there are at least two distributions with significantly different means Does NOT tell you which they are!

Hamming networks: how to find which stored pattern is nearest to a new input pattern

- take maximum of the outputs - can be done using a Maxnet

What is association?

- the task of mapping input patterns to target patterns ("attractors") - for instance, an associative memory may have to complete (or correct) an incomplete (or corrupted) pattern - unlike computer memories, no "address" is known for each pattern

vector quantization

- this is a task that applies unsupervised learning to divide an input space into several connected regions called Voronoi Regions, representing a quantization of the space - every point in the input space belongs to a region and is mapped to the corresponding CV

objective of Kohonen layer

- to attach each output node to a stored pattern as represented by its weight vector - winner node at Kohonen layer is closest to the input vector - weight update rule: for each iteration, adjust the weight such that the weight vector of each node is as near as possible to the input sample vector for which that node is winner

Hebb's observation (1949)

- when one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell - weight change, which monotonically increases if x belongs to 0,1 (use a decay factor that gradually reduces thew weigh) - if used in systems that use bipolar [+1,-1] signals, weights can increase and decrease

signum function

-1 if net<0 0 if net=0 +1 if net>0

Strong Negative Weight

-1. Strongly inhibiting firing

Know how ANNs compare to human brains.

-Composed of many units (similar to neurons) -Units are interconnected (similar to synapses) -Units occupy separate connected layers (similar to multiple brain regions in sensory pathways) -Units in deeper layers respond to more and more abstract information (similar to more complex receptive fields in "higher" cortical areas) -Require learning to perform tasks efficiently (similar to neural plasticity) -Through experience, ANN learn to recognize patterns

Applications of SOM

-Customer segmentation -Bibliographic classification -Image browsing system -Medical Diagnosis -Seismic Activity -Speech Recognition -Data Compression -Environmental Modeling

Testing a Trained ANN Model

-Data split into two parts 1. Training (80%) 2. Testing (20%) -K Fold cross validation 1.Less bias 2. Time consuming

Neural Network Selection

-Driven by the task it is intended to address (classification, regression, etc)

What can ANNS do?

-Facebook face recognition at 98% accuracy -Self driving cars: detect important objects on the road, tell moving cars apart from cyclists and pedestrians, predict what objects will do, choose a path -Navigation: choosing the best route given current traffic conditions, finding landmark locations -Why were they struggling before? Each person's sounds are unique, humans speak in a continuous flowing manner, "ice cream" v. "I scream", "I" and "eye", and other language ambiguities make it hard for voice processing -Voice recognition (Siri, Skype, Android) -Language: they can describe pictures (but they pay more attention to details)

Hopfield Networks

-Introduced by John Hopfield -Highly interconnected neurons -Applies to solving computational problems

Controlling the Speed of Learning

-Learning Rate -Momentum

What machines are still NOT good at:

-Learning from small number of examples and less practice -Solving multiple tasks simultaneously -Holding conversations -Active learning -Scene understanding -Language acquisition -Common sense -Feelings -Consciousness -Theory of Mind (understanding thought and intentions of others) -Learning-to-learn -Creativity

Transfer Function

-Linear function -Sigmoid (logical activation) function -Tangent hyperbolic function

State the disadvantages of KNN

-Need to predetermine k -Ambigous choice of distance metric -computation cost is high

Elements of ANN

-Neurons -Processing Elements: Inputs, outputs, Connection Weights, Summation Function Network: Hidden layers, Parallel Processing

What are the different steps of stochastic learning? What are its advantages/disadvantages?

1) Initialize the weights. 2) Calculate the loss for a single sample. 3) Calculate the gradients. 4) Update the weights. 5) Continue with 2) until it reaches the global minimum (eventually). Advantages: - Noisy, can hence escape from local minima. - Less computation per learning step and hence much faster. - Often results in better solutions. Disadvantages: - Wobbles around very strongly and hence might not converge. - Small computations do not use full capacity. - Conditions of convergence are hidden.

What are the different steps of mini-batch learning? What are its advantages/disadvantages?

1) Initialize the weights. 2) Calculate the loss for a subset of the training samples. 3) Calculate the gradients. 4) Update the weigths. 5) Continue with 2) until it reaches the global minimum (eventually). Advantages: - Wobbles around slightly and can hence escape from local minima. - Can fully exhaust computational capacity. - Results in good solutions. Disadvantage: Slightly wobbles around the solution.

How do we find a good learning rate?

1) Start with large learning rate. 2) Divide learning rate by 2 (or 5). 3) Retrain the ANN and validate the performance. 4) Continue with 2) if performance increased.

Backpropagation

A common method of training a neural net in which the initial system output is compared to the desired output, and the system is adjusted until the difference between the two is minimized. algorithm that propagates errors back through hidden layers to input

Evaluating RuleSets

A complete rule set should be good at classifying all the training examples Complexity: Favour rule sets with the minimal number of rules.

Recurrent Neural net

A complex form of artificial neural networks where there is one layer and all the units are interconnected and service input and output. this form is enabled store patterns

Connectome

A comprehensive map of neural diagrams in a brain (like the wiring diagram)

Well-posed Learning Problem

A computer program is said to learn from (E)xperience E with respect to some (T)ask T and some (P)erformance measure P, if its performance on T, as measured by P, improves with experience E.

Overfitting

A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data. It presents itself in complicated decision boundaries that depend strongly on individual training examples.

connectionist network

A network of units (like neurons) that are connected to one another and transfer information between each other (like axons). Made up of input units, hidden units, and output units as well as connection weights.

A neural network with real inputs

A neural network with real inputs computes a function f defined from an input space A to an output space B.

What is a decision tree?

A popular supervised learning algorithm used in classifications problems. (flowchart)

prediction model (def)

A prediction model is the result of applying a supervised learning algorithm to a labeled data set which includes observations that are characterized by a set of features (aka attributes, independent variables) and a target(aka dependent, response) variable.

Artificial Neural Network Learning

A process by which a neural network learns the underlying relationship between input and outputs, or just among the inputs.

Dropout

A very different approach to avoiding over-fitting is to use an approach called dropout. Here, the output of a randomly chosen subset of the neurons are temporarily set to zero during the training of a given mini-batch. This makes it so that the neurons cannot overly adapt to the output from prior layers as these are not always present. It has enjoyed wide-spread adoption and massive empirical evidence as to its usefulness.

What happened during 1950's and 60's? (3)

1. Perceptron 2. Learn to reach expected outputs 3. Learning linked to thinking

How to train a neural network? (4)

1. Process of trial and error to find the weights for a function 2. Find weights by starting randomly 3. Try different inputs separately 4. Gradually adjust until target output achieved

What is deep learning? (2)

1. Provide lots of data 2. Concepts discovered by itself

perceptron learning algorithm

1. Select random sample from training set as input 2. If classification is correct, do nothing 3. If classification is incorrect, modify the weight vector w using Repeat this procedure until the entire training set is classified correctly Desired output d n = { 1 if x nbelong et A −1 if x n∈set B } wi =wiηd n xi n

How does a neural network actually work? (3)

1. Signals move between neurones 2. If sum of inputs >= threshold neurone fires 3. Long term pattern used to learn

Discrete Supervised

1. Simple Hopfield 2. Outerproduct AM 3. Hamming Net

Summation Function

1. Single Neuron 2. Multiple Neurons

Backward Elimination

1. Start with complete SF set (contains all original features) 2. Find feature that, when removed, reduces the filter score least 3. Remove feature from SF set 4. Repeat steps 2-3 until convergence

Forward Selection

1. Start with empty SF set and candidate set being all original features 2. Find feature with highest filter score 3. Remove feature from candidate set 4. Add feature to SF set 5. Repeat steps 2-4 until convergence

Achitectures

1. Supervised: Recurrent, Feedforward 2. Unsupervised: Extimator, Extractor

Random forests

1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

At what point can you stop training the network? (3)

1. When the value of mean squared error is small enough 2. Ideal value is 0 but as long as it is close it is ok 3. Could take too long to actually get to 0

Example of a function that is not linear separable? (3)

1. XOR 2. Can't be represented on a graph 3. Can't be a single layer

Self organized map phases

1. volatile phase - prototypes search for niches to move into 2. sober phase - consists of nodes settling into cluster centroids in the vicinity of positions found in the earlier phase - sober phase converges but the emergence of an ordered map depends on the result of the volatile phase

How many nerve cells are in the human brain?

100 billion

What did McCulloch & Pitts do and in what year? (5)

1. 1943 2. First neural network 3. Simple inputs combined 4. Weights were fixed 5. Uses a threshold

History of Neural Networks? (5)

1. 1943 - McCulloch & Pitts 2. 1949 - Hebb 3. 1950's&60's - perceptron 4. 1969 - ANN winter 5. Mid 1980's - Parker and LeCun

When was the ANN winter and why? (2)

1. 1969 2. Can't learn non-linear separable functions with perceptron

What logic functions can be represented using McCulloch and Pitts models? (4)

1. AND 2. OR 3. AND NOT (not second input) 4. XOR

Continuous Unsupervised

1. ART-3 2. SOFM (SOM) 3. Other clustering

Clustering (NN)

1. Adaptive Resonance Theory 2. SOM

Error Backpropagation

1. Apply input vector to network and propagate forward 2. Evaluate d(k) for all output units 3. Backpropagate d's to obtain d(j) for all hidden units 4. Evaluate error derivatives as:

Training algorithm stage 2:kohonen

1. Apply normalised input vector x and its corresponding output vector y, to inputs A and B respectively. 2. Determine winning node in the Kohonen layer. 3. Update weights on the connections from the winning node to the output unit - wi(t+1) = wi(t) + (yi - wi) 4. Repeat steps 1 through 3 until all vectors of all classes map to satisfactory outputs.

Training algorithm stage 1:kohonen

1. Apply normalised input vector x to input A. 2. Determine winning node in the Kohonen layer. 3. Update winning node's weight vector - w(t+1) = w(t) + (x - w) 4. Repeat steps 1 through 3 until all vectors have been processed. 5. Repeat steps 1 to 4 until all input vectors have been learned.

What are the two weightings that affect outcome? (2)

1. Excitatory - positive 2. Inhibitory - negative

What are the uses of machine learning? (5)

1. Facial recognition 2. Predicting stock prices 3. Handwriting recognition 4. Speech recognition 5. Game playing

Neural Network Architectures

1. Feed Forward 2. Recurrent 3. Associative Memory 4. Probalistic 5. Self organizing feature maps 6. Hopfield networks

Two main network structures

1. Feed-Forward Netword 2. Recurrent Network

Regression (NN)

1. Feedforward 2. Radial Basis

Classification (NN)

1. Feedforward 2. Radial Basis 3. Probablistic

What is the purpose of the threshold? (3)

1. Fixed threshold 2. Takes the weighted sum of the inputs 3. Neurone will fire if Y >= threshold

What is a linearly separable function? (3)

1. Function that can be separated 2. Can be represented in a single layer 3. Can be visualised on a graph

How to represent AND? (3)

1. Give X1 a weight of 1 2. Give X2 a weight of 1 3. Set threshold to 2

How to represent AND NOT? (3)

1. Give X1 a weight of 2 2. Give X2 a weight of -1 3. Set threshold to 2

How to represent OR? (3)

1. Give X1 a weight of 2 2. Give X2 a weight of 2 3. Set threshold to 2

Association (NN)

1. Hopfield Networks

Six general questions to decide on decision tree algorithm:

1. How many splits per node (properties binary or multi valued)? 2. Which property to test at each node? 3. When to declare a node to be leaf? 4. How to prune a tree that has become too large (and when is a tree too large)? 5. If a leaf node is impure, how to assign a category label? 6. How to deal with missing data?

What is the learning rate? (2)

1. How quickly the network converges 2. Uses experimentation to set the value

Backpropogation Learning

1. Initialize the weights with random values 2. Read in the inputs and the desired outputs 3. Compute the actual output 4. Compute the error 5. Change the weights by working backward 6. Repeat steps 2-5 until weights stabalize

What is the classic definition of AI? (2)

1. Make agents to model different functions 2. Put agents together

What did Parker and LeCun do and when? (3)

1. Mid 1980's 2. Multilayer networks discovered 3. Solved problem caused by non-linear separable functions

What is overfitting? (3)

1. Model describes random error and noise 2. Not the underlying relationship 3. Model can be too complex with too many parameters

Auto associative network

Achieved by complex search of store pattern

According to the "spreading activation" idea, you are likely to be especially fast in naming the word "cat" if it is preceded by the word "dog" because: - The words "cat" and "dog" are similar in structure (i.e., consonant - verb - consonant) and frequency - It is easy to create an image of a cat and a dog - Activation spreads between semantically related concepts - Cats are inferior creatures and should never be mentioned before dogs

Activation spreads between semantically related concepts

What is analogue coding?

Analogue circuits code in continuous changes in voltages, as do neurons in their sub-threshold state.

Image classification

Analyzes parts of image and similarities to known animals/things/etc., requires MANY examples

Support and Confidence

Are measures of pattern interestingness

Errors

Arise from world noise, electrical noise

ANN

Artificial Neural Network

What does ANN stand for?

Artificial Neural Network

Curse of dimensionality

As feature number grows, amount of needed data to generalize increases exponentially

t-test

Assesses whether the means of two distributions are statistically different from each other

Naïve Bayes Cons

Assumes features are independent from each other

Loss function

At this stage, in one hand, we have the actual output of the randomly initialised neural network. On the other hand, we have the desired output we would like the network to learn.

Decision Tree "Best" Attribute

Attribute which splits set as nearly in half as possible

ARFF

Attribute-Relation File Format

Cross validation error

Average of accuracy scores for each fold

Backpropagation networks

Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network

backpropagation algorithm

Backpropagation is probably the most used feed-forward artificial neural network and so we are going to see how this works.

Backpropagation

Backpropagation is shorthand for "the backward propagation of errors,"

Steps of the backpropagation algorithm 2

Backpropagation: the constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x.

What kind of techniques do analogue calculations provide?

Basic calculus techniques

Belief Network

Bayes Network, directed acyclic graph with probability tables with parent dependencies

Naïve Bayes

Bayes net classifier that assumes all attributes are independent

Why do we use efficient coding?

Because spike coding is too energetically expensive

z-score normalisation

Better terminology is zero-mean normalization • min-max normalization cannot cope with outliers, z-score normalization can. • Transforms all attributes to have zero mean and unit standard deviation. • Outliers are in the heavy-tail of the Gaussian. • Still a linear transformation of all data.

Sampling Theory Basics

Binomial and Normal Distributions Mean and Variance

Levels of modeling range from ___________________ (Ex would be Google Maps) to _______________________ (Intense traffic/topographical map)

Biophysical realism, Idealized model

NN Expressiveness

Boolean, continuous functions, arbitrary function ("jumps between continuous functions")

Branching Factor

Branching factor of node at level L is equal to the number of branches it has to nodes at level L + 1

Layers of ANN

Built of units that connect with each other, the more layers, the smarter the system

How do silicone retinas work?

By connecting two siliconal neurons that extract info like what angles and lines

Normal Density

By far the most (ab)used density function is the Normal or Gaussian density

How is an associative memory formed?

By modifying the strength between layers so when input pattern is presented the store pattern associated with inputted pattern is retrieved

Cost function - Manhattan or City block distance

Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance

Training Gradient descent/delta rule

Calculus so more robust, converges to local optimum

Noisy data - Clustering

Canceling noise by clustering - Cluster data into N groups - Replace original values by means of clusters OR: - Use to detect outliers

Noisy data - Regression

Canceling noise by regression: 1. Fit a parametric function to the data using minimization of e.g. least squares error 2. Replace original values by the parametric function value

Noisy data - Binning

Cancelling noise by binning: - Sort data - Create local groups of data - Replace original values by: ______ The bin mean ______ The closest min/max value of the bin

The connectionist model is NOT trying to mimic neurons. Instead, it is trying to...

Capture information processing and information storage capabilities of our minds.

Classification

Choosing from set of classes that matches a set of inputs

CART

Classification And Regression Trees

Measures of classification accuracy

Classification Error Rate Cross Validation Recall, Precision, Confusion Matrix Receiver Operator Curves, two-alternative forced choice

Fishers LDA

Classification Problem. A dimension reducing technique. A set of parameters are used to project data x to a smaller dimension d'. The aim is to maximize the distance between the means of the two classes.

Filtering

Classifier does not inform the feature search and as such the data is "filtered" once and then handed off to learner

Maximum margin classifier

Classifier which is able to give an associated distance from the decision boundary for each example.

Locally weighted regression

Close K points are chosen and a line fit to them

Simpler Method for complex itemset

Closed frequent itemset: X is closed if there exists no super-set Y such that Y has the same support count as X Maximal frequent itemset: X is frequent, and there exist no supersets Y of X that are also frequent

Statisticians

Clssification

Ensemble Learning

Combine rules to create complex rules

Hidden Unit Activation

Common functions for are unit step, sigmoid or logistic and tanh

Classification measures - Error Rate

Common performance measure for classification problems 1. Success: Instance's class is predicted correctly (True Positives (TP) / Negatives (TN)). 2. Error: Instance's class is predicted incorrectly (False Positives (FP) / Negatives (FN)). 3. False positives - Type I error. False Negative - Type II error. 4. Classification error rate: Proportion of instances misclassified over the whole set of instances.

F-measure

Comparing different approaches is difficult when using multiple evaluation measures (e.g. Recall and Precision) F-measure combines recall and precision into a single measure

Inductive Learning: complexity of H

Complexity of hypothesis class (High: overfit, Low: underfit)

Analogue

Compound with similar molecule structure to another compound

NN Back Propagation

Computationally beneficial organization of chain rule errors flow backwards able to change weights of entire network to figure out output often reaches local optimum

Neural Networks

Computer Technology that attempts to build computers that will operate like a human brain

What is artificial intelligence ?

Computer system that attempt to model and apply the intelligence of the human mind Turing Test - computer that can talk like a human

DM Functionalities

Concept/Class description • Characterization • Discrimination • Frequent patterns/ Associations/ Correlations • Classification and Regression (Prediction) • Cluster analysis • Outlier analysis • Evolution analysis

What is a confusion matrix?

Confusion matrices are used for dealing with false positives and false negatives because some prediction vs reality test are worse than others. (Think HIV test)

Version Spaces

Contain true hypothesis in H Training set is subset of X with all x having c(x) Hypotheses consistent with examples

DM task primitives

DM task primitives forms the basis for DM queries. DM primitives specify: • Set of task-relevant data to be mined • Kind of knowledge to be mined • Background knowledge to be used • Interestingness measures and thresholds for pattern evaluation • Representation for visualizing discovered patterns.

Boosting D_t+1 calculation

D_t(i)*e^(-alpha_t*y_i*h_t(x_i)/Z_t)

Three ways of constructing new kernels

Direct from feature space mappings Proposing kernels directly Combination of existing (valid) kernels • multiplication by a constant • exponential of a kernel • sum of two kernels • product of two kernels • left/right multiplication by any function of x/x'

How are neurones connected together?

Directed weighted paths

ANN's can't...

Discern thoughts and intentions Solve multiple tasks at the same time Common sense Feelings Learn from few examples Learn actively

graceful degradation

Disruption of performance due to damage to a system that occurs only gradually as parts of the system are damaged. This occurs in some cases of brain damage and also when parts of a connectionist network are damaged.

Complete Linkage Clustering

Distance between two clusters is maximum distance between observation in one cluster and observation in other cluster

Gain algorithm

E(S) - sum_v(abs(S_v)/abs(v) * E(S_v)

edge

E. Partition graphs such that nodes in same cluster have large weights between them and between cluster weights are small

error function used in backpropagation network

E= 1\2 sum ||ot -ti|| exponent to 2

Neuro feedback therapy

EEG is shown to patient and they try to change certain aspects of their mental states

Rule based learning

Equivalent in expression power to traditional (mono-thetic) decision trees, but with more flexibility • They produce rule sets as solutions, in the form of a set of IF... THEN rules

Units of ANN

Equivalent of neurons, receive information, integrate inputs, and pass information forward if threshold is reached

Connectionist models are networks of "neuron-like" nodes that can learn from BLANK.

Experience

Decision Tree n-XOR

Expressible, but uses 2^n nodes (exponential)

Decision Tree n-OR

Expressible, uses linear number of nodes

Architectures - Unsupervised

Extimator: SOFM (SOM) Extractor: ART-1, ART-2

Applications of ANN's

Face recognition, self-driving cars, voice recognition, automated conversation

Name two neural network topologies

Feed forward topology (data flow is in the forward direction only; no feedback) and Recurrent neural networks(feedback present)

Steps of the backpropagation algorithm 1

Feed-forward:the input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored.

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Machine Learning broad definition

Field of study that gives computers the ability to learn without being explicitly programmed.

Training LDA objective:

Find (i.e. learn) that minimizes some error function on the training set. Significant approaches: • Least squares • Fisher • Perceptron

The goal of data mining is to

Find interesting patterns!. An interesting pattern is: 1. Easily understood. 2. Valid on new data with some degree of certainty. 3. Potentially useful. 4. Novel.

Linear Discriminant Analysis

Find projection discriminating based on given labels. (Not really unsupervised)

Pattern Recognition

Finding patterns without experience. It's also called unsupervised learning.

argmax_h(P(h|D))

Finds the most probable hypothesis given the data

Forward-backward algorithm

First applies Forward selection and then filters redundant elements using backward elimination

Pruning

First fully train a tree, without stopping criterion After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small

We can assess how well the network learns by hearing how well it pronounces the training words and then, critically, by hearing how well it generalizes to new text.

First recording 38" (new learning) Second recording 3'20" (after 10,000 learning trials) Third recording 5' (new words not yet learned)

Why use a sigmoidal activation function?

Firstly, people wanted to use a function that they could use mathematics to understand.

You will see how this works from the following sketch of the algorithm

Forward pass: calculate the output of the network by forward propagating the outputs of one layer to the inputs of the next.

SVM: kernel

Function used to find a hyperplane between classes, can project data into different representations that are more easily seperable

Activation Functions

Function which takes the total input and produces an output for the node given some threshold.

Graph

G(V,E)

For example, pronounce:

GHOTI

Generalisation

Generalization is the desired property of a classifier to be able to predict the labels of unseen examples correctly. A hypothesis generalizes well if it can predict an example coming from the same distribution as the training examples well.

Supervised Learning

Given training data consisting of pairs of inputs/outputs, find a function which correctly matches them

Mutual Information

Gives a measure of how 'close' two components of a joint distribution are to being independent

Minimum Error Rate

Goal is to minimise error rate

Decision Tree Preference Bias

Good splits at top, correct over incorrect, shorter trees

Gradient

Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) . So given a function i.e. g(x,y) = -x+y^2 you know, that it is better to minimize the value of x, while strongly maximize the value of y. This is a base of gradient based methods, like steepest descent technique.

ID3 "best" attribute

Greatest information gain or greatest gini gain

Data Discretization

Grouping a possibly infinite space to a discrete set of possible values For categorical data: ________ Super-categories For real numbers: ________ Binning ________ Histogram analysis ________ Clustering

Joint Entropy

H(x,y) = - SUM(P(x,y)logP(x,y))

Feedback projections

Help to direct attention (different kinds of hats in different locations in pictures, Where's Waldo, etc.)

Entropy

Helps rank attributes by how much they contribute information ranging from 0 (no info) to 1 (maximum info)

the forward pass 3

Here's how we calculate the total net input for h_1: net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1 net_{h1} = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775 We then squash it using the logistic function to get the output of h_1: out_{h1} = \frac{1}{1+e^{-net_{h1}}} = \frac{1}{1+e^{-0.3775}} = 0.593269992 Carrying out the same process for h_2 we get: out_{h2} = 0.596884378

What sort of information is represented in the hidden units?

Hidden units contain an intermediate abstract representation of the numbers.

Discrete latent variables

Hidden variables that can take only a limited number of discrete values (e.g. gender or basic emotion).

recurrent

Hopfield net

unsupervised, binary

Hopfield net, competitive net

Challenges for building a brain

How does our brain do pattern detection so well when the inputs can be so different?

Mistake bounds

How many misclassifications can a learner make over an infinite run (online)

Sample complexity

How many teaching examples are needed for a learner to create a successful hypothesis (batch)

Computational complexity

How much computational effort is needed for a learner to converge

Artificial Neural Networks make interesting mistakes even after training:

Human brains are much more complex and have many more neurons so context is more apparent

Deep Blue

IBM computer that defeats Garry Kasparov in 1997

Question: Where is the concept represented when it is distributed like this? PART III

If a few connections are lost in a connectionist system it continues to work pretty well, though perhaps "fuzzier." This is called "graceful degradation," and is similar to what occurs with a patient's memory after brain injury. Details may be lost, the system may function more "noisily," but it will not fail until a large proportion of the connections are lost.

General classification problem

If classes are disjoint, i.e. each pattern belongs to one and only one class then input space is divided into decision regions separated by decision boundaries or surfaces

How do we tackle class imbalances?

If the differences are not very large: draw balanced mini-batches. Sparse classes are then shown more often. If the differences are grouped in different classes: draw balanced mini-batches from different subgroups. It then shows larger groups more often. For very unbalanced classes: weight the loss. The loss for misclassified samples of small classes is then increased.

The Competitive Process

If the input space is D dimensional (i.e. there are D input units) we can write the input patterns as x = {xi : i = 1, ..., D} and the connection weights between the input units i and the neurons j in the computation layer can be written wj = {wji : j = 1, ..., N; i = 1, ..., D} where N is the total number of neurons.

MLFFN

If we combine units in a way that ensures that there is no recurrence, that is there are no 'loops' (no output of a unit can affect its input), even indirectly, then we obtain what we call a multi-layered feed-forward network or MLFFN for short

question 2 ?Design a network of threshold units which has 4 inputs and which outputs a one if and only if an odd number of these inputs are one

If we connect all the inputs that need to be 1 with connections with weight 1 and we have a threshold (- bias remember!) that is one less than the number of 1s input, then we get the required response. However, we need to ensure that the inputs that we want to be zero are in fact zero. This is quite easy, as we simply connect them to the unit with a negative weight. If any input that should be 0 is not, or any input that should be 1 is not, then the unit will output 0. The result is shown in Figure 4

The sigmoid has another useful property

If we differentiate σ(net) with respect to net we get: dσ(net) = - net exp(-net) (1 + exp(-net))-2 = σ(net)(1-σ(net)) = a(1 - a) dnet

Letter Recognition

Imagine an input layer that was arranged as a 20 x 20 matrix. Imagine if I wrote a number on it. This could be coded as turning some inputs on. Different numbers would have different input patterns.

Cross-validation

In k-fold cross-validation, a dataset is split into k roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using k-1 folds as the training set and predictions are made on the k'th fold. This is repeated until a prediction is made for all k folds, and an error rate for the entire dataset is obtained. Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation. • Training Data segments between different folds should never overlap • Training and test data in the same fold should never ovelap Error estimation can either be done per fold separately, or delayed by collating all predictions per fold.

K-Means Clustering

Informally, goal is to find groups of points that are close to each other but far from points in other groups • Each cluster is defined entirely and only by its centre, or mean value µk

he self-organization process involves four major components

Initialization: All the connection weights are initialized with small random values. Competition: For each input pattern, the neurons compute their respective values of a discriminant function which provides the basis for competition. The particular neuron with the smallest value of the discriminant function is declared the winner

Instances

Input

Dendrites

Inputs

Artificial Neiron

Inputs: single numeric attribute Weights: relative strength of influence of output Neuron (Summation): Computes the weighted sums of all the input elements Transfer Function: Converts the inner activation level to the output Outputs: Solution Represented as a number

What property relating to errors do we require of neural networks

Insensitivity to small errors in the input pattern.

ID3

Interactive dichotomizer version 3 Used for nominal, unordered, input data only. Every split has branching factor , where is the number of values a variable can take (e.g. bins of discretized variable) has as many levels as input variables

What is a loss function?

It is a function of the network's parameters which measures the difference between the output of the network and the desired output. Training the network means minimizing the loss by modifying the parameters (weigths).

Missing Attributes

It is common to have examples in your dataset with missing attributes/variables. One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes. A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute. This is a great benefit of trees (and in general of combined models,)

The Principle of Parsimony

It is pointless to do with more what is done with less.

How is information stored in an artificial neural network?

It is stored in the weight of connections were each neuron is dumb and only response to the weighted input.

How is using efficient coding helpful?

It maximizes the amount of information we observe through the parents of spikes by reducing redundancy.

A single perceptron can only be used to implement linearly separable functions

It takes both real and boolean inputs and associates a set of weights to them, along with a bias (the threshold thing I mentioned above).

State Oja's rule for unsupervised learning

It's based on Hebbian learning, which states that the weights are modified by a term proportional to the product of the input and output.

K Nearest Neighbors Design Considerations

K, distance metric, voting algorithm, weighted mean

Kernel methods

Kernel methods map a non-linearly separable input space to another space which hopefully is linearly separable • This space is usually higher dimensional, possibly infinitely • Even the 'non-linear' kernel methods essentially solve a linear optimization problem!!!!

Kohonen learning uses a neighborhood function φ

Kohonen learning uses a neighborhood function φ, whose value φ(i, k) represents the strength of the coupling between unit i and unit k during the training process

most popular model of self-organizing networks

Kohonen networks

Kohonen networks

Kohonen networks learn to create maps of the input space in a self-organizing way

Gaussian Laplacian

L = D - W

Bayesian Learning

Learn most probable H given data and domain knowledge

Bagging

Learn rule over k subsets of data and combine into a single rule

Inductive Learning: Training example selection

Learner asks teacher, teacher gives examples, x chosen from distribution by nature, malicious teacher gives learner false information

Wrapping

Learner informs feature search of effectiveness of set when finding optimal set. Takes model bias into account but very slow.

Regression

Learning a function that provides a continuous value

Perceptrons and parallel processing

Learning can only be implemented by modifying the connection pattern of the network and the thresholds of the units, but this is necessarily more complex than just adjusting numerical parameters

Machine Learning

Learning from experience. It's also called supervised learning, were experience E is the supervision.

What are some components of human intelligence?

Learning, reason and logic, behavior of social situations

Explain cross validation

Leave one out method

Why Sample for Bayes net?

Less complex and almost as accurate as inference

cognitive aging

Lifelong process of gradual, ongoing, yet highly variable change in cognitive functions that occur as people get older that is not a disease or a quantifiable level of function

Classification

ML task where T has a discrete set of outcomes. Often classification is binary. Examples: • face detection • smile detection • spam classification • hot/cold

Regression

ML task where T has a real-valued outcome on some continuous sub-space Examples: • Age estimation • Stock value prediction • Temperature prediction • Energy consumption prediction

What is the difference between MLP and RBF?

MLP: multi-layer perceptron RBF: radial basis function Classification: -MLPs separate classes via hyperplanes -RBFs separate classes via hyperspheres Learning: -MLPs use distributed learning -RBFs use localized learning -RBFs train faster Structure: -MLPs have one or more hidden layers -RBFs have only one layer -RBFs require more hidden neurons => curse of dimensionality

Linear Regression

Match to mx+b function with minimum squared error (could be constant function f(x) = c)

SVM: max(2/||w||) or min(1/2 * ||w||^2) or max(w(alpha))

Maximization of the length of margin

Independent Component Analysis

Maximizes the statistical independence of dataset. Assumes there are "hidden" variables that the data is a linear combination of. I(y_i,y_j)=0

K Nearest Neighbors Regression

Mean of y in nearest neighbors

Features/Attributes

Measurable values of variables that correlate with the label y Examples: • Sender domain in spam detection • Mouth corner location in smile detection • Temperature in forest fire prediction • Pixel value in face detection

Derivative

Measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function

Kullback-Leibler Divergence

Measures difference between 2 distributions based on mutual information. When x and y independent they converge, as x and y are dependent they diverge. D(p||q)=Integral(p(x)log p(x)/q(x))

Feature Selection

Method of reducing the complexity of a dataset by choosing the best features

Philosophers

Mind and machines

Misclassification Impurity

Minimum probability that training example will be misclassified at node N

Layers of ANN

Mirror V1/V2/etc., lower layers notice edges and orientation, higher layers see whole objects

Relevance Vector Machines

Model the typical points of a data set, rather than atypical( a la density estimation) while remaining a (very) sparse (like heat map) representation Returns a true posterior Naturally extends to multi-classification Fewer parameters to tune

Linear Regression revisited

Model: - linear and additive relationship - random variation Model estimation: - free parameter beta, set beta to max fit Objective: - min loss function - loss: sum of R square

Go

More complex game that AI's compete in later on

NN Restriction Bias sigmoid

More complex, not much restriction

Data quality measures

Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • Intrinsic, contextual, representational, and accessibility

what are the capability of multilayer network

Multi-layered networks are capable of performing just about any linear or nonlinear computation, and can approximate any reasonable function arbitrarily well.

An artificial neural network can be trained to generate speech from visual inputs.

NETtalk

ICA Examples

Natural Scenes -> edges Documents -> topics

backpropagation drawbacks

Network paralysis occurs when the weights are adjusted to very large values during training, large weights can force most of the units to operate at extreme values, in a region where the derivative of the activation function is very small.

What is neural computing?

Neural computing is the study of networks of adaptable units which, through a process oflearning from examples, store experience and knowledge and make it available for use

What did Carver Mead call the translation of neurobiology into technology?

Neuromorphic engineering

Integrate and fire neurons

Neurons built of transistors and adds all inputs and info they see which is coded in voltages and fire an action potential if the voltage reaches the threshold

These models capture interesting properties of cognitive processing BUT they are not meant to function in the same way that...

Neurons do

Communication Among Neurons

Neurons form networks through which nerve impulses travel. Communication is carried out via electrical conduction and chemical transmission

ANNs Step 1

Neurons in ANN are called units and they receive info from other units (like dendrites through neurons) then they integrate the inputs similar to IPSP and EPSP in real neurons. Each unit has preferred threshold, and if summed signals are greater than threshold, unit will pass info forward in network

Signal Transduction Step 3

Neurotransmitters bind to receptors on the dendrites or the cell body of a neuron. This binding creates small electrical potentials, called synaptic potentials, that can be excitatory or inhibitory (Chemical transmission)

Can an infinite H be PAC learnable?

No

Impossibility Theorem

No clustering scheme can achieve all three properties

Reinforcement Learning

No data given. Agent interacts with the environment calculating cost of actions

Undirected PGN

No edge direction (Markov Random Field)

Instance Based Learning Cons

No generalization Overfits Lookup with same x could give multiple y's

Noisy data

Noise is a random error or variance in a measured variable

Why is it important that both training set and classes are normalised?

Normalisation is required so that all the inputs are at a comparable range Some candidates forgot that normalisation is an essential step in the working of the Kohonen network. It allows one to use the inner product over the distance making the calculations quicker. The omission of this fact seems to suggest that candidates are not looking at questions as a whole. A previous part of the question should have served as a reminder to candidates about the importance of normalisation

NETtalk exhibited strong global regularities (i.e., it detected the consistent patterns whereby sounds were influenced by surrounding letters), but also a large number of more specialized rules and exceptional cases.

Note that it did this without those rules ever being explicitly coded?

On-line Gradient Descent

On-line (or Schotastic) gradient descent also known as incremental gradient descent updates parameter one data point at a time. - Handles redundancy better. (Batch GD has redundancy) - Usually much faster than Batch GD. - SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily - Can deal with new data better. - Good chance of escaping local minima. However, when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent

Connectionist Models Of Mind

One might train the network with the following set of trials adjusting weights based on performance on each trial

How is deep neural network optimized?

Optimized through gradient descent! (Forward-Backward algorithm) - Penalize complex solutions to avoid overfitting

Connectionist Models Of Mind are otherwise known as...

Parallel Distributed Processing or Artificial neural networks

d) Draw a labelled diagram of the network from part c), showing the two initial units and the two examples. Mark on it the positions of the units after training with each of the examples

Part d) threw some candidates. These individuals did not remember that the action of training a Kohonen network is to move the units towards cluster centres. The diagram required was simply one where the units' preand post-training were marked and some sort of notation used to show this movement.

Perceptron Algorithm

Perceptron is modeled after neurons in the brain. It has m input values (which correspond with the m features of the examples in the training set) and one output value. Each input value x_i is multiplied by a weight-factor w_i. If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and 'fires' a signal (+1). Otherwise it is not activated. The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product <w, x>. To produce the behaviour of 'firing' a signal (+1) we can use the signum function sgn(); it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative. Thus, this Perceptron can mathematically be modeled by the function y = sgn(b+ <w, x>). Here b is the bias, i.e. the default value when all feature values are zero.

Random Components Analysis

Randomly chooses axes on which data is projected onto. Frustratingly works well when preprocessing classification problems (by chance picks up some correleations)

Stopping Criteria

Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting.

Caveats Regarding Connectionist Models

Real neurons are more complex than the simple digital neurons used in these computer models. Real neurons are not identical. The surface of the brain, the cortex, contains highly structured layers of different neurons serving different functions. And, localization of both neurons, types of neurons, and function occurs in the nervous system. Damage to particular regions disrupts the ability to perceive phonemes, to experience the correct color of objects, to perceive faces, to calculate, and so on. The learning rule used in these models is not biologically plausible.

RELU

Rectified Linear Unit New trend, responsible for great deal of Deep Learning success. Advantages: • No 'vanishing gradient' problem • Can model any positive real value • Can stimulate sparseness

Architectures - Supervised

Recurrent: Hopefield Feedforward: Nonlinear v Linear, Backpropogation, MP perceptron, Bolzmann

Creating a Tree Model

Recursive partitioning approach (greedy search strategy) - find the best possible split (most important variable?, optimal threshold for this variable?) - split tree accordingly and create two branches - recursive partitioning (repeat previous steps, find best possible split for each branch) - stop when no further improvement is possible 2 important decisions: 1) splitting rule: How to split a node? 2) Stopping rule: How to decide if a node is a leaf node?

Boosting

Recursively trains models based on "hardness" of data points

Instance Reduction

Reduces the number of instances rather than attributes. Much more dangerous, as it risks changing the data distribution properties • Duplicate removal • Random sampling • Cluster sample • Stratified sampling

Numerosity Reduction

Reduces the number of instances rather than attributes. Much more dangerous, as it risks changing the data distribution properties. • Parametrization • Discretization • Sampling

What is pruning?

Reducing the size of the decision tree by getting rid of the deadweight.

Regularization

Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.

Lazy Learner

Remembers training data and only tries to apply it to unknown data when queried

Training ANN

Requires many, many examples to establish correct and faster connections from unit to unit

Automated Seizure Prediction/Disruption

Requires surgery to implant electrodes and pulse generator

Deep Brain Electrical Stimulation

Requires surgery to implant stimulating electrodes amplitude and duration of stimulation can be adjusted by doctor or patient

1st order Markov models

Restricted to encoding sequential correlation on previous element only

Instance Based Learning

Retains all training data and infers new classifications/values from it

Clustering Properties

Richness, Scale invariance, consistency

Tree components

Root node, branch, node, leaf node.

DM Objective Interest

Support: P(X U Y ) Percentage of transactions that a rule satisfies Confidence: P(Y | X) Degree of certainty of a detected association, i.e. the probability that a transaction containing X also contains Y

Describe the algorithm which is used to produce a state transition table of a Hopfield network given its weights.

Suppose that we calculate the sum of the nets of all units: S = ∑u=0 u=N netu If a unit has an activation of 0, it changes to 1 on firing if its net is non negative. Similarly if a unit has an activation of 1 it changes to 0 on firing if its net is negative. Looking at the contribution of such a unit to S we can see that a unit changing from 0 to 1 causes S to increase by netk and one that changes from 1 to 0 also increases S by its net. In both cases S increases because of the change. Hopfield noticed this property and consequently the analogy between S and energy in physics. If we define Energy, E = -0.5 S we have the proposition that a change of state is never accompanied by an increase in energy. The factor of 0.5 in the formula is there to simplify its derivation. For calculations we can make the sum simpler by using the fact that wij = wji and so we can rewrite E as E = -∑u=0 u=N au ∑v=0 v<u wuv av . We can summarise one property of a Hopfield net as being that its Energy never increases and in fact decreases whenever a unit changes from a 1 to a 0. Clearly this implies a number of things: a. A Hopfield net has states which, if entered, cannot be left - as well as being called sink or stable states, such states are sometimes called energy wells to conform to the usage of physics. We have seen this in the example above. b. For an energy well, the summation block of each unit that has value 1 is non negative and the summation blocks of each unit that has value 0 is negative.

Drawing a line in a plane How about the inverse of this: given a straight line graph, can we build a unit that separates the plane along the line?

Suppose that we have a line given by y = mx + c. We can see that this can be written as mx - y + c = 0 so that setting bias = c, v = m and w = -1 will provide the required weights. Not all straight lines, however, can be written as y = mx + c: for example, a vertical line cannot be so written. However, it can be written in the form represented by units. The vertical line which goes through x = c and can be written as c - x + 0y = 0, that is bias = c, v = -1 and w = 0.

Kernel

Shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Data reduction

Should remove what's unnecessary, yet otherwise maintain the distribution and properties of the original data • Data cube aggregation • Attribute subset selection (feature selection) • Dimensionality reduction (manifold projection) • Numerosity reduction • Discretization

Weighted adjacency matrix W

Shows how a graph is connected

Consistency

Shrinking intracluster distances and expanding intercluster distances does not change the clustering

Engineers

Signal processing and automatic control

Significance level

Significance level α%: α times out of 100 you would find a statistically significant difference between the distributions even if there was none. It essentially defines our tolerance level. If the calculated t value is above the threshold chosen for statistical significance then the null hypothesis that the two groups do not differ is rejected in favor of the alternative hypothesis: the groups do differ.

Why is the making of silicon neurons not applicable?

Silicon chips and circuits are two dimensional whereas the communication between biological neurons in a brain occurs in a three dimensional circumstance therefore the communication between silicone neurons won't work

What does mulitplexed mean?

Silicon neurons can carry many different impulses along the same wire.

Integrate-and-fire neurons

Silicon neurons that, like real neurons in the visual cortex, have the job of extracting information about the angles of lines and contrast boundaries in the retinal image

Mixture of Gaussians

Simple formulation: density model with richer representation than single Gaussian

What is an artificial neural network?

Simulating the brains neural networks on a computer

What are the characteristics of a good mini-batch composition?

Since the network learns faster from unexpected and new examples and that samples from the same classes contain similar information, it is best to compose mini-batches with samples of different classes. The same sample may be re-used in different mini-batches, as in this circumstance the sample would create a different error surface. Re-using the exact same mini-batch should however be avoided.

Training Perceptron rule

Single unit training, finds separating line Guarantees finite convergence for linearly separable

Slack variable

Slack variables introduced to solve optimization problem by allowing some training data to be misclassified Slack variables en >= 0 give a linear penalty to examples lying on the wrong side of the d.b.: point on correct side of db |tn ! y(xn)|, otherwise

What is a Self Organizing Map?

So far we have looked at networks with supervised training techniques, in which there is a target output for each input pattern, and the network learns to produce the required outputs.

he backpropagation algorithm is used to find a local minimum of the error function

T. The network is initialized with randomly chosen weights. The gradient of the error function is computed and used to correct the initial weights.

Constrained Teacher

Teacher unable to give the sought after function immediately Must show what is relevant and irrelevant through examples

Signal Transduction Step 2

Terminal buttons secrete neurotransmitters into the synaptic cleft. (Chemical transmission)

In mathematical formulae reference to unit 0 is just a shorthand for taking into account any biases that the units have

The N inputs are likewise labelled 0 ... N to include the bias. The weights are wh from the hth hidden unit to the output and wih from input i to hidden unit h

Can you explain the last statement of the paragraph above?

The Pocket algorithm just adds a step of remembering good sets of weights and how good they were. There are a number of variations on this idea

Learning rules

The ____ that train networks do so by modifying the strength of the connections between the neurons, a common one being a rule that takes the output of the network to a given input pattern and compares it with the desired pattern

Physicists

Statistical mechanics

What are the main application of Artificial neural networks

Statisticians - classification Philosophers - mind v. machince

Gradient points in the direction of

Steepest Ascent

Training of the weights from the input to the hidden nodes is given as follows

Step 1: The synaptic weights of the network between the input and the Kohonen layer are set to small random values in the interval [0, 1].

step 2 training kohonen

Step 2: A vector pair (x, y) of the training set is selected at random.

step 4 training kohonen

Step 4: The normalized input vector is sent to the network

step 6 training kohonen

Step 6 the winner neuron are identified with the shorest distance

step 7 training kohonen

Step 7: The synaptic weights between the winner ne neurons of the input layer are adjusted according to the equation (6.2). Wwi(t+1) = Wwi(t) + (t)(Xi - Wwi(t))

Learning:

Stimulation with an input results in a pattern of activation throughout the network Activation changes the weights of the connections between units Programming the computer to adjust the system's weights, called "back-propagation," permits the program to "learn" Training consists of numerous learning trials

What is optimization done with respect to?

The approximation error measure

What does an autoassociative network enable the recurrent neural network to do?

Store patterns rather than merely pairs of items.

DM Subjective Interest

Subjective measures require a human with domain knowledge to provide measures: • Unexpected results contradicting apriori beliefs • Actionable • Expected results confirming hypothesis

Intrinsic dimensionality

Subspace of data space that captures degrees of variability only, and is thus the most compact possible representation

Bayesian equation to sigmoid function

Substitute Divide through by numerator term Cancel common terms

Resection surgery typical target areas

Supplementary motor area (SMA) hippocampus amygdala frontal lobe areas

Cross-validation criterion

Split training data in a number of folds. For each fold, train on all other folds and make predictions on the held-out test fold. Combine all predictions and calculate error. If error has gone down, continue splitting nodes, otherwise, stop

Validation set Criterian

Split training data in a training set and a validation set (e.g. 66% training data and 34% validation data). Keep splitting nodes, using only the training data to learn decisions, until the error on the validation set stops going down.

Decision Tree Regression

Splits on attributes such as variance Outputs an average, linear fit, or other numerical function

Explain the holdout method

Splitting data into training data and test data. Use training data to create model Use test data to score the accuracy

Cost Function

Squared error cost function. J(S)

Curse of Dimensionality

The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data. First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the target label (in case of supervised learning). In other words we can easily express the target using the dimensions that we have. Second,we may need to increase the number of training data exponentially, to overcome the curse of dimensionality and that may not be feasible. Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest neighbors, everything can become far from each others and it's difficult to interpret the distance between the data points.

Could a network learn the patterns/rules involved in correctly pronouncing letters?

The task was to train a network to produce the proper phoneme (output) for a given string of letters (input.) The inputs were strings of 7 letters with the task being to map the central letter to a specific phoneme (the outputs were fed to a voice synthesizer so we can hear how well the network performs.)

supervised backpropagation learning

The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to their correct output.

limitation of back propagation learning

The gradient descent algorithm is generally very slow because it requires small learning rates for stable learning. The momentum variation is usually faster than simple gradient descent, since it allows higher learning rates while maintaining stability, but it is still too slow for many practical applications.

If we calculate the rate of change of a function f with respect to a vector v, what is the gradient of v?

The gradient of v is the vector that contains the partial derivatives of f with respect to each of the variables contained in v.

Kohonen's model

The grid of computing elements allows us to identify the immediate neighbors of a unit

idea to train the system

The idea is to train the system to recognize certain input patterns in the connection region, which in turn leads to the appropriate path through the connections to the reaction layer

Kernel trick

The key element of kernel methods is that they do not actually map features to this space, instead they return the distance between elements in this space This implicit mapping is called the (definition)

What are the characteristics of the learning constant?

The learning constant - c: -has a value between 0 and 1 -close to 0 the initial weights have more influence (slow to change) -close to 1 the weights are most sensitive to the most recent interaction -an adaptive approach is to set c high initially and decrease it over time

What is ANN inspired by?

The learning processes that take place in biological systems

The pronunciation of a letter is heavily influenced by...

The letters surrounding it.

Connections between the inputs:

The lines connect each layer There are weights associated with each connection

Could a neural network learn to identify numbers despite the different writing styles?

The network does learn to "recognise" written numbers.

Signal Transduction Step 5

The new electrical impulse is actively propagated with no signal loss in the transmission of the electrical signal

In unsupervised, competitive learning what determines the winning unit? *hint: grandmother cell,

The unit with the largest net input is the winner

Weak Relevance

There exists a set of features such that adding x_i to it improves bayes optimal classifier

If there were no error in the output then there would be no need for training.

Therefore there must be an element (x, t) of the training set for which the calculated output is incorrect

What do transistors do in 'neurons'?

They act like the cell membrane of actual neurons. Additional transistors act as conductors to emulate voltage, and time-dependent current flows of real ion channels.

How do integrate-and-fire 'neurons' work?

They add up the weighted inputs, coded as voltages, that are arriving at their synapses, and only 'fire' an action potential if the voltage reaches a set threshold.

How can the synaptic strength be modified in artificial neural networks?

They can be derived by applying mathematical optimization methods; learning tasks can be reformulated as function approximation tasks; neural networks can be considered as nonlinear function approximating tools where the parameters of the networks should be found by applying optimization methods.

What happens when silicon neurons are multiplexed?

They imitate the connectivity of the brain

What can additional transistors provide?

They provide conductances to imitate voltage in time and dependent current flow of ion channels

Tree variaties

Trees are called monothetic if one property/variable is considered at each node, polythetic otherwise

Sigmoid Activation Function

Use sigmoid for activation to make differentiable sig(a) = 1/(1+e^-a)

Non-linearly Separable Problem

Usually problems aren't linearly separable (not even in feature space) 'Perfect' separation of training data classes would cause poor generalization due to massive overfitting

node

V

Labels

Values that h aims to predict Example: • Facial expressions of pain • Impact of diet on astronauts in space • Predictions of house prices

KNN with k=n (weighted average)

Variable output, closer points have more effect

Network Topology

Variations include: • Arbitrary number of layers • Fewer hidden units than input units (causes in effect dimensionality reduction, equivalent to PCA) • Skip-layer connections • Fully/sparsely interconnected networks

K Nearest Neighbors Classifier

Vote whichever y in NN with plurality (majority)

The Forward Pass 2

We figure out the total net input to each hidden layer neuron, squash the total net input using an activation function (here we use the logistic function), then repeat the process with the output layer neurons.

Perceptron Learning Algorithm

We have a "training set" which is a set of input vectors used to train the perceptron. ● During training both wi and θ (bias) are modified for convenience, let w0 = θ and x0= 1 ● Let, η, the learning rate, be a small positive number (small steps lessen the possibility of destroying correct classifications) ● Initialise wi to some values

Soft Margin

We have effectively replaced the hard margin with a soft margin New optimization goal is maximizing the margin while penalizing points on the wrong side of d.b.

Organization of the Mapping

We have points x in the input space mapping to points I(x) in the output space: Each point I in the output space will map to a corresponding point w(I) in the input space.

Backpropagation networks

We have used a threshold or step function as the activation function of choice

unsupervised training, self organizing

We now turn to unsupervised training, in which the networks learn to form their own classifications of the training data without external help. To do this we have to assume that class membership is broadly defined by the input patterns sharing common features, and that the network will be able to identify those features across the range of input patterns.

What trees are preferable?

We prefer simple, compact trees, following Occam's Razor

Kohonen Networks t

We shall concentrate on the particular kind of SOM known as a Kohonen Network. This SOM has a feed-forward structure with a single computational layer arranged in rows and columns. Each neuron is fully connected to all the source nodes in the input layer:

Forward propagate

We start from the input we have, we pass them through the network layer and calculate the actual output of the model streightforwardly.

PCA Pros

Well studied Fast

Blocking Paths

When a path is blocked, no information can flow through it This means that observing C, if it blocks a path A-C-B, it means there is no added value in observing A, and B is fully determined by C

Or artificial neural networks able to generalize input patterns they have never seen before and noticed things and are they capable of retrieving stored patterns even one input patterns are noisy and messed up?

When they see generalized and put patterns they are able to notice regularities patterns and are tolerant and yes they are capable of retrieve and stored patterns even when the input patterns are noisy messed up

Question: Where is the concept represented when it is distributed like this? PART II

With a distributed representation the concept can contain multiple features from different modalities and emotional tone. A trusting voice? Visualizing the face of someone on the radio?

How do we adjust the biases for ReLU?

With batch normalization.

State the differences between the Kohonen rule and the competitive learning rule

With the competitive rule only the inputs to the winning units are updated but with the Kohonen rule the inputs to the winning unit and proximal units are also updated.

a classification algorithm

a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector

programmer

a computer with specialized software and a wand that communicates with the pulse generator

fourth layer is used to normalize

a fourth layer is used to normalize the input vectors, but this normalization can be easily performed by the application before these vectors are sent to the Kohonen layer

Lyapnov's Exponent

a mathematical algorithm used to measure the amount of chaos in a dataset Will decrease when a seizure happens

COUNTER PROPAGATION NEURAL NETWORK I

a means to combine an unsupervised Kohonen layer with a teachable output layer known as Grossberg layer

Threshold Value

a minimum electrical value required for the neuron to fire. If the sum is equal to or greater than this value, the neuron will fire. If it is less, the neuron will not fire

learning rule

a procedure of a training algorithm for modifying the weights and biases of a network

ketosis

a state of metabolism when the liver excessively converts fat into fatty acids and ketone bodies

Basis Function

a summation of all of the inputs a node receives. The amount of stimulation a node receives

self-organizing map (SOM) or self-organizing feature map (SOFM) i

a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space

These formulae give us a means of designing units which, in order to output a 1 on firing:

a. require all inputs to be a 1 (that is an AND gate) by setting w = 1 and bias = -N (the number of inputs) b. require at least one input to be 1, by setting w = 1 and bias = -1 (this is an inclusive OR gate) c. require at least a certain number of inputs to be 1, again setting w = 1 and bias = - the number of inputs that we want to be on. This represents a sort of 'voting' circuit d. require at most M inputs to be 1 by setting w = -1 and bias = M e. require at least one of the inputs not to be 1 by setting w = -1 and bias = N - 1 (this is the NAND gate)

nonsynchronous update

all even or odd numbered nodes are updated

sigmoid function

an S-shaped mathamatical curve is often used to describe the activation function of a neuron over time

Backpropagation and error

an error is computed at the output and distributed backwards throughout the network's layers.[2] It is commonly used to train deep neural networks.[3][4

AEDs

anti-epileptic drugs First line of treatment

Directed Acyclic Graphs (DAGs)

are Bayesian Networks. Meaning there are no cyclic paths from any node back to itself

neural networks

are software systems that can train themselves to make sense of the human world.

h_ML

argmin(sum((d_i-h(x_i))^2)) (also sum of squared error)

h_MAP

argmin[-lg(P(D|h))-lg(P(h))] or minimization of (length(D|h)+length(h))

1-NN

assign class of nearest neighbor

inductive learning

assume just the data: -cleaner and good base case -more complex mechanisms needed to reason with prior knowledge

A few 'loose ends' need to be considered: in kohonen layer

b) How many units (classes) do we use? This is not an easy question and again a number of heuristics (rules of thumb) have been tried. One might start with just a few units and after training see what happens if others are added 'in between'. One can see if the resulting network is better or worse. Alternatively, one might start with ample units and, again after training, see what happens if one or more is removed.

A few 'loose ends' need to be considered: in kohonen layer

c) Finally what does 'done enough' mean? This might be how long learning has been going for, or how many times around the loop we have gone - or better it might mean that the performance of the network is 'more than adequate'. Often, however, it means that the classes are not changing much and so the return for extra effort seems not to be worthwhile

activation function

calculates whether the strength of the input is great enough to meet the threshold

Activation Function

calculates whether the strength of the input is great enough to meet the threshold. Compares total input of node to the threshold value of that node. Determines whether the node will fire

Perceptrons as weighted threshold elements

called the classical perceptron and the model analyzed by Minsky and Papert the perceptron

neurons

called units and they receive information from other units

What data are in input variables?

can include both categorical and numeric data

Limitations of TMS

cannot stimulate deeper brain tissue cannot determine which neuronal area are being stimulated high frequency pulses can overheat the coil

Synaptic Plasticity

capacity for change. Connection can become stronger or weaker

Axons

carry information away from the cell body towards the synaptic cleft

Linear Separability

categories are linearly separable if one can categorize the examples perfectly by adding up and weighting the evidence from individual features

Self-learning

certain bodily acts are linked with certain mental experiences on the basis of first person experience.

state the Kohonen learning rule for neural networks

change in weight = r * NF(i, i*)(xj - wij ) for all ijNF = neighbourhood functionNF(i, i*) = 1 when i = i*

state the competitive learning rule

change in weight = r(input - weight of winning input)

Benefits of Chemical Transmission

chemical transmission takes place in the synapses between neurons, enabling nerve impulses to be transmitted from one neuron to the next. Gives the brain the flexibility that is required for learning

What does descriptive data mining include?

clustering: id distinct grouping of data summarizing: avgs, associations, test stats anomaly detection: id unusual items or data pts

Hopfield convergence

computation terminates since: - each node update decreases (lower-bounded) E - the number of possible states is finite - the number of possible node updates is limited - and node output values are bounded the continuous model generalizes the discrete model, but the size of the state space is larger and the energy function may have more local minima

Structure of a Neuron

consists of a cell body, dendrites, and axon. Different neurons can have different forms. Forms of the components may vary

Nucleus

contains genetic code and is involved in protein synthesis

d) A single threshold unit has two inputs, ?a and ?b which have initial weights of 0.25 and 0.5 respectively. If the initial value of the bias is -0.75, calculate i) the activation when presented with the input of (0.5, 0.1) with target 1 ii) the updated weights after each round of training twice with this input.

d) candidates were given the initial weights and bias of a single two-input threshold unit. The question simply required them to calculate both forward and backward passes. Good answers showed the calculations involved and, noting that no learning rate was specified, chose a sensible value for this. Unfortunately, some candidates used a sigmoid unit while others omitted the bias or failed to update it, despite threshold unit and bias value being clearly given in the question. This use of the sigmoid shows a lack of understanding of the basic differences between backpropagation networks and Perceptron networks.

VC number and input dimensionality relationship

d-D : VC d+1

Where do we store data in R?

data frame

Why is 1-NN not always bet?

data irl is very noisy

Drawback of avoiding overfitting method

data withheld for test set is not used for training

Inhibitory Potential

decreased the probability of the firing of a neuron. Negative

How has modern supervised learning improved?

deep learning

definition of perception

define the terms: threshold units, step units, step activation, extended truth table, clamped, threshold, bipolar activation and recurrent

Graceful Degradation

degraded input, problems with links, or damage to nodes typically does not bankrupt the system. Often leads to impaired, but interpretable, performance

Degree matrix D

degrees di on the diagonal D = (diδij)i,j=1,...,N (Kronecker delta δij is 1 if i = j, 0 otherwise.)

weight update function for Kohonen learning

delta_w = n(i-w) n = learning rate i = input w = weight

what multilayer networks

describe the architecture of the ANN called a multi-layered feedforward network (MLFFN) • explain the difference between MLFFNS and MLFFNT • carry out simple hand simulations of MLFFNTs • build an MLFFNT corresponding to a given arbitrary truth table • discuss the limitations and possible applications of MLFFNTs

SLC Pros

deterministic, if doing in graph space can be solved as spanning tree problem

Limitations of ANN

difficult to trace the steps not possible to check intermediate computations requires a large number of training example

SVM: ||w||

distance between two hyper-planes (margin)

Average Linkage Clustering

distances between two clusters is mean distance between observation in one cluster and observation in other cluster

Interconnectivity in the Brain

distributed processing- every neuron receives information from several thousand other neurons and can output to a wide number of neurons. Parallel processing

gradient descent/hill climbing

find optimum by iteratively following the gradient/slope of the parameter space

Richness

for any assignment of objects to clusters, there is some distance D such that P_0 returns that clustering

stimulation of the supplementary motor area

for motor seizures initiated in this zone Simultaneous biphasic pulses 30 Hz, 3.0V

Stimulation of the hippocampus

for the control of temporal lobe seizures 1 min trains of Lilly (biphasic) with 4 min rest 130 Hz, 450microSeconds, 400-600microAmps

True error

fraction of examples that would be misclassified on sample drawn from distribution

Single Linkage Clustering

given K, link clusters until only k clusters remain Link due to distance between clusters (nearest two points) O(n^3)

Optimizing NN weights strategies

gradient descent momentum higher order derivatives randomized optimization penalty for complexity

parameter optimization strategies

grid search, gradient descent

kohonen layer network

he hidden layer is a Kohonen network with unsupervised learning and the output layer is a Grossberg (outstar) layer fully connected to the hidden layer. The output layer is trained by the Widrow-Hoff rule. Allows the output of a pattern rather than a simple category number. ... Training is a two-stage procedure.

Wrapping Evaluation Methods

hill climbing randomized optimization forward search backward search

A Perceptron unit is often described as representing 'a line in the plane'. Explain how this description applies. Your answer must include a diagram

how weights, including the bias, are used to form the activation through the threshold function to split the plane into two. It is essential to relate the equation of the line in the plane with the equation for threshold activation. Good answers also mentioned that the line is for two input units and more inputs required higher dimensions to illustrate what is going on. The question asked for a diagram and marks were deducted if this was omitted.

Using the four relations that define net and activation:

i) a = σ(net) = 1/(1 + exp(-net)) ii) net = Σh=0H wh ah iii) ah= σ(neth) = 1/(1 + exp(-neth)) iv) neth= Σi=0 N wihxi

Multi-layer perceptron (MLP)

i. Fully connected ii. Consist of 3 layers: 1. 1 Input Layer a. d - dimensional input x b. No neurons at input layer - each input unit simply emits input xi 2. 1+ Hidden Layer a. nH neurons in each hidden layer b. Each neuron uses a nonlinear activation function c. Weight wji indicates input to hidden layer - j = hidden layer, i = input layer 3. 1 Output Layer a. c neurons in output layer b. neurons use nonlinear activation functions that relate to the problem being solved

Spectral Clustering with Gaussian Laplacian Algorithm

i. Given similarity matrix between data points, construct weighted graph adjacency matrix W ii. Compute Laplacian L Compute first k eigenvectors u1,...,uk of L: Lu = λu (Lu = λDu for iii. Shi-Malik algorithm, more commonly used.) Let U ∈ RN×k be matrix with eigenvectors as columns Take rows zi ∈ R1×k,i = 1,...,N of U Cluster {zi} i=1,...,N using k-means into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {vj|zj ∈ Ci}.

Activation Functions

i. Sigmoid function ii. Hyperbolic tangent function iii. Rectified Linear Units - used for hidden layer neurons iv. Softmax function - used for output layer neurons

Perceptron Theorem

if a linear discriminant exists that can separate the classes without error, the training procedure is guaranteed to find that line or plane

Epsilon exhausted version space

if and only if for all h in the version space the error of h is less than Epsilon

Strong Relevance

if removing x_i degrades bayes optimal classifier

preferred threshold

if summed signals are greater than the threshold, the unit will pass information forward in the network

Using the four relations that define net and activation:

ii) net = Σh=0 H wh ah iii) ah = σ(neth ) = 1/(1 + exp(-neth )) iv) neth = Σi=0 N wihxi we get a i = σ(net) i = 1/(1 + exp(-net)) ii= 1/(1 + exp(-Σh=0 H wh ah )) iii= 1/(1 + exp(-Σh=0 H wh (1/(1 + exp(-neth ))))) iv= 1/(1 + exp(-Σh=0 H wh (1/(1 + exp(-Σi=0 N wihxi )))))

What is the solution to samples with missing data?

impute missing data plug in good estimate for missing values

What are the similarities between biological neural networks and artificial neural networks?

in biological synapses the contribution of signals depends on the strength of the synaptic connection

Regarding the training process of the counter-propagation network, it can be described as a two-stage procedure;

in the first stage, the process updates the weights of the synapses between the input and the Kohonen layer, while in the second stage the weights othe synapses between the Kohonen and the Grossberg layer are updated.

Dirty data

incomplete noisy inconsistent

Valproic Acid

increases GABA which inhibits neuronal firing

Excitatory Potential

increases the probability of the firing of a neuron. Positive

BSB node update rule

initial activation is steadily amplified by positive feedback until saturation

Basis Function (forms)

initial input into a node- sensory layer input- entered by a programmer. Input from the environment- input from other nodes

What is a perceptron?

initial proposal of connectionist networks

auto-association

input vectors and output vectors range over the same vector space, eg. character recognition, eliminating noise from pixel arrays

ANNs Inspiration

inspired by the computing power of the brain. About 100 billion neurons.

types of numeric data in R

int double/numeric

What are artificial neural network in a brain?

interconnected groups of nodes akin to the vast network of neurons in a brain

perceptron

is an algorithm for supervised learning of binary classifiers

Conditional independence in PGN

is the PGN mechanism to show information in terms of interesting aspects of probability distributions

What makes it a backpropagation network

is the algorithm used for learning

What are aspects of unsupervised learning?

labels are unknown / don't exist uncover unknown structure in the data infer hidden labels and # of groups dimension reduction visualization and exploratory analysis

binary classification

labels belong to only two classes

multiclass classification

labels belonging to three or more classes

Power of hypothesis space

largest set of inputs that hypothesis class can label in all ways

Splitting Criteria for Classification: measurement

let I(N) denote the impurity of some node N goodness of split is weighted mean decrease in impurity information gain (IG): IG(N) = I(N) - p^N1*I(N1) - p^N2*I(N2)... see notes

We learn the weights, we get the function.

let's use a perceptron to learn an OR function.

What are the values that factors take on one of?

levels

Infinite Hypothesis Spaces

linear separators artificial neural networks decision trees (continuous)

What are some applications of ANN?

loan approvals, law enforcement, character recognition, image compression, self-driving cars, football play predictor

NN Preference Bias

low complexity networks, low weights

Sample Complexity and VC Dimension

m >= (1/Epsilon) * (8 * VC(H) * lg(13/Epsilon) + 4 lg(2/delta))

Haussler Theorem

m >= (1/Epsilon) * (ln(|H|) + ln(1 / delta)) If you know epsilon an delta targets sample a lot and you'll be fine as long as H is discrete

Neurons

main job is to transmit information

What is transformation?

make distribution symmetric or more Gaussian e.g. log-transform (may make interpretation more difficult)

How to make a data set linearly separable?

mapping data to a higher-dimensional space

hetero-association

mapping input vectors to output vectors that range over a different vector space, e.g. translating English words to Spanish words

What does supervised learning require?

massive labeled datasets labels usually come from humans

learning curve

mean squared error a plot of mean squared error versus training time rising learning curve is bad

What is the unexplainable variation from unmeasured X's?

measurement error ∈ "error term"

Usefulness

measures feature effect on particular predictor minimizing error given model/learner

simple partial

memory, awareness and consciousness are not impaired

method of backpropagation requirement

method requires computation of the gradient of the error function at each iteration step, we must guarantee the continuity and differentiability of the error function

unsupervised learning

mode of learning where the training data only contain inputs that describe the characteristics of the variables or patterns ANN clusters data

Cognitive scientists

models of thinking, learning, and cognition

pulse generator

monitors a patients ECoG and triggers electrical stimulation when needed

ICA Properties

mutually independent maximal mutual information bag of features

PCA Properties

mutually orthogonal maximal variance ordered features

Perceptron learning rule

n machine learning, the perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not)

Side effects of ketogenic diet- short term

nausea, diarrhea, vomiting, hypoglycemia

Side effects of AEDs

nausea, drowsiness, confusion, dizziness, skin rash, weight gain, liver damage

or the hidden layer of a two-layer network and

net = Σh=0 H wh ah and a = σ(net) = 1/(1 + exp(-net))

Hopfield net performance rule

netj=Σi≠jwijyi, yj={1 if netj>0, 0 if netj<0, yj if netj=0}

Backpropagation is probably the most used feed-forward artificial neura

network and so we are going to see how this works

Architecture of an ANN

network composed of nodes and links with one or more layers

neural networks

networks of nerve cells that integrate sensory input and motor output - parts, workings - characteristics

Post-Synaptic Neuron

neuron after the synapse

Pre-Synaptic Neuron

neuron before the synapse

Hebbian learning

neurons that fire together wire together simultaneous activation of cells leads to pronounced increases in synaptic strength between those cells

After minimizing this function for the training set,

new unknown input patterns are presented to the network and we expect it to interpolate. The network must recognize whether a new input vector is similar to learned patterns and produce a similar output

asynchronous update

nodes to be updated may be selected in - cyclic order or - at random each node in the network must have the opportunity to change state - called "fairness" property for computer networks

grid search

not efficient; create nested for loops that iterate over all possible combinations of parameters

For KNN should k be even or add?

odd to ensure we always have a majority

Why is having a large margin good?

of all the possible linear functions, this one is the most robust to outliners and thus has the best generalization abiliity

Stimulation of the centro-median nucleus

of the thalamus in the control of intractable generalized seizures and atypical absence seizures 2 hour daily 1 min trains of Lilly (biphasic) with 4 min rest 130 Hz, 450microSeconds, 400-600microAmps

How can learning process be stopped in backpropagation rule?

on basis of average gradient value If average gadient value fall below a preset threshold value, the process may be stopped.

Serial Processing

one computation at a time. First step must be completed before the next step can be started.

gradient descent definition

operates by keeping track of the current state and changing that state to (hopefully) improve performance

calculate the total error for the function

or example, the target output for o_1 is 0.01 but the neural network output 0.75136507, therefore its error is: E_{o1} = \frac{1}{2}(target_{o1} - out_{o1})^{2} = \frac{1}{2}(0.01 - 0.75136507)^{2} = 0.274811083 Repeating this process for o_2 (remembering that the target is 0.99) we get: E_{o2} = 0.023560026 The total error for the neural network is the sum of these errors: E_{total} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109

If the unit correctly classifies all training examples then stop,

otherwise add a correction of η(target - activation)*input to each of the weights before testing the next example.

classification

outcome is a discrete variable (typically <10 outcomes)

regression

outcome is continuous

non-iterative association

output pattern is generated from teh input pattern in a single iteration 1. Hebb's law may be used to develop associative "matrix memories" or 2. Gradient descent can be applied to minimize recall error

Note that if there is more than one output, it is easier to treat each

output separately as they have no means of communication

Probability Theory Recap

p(x) = marginal distribution p(x,y) = joint distribution p(x|y) = conditional distribution

Terminal Buttons

part of the axon. Secrete neurotransmitters and are positioned opposite the dendrites of other neurons

Cell Body (Soma)

part of the neuron that contains the nucleus and other organelles

Hyperbaric oxygen therapy

patient enters a small sealed chamber oxygen levels is increased

linear or nonlinear

perceptron

single-layer

perceptron, Hopfield net

supervised, binary or real

perceptron, multilayer perceptron (back propagation)

dense

perceptron, multilayer perceptron (back propagation), Hopfield net

random

perceptron, multilayer perceptron (back propagation), Hopfield net, competitive net

feedforward

perceptron, multilayer perceptron (back propagation), competitive net

ANN's can be used to...

predict classify cluster

How do we test/predict/evaluate our data?

predict new outputs, given only input

Splitting Criteria for Classification: Indicators of node impurity

pure node: all examples are of the same class, maximal impurity: if both classes are equally probable, if all examples belong to the same class

What is categorical data?

qualitative distinctness: = and not= e.g. names, gender, medical diagnosis order: <, <=, >, >= e.g. levels of satisfaction, rating

What is numeric data?

quantitative discrete or continuous x / + - e.g. days until, hotter by, totals to, times more likely, twice as old

Dendrites

receive information from other neurons in close proximity and relay that information to the cell body. Treelike structure

Function of Neurons

receive, transmit, and communicate information throughout the brain

What happens if you weight the distance (play with scale)?

reduce noise, improve classification

Dilantin

reduces electrical conductance and stabilizes the inactive state of voltage-gated Na channels

Pruning

reducing the size of a network after it is trained

Resection Surgery

removal of epileptogenic tissue requires long-term EEG monitoring

What is feature subset selection?

remove unnecessary variables e.g. redundant names and ids

Alteration Process

repeated use causes increased strength. Lack of use causes decrease in strength. This deals with the strength of connections

Neurological Learning

requires synaptic plasticity and involves an alteration of a neural pathway or set of such pathways

idea of perceptron

resent inputs to a perceptron, and to change the perceptron weights and biases according to the error, the perceptron will eventually find weight and bias values that solve the problem, given that the perceptron can solve it. Each traverse through all of the training input and target vectors is called a pas

Restriction Bias

restricts hypothesis set (H) to probable answers

Rock-Mine Network (Output Layer)

rock or mine

avoiding overtraining

rotate the training sets and the test sets early stopping small networks pruning large training data sets

The derivative of the sigmoid with respect to x

s(x)(1 − s(x)).

there are 2 types of CPN

s. They are:•1) Full counter propagation network•2) Forward only counter propagation network.•It is useful for rapid prototyping of systems

Write down the formulas for the following: sensitivity, specificity and accuracy

sensitivity = TP/FP specificity = TN/FN accuracy = (TP + TN)/(FP +FN) TP = true positive FP = false positive TN = true negative FN = false negative

Input Layer

sensory layer. Simultaneous activity of all of the nodes in the input layer is a representation of the input stimulus

Classification tree

set of (splitting) rules to recursively partition a data set. => min mixture of classes (impurity) within nodes A supervised learning technique that uses a structure similar to a tree to segment data according to known attributes to determine the value of a categorical target variable

regression tree

set of (splitting) rules to recursively partition a data set. => min variance of the response within nodes A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.

step 1

simulate a neuron

Synaptic Cleft

small gap between neurons in which neurotransmitters are released. Do not touch. Electrical energy must be converted to chemical energy in the form of neurotransmitters

If a unit feeds more than one successor unit then there will be a contribution from each successor

so that εwh becomes a sum over all successors. We will not need this formula but you need to know that it exists

Kohonen learning

start : The n-dimensional weight vectors w1, w2,..., wm of the m computing units are selected at random. An initial radius r, a learning constant η, and a neighborhood function φ are selected. step 1 : Select an input vector ξ using the desired probability distribution over the input space. step 2 : The unit k with the maximum excitation is selected (that is, for which the distance between wi and ξ is minimal, i = 1,...,m). step 3 : The weight vectors are updated using the neighborhood function and the update rule wi ← wi + ηφ(i, k)(ξ − wi), for i = 1, . . . , m. step 4 : Stop if the maximum number of iterations has been reached; otherwise modify η and φ as scheduled and continue with step 1

formal definition of the associative memory problem

store a set of k patterns such that when presented with a new pattern Pi the network responds by producing a stored pattern which most resembles Pi

Computational Learning Theory

study of the design and analysis of machine learning algorithms

perceptron architecture

supervised, feedforward, dense, linear or nonlinear, random, single-layer, binary or real

multilayer perceptron (back propagation) architecture

supervised, feedforward, dense, nonlinear, random, multi-layer, binary or real

Output Layer

system's solution to a task

Comparing Hypotheses

t-test Analysis of Variance (ANOVA) test

character of function in perceptron

the composite function produced by interconnected perceptrons is discontinuous

Links

the connections among nodes. Allow nodes to send and receive signals from other nodes. Functional equivalent to synapse. Have weights

Bayesian Learning

the distribution of the neural network parameters is learned

Machine Learning

the extraction of knowledge from data based on algorithms created from training data Technique for making a computer produce better results by learning from past experiences.

Calculating the output of the network is a simple process as we feed forward the inputs through

the hidden layer and then through the output layer

feature weighting

the idea of assigning more weight to features known to be more important for some particular classification problem (e.g. color if classifying fruit)

Nodes

the input and output units in an ANN. Function to transmit information throughout the system. Can receive multiple inputs that are summed to create a net input. Functional equivalent to neuron. Can output to multiple nodes. Have threshold values

step 3 training kohonen

the input x of the selected training pattern are selectted

Biological neural learning and artificial neural learning both happen by what?

the modification of the synaptic strength

backpropagation motivation

the motivation for backpropagation is to train a multi-layered neural network such that it can learn the appropriate internal representations to allow it to learn any arbitrary mapping of input to output.

Learning as Optimization

the objective of adapting the responses on the basis of the information received from the environment is to achieve a better state. To achieve closer to the optimal state.

SVM Kernel Trick

the original input space can be mapped to some higher dimensional feature space where the training set is seperable

the perceptron is an algorithm

the perceptron is an algorithm for learning a binary classifier: a function that maps its input {\displaystyle \mathbf {x} } \mathbf {x} (a real-valued vector) to an output value {\displaystyle f(\mathbf {x} )} f(\mathbf {x} ) (a single binary value):

Define overfitting

the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably

Information Theory

the quantification, storage, and communication of information

the function used in back propgation network

the sigmoid activation function s(net) = 1/(1 + exp(-net))

However, for our current purposes we need something different

the sigmoid activation function s(net) = 1/(1 + exp(-net)).

Perceptron

the simplest neural network possible: a computational model of a single neuron - conceptually - formally

Activation Value

the strength of the signal of the node. Basis function is the same as the activation value if the threshold has been met

artificial intelligence

the study of computer systems that attempt to model and apply the intelligence of the human mind

Signal Transduction Step 4

the synaptic potentials are passively diffused. This results in the summation of their excitatory and inhibitory effects (Passive electrical conduction)

Pre and Post Synaptic Neurons

the two neurons that form the synapse. Naming reflects the direction of information flow

margin

the width that the boundary could be increased by before hitting a data point

What do artificial neurons and neural networks try to imitate?

the working mechanisms of their biological counterparts

James McClelland & Connectionism theory

theory that memory is stored throughout the brain in connections between neurons, many of which can work together to process a single memory a type of information-processing approach that emphasizes the simultaneous activity of numerous interconnected processing units

Rationale for ANN

they attempt to model human intuition by stimulating the physical process upon which intuition theoretically capable of producing a proper response to a given problem even when the information is noisy or incomplete or when no set procedure exists for solving problems

Take care not to confuse our (and others') use of x here with a coordinate - although inputs might represent coordinates

they can represent any characteristic that we are interested in.

We use backpropagation for supervised learning when we have a set of training data that we are trying to model

to approximate the function or classifi er that the training set represents

supervised learning aims

to discover a function h(x) that approximates f(x), where h is a hypothesis

Fishers Ratio rewritten

to make dependence on weight vector explicit

parahippocampal place area

to places

one-against-one (pairwise) classification

train N(K-1)/2 binary classifiers for a N-way multiclass problem. Each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes At prediction time, a voting scheme is applied: all N(N-1)/2 classifiers are applied to the new sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier

one-against-all (one-against-rest) classification

train a single classifier per class, with the samples of that class as positive samples and all other samples as negatives (total classifiers for N classes) this strategy requires the base classifiers to produce a real-valued confidence score for its decision of that class, and the class with the highest confidence is chosen

step 3

training and learning, increase the strength of appropriate connections and prune away inefficient connections

What is discretization?

transfer continuous into discrete e.g. rounding, converting numeric to categories

What is scaling?

transform features to mean 0 and variance 1 e.g. Z-scores

Neuro-physiologists-

understanding sensory systems and memory ANN can be used to understand how visual info is represented in V2 and V4 and higher levels of visual hierarchy; some studies, humans and ANN can solve same task Showed mice black and white movies while recording regions in visual cortex

Boosting initial importance distribution

uniform (1/n)

For a backpropagation neural network,

units usually have nets that add (Σ) and sigmoid activations so that: neth= Σi=0 N wih xi and ah = σ(neth ) = 1/(1 + exp( neth ))

What is sampling?

unnecessarily fast data collection e.g. sub-sampling (can create artifacts, destroy signal)

What is dimensionality reduction?

unnecessarily many dimensions e.g. principal component analysis

competitive net architecture

unsupervised, feedforward, sparse or dense, nonlinear, random, multi-layer, binary

Hopfield net architecture

unsupervised, recurrent, dense, nonlinear, random, single-layer, binary

How do we train our data?

use available input/output data to estimate this function

Benefits of ANN

used for pattern recognition, learning, classification, generalization, and interpretation of incomplete and noisy data lends human problem-solving characteristics to machine learning robust flexible fast and easy to maintain powerful

Resection surgery rationale

used for seizures not responsive to AEDs remove epileptogenic tissue without disturbing normal brain function

cross validation dataset

used to actively test the network during training, such that training can be stopped before the network is overtrained

vigilance parameter

user can select vigilance parameter to control dissimilarity between members of the same cluster in ART

Algorithms

very specific, step-by-step procedures for solving certain types of problems

Where is unsupervised learning useful?

visualization and exploratory analysis

A two unit Kohonen network has initial classes (1,1) and (-1,-1). Showing your working, train the network with an example of (0,1) followed by an example of (1,0) each with a learning rate of 0.5.

was designed to make calculations reasonably easy. Some did not remember to normalise the inputs and the units and some did not recognise that neither (1, 1) nor (-1, -1) are normalised

the result of backpropagation

we expect to find a minimum of the error function, where ∇E = 0.

What's going on above is that we defined a few conditions (the weighted sum has to be more than or equal to 0 when the output is 1) based on the OR function output for various sets of inputs,

we solved for weights based on those conditions and we got a line that perfectly separates positive inputs from those of negative.

Weights

weight between two nodes equals the strength of the connection between the nodes. Value is between 0 and 1. Can be positive or negative.

for MCPs the new weight equals..?

weight_new = weight_old + change

The training set consists of a number of (input, output) pairs

which we label (?i , ti ) to emphasise that the x-components are inputs or (xi , ti ) when wanting to be more explicit.

Hopfield net learning rule

wij=Σi≠jg(yi[s])g(yj[s]) where g(yi[s])={1 if yi[s]=1, -1 if yi[s]=0} for each desired stable state s

SVM: y = w(^T)x + b variables

y: classification label w: parameters of plane (hyperplane) b: moves plane in/out of origin

multilayer perceptron (back propagation) learning rule

Δwij=kδjxi where δj=(tj-yj)f'(netj) for output units δj=(Σkwjk)f'(netj) for hidden units desired outputs t

competitive net learning rule

Δwij={k(xi-wij) for winning j, 0 else}

The simplest ANNs consist of

• A layer of D input nodes • A layer of hidden nodes • A layer of output nodes • Fully connected between layers

Frequent Itemset

• Absolute support of an itemset is its frequency count • Relative support is the frequency count of the itemset divided by the total size of the dataset

ANN feature selection

• Artificial Neural Networks can implicitly perform feature selection • A multi-layer neural network where the first hidden layer has fewer units (nodes) than the input layer • Called 'Auto-associative' networks

Scientists that use Artificial Neural Networks:

• Computer scientists (information processing and learning, image classification, object detection and recognition) • Statisticians (classification) • Engineers (signal processing and autonomic control) • Physicists (statistical mechanics) • Biologists (predicting protein shape from mRNA sequences, disorder diagnostic, personalised medicine) • Philosophers (Minds and Machines) • Cognitive scientists (models of thinking, learning and cognition) • Neuro-physiologists (understanding sensory systems and memory)

Model Combination View

• Decision Trees combine a set of models (the nodes) • In any given point in space, only one model (node) is responsible for making predictions • Process of selecting which model to apply can be described as a sequential decision making process corresponding to the traversal of a binary tree

Causes of inconsistent data

• Different data sources • Functional dependency violation (e.g., modify linked data)

Min-max normalization

• Enables cost-function minimization techniques to function properly, taking all attributes into equal account • Transforms all attributes to lie on the range [0, 1] or [-1, 1] • Linear transformation of all data

Data Integration

• Entity identification problem • Redundancy detection • Correlation analysis • Detection and resolution of data value conflicts • e.g. weight units, in/exclusion of taxes

Search Methods

• Exhaustive • Greedy forward selection • Greedy backward elimination • Forward-backward approach

Causes of noisy data (incorrect values)

• Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission

Artificial Neural Nets

• Feed-forward neural network/Multilayer Perceptron one of many ANNs • We focus on the Multilayer Perceptron • Really multiple layers of logistic regression models

Evaluation Procedures

• For large datasets, a single split is usually sufficient. • For smaller datasets, rely on cross validation

DM systems can be divided into types based on a number of variables

• Kinds of databases • Kinds of knowledge • Kinds of techniques • Target applications

Decision surfaces are

• Linear functions of x • Defined by (D-1) dimensional hyperplanes in the D dimensional input space.

Commonly used kernels

• Linear kernel • Polynomial kernel • Gaussian kernel (Gaussian kernel is probably the most frequently used kernel out there - Gaussian kernel maps to infinite feature space)

PCA

• Manifold projection • Assumes Gaussian latent variables and Gaussian observed variable distribution • Linear-Gaussian dependence of the observed variables on the latent variables • Also known as Karhunen-Loève transform

Parametric methods

• Many methods learn parameters of prediction function (e.g. linear regression, ANNs) • After training, training set is discarded. • Prediction purely based on learned parameters and new data.

PCA requires calculation of

• Mean of observed variables • Covariance of observed variables • Eigenvalue/eigenvector Computation of covariance matrix

HDF5

• Much more complex file format designed for scientific data handling • It can store heterogeneous and hierarchical organized data. • It has been designed for efficiency.

Sparse Kernel Methods

• Must be evaluated on all training examples during testing • Must be evaluated on all pairs of patterns during training - Training takes a long time - Testing too - Memory intensive (both disk/ RAM) Solution: sparse methods

Sparse kernel methods

• Must be evaluated on all training examples during testing • Must be evaluated on all pairs of patterns during training • Training takes a long time • Testing too • Memory intensive (both disk/RAM) Solution: sparse methods

DM integration with DBS/ Data Warehouses

• No coupling - DMS will not utilize any DB/DW system functionality • Loose coupling - Uses some DB/DW functionality, in particular data fetching/storing • Semi-tight coupling - In addition to loose coupling use sorting, indexing, aggregation, histogram analysis, multiway join, and statistics primitives available in DB/DW systems • Tight coupling

Output layer can be

• Single node for binary classification • Single node for regression • n nodes for multi-class classification

Rulesets

• Single rules are not the solution of the problem, they are members of rule sets • Rules in a rule set cooperate to solve the problem. Together they should cover the whole search space

EM Algorithm issues

• Takes a long time • Often initialised using k-Means

A single neuron can solve a _______________________ problem

linearly separable

Entropy algorithm

E(S) = - sum_v(p(v)*log(p(v)))

Cross validation

Separate training and testing set using folds that are iteratively checked

Machine learning resources

time, space, samples (or data)

Bayesian Inference

Representing and reasoning with probabilities

Topamax

increases GAB

Itemsets

simply a set of items (cf set theory)

What is a sign of supervised learning?

training data given

Filter Scores

• Correlation • Mutual information Entropy • Classification rate • Regression score

An extended truth table.

?a Net Σ Activation a = A(Σ ) = T (0,0,1)(Σ ) 0 0 1 1 w if w < 0 then 0 else 1

Unsupervised Learning Tasks

- Outlier detection: Is this a 'normal' xi ? - Data visualization: What does the high-dimensional X look like? - Association rules: Which xij occur together? - Latent-factors: What 'parts' are the xi made from? - Ranking: Which are the most important xi ? - Clustering: What types of xi are there?

Prior Probability

- Probability of encountering a class without observing any evidence - Can be generalized for the state of any random variable - Easily obtained from training data (i.e. counting)

All probability theory can be expressed in terms of two rules

- Product rule - Sum rule

Data Mining

- Quest to extract knowledge and/ or unknown interesting patterns from apparently unstructured data. aka Knowledge Discovery from Data (KDD) • Data mining bit of a misnomer - information/ knowledge is mined, not data.

Random Initialization

- Randomly Initialize K-Means clusters using actual instances as cluster centers - Run K-Means and store centers and final Cost function - Pick clusters of iteration with lowest Cost function as optimal solution - Most useful if K < 10

Cross-validation

- Randomly split data into n folds and iteratively use one as test set - All data used to test, and almost all to train - Good for small sets

Fixed train, development and test sets

- Randomly split data into training, development, and test sets. - Does not make use of all data to train or test - Good for large datasets

storage capacity

- refers to the quantity of information that can be stored and retrieved without error C = number of stored patterns / number of neurons in the network - depends on connection weights, stored patterns, anddifference between stimulus patterns and the stored patterns if not fully connectd C = number of stored patterns/ number of connection weights in network

A simplified connectionist model consists of what three parts?

- The input units - The hidden units - The output units

Maximum margin classifiers

- This turns out to be a solution where decision boundary is determined by nearest points only - Minimal set of points spanning decision boundary sought - These points are called Support Vectors

Regression Trees

- Trained in a very similar way - Leaf nodes are now continuous values - the value at a leaf node is that assigned to a test example if it reaches it - Leaf node label assignment is e.g. mean value of its data sample Problem: nodes make hard decisions, which is particularly undesired in a regression problem, where a smooth function is sought.

Decision Tree Representation

- Tree with decision nodes representing attribute - edges represent decisions - leaves are output

uses of Hopfield

- can be used to retrieve a stored pattern when a corrupted version of pattern is presented - can also be used to "complete" a pattern when parts of the pattern are missing

topological structure (eg. SOM)

- competitive learning requires inhibitory connections among all nodes at kohonen layer. topological requires that each node has excitatory connections to a small number of nodes - topology specified in terms of a neighbourhood relation among nodes - learning proceeds by moving the winner node as well as its neighbours towards the presented input sample

k-means clustering

- computes cluster centroids directly instead of making small updates to node positions

bidirectional associative memory (BAM)

- connections between input and output are bidirectional - no intra-layer connection - generates output at 2nd layer using non-iterative node update rule and a signum step function - generates 1st layer pattern to correspond to 2nd layer output using a similar update rule

learning in associative networks

- consists of encoding the desired (to be stored) patterns as a weight matrix (network)

The parts of a neural network? (4)

1. Dendrites - set of inputs 2. Axon - single output 3. Synapses - resistances/weights 4. Neurone - processing element

clustering

- determining a set of representative centroids/prototypes/cluster centers/reference vectors

centroid in neural networks

- each output node constitute a weight vector of that node, which represents the centroid of one cluster of input patterns

How do LVQs work

- each output node is associated with an arbitrary class label in the beginning - at end should be associated with approx. the number of training data in that class - initial weights are chosen randomly - learning rate decreases with time - helps the network converge to a state in which weight vectors are stable - when a new pattern is presented, if the winner node is the correct class then move winner node closer to pattern

Pruning: post-pruning

- fully grow tree - cut branches that "do not add much" (basically the opposite of tree growing, trace performance of tree on a 'fresh' set of data)

Feature extraction

- goal is to find the most important features (ie those with the highest variation in a given population) - side-effect is reduction of input dimensionality

error-correction in NI

- has low error-correction capabilities - not learning anything for changes in data, assigned from the training data - multiply W with even a slightly corrupted input vector often results in a "spurious" output that differs from stored patterns

types of associations

- hetero-association - auto-association

weights in hopfield network

- implicitly store the attractors (training samples) - also represent the correlations between node values for all attractor patterns - therefore, a large weight indicates a greater correlation between neighbouring node values

Why use a NN instead of using an array as a lookup table?

- it is parallel (independent of the number of entries) - it is fault tolerant (graceful degradation) - it is a neural model of memory - if set up properly, it can provide outputs for noisy inputs (as in character recognition, input-outputs are typical characters and hand-written or scanned inputs are noisy which need to be recognized)

Maxnet

- recurrent competitive one-layer network - used to determine which node has the highest initial activation - node function is f(net) = max(0,net) - "mutual inhibition factors" are less than 1/# nodes - nodes update their outputs simultaneously. each node receive inhibitory inputs from other nodes via intra-layer connections - allows for greater parallelism in execution, since every computation is local to each node rather than centralized

Learning Algorithms

1. Discrete/Binary Input 2. Continuous Input

Units/Nodes

- neurons - activates by input value - activation passed on through connections

negative corelation (w_lj is -ve and large)

- nodes l and j frequently have opposite ON/OFF values in attractor patterns

positive correlation (w_lj is +ve and large)

- nodes l and j frequently turn ON or OFF together in attractor patterns

Neurons vs. Nodes

- number of inputs - input activity - excitatory/inhibitory - strength of synapse

PART 1: CLASSIFIERS

- pattern recognition - diagnostic decisions EX: plants VS. vehicles

drawback of Hopfield

- performance of Hopfield network depends considerably on the number of target attractor patterns to be stored - "assumption of full connectivity" - a million weights for a thousand node network

iterative association

- reducing error in generation the desired output - same as least square procedure using Widrow-Hoff rule - good for hetero-association

What is the objective of associative networks?

- to model associative memory - the network memorizes its input-output pairs

Brain-State-in-a-Box (BSB)

- used for auto-association and can be extended to for hetero-association with two or more layers of nodes - fully connected - # of nodes = dimensionality n of input data - simultaneous updates of nodes - values of nodes are continuous (belong to [-1,+1] - node function is a ramp function - positive self activation

Kohonen as k-means

- uses stochastic gradient descent on the quantization error (see equation)

Approximation of Data distribution

- using fewer points in the same approximate areas to represent a distribution - clustering extracts only one point for each cluster

weight vectors in SOM

- weight vectors often become ordered ie. topological neighbours become associated with weight vectors that are near each other in the input space

Hamming Netowrks

- weights on links from an input layer to an output layer represent components of stored input patterns - Hamming networks compute the "hamming distance", the number of differing bits of input and stored vectors - P output nodes can store P vectors each associated with a weight vector

How can artificial neural networks separate the feature space?

..

Here is a sketch algorithm for training the network.1 and 2 steps

1) Normalise the training set. 2) Choose a set of initial classes - that is, points in the space that our objects are from. We think of these initial classes as vectors of weights - making the analogy with neural networks. We identify these classes with their vectors and call them 'units' of the network.

Creating a Tree Model in pseudo-code

1) Start from root node 2) For each variable find the best split - For nominal variables, consider splits of the type X=a, X=b, ... - For ordinal / numeric variables, consider splits of the type X ≤ a - - Assess quality of split somehow (see below) 3) Compare best splits per variable across variables - Quality(split at salary) vs. Quality(split at age) vs. ... 4) Selecting best overall split gives two internal nodes 5) Repeat above for each new internal node 6) Continue until some stopping criterion is met

Pruning: pre-pruning

1) do not fully grow tree but stop early - min gain is below some threshold - max depth is reached 2) disadvantages - consider focal node only (horizon effect) - how to select parameters ( eg. max depth)?

accuracy may not be useful measure in cases where

1- There is a large class skew 2- There are differential misclassification costs - say, getting a positive wrong costs more than getting a negative wrong. 3- We are interested in a subset of high confidence predictions

Knowledge Discovery Process

1. Data cleaning - remove noise and inconsistencies 2. Data integration - combine data sources 3. Data selection - retrieve relevant data from db 4. Data transformation - aggregation etc. (cf. feature extraction) 5. Data mining - machine learning 6. Pattern Evaluation - identify truly interesting patterns 7. Knowledge representation - visualize and transfer new knowledge

K-Means Algorithm

1. Assign each xi to its closest mean. 2. Update the means based on assignment 3. Repeat until convergence

Constrained Learner algorithm bounding mistakes

1. Assume each variable can be positive and negative 2. Given input, compute output 3. If wrong, set all positive variables to 0 to absent, negative variables that were 1 to absent. GOTO 2

This form of map, known as a topographic map, has two important properties:

1. At each stage of representation, or processing, each piece of incoming information is kept in its proper context/neighbourhood. 2. Neurons dealing with closely related pieces of information are kept close together so that they can interact via short synaptic connections.

Techniques for canceling out noise

1. Binning - First sort data, then distribute over local bins 2. Regression - Fit a parametric function to the data (e.g. linear or quadratic function) 3. Clustering

How to measure performance? (2)

1. Calculate the mean squared performance 2. [(T-O)2]/n

Prediction

1. Classification 2. Regression

Development Process of Neural Network

1. Collect, Organize, Format Data 2. Separate Data into training, validation, and testing sets 3. Decide on a network structure 4. Select a learning algorithm 5. Set network parameters 6. Initialize weights ans start training 7. Stop training, freeze 8. Test the trained network 9. Deploy the network for use on unknown new cases

How to represent XOR? (5)

1. Combine the other networks 2. Use AND NOT to do X1 AND NOT X2 3. Use AND NOT to do X2 AND NOT X1 4. OR them to get the answer 5. This will have a hidden layer

Artificial Neural Network Structure Summary

1. Composed of many units (similar to neurons) 2. Units are interconnected (similar to synapses) 3. Units occupy separate connected layers (similar to multiple brain regions in sensory pathways) 4. Units in deeper layers respond to more and more abstract information (similar to more complex receptive fields in "higher" cortical areas) 5. Require learning to perform tasks efficiently (similar to neural plasticity) 6. Through experience, Artificial Neural Networks learn to recognize patterns .

Supervised Learning Process (3 Steps)

1. Compute temporary outputs 2. Compare outputs with desired targets 3. Adjust the weights and repeat the process

What is machine learning? (2)

1. Computer given example data with output 2. Computer learns to improve

What is special about machine learning? (2)

1. Computers automatically improve performance 2. Learn from experience

Continuous Input Supervised

1. Delta Rule 2. Gradient Descent 3. Competitive Learning 4. Neocognition 5. Perceptor

neuroscience

ANNs have been used to understand how visual info is represented in V2 V4 and higher levels of visual hierarchy

How to change the weights in a neural network?

Add the current weight to the learning rate multiplied by the input weight and Err

K-medoids clustering

Addresses issue with quadratic error function (L2-norm, Euclidean norm) Replace L2 norm with any other dissimilarity measure (V...)

Perceptron Expressiveness

All Boolean Functions

PGNs are generative models

Allow us to sample from the probability distribution it defines

Complex pattern recognition (learning) can result from what is a complex system of very simple binary (neuron-like) units. What is an example of this? PART II

An artificial neural network can be trained to generate speech from visual inputs.

Complex pattern recognition (learning) can result from what is a complex system of very simple binary (neuron-like) units. What is an example of this?

An artificial neural network can be trained to recognize written letters.

How does a feedforward associator work?

An associative memory is encoded by modifying the strength of connections between the layers so that when an input pattern is presented, the stored pattern associated with that pattern is retrieved.

Principal Components Analysis

An eigenproblem that projects the data along the axis or axes of maximum variance. Aims to minimize L_2 error when moving dimensions and allow for best reconstruction

How to train ANN

An error function on the training set must be minimized. This is done by adjusting: - Weights connecting nodes. - Parameters of non-linear functions h(a).

Decision Tree learning

Pick best attribute Follow path Continue until answer node found

K-Means Clustering

Pick k centers at random Center claims closest points Recompute centers by averaging clustered points Repeat until convergence

The Principle of Plurality

Plurality should not be posited without necessity.

What are critical points?

Points with zero slope (i.e. minimum, maximum, saddle point).

Candidate

Possible target concept

Emission Probabilities

Probabilities of observed variables

What is P(D)?

Probability of the Data given by |VS| / |H|

The true error of hypothesis h

Probability that it will misclassify a randomly drawn example from distribution : D However, we cannot measure the true error. We can only estimate it by observing the sample error eS

PAC

Probably approximately correct

When do we use resampling?

Sample data is very small

Linearly-separable SVM

Satisfying solution (e.g. perceptron algorithm): finds a solution, not necessarily the 'best' Best is that solution that promises maximum generalizability

Scale Invariance

Scaling distance by a positive value does not change clustering any of the clusters

Examples of real world problems where neural networks are used

Search engines, self-driving cars, facial recognition, Google Translate, etc.

perceptron limitations

Second, perceptrons can only classify linearly separable sets of vectors. If a straight line or a plane can be drawn to separate the input vectors into their correct categories, the input vectors are linearly separable.

Define sensitivity and specificity

Sensitivity- refers to the proportion of people who have positive test results and who really have the disease. Specificity- Refers to the people who do not have the disease and whose test results are negative.

Sequence Data

Sequence data is data that comes in a particular order Opposite of independent, identical distributed (i.i.d.) Strong correlation between subsequent elements - DNA - Time series - Facial Expressions - Speech Recognition - Weather Prediction - Action planning

Testing Set

Set for testing model

Hypothesis class

Set of all concepts

Compare and contrast Backpropagation networks with Perceptron networks in terms of: a) Types of unit normally used. b) Training algorithm. c) Input and output achievable. d) Ease of use. e) Range of applications. Your answers should describe the effects of these differences.

This question asked candidates to compare and contrast two network types: backpropagation networks and Perceptron networks. Overall it was the best answered question: around 50 per cent of candidates attempted it. The question listed five criteria for comparison and each of these carried a maximum of five marks. Not all candidates were able to put their knowledge down on paper in a coherent and logical manner but most attempts were good. Candidates need to know the different types of unit normally used in all of the network types that we deal with. In this case, these are sigmoidal and threshold units. Naming these is not enough. Each different type of neural network has its own training algorithm and full answers would give formulae and details about how these work. Though there were some very good answers, there were some that left out details of the algorithms and thus lost many marks. There were three major points to be noted. • Some inputs may be discrete while others are continuous, but this is not a feature of the network. Output values are, however, and threshold units can only output zero or one though it is possible to define other types. • The beauty of Perceptrons is their ease of use in that the learning algorithm is guaranteed to converge if a suitable weight set exists. This is of course countered by the fact that continuous output is not achievable. Backpropagation achieves more power but at the cost of not having guaranteed convergence. Thus the range of applications tends to be much richer for backpropagation than for Perceptron networks. • Candidates were advised that their answers should describe the effects of these differences and most were able to answer in these terms. It is vital that in a compare and contrast question both similarities and differences between networks are explored. It is not sufficient just to

Explain how a Hopfield network might be used to solve a travelling salesman problem. Your answer should show how the problem is coded into the weights of the network

This question concerned one of the iconic applications of Hopfield nets, that is, in finding solutions to optimisation problems. Weights are set up so that the network's energy is minimised by finding the solution of a problem and then from a random starting point allowing the network to minimise energy

Why is initializing the weights with larger value not a good way around the vanishing gradients problem?

This would just cause the derivative of the logistic function to shrink alongside the gradient. One good alternative would be to pre-train the network.

perceptron three or more input

Three or more inputs It is important that you have some facility with working out the outputs, given inputs and weights (examination questions often ask you to do this), so let us look at some more examples. The diagram below represents a unit with three or more inputs. The 'dotted' input '...' represents 0 or more edges, allowing for an unspecified number of extra inputs

What is the training data used for?

To build a model for your test data

What do we use the confusion matrix for?

To calculate sensitivity, specificity and accuracy

How to make trees compact?

To do so, we will seek to minimise impurity of data reaching descendent nodes

Training a network

Training a NN involves finding the parameters that minimize some error function Choice of activation function depends on the output variables: - Unity for regression - Logistic sigmoid for (multiple independent) binary classification - Softmax for exclusive (1-of-K) multiclass classification

What is true regarding backpropagation rule?

a) it is also called generalized delta rule b) error in output is propagated backwards only to determine weight updates c) there is no feedback of signal at nay stage d) all of the mentioned

What are general limitations of back propagation rule?

a) local minima problem b) slow convergence c) scaling

What are the general tasks that are performed with backpropagation algorithm?

a) pattern mapping b) function approximation c) prediction

How much energy consumption is used by action potential's in synaptic transmission?And what is the rest used for?

50 to 80%. The rest is used for manufacturing and maintenance

Is an initialization of 0 for all the weights a good idea?

Although such an initialization is clearly unbiased, it would cause all the outputs, hence all gradients and therefore all weight updates to be identical.

a very simple network

Although this is a very simple network, it allows us to introduce the concept of an 'extended' truth table. An extended truth table is a truth table that has entries that evaluate to 0 or 1 but these entries could be variables or even expressions. Because all the inputs are binary, we can use an extended truth table to show how inputs map to outputs.

net and bias

Before we move on, let us look at what we can make with units of the type where net = bias + wΣ1 N ?i

How do we find the right depth and width of a ANN? What are the limitations?

By (1) expanding, testing and comparing performance, and by (2) finding balance between over- and underfitting. The limitations are (1) the information in the data and (2) the computational power.

Regarding the distribution of the data, how can we speed up learning?

By centering the data around 0 by subtracting the mean of each dimension of the training data. This would place the data around the undecided state of the activation function and hence speed up the learning.

How do networks learn?

By modifying the strength of its connections.

multilayer networks

Such networks are multi-layered as we can give each unit a layer number. Starting at the input layer, which we call layer zero (but remember that in this course, we never draw these as units), we give each unit a layer number which is one greater than the maximum of the layer numbers of all those units that feed it.

ANNS Step 2

brain is made up of billions of neurons and quadrillion of synapses and the more powerful brains have more connections and neuron between them and similarly ANN with more units, connections, and layers are "smarter" Hierarchy of visual processing: in the retina, neurons are receptive to points of light and darkness; in the primary visual cortex respond to faces, hands, all sorts of complex objects, both natural and manmade - ANN uses similar hierarchy of layers, info becomes more and more abstract at higher levels

Know which scientific fields ANNs are used in.

--Computer scientists: information processing and learning, image classification, object detection and recognition --Statisticians: classification --Engineers: signal processing and automatic control --Physicists: statistical mechanics --Biologists: predicting protein shape from mRNA sequences, disorder diagnostic, personalized medicine --Philosophers: minds and machines --Cognitive scientists: models of thinking, learning, and cognition --Neuro-physiologists- understanding sensory systems and memory -ANN can be used to understand how visual info is represented in V2 and V4 and higher levels of visual hierarchy; some studies, humans and ANN can solve same task -Showed mice black and white movies while recording regions in visual cortex --Used in lots of different fields

On what parameter(s) does the stability of the gradients depend?

It depends on (1) the derivative of the activation function and (2) the values of the weights.

On what parameter(s) does the network's update rule depend?

It depends on the gradient.

On what parameter(s) does the network's "training speed" depend?

It depends on the size of the gradient (large gradient = high training speed).

feed forward'

It is 'feed forward' because we can think of each layer as feeding forward to subsequent layers - there is no feedback.

What are the advantages/disadvantages of ELU?

It saturates for negative values only, which makes it noise robust with no penalty for highly positive values, and its mean activation is close to 0 (the gradients for the biases are stable). It is however expensive to compute.

What can ANNs not do?

Learning from small numbers of examples and less practice (ANN: 38 days vs. human: 2 hours) Solving multiple tasks simultaneously Holding conversations Active learning (humans seek new information to gain knowledge) Scene understanding Language acquisition Common sense Feelings Consciousness Theory of mind (understanding thought and intentions of others) Learning to lean (getting better at learning new tasks) Creativity

Sparse coding

Neural coding based on the pattern of activity in small groups of neurons. (energy efficient)

the 'input' at the bias

Notice that we have set the 'input' at the bias to zero. This means that the bias has no effect on the calculations as whatever its value the result of multiplying by zero will be zero. We say 'no bias' when the bias has no effect. Also notice that there is just one input, ?a.

bias

Now we can see that -bias is acting as a (variable) threshold that the rest of the sum must equal or exceed before the output can become 1. When you read around the subject you will see threshold being mentioned, and you now know that this is just minus the bias, which in turn is the name of a weight connected to an input that is always 1

If our nerves impulses are slower than computers how do our cells connect to each other?

Our brain has Parallel networks that have folded cells and connect to each other.

Content addressable storage

Processed knowledge residing in network itself, no separate system

percentage of a computational task

The answer to this problem in some ways resembles the speedup problem in parallel processing, in which we ask what percentage of a computational task can be parallelized and what percentage is inherently sequential.

function of classical perceptron?

The classical perceptron is in fact a whole network for the solution of certain pattern recognition problems

Describe the "vanishing gradients" problem.

The gradients exponentially decrease (and, eventually, vanish) while the error backpropagated through the layers. As a result, the first layer does not learn and keeps its random initial values: the first layer "kills" all of the signal. This is especially a problem for very deep networks.

What have neural engineers done to focus on speed rather than strength for silicon neurons?

They use the strategy of analogue and not digital coding

How do Analogue circuits work?

They work by changing their voltages continuously like in the rising phase of an action potential.

One or two inputs

To make our calculations easier we will for the time being restrict ourselves to looking at a single unit whose inputs are all binary and whose activation function is the threshold (T(<0,0,1)) given above, so that the outputs are also either 0 or 1. We call units with threshold activations threshold units. You may also see them called step units, as a step is another way of describing a threshold.

ANNs Step 3

Training/Learning, ANN need to learn and change over time by establishing appropriate connections between units; increase the strength of appropriate connections, prune away inefficient connections (more you train the better it gets) (EX: classification)

Sparse coding provides information for engineers building neural networks (true or false)?

True

net

We have been able to do this because the inputs are either 0 or 1 and all the weights are the same. Remember that the activation is 1 if net is greater than or equal to zero, so we can write the condition for our unit to have activation 1 as: net = bias + wM ≥ 0 or as M ≥ -bias/w and bias ≥ -wM

What do we understand by "appropriate" biases? When and in what order should they be learned? What effect does it have on learning?

We need the biases to shift the data into an undecided state (regarding the activation function). Those need to be learned before the learning starts as well as sequentially, starting from the second layer. This should cause to slow down early learning.

How do we prevent the dying ReLU problem from happening?

We prevent it from happening either by using a small learning rate and slightly positively initialized biases, or by using a leaky ReLU instead of ReLU.

We use capital N for the number of inputs

We use capital N for the number of inputs. When we introduced the notation of the Figures, we said that it is often useful to add to the external inputs a special 'input' which is fixed at the value 1. We called the weight associated with this input the bias of the unit and now we are including it in the discussion. Previously we set the input to 0 but now it is clamped, that is fixed at 1. Notice that the bias does not count in the N inputs - one often finds it thought of as input 0 but, unfortunately, it is also called input N+1 by some authors

What are the advantages/disadvantages of ReLU compared to tanh in very deep networks?

While tanh is subject to the vanishing/exploding gradients problem (if the has not been pre-trained) and its activation and derivative are expensive to compute, ReLU makes the network trainable without pre-training and has a very easy activation and derivative. However, the mean activation of tanh is 0, while that of ReLU is > 0: for ReLU the biases need to be adjusted first. Also, if the backpropagation step causes to map all possible inputs to a negative drive, then the ReLU gate (unlike the tanh one) never fires; as the output and the derivative are 0. The error then cannot flow through the ReLU and the weights won't be updated anymore: the ReLU is "dead".

what are perceptron?

the perceptron is an algorithm for supervised learning of binary classifiers.

Neuromorphic engineering

translation of neurobiology into technology


Set pelajaran terkait

Personal Financial Planning Final Exam

View Set

CS 1 chapter 5.5 the do while loop

View Set