Week 8 Reading
Kohenon learning rule (Kohone 1982).
A rule that is the core of self organized map, an unsupervised neural network model
What are some common termination conditions for training NN?
1. For all weights wij, the difference between the old & new values |(wijNew - wijOld)| is less than some specified threshold. 2. Error (e.g. Misclassification Rate) is less than some specified threshold. 3. The pre-specified number of iterations have been completed
Termination Conditions
1. For all weights wij, the difference between the old & new values |(wijNew - wijOld)| is less than some specified threshold; 2. Error (e.g. Misclassification Rate) is less than some specified threshold; 3. The pre-specified number of iterations have been completed
what are types of interlayer connections
1. Fully Connected 2. Partially Connected 3. Bi-Directional. 4. Hierarchical 5. Resonance
What are some design considerations for RBF
1. Instead of weights, uses width and height for connections between input and hidden layer. 2. transfer function in the output layer is usually linear with trainable weights
Softmax Activation Function
1. It is used when we want to represent the probability distribution over n discrete classes, and is most often used as the output of a classifier. 2. It can be, rarely, used in hidden layer, if the node is designed to choose between one of n different options for some internal inputs. 3. It can be viewed as a generalized Sigmoid function, which represent a probability distribution over a binary class variable. 4. It works well with maximizing log-likelihood objective function, but many other objective functions, especially these do not use a log.
Limitations of ReLu Functions
1. Non-differentiable at zero - Non-differentiable at zero means that values close to zero may give inconsistent or intractable results. 2. Non-zero centered - Being non-zero centered creates asymmetry around data (only positive values handled), leading to the uneven handling of data. 3. Unbounded - The output value has no limit and can lead to computational issues with large values being passed through. 4. Dying ReLU problem - When learning rate is too high, Relu neurons can become inactive and "die."
What are data preparation considerations for training a NN model?
1. Sample the Input Data Source 2. Create Partitioned Data Sets 3. Perform Group Processing. 4. Use Only the Important Variables. 5. Data Transformations and Filtering Outliers 6. Imputing Missing Values 7. Use Other Modeling Nodes
Radial basis function
1. interpretation relies more on geometry than biology. 2. the training method is different because in addition to optimizing the weights used to combine the outputs of the nodes, the nodes themselves have parameters that can be optimized.
The advantages of ReLu function are
1.Allow for faster and effective training of deep neural architectures on large and complex datasets 2. Sparse activation of only about 50% of units in a neural network (as negative units are eliminated) 3. More plausible or one-sided, compared to anti-symmetry of tanh 4. Efficient gradient propagation, which means no vanishing or exploding gradient problems 5. Efficient computation with the only comparison, addition, or multiplication 6. Scale well
What is a Neural Network?
A class of powerful, flexible, general-purpose techniques readily applied to prediction, estimation, and classification problems.
What is a loss function used for?
A network is trained by minimizing on a loss function
a network with high momentum responds how to new training examples that want to reverse the weights
A network with high momentum responding slowly
Generalized Learning Rule is developed by Rumelhart et al. (1988),
A rule designed to work for networks with non-linear activation functions and hidden layers. The weights are adjusted using backpropagation.
Detal rule or widrow-Hoff also called Least Mean Square (LMS) method.
A rule that says for a given input vector, the output vector is compared to the correct or desired answer. The rule updates the weights to minimize the error between the actual output and desired output. It works the best for NN with linear activation functions and no hidden layers. It does not work well with hidden layers.
What is a gradient
A vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a specific point on a curve for functions with multiple independent variables. In order to calculate this more complex slope, we need to isolate each variable to determine how it impacts the output on its own. To do this we iterate through each of the variables and calculate the derivative of the function after holding all other variables constant. Each iteration produces a partial derivative which we store in the gradient.
what is a combination function
All the output values from predecessor nodes are combined into a single value using a this type of function .
What are different learning rules for adjusting weights?
Hebbian, Detal, Generalized Learning, Kohenon
Are neural networks the same or different than deep learning?
Deep learning has deeper more complex network structure.
Feed forward
Does not use a feedback loop, uses connections in the same layer
Fully Connected inner layer connection
Each node of layer k is connected to all nodes of layer (k+1), where k = 1 is the input layer
Partially Connected inner layer connection
Each node of layer k is connected to some but not all nodes of layer (k+1)
Hierarchical inner connection layer
Feed Forward connections between nodes of adjacent levels only
Resonance inner connection layer
For the given pair of layers with a Bi-directional connection, messages are sent across the connection until a certain condition is realized.
Chain rule refresher
Forward propagation can be viewed as a long series of nested equations. If you think of feed forward this way, then backpropagation is merely an application the Chain rule to find the Derivatives of cost with respect to any variable in the nested equation.
Error in Perceptron
In the Perceptron Learning Rule, the predicted output is compared with the known output. If it does not match, the error is propagated backward to allow weight adjustment to happen.
What are loss functions
MAE, MSE, MAPE, and MSLE Hinge and its variants Cross-entropy and its variants. Logcosh Cosine proximity Poisson
Bi-Directional innerlayer connection
Nodes of layer (k+r) may receive input from nodes of layer k and nodes of layer k may receive input from any node of layer (k+r)
This enables you to distinguish between the two linearly separable classes +1 and -1.
Perceptron algorithm draws a linear decision boundary.
What are the two types of Perceptrons?
Single layer and Multilayer.
what is an output layer
The active nodes of this layer combine and modify the data to produce one or more values of the network to the environment.
What is the chain rule?
The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function(s).
what is an input layer
The nodes of this layer are passive, meaning they do not modify the data. They receive a single value on their input, and duplicate the value to their multiple outputs.
A rectifier or ReLU (Rectified Linear Unit) is a commonly used activation function.
This function allows one to eliminate negative units in an ANN. This is the most popular activation function used in deep neural networks.
Feedforward NN
Uses a possible feedback loop. Is very successful in modeling a problem with time sequence
What are two combination functions introduced in the reading?
linear combination functions and radial combination functions.
what is a link
link (i,j) from nodes i to node j has a weight Wji and bias Weight Wi0
In Mathematics, the Softmax or normalized exponential function is
a generalization of the logistic function that squashes a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) that add up to 1.
what are neuron
a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network.
quick prop algorythm
a more statistical approach to training a neural network. Tests a few different sets of weights and then guesses where the optima is.
In probability theory, the output of Softmax function represents
a probability distribution over K different outcomes.
Hebbian rule
a rule that indicates the connections between two neurons might be strengthened if the neurons fire at the same time, and might be weakened if they fire at different times.
A Perceptron is
an algorithm for supervised learning of binary classifiers.
Single layer Perceptrons can learn only
can learn only linearly separable patterns.
Radial combination functions
compute the squared Euclidean distance between the vector of weights and the vector of values feeding into the node and then multiply by the squared bias value (the bias acts as a scale factor or inverse width).
What is a linear combination function
computes a "straight line" combination of the weights and the values feeding into the node and then add the bias value (the bias W0j acts like an intercept).
What is an activation (or transfer) function
decide if the values produced by an upstream layer inputs will activate an output connection, i.e., a neuron is "fired", or activated or not. It maps input values to a desired range (e.g. between 0 to 1 or -1 to 1)
The Perceptron algorithm learns the weights for the input signals in order to
draw a linear decision boundary.
Multilayer Perceptrons or feedforward neural networks with two or more layers .
have the greater processing power
Types of activation functions include
include the sign, step, and sigmoid functions.
simulated annealing
injects randomness into hill climbing.
How does back-propagation work
inputs propagated forward through the network, layer by layer, till reaches the output layers. Error value for each neuron in the output layer is calculated for the loss function, then propagated backward until each neuron has an associated error value
In NN A network is trained by
minimizing an loss function
hill climbing
one approach to finding optima
What is the goal of back-propagation algorithm
propagate backward until each neuron has an associated error value
what does radial refer to in RBF
refers to the fact that all inputs that are the same distance from a nodes position produce the same outputs
training rate
responds how quickly the weights change
What are learning rules
rules used to update the weights between the interconnected neurons.
neural networks work best when their inputs are?
small numbers that are standardized inputs
the best approach to learning rate is to what
start big and decrease it slowly
generalized delta rule
technique for adjusting weights
sensitivity analysis
tells how opaque a network is by telling relative importance of inputs to results.
Perceptron Learning Rule states that
the algorithm would automatically learn the optimal weight coefficients.
what is a bias
the bias acts like an intercept in a linear combination function and acts as a scale factor or inverse width in a radial combination function
For iterative algorithms (Topic models, NN), An iterative method is called convergent if
the corresponding sequence converges for given initial approximations(as the iterations proceed the output gets closer and closer to a specific value).
momentum
the tendency for weights inside each unit to change the direction they are headed. Each weight remembers if it is getting bigger or smaller and momentum tries to keep it headed in that direction.
what is a hidden layer
there may be one or more hidden layer. Each hidden layer nodes receives input values (say xi ) from Input or hidden nodes, and communicates its output to either hidden or output Nodes
What is an activation function used for?
to decide if the values produced by an upstream layer inputs will activate an output connection, i.e., a neuron is "fired", or activated or not. It maps input values to a desired range (e.g. between 0 to 1 or -1 to 1)
What is a loss function used for?
used to measure the inconsistency between predicted value (^y ) and actual label (y ). It is a non-negative value, where the robustness of model increases along with the decrease of the value of loss function.