Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai Coursera 2/5)
Epoch
Full pass through all training data
Underfitting
High Bias
Local Optimum NonProblem
A local optimum is very unlikely in a very high dimensional space, b/c ALL dimensions must be in optimum state when derivative=0.0. Normally at least one dimensions is in saddle state, where you can roll off.
Weight decay
AKA L2 Regularization
Bias
Accuracy against Training data
L2 Regularization
Add a loss-function term of the norm of the weights (so try to optimize for smaller weights as well as error).
Gradient descent with momentum
Instead of using dW to update weights, use it's exponential moving average.
Xavier initialisation
Special random weight scaling when using tanh.
Frobenius Norm
Sum of squares of all elements in a matrix, eg sum(sum(W[i][j])) for a layer of weights
Overfitting
High Variance, Low Bias
Dev set
Holdout set to evaluate models on as you are iteratively tuning the model
Batch Normalization Parameters
z' = gamma * z_normalized + beta. gamma and beta can be learned through backprop like any other weight.
Regularization Parameter
Hyperparameter lambda, which is just multiplied against the norm or euclidean distance when computing the loss function
L2 Regularization Intuition
If the network is bigger than necessary, the backprop can automatically (practically) zero-out some weights so that the network is as big as needed and no bigger.
Variance
Difference between Train and Dev set accuracy
High Variance Fixes
1) More Data 2) Regularization 3) Different Architecture
Adam optimizer
RMSprop with momentum. Works good.
Bayes Error
Best possible error for a data set
Data sets split with very big data
98 / 1 / 1 for Train/Dev/Test if you have 1 million examples
Early Stopping
Plot both train and dev set error over training time, then re-run and stop where they diverge (divergence is where overfitting is starting).
High Bias fixes
1) Bigger Network, 2) Train longer 3) Different Architecture
Accuracy improvement order
1) Fix high bias, look at training data accuracy 2) Fix high variance, look at dev data accuracy
Normalizing inputs intuition
By putting inputs on a common scale, weights can be initialized to a common scale and backprop has easier time propagating errors so the model trains faster.
Softmax Layer Purpose
Compute P(C | X) for all possible classes, so you can pick the highest as your classification.
RMSprop intuition
Dampens oscillations so you take smaller steps more directly at the optimum. Can use larger learning rate.
Test set
Data set for final, unbiased, evaluation of a model
Learning Rate Decay
Decrease learning rate slowly over the epochs. Big steps early, small steps as you get close to optimum.
Which data sets must be from the same distribution?
Dev and Test. (Training data might be the cheap data)
Minibatch training
Load only n<<m training examples at a time, perform full weight updates just for those.
Batch Normalization
Normalize activations output from a layer for the benefit of the next layer.
BatchNorm Intuition
Not only makes activation inputs more consistent at a point in time, but also makes deeper layers more resilient as shallow layers have weight updates and create a moving target (similar to if the data set were to change).
Early Stopping problem
Now you are trying to minimize bias and variance at the same time.
Dropout Regularization
Randomly remove nodes (and links) during training
RMSprop
Root mean squared prop. Divide weight updates by sqrt of exponential moving average (of dW)
Why treat a metric as Satisficing?
So that there is only one true 'optimization' metric, and the rest are pass/fail.
Exploding/vanishing gradients
The final prediction of a network is a product of all weights multiplied together, which is exponentially large if the weights are >1 or egg like exponentially small if <1
Dropout Regularization Intuition
The network can't rely on any one node, so it ends up making use of more weights but makes them smaller.
Regularization affect on activation functions
The smaller weights means that z() is likely to be close to zero where tanh and others are linear. So while they don't zero-out completely, they reduce to linear models (which are also much simpler, less over-fitting).
Normalizing inputs
Transform by subtracting mean and divide by variance so that mean is zero and variance one.
Weight initialization scaling
Transform to having variance=1 so that some are more and some are less than 1 to avoid exploding/vanishing gradients.
Why momentum works
Updates orthogonal to true optimization progress (of the minibatch) are dampened by other minibatchs'. Progress in the right direction is shared by many minibatchs', whick gets reinforced.
Hyperparameter Sweep Search
Use random sampling, not grid search, so you get more distinct values per hyperparameter. Works b/c many turn out to be unimportant to tune -> the important ones you get more samples for.
L1 Regularization
Uses the euclidean distance of the weights as a regularization term in the loss function, which leads to lots of weights equal to zero.
