Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai Coursera 2/5)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Epoch

Full pass through all training data

Underfitting

High Bias

Local Optimum NonProblem

A local optimum is very unlikely in a very high dimensional space, b/c ALL dimensions must be in optimum state when derivative=0.0. Normally at least one dimensions is in saddle state, where you can roll off.

Weight decay

AKA L2 Regularization

Bias

Accuracy against Training data

L2 Regularization

Add a loss-function term of the norm of the weights (so try to optimize for smaller weights as well as error).

Gradient descent with momentum

Instead of using dW to update weights, use it's exponential moving average.

Xavier initialisation

Special random weight scaling when using tanh.

Frobenius Norm

Sum of squares of all elements in a matrix, eg sum(sum(W[i][j])) for a layer of weights

Overfitting

High Variance, Low Bias

Dev set

Holdout set to evaluate models on as you are iteratively tuning the model

Batch Normalization Parameters

z' = gamma * z_normalized + beta. gamma and beta can be learned through backprop like any other weight.

Regularization Parameter

Hyperparameter lambda, which is just multiplied against the norm or euclidean distance when computing the loss function

L2 Regularization Intuition

If the network is bigger than necessary, the backprop can automatically (practically) zero-out some weights so that the network is as big as needed and no bigger.

Variance

Difference between Train and Dev set accuracy

High Variance Fixes

1) More Data 2) Regularization 3) Different Architecture

Adam optimizer

RMSprop with momentum. Works good.

Bayes Error

Best possible error for a data set

Data sets split with very big data

98 / 1 / 1 for Train/Dev/Test if you have 1 million examples

Early Stopping

Plot both train and dev set error over training time, then re-run and stop where they diverge (divergence is where overfitting is starting).

High Bias fixes

1) Bigger Network, 2) Train longer 3) Different Architecture

Accuracy improvement order

1) Fix high bias, look at training data accuracy 2) Fix high variance, look at dev data accuracy

Normalizing inputs intuition

By putting inputs on a common scale, weights can be initialized to a common scale and backprop has easier time propagating errors so the model trains faster.

Softmax Layer Purpose

Compute P(C | X) for all possible classes, so you can pick the highest as your classification.

RMSprop intuition

Dampens oscillations so you take smaller steps more directly at the optimum. Can use larger learning rate.

Test set

Data set for final, unbiased, evaluation of a model

Learning Rate Decay

Decrease learning rate slowly over the epochs. Big steps early, small steps as you get close to optimum.

Which data sets must be from the same distribution?

Dev and Test. (Training data might be the cheap data)

Minibatch training

Load only n<<m training examples at a time, perform full weight updates just for those.

Batch Normalization

Normalize activations output from a layer for the benefit of the next layer.

BatchNorm Intuition

Not only makes activation inputs more consistent at a point in time, but also makes deeper layers more resilient as shallow layers have weight updates and create a moving target (similar to if the data set were to change).

Early Stopping problem

Now you are trying to minimize bias and variance at the same time.

Dropout Regularization

Randomly remove nodes (and links) during training

RMSprop

Root mean squared prop. Divide weight updates by sqrt of exponential moving average (of dW)

Why treat a metric as Satisficing?

So that there is only one true 'optimization' metric, and the rest are pass/fail.

Exploding/vanishing gradients

The final prediction of a network is a product of all weights multiplied together, which is exponentially large if the weights are >1 or egg like exponentially small if <1

Dropout Regularization Intuition

The network can't rely on any one node, so it ends up making use of more weights but makes them smaller.

Regularization affect on activation functions

The smaller weights means that z() is likely to be close to zero where tanh and others are linear. So while they don't zero-out completely, they reduce to linear models (which are also much simpler, less over-fitting).

Normalizing inputs

Transform by subtracting mean and divide by variance so that mean is zero and variance one.

Weight initialization scaling

Transform to having variance=1 so that some are more and some are less than 1 to avoid exploding/vanishing gradients.

Why momentum works

Updates orthogonal to true optimization progress (of the minibatch) are dampened by other minibatchs'. Progress in the right direction is shared by many minibatchs', whick gets reinforced.

Hyperparameter Sweep Search

Use random sampling, not grid search, so you get more distinct values per hyperparameter. Works b/c many turn out to be unimportant to tune -> the important ones you get more samples for.

L1 Regularization

Uses the euclidean distance of the weights as a regularization term in the loss function, which leads to lots of weights equal to zero.


Kaugnay na mga set ng pag-aaral

American Democracy Now Chapter 2

View Set

PSC 151 Exam 2, PSC 151 midterm 2, PSC 151 EXAM 2, PSC 151 M2 Practice Questions

View Set