Hyperparameter tuning, Regularization and Optimization (2/5)
What can you do if your algorithm has a high variance?
- More data. - Try regularization. - Try a different model that is suitable for your data.
Names of Mini-batch sizes by technique:
(mini batch size = m) ==> Batch gradient descent (mini batch size = 1) ==> Stochastic gradient descent (SGD) (mini batch size = between 1 and m) ==> Mini-batch gradient descent
Guidelines for choosing mini-batch size:
- If small training set (< 2000 examples) - use batch gradient descent. - It has to be a power of 2 (because of the way computer memory is layed out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2): 64, 128, 256, 512, 1024, ... - Make sure that mini-batch fits in CPU/GPU memory.
What is early stopping?
- In this regularization technique we plot the training set and the cross validation set cost together for each iteration. At some iteration the cross validation set cost will stop decreasing and the training cost will start increasing. - We will pick the point at which the training set error and cross validation set error are best (lowest training cost with lowest dev cost). - We will take these parameters as the best parameters.
What is dropout regularization?
- Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By "ignoring", I mean these units are not considered during a particular forward or backward pass. - More technically, At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.
What can you do if your algorithm has high bias?
- Try to make your NN bigger (size of hidden units, number of layers) - Try a different model that is suitable for your data. - Try to run it longer. - Different (advanced) optimization algorithms.
What are Bayesian optimization methods and how they differ from grid search or random search:
-Bayesian methods differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. -The concept is: limit expensive evaluations of the objective function by choosing the next input values based on those that have done well in the past
Writing and running programs in TensorFlow has the following steps:
1) Create Tensors (variables) that are not yet executed/evaluated.) 2) Write operations between those Tensors. 3) Initialize your Tensors. 4) Create a Session. 5) Run the Session. This will run the operations you'd written above.
How to implement gradient checking?
1) First compute "gradapprox" using a small value of epsilon (10^-7). Gradapprox = (J(O + epsilon) - J(O - epsilon)) / 2*epsilon -2) Then compute the gradient using backward propagation, and store the result in a variable "grad" 3) Finally, compute the relative difference between "gradapprox" and the "grad" using the following formula: difference = (|| grad - gradapprox || 2) / (|| grad|| 2 + ||gradapprox || 4) if this difference is small (say less than 10^-7), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation.
Steps to normalize the inputs:
1) Get the mean of the training set: mean = (1/m) * sum(x(i)) 2) Subtract the mean from each input: X = X - mean This makes your inputs centered around 0. 3) Get the variance of the training set: variance = (1/m) * sum(x(i)^2) 4) Normalize the variance. X /= variance 5) These steps should be applied to training, dev, and testing sets (but using mean and variance of the train set).
What is a disadvantage of dropout regularization and how to solve it ?
A downside of dropout is that the cost function J is not well defined, and it will be hard to debug (plot J by iteration). To solve that you'll need to turn off dropout, set all the keep_probs to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.
Explain the concept of Exponentially weighted averages
A moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. Now lets compute the Exponentially weighted averages: V0 = 0 V1 = 0.9 * V0 + 0.1 * t(1) = 4 # 0.9 and 0.1 are hyperparameters V2 = 0.9 * V1 + 0.1 * t(2) = 8.5 V3 = 0.9 * V2 + 0.1 * t(3) = 12.15 ... General equation V(t) = beta * v(t-1) + (1-beta) * theta(t) If we plot this it will represent averages over ~ (1 / (1 - beta)) entries: beta = 0.9 will average last 10 entries beta = 0.98 will average last 50 entries beta = 0.5 will average last 2 entries Best beta average for our case is between 0.9 and 0.98
What is data argumentation?
Data augmentation is a strategy that enables practitioners to significantly increase the diversity ofdata available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks.
Who does dropout regularization work?
Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons
What is gradient checking?
Gradient checking is a technique that approximates the gradients and is very helpful for finding the errors in your backpropagation implementation but it's slower than gradient descent (so use only for debugging)
The main difference between Xavier Initialization and He Initialization:
He initialization works better for layers with ReLu activation. Xavier initialization works better for layers with sigmoid activation.
Why normalize the inputs?
If we don't normalize the inputs our cost function will be deep and its shape will be inconsistent (elongated) then optimizing it will take a long time. But if we normalize it the opposite will occur. The shape of the cost function will be consistent (look more symmetric like circle in 2D example) and we can use a larger learning rate alpha - the optimization will be faster.
Explain batch normalization
In the rise of deep learning, one of the most important ideas has been an algorithm called batch normalization, created by two researchers, Sergey Ioffe and Christian Szegedy. Batch Normalization speeds up learning. Before we normalized input by subtracting the mean and dividing by variance. This helped a lot for the shape of the cost function and for reaching the minimum point faster. There are some debates in the deep learning literature about whether you should normalize values before the activation function Z[l] or after applying the activation function A[l]. In practice, normalizing Z[l] is done much more often and that is what Andrew Ng presents. Algorithm: Given Z[l] = [z(1), ..., z(m)], i = 1 to m (for each input) - Compute mean = 1/m * sum(z[i]) - Compute variance = 1/m * sum((z[i] - mean)^2) - Then Z_norm[i] = (z[i] - mean) / np.sqrt(variance + epsilon) (add epsilon for numerical stability if variance = 0) Forcing the inputs to a distribution with zero mean and variance of 1. - Then Z_tilde[i] = gamma * Z_norm[i] + beta To make inputs belong to other distribution (with other mean and variance). gamma and beta are learnable parameters of the model. Making the NN learn the distribution of the outputs. Note: if gamma = sqrt(variance + epsilon) and beta = mean then Z_tilde[i] = z[i] - Then A(Z_tilde[i])
Formula for exponential decay schedule
It has the mathematical form lr = lr0 * e^(−kt), where lr, k are hyperparameters and t is the iteration number.
The normal cost function that we want to minimize is:
J(w,b) = (1/m) * Sum(L(y(i),y'(i)))
The L1 regularization version:
J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|)
The L2 regularization version:
J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|^2)
How to compute HE initialization for RElu activaion functions
L = len(layers_dims) - 1 # integer representing the number of layers for l in range(1, L + 1): parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2/layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
What is L2-regularization actually doing?
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Some important Hyperparameters are :
Learning rate. Momentum beta. Mini-batch size. No. of hidden units. No. of layers. Learning rate decay. Regularization lambda. Activation functions. Adam beta1, beta2 & epsilon.
What is SoftMax regression?
SoftMax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y(i)∈{0,1}y(i)∈{0,1}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y(i)∈{1,...,K}y(i)∈{1,...,K} where KK is the number of classes.
Explain adam optimization
Stands for Adaptive Moment Estimation. Adam optimization and RMSprop are among the optimization algorithms that worked very well with a lot of NN architectures. Adam optimization simply puts RMSprop and momentum together Hyperparameters for Adam: - Learning rate: needed to be tuned. - beta1: parameter of the momentum - 0.9 is recommended by default. - beta2: parameter of the RMSprop - 0.999 is recommended by default. - epsilon: 10^-8 is recommended by default. Pseudo code: vdW = 0, vdW = 0 sdW = 0, sdb = 0 on iteration t: # can be mini-batch or batch gradient descent compute dw, db on current mini-batch vdW = (beta1 * vdW) + (1 - beta1) * dW # momentum vdb = (beta1 * vdb) + (1 - beta1) * db # momentum sdW = (beta2 * sdW) + (1 - beta2) * dW^2 # RMSprop sdb = (beta2 * sdb) + (1 - beta2) * db^2 # RMSprop vdW = vdW / (1 - beta1^t) # fixing bias vdb = vdb / (1 - beta1^t) # fixing bias sdW = sdW / (1 - beta2^t) # fixing bias sdb = sdb / (1 - beta2^t) # fixing bias W = W - learning_rate * vdW / (sqrt(sdW) + epsilon) b = B - learning_rate * vdb / (sqrt(sdb) + epsilon)
Explain RMSprop
Stands for Root mean square prop. Pseudo code: sdW = 0, sdb = 0 on iteration t: # can be mini-batch or batch gradient descent compute dw, db on current mini-batch sdW = (beta * sdW) + (1 - beta) * dW^2 # squaring is element-wise sdb = (beta * sdb) + (1 - beta) * db^2 # squaring is element-wise W = W - learning_rate * dW / sqrt(sdW + epsilon) b = B - learning_rate * db / sqrt(sdb + epsilon) RMSprop will make the cost function move slower on the vertical direction and faster on the horizontal direction in the following example
The cross validation set must come from the same distribution as the ...
Test set
Explain gradient decent with momentum
The momentum algorithm almost always works faster than standard gradient descent. The simple idea is to calculate the exponentially weighted averages for your gradients and then update your weights with the new values. Pseudo code: vdW = 0, vdb = 0 on iteration t: # can be mini-batch or batch gradient descent compute dw, db on current mini-batch vdW = beta * vdW + (1 - beta) * dW vdb = beta * vdb + (1 - beta) * db W = W - learning_rate * vdW b = b - learning_rate * vdb beta is another hyperparameter. beta = 0.9 is very common and works very well in most cases.
Why exponential moving average is useful for gradient decent?
The reason why exponentially weighted averages are useful for further optimizing gradient descent algorithm is that it can give different weights to recent data points (theta) based on value of beta. If beta is high (around 0.9), it smoothens out the averages of skewed data points (oscillations w.r.t. Gradient descent terminology). So this reduces oscillations in gradient descent and hence makes faster and smoother path towards minima.
Why do we use Xavier Initialization?
This is one of the best way of partially solution to Vanishing / Exploding gradients which will help gradients not to vanish/explode too quickly
What does it mean to λ to be large or small for L2 Regularization
When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit: - If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions. - If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
What is "Xavier Initialization / Glorot initialization"?
Xavier Initialization initializes the weights in your network by drawing them from a distribution with zero mean and a specific variance, Var(wi) = 1 / fan_in where fan_in is the number of incoming neurons.
Softmax activation equations:
t = e^(Z[L]) # shape(C, m) A[L] = e^(Z[L]) / sum(t) # shape(C, m), sum(t) - sum of t's for each example (shape (1, m))
Define L1 matrix norm:
||W|| = Sum(|w[i,j]|) # sum of absolute values of all w
Define L2 matrix norm:
||W||^2 = Sum(|w[i,j]|^2) # sum of all w squared
L2 Also can be defined as
||W||^2 = W.T * W if W is a vector