Coursera NN Course2Week1 - Setting up your optimization problem
Gradient checking
For each component of theta
Xavier Initialization
better use the constant 1 and so 1 over this instead of 2 and so you multiply it by the square root of this. So this square root term whoever plays this term and you use this if you're using a TanH activation function.
implement gradient checking to make sure the implementation of backdrop is correct
lets take the function f and replot it here and remember this is f of theta equals theta cubed, and let's again start off to some value of theta. Let's say theta equals 1. Now instead of just nudging theta to the right to get theta plus epsilon, we're going to nudge it to the right and nudge it to the left to get theta minus epsilon, as was theta plus epsilon. So this is 1, this is 1.01, this is 0.99 where, again, epsilon is same as before, it is 0.01. It turns out that rather than taking this little triangle and computing the height over the width, you can get a much better estimate of the gradient if you take this point, f of theta minus epsilon and this point, and you instead compute the height over width of this bigger triangle. So for technical reasons which I won't go into, the height over width of this bigger green triangle gives you a much better approximation to the derivative at theta. And you saw it yourself, taking just this lower triangle in the upper right is as if you have two triangles, right? This one on the upper right and this one on the lower left. And you're kind of taking both of them into account by using this bigger green triangle. So rather than a one sided difference, you're taking a two sided difference.
Remember Regularization
remember your regularization term if you're using regularization. So if your cost function is J of theta equals 1 over m sum of your losses and then plus this regularization term. And sum over l of wl squared, then this is the definition of J. And you should have that d theta is gradient of J with respect to theta, including this regularization term. So just remember to include that term.
Gradient Checking - implementation notes
1) don't use in training 2) if algorthm fails grad check --> look at components to try yo identify bug 3) remember regularization 4) grad check doesn't work with dropout 5) run at random initialization; perhaps again after some training
Normalizing inputs steps
1)Subtract mean 1) Normalize variances
Run at random initialization
t's not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don't do this very often, but one thing you could do is run grad check at random initialization and then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values. And then run grad check again after you've trained for some number of iterations.
Two-sided difference
By taking a two sided difference, you can numerically verify whether or not a function g, g of theta that someone else gives you is a correct implementation of the derivative of a function f.
Contrast - Gradient checking formula
If we were to use this formula - 2-sided difference, then the error is on the order of epsilon. When epsilon is a number less than 1, then epsilon is actually much bigger than epsilon squared which is why this formula here is actually much less accurate approximation than this formula on the left. Which is why when doing gradient checking, we rather use this two-sided difference when you compute f of theta plus epsilon minus f of theta minus epsilon and then divide by 2 epsilon rather than just one sided difference which is less accurate. the takeaway is that this two-sided difference formula is much more accurate.
Weight Initialization if using a TanH activation function
If you are using a TanH activation function then there's a paper that shows that instead of using the constant 2 it's better use the constant 1 and so 1 over this instead of 2 and so you multiply it by the square root of this. So this square root term whoever plays this term and you use this if you're using a TanH activation function. This is called Xavier initialization.
Exploding gradients
For the sake of simplicity, let's say we're using an activation function G of Z equals Z, so linear activation function. And let's ignore B, let's say B of L equals zero. So in that case you can show that the output Y will be WL times WL minus one times WL minus two, dot, dot, dot down to the W3, W2, W1 times X. But if you want to just check my math, W1 times X is going to be Z1, because B is equal to zero. So Z1 is equal to, I guess, W1 times X and then plus B which is zero. But then A1 is equal to G of Z1. But because we use linear activation function, this is just equal to Z1. So this first term W1X is equal to A1. And then by the reasoning you can figure out that W2 times W1 times X is equal to A2, because that's going to be G of Z2, is going to be G of W2 times A1 which you can plug that in here. So this thing is going to be equal to A2, and then this thing is going to be A3 and so on until the protocol of all these matrices gives you Y-hat, not Y. Now, let's say that each of you weight matrices WL is just a little bit larger than one times the identity. So it's 1.5_1.5_0_0. Technically, the last one has different dimensions so maybe this is just the rest of these weight matrices. Then Y-hat will be, ignoring this last one with different dimension, this 1.5_0_0_1.5 matrix to the power of L minus 1 times X, because we assume that each one of these matrices is equal to this thing. It's really 1.5 times the identity matrix, then you end up with this calculation. And so Y-hat will be essentially 1.5 to the power of L, to the power of L minus 1 times X, and if L was large for very deep neural network, Y-hat will be very large. In fact, it just grows exponentially, it grows like 1.5 to the number of layers. And so if you have a very deep neural network, the value of Y will explode. The intuition is that at the weights W, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.
Check if vectors are approximately equal to eachother
I would compute the distance between these two vectors, d theta approx minus d theta, so just the o2 norm of this.
Vanishing gradients
If we replace this with 0.5, so something less than 1, then this becomes 0.5 to the power of L. This matrix becomes 0.5 to the L minus one times X, again ignoring WL. And so each of your matrices are less than 1, then let's say X1, X2 were one one, then the activations will be one half, one half, one fourth, one fourth, one eighth, one eighth, and so on until this becomes one over two to the L. So the activation values will decrease exponentially as a function of the def, as a function of the number of layers L of the network. So in the very deep network, the activations end up decreasing exponentially.
Cost function for Normalized inputs
If you have a more spherical contours, then wherever you start gradient descent can pretty much go straight to the minimum. You can take much larger steps with gradient descent rather than needing to oscillate around like like the in the case of the unnormalized cost function. Of course ,in practice w is a high-dimensional vector, and so trying to plot this in 2D doesn't convey all the intuitions correctly. But the rough intuition that your cost function will be more round and easier to optimize when your features are all on similar scales. Not from one to 1000, zero to one, but mostly from minus one to one or of about similar variances of each other. That just makes your cost function J easier and faster to optimize. In practice if one feature, say X1, ranges from zero to one, and X2 ranges from minus one to one, and X3 ranges from one to two, these are fairly similar ranges, so this will work just fine.
Cost function for Unnormalized inputs
If you're running gradient descent on the cost function, you might have to use a very small learning rate because if you're here that gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum.
Cost function formula
Is a measure of "how good" a neural network did with respect to it's given training sample and the expected output. It also may depend on variables such as weights and biases. A cost function is a single value, not a vector, because it rates how good the neural network did as a whole.
The formal definition of a derivative is for very small values of epsilon
Is equal to f of theta plus epsilon minus f of theta minus epsilon over 2 epsilon. And the formal definition of derivative is in the limits of exactly that formula on the right as epsilon those as 0. And it turns out that for a non zero value of epsilon, you can show that the error of this approximation is on the order of epsilon squared, and remember epsilon is a very small number. So if epsilon is 0.01 which it is here then epsilon squared is 0.0001. The big O notation means the error is actually some constant times this, but this is actually exactly our approximation error. So the big O constant happens to be 1.
Vanishing gradients solution
It turns out that a partial solution to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.
Single neuron example Relu
It turns out that if you're using a value activation function that rather than 1 over n it turns out that, set in the variance that 2 over n works a little bit better. So you often see that in initialization especially if you're using a value activation function so if gl(z) is ReLu(z), oh and it depend on how familiar you are with random variables. It turns out that something, a Gaussian random variable and then multiplying it by a square root of this, that says the variance to be quoted this way, to be to 2 over n and the reason I went from n to this n superscript l-1 was, in this example with logistic regression which is to input features but the more general case they are l would have an l-1 inputs each of the units in that layer. So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn't solve, but it definitely helps reduce the vanishing, exploding gradients problem because it's trying to set each of the weight matrices w you know so that it's not too much bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly. I've just mention some other variants. The version we just described is assuming a value activation function.
if an algorithm fails grad check
Look at the components, look at the individual components, and try to identify the bug: If d theta approx is very far from d theta, what I would do is look at the different values of i to see which are the values of d theta approx that are really very different than the values of d theta. So for example, if you find that the values of theta or d theta, they're very far off, all correspond to dbl for some layer or for some layers, but the components for dw are quite close, right? Remember, different components of theta correspond to different components of b and w. When you find this is the case, then maybe you find that the bug is in how you're computing db, the derivative with respect to parameters b. And similarly, vice versa, if you find that the values that are very far, the values from d theta approx that are very far from d theta, you find all those components came from dw or from dw in a certain layer, then that might help you hone in on the location of the bug. This doesn't always let you identify the bug right away, but sometimes it helps you give you some guesses about where to track down the bug.
Check vectors: ThetaAprox = Theta
Notice there's no square on top, so this is the sum of squares of elements of the differences, and then you take a square root, as you get the Euclidean distance. And then just to normalize by the lengths of these vectors, divide by d theta approx plus d theta. Just take the Euclidean lengths of these vectors. And the row for the denominator is just in case any of these vectors are really small or really large, your the denominator turns this formula into a ratio. So we implement this in practice, I use epsilon equals maybe 10 to the minus 7, so minus 7. And with this range of epsilon, if you find that this formula gives you a value like 10 to the minus 7 or smaller, then that's great. It means that your derivative approximation is very likely correct. This is just a very small value. If it's maybe on the range of 10 to the -5, I would take a careful look. Maybe this is okay. But I might double-check the components of this vector, and make sure that none of the components are too large. And if some of the components of this difference are very large, then maybe you have a bug somewhere. And if this formula on the left is on the other is -3, then I would wherever you have would be much more concerned that maybe there's a bug somewhere. But you should really be getting values much smaller then 10 minus 3. If any bigger than 10 to minus 3, then I would be quite concerned. I would be seriously worried that there might be a bug. And I would then, you should then look at the individual components of data to see if there's a specific value of i for which d theta across i is very different from d theta i. And use that to try to track down whether or not some of your derivative computations might be incorrect. And after some amounts of debugging, it finally, it ends up being this kind of very small value, then you probably have a correct implementation.
When do you use grad check? In training or during debug.
Only during debug!!! Because computing d theta approx i, for all the values of i, this is a very slow computation. So to implement gradient descent, you'd use backprop to compute d theta and just use backprop to compute the derivative. And it's only when you're debugging that you would compute this to make sure it's close to d theta. But once you've done that, then you would turn off the grad check, and don't run this during every iteration of gradient descent, because that's just much too slow.
Example of initializing the ways for a single neuron
So a single neuron you might input four features x1 through x4 and then you have some a=g(z) and end it up with some y and later on for a deeper net you know these inputs will be right, some layer a(l), but for now let's just call this x for now. So z is going to be equal to w1x1 + w2x2 +... + I guess WnXn and let's set b=0 so you know lets just ignore b for now. So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you're adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that's going into a neuron. So in practice, what you can do is set the weight matrix W for a certain layer to be np.random.randn you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be n(l-1) because that's the number of units that I'm feeding into each of the units and they are l.
Gradient checking - numerical approximation
This point here is F of theta plus epsilon. This point here is F of theta minus epsilon. So the height of this big green triangle is f of theta plus epsilon minus f of theta minus epsilon. And then the width, this is 1 epsilon, this is 2 epsilon. So the width of this green triangle is 2 epsilon. So the height of the width is going to be first the height, so that's F of theta plus epsilon minus F of theta minus epsilon divided by the width. So that was 2 epsilon which we write that down here. 2:38 And this should hopefully be close to g of theta. So plug in the values, remember f of theta is theta cubed. So this is theta plus epsilon is 1.01. So I take a cube of that minus 0.99 theta cube of that divided by 2 times 0.01. Feel free to pause the video and practice in the calculator. You should get that this is 3.0001. Whereas from the previous slide, we saw that g of theta, this was 3 theta squared so when theta was 1, so these two values are actually very close to each other. The approximation error is now 0.0001. Whereas on the previous slide, we've taken the one sided of difference just theta + theta + epsilon we had gotten 3.0301 and so the approximation error was 0.03 rather than 0.0001. So this two sided difference way of approximating the derivative you find that this is extremely close to 3. And so this gives you a much greater confidence that g of theta is probably a correct implementation of the derivative of F.
Gradient checking steps
To debug, or to verify that your implementation and back process correct. Your new network will have some sort of parameters, W1, B1 and so on up to WL bL. To implement gradient checking: 1) the first thing you should do is take all your parameters and reshape them into a giant vector data. So what you should do is take W which is a matrix, and reshape it into a vector. You gotta take all of these Ws and reshape them into vectors, and then concatenate all of these things, so that you have a giant vector theta. Giant vector pronounced as theta. So we say that the cos function J being a function of the Ws and Bs, You would now have the cost function J being just a function of theta. 2) Next, with W and B ordered the same way, you can also take dW[1], db[1] and so on, and initiate them into big, giant vector d theta of the same dimension as theta. So same as before, we shape dW[1] into the matrix, db[1] is already a vector. We shape dW[L], all of the dW's which are matrices. Remember, dW1 has the same dimension as W1. db1 has the same dimension as b1. So the same sort of reshaping and concatenation operation, you can then reshape all of these derivatives into a giant vector d theta - which has the same dimension as theta.
Gradient Checking
When you implement back propagation -a test called creating checking that can really help you make sure that your implementation of back prop is correct.
Does gradient checking run slower?
When you use this method for grading, checking and back propagation, this turns out to run twice as slow as you were to use a one-sided defense. It turns out that in practice I think it's worth it to use this other method because it's just much more accurate.
Exploding gradients vs Vanishing gradients (quick explenation)
When you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.
Does grad check work with dropout?
grad check doesn't work with dropout, because in every iteration, dropout is randomly eliminating different subsets of the hidden units. There isn't an easy to compute cost function J that dropout is doing gradient descent on. It turns out that dropout can be viewed as optimizing some cost function J, but it's cost function J defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration. So the cost function J is very difficult to compute, and you're just sampling the cost function every time you eliminate different random subsets in those we use dropout. So it's difficult to use grad check to double check your computation with dropouts. So what I usually do is implement grad check without dropout. So if you want, you can set keep-prob and dropout to be equal to 1.0. And then turn on dropout and hope that my implementation of dropout was correct Alternatively, you could fix the pattern of nodes dropped and verify that grad check for that pattern is correct, but in practice I don't usually do that. The recommendation is turn off dropout, use grad check to double check that your algorithm is at least correct without dropout, and then turn on dropout.