CS231N Midterm
Formula for calculating output width/height
(W−F+2P)/S+1
Given a random weight matrix and N classes to predict, what should we predict the average softmax loss to be?
-log(1/N)
with a regularization penalty, we can never achieve a loss of ___
0 (only occurs in pathological setting of W=0)
3 problems with sigmoid
1.) saturated neurons kill gradients, 2.) outputs not zero-centered, 3.) exp() is compute expensive
(context: CIFAR-10) a single weight matrix W is effectively evaluating ___ classifiers in parallel, where each is a _____
10 separate; row in W
modern convolutional neural nets contain orders of ____ parameters made up of approximately ____ layers
100 million; 10-20
What is the number of learnable parameters for a 5x5 filter operating on an RGB image?
5x5x3 + 1 for bias = 76
Why is problematic that the outputs of the sigmoid neuron are not zero-centered?
Because if the data coming into a neuron is always positive, then the gradients on all the weights w will during backpropagation be all positive or all negative, creating a zig-zag dynamic in the gradient updates for the weights
Why is centered numerical gradient less good at avoiding kinks?
Coarser (larger) stepsizes always result in less accurate numerical approximations
kinks and gradient check accuracy
Consider gradient checking the ReLU function at x=-1e6. The gradient here is zero, but a numerical gradient might compute a non-zero gradient because f(x+h) might cross over the kink and introduce a non-zero contribution.
What's the big downside of L-BFGS?
Even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.
bias trick
Extend every vector x such that it always holds the constant 1 - a default bias dimension. Now the new score function simplifies to f(x) = Wx
T/F: the decision boundary of a KNN classifier is linear.
False
convert conv layer to FC
For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
What does it mean when a ReLU neuron "dies"?
For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high
convert any FC layer to CONV
For example, an FC layer with K=4096K that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer.
Why is it nice to be able to convert FC layers to CONV layers?
For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.
What is the intuition for annealing the learning rate over time?
Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.
Why do we put "probabilities" in quotes when talking about the softmax classifier?
How peaky or diffuse these probabilities are depends directly on the regularization strength λ - which you are in charge of as input to the system. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not.
Why does one need to be careful when initializing the weights of the sigmoid neuron?
If the initial weights are too large then most neurons will become saturated and the network will not be able to learn
In what case is having a similar train and test error not a good thing?
If your regularization weights are extreme, then your model will tend towards 0 weights and give the same (shitty) predictions on train and test
why use padding?
In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly.
Where do you insert batchnorm layers?
In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we'll soon see), and before non-linearities.
How does adding training data contribute to the bias/variance tradeoff?
It doesn't! It's just better.
Difference between L1 and L2
L2 punishes differences between two vectors much more heavily
If we use standard deviation calculated over all pixels, will subtracting by mean and dividing by standard deviation change the performance of the KNN classifier?
No, because the standard deviation scaling will amount to a scalar multiplied by the L1 distance, so the list of neighbors will not change
how to convert total number of values to GB
Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn't fit, a common heuristic to "make it fit" is to decrease the batch size, since most of the memory is usually consumed by the activations.
Set padding to ____ when stride is 1 to preserve input volume and output volume size
P=(F−1)/2
If we calculate standard deviation separately for each pixel, will subtracting by mean and dividing by standard deviation change the performance of the KNN classifier?
Potentially. As different dimensions are scaled differently, performance could change
How could we edit L2 distance to be more efficient but get the same results from the NN function?
Remove the square root (square root is a monotonic function, preserves ordering)
momentum update
The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. Since the force on the particle is related to the gradient of potential energy, the force felt by the particle is precisely the (negative) gradient of the loss function. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position.
Why do we not want use the number of neurons to control overfitting?
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.
What would happen to our 2D classifier "lines" if we had no bias?
Therefore, all lines would have to cross the origin, as the bias is what enables translations away from the origin
Relationship between Δ and λ
They control the same tradeoff: tradeoff between data loss and regularization loss in the objective. The magnitude of W has a direct effect on the scores (and their differences), so we can scale W up or down to meet any required margin. Therefore, the only real tradeoff is how large we allow the weights to grow through regularization strength λ
Compare the train and test error of a 1-nn to a 5-nn classifier.
Train error of 1-nn will always be zero, test error of a 1-nn may be better or worse than a 5-nn
What neuron should I use?
Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout
What if we wanted to efficiently apply an original ConvNet over the image but at a stride smaller than 32 pixels?
We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.
hierarchical softmax
When the set of labels is very large, it may be helpful to use hierarchical softmax. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent.
what is the problem of efficiency with the numerical update?
You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update. This problem only gets worse, since modern Neural Networks can easily have tens of millions of parameters.
What do you do when val accuracy tracks training accuracy well?
Your model capacity may not be high enough: increase number of parameters
range of tanh nonlinearity
[-1,1]
range of sigmoid nonlinearity
[0,1]
L-BFGS
a method that has been developed that seek to approximate the inverse Hessian
approximate nearest neighbor (ANN) algorithms
accelerate nearest neighbor lookup, trading off correctness of neighbor retrieval with its space/time complexity during retrieval
dilated convolutions
allow you to grow receptive field much larger with fewer conv layers
the amount of "wiggle" in the loss is related to
batch size
Why are our gradients noisy with batched SGD?
because they come from minibatches
If the number of hyperparameters is large, you might use _____ validation splits
bigger
adagrad code
cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps)
RMSProp
cache = decay_rate * cache + (1 - decay_rate) * dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps)
dead ReLU
consider the case where your affine layer results in an output vector of all zeros, so the relu function is all zeros as well. then d(relu)/d(affine) = 0, and your dL/d(affine)=0 as well, so no update gets performed
What is the mu parameter in the momentum update?
consistent with notion of "friction"; dampens the velocity and reduces kinetic energy of the system
with softmax, we replace hinge loss with ____
cross entropy loss
any ____ can act as a gate through which gradients can backpropagate
differentiable function
source of randomness / uncertainty for regularization
dropout, stochastic depth, drop connect
L1 distance
d₁(I₁,I₂)=∑|I₁-I₂|
template matching interpretation
each row of W corresponds to a "template" (or prototype) for each of the classes
unlike Adagrad, the gradient updates of RMSProp do not get ___
exponentially smaller
in the softmax classifier, the ___ remains unchanged, but we now interpret these as ____
function mapping f(x; W) = Wx; unnormalized log probabilities for each class
nearest neighbor classifier
given a training set of 50000 images, label the remaining 10,000, predicting the label of the closest training image
burn-in time
gradchecking only after the network is allowed to learn a little, so that you don't gradcheck at pathological edge cases
What does adagrad do in practice?
high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased
SVM loss "wants" the correct class for each image to have a score that is _____
higher than the incorrect classes by some fixed margin
Why do we not initialize every neuron to have zero weights?
if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates.
GoogLeNet used what two innovative features?
inception modules, average pooling
combat overfitting
increase data quantity/quality, impose extra constraints, introduce randomness/uncertainty
k-nearest neighbor classifier
instead of finding the single closest image in the training set, we will find the top k closest images, and have them vote on the label of the test image
why is it not common to regularize bias parameters?
it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objectiv
why is L2 loss harder to optimize than a stable loss like Softmax?
it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations
drawbacks of sigmoid function
it saturates and kills gradients - when the sigmoid saturates at either tail of 0 or 1, the gradient at these regions is almost zero, and will kill the gradient flow.
in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but how about deeper networks?
it usually does not help that much more
cross-validation
iterate over different validation sets and average performance across them
dropout
keeps a neuron active with probability p
PCA
keeps dimensions which contain the most variance
Compared to the softmax classifier, the SVM classifier has a more ___ objective
local
softmax classifier is the generalization of ____
logistic classification to multiple classes
hinge loss
max(0,-)
three common forms of data preprocessing
mean subtraction, normalization, PCA/whitening
largest bottlenecks for convnets
memory
Why do people usually avoid cross-validation?
more computationally expensive
if examples in training set were not correlated, would mini-batching work?
no
how long does it take to train vs. test a nearest neighbors classifier
no time at all to train, but very expensive to test
Batchnorm: why is it possible to force activations throughout a network to take on a unit gaussian distribution at the beginning of training?
normalization is a simple differentiable operation
problems with ReLU
not zero-centered output, and gradients of 0 when x< 0
three major sources of memory strain for convnets
number of activations, parameters, miscellaneous (image data batches, augmented images, etc.)
why is l2 loss less robust?
outliers can introduce huge gradients
increased stride means smaller ___
output volumes spatially
inverted dropout
perform scaling at train time, leave test time untouched
what could cause a dead ReLU?
poor weight initialization
function of pooling layer
progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
Why should you turn off regularization and check data loss alone first?
regularization loss may overwhelm data loss and mask an incorrect implementation of data loss
the recommended heuristic to initialize each neuron's weight vector is
scale by 1/sqrt(n), where n is the number of inputs
gradient clipping
scale gradient if norm is too big
L2 penalty prefers ______ weight vectors
smaller and more diffuse
higher values of k in the KNN classifier have what effect?
smoothing effect that is more resistant to outliers
Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do ___
something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input
during optimization, L1 regularization leads weight vectors to become ___
sparse
three common forms of learning rate decay
step decay, exponential decay, 1/t decay
extreme case where mini-batch consists of one example
stochastic gradient descent
whitening
takes the data into the eigenbasis and divides every dimension by its eigenvalue to normalize the scale
How do the tanh nonlinearity and the sigmoid nonlinearity compare in practice?
tanh is always preferred, since it is zero-centered
Seeing all tanh activation neurons outputting 0 or all completely saturated at -1 or 1 would imply what?
that something is off with weight initialization
accuracy
the fraction of predictions that were correct
in linear classifiers, the scale of the data has an effect on ___
the magnitude of the gradient for the weights
Subtracting the mean from the dataset does not change L1 distance because ____
the mean will cancel out, since both the train and test images are preprocessed with the same mean subtraction
The distribution of outputs from a randomly initialized neuron has a variance that grows with ____
the number of inputs
bug with hinge loss function
the set of parameters W that correctly classify every example is not unique; any multiple of these parameters λW where λ>1 would also give zero loss
T/F: smaller networks can be preferred if the data is not complex enough to prevent overfitting
there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
loss function is a measure of ___
unhappiness with our current model
code for momentum update
v = mu * v - learning_rate * dx # integrate velocity x += v # integrate position
Challenges in computer vision
viewpoint variation, scale variation, deformation, occlusion, illumination condition, background clutter, intra-class variation
it's sufficient for symmetry breaking to initialize all weights to 0, provided that biases are random
yes