CS231N Midterm

Ace your homework & exams now with Quizwiz!

Formula for calculating output width/height

(W−F+2P)/S+1

Given a random weight matrix and N classes to predict, what should we predict the average softmax loss to be?

-log(1/N)

with a regularization penalty, we can never achieve a loss of ___

0 (only occurs in pathological setting of W=0)

3 problems with sigmoid

1.) saturated neurons kill gradients, 2.) outputs not zero-centered, 3.) exp() is compute expensive

(context: CIFAR-10) a single weight matrix W is effectively evaluating ___ classifiers in parallel, where each is a _____

10 separate; row in W

modern convolutional neural nets contain orders of ____ parameters made up of approximately ____ layers

100 million; 10-20

What is the number of learnable parameters for a 5x5 filter operating on an RGB image?

5x5x3 + 1 for bias = 76

Why is problematic that the outputs of the sigmoid neuron are not zero-centered?

Because if the data coming into a neuron is always positive, then the gradients on all the weights w will during backpropagation be all positive or all negative, creating a zig-zag dynamic in the gradient updates for the weights

Why is centered numerical gradient less good at avoiding kinks?

Coarser (larger) stepsizes always result in less accurate numerical approximations

kinks and gradient check accuracy

Consider gradient checking the ReLU function at x=-1e6. The gradient here is zero, but a numerical gradient might compute a non-zero gradient because f(x+h) might cross over the kink and introduce a non-zero contribution.

What's the big downside of L-BFGS?

Even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.

bias trick

Extend every vector x such that it always holds the constant 1 - a default bias dimension. Now the new score function simplifies to f(x) = Wx

T/F: the decision boundary of a KNN classifier is linear.

False

convert conv layer to FC

For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).

What does it mean when a ReLU neuron "dies"?

For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high

convert any FC layer to CONV

For example, an FC layer with K=4096K that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer.

Why is it nice to be able to convert FC layers to CONV layers?

For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.

What is the intuition for annealing the learning rate over time?

Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.

Why do we put "probabilities" in quotes when talking about the softmax classifier?

How peaky or diffuse these probabilities are depends directly on the regularization strength λ - which you are in charge of as input to the system. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not.

Why does one need to be careful when initializing the weights of the sigmoid neuron?

If the initial weights are too large then most neurons will become saturated and the network will not be able to learn

In what case is having a similar train and test error not a good thing?

If your regularization weights are extreme, then your model will tend towards 0 weights and give the same (shitty) predictions on train and test

why use padding?

In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly.

Where do you insert batchnorm layers?

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we'll soon see), and before non-linearities.

How does adding training data contribute to the bias/variance tradeoff?

It doesn't! It's just better.

Difference between L1 and L2

L2 punishes differences between two vectors much more heavily

If we use standard deviation calculated over all pixels, will subtracting by mean and dividing by standard deviation change the performance of the KNN classifier?

No, because the standard deviation scaling will amount to a scalar multiplied by the L1 distance, so the list of neighbors will not change

how to convert total number of values to GB

Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn't fit, a common heuristic to "make it fit" is to decrease the batch size, since most of the memory is usually consumed by the activations.

Set padding to ____ when stride is 1 to preserve input volume and output volume size

P=(F−1)/2

If we calculate standard deviation separately for each pixel, will subtracting by mean and dividing by standard deviation change the performance of the KNN classifier?

Potentially. As different dimensions are scaled differently, performance could change

How could we edit L2 distance to be more efficient but get the same results from the NN function?

Remove the square root (square root is a monotonic function, preserves ordering)

momentum update

The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. Since the force on the particle is related to the gradient of potential energy, the force felt by the particle is precisely the (negative) gradient of the loss function. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position.

Why do we not want use the number of neurons to control overfitting?

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

What would happen to our 2D classifier "lines" if we had no bias?

Therefore, all lines would have to cross the origin, as the bias is what enables translations away from the origin

Relationship between Δ and λ

They control the same tradeoff: tradeoff between data loss and regularization loss in the objective. The magnitude of W has a direct effect on the scores (and their differences), so we can scale W up or down to meet any required margin. Therefore, the only real tradeoff is how large we allow the weights to grow through regularization strength λ

Compare the train and test error of a 1-nn to a 5-nn classifier.

Train error of 1-nn will always be zero, test error of a 1-nn may be better or worse than a 5-nn

What neuron should I use?

Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout

What if we wanted to efficiently apply an original ConvNet over the image but at a stride smaller than 32 pixels?

We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

hierarchical softmax

When the set of labels is very large, it may be helpful to use hierarchical softmax. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent.

what is the problem of efficiency with the numerical update?

You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update. This problem only gets worse, since modern Neural Networks can easily have tens of millions of parameters.

What do you do when val accuracy tracks training accuracy well?

Your model capacity may not be high enough: increase number of parameters

range of tanh nonlinearity

[-1,1]

range of sigmoid nonlinearity

[0,1]

L-BFGS

a method that has been developed that seek to approximate the inverse Hessian

approximate nearest neighbor (ANN) algorithms

accelerate nearest neighbor lookup, trading off correctness of neighbor retrieval with its space/time complexity during retrieval

dilated convolutions

allow you to grow receptive field much larger with fewer conv layers

the amount of "wiggle" in the loss is related to

batch size

Why are our gradients noisy with batched SGD?

because they come from minibatches

If the number of hyperparameters is large, you might use _____ validation splits

bigger

adagrad code

cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps)

RMSProp

cache = decay_rate * cache + (1 - decay_rate) * dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps)

dead ReLU

consider the case where your affine layer results in an output vector of all zeros, so the relu function is all zeros as well. then d(relu)/d(affine) = 0, and your dL/d(affine)=0 as well, so no update gets performed

What is the mu parameter in the momentum update?

consistent with notion of "friction"; dampens the velocity and reduces kinetic energy of the system

with softmax, we replace hinge loss with ____

cross entropy loss

any ____ can act as a gate through which gradients can backpropagate

differentiable function

source of randomness / uncertainty for regularization

dropout, stochastic depth, drop connect

L1 distance

d₁(I₁,I₂)=∑|I₁-I₂|

template matching interpretation

each row of W corresponds to a "template" (or prototype) for each of the classes

unlike Adagrad, the gradient updates of RMSProp do not get ___

exponentially smaller

in the softmax classifier, the ___ remains unchanged, but we now interpret these as ____

function mapping f(x; W) = Wx; unnormalized log probabilities for each class

nearest neighbor classifier

given a training set of 50000 images, label the remaining 10,000, predicting the label of the closest training image

burn-in time

gradchecking only after the network is allowed to learn a little, so that you don't gradcheck at pathological edge cases

What does adagrad do in practice?

high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased

SVM loss "wants" the correct class for each image to have a score that is _____

higher than the incorrect classes by some fixed margin

Why do we not initialize every neuron to have zero weights?

if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates.

GoogLeNet used what two innovative features?

inception modules, average pooling

combat overfitting

increase data quantity/quality, impose extra constraints, introduce randomness/uncertainty

k-nearest neighbor classifier

instead of finding the single closest image in the training set, we will find the top k closest images, and have them vote on the label of the test image

why is it not common to regularize bias parameters?

it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objectiv

why is L2 loss harder to optimize than a stable loss like Softmax?

it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations

drawbacks of sigmoid function

it saturates and kills gradients - when the sigmoid saturates at either tail of 0 or 1, the gradient at these regions is almost zero, and will kill the gradient flow.

in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but how about deeper networks?

it usually does not help that much more

cross-validation

iterate over different validation sets and average performance across them

dropout

keeps a neuron active with probability p

PCA

keeps dimensions which contain the most variance

Compared to the softmax classifier, the SVM classifier has a more ___ objective

local

softmax classifier is the generalization of ____

logistic classification to multiple classes

hinge loss

max(0,-)

three common forms of data preprocessing

mean subtraction, normalization, PCA/whitening

largest bottlenecks for convnets

memory

Why do people usually avoid cross-validation?

more computationally expensive

if examples in training set were not correlated, would mini-batching work?

how long does it take to train vs. test a nearest neighbors classifier

no time at all to train, but very expensive to test

Batchnorm: why is it possible to force activations throughout a network to take on a unit gaussian distribution at the beginning of training?

normalization is a simple differentiable operation

problems with ReLU

not zero-centered output, and gradients of 0 when x< 0

three major sources of memory strain for convnets

number of activations, parameters, miscellaneous (image data batches, augmented images, etc.)

why is l2 loss less robust?

outliers can introduce huge gradients

increased stride means smaller ___

output volumes spatially

inverted dropout

perform scaling at train time, leave test time untouched

what could cause a dead ReLU?

poor weight initialization

function of pooling layer

progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting

Why should you turn off regularization and check data loss alone first?

regularization loss may overwhelm data loss and mask an incorrect implementation of data loss

the recommended heuristic to initialize each neuron's weight vector is

scale by 1/sqrt(n), where n is the number of inputs

gradient clipping

scale gradient if norm is too big

L2 penalty prefers ______ weight vectors

smaller and more diffuse

higher values of k in the KNN classifier have what effect?

smoothing effect that is more resistant to outliers

Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do ___

something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input

during optimization, L1 regularization leads weight vectors to become ___

sparse

three common forms of learning rate decay

step decay, exponential decay, 1/t decay

extreme case where mini-batch consists of one example

stochastic gradient descent

whitening

takes the data into the eigenbasis and divides every dimension by its eigenvalue to normalize the scale

How do the tanh nonlinearity and the sigmoid nonlinearity compare in practice?

tanh is always preferred, since it is zero-centered

Seeing all tanh activation neurons outputting 0 or all completely saturated at -1 or 1 would imply what?

that something is off with weight initialization

accuracy

the fraction of predictions that were correct

in linear classifiers, the scale of the data has an effect on ___

the magnitude of the gradient for the weights

Subtracting the mean from the dataset does not change L1 distance because ____

the mean will cancel out, since both the train and test images are preprocessed with the same mean subtraction

The distribution of outputs from a randomly initialized neuron has a variance that grows with ____

the number of inputs

bug with hinge loss function

the set of parameters W that correctly classify every example is not unique; any multiple of these parameters λW where λ>1 would also give zero loss

T/F: smaller networks can be preferred if the data is not complex enough to prevent overfitting

there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.

loss function is a measure of ___

unhappiness with our current model

code for momentum update

v = mu * v - learning_rate * dx # integrate velocity x += v # integrate position

Challenges in computer vision

viewpoint variation, scale variation, deformation, occlusion, illumination condition, background clutter, intra-class variation

it's sufficient for symmetry breaking to initialize all weights to 0, provided that biases are random

CS231N Midterm

Related study sets

EIAT Apprentice Test

448 FINAL

Fill in the blank - Dental Anatomy book - tmj

Pharmacology: Lifespan Considerations. Chapter 3

Modules 16-17

Unit 2 Quiz 3

Stats 2510 - Exam 3

Microbiology

Emergency Action Plans and Fire Protection

APES UNIT 4 PART 2

Cloud Computing Lecture 1

MKTG 300 Final prep

18.Story Point Estimations - Planning Poker

Speech - Ch 6

Caesar Workbook answers Book 1 Page 4-6

CH 9 Budgeting

Computer Networks/Cyber Security 330

psyc 2310: chapter 11

Chapter 6 Questions

PSY 350 SDSU Exam #3