Machine Learning Exam 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Suppose X1 is 0 and X2 is 1, what will be the output for the below neural network? f(x) -> 0, for x <0 f(x) -> 1, for x>=0 for XNOR whats the output a) 0 b) 1

a) 0

Let us assume we implement an AND function to a single neuron. 000 010 010 100 111 +1 ->a x1 -> x2-> a) Bias = -1.5, w1 = 1, w2 =1 b) Bias = 1.5, w1 = 2, w2 =2 c) Bias = 1, w1= 1.5, w2 =1.5 d) none of the above

a) Bias = -1.5, w1 = 1, w2 =1

t/f A multiple-layer neural network with linear activation functions is equivalent to one single-layer perceptron that uses the same error function on the output layer and has the same number of inputs.

true

t/f A neural network with multiple hidden layers and sigmoid nodes can form non-linear decision boundaries.

true

Suppose you have 5 convolutional kernel of size 7 x 7 with zero padding and stride 1 in the first layer of a convolutional neural network. You pass an input of dimension 224 x 224 x 3 through this layer. What are the dimensions of the data which the next layer will receive? A) 217 x 217 x 3 B) 217 x 217 x 8 C) 218 x 218 x 5 D) 220 x 220 x 7

(W−F+2P)/S+1 (224-7 +2x0)/1+1 218 218 => 224-7+1 5 => C) 218 x 218 x 5

Batch Normalization is helpful because (a) It normalizes (changes) all the input before sending it to the next layer (b) It returns back the normalized mean and standard deviation of weights (c) It is a very efficient backpropagation technique (d) None of these

(a) It normalizes (changes) all the input before sending it to the next layer

Which of the following gives non-linearity to a neural network? (a) Stochastic Gradient Descent. (b) Rectified Linear Unit (c) Convolution function (d) None of the above

(b) Rectified Linear Unit

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? (a) There will not be any problem and the neural network will train properly (b) The neural network will train but all the neurons will end up recognizing the same thing (c) The neural network will not train as there is no net gradient change (d) None of these

(b) The neural network will train but all the neurons will end up recognizing the same thing

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? (a) There will not be any problem and the neural network will train properly (b) The neural network will train but all the neurons will end up recognizing the same thing (c) The neural network will not train as there is no net gradient change (d) None of these

(b) The neural network will train but all the neurons will end up recognizing the same thing

In a neural network, knowing the weight and bias of each neuron is the most important step. If you can somehow get the correct value of weight and bias for each neuron, you can approximate any function. What would be the best way to approach this? (a) Assign random values and hope they are correct (b) Search every possible combination of weights and biases till you get the best value (c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values to make them better (d) None of these

(c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values to make them better

What are the steps for using a gradient descent algorithm in training neural networks? 1. Calculate error between the actual value and the predicted value 2. Reiterate until you find the best weights of network 3. Pass an input through the network and get values from output layer 4. Initialize random weight and bias 5. Go to each neurons which contributes to the error and change its respective values to reduce the error. (a) 1, 2, 3, 4, 5 (b) 5, 4, 3, 2, 1 (c) 3, 2, 1, 5, 4 (d) 4, 3, 1, 5, 2

(d) 4, 3, 1, 5, 2

What are the steps for using a gradient descent algorithm in training neural networks? 1. Calculate error between the actual value and the predicted value 2. Reiterate until you find the best weights of network 3. Pass an input through the network and get values from output layer 4. Initialize random weight and bias 5. Go to each neurons which contributes to the error and change its respective values to reduce the error. (a) 1, 2, 3, 4, 5 (b) 3, 2, 1, 5, 4 (c) 5, 4, 3, 2, 1 (d) 4, 3, 1, 5, 2

(d) 4, 3, 1, 5, 2

Which of the following is NOT correct about ReLU (Rectified Linear Unit) activation function? (a) Does not saturate in the positive region. (b) Computationally effective comparing to sigmoid and Tanh activation functions (c) Mostly converge faster than sigmoid and Tanh activation functions (d) Zero-centered output.

(d) Zero-centered output.

t/f A perceptron is guaranteed to perfectly learn a given linearly separable function within a finite number of training steps.

true

What value would be in place of question mark? *111*00 *011*10 *001*11 00110 01100 filter *101* *010* *101* *?*-- --- --- Here we see a convolutional function being applied to input. A) 3 B) 4 C) 5 D) 6

1x1 + 1x0 + 1x1 + 0x0 + 1x1 + 1x0 + 0x1 + 0x1 + 1x1 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 = 4 B) 4

[10 pts] Forward propagation: Compute the input and output of each unit of the following neural network by filling in the empty boxes. see exam 2 2018

2, .88, -1.26, .22, .5, .62

Forward propagation: Compute the input and output of each unit of the following neural network by filling in the empty boxes. look at exam 2_2017

2, 0.5, .99, .62, -1.26, .28

Below is a diagram of a small convolutional neural network that converts a 61x61 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters (layer#1), convolution with 3 filters (layer#2), max pooling, finally a fully-connected layer, For this network we will NOT be using any bias/offset parameters. b) [5 points] How many weights do we need to learn for the entire network?

3 x 5x5 + 3 x 5x5 + 3x7x7x4 = 738

Below is a diagram of a small convolutional neural network that converts a 61x61 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters (layer#1), convolution with 3 filters (layer#2), max pooling, finally a fully-connected layer, For this network we will NOT be using any bias/offset parameters. a) [5 points] How many weights in the convolutional layer # 1 do we need to learn?

3 x5 x5 = 75

what are the steps for gradient decent algorithm 1) calculate error between actual and predicted value 2) reiterate until you find the best weights 3)pass an input through the netwrok and get values from output layer 4)initialize random weight and bias 5) go to each neuron which contributes to the error and change its respective values to reduce the error

4, 3, 1, 5, 2

Below is a diagram of a small convolutional neural network that converts a 13x13 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters, max pooling, ReLU, and finally a fully-connected layer, For this network we will not be using any bias/offset parameters. b) How many weights do we need to learn for the entire network?

4x4x3 => 48 3x5x5x4 => 300 348

Below is a diagram of a small convolutional neural network that converts a 13x13 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters, max pooling, ReLU, and finally a fully-connected layer, For this network we will not be using any bias/offset parameters. a) How many weights in the convolutional layer do we need to learn?

4x4x3=48

The number of neurons in the output layer should match the number of classes (Where the number of classes is greater than 2) in a supervised learning

false

[5 points] Consider the following two Multilayer Perceptrons Networks, where all of the layers use linear activation functions. (a) [1pt] Give one advantage of Network A over Network B.

A is more expressive than B. • has fewer units or is easier to implement

Which of the following option is correct for the below-mentioned techniques? AdaGrad uses first order differentiation L-BFGS uses second order differentiation AdaGrad uses second order differentiation L-BFGS uses first order differentiation A) 1 and 2 B) 3 and 4 C) 1 and 4 D) 2 and 3

A) 1 and 2

XNOR Activation function f(x) =0, for x<0 =1, for x>= 0 Suppose X1 is 0 and X2 is 1, what will be the output for the above neural network? A. 0 B. 1

A. Output of a1: f(0.5x1 + -1x0 + -1x1) = f(-0.5) = 0 Output of a2: f(-1.5x1 + 1x0 + 1x1) = f(-0.5) = 0 Output of a3: f(-0.5x1 + 1x0 + 1x0) = f(-0.5) = 0

X1 X2 X1 AND X2 0 0 0 0 1 0 1 0 0 1 1 1 The activation function of our neuron is denoted as: f(x) = 0, for x<0 = 1, for x>=0 What would be the weights and bias? (Hint: For which values of w1, w2 and b does our neuron implement an AND function?) A. Bias = -1.5, w1 = 1, w2 = 1 B. Bias = 1.5, w1 = 2, w2 = 2 C. Bias = 1, w1 = 1.5, w2 = 1.5 D. None of these

A. f(-1.5x1 + 1x0 + 1x0) = f(-1.5) = 0 f(-1.5x1 + 1x0 + 1x1) = f(-0.5) = 0 f(-1.5x1 + 1x1 + 1x0) = f(-0.5) = 0 f(-1.5x1 + 1x1+ 1x1) = f(0.5) = 1 Therefore option A is correct

What is a dead unit in a neural network? A. A unit which doesn't update during training by any of its neighbour B. A unit which does not respond completely to any of the training patterns C. The unit which produces the biggest sum-squared error D. None of these

A. A unit which doesn't update during training by any of its neighbour

Which of the following is true about model capacity (where model capacity means the ability of neural network to approximate complex functions) ? A. As number of hidden layers increase, model capacity increases B. As dropout ratio increases, model capacity increases C. As learning rate increases, model capacity increases D. None of these

A. As number of hidden layers increase, model capacity increases

[5 points] For a fully-connected deep network with one hidden layer, increasing the number of hidden units should have what effect on bias and variance? Explain briefly.

Adding more hidden units should decrease bias and increase variance. In general, more complicated models will result in lower bias but larger variance, and adding more hidden units certainly makes the model more complex. See the Lecture 2 slides on the "rule of thumb" for bias and variance

t/f Having max pooling layer in between two convolutional layers, always decrease the number of the parameters(weights).

false

t/f If you increase the number of hidden layers in a Multi Layer Perceptron neural networks (fully connected networks), the classification error of test data always decreases.

false

What if we use a learning rate that's too large? A. Network will converge B. Network will not converge C. Can't Say

B. Network will not converge Option B is correct because the error rate would become erratic and explode.

Which of the following statement is the best description of early stopping? A. Train the network until a local minimum in the error function is reached B. Simulate the network on a test dataset after every epoch of training. Stop training when the generalization error starts to increase C. Add a momentum term to the weight update in the Generalized Delta Rule, so that training converges more quickly D. A faster version of backpropagation, such as the `Quickprop' algorithm

B. Simulate the network on a test dataset after every epoch of training. Stop training when the generalization error starts to increase

Which gradient technique is more advantageous when the data is too big to handle in RAM simultaneously? A. Full Batch Gradient Descent B. Stochastic Gradient Descent

B. Stochastic Gradient Descent

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? A. There will not be any problem and the neural network will train properly B. The neural network will train but all the neurons will end up recognizing the same thing C. The neural network will not train as there is no net gradient change D. None of these

B. The neural network will train but all the neurons will end up recognizing the same thing

t/f In a neural network, Data Augmentation, Dropout, Regularization all deal with overfitting.

true

t/f Suppose a convolutional neural network is trained on MNIST dataset (handwritten digits dataset). This trained model is then given a completely BLACK image as an input (Zero Input). The output probabilities for this input would be ZERO for all classes.

false

t/f In a neural network, Dropout, Regularization and Batch normalization all deal with overfitting

true

t/f It is possible to use 1x1 convolution filters

true

Suppose you have inputs as x, y, and z with values -2, 5, and -4 respectively. You have a neuron 'q' and neuron 'f' with functions: q = x + y f = q x z Graphical representation of the functions is as follows: (https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/) Q7 What is the gradient of F with respect to x, y, and z? (HINT: To calculate gradient, you must find (df/dx), (df/dy) and (df/dz)) A. (-3,4,4) B. (4,4,3) C. (-4,-4,3) D. (3,-4,-4)

C. (-4,-4,3) df/df = -12 q=x+y, dq/dx = 1, dq/dy =1 f = qz, df/dq=z, df/dz=q dz: df/dq = -12/-4 = 3 dy: df/dq = -12/3 => -4 => df/dy = df/dq(dq/dy) => zx1 => -4x1 = -4 dx: df/dq = -12/3 => -4 => df/dx = df/dq(dq/dx) => zx1 => -4x1 =-4 dx,dy,dz => -4,-4,3

t/f Using momentum with gradient descent helps to find solutions more quickly

true

Below is a diagram of a small convolutional neural network that converts a 61x61 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters (layer#1), convolution with 3 filters (layer#2), max pooling, finally a fully-connected layer, For this network we will NOT be using any bias/offset parameters. c) [5 points] What is the padding size (P) in convolutional layer # 1 and convolutional layer # 2?

Convolutional layer # 1: P = 0 Convolutional layer # 2: P = 2

t/f Suppose a convolutional neural network is trained on MNIST dataset (handwritten digits dataset). This trained model is then given a completely white image as an input. The output probabilities for this input would be equal for all classes

false

t/f When pooling layer is added in a convolutional neural network, translation invariance is preserved.

true

t/f dead unit in a neural network is the unit which doesn't update during training by any of its neighbour

true

t/f using Mini-batch gradient decent the model update frequency is higher than batch gradient descent which allows for more robust convergence, avoiding local minima.

true

Which of the following options can be used to reduce overfitting in deep learning models? Add more data Use data augmentation Use architecture that generalizes well Add regularization Reduce architectural complexity A) 1, 2, 3 B) 1, 4, 5 C) 1, 3, 4, 5 D) All of these

D) All of these

What is the sequence of the following tasks in a perceptron? 1. Initialize weights of perceptron randomly 2. Go to the next batch of dataset 3.If the prediction does not match the output, change the weights 4. For a sample input, compute an output A. 1, 2, 3, 4 B. 4, 3, 2, 1 C. 3, 1, 2, 4 D. 1, 4, 3, 2

D. 1, 4, 3, 2

What are the steps for using a gradient descent algorithm? 1.Calculate error between the actual value and the predicted value 2.Reiterate until you find the best weights of network 3.Pass an input through the network and get values from output layer 4.Initialize random weight and bias 5.Go to each neurons which contributes to the error and change its respective values to reduce the error A. 1, 2, 3, 4, 5 B. 5, 4, 3, 2, 1 C. 3, 2, 1, 5, 4 D. 4, 3, 1, 5, 2

D. 4, 3, 1, 5, 2

In which neural net architecture, does weight sharing occur? A. Convolutional neural Network B. Recurrent Neural Network C. Fully Connected Neural Network D. Both A and B

D. Both A and B

t/f In backpropagation learning, we should start with a small learning rate parameter and slowly increase it during the learning process.

False

Backpropagation, describe how you will use backpropagation algorithm to train multilayer perceptron neural network (similar to this network). Consider mini-batch gradient descent algorithm

Given training set {(x1, y1), ... , (xn,yn)} initialize all theta^l randomly not to 0 Loop // each iteration is called an epoch set delta^l ij = 0 for all l, i, j for each training instance (xi, yi): set a^1 = xi compute {a^2 ... a^L} via forward propagation compute sigma^L = a^L - yi compute errors {sigma^(L-1), ... , sigma^2} compute gradients delta^l ij = delta^l ij + a^l j x sigma^(l+1) i compute avg regularized gradient D^l ij = = { 1/n delta^l ij + lam theta^l ij if j=/= 0 = { 1/n delta^l ij otherwise update weights via gradient step theta ^l ij = theta^l ij - alpha x D^l ij until weights converge or max #epochs reached

c) [5 points] What is the role of pooling layers in convolutional neural networks?

Invariance to small transformations Reduce the effect of noises and shift or distortion Build the hierarchical feature representation

[10 points] Backpropagation, describe how you will use backpropagation algorithm to train multi-layer perceptron neural network (similar to this network). Consider mini-batch gradient descent algorithm

Same as algorithm on slide 55 except you iterate for randomly picked without replacement mini-patch samples (32, 64, or 128, etc ). Update weights same way for the mini-batch. Keep iterating until all data is visited (1 epoch)

Suppose we have a deep neural network model which was trained on a vehicle detection problem. The dataset consisted of images on cars and trucks and the aim was to detect name of the vehicle (the number of classes of vehicles are 10).Now you want to use this model on different dataset which has images of only Ford Mustangs (aka car) and the task is to locate the car in an image. Which of the following categories would be suitable for this type of problem? A) Fine tune only the last couple of layers and change the last layer (classification layer) to regression layer B) Freeze all the layers except the last, re-train the last layer C) Re-train the model for the new dataset D) None of these

Solution: (A)

The different components of the neuron are denoted as: x1, x2,..., xN: These are inputs to the neuron. These can either be the actual observations from input layer or an intermediate value from one of the hidden layers. w1, w2,...,wN: The Weight of each input. bi: Is termed as Bias units. These are constant values added to the input of the activation function corresponding to each weight. It works similar to an intercept term. a: Is termed as the activation of the neuron which can be represented as and y: is the output of the neuron Considering the above notations, will a line equation (y = mx + c) fall into the category of a neuron? A. Yes B. No

Solution: (A) A single neuron with no non-linearity can be considered as a linear regression function.

Q9. A neural network can be considered as multiple simple equations stacked together. Suppose we want to replicate the function for the below mentioned decision boundary. (https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/) Q9 using two simple inputs h1, h2 What will be the final equation? A. (h1 AND NOT h2) OR (NOT h1 AND h2) B. (h1 OR NOT h2) AND (NOT h1 OR h2) C. (h1 AND h2) OR (h1 OR h2) D. None of these

Solution: (A) As you can see, combining h1 and h2 in an intelligent way can get you a complex equation easily. Refer Chapter 9 of this book

hat is the technical difference between vanilla backpropagation algorithm and backpropagation through time (BPTT) algorithm? A) Unlike backprop, in BPTT we sum up gradients for corresponding weight for each time step B) Unlike backprop, in BPTT we subtract gradients for corresponding weight for each time step

Solution: (A) BPTT is used in context of recurrent neural networks. It works by summing up gradients for each time step

Which of the following techniques perform similar operations as dropout in a neural network? A. Bagging B. Boosting C. Stacking D. None of these

Solution: (A) Dropout can be seen as an extreme form of bagging in which each model is trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding parameter in all the other models. Refer (https://www.cs.toronto.edu/~hinton/absps/dropout.pdf)

Now let's revise the previous slides. We have learned that: -A neural network is a (crude) mathematical representation of a brain, which consists of smaller components called neurons. -Each neuron has an input, a processing function, and an output. - These neurons are stacked together to form a network, which can be used to approximate any function. -To get the best possible neural network, we can use techniques like gradient descent to update our neural network model. Given above is a description of a neural network. When does a neural network model become a deep learning model? A. When you add more hidden layers and increase depth of neural network B. When there is higher dimensionality of data C. When the problem is an image recognition problem D. None of these

Solution: (A) More depth means the network is deeper. There is no strict rule of how many layers are necessary to make a model deep, but still if there are more than 2 hidden layers, the model is said to be deep

Which of the following is a representation learning algorithm? A) Neural network B) Random Forest C) k-Nearest neighbor D) None of the above

Solution: (A) Neural network converts data in such a form that it would be better to solve the desired problem. This is called representation learning.

You are building a neural network where it gets input from the previous layer as well as from itself. https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ Q16 Which of the following architecture has feedback connections? A. Recurrent Neural network B. Convolutional Neural Network C. Restricted Boltzmann Machine D. None of these

Solution: (A) Option A is correct.

Can a neural network model the function (y=1/x)? A. Yes B. No

Solution: (A) Option A is true, because activation function can be reciprocal function.

A recurrent neural network can be unfolded into a full-connected neural network with infinite length. A) TRUE B) FALSE

Solution: (A) Recurrent neuron can be thought of as a neuron sequence of infinite length of time steps.

The graph represents gradient flow of a four-hidden layer neural network which is trained using sigmoid activation function per epoch of training. The neural network suffers with the vanishing gradient problem. Q36: https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ Which of the following statements is true? A. Hidden layer 1 corresponds to D, Hidden layer 2 corresponds to C, Hidden layer 3 corresponds to B and Hidden layer 4 corresponds to A B. Hidden layer 1 corresponds to A, Hidden layer 2 corresponds to B, Hidden layer 3 corresponds to C and Hidden layer 4 corresponds to D

Solution: (A) This is a description of a vanishing gradient problem. As the backprop algorithm goes to starting layers, learning decreases.

Batch Normalization is helpful because A. It normalizes (changes) all the input before sending it to the next layer B. It returns back the normalized mean and standard deviation of weights C. It is a very efficient backpropagation technique D. None of these

Solution: (A) To read more about batch normalization, see refer this video (https://www.youtube.com/watch?v=Xogn6veSyxA)

When pooling layer is added in a convolutional neural network, translation in-variance is preserved. True or False? A. True B. False

Solution: (A) Translation invariance is induced when you use pooling.

There are many types of gradient descent algorithms. Two of the most notable ones are l-BFGS and SGD. l-BFGS is a second order gradient descent technique whereas SGD is a first order gradient descent technique. In which of the following scenarios would you prefer l-BFGS over SGD? Data is sparse Number of parameters of neural network are small A) Both 1 and 2 B) Only 1 C) Only 2 D) None of these

Solution: (A) l-BFGS works best for both of the scenarios.

Mini-Batch sizes when defining a neural network are preferred to be multiple of 2's such as 256 or 512. What is the reason behind it? A) Gradient descent optimizes best when you use an even number B) Parallelization of neural network is best when the memory is used optimally C) Losses are erratic when you don't use an even number D) None of these

Solution: (B)

Suppose we have one hidden layer neural network as shown above. The hidden layer in this network works as a dimensionality reductor. Now instead of using this hidden layer, we replace it with a dimensionality reduction technique such as PCA. Would the network that uses a dimensionality reduction technique always give same output as network with hidden layer? A. Yes B. No

Solution: (B) Because PCA works on correlated features, whereas hidden layers work on predictive capacity of features.

For an image recognition problem (recognizing a cat in a photo), which architecture of neural network would be better suited to solve the problem? A. Multi Layer Perceptron B. Convolutional Neural Network C. Recurrent Neural network D. Perceptron

Solution: (B) Convolutional Neural Network would be better suited for image related problems because of its inherent nature for taking into account changes in nearby locations of an image

"Convolutional Neural Networks can perform various types of transformation (rotations or scaling) in an input". Is the statement correct True or False? A. True B. False

Solution: (B) Data Preprocessing steps (viz rotation, scaling) is necessary before you give the data to neural network because neural network cannot do it itself.

Suppose while training, you encounter this issue. The error suddenly increases after a couple of iterations. Q 40: https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ You determine that there must a problem with the data. You plot the data and find the insight that, original data is somewhat skewed and that may be causing the problem. What will you do to deal with this challenge? A. Normalize B. Apply PCA and then Normalize C. Take Log Transform of the data D. None of these

Solution: (B) First you would remove the correlations of the data and then zero center it.

Suppose we have a neural network with ReLU activation function. Let's say, we replace ReLu activations by linear activations. Would this new neural network be able to approximate an XNOR function? Note: The neural network was able to approximate XNOR function with activation function ReLu A) Yes B) No

Solution: (B) If ReLU activation is replaced by linear activation, the neural network loses its power to approximate non-linear function.

Increase in size of a convolutional kernel would necessarily increase the performance of a convolutional network. A. True B. False

Solution: (B) Increasing kernel size would not necessarily increase performance. This depends heavily on the dataset.

The number of neurons in the output layer should match the number of classes (Where the number of classes is greater than 2) in a supervised learning task. True or False? A. True B. False

Solution: (B) It depends on output encoding. If it is one-hot encoding, then its true. But you can have two outputs for four classes, and take the binary values as four classes(00,01,10,11).

In the graph below, we observe that the error has many "ups and downs" Q42: https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ Should we be worried? A. Yes, because this means there is a problem with the learning rate of neural network. B. No, as long as there is a cumulative decrease in both training and validation error, we don't need to worry.

Solution: (B) Option B is correct. In order to decrease these "ups and downs" try to increase the batch size.

Sigmoid was the most commonly used activation function in neural network, until an issue was identified. The issue is that when the gradients are too large in positive or negative direction, the resulting gradients coming out of the activation function get squashed. This is called saturation of the neuron. That is why ReLU function was proposed, which kept the gradients same as before in the positive direction. A ReLU unit in neural network never gets saturated. A) TRUE B) FALSE

Solution: (B) ReLU can get saturated too. This can be on the negative side of x-axis.

Which of the following gives non-linearity to a neural network? A. Stochastic Gradient Descent B. Rectified Linear Unit C. Convolution function D. None of the above

Solution: (B) Rectified Linear unit is a non-linear activation function.

Y = ax^2 + bx + c (polynomial equation of degree 2) Can this equation be represented by a neural network of single hidden layer with linear threshold? A. Yes B. No

Solution: (B) The answer is no because having a linear threshold restricts your neural network and in simple terms, makes it a consequential linear transformation function.

Suppose a convolutional neural network is trained on ImageNet dataset (Object recognition dataset). This trained model is then given a completely white image as an input.The output probabilities for this input would be equal for all classes. True or False? A. True B. False

Solution: (B) There would be some neurons which are do not activate for white pixels as input. So the classes wont be equal

First Order Gradient descent would not work correctly (i.e. may get stuck) in which of the following graphs? https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ Q19

Solution: (B) This is a classic example of saddle point problem of gradient descent

If you increase the number of hidden layers in a Multi Layer Perceptron, the classification error of test data always decreases. True or False? A. True B. False

Solution: (B) This is not always true. Overfitting may cause the error to increase.

The below graph shows the accuracy of a trained 3-layer convolutional neural network vs the number of parameters (i.e. number of feature kernels). https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ Q20 The trend suggests that as you increase the width of a neural network, the accuracy increases till a certain threshold value, and then starts decreasing. What could be the possible reason for this decrease? A. Even if number of kernels increase, only few of them are used for prediction B. As the number of kernels increase, the predictive power of neural network decrease C. As the number of kernels increase, they start to correlate with each other which in turn helps overfitting D. None of these

Solution: (C) As mentioned in option C, the possible reason could be kernel correlation.

There is a plateau at the start. This is happening because the neural network gets stuck at local minima before going on to global minima. To avoid this, which of the following strategy should work? A. Increase the number of parameters, as the network would not get stuck at local minima B. Decrease the learning rate by 10 times at the start and then use momentum C. Jitter the learning rate, i.e. change the learning rate for a few epochs D. None of these

Solution: (C) Option C can be used to take a neural network out of local minima in which it is stuck.

In a neural network, knowing the weight and bias of each neuron is the most important step. If you can somehow get the correct value of weight and bias for each neuron, you can approximate any function. What would be the best way to approach this? A. Assign random values and pray to God they are correct B. Search every possible combination of weights and biases till you get the best value C. Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values values to make them better D. None of these

Solution: (C) Option C is the description of gradient descent.

Given an n-character word, we want to predict which character would be the n+1th character in the sequence. For example, our input is "predictio" (which is a 9 character word) and we have to predict what would be the 10th character. Which neural network architecture would be suitable to complete this task? A) Fully-Connected Neural Network B) Convolutional Neural Network C) Recurrent Neural Network D) Restricted Boltzmann Machine

Solution: (C) Recurrent neural network works best for sequential data. Therefore, it would be best for the task.

While training a neural network for image recognition task, we plot the graph of training error and validation error for debugging. Q26: https://www.analyticsvidhya.com/blog/2017/04/40-questions-test-data-scientist-deep-learning/ What is the best place in the graph for early stopping? A) A B) B C) C D) D

Solution: (C) You would "early stop" where the model is most generalized. Therefore option C is correct.

Which of the following is a data augmentation technique used in image recognition tasks? Horizontal flipping Random cropping Random scaling Color jittering Random translation Random shearing A) 1, 2, 4 B) 2, 3, 4, 5, 6 C) 1, 3, 5, 6 D) All of these

Solution: (D)

What are the factors to select the depth of neural network? 1.Type of neural network (eg. MLP, CNN etc) 2.Input data 3.Computation power, i.e. Hardware capabilities and software capabilities 4.Learning Rate 5.The output function to map A. 1, 2, 4, 5 B. 2, 3, 4, 5 C. 1, 3, 4, 5 D. All of these

Solution: (D) All of the above factors are important to select the depth of neural network

In a neural network, which of the following techniques is used to deal with overfitting? A. Dropout B. Regularization C. Batch Normalization D. All of these

Solution: (D) All of the techniques can be used to deal with overfitting.

Which of the following is a bottleneck for deep learning algorithm? A) Data related to the problem B) CPU to GPU communication C) GPU memory D) All of the above

Solution: (D) Along with having the knowledge of how to apply deep learning algorithms, you should also know the implementation details. Therefore you should know that all the above mentioned problems are a bottleneck for deep learning algorithm.

Suppose that you have to minimize the cost function by changing the parameters. Which of the following technique could be used for this? A. Exhaustive Search B. Random Search C. Bayesian Optimization D. Any of these

Solution: (D) Any of the above mentioned technique can be used to change parameters.

Consider the scenario. The problem you are trying to solve has a small amount of data. Fortunately, you have a pre-trained neural network that was trained on a similar problem. Which of the following methodologies would you choose to make use of this pre-trained network? A. Re-train the model for the new dataset B. Assess on every layer how the model performs and only select a few of them C. Fine tune the last couple of layers only D. Freeze all the layers except the last, re-train the last layer

Solution: (D) If the dataset is mostly similar, the best method would be to train only the last layer, as previous all layers work as feature extractors.

Which of the following is correct? Dropout randomly masks the input weights to a neuron Dropconnect randomly masks both input and output weights to a neuron A) 1 is True and 2 is False B) 1 is False and 2 is True C) Both 1 and 2 are True D) Both 1 and 2 are False

Solution: (D) In dropout, neurons are dropped; whereas in dropconnect; connections are dropped. So both input and output weights will be rendered in useless, i.e. both will be dropped for a neuron. Whereas in dropconnect, only one of them should be dropped

Instead of trying to achieve absolute zero error, we set a metric called bayes error which is the error we hope to achieve. What could be the reason for using bayes error? A. Input variables may not contain complete information about the output variable B. System (that creates input-output mapping) may be stochastic C. Limited training data D. All the above

Solution: (D) In reality achieving accurate prediction is a myth. So we should hope to achieve an "achievable result".

Suppose an input to Max-Pooling layer is given above. The pooling size of neurons in the layer is (3, 3). 345 456 567 What would be the output of this Pooling layer? A) 3 B) 5 C) 5.5 D) 7

Solution: (D) Max pooling works as follows, it first takes the input using the pooling size we defined, and gives out the highest activated input.

In training a neural network, you notice that the loss does not decrease in the few starting epochs. The reasons for this could be: The learning is rate is low Regularization parameter is high Stuck at local minima What according to you are the probable reasons? A. 1 and 2 B. 2 and 3 C. 1 and 3 D. Any of these

Solution: (D) The problem can occur due to any of the reasons mentioned.

The network shown in Figure 1 is trained to recognize the characters H and T as shown below: T -> I Fill H -> I not fill B -> ? a) I fill b) I not fill c) I top and bottom fill d) could be A or B depending on neural net

Solution: (D) Without knowing what are the weights and biases of a neural network, we cannot comment on what output it would give.

Which of the following is a decision boundary of Neural Network? Q41: https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/ A) B B) A C) D D) C E) All of these

Solution: (E) A neural network is said to be a universal function approximator, so it can theoretically represent any decision boundary.

Given below is an input matrix of shape 7 X 7. What will be the output on applying a max pooling of size 3 X 3 with a stride of 2? 1241401 0016155 1445141 4151650 1065118 2318581 0912314 a) 465 668 988 b) 455 668 986 c) 456 368 996 d) 433 333 444

Solution: A 124 001 =4 144 414 161 =6 451 401 155 = 5 141 144 415 = 6 106 451 516 = 6 651 141 650 = 8 118 106 231 = 9 091 651 185 = 8 123 118 581 = 8 314 465 668 988 Max pooling takes a 3 X 3 matrix and takes the maximum of the matrix as the output. Slide it over the entire input matrix with a stride of 2 and you will get option (1) as the answer.

Dropout can be applied at visible layer of Neural Network model? A) TRUE B) FALSE

Solution: A Look at the below model architecture, we have added a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in 5 inputs will be randomly excluded from each update cycle. def create_model(): # create model model = Sequential() model.add(Dropout(0.2, input_shape=(60,))) model.add(Dense(60, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model sgd = SGD(lr=0.1) model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) return model

[True or False] Sentiment analysis using Deep Learning is a many-to one prediction task A) TRUE B) FALSE

Solution: A Option A is correct. This is because from a sequence of words, you have to predict whether the sentiment was positive or negative.

Gated Recurrent units can help prevent vanishing gradient problem in RNN. A) True B) False

Solution: A Option A is correct. This is because it has implicit memory to remember past behavior.

True/False: Changing Sigmoid activation to ReLu will help to get over the vanishing gradient issue? A) TRUE B) FALSE

Solution: A ReLU can help in solving vanishing gradient problem.

The number of nodes in the input layer is 10 and the hidden layer is 5. The maximum number of connections from the input layer to the hidden layer are A) 50 B) Less than 50 C) More than 50 D) It is an arbitrary value

Solution: A Since MLP is a fully connected directed graph, the number of connections are a multiple of number of nodes in input layer and hidden layer.

Which of the following functions can be used as an activation function in the output layer if we wish to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1? A) Softmax B) ReLu C) Sigmoid D) Tanh

Solution: A Softmax function is of the form in which the sum of probabilities over all k sum to 1.

The input image has been converted into a matrix of size 28 X 28 and a kernel/filter of size 7 X 7 with a stride of 1. What will be the size of the convoluted matrix? A) 22 X 22 B) 21 X 21 C) 28 X 28 D) 7 X 7

Solution: A The size of the convoluted matrix is given by C=((I-F+2P)/S)+1, where C is the size of the Convoluted matrix, I is the size of the input matrix, F the size of the filter matrix and P the padding applied to the input matrix. Here P=0, I=28, F=7 and S=1. There the answer is 22.

Which of the following neural network training challenge can be solved using batch normalization? A) Overfitting B) Restrict activations to become too high or low C) Training is too slow D) Both B and C E) All of the above

Solution: A Weights between input and hidden layer are constant.

[True | False] In the neural network, every parameter can have their different learning rate. A) TRUE B) FALSE

Solution: A Yes, we can define the learning rate for each parameter and it can be different from other parameters.

I am working with the fully connected architecture having one hidden layer with 3 neurons and one output neuron to solve a binary classification challenge. Below is the structure of input and output: Input dataset: [ [1,0,1,0] , [1,0,1,1] , [0,1,0,1] ] Output: [ [1] , [1] , [0] ] To train the model, I have initialized all weights for hidden and output layer with 1. What do you say model will able to learn the pattern in the data? A) Yes B) No

Solution: B As all the weights of the neural network model are same, so all the neurons will try to do the same thing and the model will never converge.

[True or False] BackPropogation cannot be applied when using pooling layers A) TRUE B) FALSE

Solution: B BackPropogation can be applied on pooling layers too.

The red curve above denotes training accuracy with respect to each epoch in a deep learning algorithm. Both the green and blue curves denote validation accuracy. Which of these indicate overfitting? Q 25 : https://www.analyticsvidhya.com/blog/2017/08/skilltest-deep-learning/ A) Green Curve B) Blue Curve

Solution: B Blue curve shows overfitting, whereas green curve is generalized.

Question Context: Statement 1: It is possible to train a network well by initializing all the weights as 0 Statement 2: It is possible to train a network well by initializing biases as 0 Which of the statements given above is true? A) Statement 1 is true while Statement 2 is false B) Statement 2 is true while statement 1 is false C) Both statements are true D) Both statements are false

Solution: B Even if all the biases are zero, there is a chance that neural network may learn. On the other hand, if all the weights are zero; the neural neural network may never learn to perform the task.

Is the data linearly separable? xo ox A) Yes B) No

Solution: B If you can draw a line or plane between the data points, it is said to be linearly separable.

Which of the following statement is true regrading dropout? 1: Dropout gives a way to approximate by combining many different architectures 2: Dropout demands high learning rates 3: Dropout can help preventing overfitting A) Both 1 and 2 B) Both 1 and 3 C) Both 2 and 3 D) All 1, 2 and 3

Solution: B Statements 1 and 3 are correct, statement 2 is not always true. Even after applying dropout and with low learning rate, a neural network can learn

In CNN, having max pooling always decrease the parameters? A) TRUE B) FALSE

Solution: B This is not always true. If we have a max pooling layer of pooling size as 1, the parameters would remain the same.

Given below is an input matrix named I, kernel F and Convoluted matrix named C. Which of the following is the correct option for matrix C with stride =2 ? 1001101 0011101 1110101 1010110 0110011 0111011 100 011 110 a) 44333 42332 33313 34232 43324 b) 44333 42322 32333 34232 43224 c) 433 333 434 d) 433 322 334

Solution: C 1 and 2 are automatically eliminated since they do not conform to the output size for a stride of 2. Upon calculation option 3 is the correct answer.

Suppose you are using early stopping mechanism with patience as 2, at which point will the neural network model stop training? sr. no. train loss valid loss 1 1.0 1.1 2 0.9 1.0 3 0.8 1.0 4 0.7 1.0 5 0.6 1.1 A) 2 B) 3 C) 4 D) 5

Solution: C As we have set patience as 2, the network will automatically stop training after epoch 4.

Suppose there is an issue while training a neural network. The training loss/validation loss remains constant. What could be the possible reason? A) Architecture is not defined correctly B) Data given to the model is noisy C) Both of these

Solution: C Both architecture and data could be incorrect. Refer this article https://www.analyticsvidhya.com/blog/2017/07/debugging-neural-network-with-tensorboard/

Which of following activation function can't be used at output layer to classify an image ? A) sigmoid B) Tanh C) ReLU D) If(x>5,1,0) E) None of the above

Solution: C ReLU gives continuous output in range 0 to infinity. But in output layer, we want a finite range of values. So option C is correct.

For a binary classification problem, which of the following architecture would you choose? Q 23 : https://www.analyticsvidhya.com/blog/2017/08/skilltest-deep-learning/ 1) neural net with two hidden layers 2) neural net with one hidden layer a) 1 b) 2 c) any d) none

Solution: C We can either use one neuron as output for binary classification problem or two separate neurons.

Which of the following statements is true when you use 1×1 convolutions in a CNN? A) It can help in dimensionality reduction B) It can be used for feature pooling C) It suffers less overfitting due to small kernel size D) All of the above

Solution: D 1×1 convolutions are called bottleneck structure in CNN.

Which of the following are universal approximators? A) Kernel SVM B) Neural Networks C) Boosted Decision Trees D) All of the above

Solution: D All of the above methods can approximate any function.

In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer and 1 neuron in the output layer. What is the size of the weight matrices between hidden output layer and input hidden layer? A) [1 X 5] , [5 X 8] B) [8 X 5] , [ 1 X 5] C) [8 X 5] , [5 X 1] D) [5 x 1] , [8 X 5]

Solution: D The size of weights between any layer 1 and layer 2 Is given by [nodes in layer 1 X nodes in layer 2]

In which of the following applications can we use deep learning to solve the problem? A) Protein structure prediction B) Prediction of chemical reactions C) Detection of exotic particles D) All of these

Solution: D We can use neural network to approximate any function so it can theoretically be used to solve any problem

What steps can we take to prevent overfitting in a Neural Network? A) Data Augmentation B) Weight Sharing C) Early Stopping D) Dropout E) All of the above

Solution: E All of the above mentioned methods can help in preventing overfitting problem.

Below is a diagram of a small convolutional neural network that converts a 13x13 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters, max pooling, ReLU, and finally a fully-connected layer, For this network we will not be using any bias/offset parameters. True or false: A fully-connected neural network with the same size layers as the above network (13x13 → 3x10x10 → 3x5x5 → 4x1) can represent any classifier that the above convolutional network can represent.

True

Below is a diagram of a small convolutional neural network that converts a 61x61 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters (layer#1), convolution with 3 filters (layer#2), max pooling, finally a fully-connected layer, For this network we will NOT be using any bias/offset parameters. [5 points] True or false: A fully-connected neural network with the same size layers as the above network (61x61 → 3x29x29→ 3x15x15 → 3x7x7 → 4x1) can represent any classifier that the above convolutional network can represent.

True

t/f A neural network with multiple hidden layers and sigmoid nodes can form non-linear decision boundaries.

True

Suppose you want to redesign the AlexNet architecture to reduce the number of arithmetic operations required for each backprop update. Which one of these choices will reduce the number of arithmetic operations the most: a. Removing a convolutional layer b. Removing a fully connected layer c. Removing a pooling layer d. None of the above

a. Removing a convolutional layer

Increase in size of a convolutional kernel would necessarily increase the performance of a convolutional neural network. A) TRUE B) FALSE

b) False

Which of the following gives nonlinearity to a neural network a) stochastic gradient decent b_ RELU c) multi-layers d) none

b) RELU

in a neural network, knowing the weight and bias of each neuron is the most important step. if you can somehow get the correct value of weight and bias for each neuron, you can approximate any function a) Assign random values and pray to God they are correct b) Search every possible combination of weights and biases till you get the best value c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values to make them bett d) None of these

c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values values to make them bett

Which of the following is not true of convolutional neural networks (CNNs) for image analysis? a) Filters in earlier layers tend to include edge detectors b) Pooling layers reduce the spatial resolution of the image c) They have less parameters than fully connected networks with the same number of layers and the same numbers of neurons in each layer d) A CNN can be trained for unsupervised learning tasks, whereas an ordinary neural net cannot.

d) A CNN can be trained for unsupervised learning tasks, whereas an ordinary neural net cannot.

in a neural network, you notice that the loss does not decrease in the few starting epochs the reason could be 1) learning rate is to low 2) regularization parameter is high 3) stuck at local minima a) 1 and 2 b) 2 and 3 c) 1 and 3 d) any

d) any

the NN consists of many neurons, each neuron takes an input, processes it and gives an output. Which of the following statement(s) is correct? a) A neuron has a single input and a single output only b) a neuron has multiple inputs but only a single output c) a neuron has a single input but multiple outputs d) a neuron has multiple inputs and multiple outputs e) All of the above are valid

e) All of the above are valid

If you increase the number of hidden layers in a Multi Layer Perceptron, the classification error of test data always decreases. True or False?

false

t/f Suppose we have one hidden layer neural network as shown below. The hidden layer in this network works as a dimensionality reductor. Now instead of using this hidden layer, we replace it with a dimensionality reduction technique such as PCA. The network that uses a dimensionality reduction technique always give same output as network with hidden layer?

false

t/f The number of neurons in the output layer must match the number of classes (Where the number of classes is greater than 2) in a supervised learning task.

false

t/f weight sharing can occur in convolutional neural network or fully connected neural network (Multi-layer perceptron)

false

Below is a diagram of a small convolutional neural network that converts a 13x13 image into 4 output values. The network has the following layers/operations from input to output: convolution with 3 filters, max pooling, ReLU, and finally a fully-connected layer, For this network we will not be using any bias/offset parameters. c) What is the disadvantage of a fully-connected neural network compared to a convolutional neural network with the same size layers?

more parameters to train

Assume a simple MLP model with 3 neurons and inputs= 1,2,3. The weights to the input neurons are 4,5 and 6 respectively. Assume the activation function is a linear constant value of 3. What will be the output ? A) 32 B) 643 C) 96 D) 48

neruons( in1 x w1 + in2 x w2 + in3 xw3) 3(1x4 + 2x5 + 6x3) c) 96

[5 points] A convolutional neural network has 4 consecutive 3x3 convolutional layers with stride 1 and no pooling. How large is the support of (the set of image pixels which activate) a neuron in the 4th non-image layer of this network?

with a stride of 1, and 3x3 filter and noo pooling, this means the "outer ring" of the image gets chopped off each time this is applied. So, this reduces the dimension from (nxn) to ((n-2)x(n-2)). We get working backwards: 1x1 <-3x3 <- 5x5 <- 7x7 <- 9x9 Thus, the support is 9x9 = 81 pixels

can a neural network model the function y=1/x

yes

[5 points] Consider the following two Multilayer Perceptrons Networks, where all of the layers use linear activation functions (b)[1pt] Give one advantage of Network B over Network A.

• B has fewer connections, so it's less prone to overfitting • B has fewer connections, so backprop requires fewer operations • B has a bottleneck layer, so the network is forced to learn a compact representation (like an autoencoder)


Kaugnay na mga set ng pag-aaral

Finance true or false (ch 1, 3, 4)

View Set

Automobile in the History Midterm

View Set

Smartbook Recharge Chapter 14 ACCT 405

View Set

Prep U's - Chapter 8 - Management of the Older Adult Patient

View Set

SOJO Unit 2: Intersectionality, Poverty, and the Rights of Workers

View Set

AP Psych Unit 7 Test (vocab & review)

View Set