Machine Learning Exam

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What if we use a learning rate that's too large? A. Network will converge B. Network will not converge C. Can't Say

B. Network will not converge Option B is correct because the error rate would become erratic and explode.

Which gradient technique is more advantageous when the data is too big to handle in RAM simultaneously? A. Full Batch Gradient Descent B. Stochastic Gradient Descent

B. Stochastic Gradient Descent

Which of the following options can be used to reduce overfitting in deep learning models? Add more data Use data augmentation Use architecture that generalizes well Add regularization Reduce architectural complexity A) 1, 2, 3 B) 1, 4, 5 C) 1, 3, 4, 5 D) All of these

D) All of these

The red curve above denotes training accuracy with respect to each epoch in a deep learning algorithm. Both the green and blue curves denote validation accuracy. Which of these indicate overfitting? Q 25 : https://www.analyticsvidhya.com/blog/2017/08/skilltest-deep-learning/ A) Green Curve B) Blue Curve

Solution: B Blue curve shows overfitting, whereas green curve is generalized.

Question Context: Statement 1: It is possible to train a network well by initializing all the weights as 0 Statement 2: It is possible to train a network well by initializing biases as 0 Which of the statements given above is true? A) Statement 1 is true while Statement 2 is false B) Statement 2 is true while statement 1 is false C) Both statements are true D) Both statements are false

Solution: B Even if all the biases are zero, there is a chance that neural network may learn. On the other hand, if all the weights are zero; the neural neural network may never learn to perform the task.

Suppose there is an issue while training a neural network. The training loss/validation loss remains constant. What could be the possible reason? A) Architecture is not defined correctly B) Data given to the model is noisy C) Both of these

Solution: C Both architecture and data could be incorrect. Refer this article https://www.analyticsvidhya.com/blog/2017/07/debugging-neural-network-with-tensorboard/

Which of the following are universal approximators? A) Kernel SVM B) Neural Networks C) Boosted Decision Trees D) All of the above

Solution: D All of the above methods can approximate any function.

in a neural network, you notice that the loss does not decrease in the few starting epochs the reason could be 1) learning rate is to low 2) regularization parameter is high 3) stuck at local minima a) 1 and 2 b) 2 and 3 c) 1 and 3 d) any

d) any

t/f A multiple-layer neural network with linear activation functions is equivalent to one single-layer perceptron that uses the same error function on the output layer and has the same number of inputs.

true

t/f A neural network with multiple hidden layers and sigmoid nodes can form non-linear decision boundaries.

true

t/f A perceptron is guaranteed to perfectly learn a given linearly separable function within a finite number of training steps.

true

t/f using Mini-batch gradient decent the model update frequency is higher than batch gradient descent which allows for more robust convergence, avoiding local minima.

true

can a neural network model the function y=1/x

yes

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? A. There will not be any problem and the neural network will train properly B. The neural network will train but all the neurons will end up recognizing the same thing C. The neural network will not train as there is no net gradient change D. None of these

B. The neural network will train but all the neurons will end up recognizing the same thing

Which of the following is a representation learning algorithm? A) Neural network B) Random Forest C) k-Nearest neighbor D) None of the above

Solution: (A) Neural network converts data in such a form that it would be better to solve the desired problem. This is called representation learning.

Mini-Batch sizes when defining a neural network are preferred to be multiple of 2's such as 256 or 512. What is the reason behind it? A) Gradient descent optimizes best when you use an even number B) Parallelization of neural network is best when the memory is used optimally C) Losses are erratic when you don't use an even number D) None of these

Solution: (B)

Suppose we have a neural network with ReLU activation function. Let's say, we replace ReLu activations by linear activations. Would this new neural network be able to approximate an XNOR function? Note: The neural network was able to approximate XNOR function with activation function ReLu A) Yes B) No

Solution: (B) If ReLU activation is replaced by linear activation, the neural network loses its power to approximate non-linear function.

The number of neurons in the output layer should match the number of classes (Where the number of classes is greater than 2) in a supervised learning task. True or False? A. True B. False

Solution: (B) It depends on output encoding. If it is one-hot encoding, then its true. But you can have two outputs for four classes, and take the binary values as four classes(00,01,10,11).

Y = ax^2 + bx + c (polynomial equation of degree 2) Can this equation be represented by a neural network of single hidden layer with linear threshold? A. Yes B. No

Solution: (B) The answer is no because having a linear threshold restricts your neural network and in simple terms, makes it a consequential linear transformation function.

If you increase the number of hidden layers in a Multi Layer Perceptron, the classification error of test data always decreases. True or False? A. True B. False

Solution: (B) This is not always true. Overfitting may cause the error to increase.

Consider the scenario. The problem you are trying to solve has a small amount of data. Fortunately, you have a pre-trained neural network that was trained on a similar problem. Which of the following methodologies would you choose to make use of this pre-trained network? A. Re-train the model for the new dataset B. Assess on every layer how the model performs and only select a few of them C. Fine tune the last couple of layers only D. Freeze all the layers except the last, re-train the last layer

Solution: (D) If the dataset is mostly similar, the best method would be to train only the last layer, as previous all layers work as feature extractors.

Instead of trying to achieve absolute zero error, we set a metric called bayes error which is the error we hope to achieve. What could be the reason for using bayes error? A. Input variables may not contain complete information about the output variable B. System (that creates input-output mapping) may be stochastic C. Limited training data D. All the above

Solution: (D) In reality achieving accurate prediction is a myth. So we should hope to achieve an "achievable result".

In training a neural network, you notice that the loss does not decrease in the few starting epochs. The reasons for this could be: The learning is rate is low Regularization parameter is high Stuck at local minima What according to you are the probable reasons? A. 1 and 2 B. 2 and 3 C. 1 and 3 D. Any of these

Solution: (D) The problem can occur due to any of the reasons mentioned.

The network shown in Figure 1 is trained to recognize the characters H and T as shown below: T -> I Fill H -> I not fill B -> ? a) I fill b) I not fill c) I top and bottom fill d) could be A or B depending on neural net

Solution: (D) Without knowing what are the weights and biases of a neural network, we cannot comment on what output it would give.

True/False: Changing Sigmoid activation to ReLu will help to get over the vanishing gradient issue? A) TRUE B) FALSE

Solution: A ReLU can help in solving vanishing gradient problem.

The number of nodes in the input layer is 10 and the hidden layer is 5. The maximum number of connections from the input layer to the hidden layer are A) 50 B) Less than 50 C) More than 50 D) It is an arbitrary value

Solution: A Since MLP is a fully connected directed graph, the number of connections are a multiple of number of nodes in input layer and hidden layer.

Which of the following functions can be used as an activation function in the output layer if we wish to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1? A) Softmax B) ReLu C) Sigmoid D) Tanh

Solution: A Softmax function is of the form in which the sum of probabilities over all k sum to 1.

Which of the following neural network training challenge can be solved using batch normalization? A) Overfitting B) Restrict activations to become too high or low C) Training is too slow D) Both B and C E) All of the above

Solution: A Weights between input and hidden layer are constant.

[True | False] In the neural network, every parameter can have their different learning rate. A) TRUE B) FALSE

Solution: A Yes, we can define the learning rate for each parameter and it can be different from other parameters.

I am working with the fully connected architecture having one hidden layer with 3 neurons and one output neuron to solve a binary classification challenge. Below is the structure of input and output: Input dataset: [ [1,0,1,0] , [1,0,1,1] , [0,1,0,1] ] Output: [ [1] , [1] , [0] ] To train the model, I have initialized all weights for hidden and output layer with 1. What do you say model will able to learn the pattern in the data? A) Yes B) No

Solution: B As all the weights of the neural network model are same, so all the neurons will try to do the same thing and the model will never converge.

For a binary classification problem, which of the following architecture would you choose? Q 23 : https://www.analyticsvidhya.com/blog/2017/08/skilltest-deep-learning/ 1) neural net with two hidden layers 2) neural net with one hidden layer a) 1 b) 2 c) any d) none

Solution: C We can either use one neuron as output for binary classification problem or two separate neurons.

In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer and 1 neuron in the output layer. What is the size of the weight matrices between hidden output layer and input hidden layer? A) [1 X 5] , [5 X 8] B) [8 X 5] , [ 1 X 5] C) [8 X 5] , [5 X 1] D) [5 x 1] , [8 X 5]

Solution: D The size of weights between any layer 1 and layer 2 Is given by [nodes in layer 1 X nodes in layer 2]

Which of the following gives nonlinearity to a neural network a) stochastic gradient decent b_ RELU c) multi-layers d) none

b) RELU

the NN consists of many neurons, each neuron takes an input, processes it and gives an output. Which of the following statement(s) is correct? a) A neuron has a single input and a single output only b) a neuron has multiple inputs but only a single output c) a neuron has a single input but multiple outputs d) a neuron has multiple inputs and multiple outputs e) All of the above are valid

e) All of the above are valid

Which of the following gives non-linearity to a neural network? (a) Stochastic Gradient Descent. (b) Rectified Linear Unit (c) Convolution function (d) None of the above

(b) Rectified Linear Unit

Which of the following techniques perform similar operations as dropout in a neural network? A. Bagging B. Boosting C. Stacking D. None of these

Solution: (A) Dropout can be seen as an extreme form of bagging in which each model is trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding parameter in all the other models. Refer (https://www.cs.toronto.edu/~hinton/absps/dropout.pdf)

Can a neural network model the function (y=1/x)? A. Yes B. No

Solution: (A) Option A is true, because activation function can be reciprocal function.

There is a plateau at the start. This is happening because the neural network gets stuck at local minima before going on to global minima. To avoid this, which of the following strategy should work? A. Increase the number of parameters, as the network would not get stuck at local minima B. Decrease the learning rate by 10 times at the start and then use momentum C. Jitter the learning rate, i.e. change the learning rate for a few epochs D. None of these

Solution: (C) Option C can be used to take a neural network out of local minima in which it is stuck.

In a neural network, knowing the weight and bias of each neuron is the most important step. If you can somehow get the correct value of weight and bias for each neuron, you can approximate any function. What would be the best way to approach this? A. Assign random values and pray to God they are correct B. Search every possible combination of weights and biases till you get the best value C. Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values values to make them better D. None of these

Solution: (C) Option C is the description of gradient descent.

Suppose that you have to minimize the cost function by changing the parameters. Which of the following technique could be used for this? A. Exhaustive Search B. Random Search C. Bayesian Optimization D. Any of these

Solution: (D) Any of the above mentioned technique can be used to change parameters.

What steps can we take to prevent overfitting in a Neural Network? A) Data Augmentation B) Weight Sharing C) Early Stopping D) Dropout E) All of the above

Solution: E All of the above mentioned methods can help in preventing overfitting problem.

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? (a) There will not be any problem and the neural network will train properly (b) The neural network will train but all the neurons will end up recognizing the same thing (c) The neural network will not train as there is no net gradient change (d) None of these

(b) The neural network will train but all the neurons will end up recognizing the same thing

In a neural network, knowing the weight and bias of each neuron is the most important step. If you can somehow get the correct value of weight and bias for each neuron, you can approximate any function. What would be the best way to approach this? (a) Assign random values and hope they are correct (b) Search every possible combination of weights and biases till you get the best value (c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values to make them better (d) None of these

(c) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values to make them better

What are the steps for using a gradient descent algorithm in training neural networks? 1. Calculate error between the actual value and the predicted value 2. Reiterate until you find the best weights of network 3. Pass an input through the network and get values from output layer 4. Initialize random weight and bias 5. Go to each neurons which contributes to the error and change its respective values to reduce the error. (a) 1, 2, 3, 4, 5 (b) 5, 4, 3, 2, 1 (c) 3, 2, 1, 5, 4 (d) 4, 3, 1, 5, 2

(d) 4, 3, 1, 5, 2

Which of the following is NOT correct about ReLU (Rectified Linear Unit) activation function? (a) Does not saturate in the positive region. (b) Computationally effective comparing to sigmoid and Tanh activation functions (c) Mostly converge faster than sigmoid and Tanh activation functions (d) Zero-centered output.

(d) Zero-centered output.

XNOR Activation function f(x) =0, for x<0 =1, for x>= 0 Suppose X1 is 0 and X2 is 1, what will be the output for the above neural network? A. 0 B. 1

A. Output of a1: f(0.5x1 + -1x0 + -1x1) = f(-0.5) = 0 Output of a2: f(-1.5x1 + 1x0 + 1x1) = f(-0.5) = 0 Output of a3: f(-0.5x1 + 1x0 + 1x0) = f(-0.5) = 0

X1 X2 X1 AND X2 0 0 0 0 1 0 1 0 0 1 1 1 The activation function of our neuron is denoted as: f(x) = 0, for x<0 = 1, for x>=0 What would be the weights and bias? (Hint: For which values of w1, w2 and b does our neuron implement an AND function?) A. Bias = -1.5, w1 = 1, w2 = 1 B. Bias = 1.5, w1 = 2, w2 = 2 C. Bias = 1, w1 = 1.5, w2 = 1.5 D. None of these

A. f(-1.5x1 + 1x0 + 1x0) = f(-1.5) = 0 f(-1.5x1 + 1x0 + 1x1) = f(-0.5) = 0 f(-1.5x1 + 1x1 + 1x0) = f(-0.5) = 0 f(-1.5x1 + 1x1+ 1x1) = f(0.5) = 1 Therefore option A is correct

What is a dead unit in a neural network? A. A unit which doesn't update during training by any of its neighbour B. A unit which does not respond completely to any of the training patterns C. The unit which produces the biggest sum-squared error D. None of these

A. A unit which doesn't update during training by any of its neighbour

Which of the following is true about model capacity (where model capacity means the ability of neural network to approximate complex functions) ? A. As number of hidden layers increase, model capacity increases B. As dropout ratio increases, model capacity increases C. As learning rate increases, model capacity increases D. None of these

A. As number of hidden layers increase, model capacity increases

Which of the following statement is the best description of early stopping? A. Train the network until a local minimum in the error function is reached B. Simulate the network on a test dataset after every epoch of training. Stop training when the generalization error starts to increase C. Add a momentum term to the weight update in the Generalized Delta Rule, so that training converges more quickly D. A faster version of backpropagation, such as the `Quickprop' algorithm

B. Simulate the network on a test dataset after every epoch of training. Stop training when the generalization error starts to increase

What is the sequence of the following tasks in a perceptron? 1. Initialize weights of perceptron randomly 2. Go to the next batch of dataset 3.If the prediction does not match the output, change the weights 4. For a sample input, compute an output A. 1, 2, 3, 4 B. 4, 3, 2, 1 C. 3, 1, 2, 4 D. 1, 4, 3, 2

D. 1, 4, 3, 2

The different components of the neuron are denoted as: x1, x2,..., xN: These are inputs to the neuron. These can either be the actual observations from input layer or an intermediate value from one of the hidden layers. w1, w2,...,wN: The Weight of each input. bi: Is termed as Bias units. These are constant values added to the input of the activation function corresponding to each weight. It works similar to an intercept term. a: Is termed as the activation of the neuron which can be represented as and y: is the output of the neuron Considering the above notations, will a line equation (y = mx + c) fall into the category of a neuron? A. Yes B. No

Solution: (A) A single neuron with no non-linearity can be considered as a linear regression function.

While training a neural network for image recognition task, we plot the graph of training error and validation error for debugging. Q26: https://www.analyticsvidhya.com/blog/2017/04/40-questions-test-data-scientist-deep-learning/ What is the best place in the graph for early stopping? A) A B) B C) C D) D

Solution: (C) You would "early stop" where the model is most generalized. Therefore option C is correct.

What are the factors to select the depth of neural network? 1.Type of neural network (eg. MLP, CNN etc) 2.Input data 3.Computation power, i.e. Hardware capabilities and software capabilities 4.Learning Rate 5.The output function to map A. 1, 2, 4, 5 B. 2, 3, 4, 5 C. 1, 3, 4, 5 D. All of these

Solution: (D) All of the above factors are important to select the depth of neural network

In a neural network, which of the following techniques is used to deal with overfitting? A. Dropout B. Regularization C. Batch Normalization D. All of these

Solution: (D) All of the techniques can be used to deal with overfitting.

In which of the following applications can we use deep learning to solve the problem? A) Protein structure prediction B) Prediction of chemical reactions C) Detection of exotic particles D) All of these

Solution: D We can use neural network to approximate any function so it can theoretically be used to solve any problem

If you increase the number of hidden layers in a Multi Layer Perceptron, the classification error of test data always decreases. True or False?

false

The number of neurons in the output layer should match the number of classes (Where the number of classes is greater than 2) in a supervised learning

false

t/f If you increase the number of hidden layers in a Multi Layer Perceptron neural networks (fully connected networks), the classification error of test data always decreases.

false

t/f The number of neurons in the output layer must match the number of classes (Where the number of classes is greater than 2) in a supervised learning task.

false

Assume a simple MLP model with 3 neurons and inputs= 1,2,3. The weights to the input neurons are 4,5 and 6 respectively. Assume the activation function is a linear constant value of 3. What will be the output ? A) 32 B) 643 C) 96 D) 48

neruons( in1 x w1 + in2 x w2 + in3 xw3) 3(1x4 + 2x5 + 6x3) c) 96


Set pelajaran terkait

Comm 151: Exam #1 (Suggested Questions)

View Set

Segmentation, Targeting and Positioning (STP) 9

View Set