Deep Learning Final Exam Questions and Answers

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What function is NOT used to make a pooling layer for a CNN? A) max_pool() B) avg_pool() C) min_pool() D) all of these are used

C) min_pool()

The ________ activation function involves α (hyperparameter that determines how much the function "leaks") being learned during training instead of being a hyperparameter. A) leaky ReLU B) randomized leaky ReLU C) parametric leaky ReLU D) exponential linear unit (ELU)

C) parametric leaky ReLU

What method is called to restore a previously saved model? A) recall() B) revive() C) restore() D) remember()

C) restore()

Which architecture would be most likely to be used to identify objects in an image: a. RNN b. CNN

CNN

____ is the parallel computing platform and programming model used by Tensorflow to enable faster training with a NVIDIA GPU.

CUDA

What makes tensorflow most suitable for deep learning frameworks?

Computation of gradients from graph helps with the time complexity.

PCA identifies the axis that accounts for the __________ in the training set. A) least amount of variance B) largest amount of variance C) most uniformity D) most amount of data

B) largest amount of variance

What is the curse of dimentionality?

Having too many features slows down training speeds and increases the difficulty of finding a good solution.

How do you clear all nodes from the default graph? A) Restart the kernel/shell. B) Call tf.get_default_graph().clear() C) Call tf.reset_default_graph() D) A & B E) A & C

E) A & C

The LSTM model was developed to solve the problem of __________.

Exploding/vanishing gradients

T/F. A classical Perceptron is able to estimate class probabilities.

False

T/F. A convolutional layer must always be followed by a max pooling layer.

False

T/F. When training a regression model, you should use a softmax activation function on the final output to get probabilities.

False

True of false. A placeholder is a node which computes output based on data that it is fed during runtime.

False

True or False: The cost function of a logistic regression model is not convex

False

True or False: The higher the dimensionality of the word embeddings vectors, the higher will be the accuracy on various semantic analogy tasks

False

True/False) Adding more training data always increases the accuracy of linear regression.

False

True or False: One epoch of a network is equivalent to one iteration.

False (Epoch is one forward& backward of all training examples. Iteration is how many times we pass a batch through the network.)

T/F. Using keras we have more control of the network.

False. (Keras uses TF as backend and we cannot tune hypermaters like the learning rate etc.)

T/F. Numerical differentiation has a high accuracy.

False; Numerical differentiation has a low accuracy and is trivial to implement

The flattening layer in CNN is called _______________

Fully connected layer

A popular optimization algorithm is known as ____ descent

Gradient

______ works first by measuring how each training instance linearly relates to its closest neighbors and then looking for a low dimensional representation of the training set where these local relationships are best preserved.

LLE

A part of a neural network that preserves some state across time steps is called __________.

Memory Cell

Bagging is sampling ______________ replacement. While pasting is sampling ____________ replacement.

with, without

Which of the following is an example of a vector-to-sequence RNN: 1. Locating pedestrians in a picture 2. Speech to text. 3. Video captioning. 4. A and C 5. B and C

1) Locating pedestrians in a picture.

Word2vec models are ____ layer neural networks used to turn text into meaningful numeric data which deep neural nets can understand.

2

An LSTM cell has _________ memory vectors while a GRU cell has only __________.

2 (short-term and long-term); 1 (just memory)

What is the number of graph traversals required to compute all gradients in Reverse-mode autodiff?

2*n_output

Given an image whose dimentions are 30x30, and a kernal of size 5x5 with stride length 1 and no padding, how large with the convolved image be? Give answer in the form WIDTHxHEIGHT, e.g. 13x13

26 x 26

What can be used to avoid cluttering a neural network model? A) Variable scopes B) Model sorters C) Name scopes D) Neuron organizers

C) Name scopes

A TensorFlow program is generally divided into 2 parts. The first is builds a computation graph. What is this phase called? A) Construction phase B) Execution phase C) Building phase D) Graphing phase

A) Construction phase

_______ regularization technique involves each neuron having a probability p of being temporarily "dropped" A) Dropout B) PCA C) Manifold learning D) Projection

A) Dropout

Models often train better (on most systems) when mini batch sizes are in powers of ____, because of CPU and GPU memory architecture, speeding up the fetching of data from memory due to common pagefile sizes. A: 2 B: 3 C: 4 D: 5

A: 2

What is the output range for a hyperbolic tangent activation function (tanh(z) = 2 * sigma(2 * zeta) - 1)? A) 0 to 1 B) -1 to 1 C) 0 to infinity D) -1 to infinity

B) -1 to 1

What type of TensorFLow optimizer may be used to compute an optimal gradient? A) LossOptimizer() B) GradientDescentOptimizer() C)ErrorOptimizer() D)SquaredErrorOptimizer

B) GradientDescentOptimizer()

What type of TensorFlow optimizer may be used to compute an optimal gradient? A) LossOptimizer() B) GradientDescentOptimizer() C) ErrorOptimizer() D) SquaredErrorOptimizer()

B) GradientDescentOptimizer()

What is the standard matrix factorization technique that can decompose the training set matrix into a dot product of 3 matrices, with the latter of these matrices containing the principal components? A) Principal Component Analysis (PCA) B) Singular Value Decomposition (SVD) C) Manifold Learning D) Kernel PCA

B) Singular Value Decomposition (SVD)

A CNN is faster to train than a DNN because of the following reason. 1. Partially Connected Layers. 2. Reusability of weights. 3. Both A and B

Both A and B

What are the two main operations of tensorflow

Construction of the computation graph and then the execution of the graph.

____ neural networks are the best type of neural network to use for object detection in images.

Convolutional

What is the derivative of the step function at 0? A) 1 B) -1 C) 0 D) undefined

D) undefined

How does TensorFlow treat dependencies in graph nodes?

In the case of graph nodes TensorFLow will recompute the value of a node even if it has encountered it previously.

Training accuracy approaches 95%, but validation accuracy remains at 60%. Your model is likely ______.

Overfitting

How is a Perceptron trained?

Perceptrons are trained based on an algorithm that considers the error made by the network. The connections that lead to the wrong output are neglected and not reinforced.

Which architecture would be most likely to be used to predict stock market fluctuations: a. RNN b. CNN

RNN

Fun(x) = max(0,x) is which activation function?

ReLU

What is the best output activation function to predict housing prices?

ReLU (housing prices are always positive) OR None (= Linear)

Applying a softmax function to the output logits of a classifier gives you values that can be interpreted as __________.

probabilities

What is the main benefit of doing complex matrix operations using tensorflow versus computing it normally with, say, NumPy?

Tensorflow optimizes the calculations into C++ code and runs it on your GPU.

John decides to build a neural network to predict tomorrow's temperature in Fahrenheit. He decides to treat this as a regression problem and use ReLU as the output activation function. Has John made a mistake? a. Yes, this should be a classification problem. b. Yes, he should use a linear activation function .c. Yes, neural networks can only perform classification. d. No

b. Yes, he should use a linear activation function

Difference between supervised and unsupervised? a. Supervised learning requires the user to tell the machine what to do. b. Supervised learning requires the user to be present to make sure it runs correctly c. Supervised learning includes a desired outcome or solution, called labels. d. Supervised learning needs to communicate with other algorithms

c. Supervised learning includes a desired outcome or solution, called labels.

T/F. A convolutional neural network will usually have one (or more) fully connected layers at the end of the network.

True

T/F. Autoencoders typically have a symmetrical neural network structure.

True

T/F. CNN's are able to generalize much better that DNN's for image processing tasks such as classification using fewer training examples.

True

T/F. It is common to initialize weight variables over a distribution, while initializing bias variables as a constant.

True

True or False: AdaBoost cannot be parallelized

True

True or False: Data augmentation helps with the training of the data.

True

True or False: Is the statement a_val = a.eval(session=sess) equivalent to a_val = sess.run(a)?

True

True or False: Neural networks are universal approximators.Can be used for all kind of inputs and outputs.

True

True or False: SGD helps finding the global minimun.

True

True or False: a tensor is an n-dimensional array (scalar, vector, matrix, etc)?

True

True or false, there exists a manifold for any finite dataset which is linearly separable. a. True b. False

True

True or false. Autoencoders are artificial neural networks capable of learning efficient representations of input data without supervision.

True

True/False) Increasing the training data size reduces chances of overfitting.

True

Rather than using all of your available data to train the network, you should set aside some data for both ________ and _______.

Validation, testing

PCA tries to maintain as much of _____________ in the dataset as possible

Variance

Identify the Activation function by the graph below [-1,1] a. ReLu b. Logistic c. Tanh d. Leaky ReLu

c. Tanh

In a DNN, when the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution, the ____ is occurring. a. Exploding gradients problem b. Overfitting c. Vanishing gradients problem d. Underfitting

c. Vanishing gradients problem

For the above dataset, does there exist a 3D manifold in which the dataset is linearly separable? a. Yes b. No

Yes (cone)

What would be the optimal number of input neurons in a conv net for a 32x32 image? a. 1024 input neurons b. 0 neurons c. 900 neurons d. over 9000 neurons

a. 1024 input neurons

What is TensorFlow? a. An open source library which executes Python graphs using optimized C++ code. b. A program that visualizes graphs built in Python c. A programming language designed for machine learning d. An open source library which executes C++ graphs using optimized Python code

a. An open source library which executes Python graphs using optimized C++ code.

In RNNs "Dynamic Unrolling Through Time" is used to a. Avoid 'out of memory' errors b. Improve the accuracy of the network c. Avoid the vanishing gradient problem d. Avoid the exploding gradient problem e. All of the above f. None of the above

a. Avoid 'out of memory' errors

What is overfitting? a. Making an overgeneralization of the dataset b. When the model fits exactly over the dataset c. When the dataset is more complex than the model d. When the model fits over the required statistic for an accurate prediction

a. Making an overgeneralization of the dataset

When initializing a neural network, weights should be set to: a. Random numbers b. 0 c. 0.5 d. 1

a. Random numbers

Which performance measure is more sensitive to outliers a. Root Mean Square Error b. Absolute Mean Error c. Absolute Root Error d. None of the Above

a. Root Mean Square Error

What is a training set and test set? a. Splitting the data set into instances to train on and instances to validate on b. A dataset to train on and a dataset to test on c. Training the dataset and testing the dataset d. Testing the dataset before training the dataset

a. Splitting the data set into instances to train on and instances to validate on

Machine Learning is: a. The science and art of programming computers so they can learn from data b. Machines organizing information on their own c. The advancement of technology as scientists continue research d. Learning the right way to use computers to benefit us

a. The science and art of programming computers so they can learn from data

What is a data pipeline? a. The stages data goes through in an algorithm before making predictions b. Feeding data into a gpu c. Transforming data into a dataset d. Finding where to gather accurate date

a. The stages data goes through in an algorithm before making predictions

Define Batch Learning. a. The system must be trained on all available data b. The system is able to learn incrementally c. The system learns from dataset to dataset d. The system is fed data at the will of the user

a. The system must be trained on all available data

Can you run more than one dataflow graph? a. yes, but you shouldn't - it requires multiple sessions and a lot of space b. No, it is not possible c. yes, and you should

a. yes, but you shouldn't - it requires multiple sessions and a lot of space

____________ are capable of randomly generating new data that looks very similar to training data

autoencoders

What does the following equation do? Theta = (X^T * X)^-1 * X^T * y a. Allow you to find the weight vector for classification without using gradient descent algorithm b. Allow you to find the weight vector for regression without using gradient descent algorithm c. Allow you to perform perceptron learning d. None of the Above

b. Allow you to find the weight vector for regression without using gradient descent algorithm

What is the fastest type of gradient descent? a. Batch gradient descent b. Stochastic gradient descent c. Mini-batch gradient descent d. Hyper gradient descent

b. Stochastic gradient descent

Is the following dataset linearly separable? (Data in center and data in ring around first dataset)

no

In Ensemble learning, soft-voting is when... a. Your model looks at how each classifier is voting and chooses the class with the most number of votes b. Your model's results are taken thru a softmax layer to compute the class probabilities. c. Your model predicts the class with the highest class probability, averaged over all the individual classifiers. d. None of the above

c. Your model predicts the class with the highest class probability, averaged over all the individual classifiers.

What is unsupervised learning? a. features of group explicitly stated b. number of groups may be known c. neither feature & nor number of groups is known d. none of the mentioned

c. neither feature & nor number of groups is known

Tensorflow program is split into two parts _____ phase which builds a computation graph and _____ phase which runs it.

construction, execution

Name the libraries that do not support DNN framewroks. a. -Theano b. -TF c. -Keras d. -Scikit-learn

d. -Scikit-learn

For a binary classification problem, how many output nodes should there be? a. 1 b. 2 c. 3 d. 1 or 2 e. Any number

d. 1 or 2

What is an epoch? a. An era of Humanity b. A clocking speed for computers c. A file type d. A full training cycle of a neural network

d. A full training cycle of a neural network

Representing a recurrent neural network against the time axis is known as: a. Backpropagation b. Stacking the network c. Temporal training d. Unrolling the network though time

d. Unrolling the network though time

What is underfitting? a. When the dataset is more complex than the model b. When the model fits under the required statistic for an accurate prediction c. When the dataset is less complex than the model d. When the model fits under the data points

d. When the model fits under the data points

Building an RNN using __________ instead of __________ offers several advantages.(State the function names)

dynamic_rnn(), static_rnn()

In a neural network, the layer(s) between the input and output layers are referred to as ______ layers

hidden

In Scikit-Learn's GridSearchCV, all you need to do is tell it which ________________ you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of _______________ values, using cross-validation.

hyperparameters, hyperparameter

Simulated Annealing is when you gradually reduce the ___________ _____________.

learning rate

By the Perceptron Convergence Theorem, if the training instances are ________ ___________, then the Perceptron will converge to a solution.

linearly separable

A ___________________ normalization layer makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps.

local response

_______ reduces clutter when dealing with more complex models such as neural networks which helps with _______.

name scopes, group related nodes


Ensembles d'études connexes

CAPM Chapter 7: Project Cost Management

View Set

Ch. 4 Choosing A Research Design

View Set

General Biology: Chapter 20 Study Questions

View Set