Machine Learning Final
Assume your CNN has three layers: •Layer 1: six5×5×3filters with stride •Layer 2: six5×5×6filters with stride •Layer 3: five8×8×6 If you apply this network on a32×32pixels image with3channels (RGB) and one pixel zero padding, what is the size of the output tensor?
10×10×5
Which one of the following is/are considered unsupervised learning? A. Autoencoder B. CNN C. LSTM D. PCA E.t-SNE
A. Autoencoder D. PCA E.t-SNE
Gradient clipping can be used to address which of the following problem(s) in deep neural networks? A. Exploding gradient B. Vanishing gradient C. Neutral gradient D. Noisy features E. Local minima F. Slow training
A. Exploding gradient
LSTM helps address which of the following limitations of a vanilla RNN? A. Exploding gradient B. Vanishing gradient C. Neutral gradient D. Noisy features E. Local minima F. Slow training
A. Exploding gradient B. Vanishing gradient
The weight decay hyperparameter in deep neural network is closest to which of the following? A. L2 regularization B. L1 regularization C. Early stopping D. DropoutE. Ensemble learning
A. L2 regularization
In word2vec, during training, words are modeled as a function of their:
A. Semantic similarity B. Syntactic similarity C. Context D. Term-frequency E. Term-frequency inverse-document-frequency (tf-idf)
Which of the following is/are true about generative models? A. They model the joint distribution:P(class & sample) B. They model the conditional distribution:P(sample|class) C. Principal Component Analysis (PCA) is a generative model D. They can be used for regularization
A. They model the joint distribution:P(class & sample) B. They model the conditional distribution:P(sample|class)
The "one-hot" method of encoding a feature is useful: A. for encoding categorical features B. because it controls the cooling temperature in simulated annealing searches for optimal parameters C. for converting real valued features into integers D. because it allows ReLU activations to be used in deep networks
A. for encoding categorical features
Which of the following must be known to the agent in a reinforcement learning model? A. Initial state B. Current state C. Previous state D. Terminal state
B. Current state
Stochastic gradient descent (SGD) is a powerful optimization technique because: A. It can handle noise in the training data B. It can converge using only an appropriate approximation of the true gradient C. It is computationally more efficient and easier to implement in machine learning problems with large training sets D. It converges in fewer iterations that Newton's algorithm that uses the Hessian of the loss/error function
B. It can converge using only an appropriate approximation of the true gradient C. It is computationally more efficient and easier to implement in machine learning problems with large training sets
What is the main characteristic of a variational autoencoder (VAE) compared to a vanilla (basic) autoencoder? A. Latent variables in VAE are characterized by Uniform densities B. Latent variables in VAE are characterized by Gaussian densities C. VAE has uniform prior probability D. VAE has Gaussian prior probability E. None of the above
B. Latent variables in VAE are characterized by Gaussian densities
An autoencoder is: A. similar to an autoimmune system in biological systems B. similar to data compression in image and speech processing C. used solely in autonomous systems such as self-driving cars and aircraft D. useful for finding lower dimensional features in a machine learning problem
B. similar to data compression in image and speech processing D. useful for finding lower dimensional features in a machine learning problem
You are using a CNN to determine whether an image is that of a dog, a cat, or a duck. These are the only possibilities. How many neurons should your final layer contain, and what final activation function should you use? A. 3, ReLU B. 1, Sigmoid C. 3, SoftMax D. 1, SoftMax E. 3, Sigmoid F. 1, ReLU
C. 3, SoftMax
What is the main advantage of using convolutional neural networks (CNNs) for image classification in comparison to using feed-forward networks? A. CNNs are easier to implement in comparison to feedforward neural networks B. CNNs have more hyperparameters in comparison to feed-forward neural networks C. CNNs preserve spatial information, while feed-forward networks do not D. All of the aboveE. None of the above
C. CNNs preserve spatial information, while feed-forward networks do not
Which method typically should NOT be performed with batch normalization at the same time? A. Early stopping B. Weight decay C. Dropout D. L1 (LASSO) feature selection E. All of the above F. None of the above
C. Dropout
The number of nodes in the input layer is 10 and in the hidden layer is 5. The maximum number of connections from the input layer to the hidden layer are: A. Fewer than 50 B. More than 50 C. Exactly 50 D. It is an arbitrary value
C. Exactly 50
What is the simplest method to increase the generalizability of a language model? A. L1 regularization B. Backoff Smoothing C. Laplace smoothing D. All of the above
C. Laplace smoothing
Which techniques or concepts below is/are NOT used in back propagation for deep neural networks, except possibly for validating correctness? A. Local gradients B. Chain rule C. Numerical gradient D. Recursion
C. Numerical gradient
Which of the following is/are true about PCA andt-SNE? A. Both PCA andt-SNE are linear B. Both PCA andt-SNE are nonlinear C. PCA is linear,t-SNE is nonlinear D. PCA is nonlinear,t-SNE is linear E. Their linearity/nonlinearity depends on the dataset
C. PCA is linear,t-SNE is nonlinear
Which non-linearity function is NOT used in the Long Short Term Memory (LSTM) unit? A. Sigmoid B. TanH C. ReLU D. None of the above
C. ReLU
Which deep learning architecture typically has the most number of layers? A. VGG B. AlexNet C. ResNet D. GoogLeNet
C. ResNet
Which word representation approach works the best for infrequent words in a large corpus? A. One-hot vector B. Term frequency-inverse document frequency (tf-idf) C. Skip-grams (SG) D. Continuous bag of words (CBOW)
C. Skip-grams (SG)
Which classification model is different from the others? A. Conditional Markov model B. Conditional random field C. Support vector machine (SVM) D. Recurrent neural network
C. Support vector machine (SVM)
In training a CNN model, we observe that the validation loss starts increasing after 30 epochs, while the training loss keeps decreasing. Which of the following is/are appropriate as the next step? A. Decrease the learning rate B. Increase the learning rate C. Use data augmentation D. Use a different loss function E. Stop training at epoch 30 F. Add dropout
C. Use data augmentation E. Stop training at epoch 30 F. Add dropout
The ReLU activation function: A. is differentiable everywhere B. is bounded C. has a bounded derivative D. is the only way to find low dimensional features in a machine learning problem
C. has a bounded derivative
Which type of layer has the fewest number of parameters in a CNN? A. A convolutional layer with ten3×3filters B. A convolutional layer with eight5×5filters C. A fully-connected layer from20hidden units to4output units D. A max-pooling layer that reduces a10×10matrix to a5×5matrix
D. A max-pooling layer that reduces a10×10matrix to a5×5matrix
What is the most critical factor to improve training a text classification model on a large corpus of labeled documents? A. Selecting a non-parametric model B. Selecting a discriminative model C. Document/text preprocessing D. Feature engineering E. Collecting more labeled data
D. Feature engineering
Which natural language processing task does NOT rely on classification? A. Named entity recognition B. Sentiment analysis C. Spam detection D. Language modeling E. None of the above
D. Language modeling
Which of the following is/are most common, non-linear activation used in deep neural networks? A. Sigmoid B. TanH C. Convolution D. ReLU E. All of the above
D. ReLU
You are using a CNN to determine whether an image contains a dog, a cat, and/or a duck or none of these animals. These are the same classes as in the previous question, but in a multi-label setting now. That is, zero, one, or more class-labels may be associated to each input example. How would you modify the last layer of your network from the previous question to model this problem? Please specify the number of units and the activation function. A. 3, ReLU B. 1, Sigmoid C. 3, SoftMax D. 1, SoftMax E. 3, Sigmoid F. 1, ReLU
E. 3, Sigmoid