Machine Learning Questions
Which of the following is a reasonable way to select the number of principal components k? (Note that m is the number of input examples.) -Choose k to be the smallest value so that at least 99% of the variance is retained. -Choose to use the elbow method. -Choose k to be 99% of m (i.e., k = 0.99*m, rounded to the nearest integer). -Choose k to be the largest value so that at least 99% of the variance is retained.
Choose k to be the smallest value so that at least 99% of the variance is retained.
Training a model using categorically labeled data to predict labels for new data is known as __________. -Feature extraction -Classification -Clustering -Regression
Classification
Pruning and early stopping in decision trees is used to -Combat overfitting. -None of the options. -Improve training error.
Combat overfitting.
Which function does logistic regression use to "squeeze" the real line to [0, 1]? -Absolute value function -Logistic function -Zero function
Logistic function
What is the meaning of "Kernelling" in SVM? -Finding a hyperplane in such a way that increases the dimensionality of a dataset -Mapping data into a higher dimensional space, in such a way that can change a linearly inseparable dataset into a linearly separable dataset -A function to reduce the dimensionality of a dataset in SVM
Mapping data into a higher dimensional space, in such a way that can change a linearly inseparable dataset into a linearly separable dataset
A common process for selecting a parameter like the optimal polynomial degree is: -Model estimation -Multiple regression -Bootstrapping -Minimizing test error -Minimizing validation error
Minimizing validation error
Which of the following is true for decision trees? -Model complexity increases with size of the data. -None of the options. -Model complexity increases with depth.
Model complexity increases with depth.
With a data set of 150 features and 20 classes, if we create a fully connected neuron network with one hidden layer of 100 neurons, then how many parameters will be trained in total? -17120 -270 -300000 -17000
-17120
If you use 3 filters of a size of 5*5 on an image of 28*28 pixels, what is the shape of the feature map assuming stride = 1? -28 * 28 * 3 -23 * 23 * 1 -24 * 24 * 1 -24 * 24 * 3 -23 * 23 * 3 -28 * 28 * 1
-24 * 24 * 3
For the MNIST dataset, a convolutional neural network with the following layers is built: -1st Convolution layer: 25 filters with filter size 5*5 -1st Max Pooling layer: pool size 2*2 Flatten -Dense layer: 100 neurons -Output layer: 10 neurons What is the total number of parameters? -361,225 -361,625 -361,760 -361,360
-361,760
Compute the number of parameters in a densely connected neural network on the MNIST dataset (784 input dimensions, 10 classes) with two hidden layers of dimension 500 each. -648010 -647000 -647010 -643000
-648010
What is a multi-class classifier? -A classifier that can predict a field with two discrete values, such as "Defaulter" or "Not Defaulter" -A classifier that can predict a field with multiple discrete values, such as "DrugA", "DrugX" or "DrugY" -A classifier that can predict multiple fields with many discrete values
-A classifier that can predict a field with multiple discrete values, such as "DrugA", "DrugX" or "DrugY"
A classifier is trained on an imbalanced multiclass dataset. After looking at the model's precision scores, you find that the micro averaging is much smaller than the macro averaging score. Which of the following is most likely happening? -The model is probably misclassifying the infrequent labels more than the frequent labels. -The model is probably misclassifying the frequent labels more than the infrequent labels.
The model is probably misclassifying the infrequent labels more than the frequent labels.
Let's say we have learned a decision tree on dataset D. Consider the split learned at the root of the decision tree. Which of the following is true if one of the data points in D is removed and we re-train the tree? -The split at the root will be different. -The split could be the same or could be different. -The split at the root will be exactly the same as before.
The split could be the same or could be different.
Convolutional layers in a neural network typically have less parameters than densely connected layers. -True -False
True
In a bag-of-word model with unigrams, using stop-words will reduce the number of features. -True -False
True
In order to train a logistic regression model, we find the weights that maximize the likelihood of the model. -True -False
True
PCA components are always orthogonal. -True -False
True
ReLU is a special case of Maxout. -True -False
True
Modeling the features of an unlabeled dataset to find hidden structure is known as ____________. -Supervised learning -Regression -Classification -Unsupervised learning
Unsupervised learning
When learning decision trees, smaller depth usually translates to lower training error. -True -False
False
Which of the following statements are correct? -A convolutional neural network is special case of a fully connected neural network. -A convolutional neural network has more parameters than a fully connected neural network. -A fully connected neural network is special case of a convolutional neural network. -A fully connected neural network has more parameters than a convolutional neural network.
-A convolutional neural network is special case of a fully connected neural network. -A fully connected neural network has more parameters than a convolutional neural network.
Gradient descent/ascent is ____________. -A model for predicting a continuous variable -An algorithm for minimizing/maximizing a function -An approximation to simple linear regression -A modeling technique in machine learning -A theoretical statistical result
-An algorithm for minimizing/maximizing a function
What are the two most common supervised tasks? Select all that apply. -Reinforcement learning -Clustering -Classification -Regression
-Classification -Regression
In which type of dissimilarity calculation between clusters, we find the longest distance between points in each cluster? -Complete-Linkage Clustering -Average-Linkage Clustering -Centroid Linkage Clustering -Single-Linkage Clustering
-Complete-Linkage Clustering
In the context of L2 regularized logistic regression, which of the following occurs as we increase the L2 penalty λ? Choose all that apply. -Decision boundary becomes less complex -The classifier has lower variance -Some features are excluded from the classifier -Region of uncertainty becomes narrower, i.e., the classifier makes predictions with higher confidence. -Training error decreases -The L2 norm of the set of coefficients gets smaller
-Decision boundary becomes less complex -The classifier has lower variance -The L2 norm of the set of coefficients gets smaller
Suppose you train an SVM and find it overfits your training data. Which of these would be a reasonable next step? -Decrease C -Increase C
-Decrease C
Suppose you train an SVM and find it overfits your training data. Which of these would be a reasonable next step? Choose all that apply. -Increase \gamma. -Decrease \gamma. -Decrease C. -Increase C.
-Decrease \gamma. -Decrease C.
Which of the following are the characteristics of density-based clustering? -Density-based clustering algorithms are proper for arbitrary shape clusters. -Density-based clustering algorithms have no notion of outliers. -Density-based clustering algorithms locate regions of high density that are separated from one another by regions of low density.
-Density-based clustering algorithms are proper for arbitrary shape clusters. -Density-based clustering algorithms locate regions of high density that are separated from one another by regions of low density.
A false negative is always worse than a false positive. -True -False
-False
Adding drop-out to a neural network will increase the number of parameters. -True -False
-False
Any initialization of the centroids in k-means is just as good as any other. -True -False
-False
Bag-of-word models using bigrams completely ignore the order of words in a sentence. -True -False
-False
In the clustering evaluation process, "elbow point" is where the rate of accuracy increase sharply, when we run clustering multiple times, increasing k in each run. -True -False
-False
It is always optimal to add more features to a regression model. -True -False
-False
Larger first order derivative means the current location is far from the minima. -True -False
-False
The decision boundary learned by a neural network with tanh activation function will be piecewise linear. -True -False
-False
The model that best minimizes training error is the one that will perform best for the task of prediction on new data. -True -False
-False
Which of the following statements are true? Select all that apply. -For some datasets, the "right" or "correct" value of K (the number of clusters) can be ambiguous, and hard even for a human expert looking carefully at the data to decide. -Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, and thus it is always better to have as large a number of clusters as is computationally feasible. -The standard way of initializing K-means is setting all the centroids locations to be equal to a vector of zeros. -If we are worried about K-means getting stuck in bad local optima, one way to ameliorate (reduce) this problem is if we try using multiple random initializations.
-For some datasets, the "right" or "correct" value of K (the number of clusters) can be ambiguous, and hard even for a human expert looking carefully at the data to decide. -If we are worried about K-means getting stuck in bad local optima, one way to ameliorate (reduce) this problem is if we try using multiple random initializations.
Suppose you have implemented regularized logistic regression to classify what object is in an image (i.e., to do object recognition). However, when you test your hypothesis on a new set of images, you find that it makes unacceptably large errors with its predictions on the new images. However, your hypothesis performs well (has low error) on the training set. Which of the following are promising steps to take? Choose all that apply. -Use fewer training examples. -Get more training examples. -Try using a smaller set of features. -Try adding polynomial features.
-Get more training examples. -Try using a smaller set of features.
Of the following examples, which would you address using an unsupervised learning algorithm? (Select all that apply.) -Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. -Given a database of customer data, automatically discover market segments and group customers into different market segments. -Given email labeled as spam/not spam, learn a spam filter. -Given a set of news articles found on the web, group them into a set of articles about the same story.
-Given a database of customer data, automatically discover market segments and group customers into different market segments. -Given a set of news articles found on the web, group them into a set of articles about the same story.
Which of the following statements are true? Check all that apply. -Given an input vector x, PCA compresses it to a lower-dimensional vector z. -Feature scaling is not useful for PCA, since the eigenvector calculation takes care of this automatically. -If the input features are on very different scales, it is a good idea to perform feature scaling before applying PCA. -PCA can be used only to reduce the dimensionality of data by 1 (such as 3D to 2D, or 2D to 1D).
-Given an input vector x, PCA compresses it to a lower-dimensional vector z. -If the input features are on very different scales, it is a good idea to perform feature scaling before applying PCA.
Which of the following loss function is used by linear SVM? -Zero-one loss -Hinge loss -Log loss
-Hinge loss
Which of the following statements about learning rate in gradient descent is/are true (select all that apply). -The learning rate doesn't matter. -It's important to choose a very small learning rate. -If the learning rate is too small (but not zero) gradient descent may take a very long time to converge. -If the learning rate is too large gradient descent may not converge.
-If the learning rate is too small (but not zero) gradient descent may take a very long time to converge. -If the learning rate is too large gradient descent may not converge.
TF-IDF ______ in the total number of documents. -Increases -Decreases
-Increases
You are training a classification model with logistic regression. Which of the following statements are true? Choose all that apply. -Adding a new feature to the model always results in equal or better performance on examples not in the training set. -Introducing regularization to the model always results in equal or better performance on examples not in the training set. -Introducing regularization to the model always results in equal or better performance on the training set. -Adding many new features to the model makes it more likely to overfit the training set.
-Introducing regularization to the model always results in equal or better performance on examples not in the training set. -Adding many new features to the model makes it more likely to overfit the training set.
In ridge regression, choosing a large penalty strength λ tends to lead to a model with (select all that apply). -High variance -Low variance -Low bias -High bias
-Low variance -High bias
If the features of Model 1 are a strict subset of those in Model 2, which model will usually have lowest training error? -Model 2 -It's impossible to tell with only this information. -Model 1
-Model 2
Which one of the following is not an activation function? -Relu -Tanh -Sigmoid -Polynomial
-Polynomial
In a future society, a machine is used to predict a crime before it occurs. If you were responsible for tuning this machine, what evaluation metric would you want to maximize to ensure no innocent people (people not about to commit a crime) are imprisoned (where crime is the positive label)? -Accuracy -Precision -AUC -F1 -Recall
-Precision
Which of the following statements apply to neural networks? Work well when little training data is available -Fast to train on large datasets -Provide state-of-the-art performance in computer vision and audio analysis -Can learn arbitrarily complex functions -Have no hyper-parameters to tune
-Provide state-of-the-art performance in computer vision and audio analysis -Can learn arbitrarily complex functions
We are interested in reducing the number of false negatives. Which of the following metrics should we primarily look at? -Precision -Recall -Accuracy
-Recall
Training a model using labeled data where the labels are continuous quantities to predict labels for new data is known as __________. -Feature extraction -Regression -Clustering -Classification
-Regression
Which of the following is an example of clustering? -Accumulate data into groups based on labels -Creating a new representation of the data with fewer features -Compress elongated clouds of data into more spherical representations -Separate the data into distinct groups by similarity
-Separate the data into distinct groups by similarity
Selecting model complexity on test data: (select all that apply) -Allows you to avoid issues of overfitting to training data -Should never be done -Provides an overly optimistic assessment of performance of the resulting model -Is computationally inefficient
-Should never be done -Provides an overly optimistic assessment of performance of the resulting model
Which of the following is not an ensemble method? -Random forests -AdaBoost -Gradient boosted trees -Single decision trees
-Single decision trees
Training a model using labeled data and using this model to predict the labels for new data is known as ____________. -Clustering -Density estimation -Unsupervised learning -Supervised learning
-Supervised learning
Which of the following properties is not the reason behind using the convolutional neural network for image recognition? -The contents of an image do not change when the image is rotated. -Some patterns are much smaller than the whole image. -Subsampling the pixels will not change the object. -The same patterns may appear in different regions in an image.
-The contents of an image do not change when the image is rotated.
In a simple regression model, if you increase the input value by 1, then you expect the output to change by: -1 as well -The value of the slope parameter -It would be impossible to tell -The value of the intercept parameter
-The value of the slope parameter
Which of the following are good/recommended applications of PCA? Select all that apply. -To visualize high-dimensional data (by choosing k = 2 or k = 3). -Instead of using regularization, use PCA to reduce the number of features to reduce overfitting. -To reduce the dimension of the input data so as to speed up a learning algorithm. -To compress the data so it takes up less computer memory/disk space.
-To visualize high-dimensional data (by choosing k = 2 or k = 3). -To reduce the dimension of the input data so as to speed up a learning algorithm. -To compress the data so it takes up less computer memory/disk space.
You're running a company, and you want to develop learning algorithms to address each of two problems. -Problem One: You have a large inventory of identical items. You want to predict how many of these items will sell over the next three months. -Problem Two: You'd like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised. Should you treat these as classification or as regression problems? -Treat both as regression problems. -Treat both as classification problems. -Treat problem one as a regression problem, problem two as a classification problem. -Treat problem one as a classification problem, problem two as a regression problem.
-Treat problem one as a regression problem, problem two as a classification problem.
AdaBoost focuses on data points it incorrectly predicted by increasing those weights in the data set. -True -False
-True
Which of the following generates the most features? -ngram_range = (1, 1) -ngram_range = (2, 4) -ngram_range = (1, 4) -ngram_range = (4, 4)
-ngram_range = (1, 4)
Which of these could be an acceptable sequence of operations using scikit-learn to apply the k-nearest neighbors' classification method? -read_table, train_test_split, fit, KNeighborsClassifier, score -read_table, fit, train_test_split, KNeighborsClassifier, score -read_table, train_test_split, KNeighborsClassifier, fit, score -KNeighborsClassifier, train_test_split, fit, score, read_table
-read_table, train_test_split, KNeighborsClassifier, fit, score
It is often the case that false positives and false negatives incur different costs. In situations where false negatives cost much more than false positives, we should -require lower confidence level for positive predictions. -require higher confidence level for positive predictions.
-require lower confidence level for positive predictions.
Which of the following is not a linear regression model. Hint: remember that a linear regression model is always linear in the parameters, but may use non-linear features. -y = w0 + w1x2 -y = w0 w1 + log(w1) x -y = w0 + w1x -y = w0 + w1log(x)
-y = w0 w1 + log(w1) x
Which of the following is calculated using backward pass? -Z -a -∂z/∂w -∂t/∂z
-∂t/∂z
Suppose there are two inputs to a neuron: 0.72 and 0.12. The corresponding weights are 3 and -1, and the bias parameter is -2. What is the output of this neuron using the sigmoid function as an activation function? Pick the closest answer. -0.51 -0.11 -0.85 -0.62
0.51
Consider training a 1 vs. all multi-class classifier for the problem of digit recognition using logistic regression. There are 10 digits, thus there are 10 classes. How many logistic regression classifiers will we have to train? -1 -9 -10 -5
10
Gradient descent/ascent allows us to _______________. -Assess performance of a model on test data -Estimate model parameters from data -Predict a value based on a fitted function
Estimate model parameters from data
Each time we update the parameters using gradient descent, we obtain parameters that will make the loss smaller. -True -False
False
High classification accuracy always indicates a good classifier. -True -False
False
A simple model with few parameters is most likely to suffer from: -High Variance -High Bias
High Bias
How can you reduce the number of features in a text classification task with a bag-of-word representation? -Tf-idf rescaling. -Using higher n-grams (bigrams, trigrams). -Removing uncommon words (min_df).
Removing uncommon words (min_df)
Given a dataset with 10,000 observations and 50 features plus one label, what would be the dimensions of X_train, y_train, X_test, and y_test? Assume a train/test split of 75%/25%.
X_train: (7500, 50) y_train: (7500, ) X_test: (2500, 50) y_test: (2500, )