Machine Learning Midterm
Name one advantage of logistic regression over Naive Bayes and vice versa.
Naive Bayes makes assumptions about the distribution of P(x|y) whereas logistic regression doesn't. However, Naive Bayes converges faster to the same solution if those assumptions hold.
Why is it often impractical to use Newton's method to train a neural network?
Neural networks have such a large number of parameters, the Hessian matrix is too large to be computed practically. Also, Newton's method is prone to getting stuck in local minima and saddle points.
Why does linear regression always converge after a single Newton step?
Newton's Method utilizes the Hessian, and linear regression minimizes a quadratic loss function, so its second derivative arrives immediately at the minimum.
Do decision trees return the same predictions as kNN?
No
Can decision trees be kernelized?
No, decision trees don't access data through inner products; instead they split on the features.
Does the id3 algorithm find the globally optimal tree?
No, it is a greedy approximation of finding an optimal tree (it is myopic).
What assumptions do you make for Ridge regression? Name the procedure used to derive the respective loss function/closed form solution.
Ridge regression assumes the regression function is linear and that w has a zero-mean Gaussian prior. MAP is used to derive its closed form solution.
Why can stochastic gradient descent be better than traditional (batch) gradient descent when applied to neural networks?
SGD can jump out of local minima more easily since it's noisy.
Why shouldn't you incorporate bias as a constant feature when working with SVM's?
SVM's minimize wTw since we want the maximum margin hyperplane, which requires the bias to move freely. It doesn't make sense to constrain and minimize the bias term as part of w.
What are two assumptions that the Perceptron makes?
The Perceptron assumes that the data is linearly separable and binary.
Between the fully connected layer and convolutional layer with the same number of parameters, which does more arithmetic operations?
The convolution layer because it applies the same parameter all over the image.
Why does kNN still perform well on high dimensional face images and handwritten digits?
The data is intrinsically low dimensional.
How does tree depth relate to the bias-variance tradeoff?
The deeper the tree, the higher the variance and lower the bias.
Suppose your features are age, number of years of education, and annual income in US dollars for an individual. Why is the Euclidean distance not a good measure of dissimilarity?
The features are on different scales. You could normalize each feature to [0, 1].
When bagging, how do we compute an unbiased estimator of the test error?
We use the out-of-bag error. Take the average classifier of all the classifiers that were not trained on a point and find their loss on that point, for all points.
What is a scenario in which Adaboost is not a good choice?
If a data set has a lot of label noise, Adaboost will overfit, and the exponential loss will ensure mislabeled points are classified correctly.
Under what conditions on your training set will a full-depth CART tree obtain 0 training error?
If there are no training inputs with the same features but different labels
Why does adding more training data not always help reduce your testing error below a desired threshold?
If your training error is too high, the testing error can't go below the desired threshold since it is bounded below by the training error.
When will Lloyd's algorithm not output the optimal clustering for a given k?
Lloyd's algorithm can only improve a clustering, but cannot optimize it. It is also sensitive to initialization.
Name one advantage of logistic regression over SVM and vice versa.
Logistic regression has a probabilistic interpretation whereas SVM is more robust and generalizes well on testing data.
Imagine you build a KD-tree and label each leaf with the most common label amongst all training points that fall into this leaf. Why would this not be a desirable classifier?
Many leaves wouldn't be pure, so the most common label isn't the best estimate.
What is the formula for computing the size of an output feature map for convolution?
(W - F + 2P) / S + 1 W = image size F = filter size P = padding S = stride
Identify whether kNN or SVM would have lower test classification error in the following situations. 1. Classify text documents by topic 2. Given 15 measurements of vital signs, predict disease in a patient 3. Classify handwritten digits as 3 versus as 8
1. Linear (the data is high dimensional) 2. kNN (the data can have nonlinear boundaries) 3. kNN (the data is inherently low dimensional)
Name three reasons why random forests are such popular classifiers amongst practictioners.
1. RF only has two hyperparameters, m (# of bagged datasets) and k (# of subsampled features), which it is insensitive to. 2. Features can have different scales and units. 3. RF outputs meaningful probability estimates.
When is the Naive Bayes decision boundary identical to the Logistic Regression boundary?
1. The data are conditionally independent given the label 2. The two classes are drawn from Gaussian distributions with the same variance but different means
Why wouldn't we be able to use Newton's Method on a particular data set?
1. The data is flat in one dimension (the Hessian is not invertible) 2. The data is high dimensional (the Hessian is huge, which is expensive to compute)
What are some advantages of decision trees over kNN?
1. You don't have to store all training points 2. It's fast to test with decision trees 3. You don't need metrics
What are the assumptions of the following classifiers? 1. kNN 2. Naive Bayes 3. Logistic regression 4. SVM
1. kNN - similar points have similar labels 2. Naive Bayes - the features are conditionally independent 3. Logistic regression - P(y|x) = 1 / (1 + e^-yiwTxi) 4. SVM - hard margin assumes the data is linearly separable Both LR and SVM assume the data is binary.
If a classifier obtains 0% training error, it cannot have 100% testing error. True or false?
False. The classifier could just memorize the training set.
True or false: k-means is a non-parametric algorithm.
False. There's always one mean vector for each of the k clusters (independent of input size).
True or false: SVM's maximize the margin between the training and testing data.
False. They maximize the margin between the training data and the separating hyperplane.
Identify the following classifiers as generative or discriminative. 1. kNN 2. SVM 3. Logistic regression 4. Perceptron 5. Naive Bayes
All but Naive Bayes are discriminative. NB is generative.
What is a downside of kNN as the data set size gets very large?
Applying kNN gets very slow, and the points become equidistant (curse of dimensionality).
In MAP, we find the maximizer of the posterior, so we need to find an expression for the posterior. True or false?
False. To maximize P(theta|D), we use P(D|theta)P(theta).
True or false: variance can only be reduced by picking a different model or by adding a regularizer term.
False. You can also try bagging, early stopping, or gradient descent.
What happens in Adaboost if two training inputs (in a binary classification problem) are identical in features but have different labels?
Both points will obtain very high weights and eventually will dominate the training data set. Weak learners will no longer be able to classify the weighted data set with greater than 50% accuracy and the algorithm will stop.
If you were to use the "true" Bayesian way of machine learning, you would put a prior over the possible models and draw one randomly during training. True or false?
False. You would draw several models randomly during training.
True or false: every function k that maps two feature vectors to a non-negative real number is a kernel.
False. k must represent an inner product.
Under what assumptions does the gradient descent update reduce loss?
Gradient descent assumes step size > 0 and g(w)Tg(w) > 0.
Why is the test error typically higher than the validation error?
Each time you train on the training set and evaluate on the validation set, you pick the model that works best on the validation set.
True or false: logistic regression is a special case of Naive Bayes.
False. LR doesn't assume features are independent given the label. NB is generative, and LR is discriminative.
If the training data set is linearly separable, Naive Bayes will have zero training error. True or false?
False. NB only fits the distribution to the data and might sacrifice some points to do so.
True or false: In the classification setting, if we can't reduce impurity, we stop splitting
False. Take XOR for example.
Assume there exists a vector w' that defines a hyperplane that perfectly separates your data. Let the Perceptron vector be w. As the algorithm proceeds, w -> w' and in the limit, we have w = w'. True or false?
False. The Perceptron converges to a separating hyperplane, but it could be different from w' (won't perfectly separate).
Say you run a classifier but one of the features is no longer available. Why would you prefer to use kNN over SVM or logistic regression?
During testing with kNN, you can compute distance with all other features and drop the missing one on the fly.
SGD (stochastic gradient descent) always finds the same solution when fitting parameters for neural networks.
False
The kNN algorithm can only be used for classification but not regression. True or false?
False, kNN can be used for regression by averaging the labels of the k nearest neighbors.
For boosting, we can learn multiple base classifiers in parallel.
False, you have to learn each base classifier sequentially.
True or false: If we ensemble too many classifiers, bagging will increase bias.
False, you'll just have diminishing returns; the classifier will just slow down, but the error will not increase.
True or false: an effective way to fight a high bias scenario is to add training data.
False. Adding training data only increases the training error.
k means clustering and PCA require labelled data. True or false?
False. Both are examples of unsupervised learning, which seeks to find structure in unlabelled data.
Generalization error is just another word for validation error. True or false?
False. Generalization error is the expected error on new data from the same data distribution. Validation error is the empirical error on the validation set.
The order of the training points can affect the convergence of the gradient descent algorithm. True or false?
False. Gradient descent depends on a sum across training samples, which is independent of order.
True or false: in a linearly separable dataset, both linear SVM with hard and soft constraints will return the same solution.
False. Hard constraint SVM will never misclassify points, whereas soft constraints might depending on C (larger C means more strict, and vice versa).
If a data set is linearly separable, the Perceptron is guaranteed to converge in a finite number of updates. Otherwise, it sometimes converges, but there are no guarantees. True or false?
False. If the data is not linearly separable, the Perceptron will not converge.
True or false: in order for the gradient descent to converge, the loss function has to be convex and differentiable everywhere.
False. If the function is not convex, gradient descent will still converge but to a local minimum.
The optimization of deep neural networks is a convex minimization problem.
False. It is non-convex because of the non-linear transition functions.
The Bayes optimal error is the best classification you could get if there were no noise. True or false?
False. It is the best classification error you could get if you know the distribution. The error is actually due to noise.
Increasing regularization tends to reduce the bias of your classifier. True or false?
False. It reduces the variance of your classifier. Often, it can increase bias when set too high.
True or false: the regularization constant should be chosen based on the test data set.
False. It should be chosen based on the validation set.
Why is Naive Bayes generative? What is its discriminative counterpart?
It estimates P(X, Y) = P(X|Y)P(Y). Logistic regression is the discriminative counterpart of Naive Bayes.
Does it take more time to train or apply kNN?
It takes more time to apply since kNN just memorizes the training points, but for application must compute Euclidean distances.
When is kernel SVM faster than primal SVM?
Kernel SVM is faster for small training set sizes n with high dimensionality; we don't have to compute inner products O(d) between weight input vectors.
Logistic regression on linearly separable data and non-linearly separable data has what combination of low/high bias and variance?
Linearly separable: low bias and low variance Non-linearly separable: high bias and low variance
What are two sources of randomness when building a random forest?
One is sampling from the data with replacement (bagging). The other is subsampling k features before every split.
PCA solves what two problems given a data set?
PCA maximizes the variance and gives a low dimensional representation of the data.
What is the difference between parametric and non-parametric algorithms?
Parametric algorithms have a fixed number of parameters. Non-parametric algorithms have a number of parameters that grow with the training set size.
Why is the SVM margin exactly 1 / norm(w)?
We can fix the scale of w, b any way we want, so |wTxi + b| just becomes 1.
For k-fold cross validation, describe the pros and cons as k --> n. When would you be most inclined to use k = n?
Pros: error decreases as you have more training data. Cons: validation gets slower. You should use k = n when you have a small dataset.
Consider kernel (Ridge) regression with the RBF kernel. Why does the interpolant go to zero outside the region where there's data?
RBF only works around the region where there's training data (think smart nearest neighbors).
What is one advantage of RELU (rectified linear units) over sigmoid functions?
RELU has sparsity and reduced likelihood of vanishing gradients.
kNN with small k and large k has what combination of low/high bias and variance?
Small k: low bias and high variance Large k: high bias and low variance
Name an advantage of squared loss over absolute loss and vice versa.
Squared loss is differentiable, whereas absolute loss is less sensitive to outliers.
Why do ML practitioners drop the learning rate during training?
Starting out with a large learning rate prevents you from getting stuck in sharp local minima and takes you quickly downhill. Then switching to a smaller learning rate allows you to converge to the closest local minima to the current weight position.
Imagine trying to use Adaboost on a full CART tree without depth limit. Although the code seems correct, it crashes in the very first round. What do you think is the problem?
The CART tree has 0 classification error yielding an infinite step size 1/2ln(1 - e / e) and division by zero.
In neural networks, bagging can be performed without random subsampling of the data i.e. one trains m neural networks independently and ensembles their results. Can you explain why the subsampling is unnecessary in this case?
The random initialization and non-convexity of neural networks ensures independently trained models will end up in different local minima and obtain different results. The effect is similar to training on slightly different data sets.
In GBRT (gradient boosted regression trees), what values do we fit and try with each tree?
The residuals (yi- H(xi))
Imagine you apply the kNN classifier with Euclidean distance. Describe what happens if you scale one dimension of the input features by a large positive constant across all samples.
The scaled feature will dominate the distance metric and kNN will eventually only be measured in that dimension.
What happens if you remove all the non-linear transition functions from a neural network?
The training error and bias increase, and the network collapses to a linear classifier.
Why are convolution neural networks the natural choice for image/audio data?
They emphasize adjacency and have translation invariance.
What are some strategies to reduce bias and variance?
To reduce variance: try bagging, reduce model complexity, and add training data. To reduce bias: try boosting, increase model complexity, and add features.
True or false: If we remove all non-linearities (activation functions), a multi-layer perceptron becomes linear regression.
True
True or false: if a kernelized SVM correctly classifies all points, the dataset is linearly separable in the "feature expansion" space
True
If the Naive Bayes assumption holds, the Naive Bayes classifier becomes identical to the Bayes Optimal Classifier. True or false?
True.
True or false: can you change the noise term in the bias variance trade-off by changing the feature representation of the data?
True. For instance, if all the features are removed, the error is only noise (which would be very large).
True or false: for Adagrad, each feature has its own learning rate.
True. Smaller gradients have larger learning rates, larger gradients have smaller learning rates.
True or false: For logistic regression, Newton's method will return the global optimum.
True. The loss function of LR is convex.
True or false: a learned kernel SVM model (with RBF kernel) requires you to store some of the training data.
True. You need to store the support vectors.
What are some ways you can set hyperparameters?
Try cross validation, telescopic search, or grid/random search.
You have a fully connected MLP (multi-layer perceptron) with a configuration of 5 (inputs), 4, 6, and 1 (outputs). How many parameters does it have?
Use the # dimensions of the previous layer plus a constant. 4 x 5 + 6 x 4 + 1 x 6 + 4 + 6 + 1
Why do we remove the mean from the data before computing principal components?
We are trying to find structure within the data (not bulk properties), so we center it first.
One limitation of CART trees is that all splits are axis aligned. What is a scenario in which this behavior leads to bad performance?
When you have high dimensional data (for example, splitting on text data which has lots of zeros can result in little progress).
When we kernelize, what is the key substitution we must make in place of using the feature expansion directly?
Write the loss, classifier, and w in terms of inner products and replace with the kernel function.
Imagine you are trying to fit points with a cubic function f(x) = a + bx + cx^2 + dx^3, but you only have code for linear regression. How do you achieve this goal with a) explicit feature expansion and b) kernelization?
You can map each feature vector to a 4D feature expansion of (1 x x^2 x^3), or kernelize with a polynomial kernel of degree 3.
Explain why as n approaches infinity, theta_MAP approaches theta_MLE for the coin toss (modeled by the binomial distribution).
alpha - 1 and beta - 1 become irrelevant compared to large nH and nT.
With Naive Bayes, given c classes/labels, k categories per feature, and +1 smoothing, how many hallucinated samples are you adding to the entire training set?
ck
What is the relationship between kNN and the Bayes Optimal classifier?
err(Bayes Optimal) < err(1-kNN) < 2 * err(Bayes Optimal)
Name one advantage of kNN over logistic regression and vice versa.
kNN can have nonlinear boundaries whereas logistic regression must be linear. However, LR has a probabilistic interpretation.
Name three algorithms for which boosting will be ineffective.
kNN, full-depth decision trees, and kernel SVM's (all have low bias and high variance)
Let me be the number of support vectors of an SVM trained on n data points with RBF kernel. For a fixed n, imagine you increase the dimensionality d of the data until it becomes very large. How would you expect the ratio m / n to change as d >> 0?
m / n would approach 1 because of the curse of dimensionality. All the training points would become very far from each other and get very close to the decision boundary.