Machine Learning Midterm sp24

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Identify whether kNN or SVM would have lower test classification error in the following situations. 1. Classify text documents by topic 2. Given 15 measurements of vital signs, predict disease in a patient 3. Classify handwritten digits as 3 versus as 8

1. Linear (the data is high dimensional) 2. kNN (the data can have nonlinear boundaries) 3. kNN (the data is inherently low dimensional)

When is the Naive Bayes decision boundary identical to the Logistic Regression boundary?

1. The data are conditionally independent given the label 2. The two classes are drawn from Gaussian distributions with the same variance but different means

Why wouldn't we be able to use Newton's Method on a particular data set?

1. The data is flat in one dimension (the Hessian is not invertible) 2. The data is high dimensional (the Hessian is huge, which is expensive to compute)

What are the assumptions of the following classifiers? 1. kNN 2. Naive Bayes 3. Logistic regression 4. SVM

1. kNN - similar points have similar labels 2. Naive Bayes - the features are conditionally independent 3. Logistic regression - P(y|x) = 1 / (1 + e^-yiwTxi) 4. SVM - hard margin assumes the data is linearly separable Both LR and SVM assume the data is binary.

Identify the following classifiers as generative or discriminative. 1. kNN 2. SVM 3. Logistic regression 4. Perceptron 5. Naive Bayes

All but Naive Bayes are discriminative. NB is generative.

What is a downside of kNN as the data set size gets very large?

Applying kNN gets very slow, and the points become equidistant (curse of dimensionality).

What happens in Adaboost if two training inputs (in a binary classification problem) are identical in features but have different labels?

Both points will obtain very high weights and eventually will dominate the training data set. Weak learners will no longer be able to classify the weighted data set with greater than 50% accuracy and the algorithm will stop.

Say you run a classifier but one of the features is no longer available. Why would you prefer to use kNN over SVM or logistic regression?

During testing with kNN, you can compute distance with all other features and drop the missing one on the fly.

Why is the test error typically higher than the validation error?

Each time you train on the training set and evaluate on the validation set, you pick the model that works best on the validation set.

The kNN algorithm can only be used for classification but not regression. True or false?

False, kNN can be used for regression by averaging the labels of the k nearest neighbors.

True or false: an effective way to fight a high bias scenario is to add training data.

False. Adding training data only increases the training error.

k means clustering and PCA require labelled data. True or false?

False. Both are examples of unsupervised learning, which seeks to find structure in unlabelled data.

Generalization error is just another word for validation error. True or false?

False. Generalization error is the expected error on new data from the same data distribution. Validation error is the empirical error on the validation set.

The order of the training points can affect the convergence of the gradient descent algorithm. True or false?

False. Gradient descent depends on a sum across training samples, which is independent of order.

True or false: in a linearly separable dataset, both linear SVM with hard and soft constraints will return the same solution.

False. Hard constraint SVM will never misclassify points, whereas soft constraints might depending on C (larger C means more strict, and vice versa).

If a data set is linearly separable, the Perceptron is guaranteed to converge in a finite number of updates. Otherwise, it sometimes converges, but there are no guarantees. True or false?

False. If the data is not linearly separable, the Perceptron will not converge.

True or false: in order for the gradient descent to converge, the loss function has to be convex and differentiable everywhere.

False. If the function is not convex, gradient descent will still converge but to a local minimum.

The Bayes optimal error is the best classification you could get if there were no noise. True or false?

False. It is the best classification error you could get if you know the distribution. The error is actually due to noise.

Increasing regularization tends to reduce the bias of your classifier. True or false?

False. It reduces the variance of your classifier. Often, it can increase bias when set too high.

True or false: the regularization constant should be chosen based on the test data set.

False. It should be chosen based on the validation set.

True or false: logistic regression is a special case of Naive Bayes.

False. LR doesn't assume features are independent given the label. NB is generative, and LR is discriminative.

If the training data set is linearly separable, Naive Bayes will have zero training error. True or false?

False. NB only fits the distribution to the data and might sacrifice some points to do so.

True or false: In the classification setting, if we can't reduce impurity, we stop splitting

False. Take XOR for example.

Assume there exists a vector w' that defines a hyperplane that perfectly separates your data. Let the Perceptron vector be w. As the algorithm proceeds, w -> w' and in the limit, we have w = w'. True or false?

False. The Perceptron converges to a separating hyperplane, but it could be different from w' (won't perfectly separate).

If a classifier obtains 0% training error, it cannot have 100% testing error. True or false?

False. The classifier could just memorize the training set.

True or false: k-means is a non-parametric algorithm.

False. There's always one mean vector for each of the k clusters (independent of input size).

True or false: SVM's maximize the margin between the training and testing data.

False. They maximize the margin between the training data and the separating hyperplane.

In MAP, we find the maximizer of the posterior, so we need to find an expression for the posterior. True or false?

False. To maximize P(theta|D), we use P(D|theta)P(theta).

True or false: variance can only be reduced by picking a different model or by adding a regularizer term.

False. You can also try bagging, early stopping, or gradient descent.

If you were to use the "true" Bayesian way of machine learning, you would put a prior over the possible models and draw one randomly during training. True or false?

False. You would draw several models randomly during training.

True or false: every function k that maps two feature vectors to a non-negative real number is a kernel.

False. k must represent an inner product.

True or false: increasing regularization tends to reduce the bias of your classifier.

False; increasing regularization decreases model complexity, whi h can lead to greater bias.

Under what assumptions does the gradient descent update reduce loss?

Gradient descent assumes step size > 0 and g(w)Tg(w) > 0.

What is a scenario in which Adaboost is not a good choice?

If a data set has a lot of label noise, Adaboost will overfit, and the exponential loss will ensure mislabeled points are classified correctly.

Why does adding more training data not always help reduce your testing error below a desired threshold?

If your training error is too high, the testing error can't go below the desired threshold since it is bounded below by the training error.

Why is Naive Bayes generative? What is its discriminative counterpart?

It estimates P(X, Y) = P(X|Y)P(Y). Logistic regression is the discriminative counterpart of Naive Bayes.

Does it take more time to train or apply kNN?

It takes more time to apply since kNN just memorizes the training points, but for application must compute Euclidean distances.

Logistic regression on linearly separable data and non-linearly separable data has what combination of low/high bias and variance?

Linearly separable: low bias and low variance Non-linearly separable: high bias and low variance

When will Lloyd's algorithm not output the optimal clustering for a given k?

Lloyd's algorithm can only improve a clustering, but cannot optimize it. It is also sensitive to initialization.

Name one advantage of logistic regression over SVM and vice versa.

Logistic regression has a probabilistic interpretation whereas SVM is more robust and generalizes well on testing data.

Gaussian Naive Bayes has what combination of bias/variance?

Low variance, high bias

Name one advantage of logistic regression over Naive Bayes and vice versa.

Naive Bayes makes assumptions about the distribution of P(x|y) whereas logistic regression doesn't. However, Naive Bayes converges faster to the same solution if those assumptions hold.

Why does linear regression always converge after a single Newton step?

Newton's Method utilizes the Hessian, and linear regression minimizes a quadratic loss function, so its second derivative arrives immediately at the minimum.

PCA solves what two problems given a data set?

PCA maximizes the variance and gives a low dimensional representation of the data.

What is the difference between parametric and non-parametric algorithms?

Parametric algorithms have a fixed number of parameters. Non-parametric algorithms have a number of parameters that grow with the training set size.

Consider kernel (Ridge) regression with the RBF kernel. Why does the interpolant go to zero outside the region where there's data?

RBF only works around the region where there's training data (think smart nearest neighbors).

What assumptions do you make for Ridge regression? Name the procedure used to derive the respective loss function/closed form solution.

Ridge regression assumes the regression function is linear and that w has a zero-mean Gaussian prior. MAP is used to derive its closed form solution.

Why can stochastic gradient descent be better than traditional (batch) gradient descent when applied to neural networks?

SGD can jump out of local minima more easily since it's noisy.

Why shouldn't you incorporate bias as a constant feature when working with SVM's?

SVM's minimize wTw since we want the maximum margin hyperplane, which requires the bias to move freely. It doesn't make sense to constrain and minimize the bias term as part of w.

kNN with small k and large k has what combination of low/high bias and variance?

Small k: low bias and high variance Large k: high bias and low variance

Name an advantage of squared loss over absolute loss and vice versa.

Squared loss is differentiable, whereas absolute loss is less sensitive to outliers.

Why do ML practitioners drop the learning rate during training?

Starting out with a large learning rate prevents you from getting stuck in sharp local minima and takes you quickly downhill. Then switching to a smaller learning rate allows you to converge to the closest local minima to the current weight position.

Suppose we perform leave-one-out cross validation on the training set and produce an estimate of the real test loss. Give an upper bound on the test loss in terms of n and k.

Test loss < k / n

What are two assumptions that the Perceptron makes?

The Perceptron assumes that the data is linearly separable and binary.

Why does kNN still perform well on high dimensional face images and handwritten digits?

The data is intrinsically low dimensional.

Suppose your features are age, number of years of education, and annual income in US dollars for an individual. Why is the Euclidean distance not a good measure of dissimilarity?

The features are on different scales. You could normalize each feature to [0, 1].

Imagine you apply the kNN classifier with Euclidean distance. Describe what happens if you scale one dimension of the input features by a large positive constant across all samples.

The scaled feature will dominate the distance metric and kNN will eventually only be measured in that dimension.

What are some strategies to reduce bias and variance?

To reduce variance: try bagging, reduce model complexity, and add training data. To reduce bias: try boosting, increase model complexity, and add features.

True or false: as your training data set size, n, approaches infinity, the k-NN classifier is guaranteed to have an error no worse than twice the Bayes optimal error.

True

True or false: if a kernelized SVM correctly classifies all points, the dataset is linearly separable in the "feature expansion" space

True

If the Naive Bayes assumption holds, the Naive Bayes classifier becomes identical to the Bayes Optimal Classifier. True or false?

True.

True or false: can you change the noise term in the bias variance trade-off by changing the feature representation of the data?

True. For instance, if all the features are removed, the error is only noise (which would be very large).

True or false: for Adagrad, each feature has its own learning rate.

True. Smaller gradients have larger learning rates, larger gradients have smaller learning rates.

True or false: For logistic regression, Newton's method will return the global optimum.

True. The loss function of LR is convex.

Why do we remove the mean from the data before computing principal components?

We are trying to find structure within the data (not bulk properties), so we center it first.

Why is the SVM margin exactly 1 / norm(w)?

We can fix the scale of w, b any way we want, so |wTxi + b| just becomes 1.

Does SVM generalize better when we increase the number of training points? How about when we increase the number of support vectors?

Yes and no.

Explain why as n approaches infinity, theta_MAP approaches theta_MLE for the coin toss (modeled by the binomial distribution).

alpha - 1 and beta - 1 become irrelevant compared to large nH and nT.

With Naive Bayes, given c classes/labels, k categories per feature, and +1 smoothing, how many hallucinated samples are you adding to the entire training set?

ck

What is the relationship between kNN and the Bayes Optimal classifier?

err(Bayes Optimal) < err(1-kNN) < 2 * err(Bayes Optimal)

Name one advantage of kNN over logistic regression and vice versa.

kNN can have nonlinear boundaries whereas logistic regression must be linear. However, LR has a probabilistic interpretation.


Kaugnay na mga set ng pag-aaral

chp 11: global and international issues

View Set

BUS/475: Integrated Business Topics Wk 1 - Practice: Ch. 1, What Is Strategy? [due Day 5]

View Set

BASICS: Naming the types of Epithelial Tissues

View Set

estates chapter 8 creation of trusts

View Set

Computer Security Final Ch5 Network Security

View Set

Orbitals, energy levels and ion formation

View Set