Machine Learning Exam Qs

Ace your homework & exams now with Quizwiz!

Undersampling and oversampling are ways to deal with imbalanced classes. Which is true?

- A: Oversampling leads to duplicate instances in your data. - Oversampling is appropriate when data scientists do not have enough information, so the scientist increases the number of rare events -> duplicate events. - Undersampling: Typically, scientists randomly delete events in the majority class to end up with the same number of events as the minority class.

What is the difference between classification and regression?

- Classification predicts an item from a finite set, regression predicts a numeric value. (BOTH are Supervised Learning)

Imagine a machine learning task where the instances are customers. You know the phone number for each customer and their occupation (one of seven categories). You're wondering how to turn these into features. Which is TRUE?

- For some models, you may want to turn the occupation into several numeric features. - Whether to use the occupation directly or turn it into a numeric feature depends on the model. - You can extract several useful categoric features from the phone number.

The bias/variance decomposition is a helpful tool. Which are TRUE:

- If we know that we have high variance, we can use ensembling methods to reduce it. - We can estimate the bias/variance decomposition of our error if we have multiple datasets, or a way to resample data. - If we know that we have high bias, we can use ensembling methods to reduce it.

least-squares loss (loss function)

- assign -1, 1 to points, treat as regression - In practice it doesn't usually work very well, and is rarely used.

log loss (loss function)

- requires a sigmoid function to be added to the output of the linear function. It assumes that the result is the probability of the positive class applying to the instance, and it maximizes the log likelihood of the classes given the model parameters. - Practically this boils down to minimizing the negative log likelihood of the correct class. - This can also be derived from the cross-entropy between the true class distribution given by the data and the class distribution given by the model.

zero one loss / the error (loss function)

- simply the number or proportion of misclassified examples. It's usually what we're interested in, but it doesn't give us a loss surface that is suitable for searching.

Imagine we have a naive Bayes classifier. In our dataset we have two binary features (categorical with two possible values) and two classes.How many pseudo-observations do we need to add if we want to apply Laplace smoothing?

4

Some models are built on the Markov assumption. What do we mean by this?

A word is conditionally dependent only on a finite number of words preceding it.

f somebody says: "There is a high probability that the mean height ofItalian women is below 2 meters." Which is true?

A: A strict frequentist would consider this an improper use of the term probability. - There is no probability associated with the true mean at all. It is simply an objective, determined value (which we don't know). The probability comes from sampling, and from computing the interval from a sample.

y what method do variational autoencoders avoid mode collapse?

A: By learning the latent representation of an instance through an "en-coder" network. - benefit of autoencoders: we can train them on unlabeled data (which is cheap) and then use only a very small number of labeled examples to "annotate" the latent space. In other words, autoencoders are a great way to do semi-supervised learning.

How does dropout help with the overfitting problem?

A: By randomly disabling nodes in a neural network, to eliminate solutions that require highly specific configurations. - Dropout is a very different regularization technique for large neural nets. During training, we simply remove hidden and input nodes (each with probability p) by setting their values to zero. - Memorization (aka overfitting) often depends on multiple neurons firing together in specific combinations. Dropout prevents this by randomly turning them off.

What problem, if it exists for a single model, cannot be solved by training an ensemble of such models?

A: High training time. CAN be solved by training an ensemble: - high variance, high overfitting, high bias.

I have a dataset of politicians in the European parliament and which past laws they voted for and against. The record is incomplete, but I have some votes for every law and for every politician. I would like to predict, for future laws, which politicians will vote for and which will vote against. I plan to model this as a recommender system using matrix factorization. - What is a possible issue:

A: I would have to deal with the cold start problem, because for the future laws I don't have any voting information. - cold start problem. When a new user joins Netflix, or a new movie is added to the database, we have no ratings for them, so the matrix factorization has nothing to build an embedding on. - In this case we have to rely on implicit feedback and side information, to suggest the users their first movies.

We are choosing a new basis for our data. We decide to use an orthonormal basis. What is the advantage of having an orthonormal basis?

A: It ensures that the inverse of the basis matrix is equal to its transpose.

The slides mention two ways to adapt a categoric feature for a classifier that only accepts numeric features: integer coding and one-hotcoding. Which is true?

A: One-hot coding always turns one categoric feature into multiple numeric features. - Integer coding gives us the same problem we had earlier. We are imposing a false ordering on unordered data (so, one hot is preferred.)

Which is (primarily) a supervised machine learning method?

A: Support Vector Machines

In some machine learning settings it is said that we must make a trade-off between exploration and exploitation. What do we mean by this?

A: That an online algorithm needs to balance optimization of its expected reward with exploring to learn more about its environment.

Neural networks usually contain activation functions (non-linearity). What is their purpose?

A: They are applied after a linear transformation, so that the network can learn nonlinear functions. - To create perceptrons that we can chain together in such a way that the result will be more expressive than any single perceptron could be, the simplest trick is to include a non-linearity, also called an activation function.

The soft margin SVM loss is defined as a constrained optimization objective. We can rewrite this in two ways. Which is true? (also known as hinge loss)

A: We can rewrite using KKT multipliers. This allows us to use the kernel trick. - kernel trick: if you have an algorithm which operates only on the dot products of instances, you can substitute the dot product for a kernel function. - attempts to maximize the margin between the positive and negative points. It's also know as a maximum margin loss, or the hinge loss (since the error is fed through a maximum function, which looks like a hinge).

Which of the following is NOT a method to prevent overfitting?

A; Boosting. - Methods of preventing overfitting: bagging, dropout, L1 regularization.

We want to represent color videos in a deep learning system. Each isa series of frames, with each frame an RGB image. Which is the most natural representation for one such video?

As a 4-tensor.

One can choose between the likelihood function or the log likelihood function as a loss function. Which is usually preferred, and why?

Both result in a maximum at the same point in model space, but the log-likelihood is often easier to work with.

We are considering using either gradient descent or random search for a problem. Which is true?

For both, which optimum they find depends on the initial starting point. - Gradient descent works for continuous model spaces (and not for discrete model spaces). - Random search works for discrete model spaces (such as trees.)

You want to search for a model in a discrete model space. Which search method is the least applicable?

Gradient descent

When using gradient descent to train a linear classifier, why don't we use accuracy as a loss function?

It causes the loss landscape to be flat almost everywhere

Instead of making a training/validation split, we could also use cross-validation to search for good values for our hyperparameters. What is the advantage of this approach?

It is more data-efficient. Each example gets to be used as training data.

We have a logistic regression model for a binary classification problem, which predicts class probabilities q. We compare these to the true class probabilities p, which are always 1 for the correct class and0 for the incorrect class. The slides mention two loss functions for this purpose: logarithmic loss and binary cross-entropy. Which is true?

Log-loss is equal to the binary cross-entropy H(p, q).

I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image from the data, and backpropagate the error. When training is finished, all samples from the network look like the average over all faces in the dataset. What name do we have for this phenomenon?

Mode collapse- the many different modes (areas of high probability) of the data distribution end up being averaged ("collapsing") into a single point.

What is a valid reason to prefer gradient descent over random search?

My model is easily differentiable.

John is training a k-nearest neighbors classifier. He has split his data into a test and training set. To find a good value for k, he tries all values between 3 and 15 and checks the resulting accuracy on the test set. He then chooses the k which gave him the best accuracy for his final model. Is this correct?

No, he should instead use a validation set to search for a good hyperparameter.

What is the relation between an ROC curve and a coverage matrix?

Normalizing the axes of the coverage matrix gives an ROC curve.

Undersampling and oversampling are ways to deal with imbalanced classes. Which is true?

Oversampling leads to duplicate instances in your data.

Which is not a reason that the error is a poor loss function for training classifiers?

The loss takes too many different values.

Imagine a machine learning task where the instances are customers. You know the phone number for each customer and their occupation (one of seven categories). You're wondering how to turn these into features. Which is false?

The phone number is an integer, so you should use it as a numeric feature.

n support vector machines, how is the maximum margin hyperplane criterion (MMC) related to the support vectors?

The support vectors determine the hyperplane that satisfies the MMC.

Which property is common to both logistic regression and support vector machines?

They both focus mostly or only on the points closest to the decision boundary.

The bias/variance decomposition is a helpful tool. Which is false?

We can compute the bias/variance decomposition of our error for a single model trained on a single dataset.

We are training a classification model by gradient descent, and we want to figure out which learning rate to use, before comparing the model to other classifiers. We try five learning rate values, resulting in five different models. How do we choose among these five models?

We measure the accuracy of each model on the validation set.

When training a decision tree on only ______ features, there's no use in splitting again on a feature you've already split on.

categorical

To use ______ ______ on data with numeric features, we must choose a threshold value to split on, for every split.

decision trees

A lazy algorithm is a machine learning method that simply stores the data and refers back to it during evaluation, instead of training to establish a good model which can be stored independent of the data. Which of the following methods is a lazy algorithm?

k-nearest neighbors

When training a decision tree on ____ features, it can often be useful to split on a feature you've already used before.

numeric

I want to predict prices of stocks, from a set of examples, based on two quantities: average price over the last month and the time of the year (both expressed as numbers). I create a scatterplot with the average price on the horizontal axis and the time of year on the vertical. I plot each stock in my dataset as a point in these axes. What have I drawn?

the feature space

I have a dataset of emails, with target classes ham and spam. I represent each email by three numbers: how often the word hello occurs, how often the word medicine occurs and how often the word meeting occurs. What are these three attributes called?

the features


Related study sets

Mod 1-Week 3 -Chapter 26. The Medical Record, Documentation, and Filing

View Set

(Pharm) Gastrointestinal 2 {TERM 3}

View Set

Chp. 15 Disorders of Motor Function

View Set

Chapter 23, Nursing Management: Integumentary Problems: Integ. Problems

View Set

Anatomy And Physiology Chapter 17

View Set