CS 434 Quizzes

Ace your homework & exams now with Quizwiz!

The Naive Bayes assumption is that __________. a. the input features are conditionally independent given the class label. b. the input features are distributed according to a Normal distribution. c. the input features are independent. d. that P(y|x) is proportional to P(x|y)P(y).

the input features are conditionally independent given the class label.

Explain what a decision boundary is in classification.

A decision boundary is defined by the set of points in space where a classifier predicts equal confidence for all classes. For example, in logistic regression it is the set of points where w.Tx=0.

What role does a prior play in MAP inference?

A prior is a way for us to encode beliefs about the values of our parameters _before_ seeing any data.

Optimization Error

Error due to the difficulty of finding optimal models for a dataset during learning. Can be reduced with more computation dedicated to the search.

k Nearest Neighbor

Exemplar classifier with no probabilistic interpretation.

Each value in a probability density function must be less than 1. a. True b. False

False

When deriving logistic regression originally, we set up our task of learning the optimal line w as maximizing the likelihood of our data under a conditional Bernoulli likelihood model. When deriving mulitclass logistic regression, we use the same formulation. a. True b. False

False. A Bernoulli likelihood only considers the y label to be 0 or 1. We used a Categorical likelihood which allowed the y label to be from a discrete set corresponding to our classes.

K-Fold Cross Validation can only be applied to k-Nearest Neighbors models. a. True b. False

False. K-Fold Cross Validation can be applied to any learning algorithm.

We can use any function to define a kernel. a. True b. False

False. Only certain functions can define a kernel. Specifically, ones which result always result in positive semidefinite Gram matrices.

Computing a quadratic kernel requires quadratic computation in the dimensionality of the input vectors. a. True b. False

False. The quadratic kernel between a and b is (a^Tb+1)^2 and takes only linear time in the dimensionality.

Bayes Error

Irreducible error inherit to the problem. We can't really fix this.

True Negative

Negative example that our model predicts as negative

What is meant by overfitting and underfitting?

Overfitting is when performance on test data is significantly worse than performance on training data. Underfitting is when performance on both train and test is poor.

What are recall and precision?

Recall is the fraction of positive examples correctly predicted as positive. Precision is the fraction of positive predictions that are actually positive examples.

You want to predict whether someone will like a particular music artist given a list of artists they already enjoy. You have a dataset of "liked artist lists", "like this new artist" pairs. This is an example of: a. Unsupervised Clustering b. Unsupervised Dimensionality Reduction c. You Answered Supervised Classification d. Supervised Regression

Supervised classification because we are trying to predict an outcome with a finite set of possible values (liked or not liked in this case). We have a dataset of "liked lists" and whether a new artist is liked as supervision.

What does the IID (independently and identically distributed) assumption assume about our data? a. That each data point is identical. b. That data points do not affect other data points and all data points are generated from the same probabilistic mechanism. c. That each data point comes from a different distributions. d. No answer text provided.

That data points do not affect other data points and all data points are generated from the same probabilistic mechanism. The "independent" part implies that data points do not affect each other and the "identically distributed" bit implies they are generated from the same distribution.

Write the following equation as a vector operation involving column vectors x and y: d ∑ xi yi i=1

This is a dot product and could be written <x, y> or x^Ty

Classifying new points for an SVM requires computing dot products between support vectors and the new point. a. True b. False

True. Substituting the definition of the optimal weight vector into w^Tx+b shows this fact.

The Naïve Bayes assumption generally reduces the number of parameters we have to learn in our model. This is because we don't need to learn the full conditional joint distribution P(x1, x2, ..., xd | y) and can instead learn conditionals for individual input dimensions P(xi | y) ∀i. a. True b. False

True. The conditional independence assumption in Naïve Bayes lets us assume P(x1, x2, ..., xd | y) = P(x1 | y) ∗ P(x2 | y) ∗ ... ∗ P(xd | y) so we can then just learn the individual conditional (right) rather than the full conditional joint (left).

Parameters are parts of a model learned from data. Hyperparameters are settings for a model that are (generally) chosen by the machine learning practitioner. a. True b. False

True. Hyperparameters are set by the ML practitioner. Parameters are learned from data.

We can use kernel functions for any machine learning algorithm based only on dot products between feature vectors. a. True b. False

True. Kernels are a very general idea. We showed them for SVMs and Perceptron.

Which of the following algorithms can handle multiclass classification? a. k Nearest Neighbors b. Standard Logistic Regression c. Multinomial Logistic Regression d. Naive Bayes

k Nearest Neighbors and Naive Bayes can handle multiple classes in their standard definitions. However, for logistic regression we had to go through a new derivation in order to handle multiclass problems.

List three hyperparameters for the k-Nearest Neighbor algorithm.

k, Distance Function. Weighting Function.

Both discriminative and generative classifiers make decisions according to P(y | x) (where y is the output and x is the input). However, discriminative model learn _________, while generative models learn _____________.

1. P(y | x) directly 2. P(x | y) and P(y) and apply Bayes Theorem Generative models make use of Bayes Theorem and learn make decisions according to argmax_y P(x | y)P(y)

Given random variables X, Y, and Z -- what does it mean for X and Y to be conditionally independent given Z? (Hint: Review the height/vocabulary/age example in lecture)

Conditioned on the value of Z, X and Y do not provide additional information about each other. That is to say, P(X ,Y | Z) = P(X | Z) P(Y | Z).

Modelling Error

Error from a mismatch between our hypothesis set and the real function. Can be reduced with more expressive model classes.

Estimation Error

Error from learning a model from finite dataset. Can be reduced with more data or with more data efficient algorithms

Linear regression by minimizing the sum of squared errors (SSE) is robust to outliers. a. True b. False

False. As we showed in lecture, a single outlier point is sufficient to dramatically change the solution to ordinary least squares.

The standard perceptron learning algorithm is guaranteed to converge even for non-linearly separable data. a. True b. False

False. It will continue to oscillate as zero error will never be reached.

Likelihood =

Likelihood = P(data|parameters)

Logistic Regression

Linear classifier with a probabilistic interpretation.

Perceptron

Linear classifier with no probabilistic interpretation.

Linear Regression

Linear regression model with probabilistic interpretation.

What is Maximum Likelihood Estimation (MLE)?

Maximum Likelihood Estimation (MLE) is a way to fit parameters of a probabilistic model to data. In MLE, we assume some generative model of our data (i.e. a probabilistic model of how the data is produced) and then find parameters for that model that maximize the likelihood of our observed data. MLE is a general technique and we've now seen it applied to binary random variables (with a Bernoulli assumption), continue values (with a Normal assumption), and in linear regression (a conditional Normal assumption).

False Positive

Negative example that our model predicts as positive

K-Nearest Neighbors is referred to as a _________ algorithm because no parameters are learned during training. a. Instance-based b. Non-Parametric c. Exemplar

Non-parametric. As it does not learn any parameters.

False Negative

Positive example that our model predicts as negative

True Positive

Positive example that our models predicts as positive

Posterior =

Posterior = P(parameters | data)

Prior =

Prior = P(parameters)

Both the SVM primal and dual formulations can be solved using __________. a. Gradient Descent b. Quadratic Program Solvers c. Matrix Inverse

Quadratic Program Solvers

What is regularization?

Regularization is a way to express prior belief about what the parameters of a model should be. It is typically used to express a belief that the model should be as simple. There is a trade-off between the training error of a model and the simplicity of the model.

When deriving MLE estimates in lecture, we frequently would write out the likelihood as: n L(θ) = P(D ∣ θ) = ∏ P(xi ∣ θ) i=1 What assumption allows us to write the probability of our dataset as a product of the probabilities of each data point? a. The IID Assumption b. The Gaussian Assumption c. The Linearity of Expectation

The IID assumption. Independence between datapoints lets us write the joint probability of the dataset as the product of independent probabilities. If A and B are independent random variables, P(A,B) = P(A)P(B)

What is a hypothesis set? a. The set of possible functions for a machine learning algorithm b. A set of likely outputs for an input c. A set of hypothesis we have about what d. hyperparameters will work will for a problem

The set of possible functions for a machine learning algorithm. For instance, all possible weights for a linear regression model.

What do the slack variables accomplish in the soft-margin SVM formulation?

The slack variables allow for violations of the margin in the constraints and are then penalized in the objective to minimize them.

What would the softmax function output for the input vector [5, -2, 10]^T?

[0.007, 0.000, 0.993]

Bayes error is the error due to ______________. a. training a model from a finite dataset. b. the difficulty of optimization. c. You Answered inherit uncertainty in the problem. d. our model not being expressive enough.

inherit uncertainty in the problem. If two examples look identical except they have different outputs, no model can get them both correct.

In lecture, we fit the parameters of the logistic regression function by minimizing the _________________. a. log posterior by setting the derivative equal to zero. b. log prior by using gradient ascent. c. log likelihood by setting the derivative to zero. d. negative log likelihood using gradient descent.

negative log likelihood using gradient descent. When we computed the gradient of the negative log likelihood for the logistic regression model, we were met with a system of non-linear equations that didn't let us write a closed-form solution. Instead, we used the gradient expression to perform gradient descent to minimize the negative log likelihood.

When does σ(w^Tx) equal 0.5? And what does that imply about the decision boundary of the a logistic regression model? Hint: σ is the logistic function.

σ(w^Tx) equals 0.5 when w^Tx equals 0. This suggests that logistic regression has a decision boundary (the set of locations where both classes are equally likely) defined by the line where w^Tx=0.

What is a confusion matrix and what can you learn about a classifier by looking at one?

Confusion matrices keep track of a classifier's predictions relative to the true value of its inputs. For instance, the i, j'th entry in a confusion matrix counts the number of times an instance of class i is labeled class j by the classifier. Looking at a confusion matrix can show you which classes a classifier commonly mixes up.

Finding the optimal weight vector for a hard-margin SVM requires solving an unconstrained optimization problem. a. True b. False

False. The optimization includes constraints for the data to be correctly classified.

When the slack penalty hyperpameter C is set to zero, the soft-margin SVM reverts to a hard-margin SVM. a. True b. False

False. When the penalty is set to zero, the margin can grow infinitely large as incorrect predictions are not penalized in the objective.

The SVM dual formulation shows us that the optimal weight vector always depends on all data points. a. True b. False

False. While the weight vector is a weighted combination of all training points, points not within or on the margin have zero weight.

If a prior distribution A is a conjugate prior to likelihood distribution B, what can I say about the posterior distribution B*A?

The posterior will have the same distribution as A -- which is the definition of a conjugate prior.

Tree classifiers, one-vs-all classifiers, and all-vs-all classifiers are schemes to let binary classification models work for multiple classes. a. True b. False

True. We can use any binary classifier in constructing one of these tree / one-vs-all / all-vs-all classifiers to make multiclass predictions.

Linear regression by minimizing the sum of squared error is equivalent to maximizing the likelihood of data under a linear model with Gaussian noise. a. True b. False

True. We spent the second half of the lecture proving this.

In supervised learning, each data point in the training set has both an input representation and a known output. In unsupervised learning, no known outputs are provided. a. True b. False

True. Supervised learning assumes the data comes with correct output values annotated.

A hard-margin linear SVM only has a solution when the data is linearly separable. a. True b. False

True. The constrained optimization only considers solutions where all data points are correctly classified.

How can our techniques for linear regression be used to fit non-linear functions of the input?

We can apply basis functions to augment our data with non-linear transforms. This results in the data matrix gaining additional columns. Running standard linear regression on this new augmented data matrix works fine because the output is still a linear function of the now non-linear representation.

Over all possible datasets generated from the true model, an estimator with high bias but low variance will _______________. a. Produce very different values for different datasets but will be correct on average. b. produce similar estimates for most datasets, but will not perfectly predict the true parameters. c. Consistently produce low-error estimates.

produce similar estimates for most datasets, but will not perfectly predict the true parameters. Bias refers to how wrong the provided estimate is on average. Variance refers to how much the predicted estimate varies across different datasets.


Related study sets

Глава 1: Знакомство с колледжем

View Set

Coronary artery disease and acute coronary syndrome questions

View Set

AP Lang: MCQ practice for midterm

View Set

Mortgage Practice Questions 501-550

View Set

Chapter 12: Media Access Control (MAC)

View Set