CS 4780 True False

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

For gradient descent, higher learning rates guarantee faster convergence times

F, higher learning rates can lead to divergence

For text data categorical NB outperforms multinomial of NB is small

False

Showing that w'w* always increases implies convergence

False

The Beta-distribution is the conjugate prior of the Gaussian

False

The Naive Bayes assumption holds rarely, but is true for text documents

False

The kNN assumption never holds in high dimension

False

When the NB assumption does not hold, the algorithm cannot work

False

For Adagrad, we use the same learning rate for all features

False, Adagrad uses different automatically chosen learning rates for each feature

Although MLE and MAP are different approaches to set model parameters theta, both of them consider theta to be random variables

False, MLE does not, only MAP is theta a random variable

A classifier trained on less training data is less likely to overfit

False, a small training data set is more likely to differ from the true distribution of the real data, causing overfitting

As the amount of training data increases, the training error goes down

False, as the training data increases, it becomes harder for the model to classify all of them correctly

Estimating P(y|x) directly is hard if x lies in low dimensional space

False, estimating it directly is easy if it lies in low dimensional space, it's high dimensional space where it becomes difficult

If a classifier obtains 0% training error, it cannot have 100% testing error

False, extreme overfitting can result in this occurring

The order of the training points can affect the convergence of gradient descent algorithm

False, gradient descent just depends on a sum across the training examples, which is independent of order

In order for gradient descent to converge, the loss function has to be convex and differentiable everywhere

False, if it is not convex, it will still converge, just to a local minimium

The perceptron always find a separating hyperplane

False, if the data is not linearly separable, it will not

When you split your data into train and test you have to make sure you always do the splitting uniformly at random

False, if there is a temporal component to your data you usually split by time

Newton's method diverges only if the Hessian matrix is not invertible

False, it can also diverse with invertible hessian matricies

Logistic regression can be used with MLE but not MAP

False, it can be used with both

Naive Bayes can be used with MLE but not MAP

False, it can be used with both

During training, in a linearly separable data set, the perceptron algorithm never misclassifies the same input twice

False, it can iterate many times over the data set and get the same points wrong repeatedly

Newton's Method always converges but can be slow

False, it does not always converge

The fully Bayesian approach is never feasible

False, it is feasible in cases of

The higher dimensional the data, the less likely the perceptron is to converge

False, it is more likely that there's a linear separating hyperplane if the data is highly dimensional, as points tend to be further apart

You have a biased coin and toss it n times, the MAP estimate w/ +1 smoothing of the probability of getting "head" is num heads + 1 / num total + 1

False, it is num heads + 1 / num total + 2

The Bayes optimal error is the best classification error you could get if there was no noise

False, it is the best classification error you could get if you knew the data distribution. This error is due to label uncertainty, which is actually noise

The kNN classifier assumes that data is high dimensional

False, it only works for low dimensional data

MLE stands for Maximum LIkelihood Expectation

False, it stands for Maximum Likelihood Estimator

If no separating hyper plane exists it may or may not converge

False, it will not converge

The error of the 1-NN classifier is at least as bad as 2x the error of the Bayes Optimal classifier, as n -> infinity

False, it's at most

Logistic regression is a special case of Naive Bayes classifier

False, logistic regression does not assume features are independent given the label

Linear and Logistic Regression are both used for regression

False, logistic regression is used for classification, linear regression is used for regression

The perceptron and logistic regression both run forever with the data is not linearly separable

False, logistic regression will terminate

The ML algorithm takes in data and predicts features

False, machine learning takes in data in the form of features, and predicts labels on those feature

Making assumptions in ML is "cheating"

False, no free lunch algorithm states that every machine learning algorithm has to make assumptions

If ML is done right, the data scientists has to make no choices

False, no free lunch theorem states that every machine learning algorithm has to make assumptions

Regression is the setting with only two classes

False, regression has the set of all real numbers

If the features are probabilistically dependent on each other, then the naive Bayes assumption cannot hold

False, the features could be conditionally independent, given the label

Generalization error is just another word for validation error

False, the generalization error is the expected error obtained on new data drawn from the same data distribution, the validation error is the empirical error on the validation set

The zero one loss can be minimized with the gradient descent algorithm

False, the gradient of the loss function is always 0 except at the origin. Gradient descent doesn't lead the optimization solver to the minimum of the function

In MLE, the parameter we want to learn is the random variable

False, the model parameter does not have a distribution associated with it

A "true bayesian" approach of learning will learn a point estimate of the model parameter

False, the model parameter needs to be integrated out

Assume there exists a vector w' that defines a hyperplane that perfectly separates your data. Let the perceptron vector be w, as the algorithm proceeds, the two vectors must converge, and in the limit, we have w = w'

False, the perceptron converges to a separating hyperplane, but it could be a different one that w'

In high dimensions randomly drawn points are all close

False, they are not guaranteed in high dimensions, the curse of dimensionality hurts kNN performance in high dimensions

One implication of the curse of dimensionality is that if you sample n data points uniformly at random within a hypercube of dimensionality d, all pairwise distances converge to 0 as n -> infinity

False, they become concentrated around the average distance as d -> infinity As d increases, points move further away, not closer together (where pairwise distances would be close to 0)

As n -> infinity, the error of the perceptron classifier is at most twice the error of the Optimal Bayes Classifier (where n is the number of training samples)

False, this is true for 1 NN

Logistic regression always produces a better result than with Naive Bayes

False, when n is small Naive Bayes will outperform logistic regression

In practice, people often create Train and Validation sets out of their original data D. Each point x in D, is placed either into train, validation, or both

False, you cannot place a point into both

If you were to use the "true" Bayesian way of machine learning you would put a prior over the possible models and draw one randomly during training

False, you would integrate out of all possible models

The prior distribution is my belief of P(y|x) before I see data

The prior distribution is P(theta) over the parameters, theta, before we see any data

Adagrad excels when feature sparsity patterns vary

True

As the validation set becomes extremely large, the validation error approaches test error

True

Averaged across all data sets, all hypothesis classes are equally "bad"

True

Bayesian statistics can elevate parameters to random variables

True

For AdaGrad, each feature has its own learning rate

True

For binomial distributions, as n -> ML becomes exact

True

For perceptron learning algorithms, the order of data points do affect the number of misclassifications of each points

True

For text data categorical NB performs like multinomial NB as n -> infinity

True

Gradient descent takes the (locally) steepest step possible

True

If MAP is used with a prior that has non-zero support everywhere, it will converge to the same result as MLE as n -> infinity (where n is the number of training samples)

True

If data is drawn uniformly at random within a hypercube, increasing the dimensionality makes the data points more equidistant

True

If the Naive Bayes assumption holds, the Naive Bayes classifier becomes identical to the Bayes Optimal classifier

True

In high dimensions randomly drawn points are likely relatively close to an arbitrary hyperplane

True

In the binomial case MAP essentially "hallucinates" samples

True

In the categorical Naive Bayes each feature is modeled with its own "coin"

True

Instead of the likelihood P(Data|Theta) we can maximize the log likelihood

True

Logistic regression always produces a linear classifier

True

MAP estimation with a Beta prior to estimate the probability of a coin leading to heads, is identical to "hallucinating" coin toss results

True

MAP inference maximizes p(w|Data) whereas MLE maximizes P(Data;w) where w represents the model parameters

True

MAP returns the most likely parameters given the data and your prior

True

ML maximizes log[p(y|x;theta)] and MAP maximizes log[P(y|x, theta)] + log[P(theta)]

True

MLE returns the parameters that make the data most likely

True

MLE works great for large n, but is bad for small n

True

Multinomial NB uses one "die" and each side corresponds with a feature

True

Naive Bayes holds when features are independent given the label

True

Newton's Method involves a matrix inversion

True

One advantage of Adagrad is that each dimension has effective its own separate learning rate, which is set automatically

True

Scaling the step-size by the inverse Hessian diagonal can be very effective in practice

True

The "fully Bayesian" approach marginalizes out all possible models

True

The Gaussian is the conjugate prior of the Gaussian

True

The k-NN classifier is not a linear classifier

True

The loss function measures the quality of a specific hypothesis in the hypothesis class

True

The multinomial Naive Bayes algorithm is a linear classifier

True

The perceptron converges whenever a separating hyperplane exists

True

With a good stepsize GD always converges but can be slow

True

Increasing k makes the decision boundary more smooth

True, as k increases the data becomes more accurate

The order of the training points can affect the training time of the perceptron algorithm

True, certain point orders will allow the algorithm it converge earlier or later than other orders

As n becomes very large, kNN becomes very accurate

True, it becomes more accurate

As n becomes very large, kNN becomes very slow

True, it gets slower

The perceptron can only be used for classification

True, it needs binary classifiers of {-1, 1}

Logistic regression is a discriminative algorithm

True, it's a discriminative model, not a generative one

For logistic regression, newton's method will return the global optimum

True, the loss function of the logistic function is convex

Let w* be a separating hyperplane, then w'w* increases with every update

True, they move further and further apart with each update, so the dot product increases


Kaugnay na mga set ng pag-aaral

Humerus & Shoulder Girdle & hands exam 3 positioning

View Set

Endocrinology Quiz 4 Topics 16-21

View Set