CS 4780 True False
For gradient descent, higher learning rates guarantee faster convergence times
F, higher learning rates can lead to divergence
For text data categorical NB outperforms multinomial of NB is small
False
Showing that w'w* always increases implies convergence
False
The Beta-distribution is the conjugate prior of the Gaussian
False
The Naive Bayes assumption holds rarely, but is true for text documents
False
The kNN assumption never holds in high dimension
False
When the NB assumption does not hold, the algorithm cannot work
False
For Adagrad, we use the same learning rate for all features
False, Adagrad uses different automatically chosen learning rates for each feature
Although MLE and MAP are different approaches to set model parameters theta, both of them consider theta to be random variables
False, MLE does not, only MAP is theta a random variable
A classifier trained on less training data is less likely to overfit
False, a small training data set is more likely to differ from the true distribution of the real data, causing overfitting
As the amount of training data increases, the training error goes down
False, as the training data increases, it becomes harder for the model to classify all of them correctly
Estimating P(y|x) directly is hard if x lies in low dimensional space
False, estimating it directly is easy if it lies in low dimensional space, it's high dimensional space where it becomes difficult
If a classifier obtains 0% training error, it cannot have 100% testing error
False, extreme overfitting can result in this occurring
The order of the training points can affect the convergence of gradient descent algorithm
False, gradient descent just depends on a sum across the training examples, which is independent of order
In order for gradient descent to converge, the loss function has to be convex and differentiable everywhere
False, if it is not convex, it will still converge, just to a local minimium
The perceptron always find a separating hyperplane
False, if the data is not linearly separable, it will not
When you split your data into train and test you have to make sure you always do the splitting uniformly at random
False, if there is a temporal component to your data you usually split by time
Newton's method diverges only if the Hessian matrix is not invertible
False, it can also diverse with invertible hessian matricies
Logistic regression can be used with MLE but not MAP
False, it can be used with both
Naive Bayes can be used with MLE but not MAP
False, it can be used with both
During training, in a linearly separable data set, the perceptron algorithm never misclassifies the same input twice
False, it can iterate many times over the data set and get the same points wrong repeatedly
Newton's Method always converges but can be slow
False, it does not always converge
The fully Bayesian approach is never feasible
False, it is feasible in cases of
The higher dimensional the data, the less likely the perceptron is to converge
False, it is more likely that there's a linear separating hyperplane if the data is highly dimensional, as points tend to be further apart
You have a biased coin and toss it n times, the MAP estimate w/ +1 smoothing of the probability of getting "head" is num heads + 1 / num total + 1
False, it is num heads + 1 / num total + 2
The Bayes optimal error is the best classification error you could get if there was no noise
False, it is the best classification error you could get if you knew the data distribution. This error is due to label uncertainty, which is actually noise
The kNN classifier assumes that data is high dimensional
False, it only works for low dimensional data
MLE stands for Maximum LIkelihood Expectation
False, it stands for Maximum Likelihood Estimator
If no separating hyper plane exists it may or may not converge
False, it will not converge
The error of the 1-NN classifier is at least as bad as 2x the error of the Bayes Optimal classifier, as n -> infinity
False, it's at most
Logistic regression is a special case of Naive Bayes classifier
False, logistic regression does not assume features are independent given the label
Linear and Logistic Regression are both used for regression
False, logistic regression is used for classification, linear regression is used for regression
The perceptron and logistic regression both run forever with the data is not linearly separable
False, logistic regression will terminate
The ML algorithm takes in data and predicts features
False, machine learning takes in data in the form of features, and predicts labels on those feature
Making assumptions in ML is "cheating"
False, no free lunch algorithm states that every machine learning algorithm has to make assumptions
If ML is done right, the data scientists has to make no choices
False, no free lunch theorem states that every machine learning algorithm has to make assumptions
Regression is the setting with only two classes
False, regression has the set of all real numbers
If the features are probabilistically dependent on each other, then the naive Bayes assumption cannot hold
False, the features could be conditionally independent, given the label
Generalization error is just another word for validation error
False, the generalization error is the expected error obtained on new data drawn from the same data distribution, the validation error is the empirical error on the validation set
The zero one loss can be minimized with the gradient descent algorithm
False, the gradient of the loss function is always 0 except at the origin. Gradient descent doesn't lead the optimization solver to the minimum of the function
In MLE, the parameter we want to learn is the random variable
False, the model parameter does not have a distribution associated with it
A "true bayesian" approach of learning will learn a point estimate of the model parameter
False, the model parameter needs to be integrated out
Assume there exists a vector w' that defines a hyperplane that perfectly separates your data. Let the perceptron vector be w, as the algorithm proceeds, the two vectors must converge, and in the limit, we have w = w'
False, the perceptron converges to a separating hyperplane, but it could be a different one that w'
In high dimensions randomly drawn points are all close
False, they are not guaranteed in high dimensions, the curse of dimensionality hurts kNN performance in high dimensions
One implication of the curse of dimensionality is that if you sample n data points uniformly at random within a hypercube of dimensionality d, all pairwise distances converge to 0 as n -> infinity
False, they become concentrated around the average distance as d -> infinity As d increases, points move further away, not closer together (where pairwise distances would be close to 0)
As n -> infinity, the error of the perceptron classifier is at most twice the error of the Optimal Bayes Classifier (where n is the number of training samples)
False, this is true for 1 NN
Logistic regression always produces a better result than with Naive Bayes
False, when n is small Naive Bayes will outperform logistic regression
In practice, people often create Train and Validation sets out of their original data D. Each point x in D, is placed either into train, validation, or both
False, you cannot place a point into both
If you were to use the "true" Bayesian way of machine learning you would put a prior over the possible models and draw one randomly during training
False, you would integrate out of all possible models
The prior distribution is my belief of P(y|x) before I see data
The prior distribution is P(theta) over the parameters, theta, before we see any data
Adagrad excels when feature sparsity patterns vary
True
As the validation set becomes extremely large, the validation error approaches test error
True
Averaged across all data sets, all hypothesis classes are equally "bad"
True
Bayesian statistics can elevate parameters to random variables
True
For AdaGrad, each feature has its own learning rate
True
For binomial distributions, as n -> ML becomes exact
True
For perceptron learning algorithms, the order of data points do affect the number of misclassifications of each points
True
For text data categorical NB performs like multinomial NB as n -> infinity
True
Gradient descent takes the (locally) steepest step possible
True
If MAP is used with a prior that has non-zero support everywhere, it will converge to the same result as MLE as n -> infinity (where n is the number of training samples)
True
If data is drawn uniformly at random within a hypercube, increasing the dimensionality makes the data points more equidistant
True
If the Naive Bayes assumption holds, the Naive Bayes classifier becomes identical to the Bayes Optimal classifier
True
In high dimensions randomly drawn points are likely relatively close to an arbitrary hyperplane
True
In the binomial case MAP essentially "hallucinates" samples
True
In the categorical Naive Bayes each feature is modeled with its own "coin"
True
Instead of the likelihood P(Data|Theta) we can maximize the log likelihood
True
Logistic regression always produces a linear classifier
True
MAP estimation with a Beta prior to estimate the probability of a coin leading to heads, is identical to "hallucinating" coin toss results
True
MAP inference maximizes p(w|Data) whereas MLE maximizes P(Data;w) where w represents the model parameters
True
MAP returns the most likely parameters given the data and your prior
True
ML maximizes log[p(y|x;theta)] and MAP maximizes log[P(y|x, theta)] + log[P(theta)]
True
MLE returns the parameters that make the data most likely
True
MLE works great for large n, but is bad for small n
True
Multinomial NB uses one "die" and each side corresponds with a feature
True
Naive Bayes holds when features are independent given the label
True
Newton's Method involves a matrix inversion
True
One advantage of Adagrad is that each dimension has effective its own separate learning rate, which is set automatically
True
Scaling the step-size by the inverse Hessian diagonal can be very effective in practice
True
The "fully Bayesian" approach marginalizes out all possible models
True
The Gaussian is the conjugate prior of the Gaussian
True
The k-NN classifier is not a linear classifier
True
The loss function measures the quality of a specific hypothesis in the hypothesis class
True
The multinomial Naive Bayes algorithm is a linear classifier
True
The perceptron converges whenever a separating hyperplane exists
True
With a good stepsize GD always converges but can be slow
True
Increasing k makes the decision boundary more smooth
True, as k increases the data becomes more accurate
The order of the training points can affect the training time of the perceptron algorithm
True, certain point orders will allow the algorithm it converge earlier or later than other orders
As n becomes very large, kNN becomes very accurate
True, it becomes more accurate
As n becomes very large, kNN becomes very slow
True, it gets slower
The perceptron can only be used for classification
True, it needs binary classifiers of {-1, 1}
Logistic regression is a discriminative algorithm
True, it's a discriminative model, not a generative one
For logistic regression, newton's method will return the global optimum
True, the loss function of the logistic function is convex
Let w* be a separating hyperplane, then w'w* increases with every update
True, they move further and further apart with each update, so the dot product increases