Machine Learning Quiz Answers
What are the odds (not probability) of rolling a 6 on a fair sided die
0.2
Select all operations you can perform with lin reg that will never decrease the R² on the training set. Adding features Removing features Scaling features by a positive value Scaling features by a negative value Scaling the features by any real-valued constant Scaling the target by a positive real-valued constant
Adding features, scaling by positive, scaling negative, scaling the target
Select all statements that are true about cubic natural splines. They are: differentiable continuous nonlinear have some linear segments
All of the above
Which is a most common method for estimating the confidence in estimated parameters (such as β in linear regression) of a machine learning method
Bootstrapping
Which of the following is true about decision trees: Can be used for classification and regression A disadvantage is they are not scale invariant A disadvantage is that they only can be used for quantitative features An advantage is that they can conveniently accomodate any prior knowledge An advantage is that they can be interpreted
Can be used for classification and regression, they can be interpreted
If your data contains a categorical feature/predictor with multiple valid values, what would be a reasonable approach for using in your machine learning algorithm?
Constructing an indicator/dummy variable
Linear regression with cubic regression splines fits functions that are Equivalent to a single cubic function Linear Continuously differentiable Differentiable Continuous
Continuously differentiable, differentiable, continuous
Benefit of pruning decision tree
Decrease variance
Which of the following is true about decision trees Can only be used for regression Can only be used for qualitative features Can only work with centered and normalized features Usually branch using 2-feature conditions Easy to interpret
Easy to interpret
If QDA has a higher area under the curve than LDA then
Either QDA or LDA can be preferable based on the relative costs of false positives or false negatives
True or False: ridge regression will never overfit because it is regularized
False
True or False: the primary disadvantage of Lasso is that it requires all predictors to have non-negative weights in the final model
False
True or false: RSS on the training set increases with the addition of features
False
True or false: using kernels in SVMs prevents overfitting
False
Which would be best suited for a recurrent neural net Learning the same problem multiple times Image recognition Unsupervised learning Genre detection Predicting tides
Genre detection
Which statement is true about using KNN for classification with k=1: Is common in practice because it is very fast Has a 0 error on the training set Has a 0 error on the test set Has a linear decision boundary Never overfits
Has a 0 error on the training set
Select which of the following are examples of prediction: Identifying patients as sick based on temp, antibody levels, and presence of a cough Developing a new process for evaluating loans based on the splits in a decision tree, Reviewing feature coefficients after perform log reg on whether or not a student passed a test Labeling a student as likely to pass a test based on previous exam grades and class attendance
Identifying patients as sick, labeling a student as likely to pass
Select all valid reasons for reducing the number of features: Improve interprebaility of parameters Increase flexibility Reduce error on the test set Reduce error on the training set
Improve interpretability, reduce error on test set
How does QDA generalize LDA?
It does not assume that all classes have the same variance
Select all methods that are examples of unsupervised learning Decision trees LDA KNN K-Means PCA
K-means, PCA
Assume that the LDA classification error is 0.9 on the test set, then
LDA is worse than always predicting the most common class
Select all of the following machine learning methods that are generative: LDA QDA KNN Log Reg Lin Reg
LDA, QDA
Select all parametric classification methods: KNN LDA QDA Lin Reg Log Reg
LDA, QDA< Log Reg
If you have a small training set and a flexible classifier, what is the most likely appropriate strategy for model validation?
LOO
Assume that you have a problem with many features and you expect that only a few of them to be important. Lin Reg Lasso Forward feature selection Ridge KNN
Lasso, forward feature
Select ALL methods that you would consider using if you have more features than data points Lin Reg Lasso Ridge Regression Bootstrap
Lasso, ridge regression
Select all ML methods for which scaling features will never change the training error or the test error: Lin Reg KNN LDA QDA Log Reg
Lin Reg, LDA, QDA, Log Reg
Select all that are true: Lin reg can fail if the data set is not iid Outlier data does not impact a linear regression model Linear regression can be used to fit a nonlinear function Linear regression is a generative model
Lin reg can fail is the data set is not iid Line reg can be used to fit a nonlinear function
Which machine learning method when scaling features will never change the training error or the test error: Lin reg Lasso Ridge KNN
Linear regression
Select all classification methods that have linear classification boundaries in 2-class classification: Log reg LDA QDA KNN Lin Reg
Log Reg, LDA
Consider a neural network with a single output node, several input nodes, and no hidden layers. The single unit in this network uses a sigmoid activation function. If you train this network using the cross-entropy objective, which other machine learning method will make the most similar predictions? SVM Feed forward neural net LDA Decision tree trained using CART linear regression poisson regression Recurrent neural net Random forest Log reg
Log reg
Which of the following fit a separating hyperplane Log Reg KNN LDA Support vector classifiers Lin Reg
Log reg, LDA, support vector classifiers
Decision boundary is linear in the feature space in the following methods: SVM with a polynomial kernel Maximum margin classifier Single layer neural net Recurrent neural net Log reg LDA
Maximum margin classifier, log reg, LDA
If the ROC curve of method A is never below the ROC curve of method B, then on this dataset
Method A is no worse then Method B
It is a good practice to run LOO CV multiple times to get a better estimate of the desired parameter
No
Select the method for which the following statement is satisfied. If the coefficient βₙ for the feature Xₙ is smaller than the coefficient β₀ for the feature X₀, then Xₙ is less important: Lin Reg Log Reg KNN Naive Bayes LDA and QDA None of the Above
None of the above
Select all of the following machine learning methods which are generative: SVM Log Reg QDA Lasso LDA PCA
QDA, LDA
What is used to approximate the purity of a node in a regression tree? Gini index RSS Cross validation Confusion matrix ROC
RSS
Which of the following are examples of ensemble learning KNN Random forests Decision trees LOO Boosted decision trees
Random forests, boosted decision trees
What are valid reasons to reduce the number of features
Reduce overfitting, reduce computational complexity
Select all true statements: FOrward step-size feature selection is guaranteed to achieve the minimal error on the training set Reguklarization can be seen as a heurisitic approach to feature selection CV is appropriate for determining the right number of features to use Regularization penalizes model complexitiy
Regularization is a heuristic approach CV is appropriate for determining number of features Regularization penalizes model complexity
What is one of the benefits of using the L1 norm in regularization of linear regression vs the L2 norm?
Results in sparse solutions
Which of the following is true Ridge regression shrinks regression coefficients towards 0 Lasso expands the regression coefficients towards infinity Ridge regression doesn't show any improvement over lin reg Ridge regression tends to assign non-zero coefficients to fewer features than best subset selection
Ridge shrinks coefficients towards 0
Which of the following are true: SVMs effectively eliminate the bias variance trade off The number of support vectors is independent of the kernel used SVMs are generalizations of the maximal margin classifier SVMs can fit nonlinear decision boundries
SVMs are generalizations, can fit nonlinear decision boundaries
The overall strategy in bootstrapping is to
Sample with replacement
What is a valid reason for using boosting over a single decision tree?
Single decision trees cannot include predictive power from multiple, overlapping regions of the feature space
Recall
TP/(TP+FN)
True or false: a benefit of pruning decision trees is that it may help decrease variance
True
True or false: the purpose of random forests is to decorrelate trees when doing bagging
True
Which are true about hyperplanes: An appropriate interpretation is that it divides a p-dimensional space into three equal size partitions A line is a hyperplane in three dimensional space Hyperplanes can only be defined in 2-dimensions The following is valid equation for a hyperplane in five-dimensional space β₀+β₁X₁+β₂X₂+β₃X₃+β₄X₄+β₅X₅=0 A line is a hyperplane in two dimensional space
Valid equation, a line is a hyperplane in two dimensional space
Decision trees can handle qualitative features ______
always
Which values usually increase when increasing the regularization coefficient in ridge regression from 0: bias variance training error test erro
bias, training erro
Cubic regression splines are: continuously differentiable continuous piecewise linear piecewise constant
continuously differentiable, continuous
The principal components identified by PCA are eigenvectors of the covariance matrix always positive orthogonal always negative parallel
eigenvectors, orthogonal
Suppose that for a linear regression fit, the coefficient βₙ for the feature Xₙ is smaller than β₀ for the feature X₀. Then removing Xₙ would increase the prediction error ____ compared to X₀
either more or less
Logistic regression assumes a linear model of log odds and therefore
has a linear decision boundary between classes in the feature space
Data points with high leverage in simple linear regression are ones that
have a very different X (feature) value from other data points
Simple linear regression assumes that the target is a linear combination of feature values plus a noise εₙ for each data point n. Select ALL statements with what linear regression assumes the noise εₙ They are heteroscedastic They are homoscedastic They are independent(statistically) They are identically distributed They are normally distributed They are positive
homoscedastic, independent, identically distributed, normally distributed
Select all that are true about convolutional neural nets They are especially appropriate for image recognition They can only be used with ReLU as an activation function They are a type of recurrent neural net with constraints They are a type of feed forward neural network They were developed especially for understanding natural language
image recognition, feed forward neural net
QDA will never have a greater ______ than LDA
negative likelihood of the training set
Adding interaction features in linear regression will
never increase the training RSS
Random variables X and Y are independent _______ their correlation coefficient is zero
only if
Using dummy variables for a qualitative feature with p classes adds___ new binary features
p-1
Select all statements that are true about the k-means algorithm It is randomized It automatically finds the number of clusters It minimizes the classification error Its output depends on whether the features are normalized It does not require features to be centered
randomized, output depends on normalization of features, does not require features to be centered
Bootstrapping constructs data sets by sampling ________
randomly with replacement
The first principal components in PCA is: the same idrection as computed using total least squares the same direction as computed using linear regression the direction that minimizes the datas variance the direction that maximizes the data variance
same direction as total least squares, maximizes variance
In simple linear regression (Y=β₀+Xβ₁), the R² statistic is equal to
the square of the correlation coefficient between X and Y
QDA will never have a higher eroor than LDA on the
training set
True or false: the ROC curve shows the true positive rate as a function of the false positive rate
true
True or false: the principal components identified by PCA are othogonal
true
Select all that are true about PCA: PCA is falling out of favor to Lasso PCA is unsupervised learning technique PCA seeks dimensions that minimze variance PCA expands the dimensionality of the data PCA is an effective texhnique for linear dimensionality reduction
unsupervised learning technique, effective technique for linear dimensionality reduction
What option should you taken when an SVM with a polynomial kernel overfits on the training set Use a polynomial kernel of a smaller degree Use a polynomial kernel of a larger degre Create more data by bootstrapping
use kernel of smaller degree
Select reasons to use slack variables in SVMs and maximum margin classifiers When classes are not seperable to use a non linear kernel to handle non linear kernels to increase the number of support vectors to reduce sensitivity to outliers to decrease the number of support vectors
when classes are not seperable, to increase the number of support vectors, to reduce sensitivity to outliers