SML
A convolutional layer in a neural network can only appear directly after the input layer.
False
A marketing company want to build a model for predicting the number of visitors to a web page. Since the number of visitors is an integer, this is best viewed as a classification problem.
False
A neural network is an ensemble method.
False
An epoch, in the context of stochastic gradient descent, is the number of iterations required for the training to converge
False
Bootstrap aggregating (bagging) works best on simple models with high biasand low variance.
False
Classification problems have only qualitative inputs
False
Compared to Bagging, Random Forest on the same trainset performs better by decreasing the bias and increasing the variance.
False
Deep learning is a nonparametric method
False
Deep neural networks can only be used for regression, not classification.
False
For classification, the input variables have to be categorical.
False
In neural networks, sigmoid activation function is a common choice for the last layer in a multi-class classification problem.
False
K-nearest neighbors is a generative model.
False
LASSO and Ridge Regression are mathematically equivalent
False
Like Bagging technique, cross-validation helps to reduce the flexibility of model.
False
Linear discriminative analysis (LDA) is a non-parametric model.
False
Linear regression requires all input variables to be numerical (quantitative).
False
Logistic regression and linear discriminative analysis (LDA) will always produce the same decision boundary for binary classification problems.
False
Neural networks can only be used for classification problems, and not forregression problems.
False
Normalizing the dataset is important for the performance of a classification tree.
False
QDA is a non-linear classification algorithm and always has a quadratic decision surface.
False
Regularization can only be used for regression methods, and not for classification methods.
False
Standard Gradient Descent is guaranteed to converge and find the global minimum.
False
The least squares problem always has a unique solution ̂ β.
False
The optimal weights in deep learning always have a closed form solution, but it is to expensive to compute when the number of data points is large.
False
The partitioning of the input space shown below could be generated by recursive binary splitting
False
The training error usually increases when we increase the model flexibility
False
To talk about the bias-variance tradeoff only has meaning when learning by minimizing the mean squared error cost function.
False
Using bootstrap aggregating, we never have to worry about how much data we have collected. We can always sample more datasets to improve our models.
False
When bootstrapping a dataset, it is important to sample without replacement.
False
k-NN is a linear classifier if k = 1
False
Logistic regression is a regression method.
False It is a classification method (despite its name)
A nonlinear classifier can never have a linear decision boundary
False, A non-linear classifier can still produce a linear classifier.
One should not split datasets randomly into training and test data, but alway stake the last data points as the test data.
False, If the data is collected in an non random fashion this could lead to test datasetonly contains data of a certain type.
Random forest is a special version of boosting with trees
False, Random forest is an bagging algorithm not a boosting algorithm
Regularization decreases the bias of the model
False, Regularization is used to reduce the variance and will thus increase the bias
A classifier ˆG(X) is said to be linear if the function ˆG, which mapseach input to a predicted class, is a linear function of the modelparameters.
False, a classifier is said to be linear if its decision boundary islinear. ˆG takes values in a discrete set and can not be a linearfunction!
The model bias typically tends to zero as the number of trainingdata points tends to infinity.
False, any mismatch between the postulated model and the trueinput-output relationship will result in a model bias which doesnot vanish as the number of data points becomes large.
Boosting is typically used to improve large models with small bias and high variance.
False, boosting is typically used with simple base models with high bias and low variance.
A non-parametric model for classification always achieve zero training error since the complexity grows with data.
False, consider e.g. k-nearest-neighbors with k = 10.
Misclassification loss is sensitive to outliers, i.e. incorrectly classi-fied training data points far from the decision boundary.
False, misclassification loss yields a loss of 1 for any misclassifiedpoint, regardless of how far from the decision boundary it is.
he correlation between any pair of ensemble members of a baggedregression modelˆfBbag(x) = 1BB∑b=1ˆf?b(x)tends to zero as the number of ensemble members B tends toinfinity.
False, the ensemble members are conditionally independent (giventhe training data set) and the correlation between any pair ofensemble members is independent of B.
The Bayes classifier can not be implemented in practice, but if itcould it would always attain zero test error.
False, there is typically an irreducible error.
An underfitted model has high bias and low variance. Therefore, it shows a low accuracy on trainset and high accuracy on testset.
False.
In A Linear Regression model with Gaussian noise, MLE and MSE always give the same result.
False.
Solving a logistic regression problem using gradient descent can lead to multiple local optimum solutions.
False. (Because the logistic loss is convex. Reference in the draft(April 30, 2021) SML book on page 96: "Examples of convex func-tions are the cost functions for logistic regression, linear regressionand L1-regularized linear regression.")
The model y = θ1 x1 + θ21 x2 + ε is an example of a linear regression model with a parameters θ1, θ21, input variables x1, x2 and a noiseterm ε.
False. (Reference in the draft (April 30, 2021) SML book on page49: "linear regression is a model which is linear in its parameters")
It is easy to parallelize the training of a boosted model.
False. (Reference in the draft (April 30, 2021) SML book onpage 151: "Another unfortunate aspect of the sequential nature ofboosting is that it is not possible to parallelize the learning.")
The bias of the model decreases as the size of the training dataset goes to infinity.
False. (See Figure 4.9 (a) in the draft (April 30, 2021) SML book. The bias is approximately constant as the number of training ex-amples increases.)
If gradient decent converges, the solution is guaranteed to be a local minmum.
False. Gradient decent can converge to saddle points.
classifier is called linear if the function that maps each input to a predicted class is linear in the parameters.
False. It is called linear if it has a linear decision boundary.
When using LDA for binary classification, the mid point between two clusters μ = (μ1+μ2)/2 will always give p(y = 1|x = μ) = 1/2
False. It is not true in general. However, it is true if ̂ pi_1 = ̂ pi_= 1/2 .
Random forest is an extension of Adaboost.
False. Random forest is an extension of bagging
A model with lower bias always performs better than a model with higher bias in terms of the mean squared error on test data.
False. The mean squared error is the sum of the bias and the variance. An increasein bias can reduce the variance.
A classification tree with a single binary split is a linear classifier.
True
A k-nearest neighbors classifier always attains zero training error for k = 1 for datasets where no inputs are repeated, i.e. xi \neq xj ∀i \neq j.
True
A linear classifier has a linear decision boundary.
True
A neural network with linear activation functions is linear in the input variables.
True
Boosting primarily increases performance by reducing the bias of the base model.
True
Cross-validation can be used to learn the regularization parameter λ in ridge regression.
True
Deep learning is a parametric method
True
In binary classification, the output can take only two possible values.
True
LASSO and Ridge Regression are two different methods for regularization
True
LASSO regularization can be used as an input selection method.
True
LDA is a special case of QDA
True
Logistic regression is a linear classifier.
True
One could use LDA as base classifier in boosting.
True
Quadratic discriminant analysis is a parametric model.
True
Regression models have quantitative outputs.
True
Regularization allows us to restrict the flexibility of a model.
True
Regularization can be used to avoid overfitting in linear regression.
True
Regularization may prevent overfit
True
Regularization, like ridge regression and LASSO, adds an extra term to thecost function.
True
The absolute error loss function is more robust to outliers than the squared error loss function.
True
The expected mean squared error for new, previously unseen data points can be decomposed into a sum of squared bias, variance and irreducible error.
True
The k-NN classifier most often suffers from overfitting when k = 1
True
When using bagging, an out-of-bag estimate of the expected new data error Enew is computationally much cheaper than a k-fold cross-validation.
True
c-fold cross validation can be used for selecting a good value of k in k-NN.
True
k-NN is a nonparametric method
True
In a neural network model, a convolutional layer uses significantly fewer parameters compared to the dense layer with the same number of hidden units.
True (Because of the sparse interactions and parameter sharing inconvolutional layers. Reference in the draft (April 30, 2021) SMLbook on page 126: "Furthermore, a convolutional layer uses sig-nificantly fewer parameters compared to the corresponding denselayer.")
Convolutional neural networks are well suited for classification problems where the input is an image.
True, Convolutional neural networks are well suited for problems where the data hasa local structure, such as images or time series.
The model y = β0 + β1 x1 + β2 sin(x2) + ε is a linear regression model(β0, β1 and β2 are the unknown parameters)
True, It is linear in the parameters and thus a linear model
Probabilistic models assign probability distributions to unknownmodel parameters.
True, in a probabilistic model the belief about unknown modelparameters is represented using probability distributions.
The model bias of k-NN typically increases as k increases.
True, the model becomes less flexible (= larger bias) as k increases.For large enough k the model will always predict according to thedominating class.
Both CART and K-Nearest Neighbor are non-parametric.
True.
Dropout is a regularization technique which prevents overfitting and generalizes the model.
True.
Bagging allows to estimate the expected new data error Enew without cross-validation.
True. (Reference in the draft (April 30, 2021) SML book on page141: "When using bagging, it turns out that there is a way toestimate the expected new data error Enew without using cross-validation.")
Enforcing a maximum depth for the tree can help reduce overfitting in decision trees.
True. (Reference in the draft (April 30, 2021) SML book on page29)
For models that are trained iteratively, a lower training error Etrain can be achieved by training longer.
True. (Reference in the draft (April 30, 2021) SML book onpage 240: "For models that are trained iteratively we can reduceEtrainby training longer.")
A higher value of the regularization hyperparameter in a linearregression problem with L1 regularization (also called LASSO) leads to a more sparse model, where fewer model parameters arenon-zero.
True. Reference in the draft (April 30, 2021) SML book on page94: "Whereas L2 regularization pushes all parameters towardssmall values (but not necessarily exactly zero), L1 tends to fa-vor so-called sparse solutions where only a few of the parametersare non-zero, and the rest are exactly zero."
A too large number of ensemble members leads to an increased complexity of a bagging model and results in a higher variance.
alse. (Reference in the draft (April 30, 2021) SML book on page140: "It is important to understand that by the construction ofbagging, more ensemble members does not make the resultingmodel more flexible, but only reduces the variance.")