SML true/false questions from exams

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

For models that are trained iteratively, a lower training error Etrain can be achieved by training longer.

.True. (Reference in the draft (April 30, 2021) SML book on page 240: "For models that are trained iteratively we can reduce Etrainby training longer.")

A convolutional layer in a neural network can only appear directly after the input layer

False

Bootstrap aggregating (bagging) works best on simple models with high bias and low variance

False

For classification, the input variables have to be categorical

False

K-nearest neighbors is a generative model.

False

Linear discriminative analysis (LDA) is a non-parametric model.

False

Linear regression requires all input variables to be numerical (quantitative)

False

If gradient decent converges, the solution is guaranteed to be a local minimum

False. Gradient decent can converge to saddle points If gradient descent converges, the solution may be a local minimum, but it is not guaranteed.

Random forest is an extension of Adaboost

False. Random forest is an extension of bagging.

.Quadratic discriminant analysis is a parametric model

TRUE

A classification tree with a single binary split is a linear classifier

True

A neural network with linear activation functions is linear in the input variables

True

Logistic regression and linear discriminative analysis (LDA) will always produce the same decision boundary for binary classification problems

False

It is easy to parallelize the training of a boosted model.

False. (Reference in the draft (April 30, 2021) SML book on page 151: "Another unfortunate aspect of the sequential nature of boosting is that it is not possible to parallelize the learning.")

Regression models have quantitative outputs.

True.

Bagging allows to estimate the expected new data error Enew without cross-validation

True. (Reference in the draft (April 30, 2021) SML book on page 141: "When using bagging, it turns out that there is a way to estimate the expected new data error Enew without using crossvalidation.")

Boosting primarily increases performance by reducing the bias of the base model

True

Cross-validation can be used to learn the regularization parameter λ in ridge regression.

True

LDA is a special case of QDA

True

Logistic regression is a linear classifier.

True

Regularization allows us to restrict the flexibility of a model

True

Probabilistic models assign probability distributions to unknown model parameters.

.True. Probabilistic models are statistical models that use probability distributions to describe the uncertainty in the model parameters and the data. In contrast to deterministic models, which provide a single estimate of the model parameters, probabilistic models assign probability distributions to unknown model parameters. In other words, probabilistic models allow us to quantify the uncertainty in our estimates of the model parameters. This is useful in many applications, such as Bayesian inference, where we want to estimate the posterior probability distribution of the model parameters given the data.

Normalizing the dataset is important for the performance of a classification tree.

False

To talk about the bias-variance tradeoff only has meaning when learning by minimizing the mean squared error cost function

False

When bootstrapping a dataset, it is important to sample without replacement

False

The optimal weights in deep learning always have a closed form solution, but it is to expensive to compute when the number of data points is large.

False In deep learning, the optimal weights are typically found through optimization techniques such as gradient descent, which iteratively adjusts the weights to minimize the loss function. The loss function is a measure of how well the model is performing, and the goal of the optimization algorithm is to find the set of weights that minimizes the loss function.

Using bootstrap aggregating, we never have to worry about how much data we have collected. We can always sample more datasets to improve our models

False While it is true that bagging can improve model performance, it is not necessarily true that we can always sample more datasets to improve our models. There are several reasons why this may not be possible or desirable while bagging can be a powerful tool for improving model performance, it is important to consider the quality and quantity of the data available and to assess whether additional data is likely to improve model performance before deciding whether to collect more data.

The quadratic discriminant analysis classifier is only applicable to inputs that follow a Gaussian distribution.

False, It can be applied in all cases

A classifier is called linear if the function that maps each input to a predicted class is linear in the parameters.

False, It is called linear if it has a linear decision boundary. Reference in the draft (April 30, 2021) SML book on page 49: "linear regression is a model which is linear in its parameters")

If the second term in the sum above is dominant, we call this 'underfitting'.

False, Underfitting is a model that is not flexible enough

Learning a classifier with the logistic loss always produces a linear decision boundary in input space

False, With logistic Regression yes but not just any other model

Boosting is typically used to improve large models with small bias and high variance

False, boosting is typically used with simple base models with high bias and low variance.

Ridge regression is typically used as an input selection method.

False, but true in the case of Lasso(l1)

A non-parametric model for classification always achieve zero training error since the complexity grows with data.

False, consider e.g. k-nearest-neighbors with K = 10.

When assessing the out-of-sample performance of a given learned model f(x; bθ), its new expected error can be estimated using kfold cross-validation.

False, for a given learned model you can just apply it on the hold-out data.

Consider learning a linear classification model f(x; θ) = sign(x Tθ) using the missclassification loss. The gradient descent method will find a model that has the minimum average missclassification

False, the gradient is always almost zero

In classification problems, the input variables are always categorical.

False, the output is categorical but not necessarily the inputs

The model bias typically tends to zero as the number of training data points tends to infinity

False. Increasing the number of training data points can help reduce the variance of a model, which is a measure of how much the model's predictions vary for different training data subsets. However, it may not necessarily affect the bias of the model, which is a measure of how well the model fits the training data. In some cases, increasing the number of training data points may help reduce the bias of a model, especially if the model is underfitting the data due to insufficient complexity. However, there is no guarantee that the bias will tend to zero as the number of training data points tends to infinity. This is because the bias is affected by other factors such as the choice of model architecture, regularization, and optimization algorithm.

Solving a logistic regression problem using gradient descent can lead to multiple local optimum solutions.

False. (Because the logistic loss is convex. Reference in the draft (April 30, 2021) SML book on page 96: "Examples of convex functions are the cost functions for logistic regression, linear regression and L 1 -regularized linear regression.")

A too large number of ensemble members leads to an increased complexity of a bagging model and results in a higher variance.

False. (Reference in the draft (April 30, 2021) SML book on page 140: "It is important to understand that by the construction of bagging, more ensemble members does not make the resulting model more flexible, but only reduces the variance.")

The model y = θ1x1 +θ2x2 +ϵ is an example of a linear regression model with a parameters θ1, θ2 , input variables x1, x2 and a noise term ϵ.

False. (Reference in the draft (April 30, 2021) SML book on page 49: "linear regression is a model which is linear in its parameters")

The bias of the model decreases as the size of the training dataset goes to infinity

False. (See Figure 4.9 (a) in the draft (April 30, 2021) SML book. The bias is approximately constant as the number of training examples increases.)

An underfitted model has high bias and low variance. Therefore, it shows a low accuracy on trainset and high accuracy on testset.

False. An underfitted model has high bias, which means that it oversimplifies the relationship between the inputs and outputs and is not flexible enough to capture the underlying pattern in the data. As a result, it will have a low accuracy both on the training set and on the test set. GPT

Like Bagging technique, cross-validation helps to reduce the flexibility of model.

False. Bagging and cross-validation have different goals. Bagging aims to reduce the variance of the model by training multiple models on different subsets of the data and combining their predictions. Cross-validation aims to get an estimate of the generalization error of a model by dividing the data into training and validation sets and evaluating the model on the validation set. Cross-validation can help to reduce overfitting by providing an estimate of the model's performance on unseen data, but it does not necessarily reduce the flexibility of the model. GPT Not sure about this one

Compared to Bagging, Random Forest on the same trainset performs better by decreasing the bias and increasing the variance

False. Compared to bagging, random forests generally have lower variance (which means they are less likely to overfit) and slightly higher bias. By combining the predictions of many decision trees, random forests reduce the variance of the final prediction, which improves the accuracy of the model. GPT From the book: "This will cause the 𝐵 trees to be less correlated, and averaging their predictions can therefore result in a larger variance reduction compared to bagging. It should be noted, however, that this random perturbation of the training procedure will increase the variance3 of each individual tree. In the notation of equation (7.2b), the random forest decreases 𝜌 (good) but increases 𝜎 2 (bad) compared to bagging. Experience has, however, shown that the reduction in correlation is the dominant effect, so that the averaged prediction variance is often reduced.

Misclassification loss is sensitive to outliers, i.e. incorrectly classified training data points far from the decision boundary

False. Gradient decent can converge to saddle points. Misclassification loss is not sensitive to outliers, i.e., it does not distinguish between the correct classification of an outlier and the correct classification of a non-outlier. Misclassification loss is a type of loss function used in classification problems to evaluate the performance of a model. It measures the number or proportion of incorrectly classified examples in the dataset. The loss function does not take into account the distance between the decision boundary and the misclassified points, so misclassification of an outlier and a non-outlier have the same impact on the loss.

In neural networks, sigmoid activation function is a common choice for the last layer in a multi-class classification problem

False. In multi-class classification problems, the softmax activation function is commonly used in the last layer of a neural network, not the sigmoid activation function. The softmax activation function maps the outputs of the neural network to a probability distribution over the classes, while the sigmoid activation function maps its input to values between 0 and 1.

When using LDA for binary classification, the mid point between two clusters µ = µ1+µ2 2 will always give p(y = 1|x = µ) = 1 2 .

False. It is not true in general. However, it is true if pihat1 = pihat2 = 1/2

QDA is a non-linear classification algorithm and always has a quadratic decision surface.

False. QDA (Quadratic Discriminant Analysis) is a linear classification algorithm. While it can fit quadratic decision surfaces, it is not guaranteed to do so, and it only fits quadratic decision surfaces when the covariance matrix of the classes is different.

The correlation between any pair of ensemble members of a bagged regression model ˆf B bag(x) = 1 B X B b=1 ˆf ?b(x) tends to zero as the number of ensemble members B tends to infinity.

False. Random forest is an extension of bagging. In a bagged regression model, the goal is to reduce the variance of the model by training multiple models on different subsets of the training data and averaging their predictions. The correlation between any pair of ensemble members depends on the degree of overlap between the subsets used to train each model. As the number of ensemble members B tends to infinity, the correlation between any pair of ensemble members does not necessarily tend to zero. In fact, if the subsets used to train each model are highly overlapping, the correlation between the ensemble members can remain relatively high even as the number of ensemble members increases.

A model with lower bias always performs better than a model with higher bias in terms of the mean squared error on test data

False. The mean squared error is the sum of the bias and the variance. An increase in bias can reduce the variance

Regularization, like ridge regression and LASSO, adds an extra term to the cost function

True

A linear classifier has a linear decision boundary.

True In a binary classification problem, a linear classifier divides the feature space into two regions based on a linear decision boundary. The decision boundary is a hyperplane (a line in two dimensions, a plane in three dimensions, and a higher-dimensional space in higher dimensions) that separates the two classes. The decision boundary of a linear classifier is determined by the weights and biases of the model, which define the linear function used to make predictions. The decision boundary is linear because it is a linear combination of the input features, and the predicted class label is determined by which side of the decision boundary the input falls on.

The absolute error loss function is more robust to outliers than the squared error loss function

True The absolute error loss function is less sensitive to outliers than the squared error loss function. This is because the squared error loss function puts more weight on large errors, which can be caused by outliers, while the absolute error loss function treats all errors equally regardless of their size.

In a neural network model, a convolutional layer uses significantly fewer parameters compared to the dense layer with the same number of hidden units

True (Because of the sparse interactions and parameter sharing in convolutional layers. Reference in the draft (April 30, 2021) SML book on page 126: "Furthermore, a convolutional layer uses significantly fewer parameters compared to the corresponding dense layer.")

The model sensitivity of k-NN typically decreases as k increases

True, Larger K mean that a single data point has smaller influence. I.E lower sensitivity

The squared-error performance of a learning method for regression can be decomposed into three terms: E[(f0(x⋆) − ¯f(x⋆))2 ] + E[(f(x⋆; bθ(T )) − ¯f(x⋆))2 ] + σ 2 , where the first term can be reduced by using a more flexible family of models

True, The first termin is the bias

Stochastic gradient descent can be faster than ordinary gradient descent even though it requires more search steps

True, since each iteration is faster

A k-nearest neighbors classifier always attains zero training error for k = 1 for datasets where no inputs are repeated, i.e. xi != xj ∀i != j

True.

A classifier Gˆ(X) is said to be linear if the function Gˆ, which maps each input to a predicted class, is a linear function of the model parameters.

True. A linear classifier is a type of classifier in machine learning that uses a linear function of the model parameters to make predictions. The function takes as input a set of features (represented as a vector x) and computes a weighted sum of those features, which is then passed through an activation function to produce the predicted class label.

The model bias of k-NN typically increases as k increases

True. Now, consider an extreme case, K=1, what will it happen? The training data will be perfectly predicted, right? The bias will be 0 when K=1, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. When we increase K, the training error will increase (increase bias), but the test error may decrease at the same time (decrease variance). We can think that when K becomes larger, since it has to consider more neighbors, its model is more complex.

The Bayes classifier can not be implemented in practice, but if it could it would always attain zero test error.

True. he Bayes classifier is a theoretical model that achieves the optimal classification performance by making predictions based on the probabilities of each class given the input features. It assumes perfect knowledge of the true underlying distribution of the data and can be difficult to implement in practice.

LASSO regularization can be used as an input selection method.

True. LASSO regularization is a popular technique used in regression analysis to prevent overfitting and select important features by shrinking the coefficients of less important features to zero. In other words, LASSO can be used to perform feature selection by identifying and excluding features that are less important in predicting the response variable. By setting some of the coefficients to zero, LASSO essentially eliminates the corresponding features from the model, making it a useful tool for input selection.

When using bagging, an out-of-bag estimate of the expected new data error Enew is computationally much cheaper than a k-fold cross-validation.

True. The out-of-bag estimate of the expected new data error in bagging is computationally cheaper than k-fold cross-validation because in bagging, each sample is used as the validation set exactly once, so there is no need to perform k separate training and validation processes. From the book: "𝐸OOB provides an estimate of 𝐸new which can be at least as good as the estimate 𝐸𝑘-fold from 𝑘-fold cross-validation. Most importantly, however, 𝐸OOB comes almost for free in bagging, whereas 𝐸𝑘-fold requires much more computation when re-training 𝑘 times."

Enforcing a maximum depth for the tree can help reduce overfitting in decision trees

True. (Reference in the draft (April 30, 2021) SML book on page 29)

Both CART and K-Nearest Neighbor are non-parametric

True. Both CART (Classification and Regression Trees) and K-Nearest Neighbor (KNN) are non-parametric algorithms, meaning they do not make assumptions about the underlying distribution of the data

Dropout is a regularization technique which prevents overfitting and generalizes the model

True. Dropout is a regularization technique that randomly drops out neurons in a neural network during training, which helps to prevent overfitting by reducing the dependence of the model on any single neuron. It generalizes the model by forcing it to learn multiple representations of the input data.

In A Linear Regression model with Gaussian noise, MLE and MSE always give the same result

True. In a linear regression model with Gaussian noise, the maximum likelihood estimation (MLE) of the model parameters is equivalent to minimizing the mean squared error (MSE) between the model predictions and the observed data. Therefore, the optimization problem for both MLE and MSE is the same, and the optimal solution is also the same.

A higher value of the regularization hyperparameter in a linear regression problem with L 1 regularization (also called LASSO) leads to a more sparse model, where fewer model parameters are non-zero

True. Reference in the draft (April 30, 2021) SML book on page 94: "Whereas L 2 regularization pushes all parameters towards small values (but not necessarily exactly zero), L 1 tends to favor so-called sparse solutions where only a few of the parameters are non-zero, and the rest are exactly zero."

Standard Gradient Descent is guaranteed to converge and find the global minimum.

false. Standard Gradient Descent is not guaranteed to converge to the global minimum in all cases. It can get stuck in local minima and may not find the global minimum, especially if the cost function is non-convex. To address this issue, other optimization algorithms such as Momentum-based Gradient Descent, Adagrad, and Adam can be used.


Set pelajaran terkait

English 11a - Unit One: A Certain Shade of Green Lesson 1-4

View Set

Numbers 1-30 Spanish/ ¿Cuántos/as? / y "Hay"

View Set

Setting the Scene of Romeo and Juliet, Part 2

View Set