Machine Learning - Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

For k-means algorithm, finding the optimal number of clusters using Silhouette Score method is more accurate than the elbow method.

True

For linear-SVM with large C = infinity value giving a linearly separable dataset, the training error is guaranteed to be zero.

True

For two runs of K-Mean, with same K and initialized from same clusters' centroids. It is expected that both runs give same clustering results.

True

Given m data points, the training error converges to the true error as m → ∞.

True

High Bias and Low Variance tends to result in underfitting

True

In SVMs, the values of αi for non-support vectors are 0

True

In gradient based algorithms, early stopping criteria can work as regularization

True

In gradient descent based algorithms, choosing a small learning rate may cause the model fail to converge.

True

Logistic Regression can be used for classification.

True

Logistic Regression with no polynomial will always give a linear decision boundary.

True

Making a decision tree deeper will assure better fit but reduce robustness/generalization.

True

Subsequent principle components are always orthogonal to each other

True

Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example.

True

The only way to solve logistic regression problem is using gradient based solutions as there is no known analytical way to solve it.

True

We can get one global-minima if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent.

True

When the feature space is larger, over fitting is more likely.

True

Clustering (K-Means) and PCA are what type of learning algorithms?

Unsupervised Learning

What is unsupervised learning?

Uses unlabled data and tries to find natural clusters and patterns.

What are the two main issues with K-Means?

1. Sensitive to initial centroids. 2. Have to manually select K number of clusters.

Consider a point that is correctly classified and distant from the decision boundary. (a) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be affected. (b) SVM's decision boundary will be affected by this point, and the one learned by logistic regression will be affected. (c) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be unaffected. (d) SVM's decision boundary will be affected by this point, but the one learned by logistic regression will be unaffected.

A

Which one of these statements is TRUE about decision tree: (a) A tree with depth of 3 has higher variance than a tree with depth of 1. (b) A tree with depth of 3 has higher bias than a tree with depth 1. (c) A tree with depth of 3 always has higher training error than a tree with depth 1. (d) A tree with depth of 3 never has higher test error than a tree with depth 1.

A

Random forest classifier

A collection of a large number of decision trees trained via a bagging method.

Describe an ROC curve.

A graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate, False Positive Rate.

Which statement is WRONG about PCA: (a) PCA seeks for linear transformations to a lower dimensional space. (b) PCA first component tries to capture the least of data variance. (c) PCA is an unsupervised method. (d) PCA can be used as pre-processing step for other machine learning algorithms.

B

Suppose we want to compute 10-Fold Cross-Validation error on 100 training examples. We need to compute error N1 times, and the Cross-Validation error is the average of the errors. To compute each error, we need to build a model with data of size N2, and test the model on the data of size N3. What are the appropriate numbers for N1, N2, N3? (a) N1 = 1, N2 = 90, N3 = 10 (b) N1 = 10, N2 = 100, N3 = 10 (c) N1 = 10, N2 = 90, N3 = 10 (d) N1 = 10, N2 = 100, N3 = 100

C

To Maximize the margin in SVM, we want to: (a) max 𝜃 1 2 ‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≤ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≥ −1 𝑖𝑓 𝑦𝑖 = −1 (b) min 𝜃 1 2‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≥ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≤ −1 𝑖𝑓 𝑦𝑖 = −1 (c) 𝐦𝐢𝐧 𝜽 𝟏 𝟐 ‖𝜽‖𝟐 such that 𝜽 𝑻𝑿𝒊 ≥ 𝟏 𝒊𝒇 𝒚𝒊 = 𝟏 𝒂𝒏𝒅 𝜽 𝑻𝑿𝒊 ≤ −𝟏 𝒊𝒇 𝒚𝒊 = −𝟏 (d) None of above

C

Which of the following is true about "Ridge" or "Lasso" regression methods in the case of feature selection? (e) Ridge regression uses subset selection of features (f) Lasso regression uses subset selection of features (g) Both use subset selection of features (h) None of the above

F

Consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. The probability of correct classification is the most important metric to optimize.

False

Even Stochastic gradient descent is faster to update the model parameters (comparing to Batch gradient descent). Batch gradient descent takes longer time to converge.

False

If a learning algorithm is suffering from high bias, getting more training data is likely to help.

False

If we use feature scaling, it's guaranteed that gradient descent-based algorithms will converge.

False

K-means automatically adjusts the number of clusters.

False

Leave-one-out cross validation generally gives less accurate estimates of true test error than 10- fold cross validation.

False

Low Bias and High Variance tends to result in underfitting.

False

Random forest is an ensemble learning method that attempts to lower the bias error of decision trees.

False

The largest eigenvector of the covariance matrix is the direction of minimum variance in the data

False

Unsupervised learning cannot be used for data visualization and dimensionality reduction.

False

Random forest is an ensemble method that uses bagging and random selection features, at which level the random selection features happens?

Node level.

Bagging

In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once

What is supervised learning?

Process of learning with labelled training data

Describe K-Means Clustering Algorithm

Randomly choose K cluster center locations (centroids) Loop until convergence: ➢ assign each point to the cluster of the closest centroid. ➢ Re -estimate the cluster centroids based on the data assigned to each cluster

What is C in SVM?

Regularization hyperparameter. Higher C value = less margin violations + less generalization.

The K-Means algorithm will in general converge to a local-optima rather than a global one. Given this, how would you adapt the algorithm to increase the chances of finding a good solution?

Start from spares centroids. Run multiple times and pick the one that has the lowest cost. That works mainly for small K.

What is subset selection?

Subset selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. This reduces the dimensionality of the data and allows learning algorithms to operate faster and more effectively

Classification and Regression are what type of learning algorithms?

Supervised Learning

Decision trees with depth one will always give a linear decision boundary.

True

Dimensionality reduction can be used as pre-processing for machine learning algorithms like decision trees, SVM, etc.

True

Adding more features to a linear regression model always increase model variance.

True

By normalizing data, gradient decent algorithm will converge faster

True

Consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. For this dataset, Receiver Operating Characteristic (ROC) curve can measure the true performance.

True

high variance

sensitivity to small fluctuations in the training set. High variance would cause an algorithm to model the noise in the training set (overfitting)

high bias

condition where the output of the machine learning model is quite far off from the actual output.

What is the kernel trick in SVM?

makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, without actually adding them


Ensembles d'études connexes

NURS 302 Mental Health 2: Exam 1

View Set

Chapter 34 - The Child with Musculoskeletal or Articular Dysfunction

View Set