Machine Learning Exam 1

Ace your homework & exams now with Quizwiz!

how to solve overfitting

- regularization - add more training data - simplifying training data - reducing noise in data

how to solve underfitting

- use a more powerful model - including better/more relevant features - reducing constraints on model

A perfect classifier will have ROC AUC close to:

1

Logistic regression is a ML algorithm used for:

Classification

Which is better: hold out method or cross-validation?

Cross-validation

What method do we use to find the best fit line for data in Linear Regression?

Least Square Error

Are gradient descent algorithms for logistic regression and linear regression identical?

No

Consider a point that is correctly classified and distant from the decision boundary. (a) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be affected. (b) SVM's decision boundary will be affected by this point, and the one learned by logistic regression will be affected. (c) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be unaffected. (d) SVM's decision boundary will be affected by this point, but the one learned by logistic regression will be unaffected.

a

Suppose we want to compute 10-Fold Cross-Validation error on 100 training examples. We need to compute error N1 times, and the Cross-Validation error is the average of the errors. To compute each error, we need to build a model with data of size N2, and test the model on the data of size N3. What are the appropriate numbers for N1, N2, N3? (a) N1 = 1, N2 = 90, N3 = 10 (b) N1 = 10, N2 = 100, N3 = 10 (c) N1 = 10, N2 = 90, N3 = 10 (d) N1 = 10, N2 = 100, N3 = 100

c

To Maximize the margin in SVM, we want to: (a) max 𝜃 1 2 ‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≤ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≥ −1 𝑖𝑓 𝑦𝑖 = −1 (b) min 𝜃 1 2‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≥ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≤ −1 𝑖𝑓 𝑦𝑖 = −1 (c) 𝐦𝐢𝐧 𝜽 𝟏 𝟐 ‖𝜽‖𝟐 such that 𝜽 𝑻𝑿𝒊 ≥ 𝟏 𝒊𝒇 𝒚𝒊 = 𝟏 𝒂𝒏𝒅 𝜽 𝑻𝑿𝒊 ≤ −𝟏 𝒊𝒇 𝒚𝒊 = −𝟏 (d) None of above

c

instance vs model based learning

instance-based: TRIVIAL. for a new sample never seen before, use a similarity measure on all the training data and find the most similar model-based (what we're using): use all the data in training set to train a model

Why can logistic regression hypothesis output range only from 0 to 1?

it is a probability

gradient descent techniques that can be scalable to very large datasets

stochastic GD

A machine learning program that is trained from labeled data and filters spam emails should be modeled using

supervised learning

batch learning

system cannot learn incrementally and takes all the available in at once. takes lots of time and resources. system is trained, then launches and must apply what it has learned.

Models that overfit fail to generalize on:

testing set

online learning

train system incrementally by feeding data instances individually or by small groups called mini-batches fast and cheap great if a system is receiving continuous flow of data saves a lot of space because once it has learned something, it doesn't need it in the system anymore

T/F: Feature scaling using techniques like mean normalization can make gradient descent converge faster

true

Which is true for hold out method? 1. always estimate true error ratee accurately 2. unfortunate split (test/train) can give misleading result 3. it is a good choice for sparse datasets

2

T/F: Even Stochastic gradient descent is faster to update the model parameters (comparing to Batch gradient descent). Batch gradient descent takes longer time to converge

False

T/F: If a learning algorithm is suffering from high bias, getting more training data is likely to help

False

T/F: If we use feature scaling, it's guaranteed that gradient descent-based algorithms will converg

False

T/F: In case of 3-way splits, we don't need to keep the test set separated

False

T/F: In terms of speed and results, Grid search is more efficient than Random search

False

T/F: It is okay to tune your model after testing

False

T/F: It is recommended to use linear regression for classification tasks

False

T/F: K-means automatically adjusts the number of clusters

False

T/F: Leave-one-out cross validation generally gives less accurate estimates of true test error than 10- fold cross validation

False

T/F: Overfitting is more likely when you have a huge amount of data.

False

T/F: Random forest is an ensemble learning method that attempts to lower the bias error of decision trees

False

T/F: Standardizing features is required before training a Linear Regression

False

T/F: The largest eigenvector of the covariance matrix is the direction of minimum variance in the data

False

T/F: The linear regression cost function for logistic regreession is convex.

False

T/F: The output of a classification system is a continuous value and the output for a regression system is discrete

False

T/F: consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. The probability of correct classification is the most important metric to optimize.

False

Why use feature scaling?

Faster

When do you need to change or modify your model?

High variance or high bias

When should you use regularization?

Optimization

Logistic regression hypothesis means:

Probability of the positive class

When is one leave out useful?

Sparse dataset

T/F: Adding more features to a linear regression model always increase model variance

True

T/F: Building a nonlinear regression with a high polynomial degree could result in overfitting.

True

T/F: By normalizing data, gradient decent algorithm will converge faster

True

T/F: Consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. For this dataset, Receiver Operating Characteristic (ROC) curve can measure the true performance.

True

T/F: Decision trees with depth one will always give a linear decision boundary

True

T/F: Dimensionality reduction can be used as pre-processing for machine learning algorithms like decision trees, SVM, etc

True

T/F: Early stopping helps in case of high variance

True

T/F: For k-means algorithm, finding the optimal number of clusters using Silhouette Score method is more accurate than the elbow method

True

T/F: For linear-SVM with large C = infinity value giving a linearly separable dataset, the training error is guaranteed to be zero

True

T/F: For two runs of K-Mean, with same K and initialized from same clusters' centroids. It is expected that both runs give same clustering results

True

T/F: Given m data points, the training error converges to the true error as m → ∞

True

T/F: In SVMs, the values of αi for non-support vectors are 0

True

T/F: In case of high bias, increasing the model size is an option

True

T/F: In gradient based algorithms, early stopping criteria can work as regularization.

True

T/F: In gradient decent based algorithms, choosing a small learning rate may cause the model fail to converge

True

T/F: In the closed form solution, we don't have to choose the learning rate.

True

T/F: It is possible to apply a logistic regression algorithm on a 3-class classification problem

True

T/F: Learning curve is one of the ways to solve bias variance tradeoff

True

T/F: Linear Regression is a supervised machine learning algorithm

True

T/F: Linear Regression is mainly used for Regression

True

T/F: Logistic Regression can be used for classification

True

T/F: Logistic Regression with no polynomial will always give a linear decision boundary

True

T/F: Machine Learning is a branch of Artificial Intelligence (AI), that provides systems with ability to learn and improve from experience without being explicitly programmed

True

T/F: Making a decision tree deeper will assure better fit but reduce robustness/generalization

True

T/F: ROC curve is similar to Precision-Recall curve. It plots sensitivity vs 1-specificity.

True

T/F: Results achieved by supervised learning are much better than unsupervised learning

True

T/F: Subsequent principle components are always orthogonal to each other

True

T/F: Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example

True

T/F: The only way to solve logistic regression problem is using gradient based solutions as there is no known analytical way to solve it

True

T/F: The performance on the testing set is much more important than performance on training set

True

T/F: We can get one global-minima if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent

True

T/F: When the feature space is larger, overfitting is more likely

True

Is it a good idea to do cross validation repeatedly?

Yes

Which one of these statements is TRUE about decision tree: (a) A tree with depth of 3 has higher variance than a tree with depth of 1. (b) A tree with depth of 3 has higher bias than a tree with depth 1. (c) A tree with depth of 3 always has higher training error than a tree with depth 1. (d) A tree with depth of 3 never has higher test error than a tree with depth 1.

a

The run time of closed form solution of Linear Regression grows [a] with the number of features and [b] with the number of data points

a. b.

In gradient descent approach, if the learning rate is too [a], then the algorithm will have to go through many iterations to converge. If the learning rate rate is too [b], the algorithm may diverge, failing to find a good solution

a. small b. high

How to build a nonlinear logistic regression

add more polynomial features

We can build non-linear decision boundary by:

add powers (nonlinear versions) of each feature as new features. train a linear model on this extended set of features this is called polynomial regression

Machine Learning works very well for - problems for which existing solutions require a lot of hand-tuning or long lists of rules - complex problems for which there is no good solution at all using a traditional programming approaches - changing environments which require resilience and generalization - All of the above

all of the above

Which of the following is true about "Ridge" or "Lasso" regression methods in case of feature selection? (a) Ridge regression uses subset selection of features (b) Lasso regression uses subset selection of features (c) Both use subset selection of features (d) None of the above

b

Which statement is WRONG about PCA: (a) PCA seeks for linear transformations to a lower dimensional space. (b) PCA first component tries to capture the least of data variance. (c) PCA is an unsupervised method. (d) PCA can be used as pre-processing step for other machine learning algorithms.

b

types of supervised learning

classification (discrete output) and regression (continuous output)

feature engineering

coming up with a good set of features to train on involves: - feature selection: selecting the most useful features to train on among existing features. - feature extraction: combining existing features to produce a more useful one - creating new features by gathering new data

Benefits of vectorization

easier, faster computation, less expensive, handles more data

T/F: the least squares cost function of Linear Regression can have multiple minima (in other words, there is no guarantee that the least squares cost function will be convex)

false

Why is overfitting bad?

inaccurate, won't generalize for test data, misleading

What does it mean to have a convex function?

one global minima, bowl shape

_________ algorithms can be used to train systems on huge datasets that cannot fit in one machine's main memory. The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.

online learning

High variance indicates:

overfitting

overfitting vs underfitting

overfitting: model performs well on training data but does not generalize to unseen data underfitting: opposite of overfitting. model is too simple to understand the data

machine learning practice

preparation, representation, optimization, evaluation

What does the hypothesis function of logistic regression mean?

probability

How to solve/minimize overfitting

regularization, add more data, less features

Which of the following systems involve an agent performing actions in an environment receiving reward/penalties? - Semi-Supervised Learning - Model based Learning - Reinforcement Learning

reinforcement learning

When is an accuracy matrix not the best indicator?

when data is skewed/unbalanced


Related study sets

19: Påhängsfrågor (tag questions) and 20: Några konjunktioner

View Set

AP World History Midterm Questions

View Set

Cal Maritime - Rules of the Road 1-3 - Coast Guard Questions - Browne

View Set

MyAP Classroom Quizzes for Dervivative Test

View Set

Probability and Statistics: Week 1 Exercise

View Set