Machine Learning Exam 1
how to solve overfitting
- regularization - add more training data - simplifying training data - reducing noise in data
how to solve underfitting
- use a more powerful model - including better/more relevant features - reducing constraints on model
A perfect classifier will have ROC AUC close to:
1
Logistic regression is a ML algorithm used for:
Classification
Which is better: hold out method or cross-validation?
Cross-validation
What method do we use to find the best fit line for data in Linear Regression?
Least Square Error
Are gradient descent algorithms for logistic regression and linear regression identical?
No
Consider a point that is correctly classified and distant from the decision boundary. (a) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be affected. (b) SVM's decision boundary will be affected by this point, and the one learned by logistic regression will be affected. (c) SVM's decision boundary might be unaffected by this point, but the one learned by logistic regression will be unaffected. (d) SVM's decision boundary will be affected by this point, but the one learned by logistic regression will be unaffected.
a
Suppose we want to compute 10-Fold Cross-Validation error on 100 training examples. We need to compute error N1 times, and the Cross-Validation error is the average of the errors. To compute each error, we need to build a model with data of size N2, and test the model on the data of size N3. What are the appropriate numbers for N1, N2, N3? (a) N1 = 1, N2 = 90, N3 = 10 (b) N1 = 10, N2 = 100, N3 = 10 (c) N1 = 10, N2 = 90, N3 = 10 (d) N1 = 10, N2 = 100, N3 = 100
c
To Maximize the margin in SVM, we want to: (a) max 𝜃 1 2 ‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≤ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≥ −1 𝑖𝑓 𝑦𝑖 = −1 (b) min 𝜃 1 2‖𝜃‖2 such that 𝜃 𝑇𝑋𝑖 ≥ 1 𝑖𝑓 𝑦𝑖 = 1 𝑎𝑛𝑑 𝜃 𝑇𝑋𝑖 ≤ −1 𝑖𝑓 𝑦𝑖 = −1 (c) 𝐦𝐢𝐧 𝜽 𝟏 𝟐 ‖𝜽‖𝟐 such that 𝜽 𝑻𝑿𝒊 ≥ 𝟏 𝒊𝒇 𝒚𝒊 = 𝟏 𝒂𝒏𝒅 𝜽 𝑻𝑿𝒊 ≤ −𝟏 𝒊𝒇 𝒚𝒊 = −𝟏 (d) None of above
c
instance vs model based learning
instance-based: TRIVIAL. for a new sample never seen before, use a similarity measure on all the training data and find the most similar model-based (what we're using): use all the data in training set to train a model
Why can logistic regression hypothesis output range only from 0 to 1?
it is a probability
gradient descent techniques that can be scalable to very large datasets
stochastic GD
A machine learning program that is trained from labeled data and filters spam emails should be modeled using
supervised learning
batch learning
system cannot learn incrementally and takes all the available in at once. takes lots of time and resources. system is trained, then launches and must apply what it has learned.
Models that overfit fail to generalize on:
testing set
online learning
train system incrementally by feeding data instances individually or by small groups called mini-batches fast and cheap great if a system is receiving continuous flow of data saves a lot of space because once it has learned something, it doesn't need it in the system anymore
T/F: Feature scaling using techniques like mean normalization can make gradient descent converge faster
true
Which is true for hold out method? 1. always estimate true error ratee accurately 2. unfortunate split (test/train) can give misleading result 3. it is a good choice for sparse datasets
2
T/F: Even Stochastic gradient descent is faster to update the model parameters (comparing to Batch gradient descent). Batch gradient descent takes longer time to converge
False
T/F: If a learning algorithm is suffering from high bias, getting more training data is likely to help
False
T/F: If we use feature scaling, it's guaranteed that gradient descent-based algorithms will converg
False
T/F: In case of 3-way splits, we don't need to keep the test set separated
False
T/F: In terms of speed and results, Grid search is more efficient than Random search
False
T/F: It is okay to tune your model after testing
False
T/F: It is recommended to use linear regression for classification tasks
False
T/F: K-means automatically adjusts the number of clusters
False
T/F: Leave-one-out cross validation generally gives less accurate estimates of true test error than 10- fold cross validation
False
T/F: Overfitting is more likely when you have a huge amount of data.
False
T/F: Random forest is an ensemble learning method that attempts to lower the bias error of decision trees
False
T/F: Standardizing features is required before training a Linear Regression
False
T/F: The largest eigenvector of the covariance matrix is the direction of minimum variance in the data
False
T/F: The linear regression cost function for logistic regreession is convex.
False
T/F: The output of a classification system is a continuous value and the output for a regression system is discrete
False
T/F: consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. The probability of correct classification is the most important metric to optimize.
False
Why use feature scaling?
Faster
When do you need to change or modify your model?
High variance or high bias
When should you use regularization?
Optimization
Logistic regression hypothesis means:
Probability of the positive class
When is one leave out useful?
Sparse dataset
T/F: Adding more features to a linear regression model always increase model variance
True
T/F: Building a nonlinear regression with a high polynomial degree could result in overfitting.
True
T/F: By normalizing data, gradient decent algorithm will converge faster
True
T/F: Consider a cancer diagnosis classification problem where almost all the people being diagnosed don't have cancer. For this dataset, Receiver Operating Characteristic (ROC) curve can measure the true performance.
True
T/F: Decision trees with depth one will always give a linear decision boundary
True
T/F: Dimensionality reduction can be used as pre-processing for machine learning algorithms like decision trees, SVM, etc
True
T/F: Early stopping helps in case of high variance
True
T/F: For k-means algorithm, finding the optimal number of clusters using Silhouette Score method is more accurate than the elbow method
True
T/F: For linear-SVM with large C = infinity value giving a linearly separable dataset, the training error is guaranteed to be zero
True
T/F: For two runs of K-Mean, with same K and initialized from same clusters' centroids. It is expected that both runs give same clustering results
True
T/F: Given m data points, the training error converges to the true error as m → ∞
True
T/F: In SVMs, the values of αi for non-support vectors are 0
True
T/F: In case of high bias, increasing the model size is an option
True
T/F: In gradient based algorithms, early stopping criteria can work as regularization.
True
T/F: In gradient decent based algorithms, choosing a small learning rate may cause the model fail to converge
True
T/F: In the closed form solution, we don't have to choose the learning rate.
True
T/F: It is possible to apply a logistic regression algorithm on a 3-class classification problem
True
T/F: Learning curve is one of the ways to solve bias variance tradeoff
True
T/F: Linear Regression is a supervised machine learning algorithm
True
T/F: Linear Regression is mainly used for Regression
True
T/F: Logistic Regression can be used for classification
True
T/F: Logistic Regression with no polynomial will always give a linear decision boundary
True
T/F: Machine Learning is a branch of Artificial Intelligence (AI), that provides systems with ability to learn and improve from experience without being explicitly programmed
True
T/F: Making a decision tree deeper will assure better fit but reduce robustness/generalization
True
T/F: ROC curve is similar to Precision-Recall curve. It plots sensitivity vs 1-specificity.
True
T/F: Results achieved by supervised learning are much better than unsupervised learning
True
T/F: Subsequent principle components are always orthogonal to each other
True
T/F: Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example
True
T/F: The only way to solve logistic regression problem is using gradient based solutions as there is no known analytical way to solve it
True
T/F: The performance on the testing set is much more important than performance on training set
True
T/F: We can get one global-minima if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent
True
T/F: When the feature space is larger, overfitting is more likely
True
Is it a good idea to do cross validation repeatedly?
Yes
Which one of these statements is TRUE about decision tree: (a) A tree with depth of 3 has higher variance than a tree with depth of 1. (b) A tree with depth of 3 has higher bias than a tree with depth 1. (c) A tree with depth of 3 always has higher training error than a tree with depth 1. (d) A tree with depth of 3 never has higher test error than a tree with depth 1.
a
The run time of closed form solution of Linear Regression grows [a] with the number of features and [b] with the number of data points
a. b.
In gradient descent approach, if the learning rate is too [a], then the algorithm will have to go through many iterations to converge. If the learning rate rate is too [b], the algorithm may diverge, failing to find a good solution
a. small b. high
How to build a nonlinear logistic regression
add more polynomial features
We can build non-linear decision boundary by:
add powers (nonlinear versions) of each feature as new features. train a linear model on this extended set of features this is called polynomial regression
Machine Learning works very well for - problems for which existing solutions require a lot of hand-tuning or long lists of rules - complex problems for which there is no good solution at all using a traditional programming approaches - changing environments which require resilience and generalization - All of the above
all of the above
Which of the following is true about "Ridge" or "Lasso" regression methods in case of feature selection? (a) Ridge regression uses subset selection of features (b) Lasso regression uses subset selection of features (c) Both use subset selection of features (d) None of the above
b
Which statement is WRONG about PCA: (a) PCA seeks for linear transformations to a lower dimensional space. (b) PCA first component tries to capture the least of data variance. (c) PCA is an unsupervised method. (d) PCA can be used as pre-processing step for other machine learning algorithms.
b
types of supervised learning
classification (discrete output) and regression (continuous output)
feature engineering
coming up with a good set of features to train on involves: - feature selection: selecting the most useful features to train on among existing features. - feature extraction: combining existing features to produce a more useful one - creating new features by gathering new data
Benefits of vectorization
easier, faster computation, less expensive, handles more data
T/F: the least squares cost function of Linear Regression can have multiple minima (in other words, there is no guarantee that the least squares cost function will be convex)
false
Why is overfitting bad?
inaccurate, won't generalize for test data, misleading
What does it mean to have a convex function?
one global minima, bowl shape
_________ algorithms can be used to train systems on huge datasets that cannot fit in one machine's main memory. The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.
online learning
High variance indicates:
overfitting
overfitting vs underfitting
overfitting: model performs well on training data but does not generalize to unseen data underfitting: opposite of overfitting. model is too simple to understand the data
machine learning practice
preparation, representation, optimization, evaluation
What does the hypothesis function of logistic regression mean?
probability
How to solve/minimize overfitting
regularization, add more data, less features
Which of the following systems involve an agent performing actions in an environment receiving reward/penalties? - Semi-Supervised Learning - Model based Learning - Reinforcement Learning
reinforcement learning
When is an accuracy matrix not the best indicator?
when data is skewed/unbalanced