Machine learning
Which of the following statements is NOT an advantage of CART (Classification and Regression Trees)? 1. Interpretability 2.handling categorical variables without the need of one-hot encoding 3. using a greedy algorithm which can get stuck in a local optimal 4. handling non-linear data sets
3. using a greedy algorithm which can get stuck in a local optimal
How do you summarize a hierarchical clustering algorithm? A. Start with each point in its own cluster B. Repeat until all points are in a cluster C. Merge the clusters D. Identify the closest two clusters based on a distance metrics (Euclidian, Manhattan and etc). This is our dissimilarity metric.
A - D - C - B
Which of the following statements are correct? A. SVC can handle separable data B. SVC can handle noisy data (outliers) by using soft margin C. SVC can handle non-separable data
A and B are correct, C is incorrect
A) ----------- is an example of unsupervised dimension reduction algorithm. B) ----------- is an example of supervised dimension reduction technique. C) ----------- are two of the most common unsupervised clustering techniques.
A) PCA B) Lasso regression C) K-mean and hierarchical
The step size in a gradient descent algorithm to find the parameters in simple regression model is determined by slope multiplied by learning rate! (T/F)
True
The two main types of hierarchical clustering techniques are:
Agglomerative clustering (bottom up approach) - Divisive clustering (top down approach)
Which of the following statements is NOT correct about the AdaBoost? A. The weak learners are ALL decision stumps B.Each stump depends on the previous tree's errors rather than being independent. C.All the instances (training observations) get the SAME weight during the entire AdaBoosting process. D.Misclassified observations are assigned HIGHER weights
All the instances (training observations) get the SAME weight during the entire AdaBoosting process.
Which of the following statements is NOT correct about ensemble learning? A. Ensemble learning is a method that is used to enhance the performance of a machine learning model by combining several learners. B. The main idea in ensemble learning is to learn one super accurate model, instead of focusing on training many low accuracy models. C. It combines the predictions from a collection of models. D. Ensemble learning typically produces more accurate and more stable predictions than the best single model.
B. The main idea in ensemble learning is to learn one super accurate model, instead of focusing on training many low accuracy models.
Bootstrapping the data plus using the aggregate to make a decision is called?
Bagging
Which of the following statements are correct? 1. Residual sum of squares is an example of loss function! 2. The goal is to minimize the loss function by finding the best parameter estimates.
Both 1 and 2
Which of the following statements are correct with respect to PCA vs Clustering: A. PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance. B. Clustering looks for homogeneous subgroups among the observations.
Both A and B
Which of the following statements are correct: A. K nearest neighbor (KNN) is one of the simplest and best known non parametric supervised learning technique most often used for classification . B. Contrary to other learning algorithms that allow discarding the training data after the model is built, KNN keeps all training examples in memory.
Both A and B
Which of the followings is NOT an advantage of KNN model?
Curse of dimensionality
Which of the following statements is NOT correct about the Gradient Boosting Machines (GBM)? A. In gradient boosting, each weak learner corrects its predecessor's error. B. Unlike AdaBoost, the weights of the training instances are not tweaked, instead, each predictor is trained using the residual errors of predecessor as labels. C. Unlike AdaBoost, each tree can be larger than a stump. D. It can only be applied to classification problems.
D. It can only be applied to classification problems.
Which of the following models is not using a decision tree as a weak learner? A. AdaBoost B. Gradient Boost Machine C. XGBoost D. Decision Trees
Decision Trees - DTs model rely on a single tree as the only learner in the algorithm. There is no aggregation of weak learners in DT models.
In a simple linear regression model, we can use -------------------- to optimize the parameters of interest, namely the intercept and the slope.
Either OLS or GD
Which of the following methods is the best way to compare different machine learning classification models when the target variable is highly imbalanced (for example 98% class 1 and only 2% class 2)?
F1 - score
The choice of linkage method that is used in hierarchical clustering (simple linkage, average linkage, complete linkage and etc), does not effect the final set of clusters at all. What matters most, is the choice of distance metrics (Euclidian, Manhattan and etc)
False
The performance of K-mean clustering is independent of where we put the initial starting point! (T/F)
False
By construction, Maximum Margin Classifier (MMC) is NOT sensitive to outliers in the training data set.
False - MMCs are super sensitive to outliers in general
In order to decide which feature to begin with and where to put the split, the CART algorithm compares -------------- for classification and ----------------- for regression in each region.
Gini impurity - MSE
The main difference between Bagging and Boosting is: A. In bagging, the bootstrapped trees are independent from each other but in boosting, each tree is grown using information from previous tree. B. In boosting, the bootstrapped trees are independent from each other but in bagging, each tree is grown using information from previous tree. C. Boosting is a parallel process but bagging is sequential D. Boosting rely on bootstrapped data but bagging relies on one copy of the training data.
In bagging, the bootstrapped trees are independent from each other but in boosting, each tree is grown using information from previous tree.
What is machine learning?
Machine learning is a subset of AI which provides machines the ability to learn automatically & improve from experience without being explicitly programmed.
Q: What are the outputs (predictions) of a CART model for classification and regression trees respectively? (what are the predicted values in each terminal node) A: for classification, the predictions are the ------------- in each region and for regression, the predictions are the ----------------- of the observations in each region.
Majority class - Average values
Principal Component Analysis (PCA) algorithm reduces the dimension of the data by --------------- and --------------------. Projection errors: perpendicular distances from the PC line Spreads: Variation of data along PC line
Minimizing the projection errors - Maximizing the spreads
A ----------- shows the proportion of the total variance in the data explained by each principal component.
Scree plot
------------ is an optimization technique that uses a randomly selected subset of the data at every step rather than the full dataset. This reduces the time spent calculating the derivatives of the loss function.
Stochastic gradient descent
XGBoost is a refined and customized version of a gradient boosting decision tree system, focusing on performance and speed. (T/F)
True
The main idea behind the ridge regression is to introduce a small amount of bias to the model in order to get a significant drop in variance. (T/F)
True
At the core of every CART model, is the need of the algorithm to decide on two things: 1. Which feature to begin with! 2. Where to put the split! (T/F)
True
Boosting is a process that uses a set of machine learning algorithms to combine weak learners (usually decision trees) to form strong learners in order to increase the accuracy of the model. (T/F)
True
Cross validation allows us to compare different machine learning methods and get a sense of how well they will work in practice! (T/F)
True
In K-mean clustering, we seek to partition the observations into a pre-specified number of clusters. However, in Hierarchical clustering, we do not know in advance how many clusters we want. (T/F)
True
Polynomial regression model is a special case of general linear regression models. (T/F)
True - the model is called linear as long as it is linear in parameters and not in features.
K-mean clustering is am example of --------------- models and K-nearest neighbors is considered a ------------- model.
unsupervised learning - supervised learning
The inability for a machine learning method to capture the true relationship between features and the target variable is called ----- and the difference in fits between data sets is called --------
bias-variance
The objective function in K-mean clustering is to:
minimize the total within-cluster variation
One disadvantage of CART is that it may overfit the data specially if there is no stopping criteria in the algorithm. This will lead to generating a very bushy tree in general. (T/F)
True
As the order of polynomial model increases (more complex model), the model bias will ----------- and model variance will ------------
decrease - increase
From the Elements of Statistical Learning (aka the Bible of Machine Learning and btw our main textbook): Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely -------------
inaccuracy
As we increase the hyperparameter K in a KNN model from 1 to 10, the model bias ----------------- and the model variance ----------------
increase - decrease
Which of the followings is NOT a solution to the potential overfitting of CART? 1.Pruning back a bushy tree 2.Defining a stopping criteria 3.optimizing over where to put the split 4. none of the above
optimizing over where to put the split - optimizing over where to put the split cannot avoid overfitting in CART if there is no stopping criteria.
Suppose you are hired as a data analyst to analyze some COVID data set. You want to use logistic regression and construct a confusion matrix based on the COVID test results. If your goal is to avoid missing too many cases of COVID (avoid false negative), then A: which of the following probability thresholds would satisfy your objective? B: what is the consequence of that? Hint: if y_hat > threshold then your prediction is positive, otherwise negative.
A= 0.2 and B: lower precision and higher recall
Which of the following statements are correct: A. One big difference between linear regression and logistic regression is how the line is fit to the data. B. Maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.
Both A and B
Which of the following statements are correct: A. The shortest distance between the observations and the hyperplane is called the margin! B. When we use the hyperplane that gives us the largest margin to make classification, we are using a Maximum Margin Classifier (MMC).
Both A and B
If the target binary variable is relatively balanced, then we use AUC to compare different machine learning classifiers. The model with lower AUC is a preferred to a model with higher AUC. (T/F)
False - If the target binary variable is relatively balanced, then we use AUC to compare different machine learning classifiers. The model with higher AUC is a preferred to a model with lower AUC.
PC1 is the unique vector that accounts for the smallest proportion of the variance in the initial data. (T/F)
False - PC1 is the unique vector that accounts for the largest proportion of the variance in the initial data.
Regression and Classification methods are different types of Reinforcement learning! (T/F)
False - Regression and Classification methods are different types of supervised learning!
There is only one hyperparameter in KNN model. (T/F)
False - There are 2 hyperparameters in KNN. the distance metric and K.
Unlike econometrics, in machine learning the true relationship between variables are known! (T/F)
False - We never know the true relationship between feature variables regardless of using a machine learning approach or econometrics approach. All we can do it estimate the true relationship and hope that we are getting as close as possible to it!
When the sample sizes are relatively small, the ridge regression can improve predictions in the test data by making them more sensitive to the training data. (T/F)
False - When the sample sizes are relatively small, the ridge regression can improve predictions in the test data by making them less sensitive to the training data.
The outputs of a KNN classification model are categorical (for example 0 or 1 for a binary KNN classifier) (T/F)
False - the outputs are probabilities
Ridge regression helps reduce the model variance by ------------- parameters and Lasso regression does the same job by -------------- parameters.
Shrinking - dropping
Eigenvectors show the direction of the principal components and Eigenvalues represent their magnitudes. Loosely speaking the eigenvectors are just the linear combinations of the original variables. Their eigenvalues which are associated with each principal component tell you how much variation in the data set is explained. (T/F)
True
One of the main advantages of Lasso regression over Ridge regression is that with Lasso regression we can reduce the number of features in the model by setting the weights of the redundant features to zero but with Ridge regression this is impossible.
True
Polynomial regression models are more prone to overfit the data compared to other linear regression models not using polynomial features. (T/F)
True
Random Forests combine the simplicity of decision trees with flexibility by creating bootstrapped dataset (forest of trees) resulting in a vast improvement in accuracy! (T/F)
True
Random forest is an example of ensemble machine learning algorithm meaning that a group of weak learners (individual trees) come together and build a strong learner (the forest) with a better performance in general. (T/F)
True
Support Vector Machines (SVM) use something called Kernel functions to systematically find SVCs in higher dimensions without going to that higher dimension!! With this trick, the SVM are able to handle non-separable data sets. (T/F)
True
Support vector classifiers (SVC) allow misclassifications by adding soft margin to the MMC concept.
True
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm. (T/F)
True
The goal of unsupervised learning is to discover the underlying patterns and find groups of samples that behave similarly! (T/F)
True
Unsupervised learning is a machine learning that does not use labeled data. The two main types of unsupervised machine learning algorithms are dimension reduction and clustering. The idea of unsupervised learning is to find patterns within the data themselves with no target variable. (T/F)
True
In this type of machine learning, the machines don't get any feedback from the output!
Unsupervised learning
In a random forest learning algorithm, using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees. Why do you think this variety is what makes random forests more effective than individual decision tree?
WRITE IN ANSWER
Fitting the training data well but making poor predictions, is called the ------------
bias-variance trade off
In the following polynomial regression equation, d is --------------
both hyper parameter and the order of polynomial model
To evaluate the performance of a classification machine learning model, we can use ----------
confusion matrix
Which of the followings is not an advantage of a Random Forest model?
easy to interpret and determine variable significance
The ---------- the penalty term used in the ridge regression, the --------------- the fitted line in train data.
larger - flatter
In linear regression, we fit the line using ---------- and in logistic regression we fit the S curve using --------
least squares - maximum likelihood
What are the support vectors in SVC?
observations on the edge and inside the soft margin
When the data is 1-Dimensional, the SVC is a ------------- When the data is 2-Dimensional, the SVC is a ------------- when the data is 3-Dimensional, the SVC is a -------------- Finally, when the data are in 4 or more dimensions, the SVC is a ---------.
single point - Line - Plane- Hyperplane
Which of the following is not an advantage of the stochastic gradient descent (SGC) method over the simple gradient descent (GS)?
the implementation of GD is simpler specially for big data
Which one is correct definition of Supervised learning? In supervised learning -------
the machines learn to model relationships based on labeled data!
Out-of-bag error is defined as:
the proportion of out-of-bag samples that were incorrectly classified
In Machine Learning, We use -------------- data to train the model and ------------ data to evaluate the performance of our model.
train - test