Machine learning

Ace your homework & exams now with Quizwiz!

Which of the following statements is NOT an advantage of CART (Classification and Regression Trees)? 1. Interpretability 2.handling categorical variables without the need of one-hot encoding 3. using a greedy algorithm which can get stuck in a local optimal 4. handling non-linear data sets

3. using a greedy algorithm which can get stuck in a local optimal

How do you summarize a hierarchical clustering algorithm? A. Start with each point in its own cluster B. Repeat until all points are in a cluster C. Merge the clusters D. Identify the closest two clusters based on a distance metrics (Euclidian, Manhattan and etc). This is our dissimilarity metric.

A - D - C - B

Which of the following statements are correct? A. SVC can handle separable data B. SVC can handle noisy data (outliers) by using soft margin C. SVC can handle non-separable data

A and B are correct, C is incorrect

A) ----------- is an example of unsupervised dimension reduction algorithm. B) ----------- is an example of supervised dimension reduction technique. C) ----------- are two of the most common unsupervised clustering techniques.

A) PCA B) Lasso regression C) K-mean and hierarchical

The step size in a gradient descent algorithm to find the parameters in simple regression model is determined by slope multiplied by learning rate! (T/F)

True

The two main types of hierarchical clustering techniques are:

Agglomerative clustering (bottom up approach) - Divisive clustering (top down approach)

Which of the following statements is NOT correct about the AdaBoost? A. The weak learners are ALL decision stumps B.Each stump depends on the previous tree's errors rather than being independent. C.All the instances (training observations) get the SAME weight during the entire AdaBoosting process. D.Misclassified observations are assigned HIGHER weights

All the instances (training observations) get the SAME weight during the entire AdaBoosting process.

Which of the following statements is NOT correct about ensemble learning? A. Ensemble learning is a method that is used to enhance the performance of a machine learning model by combining several learners. B. The main idea in ensemble learning is to learn one super accurate model, instead of focusing on training many low accuracy models. C. It combines the predictions from a collection of models. D. Ensemble learning typically produces more accurate and more stable predictions than the best single model.

B. The main idea in ensemble learning is to learn one super accurate model, instead of focusing on training many low accuracy models.

Bootstrapping the data plus using the aggregate to make a decision is called?

Bagging

Which of the following statements are correct? 1. Residual sum of squares is an example of loss function! 2. The goal is to minimize the loss function by finding the best parameter estimates.

Both 1 and 2

Which of the following statements are correct with respect to PCA vs Clustering: A. PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance. B. Clustering looks for homogeneous subgroups among the observations.

Both A and B

Which of the following statements are correct: A. K nearest neighbor (KNN) is one of the simplest and best known non parametric supervised learning technique most often used for classification . B. Contrary to other learning algorithms that allow discarding the training data after the model is built, KNN keeps all training examples in memory.

Both A and B

Which of the followings is NOT an advantage of KNN model?

Curse of dimensionality

Which of the following statements is NOT correct about the Gradient Boosting Machines (GBM)? A. In gradient boosting, each weak learner corrects its predecessor's error. B. Unlike AdaBoost, the weights of the training instances are not tweaked, instead, each predictor is trained using the residual errors of predecessor as labels. C. Unlike AdaBoost, each tree can be larger than a stump. D. It can only be applied to classification problems.

D. It can only be applied to classification problems.

Which of the following models is not using a decision tree as a weak learner? A. AdaBoost B. Gradient Boost Machine C. XGBoost D. Decision Trees

Decision Trees - DTs model rely on a single tree as the only learner in the algorithm. There is no aggregation of weak learners in DT models.

In a simple linear regression model, we can use -------------------- to optimize the parameters of interest, namely the intercept and the slope.

Either OLS or GD

Which of the following methods is the best way to compare different machine learning classification models when the target variable is highly imbalanced (for example 98% class 1 and only 2% class 2)?

F1 - score

The choice of linkage method that is used in hierarchical clustering (simple linkage, average linkage, complete linkage and etc), does not effect the final set of clusters at all. What matters most, is the choice of distance metrics (Euclidian, Manhattan and etc)

False

The performance of K-mean clustering is independent of where we put the initial starting point! (T/F)

False

By construction, Maximum Margin Classifier (MMC) is NOT sensitive to outliers in the training data set.

False - MMCs are super sensitive to outliers in general

In order to decide which feature to begin with and where to put the split, the CART algorithm compares -------------- for classification and ----------------- for regression in each region.

Gini impurity - MSE

The main difference between Bagging and Boosting is: A. In bagging, the bootstrapped trees are independent from each other but in boosting, each tree is grown using information from previous tree. B. In boosting, the bootstrapped trees are independent from each other but in bagging, each tree is grown using information from previous tree. C. Boosting is a parallel process but bagging is sequential D. Boosting rely on bootstrapped data but bagging relies on one copy of the training data.

In bagging, the bootstrapped trees are independent from each other but in boosting, each tree is grown using information from previous tree.

What is machine learning?

Machine learning is a subset of AI which provides machines the ability to learn automatically & improve from experience without being explicitly programmed.

Q: What are the outputs (predictions) of a CART model for classification and regression trees respectively? (what are the predicted values in each terminal node) A: for classification, the predictions are the ------------- in each region and for regression, the predictions are the ----------------- of the observations in each region.

Majority class - Average values

Principal Component Analysis (PCA) algorithm reduces the dimension of the data by --------------- and --------------------. Projection errors: perpendicular distances from the PC line Spreads: Variation of data along PC line

Minimizing the projection errors - Maximizing the spreads

A ----------- shows the proportion of the total variance in the data explained by each principal component.

Scree plot

------------ is an optimization technique that uses a randomly selected subset of the data at every step rather than the full dataset. This reduces the time spent calculating the derivatives of the loss function.

Stochastic gradient descent

XGBoost is a refined and customized version of a gradient boosting decision tree system, focusing on performance and speed. (T/F)

True

The main idea behind the ridge regression is to introduce a small amount of bias to the model in order to get a significant drop in variance. (T/F)

True

At the core of every CART model, is the need of the algorithm to decide on two things: 1. Which feature to begin with! 2. Where to put the split! (T/F)

True

Boosting is a process that uses a set of machine learning algorithms to combine weak learners (usually decision trees) to form strong learners in order to increase the accuracy of the model. (T/F)

True

Cross validation allows us to compare different machine learning methods and get a sense of how well they will work in practice! (T/F)

True

In K-mean clustering, we seek to partition the observations into a pre-specified number of clusters. However, in Hierarchical clustering, we do not know in advance how many clusters we want. (T/F)

True

Polynomial regression model is a special case of general linear regression models. (T/F)

True - the model is called linear as long as it is linear in parameters and not in features.

K-mean clustering is am example of --------------- models and K-nearest neighbors is considered a ------------- model.

unsupervised learning - supervised learning

The inability for a machine learning method to capture the true relationship between features and the target variable is called ----- and the difference in fits between data sets is called --------

bias-variance

The objective function in K-mean clustering is to:

minimize the total within-cluster variation

One disadvantage of CART is that it may overfit the data specially if there is no stopping criteria in the algorithm. This will lead to generating a very bushy tree in general. (T/F)

True

As the order of polynomial model increases (more complex model), the model bias will ----------- and model variance will ------------

decrease - increase

From the Elements of Statistical Learning (aka the Bible of Machine Learning and btw our main textbook): Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely -------------

inaccuracy

As we increase the hyperparameter K in a KNN model from 1 to 10, the model bias ----------------- and the model variance ----------------

increase - decrease

Which of the followings is NOT a solution to the potential overfitting of CART? 1.Pruning back a bushy tree 2.Defining a stopping criteria 3.optimizing over where to put the split 4. none of the above

optimizing over where to put the split - optimizing over where to put the split cannot avoid overfitting in CART if there is no stopping criteria.

Suppose you are hired as a data analyst to analyze some COVID data set. You want to use logistic regression and construct a confusion matrix based on the COVID test results. If your goal is to avoid missing too many cases of COVID (avoid false negative), then A: which of the following probability thresholds would satisfy your objective? B: what is the consequence of that? Hint: if y_hat > threshold then your prediction is positive, otherwise negative.

A= 0.2 and B: lower precision and higher recall

Which of the following statements are correct: A. One big difference between linear regression and logistic regression is how the line is fit to the data. B. Maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.

Both A and B

Which of the following statements are correct: A. The shortest distance between the observations and the hyperplane is called the margin! B. When we use the hyperplane that gives us the largest margin to make classification, we are using a Maximum Margin Classifier (MMC).

Both A and B

If the target binary variable is relatively balanced, then we use AUC to compare different machine learning classifiers. The model with lower AUC is a preferred to a model with higher AUC. (T/F)

False - If the target binary variable is relatively balanced, then we use AUC to compare different machine learning classifiers. The model with higher AUC is a preferred to a model with lower AUC.

PC1 is the unique vector that accounts for the smallest proportion of the variance in the initial data. (T/F)

False - PC1 is the unique vector that accounts for the largest proportion of the variance in the initial data.

Regression and Classification methods are different types of Reinforcement learning! (T/F)

False - Regression and Classification methods are different types of supervised learning!

There is only one hyperparameter in KNN model. (T/F)

False - There are 2 hyperparameters in KNN. the distance metric and K.

Unlike econometrics, in machine learning the true relationship between variables are known! (T/F)

False - We never know the true relationship between feature variables regardless of using a machine learning approach or econometrics approach. All we can do it estimate the true relationship and hope that we are getting as close as possible to it!

When the sample sizes are relatively small, the ridge regression can improve predictions in the test data by making them more sensitive to the training data. (T/F)

False - When the sample sizes are relatively small, the ridge regression can improve predictions in the test data by making them less sensitive to the training data.

The outputs of a KNN classification model are categorical (for example 0 or 1 for a binary KNN classifier) (T/F)

False - the outputs are probabilities

Ridge regression helps reduce the model variance by ------------- parameters and Lasso regression does the same job by -------------- parameters.

Shrinking - dropping

Eigenvectors show the direction of the principal components and Eigenvalues represent their magnitudes. Loosely speaking the eigenvectors are just the linear combinations of the original variables. Their eigenvalues which are associated with each principal component tell you how much variation in the data set is explained. (T/F)

True

One of the main advantages of Lasso regression over Ridge regression is that with Lasso regression we can reduce the number of features in the model by setting the weights of the redundant features to zero but with Ridge regression this is impossible.

True

Polynomial regression models are more prone to overfit the data compared to other linear regression models not using polynomial features. (T/F)

True

Random Forests combine the simplicity of decision trees with flexibility by creating bootstrapped dataset (forest of trees) resulting in a vast improvement in accuracy! (T/F)

True

Random forest is an example of ensemble machine learning algorithm meaning that a group of weak learners (individual trees) come together and build a strong learner (the forest) with a better performance in general. (T/F)

True

Support Vector Machines (SVM) use something called Kernel functions to systematically find SVCs in higher dimensions without going to that higher dimension!! With this trick, the SVM are able to handle non-separable data sets. (T/F)

True

Support vector classifiers (SVC) allow misclassifications by adding soft margin to the MMC concept.

True

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm. (T/F)

True

The goal of unsupervised learning is to discover the underlying patterns and find groups of samples that behave similarly! (T/F)

True

Unsupervised learning is a machine learning that does not use labeled data. The two main types of unsupervised machine learning algorithms are dimension reduction and clustering. The idea of unsupervised learning is to find patterns within the data themselves with no target variable. (T/F)

True

In this type of machine learning, the machines don't get any feedback from the output!

Unsupervised learning

In a random forest learning algorithm, using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees. Why do you think this variety is what makes random forests more effective than individual decision tree?

WRITE IN ANSWER

Fitting the training data well but making poor predictions, is called the ------------

bias-variance trade off

In the following polynomial regression equation, d is --------------

both hyper parameter and the order of polynomial model

To evaluate the performance of a classification machine learning model, we can use ----------

confusion matrix

Which of the followings is not an advantage of a Random Forest model?

easy to interpret and determine variable significance

The ---------- the penalty term used in the ridge regression, the --------------- the fitted line in train data.

larger - flatter

In linear regression, we fit the line using ---------- and in logistic regression we fit the S curve using --------

least squares - maximum likelihood

What are the support vectors in SVC?

observations on the edge and inside the soft margin

When the data is 1-Dimensional, the SVC is a ------------- When the data is 2-Dimensional, the SVC is a ------------- when the data is 3-Dimensional, the SVC is a -------------- Finally, when the data are in 4 or more dimensions, the SVC is a ---------.

single point - Line - Plane- Hyperplane

Which of the following is not an advantage of the stochastic gradient descent (SGC) method over the simple gradient descent (GS)?

the implementation of GD is simpler specially for big data

Which one is correct definition of Supervised learning? In supervised learning -------

the machines learn to model relationships based on labeled data!

Out-of-bag error is defined as:

the proportion of out-of-bag samples that were incorrectly classified

In Machine Learning, We use -------------- data to train the model and ------------ data to evaluate the performance of our model.

Machine learning

Related study sets

operating system 1

CCNA 2

HY 103 Final- Selesky

Movie Quotes for Trivia_

Probability

Anatomy Test 2

Kinesiology exam 8-11

VT 8 DiAGN0STiC iMAGiNG final review

Arc Length & Circumference, Area of Sector, Circles, and Segments

POLI 278

A

Period 5

Launchpad 6

Chapter 5 Central and Peripheral Nervous System

Accounting II Review- Multiple Choice

IFSTA Ch. 10 "Structural Search and Rescue"

Prac. test question quizlet CC

week 2 quiz

Ch 8 Quiz

Chapter 20: Intrapartum Nursing Assessment