Machine learning

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Explain the difference between bagging and pasting

Bagging: Sampling is performed with replacement, (bootstrap aggregating). Pasting: Sampling is performed without replacement Bagging and Pasting: Same training instances can be sampled several times across multiple predictors Bagging: Same training instances can sampled several times for the same predictor

3 applications of clustering, briefly explain

• Customer segmentation: Useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. • Data analysis: Analyzing each cluster of data separately might give further insights. • Dimensionality reduction : Once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster. Anomaly detection (also called outlier detection): Any instance that has a low affinity to all the clusters is likely to be an anomaly

What are the advantages of using DBScan over K-Means algorithm?

• DBSCAN can identify any number of clusters of any shape. • It is robust to outliers, it has just two hyperparameters (eps and min_samples).

Which of the following is a whitebox model

• Decision Trees

is having regularization for decision trees important

• Decision Trees make very few assumptions about the training data • If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely—indeed, most likely overfitting it

Which of the followings are an algorithm to form decision trees

• ID3 • CART

Petal length length is 5.00 cm, and peta width is 1.50 cm for an iris flower. The following decision tree is given. What are the probabilities for the setosa, versicolor and virginica classes

• Iris-Setosa: 0% (0/54) • Iris-Versicolor: 90.7% (49/54) Iris-Virginica: 9.3% (5/54)

The figure below shows an ensemble learning system. Which of these could be true for this figure

• It uses boosting • Trains classifiers sequentially It cannot be scaled

What are the limitations of the K-Means algorithm

• Necessary to run the algorithm several times to avoid suboptimal solutions, • Need to specify the number of clusters, • Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes.

Which of the following is not true for whitebox models?

• Predictions are hard to explain

Differences between Extra Trees vs Random Forest

• Random forest uses bootstrap replicas whereas Extra Trees use the whole original sample by default. (extra trees has optional parameter allowing users to bootstrap replicas) Random Forest chooses the optimum split while Extra Trees chooses it randomly

How does K-means clustering algorithm work

• Start by placing the centroids randomly (e.g., by picking k instances randomly as centroids). • Then label the instances according the closest centroid, • Update the centroids location based on labels, • Relabel the instances according to the closest centroid, Continue until the centroids stop moving

List three ways to diversify learners in an ensemble learning model

• Use different training algorithms with the same training set • Use the same training algorithm for every predictor, but train on different random subsets of the training set • Use different random subsets of features

How does clustering help pre-processing

•Create a pipeline that will first cluster the training set into 50 clusters •Replace the images with their distances to these 50 clusters, •Then apply a Logistic Regression model

Which of the following(s) are true for boosting

a) It is another Ensemble method that combine several weak learners into a strong learner b) Trains predictors sequentially c) Each predictor attempts to correct its predecessor

Which of the following is true for the CART algorithm that is used to for the decision tree

a) Searches for an optimum split at the top level, then repeats the process at each subsequent level. b) It is a greedy algorithm. It does not check whether or not the split will lead to the lowest possible impurity several levels down. c)Produces a solution that's reasonably good but not guaranteed to be optimal

of the following(s) is true for soft voting

a) Soft voting classifier achieves better accuracy than hard voting b) It gives more weight to highly confident votes c) All classifiers of used from the Scikit Learn library must have predict_proba() method to be able to use soft voting

Which of the followings can be done by a Support Vector Machine (SVM) learning model?

a)Classification b)Regression c)Outlier detection

SVM Classification is also called _____________ classification.

a)Large margin

Given the following figure, what is the optimal number of clusters

about 4

According the figure below, how many decision trees would you use in gradient boost?

about 57

support vectors

are the samples on the margin.

color segmentation

assign pixels to the same segment if they have a similar color

aggregate the predictions of a group of predictors (such as classifiers or regressors), and we often get better predictions than with the best individual predictor.

ensemble_ learning

uring SVM classification, if we strictly impose that all instances must be off the street and on the right side, this is called

hard margin __classification

All pixels that are part of the same individual object are assigned to the same segment.

instance segmentation

Using a function to transform the original space into a higher dimensional space during the costs function optimization is called

kernel trick

Many datasets are not even close to being linearly separable. Linear SVM would not perform well on these datasets. We need to transform the original space to a higher dimensional space to improve the performance. Certain values can be passed to Scikit Learn Support Vector Machine Classifier (SVC)'s 'kernel' parameter to transform the original space to a higher dimension. Which of the following values can be used for that:

poly rbf sigmoid

We use SVM learning model to solve a ML problem. We see scaled and not scaled data in the figures. Do you recommend to use scaled or unscaled data why?

scaled as the points on scaled graph are closer to the bounds

all pixels that are part of the same object type get assigned to the same segment.

semantic segmentation,

For an ensemble learning model to work, which of the followings needs to be satisfied for learners

there are a sufficient number of weak learners they are sufficiently diverse

Even if each classifier used in an ensemble learning model is a ______ learner (slightly better than random guessing), the ensemble can still be a ______ learner (achieving high accuracy).

weak strong

Name and describe two main tasks that are achieved using unsupervised learning

Clustering: The goal is to group similar instances together into clusters. Anomaly detection: Learn what "normal" data looks like, use that to detect abnormal instances

What is a Gaussian Mixture Model

Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.

Briefly explain the difference between AdaBoost and Gradient Boost

Gradient Boost attempts to fit the new predictor to the residual errors made by the previous predictor instead of tweaking the instance weights at every iteration

Sequential learning technique has some similarities with ____________, except that instead of tweaking a single predictor's parameters to minimize a cost function, _________ adds predictors to the ensemble, gradually making it better

Gradient Descent AdaBoost

what is hard clustering and soft clustering

Hard clustering: Assigning each instance to a single cluster Soft clustering: Give each instance a score per cluster

hard and soft voting classifiers for ensemble learning

Hard voting: aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier Soft voting: predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting

What is the main idea behind the stacking approach in ensemble learning

Idea: Train a model to aggregate the predictions of all predictors in an ensemble.

is the task of partitioning an image into multiple segments.

Image segmentation

Difference between Random Forest (RF) vs Random Patches (RP)

In Random Patches, the subset of features is selected globally once and for all, prior to the construction of the tree. In Random Forest, subsets of features are drawn locally at each node.

Hyperplane

In an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.

What would you do to improve the performance of the DBSCan algorithm for the figure below:

Increase the eps

If we are using a decision tree for the data above, which data should we use?

It does not matter, decision trees do not need the data to be scaled.

Similarities - differences between K-Means and Expectation-Maximization (EM) Algorithm

Many similarities with the K-Means algorithm: • Initializes the cluster parameters randomly, • then it repeats two steps until convergence, expectation step : first assigning instances to clusters maximization step: then updating the clusters EM is a generalization of K-Means • Finds the cluster centers (μ(1) to μ(k)), • Finds cluster size, shape, and orientation (Σ(1) to Σ(k)), • Finds cluster relative weights (ϕ(1) to ϕ(k)) • Uses soft cluster assignments instead of hard

Which of the followings are true for hard margin classification?

Only works if the data is linearly separable. b)Sensitive to outliers and it will probably not generalize as well.

The following figure is given for the Gaussian mixture, what would be the optimal number of clusters?

Optimal # of clusters: 3

It is called random _______ is when we sample both training instances and features.

Patches

It is called random _______ when we keep all training instances but sample features

Subspaces

margin

The distance between the closest examples of two classes to the decision boundary

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features

True

The SVM is a different type of algorithm as it picks the extreme case which is close to the boundary and it uses that to construct its analysis.

True

The idea of SVM is to create a line or a hyperplane which separates the data into classes.

True

What are the methods used to find out the optimal number of trees during a gradient boosting method.

Two stage training method: • Trains a large number of trees • Measures validation error at each stage in the training • Selects tree size with minimum validation error • Trains a new model with the optimal tree size found Early stopping method: • Implements incremental learning • Measures validation error for every tree • Stops adding trees when error increases 5 times in a row

How does clustering help semi-supervised learning

Use for Semi-Supervised Learning when we have plenty of unlabeled instances and very few labeled instances .1.Train with only 50 labeled instances: 2.Cluster the training set into 50 clusters. Then for each cluster, find the representative image closest to the centroid. 3. Let's propagate the labels to all the other instances in the same cluster (label propagation)?

How do we know the best solution K-means

Use the inertia metric. Inertia: The mean squared distance between each instance and its closest centroid. Algorithm choses the result with the lowest inertia.

How do we select the number of clusters

We get the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) values for various cluster numbers and select the one that has the minimum BIC or AIC values.

Why do we want to diversify the learners

When diversified, learners will make very different types of errors, improving the ensemble's accuracy

Which of the followings is not true for decision tree regression

When used for regression, decision trees are not prone to overfitting

How does DBSCAN (Density-Based Spatial Clustering of Applications with Noise) work

1. Pick a random point that has not been marked yet, 2. Check if the point has n number of neighbors within ε distance? 1. Yes, it is a core point: • Check all other points in the neighborhood and mark them as core point if they have n or more number of neighbors, if they do not have n neighbors then mark them as border points. Continue the process until no core point left in the neighborhood. All core points will be in the same cluster. 2. No, it is an outlier point If there are unmarked points left, go to the first step 3. If there are unmarked points left, go to the first step

How does uncertainty sampling strategy during active learning work

1. The model is trained on the labeled instances gathered so far, and then makes predictions on all the unlabeled instances. 2. The expert labels the instances for which the model is most uncertain 3. Iterate this process until the performance improvement stops being worth the labeling effort

What is active learning?

Active learning is when a human expert interacts with the learning algorithm, providing labels for specific instances when the algorithm requests them

When does DBScan does not perform well?

DBSCAN cannot capture all the clusters properly if the density varies significantly across the clusters

Random Forest is an ensemble of _____ _____, generally trained via the _______ method

Decision Trees bagging

How to avoid local minimum in K-means

Run the algorithm multiple times with different random initializations and keep the best solution.

Which of the following figure has lower variance?

The one on the right (it applies bagging).

Why is the training complexity of the decision tree using the CART algorithm is O( n x m x log2m )

There are n features and m samples and there are log2m levels. The algorithm compares all features on all samples at each level and hence has the O( n x m x log2m ) complexity

Why is the prediction complexity of the decision tree is O( log2m )

There is one comparison at each level and there are log2m levels.

One of the following values for the kernel parameter were used for the following Support Vector Machine Regression models.

Top left: rbf Top right: linear Bottom left: poly Bottom right: sigmoid

Bagging and pasting methods enables parallel training and prediction. Training can be done in in parallel on different CPU cores or servers. Predictions can be made in parallel

True


Ensembles d'études connexes

History of Rock and Roll - Exam 2 OK STATE

View Set

into to business ~ chapter 2 exam

View Set

Intermediate 2- Exam 2 multiple choice

View Set

Managing Diversity - Workplace Chapter 5

View Set

Unit 4 Review: Triumph of Industry; Labor Movement; Cities, Immigration and Farmers

View Set