Machine Learning Exam 2
What is active learning?
Active learning is when a human expert interacts with the learning algorithm, providing labels for specific instances when the algorithm requests them.
Name and describe two main tasks that are achieved using unsupervised learning?
Clustering: The goal is to group similar instances together into clusters. Anomaly detection: Learn what "normal" data looks like, use that to detect abnormal instances
When does DBScan does not perform well?
DBSCAN cannot capture all the clusters properly if the density varies significantly across the clusters.
How does K-means clustering algorithm work?
How it works: • Start by placing the centroids randomly (e.g., by picking k instances randomly as centroids). • Then label the instances according the closest centroid, • Update the centroids location based on labels, • Relabel the instances according to the closest centroid, • Continue until the centroids stop moving.
Why do we want to diversify the learners?
When diversified, learners will make very different types of errors, improving the ensemble's accuracy.
___________ In an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
hyperplane
Why is having regularization for decision trees important?
• Decision Trees make very few assumptions about the training data • If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely—indeed, most likely overfitting it
Which of the followings are an algorithm to form decision trees? • ID3 • CART • Entropy • Gini
• ID3 • CART
How does uncertainty sampling strategy during active learning work?
Uncertainty sampling: 1. The model is trained on the labeled instances gathered so far, and then makes predictions on all the unlabeled instances. 2. The expert labels the instances for which the model is most uncertain 3. Iterate this process until the performance improvement stops being worth the labeling effort.
Random Forest is an ensemble of ___________, generally trained via the __________ method.
Decision Trees,bagging
What is a Gaussian Mixture Model?
Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.
What would you do to improve the performance of the DBSCan algorithm for the figure below:
Increase the eps
(Pic)If we are using a decision tree for the data above, which data should we use?
It does not matter, decision trees do not need the data to be scaled.
_________ The distance between the closest examples of two classes to the decision boundary
Margin
According the figure below, how many decision trees would you use in gradient boost?
Min
How to avoid local minimum in K-means?
Run the algorithm multiple times with different random initializations and keep the best solution.
Why is the prediction complexity of the decision tree is O( log2m ) ?
There is one comparison at each level and there are log2m levels.
(Pic)One of the following values for the kernel parameter were used for the following Support Vector Machine Regression models. • linear • rbf • poly • sigmoid Name the figures based on the values used for the kernel parameter:
Top left: rbf Top right: linear Bottom left: poly Bottom right: sigmoid
Bagging and pasting methods enables parallel training and prediction. Training can be done in in parallel on different CPU cores or servers. Predictions can be made in parallel. True/False
True
What are the methods used to find out the optimal number of trees during a gradient boosting method.
Two stage training method: • Trains a large number of trees • Measures validation error at each stage in the training • Selects tree size with minimum validation error • Trains a new model with the optimal tree size found Early stopping method: • Implements incremental learning • Measures validation error for every tree • Stops adding trees when error increases 5 times in a row
It is called random _________ is when we sample both training instances and features.
patches
Even if each classifier used in an ensemble learning model is a ______learner (slightly better than random guessing), the ensemble can still be a ______ learner (achieving high accuracy).
weak,strong
Given the following figure, what is the optimal number of clusters?
4
__________________ is the task of partitioning an image into multiple segments.
Image segmentation
Difference between Random Forest (RF) vs Random Patches (RP)?
In Random Patches, the subset of features is selected globally once and for all, prior to the construction of the tree. In Random Forest, subsets of features are drawn locally at each node
The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features True/False
True
How do we know the best solution K-means?
Use the inertia metric. Inertia: The mean squared distance between each instance and its closest centroid. Algorithm choses the result with the lowest inertia.
How do we select the number of clusters
We get the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) values for various cluster numbers and select the one that has the minimum BIC or AIC values.
____________________all pixels that are part of the same object type get assigned to the same segment.
semantic segmentation
When bagging is compared to a single predictor trained on the original training set, the ensemble has a ________ bias a ______ variance.
similar,lower
___________ are the samples on the margin.
support vectors
Petal length length is 5.00 cm, and peta width is 1.50 cm for an iris flower. The following decision tree is given. What are the probabilities for the setosa, versicolor and virginica classes?
Iris-Setosa: 0% (0/54) • Iris-Versicolor: 90.7% (49/54) • Iris-Virginica: 9.3% (5/54)
Which of the following figure has lower variance?
The one on the right (it applies bagging)
Which of the following(s) is true for soft voting? a) Soft voting classifier achieves better accuracy than hard voting b) It gives more weight to highly confident votes c) All classifiers of used from the Scikit Learn library must have predict_proba() method to be able to use soft voting d) All of the options listed
d) All of the options listed
________________________: All pixels that are part of the same individual object are assigned to the same segment.
instance segmentation
Explain the difference between bagging and pasting
Bagging: Sampling is performed with replacement, (bootstrap aggregating). Pasting: Sampling is performed without replacement Bagging and Pasting: Same training instances can be sampled several times across multiple predictors Bagging: Same training instances can sampled several times for the same predictor
Which of the following is a whitebox model? • Decision Trees • Random Forests • Neural Networks • All of the options
• Decision Trees
What are the limitations of the K-Means algorithm?
• Necessary to run the algorithm several times to avoid suboptimal solutions, • Need to specify the number of clusters, • Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes.
Which of the following is not true for whitebox models? • Intuitive models • Decisions are easy to interpret • Predictions are hard to explain • None of the options
• Predictions are hard to explain
List three ways to diversify learners in an ensemble learning model:
• Use different training algorithms with the same training set • Use the same training algorithm for every predictor, but train on different random subsets of the training set • Use different random subsets of features
For an ensemble learning model to work, which of the followings needs to be satisfied for learners? • there are a sufficient number of weak learners • they are sufficiently diverse • the learners must be random • decision trees must be used • random forest learner must be used
• there are a sufficient number of weak learners • they are sufficiently diverse
(Pic) Match the definitions seen on the decision tree nodes: * samples • value • gini
_samples_ counts how many training instances it applies to _value_ defines how many training instances of each class this node applies to _gini_ measures its impurity
SVM Classification is also called _____________ classification. a) Large margin b) Street c) Hyperplane d) None of the options listed
a) Large margin
Which of the followings are true for hard margin classification? a) Only works if the data is linearly separable. b) Sensitive to outliers and it will probably not generalize as well. c) Tries to find a good balance between keeping the street as large as possible and limiting the margin violations. d) All of the options listed
a) Only works if the data is linearly separable. b) Sensitive to outliers and it will probably not generalize as well.
____________________: assign pixels to the same segment if they have a similar color.
color segmentation
Which of the following is true for the CART algorithm that is used to for the decision tree? a) Searches for an optimum split at the top level, then repeats the process at each subsequent level. b) It is a greedy algorithm. It does not check whether or not the split will lead to the lowest possible impurity several levels down. c) Produces a solution that's reasonably good but not guaranteed to be optimal. d) All of the options listed
d) All of the options listed
Which of the following(s) are true for boosting? a) It is another Ensemble method that combine several weak learners into a strong learner b) Trains predictors sequentially c) Each predictor attempts to correct its predecessor d) All of the options listed
d) All of the options listed
Which of the followings can be done by a Support Vector Machine (SVM) learning model? a) Classification b) Regression c) Outlier detection d) All of the options listed
d) All of the options listed
Which of the followings is not true for decision tree regression? a) Instead of predicting a class in each node, it predicts a value. b) The predicted value for each region is always the average target value of the instances in that region. c) The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value. d) When used for regression, decision trees are not prone to overfitting
d) When used for regression, decision trees are not prone to overfitting
In _________ learning, we aggregate the predictions of a group of predictors (such as classifiers or regressors), and we often get better predictions than with the best individual predictor.
ensemble
During SVM classification, if we strictly impose that all instances must be off the street and on the right side, this is called _________ classification
hard margin
Using a function to transform the original space into a higher dimensional space during the costs function optimization is called ________ _____
kernel trick
It is called random _______ when we keep all training instances but sample features.
subspaces
Many datasets are not even close to being linearly separable. Linear SVM would not perform well on these datasets. We need to transform the original space to a higher dimensional space to improve the performance. Certain values can be passed to Scikit Learn Support Vector Machine Classifier (SVC)'s 'kernel' parameter to transform the original space to a higher dimension. Which of the following values can be used for that: • 'linear', • 'poly', • 'rbf', • 'sigmoid'
• 'poly', • 'rbf', • 'sigmoid'
The figure below shows an ensemble learning system. Which of these could be true for this figure? • It uses boosting • Trains classifiers sequentially • It cannot be scaled • All of the options listed
• All of the options listed
Briefly explain the difference between AdaBoost and Gradient Boost
Gradient Boost attempts to fit the new predictor to the residual errors made by the previous predictor instead of tweaking the instance weights at every iteration
Sequential learning technique has some similarities with _______________, except that instead of tweaking a single predictor's parameters to minimize a cost function, _______________ adds predictors to the ensemble, gradually making it better.
Gradient Descent, AdaBoost
What is hard clustering and soft clustering?
Hard clustering: Assigning each instance to a single cluster Soft clustering: Give each instance a score per cluster
Describe hard and soft voting classifiers for ensemble learning?
Hard voting: aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier. Soft voting: predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting.
What is the main idea behind the stacking approach in ensemble learning?
Idea: Train a model to aggregate the predictions of all predictors in an ensemble.
Similarities - differences between K-Means and Expectation-Maximization (EM) Algorithm?
Many similarities with the K-Means algorithm: • Initializes the cluster parameters randomly, • then it repeats two steps until convergence, expectation step : first assigning instances to clusters maximization step: then updating the clusters EM is a generalization of K-Means • Finds the cluster centers (μ(1) to μ(k)), • Finds cluster size, shape, and orientation (Σ(1) to Σ(k)), • Finds cluster relative weights (ϕ(1) to ϕ(k)) • Uses soft cluster assignments instead of hard
The SVM is a different type of algorithm as it picks the extreme case which is close to the boundary and it uses that to construct its analysis. True / False
True
The idea of SVM is to create a line or a hyperplane which separates the data into classes. True / False
True
How does DBSCAN (Density-Based Spatial Clustering of Applications with Noise) work?
Visit all point using the algorithm below and mark them as core/border/outlier 1. Pick a random point that has not been marked yet, 2. Check if the point has n number of neighbors within ε distance? 1. Yes, it is a core point: • Check all other points in the neighborhood and mark them as core point if they have n or more number of neighbors, if they do not have n neighbors then mark them as border points. Continue the process until no core point left in the neighborhood. All core points will be in the same cluster. 22 2. No, it is an outlier point 3. If there are unmarked points left, go to the first step
(Pic)We use SVM learning model to solve a ML problem. We see scaled and not scaled data in the figures. Do you recommend to use scaled or unscaled data why?
We need to use scaled data for SVM because it will tend to neglect small features.
The following figure is given for the Gaussian mixture, what would be the optimal number of clusters?
3
Name 3 applications of clustering, briefly explain:
Customer segmentation: Useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. • Data analysis: Analyzing each cluster of data separately might give further insights. • Dimensionality reduction : Once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster. • Anomaly detection (also called outlier detection): Any instance that has a low affinity to all the clusters is likely to be an anomaly. • Semi-supervised learning: If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same cluster. • Image Segmentation: Cluster pixels according to their color and then replace each pixel's color with the mean color of its cluster, • Search engines: Search for images that are similar to a reference image.
Differences between Extra Trees vs Random Forest?
The main two differences are the following: • Random forest uses bootstrap replicas whereas Extra Trees use the whole original sample by default. (extra trees has optional parameter allowing users to bootstrap replicas) • Random Forest chooses the optimum split while Extra Trees chooses it randomly
Why is the training complexity of the decision tree using the CART algorithm is O( n x m x log2m ) ?
There are n features and m samples and there are log2m levels. The algorithm compares all features on all samples at each level and hence has the O( n x m x log2m ) complexity.