Machine learning
Explain the difference between bagging and pasting
Bagging: Sampling is performed with replacement, (bootstrap aggregating). Pasting: Sampling is performed without replacement Bagging and Pasting: Same training instances can be sampled several times across multiple predictors Bagging: Same training instances can sampled several times for the same predictor
3 applications of clustering, briefly explain
• Customer segmentation: Useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. • Data analysis: Analyzing each cluster of data separately might give further insights. • Dimensionality reduction : Once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster. Anomaly detection (also called outlier detection): Any instance that has a low affinity to all the clusters is likely to be an anomaly
What are the advantages of using DBScan over K-Means algorithm?
• DBSCAN can identify any number of clusters of any shape. • It is robust to outliers, it has just two hyperparameters (eps and min_samples).
Which of the following is a whitebox model
• Decision Trees
is having regularization for decision trees important
• Decision Trees make very few assumptions about the training data • If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely—indeed, most likely overfitting it
Which of the followings are an algorithm to form decision trees
• ID3 • CART
Petal length length is 5.00 cm, and peta width is 1.50 cm for an iris flower. The following decision tree is given. What are the probabilities for the setosa, versicolor and virginica classes
• Iris-Setosa: 0% (0/54) • Iris-Versicolor: 90.7% (49/54) Iris-Virginica: 9.3% (5/54)
The figure below shows an ensemble learning system. Which of these could be true for this figure
• It uses boosting • Trains classifiers sequentially It cannot be scaled
What are the limitations of the K-Means algorithm
• Necessary to run the algorithm several times to avoid suboptimal solutions, • Need to specify the number of clusters, • Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes.
Which of the following is not true for whitebox models?
• Predictions are hard to explain
Differences between Extra Trees vs Random Forest
• Random forest uses bootstrap replicas whereas Extra Trees use the whole original sample by default. (extra trees has optional parameter allowing users to bootstrap replicas) Random Forest chooses the optimum split while Extra Trees chooses it randomly
How does K-means clustering algorithm work
• Start by placing the centroids randomly (e.g., by picking k instances randomly as centroids). • Then label the instances according the closest centroid, • Update the centroids location based on labels, • Relabel the instances according to the closest centroid, Continue until the centroids stop moving
List three ways to diversify learners in an ensemble learning model
• Use different training algorithms with the same training set • Use the same training algorithm for every predictor, but train on different random subsets of the training set • Use different random subsets of features
How does clustering help pre-processing
•Create a pipeline that will first cluster the training set into 50 clusters •Replace the images with their distances to these 50 clusters, •Then apply a Logistic Regression model
Which of the following(s) are true for boosting
a) It is another Ensemble method that combine several weak learners into a strong learner b) Trains predictors sequentially c) Each predictor attempts to correct its predecessor
Which of the following is true for the CART algorithm that is used to for the decision tree
a) Searches for an optimum split at the top level, then repeats the process at each subsequent level. b) It is a greedy algorithm. It does not check whether or not the split will lead to the lowest possible impurity several levels down. c)Produces a solution that's reasonably good but not guaranteed to be optimal
of the following(s) is true for soft voting
a) Soft voting classifier achieves better accuracy than hard voting b) It gives more weight to highly confident votes c) All classifiers of used from the Scikit Learn library must have predict_proba() method to be able to use soft voting
Which of the followings can be done by a Support Vector Machine (SVM) learning model?
a)Classification b)Regression c)Outlier detection
SVM Classification is also called _____________ classification.
a)Large margin
Given the following figure, what is the optimal number of clusters
about 4
According the figure below, how many decision trees would you use in gradient boost?
about 57
support vectors
are the samples on the margin.
color segmentation
assign pixels to the same segment if they have a similar color
aggregate the predictions of a group of predictors (such as classifiers or regressors), and we often get better predictions than with the best individual predictor.
ensemble_ learning
uring SVM classification, if we strictly impose that all instances must be off the street and on the right side, this is called
hard margin __classification
All pixels that are part of the same individual object are assigned to the same segment.
instance segmentation
Using a function to transform the original space into a higher dimensional space during the costs function optimization is called
kernel trick
Many datasets are not even close to being linearly separable. Linear SVM would not perform well on these datasets. We need to transform the original space to a higher dimensional space to improve the performance. Certain values can be passed to Scikit Learn Support Vector Machine Classifier (SVC)'s 'kernel' parameter to transform the original space to a higher dimension. Which of the following values can be used for that:
poly rbf sigmoid
We use SVM learning model to solve a ML problem. We see scaled and not scaled data in the figures. Do you recommend to use scaled or unscaled data why?
scaled as the points on scaled graph are closer to the bounds
all pixels that are part of the same object type get assigned to the same segment.
semantic segmentation,
For an ensemble learning model to work, which of the followings needs to be satisfied for learners
there are a sufficient number of weak learners they are sufficiently diverse
Even if each classifier used in an ensemble learning model is a ______ learner (slightly better than random guessing), the ensemble can still be a ______ learner (achieving high accuracy).
weak strong
Name and describe two main tasks that are achieved using unsupervised learning
Clustering: The goal is to group similar instances together into clusters. Anomaly detection: Learn what "normal" data looks like, use that to detect abnormal instances
What is a Gaussian Mixture Model
Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.
Briefly explain the difference between AdaBoost and Gradient Boost
Gradient Boost attempts to fit the new predictor to the residual errors made by the previous predictor instead of tweaking the instance weights at every iteration
Sequential learning technique has some similarities with ____________, except that instead of tweaking a single predictor's parameters to minimize a cost function, _________ adds predictors to the ensemble, gradually making it better
Gradient Descent AdaBoost
what is hard clustering and soft clustering
Hard clustering: Assigning each instance to a single cluster Soft clustering: Give each instance a score per cluster
hard and soft voting classifiers for ensemble learning
Hard voting: aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier Soft voting: predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting
What is the main idea behind the stacking approach in ensemble learning
Idea: Train a model to aggregate the predictions of all predictors in an ensemble.
is the task of partitioning an image into multiple segments.
Image segmentation
Difference between Random Forest (RF) vs Random Patches (RP)
In Random Patches, the subset of features is selected globally once and for all, prior to the construction of the tree. In Random Forest, subsets of features are drawn locally at each node.
Hyperplane
In an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
What would you do to improve the performance of the DBSCan algorithm for the figure below:
Increase the eps
If we are using a decision tree for the data above, which data should we use?
It does not matter, decision trees do not need the data to be scaled.
Similarities - differences between K-Means and Expectation-Maximization (EM) Algorithm
Many similarities with the K-Means algorithm: • Initializes the cluster parameters randomly, • then it repeats two steps until convergence, expectation step : first assigning instances to clusters maximization step: then updating the clusters EM is a generalization of K-Means • Finds the cluster centers (μ(1) to μ(k)), • Finds cluster size, shape, and orientation (Σ(1) to Σ(k)), • Finds cluster relative weights (ϕ(1) to ϕ(k)) • Uses soft cluster assignments instead of hard
Which of the followings are true for hard margin classification?
Only works if the data is linearly separable. b)Sensitive to outliers and it will probably not generalize as well.
The following figure is given for the Gaussian mixture, what would be the optimal number of clusters?
Optimal # of clusters: 3
It is called random _______ is when we sample both training instances and features.
Patches
It is called random _______ when we keep all training instances but sample features
Subspaces
margin
The distance between the closest examples of two classes to the decision boundary
The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features
True
The SVM is a different type of algorithm as it picks the extreme case which is close to the boundary and it uses that to construct its analysis.
True
The idea of SVM is to create a line or a hyperplane which separates the data into classes.
True
What are the methods used to find out the optimal number of trees during a gradient boosting method.
Two stage training method: • Trains a large number of trees • Measures validation error at each stage in the training • Selects tree size with minimum validation error • Trains a new model with the optimal tree size found Early stopping method: • Implements incremental learning • Measures validation error for every tree • Stops adding trees when error increases 5 times in a row
How does clustering help semi-supervised learning
Use for Semi-Supervised Learning when we have plenty of unlabeled instances and very few labeled instances .1.Train with only 50 labeled instances: 2.Cluster the training set into 50 clusters. Then for each cluster, find the representative image closest to the centroid. 3. Let's propagate the labels to all the other instances in the same cluster (label propagation)?
How do we know the best solution K-means
Use the inertia metric. Inertia: The mean squared distance between each instance and its closest centroid. Algorithm choses the result with the lowest inertia.
How do we select the number of clusters
We get the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) values for various cluster numbers and select the one that has the minimum BIC or AIC values.
Why do we want to diversify the learners
When diversified, learners will make very different types of errors, improving the ensemble's accuracy
Which of the followings is not true for decision tree regression
When used for regression, decision trees are not prone to overfitting
How does DBSCAN (Density-Based Spatial Clustering of Applications with Noise) work
1. Pick a random point that has not been marked yet, 2. Check if the point has n number of neighbors within ε distance? 1. Yes, it is a core point: • Check all other points in the neighborhood and mark them as core point if they have n or more number of neighbors, if they do not have n neighbors then mark them as border points. Continue the process until no core point left in the neighborhood. All core points will be in the same cluster. 2. No, it is an outlier point If there are unmarked points left, go to the first step 3. If there are unmarked points left, go to the first step
How does uncertainty sampling strategy during active learning work
1. The model is trained on the labeled instances gathered so far, and then makes predictions on all the unlabeled instances. 2. The expert labels the instances for which the model is most uncertain 3. Iterate this process until the performance improvement stops being worth the labeling effort
What is active learning?
Active learning is when a human expert interacts with the learning algorithm, providing labels for specific instances when the algorithm requests them
When does DBScan does not perform well?
DBSCAN cannot capture all the clusters properly if the density varies significantly across the clusters
Random Forest is an ensemble of _____ _____, generally trained via the _______ method
Decision Trees bagging
How to avoid local minimum in K-means
Run the algorithm multiple times with different random initializations and keep the best solution.
Which of the following figure has lower variance?
The one on the right (it applies bagging).
Why is the training complexity of the decision tree using the CART algorithm is O( n x m x log2m )
There are n features and m samples and there are log2m levels. The algorithm compares all features on all samples at each level and hence has the O( n x m x log2m ) complexity
Why is the prediction complexity of the decision tree is O( log2m )
There is one comparison at each level and there are log2m levels.
One of the following values for the kernel parameter were used for the following Support Vector Machine Regression models.
Top left: rbf Top right: linear Bottom left: poly Bottom right: sigmoid
Bagging and pasting methods enables parallel training and prediction. Training can be done in in parallel on different CPU cores or servers. Predictions can be made in parallel
True