478 Midterm
CART
(Mleft/M)Gleft + (Mright/M)Gright
F_score/ F1_score
2 [(P*R)/(P+R)] == 2/[(1/P)+(1/R)]
For a Multi-class classifier with 4 labels/classes, what is the dimensionality of the confusion matrix?
4 x 4
CART algorithm
Constructs a binary tree; scikit-learn uses CART algorithm for its implementation of decision trees; Recursive algorithm. Greedy algorithm in the sense that it greedily searches for an optimum split at the top level, then repeats the process at each level.
sklearn by default uses entropy as criterion for splitting (T or F)
FALSE
A regression model can predict categorical values (T or F)
False
Kernel trick is to transform the data to a lower dimensional space so that it becomes linearly separable (T or F)
False
Normalization is a random shuffling and always hurts results (T or F)
False
Shuffling data randomly before or after splitting to train/test sets would significantly reduce the model performance (T or F)
False
The cost function of decision trees is a weighted average between both gini and entropy of each node. (T or F)
False
Gini
Gi = 1 - [summation] Pi, k^2
Recall Function
TP / (TP + FN)
Precision function
TP / (TP + FP)
Soft margin & hard margin SVM
If we strictly impose that all instances be off the street and on the correct side of the decision boundary, this is called hard margin classification; Hard margin SVM only works if the data is linearly separable, and it is quite sensitive to outliers
SVM overfitting and underfitting
If you trained an SVM classifier with a linear kernel and it seems to underfit the training set, try changing the kernel to a non-linear kernel such as poly or rbf. If you trained an SVM classifier with a non-linear kernel and it seems to overfit the training set, try changing the kernel to a linear kernel or a non-linear kernel with a lower complexity, e.g. a polynomial with a lower degree. A higher value for C parameter is more likly to lead to overfitting as it narrows down the margins. A lower value for C parameter is less likely to lead to overfitting as it makes the model more generalized by widening the margins.
Example of unsupervised learning?
Image clustering
Online learning vs batch learning differences
In batch learning, the system is incapable of learning incrementally; it must be trained using all available data, whereas in online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches
Ridge Regression Cost Function
J(theta) = MSE(theta) + 1/2 a[summation] theta^2
Lasso Regression Cost Function
J(theta) = MSE(theta) + a[summation] [absolute value] theta
What ML method is supervised learning but sounds like it shouldn't be?
Logistic Regression
What is overfitting?
Model performs well on training data but does not generalize well; Model likely to detect patterns in the noise with not enough data & that data is noisy.
What is RMSE used for?
Regression error
Cross validation:
Splits the data into K different folds, runes for k iterations, and in each iteration reserves 1 fold for testing; Splits the data into K different folds, runs for k iterations, and in each iteration reserves k-1 folds for training; Is a performance measure for ML models.
Difference between supervised and unsupervised learning
Supervised learning and unsupervised learning both have training, however, in supervised learning, the training data you feed the algorithm includes desired solutions, called labels whereas in unsupervised learning data is not labeled
Examples of Machine Learning regression problems
Temperature forecast; predicting stock market
MNIST dataset (0-9) handwritten
The features of each data sample are pixel intensities and there are 10 different labels; The features or numbers from 0-255 and the labels are from 0-9; It is often called the "Hello World" of machine learning
Precision vs recall
The lower number of False Negatives the higher the recall
A confusion matrix with high scores on the main diagonal indicates a good model performance (T or F)
True
A good way to reduce overfitting is to regularize the model (constrain it) by adding a regularization parameter (T or F)
True
A smaller value for Entropy is better and should be preferred for choosing the feature in decision trees (T or F)
True
Clustering is an example of unsupervised learning (T or F)
True
Common to use 80% of data for training and 20% for testing? (T or F)
True
Comparing AI, ML & DL, one can argue that a superset-subset relationship between them such that DL is a subset of ML and ML is a subset of of the broad field of approaches, algorithms and techniques in AI (T or F)
True
Cross validation is an effective way of model evaluation (T or F)
True
Finding an optimal decision tree is an NP complete problem (T or F)
True
Fine tuning model parameters may improve the results of the ML model (T or F)
True
Gradient Descent solves optimization problems on the cost function using gradient matrix and a learning rate which should be neither too small nor too large (T or F)
True
Machine Learning is great for problems for which solutions require a lot of hand-tuning or long lists of rules: one ML algorithm can often simplify code and perform better (T or F)
True
Matplotlib is a Python module that has a wide variety of plotting features and functions and can be used for data visualization (T or F)
True
Normalization is one way of scaling and usually improves the model performance ( T or F)
True
Normalization may change the data range (T or F)
True
Preprocessing the data is a critical step in preparing the data for the ML model and may include cleaning the data by dropping NA values (T or F)
True
ROC curve plots "true positive rate" TPR on y-axis against "false positive rate" FPR on x-axis, and its "area under curve" AUC is a performance measure of ML models (T or F)
True
Regression is predicting a target numeric value, such as the price of a car, given a set of features called predictors (T or F)
True
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a bit of labeled data. This is called semisupervised learning (T or F)
True
Sometimes scaling large values in features may improve the results of the ML model (T or F)
True
Stochastic Gradient Descent uses only one random instance to compute the gradients at every step whereas Batch Gradient Descent uses the whole training set. (T or F)
True
There is a trade-off between precision and recall such that any attempt to increase precision will decrease recall and vice-versa (T or F)
True
Typical supervised learning task is classification (T or F)
True
Gini
lower gini index is better == lower impurity