Machine Learning Quiz 4
Error Rate
(FP+FN) / (TP+TN + FP + FN) or 1- accuracy
Accuracy formula
(TP + TN) / (TP + TN + FP + FN)
Boosting
At each iteration of the Boosting process: 1. The resampled datasets are constructed specifically to generate complementary learners. 2. Each learner's vote is weight based on its past performance.
Adaptive Boosting (AdaBoost) Strengths
Boosting is a relatively simple ensemble method to implement. Requires less parameter tuning compared to other ensemble methods. Can be used with many different classifiers.
The Holdout Method
For this to result in a truly accurate estimate of future performance, at no time should the performance on the test dataset be allowed to influence the model. (see slides)
Adaptive Boosting Weaknesses
High tendency to overfit with many weak learners. Rather slow training time. Sensitive to noisy data and outliers
Random Forest Strengths
Performs well on most problems. Handles noisy or missing data as well as categorical or continuous features. Selects only the most important features. Works for data with an extremely large number of features.
The Receiver Operating Characteristic (ROC) curve
The Receiver Operating Characteristic (ROC) curve is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives.
Measuring Performance
The best measure of performance is always the one that captures whether the classifier is successful at its intended purpose. This means that performance measures should be defined for utility rather than raw accuracy.
2. Which parameters to tune
The goal is to: 1. Create a matrix or grid of parameter combinations. 2. Develop candidate models based on the parameter grid. 3. Perform a search for the best model.
Advantages of Ensembles
1)Better generalizability to future problems. 2)Improved performance on massive or miniscule datasets. 3)The ability to synthesize data from distinct domains. 4)A more nuanced understanding of difficult learning tasks
Problems with the Holdout Method
1. Each partition may have a larger or smaller proportion of some classes. This could lead to a class being omitted from the training data (resolved by stratified random sampling). 2. Some samples may have too many or few difficult cases, easy-to-predict cases, or outliers. 3. Substantial portions of data must be reserved to test and validate the model.
Two main ensemble method families
1.Averaging methods 2.Boosting methods
Cross-validation
A technique known as repeated holdout is sometimes used to mitigate the problems with the holdout method. It uses the average result from several random holdout samples to evaluate a model's performance. It forms the basis for the technique known as K-fold crossvalidation.
Averaging Methods
Independently built models with their predictions averaged or combined by a voting scheme. They attempt to reduce the variance of a single base estimator. Examples include *Bagging methods* and *Random Forests*.
K-fold Cross-validation
K-fold cross-validation randomly divides the data into k completely separate random partitions called folds. For each of the k-folds, the remaining data is used for training. At the end, the average performance across all the folds is reported.
Extreme Gradient Boosting (XGBoost)
The Extreme Gradient Boosting learner is an implementation of gradient boosted decision trees designed specifically for speed and performance. With gradient boosting, instead of assigning weights to models at each iteration, subsequent models attempt to predict the residuals of prior models using a gradient descent algorithm to minimize loss.
Estimating Future Performance (resubstitution error)
Most learners present performance measures during training. This is known as the *resubstitution* error. This metric is overly optimistic and cannot reliably measure future performance. A better measure of future performance is evaluation against unseen data.
The F-Measure
The F-measure combines precision and recall into a single number using the harmonic mean. It provides a convenient way to compare several models side by side. F-measure = (2*recall*precision)/recall+precision) = (2*10)/[2(TP+FP+FN)]
Boosting Methods
Sequentially built models which are combined to produce a powerful ensemble. They attempt to reduce the *bias* of the *combined* estimator. Examples include AdaBoost and Gradient Tree Boosting
Area Under the Curve (AUC)
The AUC treats the ROC diagram as a twodimensional square and measures the total area under the ROC curve. AUC ranges from 0.5 (for a classifier with no predictive value) to 1.0 (for a perfect classifier).
Adaptive Boosting (AdaBoost)
The Adaptive Boosting learner works by sequentially adding weak models which are trained using weighted training data. Each model is assigned a stage value which corresponds to how accurate it is against the training data. The prediction for the ensemble is taken as the sum of the weighted predictions of the weak classifiers.
Stacking
utilizing another model to learn a combination function from various combinations of predictions.
ensemble
The meta-learning approach that utilizes the principle of creating a varied team of experts. Ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created.
Precision
The precision (also known as the positive predictive value) is defined as the proportion of positive predictions that are truly positive. A model with high precision is trustworthy. precision = TP / (TP + FP)
Parameter Tuning
The process of adjusting a model's parameters to identify the best fit is called
Recall
The recall is a measure of the completeness of the results. A model with high recall has wide breadth. TP / (TP + FN)
Sensitivity
The sensitivity of a model (also called the true positive rate) measures the proportion of positive examples that were correctly classified. sensitivity = TP / (TP + FN)
Specificity
The specificity of a model (also called the true negative rate) measures the proportion of negative examples that were correctly classified. specificity = TN/ (TN + FP)
meta-learning
The technique of combining and managing the predictions of multiple models is known
3. What performance criteria to use
These include statistics such as accuracy and kappa (for classifiers) and R-squared or RMSE (for numeric models). Cost-sensitive measures such as sensitivity, specificity, and area under the ROC curve (AUC) can also be used.
1. What model to choose
This requires an understanding of the strengths and weaknesses of machine learning models. It also requires an understanding of the data and the machine learning task. Simply understanding whether the task is a classification problem or a numeric prediction problem (regression) helps narrow the choices.
Random Forest Weaknesses
Unlike a decision tree, the model is not easily interpretable. May require some work to tune the model to the data. Increased computational complexity.
Bootstrapping
Using sampling with replacement to form training set, bootstrapping presents an alternative to cross-validation: 1. Sample dataset of n instances n times with replacement, forming new dataset of n instances. 2. Use resulting data as training set. 3. Use instances from original dataset but not in resulting training set for testing. Bootstrapping typically uses less data than crossvalidation for training, therefore, its test error will be rather pessimistic. To account for this, the final error rate for bootstrapping is calculated as: error = 0.632 x errortest + 0.368 x errortrain
Occam's razor ("law of parsimony")
when presented with competing hypothetical answers to a problem, one should select the answer that makes the fewest assumptions.
automated parameter tuning`
a search to identify the optimal combination of parameters using a choice of evaluation methods and metrics 1. What type of machine learning model (and specific implementation) should be trained on the data? 2. Which model parameters can be adjusted, and how extensively should they be tuned to find the optimal settings? 3. What criteria should be used to evaluate the models to find the best candidate?
confusion matrix
a table that categorizes predictions according to whether they match the actual value
Bagging (Bootstrap Aggregating)
a technique that generates a number of training datasets by bootstrap sampling the original training data. 1. The training datasets are used to generate a set of models using a single learner. 2. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).
allocation function
dictates how much of the training data each model receives
Random Forest
focuses only on ensembles of decision trees. It combines the base principles of bagging with random feature selection to add additional diversity to decision tree models. 1.Create random vectors 2.Use random vector to build multiple decision trees 3.Combine desicision trees
combination function
governs how disagreements among the predictions are reconciled.
The Kappa Statistic
k = [Pr(a) - Pr(e)] / [1 - Pr(e)] adjusts accuracy by accounting for the possibility of a correct prediction by chance alone Pr(a) refers to the proportion of actual agreement, while Pr(e) refers to the expected agreement. Kappa values range from 0 (poor) to 1 (good). Pr(e)=Pr(actual Yes) x Pr(predict Yes) + Pr(actual No) x Pr(predict No)
Hyperparameters
parameters that need to be set before the learning process begins