CSC529Week3

Ace your homework & exams now with Quizwiz!

Boosting challenges

Challenges: how to reweight? How to combine?

Random Forests algorithm

Choose T—number of trees to grow Choose m<M (M is the number of total features) —number of features used to calculate the best split at each node (typically 20%) For each tree Choose a training set by choosing N times (N is the number of training examples) with replacement from the training set For each node, randomly choose m features and calculate the best split Fully grown and not pruned Use majority voting among all the trees

Ensembles: Manipulate output targets

(same data, same algorithm, convert multiclass problems into many two-class problems

EP: Given three classifiers, each with probability of error of 0.2, combined by simple voting, what is the probability of error of the ensemble?

0.104

EP: Given 9356 classifiers, each with probability of error of 0.5, combined by simple voting, what is the probability of error of the ensemble?

0.5

EP: Random Forests are an example of ....

Bagging

EP: Which of the following ensemble techniques can be easily parallelized?

Bagging

Bagging algorithm

Bagging - Training 1. k = 1; 2. pi = 1/m , for i=1,...,m 3. While k < EnSize 3.1 Create training set Tk (normally of size m) by sampling from T with replacement according to probability distribution p. 3.2 Build classifier Ck using learning algorithm L and training set Tk 3.3 if errror_T (Ck) < threshold k = k+1 3.4 Goto 3.1 4. Output C1, C2,..., Ck Classification: Classify new examples by voting among C_1, C_2,..

EP: Which of the following is a fundamental difference between bagging and boosting?

Bagging does not take the performance of previously built models into account when building a new model. With boosting each new model is built based upon the results of previous models.

Measuring Bias and Variance in Practice (biases)

Bias and Variance are both defined as expectations: Bias (X) = EP[f(X)-fbar(X)]

Measuring Bias and Variance in Practice (variance)

Bias and Variance are both defined as expectations: Var(X) = EP[(f(X)-fbar(X))2]

Boosting: Principles

Boost a set of weak learners to a strong learner Make records currently misclassified more important

Boosting algorithm

Boosting - Training 1. k = 1, for i=1,...,m w1i = 1/m , 2. While k < EnsSize 2.1 for i=1,...,m pi = wi/sum(wi) 2.2 Create training set Tk (normally of size m) by sampling from T with replacement according to probability distribution p. 2.3 Build classifier Ck using learning algorithm L and training set Tk 2.4 Classify examples in T 2.5 if errror_T (Ck) < threshold k = k+1 For i = 1 to m if Ck(x_i) != yi wi = wi * Beta -- (Beta >1) Increase w of misclassified examples 2.6 Goto 3.1 3. Output C1, C2,...,Ck

Boosting

Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance[1] in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones.[2] Boosting is based on the question posed by Kearns and Valiant

Bagging 2 steps

Bootstrap Sampling with replacement Contains around 63.2% original records in each sample Bootstrap aggregation Train a classifier on each bootstrap sample Use majority voting to determine the class label of ensemble classifier

Bagging stands for

Bootstrap Aggregation

Bagging

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

elephant & the blind men example

Why ensembles work: 6 blind men touch a different part of an elephant and will only be able to describe what they feel (i.e. a tail or trunk). Individually, they only can describe what they feel. But, together, they can identify the animal as an elephant.

ensemble methods efficiency

a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach)

Boosting was revised to be a

a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996).

The Representational Problem

arises when the hypothesis space does not contain any good approximation of the target class(es).

The Statistical Problem

arises when the hypothesis space is too large for the amount of available data. Hence, there are many hypotheses with the same accuracy on the data and the learning algorithm chooses only one of them! There is a risk that the accuracy of the chosen hypothesis is low on unseen data!

The Computational Problem

arises when the learning algorithm cannot guarantee finding the best hypothesis.

It is easy to see why bagging reduces variance

averaging

Boosting: In an ensemble, the output of an instance is computed by

averaging the output of several hypothesis

Boosting originally developed

by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990).

Ensemble learning

combine the prediction of different hypothesis by some sort of voting

Boosting: Instead of constructing the hypothesis independently,

construct them such that new hypothesis focus on instance that were problematic for the previous hypothesis

Bagging decreases error by

decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.

Human ensembles are

demonstrably better

ensemble methods applications

distributed computing privacy-preserving applications large-scale data with reusable models multiple sources of data

Bagging is a simple example of an

ensemble learning algorithm

Why Do Ensembles Work? Dietterich(2002) showed that

ensembles overcome three problems The Statistical Problem The Computational Problem The Representational Problem

Learning with Weighted Examples: Generic approach

is to replicate examples in the training set proportional to their weights (e.g. 10 replicates of an example with a weight of 0.01 and 100 for one with weight 0.1). Most algorithms can be enhanced to efficiently incorporate weights directly in the learning algorithm so that the effect is the same . For decision trees, for calculating information gain, when counting example i, simply increment the corresponding count by wi rather than by 1.

Issues in ensembles (bagging is easily)

parallelized

Issues in ensembles (boosting is NOT easily)

parallelized

Mathematical proof example for ensembles: For 3 classifiers, each with probability of error of 1/3, combined by simple voting, the probability of error is equal to the probability that two classifiers make an error plus the probability that all three classifiers make an error.

pe(ens) = c(3,2) pe^2(1-pe) + c(3,3) pe^3 = 3* (1/3)^2 (2/3) + (1/3)^3 = 2/9 + 1/27 = 7/27 = 0.26 (down from 0.33 for a single classifier)

Value of Ensembles (When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing)

random errors cancel each other out, correct decisions are reinforced.

Bagging draws examples with

replacement

Ensembles: Manipulate training data

same learning algorithm, different training data

Ensembles: Heterogeneous ensembles

same training data, different learning algorithms

The more classifiers used,

the better the ensemble performs

Issues in ensembles (variants of boosting )

to handle noisy data

Ensembles: Manipulate input features

use different subsets of the attribute sets

Ensemble methods combine learners to reduce

variance

EP: Which of the following are considered ensemble learning. (Mark all that apply)

Combining the outputs of several different algorithms using the same training data. Combining the outputs of the same algorithm trained on different subsets of the training data. Combining the outputs of the same algorithm trained on different features of the training data.

Bagging 2

Create ensembles by repeatedly randomly resampling the training data Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement. Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates. Combine the m resulting models using simple majority vote. Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.

Boosting method

Examples are given weights. At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong.

EP: As you increase the number of classifiers in an ensemble, the classification error decreases until the error approaches 0.

FALSE

EP: Boosting is an ensemble method that uses sampling with replacement of a dataset and learns independently several classifiers.

FALSE

EP: Ensemble methods can be used for classification but not prediction.

FALSE

EP: Ensemble methods can be used for supervised learning but not unsupervised learning.

FALSE

EP: For methods like Boosting or Bagging to perform well, it is necessary that the sub-samples from the dataset used in each classifier are as similar as possible.

FALSE

EP: Given a dataset of N observations, the random forest algorithm will product N * 0.632 trees.

FALSE

EP: Random decision trees are computationally expensive to build.

FALSE

EP: Random subspaces methods (like random forest) are meta-learning methods that learn a set of classifiers using datasets that have subsets of attributes of the original data.

FALSE

Random Decision tree (single model learning algorithms)

Fix structure of the model, minimize some form of errors, or maximize data likelihood (eg., Logistic regression, Naive Bayes, etc.) Use some "free-form" functions to match the data given some "preference criteria" such as information gain, gini index and MDL. (eg., Decision Tree, Rule-based Classifiers, etc.)

Boosting: Basic Algorithm

General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples Decrease the weights of examples ht classifies correctly Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting. During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data.

How to build ensembles:

Heterogeneous ensembles (same training data, different learning algorithms) Manipulate training data (same learning algorithm, different training data) Manipulate input features (use different subsets of the attribute sets) Manipulate output targets (same data, same algorithm, convert multiclass problems into many two-class problems) Inject randomness to learning algorithms

Issues in ensembles (other)

How "weak" should a base-learner for Boosting be? What is the theoretical explanation of boosting's ability to improve generalization? Exactly how does the diversity of ensembles affect their generalization performance. Combining Boosting and Bagging.

Bagging: Variance Reduction

If each classifier has a high variance (unstable) the aggregated classifier has a smaller variance than each single classifier The bagging classifier is like an approximation of the true average computed by replacing the probability distribution with bootstrap approximation

Random Forests: Improve accuracy

Incorporate more diversity and reduce variances

AdaBoost: Basic Algorithm

Initially, set uniform weights on all the records At each round Create a bootstrap sample based on the weights Train a classifier on the sample and apply it on the original training set Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased If the error rate is higher than 50%, start over Final prediction is weighted average of all the classifiers with weight representing the training accuracy

Random Decision tree (Learning as encoding)

Make no assumption about the true model, neither parametric form nor free form Do not prefer one base model over the other, just average them

EP: Which of the following ensemble techniques work well when the errors of the classifiers are strongly correlated.

Neither

Stories of Success (ensembles)

Netflix million dollar prize

Main Ideas of Boosting

New classifiers should focus on difficult cases Examine the learning set Get some "rule of thumb" (weak learner ideas) Reweight the examples of the training set, concentrate on "hard" cases for the previous rule Derive the next rule of thumb! Build a single, accurate predictor by combining the rules of thumb

EP: AdaBoost is susceptible to outliers.

TRUE

EP: Classifier ensembles usually have better performance than stand-alone classifiers because they combine the different points of view of each individual classifier.

TRUE

Error due to Bias:

The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.

Error due to Variance

The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

For an ensemble to work the following conditions must be true:

The errors of the classifiers need not to be strongly correlated (think about the exam example, if everyone knows by heart exactly the same chapters, will it help to solve the test in groups?) The errors of the individual classifiers making up the example need to be less than 0.5 (at least better than chance)

Motivations of ensemble methods

Ensemble model improves accuracy and robustness over single model methods

Each boostrap sample will on average contain

63.2% of the unique training examples, the rest are replicates

Random Decision Tree (algorithm)

At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen We stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits, such as the total number of features. Prediction Simple averaging over multiple trees

Random Decision tree (such methods will make mistakes if)

Data is insufficient Structure of the model or the preference criteria is inappropriate for the problem

EP: Which of the following is correct with respect to random forests compared to decision trees?

Random forests are more difficult to interpret but often more accurate.

Random Forests: Improve efficiency

Searching among subsets of features is much faster than searching among the complete set

Random Decision Tree (prediction)

Simple averaging over multiple trees

How to build ensembles: The two dominant approaches belong to category 2 (manipulate training data)

They are: bagging and boosting

Random Decision Tree (potential advantages)

Training can be very efficient. Particularly true for very large datasets. No cross-validation based estimation of parameters for some parametric methods. Natural multi-class probability. Imposes very little about the structures of the model.


Related study sets

ICANN - Аббревиатуры

View Set

Chapter 6 Basic Concepts of Enzyme Action

View Set

Chapter 6: consumer decision making

View Set

¡Avancemos 1! Unidad 4 Lección 1 Cultura

View Set