ISLR Ch 8

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is the RSS for tree models?

(j=1 to JSum)(i E Rj Sum) ( y i − y ˆ R j )^2 whereyˆRj isthemeanresponseforthetrainingobservationswithinthe jth box.

Give Algorithm for Building a Regression Tree

1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. 2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α. 3. Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k = 1, . . . , K: (a) Repeat Steps 1 and 2 on all but the kth fold of the training data. (b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results for each value of α, and pick α to minimize the average error. 4. Return the subtree from Step 2 that corresponds to the chosen value of α.

the process of building a regression tree.

1. We divide the predictor space—that is, the set of possible values for X1, X2, . . . , Xp—into J distinct and non-overlapping regions, R1,R2,...,RJ. 2. For every observation that falls into the region Rj , we make the same prediction, which is simply the mean of the response values for the training observations in Rj.

What is a classification tree? How are predictions made?

A classification tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one. for a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.

In the classification setting, RSS cannot be used as a criterion for making the binary splits. What is an alternative?

A natural alternative to RSS is the classification error rate. Since we plan to assign an observation in a given region to the most commonly occurring class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:

Explain the 3 tuning parameters of boosting

Boosting has three tuning parameters: 1. The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B. 2. Theshrinkageparameterλ,asmallpositivenumber.Thiscontrolsthe rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small λ can require using a very large value of B in order to achieve good performance. 3. The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, in which case each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables.

Equation 8.4 is similar to which other method?

Equation 8.4 is reminiscent of the lasso (6.7) from Chapter 6, in which a similar formulation was used in order to control the complexity of a linear model.

Explain the algorithm

Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y , as the re- sponse. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the resid- uals.

Can you construct decision tree with qualitative predictor variables?

However, decision trees can be constructed even in the presence of qualitative predictor variables. lit on one of these variables amounts to assigning some of the qualitative values to one branch and assigning the remaining to the other branch.

How does boosting for decision trees work?

In bagging each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

How is binary splitting done?

In order to perform recursive binary splitting, we first select the pre- dictor Xj and the cutpoint s such that splitting the predictor space into the regions {X|Xj < s} and {X|Xj ≥ s} leads to the greatest possible reduction in RSS.

How do we construct the regions R1,...,RJ? What is their shape?

In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predic- tive model. The goal is to find boxes R1, . . . , RJ that minimize the RSS, given by

How do we determine the best way to prune the tree?

Intuitively, our goal is to select a subtree that leads to the lowest test error rate. Given a subtree, we can estimate its test error using cross-validation or the validation set approach. However, estimating the cross-validation error for every possible subtree would be too cumbersome, since there is an extremely large number of possible subtrees.

What is OOB analogous to when B is large?

It can be shown that with B sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error. The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous.

In which context does a regression tree perform better than linear regression and vice versa. How is the performance measured?

It depends on the problem at hand. If the relationship between the features and the response is well approximated by a linear model as in (8.8), then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure. If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model (8.9), then decision trees may outperform classical approaches. The rela- tive performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach

Gini index is referred to as a measure of node purity. Why?

It is not hard to see that the Gini index takes on a small value if all of the pˆmk's are close to zero or one. For this reason the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.

Why is tree pruning needed?/ Drawbacks of large trees

Large trees may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex. A smaller tree with fewer splits (that is, fewer regions R1,...,RJ) might lead to lower variance and better interpretation at the cost of a little bias.

Is cross-validation or validation set approach used to estimate the test error of a bagged model? Why or why not?

No. We use out of bag error estimation

What is decorrelating of trees? Why is it useful?

Random forests force each split to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.

Explain how Random Forest works

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors is taken at each split, and typically we choose m ≈ √p

What is Cost complexity pruning—also known as weakest link pruning—

Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α.

How are the Gini index and cross entropy numerically similar?

Since 0 ≤ pˆmk ≤ 1, it follows that 0 ≤ −pˆmk logpˆmk. One can show that the cross-entropy will take on a value near zero if the pˆmk's are all near zero or near one. Therefore, like the Gini index, the cross-entropy will take on a small value if the mth node is pure.

Why is node purity important?

Suppose that we have a test obser- vation that belongs to the region given by that right-hand leaf. Then we can be pretty certain that its response value is Yes. In contrast, if a test observation belongs to the region given by the left-hand leaf, then its re- sponse value is probably Yes, but we are much less certain. Even though the split does not reduce the classification error, it improves the Gini index and the cross-entropy, which are more sensitive to node purity.

Explain how bagging works. Give the formula

Take samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the bth bootstrapped training set in order to get fˆ∗b(x), and finally average all the predictions, to obtain (formula)

What are two other alternatives to classification error rate? Give formulas

The Gini index, cross-entropy

Decision trees suffer from high variance. Explain

The decision trees discussed in Section 8.1 suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

What's the difference between random forest and bagging?

The main difference between bagging and random forests is the choice of predictor subset size m. For instance, if a random forest is built using m = p, then this amounts simply to bagging.

some of the splits yield two terminal nodes that have the same predicted value. Why, then, is the split performed at all?

The split is performed because it leads to increased node purity. That is, all 9 of the observations corresponding to the right-hand leaf have a response value of Yes, whereas 7/11 of those corresponding to the left-hand leaf have a response value of Yes.

Explain in detail function 8.4

The tuning parameter α controls a trade-off between the subtree's com- plexity and its fit to the training data. When α = 0, then the subtree T will simply equal T0, because then (8.4) just measures the training error. However, as α increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for a smaller subtree.

How do tree based methods work?

These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observa- tions in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.

What is the disadvantage of bagging?

Thus, bagging improves prediction accuracy at the expense of interpretability.

How do you apply bagging to regression trees?

To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these B trees reduces the variance.

Pros and cons of tree based methods

Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning ap- proaches

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. True or False

True

The number of trees B is not a critical parameter with bagging. T or F

True. using a very large value of B will not lead to overfitting. In practice we use a value of B sufficiently large that the error has settled down.

What should be our choice of subset size m for the random forest?

Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.

Explain OOB error estimation

We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation. In order to obtain a single prediction for the ith observation, we can average these predicted responses (if regression is the goal) or can take a majority vote (if classification is the goal). This leads to a single OOB prediction for the ith observation. An OOB prediction can be obtained in this way for each of the n observations, from which the overall OOB MSE (for a regression problem) or classification error (for a classification problem) can be computed.

Explain when during the tree process these three are preferable

When building a classification tree, either the Gini index or the cross- entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate. Any of these three approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.

What does the Gini index measure?

a measure of total variance across the K classes.

it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take atop-down,greedy approach that is known as recursive binary splitting. Explain

approach is top down because it begins at the top of the tree(at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

What is the main idea behind bagging, boosting and random forest?

bagging, random forests, and boosting. Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in inter- pretation.

How can bagging be extended to a classification problem where Y is qualitative?

construct B trees using B bootstrapped training sets. For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.

What is tree pruning?

grow a very large tree T0, and then prune it back in order to obtain a subtree.

one difference between boosting and random forests:

in boosting, because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient. Using smaller trees can aid in interpretability as well; for instance, using stumps leads to an additive model.

Explain a situation where random forest performs better than bagging

in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors. This may sound crazy, but it has a clever rationale. Suppose that there is one very strong predictor in the data set, along with a num- ber of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Un- fortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quanti- ties. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.

How do you know which variables are important in bagging?

one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees). In the case of bagging regression trees, we can record the total amount that the RSS (8.1) is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor. Similarly, in the context of bagging classification trees, we can add up the total amount that the Gini index (8.6) is decreased by splits over a given predictor, averaged over all B trees.

What are out of bag observations?

the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. One can show that on average, each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations.

What are some of the advantages of trees?

▲ Trees are very easy to explain to people. ▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches. ▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). ▲ Trees can easily handle qualitative predictors without the need to create dummy variables.

What are the disadvantages of trees?

▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved. ▼ Decision trees suffer from high variance.


Conjuntos de estudio relacionados

Week 5 Chapter 16 Prep U questions

View Set

FNAN 300 Chapter 6 Connect Learnsmart

View Set

2.3 Marginal Cost and Marginal Revenue

View Set

Module 3: Understanding Internal Control and Assessing Control Risk

View Set

flashcards from strike A book (f1)

View Set