4: Decision Trees

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Random forests vs. regular decision trees - 2 key differences

1.) Random forests have larger trees. - Grow more freely with no pruning. - The act of aggregating predictions reduces variance and combats overfitting (lower variance than a single tree). 2.) Random forests consider a random subset of features at each split. - Minimizes the risk of correlated predictions and allows for other important predictors to become known. - Reduces variance even a step further. - We use CV to determine how many features to consider at each split.

3 hyperparameters in boosting

1.) Shrinkage parameter: value between 0 and 1 that controls how fast or slow the boosting learns. - Controls the amount of information gain from each tree. 2.) Number of trees can be viewed as a hyperparameter given the number of trees is a flexibility measure. 3.) Interaction.depth: controls the depth of each tree built in a boosted model. - Typically smaller values as trees usually represent weak learners to be aggregated to obtain the boosted model's prediction.

Recursive binary splitting

A greedy algorithm that determines the splitting rules by dividing the predictor space into two smaller regions at each node. Makes split by minimizing some target function. - SSE for regression - Deviance for Poisson regression - Error rate, Gini index, or cross entropy for classification

The train function

A versatile function that fits models with different tuning parameters. In the case of fitting decision trees, train conveniently performs CV for the tuning process. - Every cp candidate value is specified, and the code will select the one with the lowest CV error for the final selected tree.

Bootstrapping

Sampling n observations randomly WITH replacement from the original training dataset of n observations. This process is repeated for ensemble methods like random forests to create many bootstrap samples from a training dataset.

Boosting in R - gbm()

See diagram.

Boosting in R - xgb.train()

See diagram.

General algorithm for creating a decision tree

See diagram.

Regression vs. classification trees

See diagram.

Summary of models, R packages/methods, and hyperparameters

See diagram.

Stopping criteria

Since recursive binary splitting builds a big tree, introducing stopping criteria helps reduce overfitting. All of the stopping criteria except for maxdepth are inversely related to flexibility!

Ensemble methods

Statistical techniques that combine multiple models to obtain one predictive model. 2 methods covered for this exam... 1.) Random Forests 2.) Boosting

Pruning a regression tree

Trimming/cutting off parts of a tree to make it smaller and less flexible. We can use the four previous stopping criteria, however, we will focus on cp. cp of 0 means the most flexible tree (no pruning) and cp of 1 means the least flexible tree (most pruning). cp corresponds to how much information gain there needs to be for a split to occur. To leverage the bias-variance trade-off we want to set a cp that minimizes the test error. We usually use CV to estimate the test error here! - We can also apply the one-standard-error rule here as well, but it isn't as common. - Use xerror as a reference to the test error in the R output. - Also note how outputs are expressed from rpart()!

Variable importance in boosting

Use variable importance measures and partial dependence plots to visualize the effect of predictors.

Decision trees

Visual diagrams that split up the predictor space to help us make predictions based on observations. Essentially, a series of splitting rules that divide the dataset into mutually exclusive nodes, each with a single static prediction that the observations in a node share. The number of terminal nodes determines the flexibility of a decision tree!

Building decision trees - recursive binary splitting for regression trees

We use function rpart() and rpart.plot() to build decision trees and visualize them. - Make sure to use set.seed() for the CV portion! When you build a tree, the most important predictors will have the highest positions on the tree (i.e., be the first splits). Variables left out of the tree means they are not as significantly predictive. NOTE: for CV, rpart() will show you the final tree sizes on the top of the plot and then have the CV error curve and corresponding cp values on the y and x axis. - The built tree is represented by the right most point! BUILDING IS DIFFERENT THAN SELECTING!

Parametric models vs. decision trees

While neither is perfect, one might be better than the other depending on the problem.

Section 4.1 #2 Business Problem

You are a consultant for the Bureau of Tax Forecasting, which is concerned with how much revenue a new, upcoming tax will generate. The tax will only impact people who have an income exceeding $50,000. Using some incomplete census data, the bureau has hired you to predict who will be impacted by the tax. Whether or not a given worker exceeds $50,000 in annual income is denoted by the feature income_flag.

Section 4.1 #1 Business Problem

Your friend Jamie is planning to marry the love of his life, but he is quite clueless on how to start the search for a wedding ring. He expresses how he is afraid of paying more than he should, as he knows little about what aspects of a diamond are significant to its price. Aware of your statistical and actuarial skills, he comes to you for help with a dataset of diamonds to analyze.

Section 4.2 Business Problem

Your friend Jamie is planning to marry the love of his life, but he is quite clueless on how to start the search for a wedding ring. He expresses how he is afraid of paying more than he should, as he knows little about what aspects of a diamond are significant to its price. Aware of your statistical and actuarial skills, he comes to you for help with a dataset of diamonds to analyze.

Student Data - Part 10

- Fit a random forest model and compared it to a single decision tree from a previous task. Partial dependence plots showed Medu was likely overshadowed by Mjob and failures in the single decision tree. This highlights random forest's abilities to let other important predictors stand out. - Further, constructed a boosted model and compared it to the other two models using test RMSE. Final results were Random Forest > Boosted Model > Single Tree.

Mining Data - Part 5

- Fit two Poisson regression trees and compared them using test RMSE. We didn't need to exclude the collinear predictors here because non-parametric methods aren't impacted by those types of assumptions. - The second model was noticeably larger than the first model (23 vs. 4 terminal nodes) and only had a slightly lower test RMSE. We decided to go with the first model because of this.

Hospital Data - Part 5

- Fit two classification trees and compared them using AUC. - The second model had a slightly greater AUC, but with much more leaves. Whether we prefer the first or second model depends on the business problem!

Hospital Data - Part 6

- Given Readmission.Status had a lot more 0's than 1's, classification models here may suffer from low sensitivity and high specificity. We proposed oversampling to mitigate this issue. - Further used variable importance measures and partial dependence plots to determine that lnRiskscore is by far the most important predictor.

Drawbacks of oversampling and undersampling

Although unbalanced classes affects all classification problems, there are risks to these methods... Oversampling: may overfit the minority class since duplicates of observations are created. Undersampling: may underfit the majority class since observations are deleted.

Variable importance in random forests

Random forests can't be visualized by a single tree, so how do we summarize feature importance? Use variable importance measures and partial dependence metrics/plots to visualize the effect of predictors. Random forests do a good job of showcasing all of the predictors that are important. Single decision trees may have important predictors get overshadowed by more dominant ones.

Section 4.2 Part A Begin by constructing a random forest model using the provided code. This may take around 5 minutes to finish running. Interpret the model outputs.

Among the 8 features total, diamond clarity is deemed the most important feature, in that it reduced the error the most across all the trees. The second most important feature is the carat, but it is well surpassed by the clarity. The least important feature is the table, presumably because a diamond's dimensional quality as relating to price is captured better by other features.

Main advantages and disadvantages of ensemble methods

Advantages: they focus more on predictive accuracy. - Focus on reducing variance (random forests) and bias (boosting) to overall result in a more predictive model compared to a single tree. Disadvantage: they are not easily interpretable and result in less output in R.

Section 4.1 #2 Part A Begin by constructing a classification tree using the provided code. Justify the cp value you use for pruning and provide the plots of the unpruned and pruned trees.

After building the classification tree, I pruned it at the cp value of 0.0021. This value corresponds to the lowest cross-validation error among six cp values that produce trees no bigger than the unpruned tree.

Random forests vs. Boosting models

Random forests tend to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of aggregating predictions across many trees that were built on different subsets of the data. Both practices hinder overfitting to the idiosyncrasies of the training data, and hence keep the variance low. Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data (high variance).

Section 4.1 #3 Part C Explain two advantages of decision trees over GLM in light of this dataset.

One advantage of decisions trees for this dataset is how interactions do not have to be identified or specified beforehand. For example, it is not surprising for the occupation variables worktype and employer to be interdependent, thus it would be natural for interactions to emerge. Another advantage of decision trees for this dataset is its more streamlined handling of features. This data has several factors with a decent number of levels. A GLM requires the necessary dummy variables to account for them, bringing the total to 17 predictors, then only assess their predictive power after modeling with them. In contrast, a decision tree doesn't require dummy variables and will simply not split with a predictor if it is not deemed significant.

Competitor and surrogate splits

Competitor splits: the best alternatives from predictors not chosen as the best split during recursive binary splitting. - May be good to know if there's another similar best split given that the algorithm is greedy. Surrogate splits: alternatives that mimic the best split for observations missing its value. - Think of them as "backup splitting rules" because we aren't able to use the actual value.

Key difference between decision trees and parametric regression models

Decision trees don't make any concrete assumptions about how the target and the predictors relate. We still use similar accuracy measures as to before though! - Ex.) regression trees can use test RMSE and classification trees can use AUC.

Oversampling and undersampling

Often during classification problems, the dataset is disproportionate, which means there are not enough observations for one of the classes. This can cause concerns regarding sensitivity and specificity... Oversampling: duplicates observations in the minority class while retaining all observations in the majority class. - Suitable when not enough data points. Undersampling: randomly deletes observations in the majority class while retaining all observations in the minority class. - Suitable when there is too many data points (less common). Goal = have similar number of observations in each class.

Boosting

Ensemble method that grows trees sequentially using information from previous trees. Before fitting the second tree, the training set is updated by removing information explained by the first tree; information that remains unexplained is referred to as "residuals". Before fitting the third tree, the training set is updated again, and so on, for as many trees as desired. Focuses on reducing bias by training subsequent trees on the same training data. Number of trees is the flexibility measure!

Final takeaway on ensemble methods

Ensemble methods are generally less quantifiable and more difficult to interpret compared to parametric regression and single decision tree models. We rely more on variable importance measures and partial dependence plots to interpret the model. But, for example, variable importance measures do not show how changing the values/levels of predictors would impact the prediction. Trade-off between predictive accuracy and interpretability!

Section 4.2 Part B Imagine you also ran a boosted model which resulted in training and test RMSE's of 700 and 1000 (test RMSE lower than random forest), respectively. Justify why the random forest model might be preferred over this boosted model.

Even though the boosted model reports a lower test RMSE than the random forest model, the difference is not drastic. The edge in predictive accuracy belonging to the boosted model is minor. However, the difference in training and test RMSE's for the boosted model is quite significant, in contrast to the practically identical numbers for the random forest model. This indicates that the boosted model shows evidence of overfitting, whereas the random forest model does not. Hence, since the two models are roughly comparable in term of accuracy, we might prefer the random forest model because it seems to not be overfitted.

Random forests

Fits a decision tree to each bootstrap sample and aggregates them to obtain a single prediction. - For regression, we focus on the average response. - For classification, we focus on the most frequent response. Focuses on reducing variance by aggregating across trees built on different data and only considering a subset of predictors at each split within trees (de-correlated predictions). NOTE: for a classification random forest, when exactly half of the trees predict an observation as positive, the model will choose the target prediction at random.

Student Data - Part 9

Ran three regression trees and compared them using test RMSE. The second model ended up with the lowest test error and also had a medium complexity, which coincides with the intuitive test error U-shaped curve.

Section 4.1 #2 Part B A reviewer of your work comments on how the model results clearly indicate that the unpruned tree is superior to the pruned tree. Critique the reviewer's comment.

I disagree with the reviewer. While the test AUC of the pruned tree is less than the test AUC of the unpruned tree, they differ by only about 0.004. It can be argued that predictive accuracy is about the same for both trees. We also know that the pruned tree is better from the perspective of cross-validation, suggesting that the unpruned tree is more flexible than needed. Therefore, it is not apparent from the results that the unpruned tree is the superior model.

More on Gini index and Entropy

In classification trees, the algorithm looks to maximize the information gain at each split through Gini index and Entropy. Minimizing an impurity measure is equivalent to maximizing the information gain in that measure. Information gain is calculated as the measure for the parent node minus the weighted average measure between the child nodes. In the diagram, this is a poor split because it did not improve our understanding of the classification between success and failure. We want to maximize homogeneity at each node!

Recursive binary splitting and Poisson regression

Instead of minimizing the SEE at each split, Poisson trees look to minimize the deviance. NOTE: constructing a tree without exposures just assumes all w_i equal one.

Section 4.1 #1 Part B Justify whether modeling with a decision tree works well for this dataset.

It appears that modeling with a decision tree is a poorer option for this dataset. As mentioned, there is evidence for improvement in predictive accuracy if the tree could grow bigger, but a large decision tree is difficult to interpret, thus losing its appeal as being capable of depicting a model with a simple visual. The significant predictors from the trees, both carat and y, also suggest that a linear or continuous relationship with price might make more sense, rather than having a static prediction per region in the predictor space.

Pruning a classification tree

Like with regression trees, we usually select the cp that results in the lowest CV error (which is listed as xerror). nsplit = number of splits in the tree number of terminal nodes = nsplit + 1 When a cp is between two rows of the output, pick the row below as the answer!

Advantages and disadvantages of decision trees

Make sure to note that, regarding interactions, they are not always easy to find in the tree visual.

Reading a classification tree

Middle decimal is the predicted rate at that node. That is, the proportion of observations at that node with Y = 1. The percentage is the proportion of observations in the entire dataset used in the node split.

Bagging

Model: special case of random forest that considers all features at each split. This could be undesirable, such as if a dominant predictor appears as the first split in many of the trees, thus resulting in similar/correlated predictions across trees. Methodology: bootstrapping and subsequent aggregation of the bootstraps. Variance refers to the sensitivity of the model to changes in the training dataset. Bagging reduces variance at the expense of bias because each individual tree is trained on different data.

Random forests in R

NOTE: if mtry is not specified, the default is p/3 for regression and sqrt(p) for classification. - We often use CV to determine how many features to consider at each split. NOTE: function predict() does not output predictions on the training data! Predictions by default are on out-of-bag observations (observations NOT used to fit a particular tree). - You can also specify using a separate test data set too.

Constructing a classification tree in R

NOTE: within function predict(), we can set type = "prob" to explicitly display probabilities or set type = "class" to get target predictions. - We need probabilities to create an ROC object and the predictions to create a confusion matrix.

Section 4.1 #1 Part A You begin by constructing two regression trees and use the Pearson goodness-of-fit statistic to compare them. Explain what distinguishes the two resulting trees from each other and recommend one of the trees. Justify your recommendation.

The first decision tree is larger than the second decision tree. This is due to how the tree construction only differed in one way: the minsplit parameter. It was set much higher for the second tree, making it more difficult for splits to occur, thus producing fewer terminal nodes and a few differences in some of the splits. The main differences in the splits revolve around diamonds with higher prices. Given that Jamie does not seem as concerned with interpretation, I recommend the first tree because of its lower test Pearson goodness-of-fit statistic, while keeping in mind that an even bigger tree might actually be better.

Overfitting concerns in boosting

The number of trees and shrinkage parameter (a.k.a. eta) simultaneously control the flexibility of the boosted model. If eta is set to be small, we should expect to build quite a lot of trees in order to reach a moderately-flexible model. Each tree would only be adding a little to the model's variance, and even after adding many trees, the variance still might not be that big (assuming a small enough eta). When eta is its maximum, each tree is adding a lot to the model's variance. Just a few trees can easily produce a high variance.

Note on variable granularity when building trees

Tree models tend to overfit to variables that have many ways to split, including continuous numerical variables and categorical variables with many levels. This is because it is easier to choose a split where the information gain is large.

What if we see that the tree's CV error continues to decrease as flexibility increases and there is no U-shape in the CV error curve?

This behavior implies that a regression tree might not be a good fit. If having more and more splits in the predictor space continues to lead to improvements, particularly from continuous predictors, it likely means there is a prominent continuous relationship between the target and each key predictor.

How to calculate a partial dependence plot

To calculate partial dependence of a variable, we need to first replace all values for that variable in the dataset with the lowest value for that variable, calculate predictions for all observations, and average the predictions. This process is repeated for all values of the variable (or a selected grid of values).


Set pelajaran terkait

ATI Systemic Lupus Erythematosus

View Set

AP Statistics Semester One Review

View Set

Social Psychology Final Study Set

View Set

Medical Math, Bloodborne Pathogens, and Safety Precautions

View Set

PHARM - Immune / Infection ATI Questions / Kahoot / Wipe Out

View Set

Missouri Statutes, Rules and Regulations Pertinent to Life Only

View Set

Econ Final Exam Quiz and Test Answers

View Set

Econ 520 - Jin Wang - Homeworks 1-7 multiple choice questions

View Set