Tree based models
Relative training error for regression trees
= sun of RSS over all leaves / TSS
Boosting
Also known as gradient, boosting machines Works on principle sequential learning and builds a sequence of interdependent trees using information from previously grown trees
Terminal node
Also known as leaves Nodes at the bottom of the tree, which are not split further and constitute an endpoint
Describe the randomization at each split in a random forest algorithm
At each split a random sample of M predictors is chosen as the split candidates out of the P predictors and then out of those M the one that reduces the impurity the most is used New random sample is made it every split Default for classification is m = sqrt(p) Default for regression is m = p/3 Can be treated like a hyper parameter and tuned by cross validation
What does it mean for the recursive binary splitting algorithm to be greedy?
At each split, we adopt the binary split that is currently the best in the sense that it leads to the greatest reduction in impurity at that point instead of looking ahead and selecting a split that results in a better split in a future step Repeat until stopping criterion is reached
Differences between boosting and random forests
Base trees in a random forest are fit in parallel and independently. While boosted model base trees are done in series. Random forests address model variance, and boosting reduces bias
Pros of boosting relative to base trees
Better prediction performance
Ensemble methods tend to be more successful in
Biased: because it captures complex relationships due to using multiple base models each working on different parts Variance: improves the stability of the overall predictions by averaging the predictions of the individual base models - variance of an average is smaller than that of one component
Bagging
Bootstrapped average of any model Random forest is a type of bagging for decision trees
What are three different ways to calculate node impurity for classification trees
Classification error rate Gini index Entropy
cp in the rpart.control() function
Complexity parameter Refers to the value penalizing a tree by its size when cost complexity pruning is performed This is the minimum amount of reduction in the relative training error required for a split to be made Default is .01 The higher the value, the fewer number of splits, so the less complex tree
Drawbacks of ensemble methods
Computationally prohibitive to implement and difficult to interpret due to the need to deal with hundreds or thousands of base models
What constraints is c subject to, in cost complexity pruning
Constraints imposed by other tree control parameters, such as minimum number of observations in a node and maximum depth level
rpart.control() function
Controls complexity of the tree minsplit minbucket cp maxdepth These are hyper parameters either tuned by cross validation (cp) or be trial and error (minbucket and max depth)
Pruning
Cutting branches off of large decision tree to get a more compact, interpretable and predictive tree
Entropy
D = - sum across k classes of ((p hat) * log2(p hat)) The lower, the D the more pure Need to use log base two but the Calculator is not set up for this so you need to do ln(x)/ln(2)
Binary tree
Decision tree, where each note has only two children This is the only type of tree on the exam
Why do you prune a tree?
Decision trees easily suffer from overfitting and can easily be too complex which makes them difficult to interpret and causes them to have a high variance since the number of observations in each leaf is small and vulnerable to noise So we want to shrink the tree to optimize the bias and variance trade-off and predictive performance on the test data
What is the reason for randomizing each split in a random forest?
Decorelate the base trees by making them use more diverse features. This leads to a greater variance reduction when the bases are averaged. If there is a very strong predictor, and we don't use this, each base tree will make use of this, and all the predictions will be highly correlated
Information gain means
Decrease in impurity created by a tree split = node impurity of parent - sum of[ (number in node / number in parent node) * impurity of new node ] Called information gain for entropy and Gini gain for Gini
What's the basis of a decision tree model
Divides the feature space into a finite set of nonoverlapping, exhaustive regions, containing relatively homogeneous observations(with respect to target) that are more amenable to analysis and prediction Divides the feature space using a series of classification rules or splits based on the values or levels of the predictors
How to predict an observation with a decision tree
Drop the observation down the decision tree, and use the classification rules to identify the region to which it belongs Four numeric target variables, use the average of the target variable in that region as the predicted value For categorical target variables, use the most common class of the target variable in that region as the predicted class
Classification error rate
E = 1 - max(p hat) where p hat is the proportion of the target observations in that node that live in the kth class This is the complement of the proportion of observations in the most common class to find the proportion of misclassified observations The lower the E, the more homogeneous the node
Boosting algorithm
Each iteration fits trees to the residuals of the preceding tree, and subtracts the scaled down version of the current trees predictions from the preceding trees residuals to form new residuals Each tree focuses on predicting observations that the previous tree predicted poorly Moves toward the negative gradient of the objective function we want to minimize
What is a requirement for an ensemble method to work?
Each model has to already be reasonably good
When fitting a tree what if we have control parameters that impose no restrictions?
Each terminal node will have one observation, and we will have a perfect fit
Graph for entropy
For k = 2 this is an upside down U that is fatter than Gini
Gini index
G = sum across k classes of p hat * (1-p hat) = 1- sum across k classes of (p hat)^2 Measures the variance across the k clases The lower the G the more pure Default in R Measure of how often a randomly chosen element from the node would be incorrectly classified if it were randomly classified accordingly to the distribution of targets in the subset
Compare GLM's and decision trees, based on how they handle numeric predictors
GLM's assume that the effect of a numeric predictor on the target mean is monotonic and assigns a single regression coefficient. Nonlinear, and non-monotonic effects can be captured by adding higher power polynomial terms. Trees separate observations into distinct groups based on the ordered values of a numeric predictor. These predicted means or probabilities can be irregular as a function of the numeric predictor, and they are good at handling nonlinear relationships, since there is no monotonic structure.
Compare GLM's and decision trees, based on how they handle categorical predictors
GLM's have to binarize and convert categorical predictors to dummy variables. They find coefficients and assign one for each non-baseline level. Trees separate the levels of a predictor into two groups, so no Binarization is needed because it would impose unwarranted restrictions on trees since more than one level couldn't be split out into a group
Compare GLMs and decision trees, based on output
GLM's produce a closed form equation showing how the target mean depends on the feature values with coefficients indicating the magnitude and direction of the effects on the target Trees produce a set of if else splitting rules, showing how the target mean varies with the feature values and have no equation
Compare GLM's and decision trees based on collinearity
GLMs are vulnerable to the presence of highly correlated predictors and collinearity may inflate the variance of coefficient estimates and model predictions, which makes us lose the meaningfulness of the interpretation of the coefficient estimates Trees have less of a problem with collinearity, even if it's perfect, but if we have highly correlated predictors, choosing one to use in a split may be rather random, which can skew the variable importance scores
Compare GLM's and decision trees based on interactions between predictors
GLMs can't take care of interactions unless interaction terms are manually inserted into the equation For trees, each split in recursive binary splitting only affects a subset of the feature space so when the splits are based on different predictors an interaction affect is automatically modeled - the more splits, the larger the extent of the interaction that can be captured
Cost complexity pruning
Grow a very large tree and then prune it back to get a smaller sub tree Prune branches that do not improve goodness of fit by a sufficient amount
Node impurity
How dissimilar the observations are in a node
Balanced tree
If the left and right sub trees coming out of any node differ in depth by at most one
Pros of decision trees
Interpretability: as long as there aren't too many leaves, these are pretty easy to interpret, and explain in a non-technical way, because of the transparent nature of the classification rules, which mirror human decision making. This is also displayed graphically, which makes it easier to tell the most important predictors. Non-linear relationships: numeric predictors can affect the target in a complex manner, and no extra terms need to be supplied in advance. There's also no need for transformations if we have skewed variables. Interactions: this automatically recognizes interactions so we don't need to identify them ahead of time Categorical variables: no need for binarization or selecting a baseline level Variable selection: variables are automatically selected and the ones that do not appear in the tree are filtered out and the most important ones show up at the top Missing data: can be easily modified to handle missing data Quick to run Good for recognizing natural break points in continuous variables
Cons of random forests over a single tree
Interpretability: consist of hundreds to thousands of decision trees, so they are harder to interpret and can't be visualized in a tree diagram which makes it difficult to see how predictions depend on each feature and the whole process is a black box Computational power: it takes a longer time to implement due to computational burden
What is wrong with the naïve pruning method?
It overlooks that a not so good split early in the tree Building process may be followed by a good split. Too short sighted
Cons of boosting relative to base trees
Less interpretability and computational efficiency
maxdepth in rpart.control() function
Maximum number of branches from the root to the furthest terminal node The higher, the more complex Default is 30
Entropy maximum and minimum value
Minimum is zero when everything is in one class Maximum is log k when the observations are evenly split among the k classes
minbucket in the rpart.control() function
Minimum number of observations in any terminal node so if a split generates a node with fewer observations the split is not made The lower, the value, the larger and more complex the tree Default is minsplit/3
minsplit in the rpart.control() function
Minimum number of observations that must be in a node in order for it to be split further The lower this is the larger and more complex the tree Default is minbucket*3
Minimum and maximum values of Gini index
Minimum value of zero, if all observations are in one class Maximum value of 1/k if the values are spread evenly throughout the classes
Relative training error for classification trees
Miss classifications made by the tree / miss classifications made by using majority class Uses classification error rate
Cons of boosting relative to random forest
More vulnerable to overfitting due to multiple base trees trying to capture the signal in the data More sensitive to hyper parameter values
Pros of random forest against a single tree
Much more robust Averaging results leads to a substantial variance reduction, even though the trees are unpruned Produces much more precise predictions Doesn't reduce bias but this is already low since we are using unpruned trees
How many leaves do we have if we have N splits?
N + 1 For numeric target variables, we have a finite number of possible predictions equal to N + 1
Is there a restriction on how many times we can use a predictor to split the data?
No, we can use the same predictor more than once
Root node
Node at the top of a decision tree, representing the full data set
xval argument in rpart() function
Number of folds used when doing cross validation Default is 10 No effect on complexity, but effects performance
Stopping criterion for the splitting process
Number of observations in all terminal nodes is below some threshold There are no more splits that lead to a significant reduction in impurity
Depth
Number of tree splits needed to go from the trees route node to the furthest terminal node Measures the complexity of the tree The larger the depth, the more complex tree
Optimized weighted
Objective function evaluated on a separate validation set is minimized
rpart.plot() characteristics
Observations satisfying the splitting criterion are sent to the left branch Have a predicted response value printed at each node
To split in a classification tree we want
Overall impurity = weighted average of impurity of the nodes based on number of observations, to be the smallest among all possible choices
Cons for decision trees
Overfitting: this is more prone to overfitting and producing unstable predictions with high variance, even with pruning, since a small change in the training data can lead to a large change in the fitted tree and predictions. A bad initial split can mess up the rest of the tree. Numeric variables: we may need to make tree splits based on a variable repeatedly, which can raise complexity and lower interpretability Categorical variables: favors those with more levels even if they may not be the best predictors, due to the more ways it can be split, so we may want to combine levels in to more representatives groups Lack of model, diagnostic tools
Pros of boosting relative to random forest
Perform better in terms of prediction accuracy, due to emphasis on bias reduction
Node
Point on a decision tree that corresponds to a subset of the training data Also called segments
Naïve pruning method
Pre-specify an impurity reduction threshold, and make splits only when the reduction in impurity exceeds this threshold
How do we predict in boosting?
Prediction is the sum of the scaled down prediction of each base tree
What do we use to measure node impurity for regression trees?
RSS = sum of (y - y hat)^2 where y hat is the mean of the target observation in that node The lower this is the more homogeneous The target observations are in that node
To find the split for a regression tree we want to
RSS = sum over first branch of (y-yhat)^2 + sum over second branch of (y-yhat)^2 is the smallest among all possible choices of the predictor and cut off level
Two types of ensemble tree methods
Random forests and boosting
Bootstrap
Re-sampling method that randomly samples the original training observations with replacement
To construct a decision tree we use
Recursive binary splitting
Partition of the predictor space dimension
Same dimensions as the number of predictors so if we have two variables, then it looks like a box that is split into different regions
Binary splits
Separating the training observations into two groups according to the value or level of one and only one predictor at a time
Deliverables for single decision tree
Set of classification rules based on the values or levels of predictors and represented in the form of a tree
Toy decision tree
Shaped like a family tree
Lambda in boosting
Shrinkage parameter applied to the individual pastries Makes the algorithm take longer to converge, but prevents overfitting
Feature of the boosting algorithm
Slow learning
Nested family of c
Solving for each c produces a nested family of fitted decision trees indexed by c If c1<c2 then c2 has a smaller tree where all of its nodes form a subset of the nodes fit with c1
Method argument in the rpart() function
Specifies whether we want a regression tree (anova) or classification tree (class) If left blank R will make a choice based on the target, which may not be desirable
Are decision trees supervised or unsupervised
Supervised and can tackle both regression and classification problems
Random Forest
Takes the training set and uses bootstrapping to produce B bootstrap training samples. These samples are independent and random We then train an unpruned decision tree on each of the B samples separately denoted by f and apply the test set to these and then combine these results to form an overall prediction Need to decide number of trees, number of observations in each bootstrapped model, proportion of features used in each split
What do ensemble methods allow?
They allow us to hedge against the instability of decision trees, and substantially improve their production performance
How do we choose which node impurity calculation to use for classification trees?
This doesn't make a huge difference but gini and entropy are are more sensitive to node impurity and are differentiable with respect to the class proportions and are more amenable to numerical optimization Classification error is used most in cost complexity pruning
How do we choose c in cost complexity, pruning
This is a hyper parameter so we can use K fold cross validation For a given c we fit a tree on all but one of the K folds and then generate predictions based on that fold and measure the predictive performance. We repeat this K times and compute the overall performance, and choose the c that results in the lowest error rate.
When m = p at each split in a random forest
This is called the bagged model also known as bootstrap aggregation
Deciding whether to treat a variable numerically, or categorically for trees
This is less of an issue than for GLM's and retaining it as a numeric Variable doesn't reduce the dimensionality too much If we take a variable as numeric the decision tree will take the order into account when splitting, but if we take it as a factor variable, a tree can make a lot more splits but they may be weird
Disadvantages of using classification error rate
This is not sensitive enough to node purity, since it depends on the class proportions only through the maximum and not the distribution of the other classes
Recursive binary splitting is what type of algorithm
Top down and greedy
Adding higher power terms, or transformations of observations affect on trees
Transforming predictors, or adding monotonic transformations have no effect on the tree Transforming the target will have an effect on the tree Adding non-monotonic transformations will have an effect on the tree
Cost complexity pruning, mathematically
Use a penalized objective function of relative training error + c*T Relative training error is the training error of the tree is scaled by the training error of the simplest tree c is the complexity parameter bounded between zero and one T is the size of the tree, determined by the number of terminal nodes
What are the overall predictions in a random forest for categorical targets?
Use the majority vote and pick the predicted class as the one that is the most commonly occurring among the B predictions
rpart() function
Used for fitting trees Takes a formula argument that specifies the target and predictors and data argument that specifies the data frame Method argument Control argument Xval argument Parms argument
What is the basis of ensemble methods?
Uses multiple base models, instead of relying on a single model, and takes the result of these base models in aggregate to make an overall prediction
Control argument in the rpart() function
Uses rpart.control() function to specify a list of parameters, controlling when the partition stops Controls the complexity of the tree
Why do we pre-specify the complexity parameter when fitting trees if we actually prune it later
We do this to pre-prune splits that are obviously not worthwhile to save the computational effort We don't want this pre-specified complexity parameter to be too big and over prune so we want a reasonably small value like .001
How do we make decisions on what predictor to use and what our cut off value is going to be for binary splits?
We need to quantify node, impurity, and partition the target observations by which ever variable results in the greatest reduction of impurity or greatest information gain
What does it mean for the recursive binary splitting algorithm to be top down
We start from the top of the tree and go down and sequentially partition the feature space in a series of binary splits
What is the rationale behind using a shrunken version of the tree predictions in boosting?
We want to move only a fraction at a time to slow down the learning process and allow for different shaped trees to attack the residuals Methods that learn, slowly tend to do well
Accuracy weighted
Weight is a measure of accuracy of the fitted model measured on a separate validation set
What are the overall predictions in a random forest for a numeric target?
Weighted average of the B base predictions Use 1/B for evenly waited Could also use accuracy waited or optimize weighted average
Graph for Gini index
When k = 2 this looks like an upside U that is skinnier than entropy
Graph for classification error rate
When k=2 it looks like an upside down V
Every time we creat a binary split what two interrelated decisions are we making
Which predictor to split Given that split predictor, what is the corresponding cut off value or how are we going to split the categorical predictors by level?
Role of c in cost complexity pruning
penalty term to prevent overfitting If c=0 then there is no price for a complex tree and the fitted tree is the same as the large overblown tree As c increases, the penalty becomes greater and the tree branches are snipped off to form new larger terminal notes If c=1 then the penalized term is minimized at the root node because no matter how we split the data the drop in the relative training error can never compensate for the increase in the penalty