Tree based models

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Relative training error for regression trees

= sun of RSS over all leaves / TSS

Boosting

Also known as gradient, boosting machines Works on principle sequential learning and builds a sequence of interdependent trees using information from previously grown trees

Terminal node

Also known as leaves Nodes at the bottom of the tree, which are not split further and constitute an endpoint

Describe the randomization at each split in a random forest algorithm

At each split a random sample of M predictors is chosen as the split candidates out of the P predictors and then out of those M the one that reduces the impurity the most is used New random sample is made it every split Default for classification is m = sqrt(p) Default for regression is m = p/3 Can be treated like a hyper parameter and tuned by cross validation

What does it mean for the recursive binary splitting algorithm to be greedy?

At each split, we adopt the binary split that is currently the best in the sense that it leads to the greatest reduction in impurity at that point instead of looking ahead and selecting a split that results in a better split in a future step Repeat until stopping criterion is reached

Differences between boosting and random forests

Base trees in a random forest are fit in parallel and independently. While boosted model base trees are done in series. Random forests address model variance, and boosting reduces bias

Pros of boosting relative to base trees

Better prediction performance

Ensemble methods tend to be more successful in

Biased: because it captures complex relationships due to using multiple base models each working on different parts Variance: improves the stability of the overall predictions by averaging the predictions of the individual base models - variance of an average is smaller than that of one component

Bagging

Bootstrapped average of any model Random forest is a type of bagging for decision trees

What are three different ways to calculate node impurity for classification trees

Classification error rate Gini index Entropy

cp in the rpart.control() function

Complexity parameter Refers to the value penalizing a tree by its size when cost complexity pruning is performed This is the minimum amount of reduction in the relative training error required for a split to be made Default is .01 The higher the value, the fewer number of splits, so the less complex tree

Drawbacks of ensemble methods

Computationally prohibitive to implement and difficult to interpret due to the need to deal with hundreds or thousands of base models

What constraints is c subject to, in cost complexity pruning

Constraints imposed by other tree control parameters, such as minimum number of observations in a node and maximum depth level

rpart.control() function

Controls complexity of the tree minsplit minbucket cp maxdepth These are hyper parameters either tuned by cross validation (cp) or be trial and error (minbucket and max depth)

Pruning

Cutting branches off of large decision tree to get a more compact, interpretable and predictive tree

Entropy

D = - sum across k classes of ((p hat) * log2(p hat)) The lower, the D the more pure Need to use log base two but the Calculator is not set up for this so you need to do ln(x)/ln(2)

Binary tree

Decision tree, where each note has only two children This is the only type of tree on the exam

Why do you prune a tree?

Decision trees easily suffer from overfitting and can easily be too complex which makes them difficult to interpret and causes them to have a high variance since the number of observations in each leaf is small and vulnerable to noise So we want to shrink the tree to optimize the bias and variance trade-off and predictive performance on the test data

What is the reason for randomizing each split in a random forest?

Decorelate the base trees by making them use more diverse features. This leads to a greater variance reduction when the bases are averaged. If there is a very strong predictor, and we don't use this, each base tree will make use of this, and all the predictions will be highly correlated

Information gain means

Decrease in impurity created by a tree split = node impurity of parent - sum of[ (number in node / number in parent node) * impurity of new node ] Called information gain for entropy and Gini gain for Gini

What's the basis of a decision tree model

Divides the feature space into a finite set of nonoverlapping, exhaustive regions, containing relatively homogeneous observations(with respect to target) that are more amenable to analysis and prediction Divides the feature space using a series of classification rules or splits based on the values or levels of the predictors

How to predict an observation with a decision tree

Drop the observation down the decision tree, and use the classification rules to identify the region to which it belongs Four numeric target variables, use the average of the target variable in that region as the predicted value For categorical target variables, use the most common class of the target variable in that region as the predicted class

Classification error rate

E = 1 - max(p hat) where p hat is the proportion of the target observations in that node that live in the kth class This is the complement of the proportion of observations in the most common class to find the proportion of misclassified observations The lower the E, the more homogeneous the node

Boosting algorithm

Each iteration fits trees to the residuals of the preceding tree, and subtracts the scaled down version of the current trees predictions from the preceding trees residuals to form new residuals Each tree focuses on predicting observations that the previous tree predicted poorly Moves toward the negative gradient of the objective function we want to minimize

What is a requirement for an ensemble method to work?

Each model has to already be reasonably good

When fitting a tree what if we have control parameters that impose no restrictions?

Each terminal node will have one observation, and we will have a perfect fit

Graph for entropy

For k = 2 this is an upside down U that is fatter than Gini

Gini index

G = sum across k classes of p hat * (1-p hat) = 1- sum across k classes of (p hat)^2 Measures the variance across the k clases The lower the G the more pure Default in R Measure of how often a randomly chosen element from the node would be incorrectly classified if it were randomly classified accordingly to the distribution of targets in the subset

Compare GLM's and decision trees, based on how they handle numeric predictors

GLM's assume that the effect of a numeric predictor on the target mean is monotonic and assigns a single regression coefficient. Nonlinear, and non-monotonic effects can be captured by adding higher power polynomial terms. Trees separate observations into distinct groups based on the ordered values of a numeric predictor. These predicted means or probabilities can be irregular as a function of the numeric predictor, and they are good at handling nonlinear relationships, since there is no monotonic structure.

Compare GLM's and decision trees, based on how they handle categorical predictors

GLM's have to binarize and convert categorical predictors to dummy variables. They find coefficients and assign one for each non-baseline level. Trees separate the levels of a predictor into two groups, so no Binarization is needed because it would impose unwarranted restrictions on trees since more than one level couldn't be split out into a group

Compare GLMs and decision trees, based on output

GLM's produce a closed form equation showing how the target mean depends on the feature values with coefficients indicating the magnitude and direction of the effects on the target Trees produce a set of if else splitting rules, showing how the target mean varies with the feature values and have no equation

Compare GLM's and decision trees based on collinearity

GLMs are vulnerable to the presence of highly correlated predictors and collinearity may inflate the variance of coefficient estimates and model predictions, which makes us lose the meaningfulness of the interpretation of the coefficient estimates Trees have less of a problem with collinearity, even if it's perfect, but if we have highly correlated predictors, choosing one to use in a split may be rather random, which can skew the variable importance scores

Compare GLM's and decision trees based on interactions between predictors

GLMs can't take care of interactions unless interaction terms are manually inserted into the equation For trees, each split in recursive binary splitting only affects a subset of the feature space so when the splits are based on different predictors an interaction affect is automatically modeled - the more splits, the larger the extent of the interaction that can be captured

Cost complexity pruning

Grow a very large tree and then prune it back to get a smaller sub tree Prune branches that do not improve goodness of fit by a sufficient amount

Node impurity

How dissimilar the observations are in a node

Balanced tree

If the left and right sub trees coming out of any node differ in depth by at most one

Pros of decision trees

Interpretability: as long as there aren't too many leaves, these are pretty easy to interpret, and explain in a non-technical way, because of the transparent nature of the classification rules, which mirror human decision making. This is also displayed graphically, which makes it easier to tell the most important predictors. Non-linear relationships: numeric predictors can affect the target in a complex manner, and no extra terms need to be supplied in advance. There's also no need for transformations if we have skewed variables. Interactions: this automatically recognizes interactions so we don't need to identify them ahead of time Categorical variables: no need for binarization or selecting a baseline level Variable selection: variables are automatically selected and the ones that do not appear in the tree are filtered out and the most important ones show up at the top Missing data: can be easily modified to handle missing data Quick to run Good for recognizing natural break points in continuous variables

Cons of random forests over a single tree

Interpretability: consist of hundreds to thousands of decision trees, so they are harder to interpret and can't be visualized in a tree diagram which makes it difficult to see how predictions depend on each feature and the whole process is a black box Computational power: it takes a longer time to implement due to computational burden

What is wrong with the naïve pruning method?

It overlooks that a not so good split early in the tree Building process may be followed by a good split. Too short sighted

Cons of boosting relative to base trees

Less interpretability and computational efficiency

maxdepth in rpart.control() function

Maximum number of branches from the root to the furthest terminal node The higher, the more complex Default is 30

Entropy maximum and minimum value

Minimum is zero when everything is in one class Maximum is log k when the observations are evenly split among the k classes

minbucket in the rpart.control() function

Minimum number of observations in any terminal node so if a split generates a node with fewer observations the split is not made The lower, the value, the larger and more complex the tree Default is minsplit/3

minsplit in the rpart.control() function

Minimum number of observations that must be in a node in order for it to be split further The lower this is the larger and more complex the tree Default is minbucket*3

Minimum and maximum values of Gini index

Minimum value of zero, if all observations are in one class Maximum value of 1/k if the values are spread evenly throughout the classes

Relative training error for classification trees

Miss classifications made by the tree / miss classifications made by using majority class Uses classification error rate

Cons of boosting relative to random forest

More vulnerable to overfitting due to multiple base trees trying to capture the signal in the data More sensitive to hyper parameter values

Pros of random forest against a single tree

Much more robust Averaging results leads to a substantial variance reduction, even though the trees are unpruned Produces much more precise predictions Doesn't reduce bias but this is already low since we are using unpruned trees

How many leaves do we have if we have N splits?

N + 1 For numeric target variables, we have a finite number of possible predictions equal to N + 1

Is there a restriction on how many times we can use a predictor to split the data?

No, we can use the same predictor more than once

Root node

Node at the top of a decision tree, representing the full data set

xval argument in rpart() function

Number of folds used when doing cross validation Default is 10 No effect on complexity, but effects performance

Stopping criterion for the splitting process

Number of observations in all terminal nodes is below some threshold There are no more splits that lead to a significant reduction in impurity

Depth

Number of tree splits needed to go from the trees route node to the furthest terminal node Measures the complexity of the tree The larger the depth, the more complex tree

Optimized weighted

Objective function evaluated on a separate validation set is minimized

rpart.plot() characteristics

Observations satisfying the splitting criterion are sent to the left branch Have a predicted response value printed at each node

To split in a classification tree we want

Overall impurity = weighted average of impurity of the nodes based on number of observations, to be the smallest among all possible choices

Cons for decision trees

Overfitting: this is more prone to overfitting and producing unstable predictions with high variance, even with pruning, since a small change in the training data can lead to a large change in the fitted tree and predictions. A bad initial split can mess up the rest of the tree. Numeric variables: we may need to make tree splits based on a variable repeatedly, which can raise complexity and lower interpretability Categorical variables: favors those with more levels even if they may not be the best predictors, due to the more ways it can be split, so we may want to combine levels in to more representatives groups Lack of model, diagnostic tools

Pros of boosting relative to random forest

Perform better in terms of prediction accuracy, due to emphasis on bias reduction

Node

Point on a decision tree that corresponds to a subset of the training data Also called segments

Naïve pruning method

Pre-specify an impurity reduction threshold, and make splits only when the reduction in impurity exceeds this threshold

How do we predict in boosting?

Prediction is the sum of the scaled down prediction of each base tree

What do we use to measure node impurity for regression trees?

RSS = sum of (y - y hat)^2 where y hat is the mean of the target observation in that node The lower this is the more homogeneous The target observations are in that node

To find the split for a regression tree we want to

RSS = sum over first branch of (y-yhat)^2 + sum over second branch of (y-yhat)^2 is the smallest among all possible choices of the predictor and cut off level

Two types of ensemble tree methods

Random forests and boosting

Bootstrap

Re-sampling method that randomly samples the original training observations with replacement

To construct a decision tree we use

Recursive binary splitting

Partition of the predictor space dimension

Same dimensions as the number of predictors so if we have two variables, then it looks like a box that is split into different regions

Binary splits

Separating the training observations into two groups according to the value or level of one and only one predictor at a time

Deliverables for single decision tree

Set of classification rules based on the values or levels of predictors and represented in the form of a tree

Toy decision tree

Shaped like a family tree

Lambda in boosting

Shrinkage parameter applied to the individual pastries Makes the algorithm take longer to converge, but prevents overfitting

Feature of the boosting algorithm

Slow learning

Nested family of c

Solving for each c produces a nested family of fitted decision trees indexed by c If c1<c2 then c2 has a smaller tree where all of its nodes form a subset of the nodes fit with c1

Method argument in the rpart() function

Specifies whether we want a regression tree (anova) or classification tree (class) If left blank R will make a choice based on the target, which may not be desirable

Are decision trees supervised or unsupervised

Supervised and can tackle both regression and classification problems

Random Forest

Takes the training set and uses bootstrapping to produce B bootstrap training samples. These samples are independent and random We then train an unpruned decision tree on each of the B samples separately denoted by f and apply the test set to these and then combine these results to form an overall prediction Need to decide number of trees, number of observations in each bootstrapped model, proportion of features used in each split

What do ensemble methods allow?

They allow us to hedge against the instability of decision trees, and substantially improve their production performance

How do we choose which node impurity calculation to use for classification trees?

This doesn't make a huge difference but gini and entropy are are more sensitive to node impurity and are differentiable with respect to the class proportions and are more amenable to numerical optimization Classification error is used most in cost complexity pruning

How do we choose c in cost complexity, pruning

This is a hyper parameter so we can use K fold cross validation For a given c we fit a tree on all but one of the K folds and then generate predictions based on that fold and measure the predictive performance. We repeat this K times and compute the overall performance, and choose the c that results in the lowest error rate.

When m = p at each split in a random forest

This is called the bagged model also known as bootstrap aggregation

Deciding whether to treat a variable numerically, or categorically for trees

This is less of an issue than for GLM's and retaining it as a numeric Variable doesn't reduce the dimensionality too much If we take a variable as numeric the decision tree will take the order into account when splitting, but if we take it as a factor variable, a tree can make a lot more splits but they may be weird

Disadvantages of using classification error rate

This is not sensitive enough to node purity, since it depends on the class proportions only through the maximum and not the distribution of the other classes

Recursive binary splitting is what type of algorithm

Top down and greedy

Adding higher power terms, or transformations of observations affect on trees

Transforming predictors, or adding monotonic transformations have no effect on the tree Transforming the target will have an effect on the tree Adding non-monotonic transformations will have an effect on the tree

Cost complexity pruning, mathematically

Use a penalized objective function of relative training error + c*T Relative training error is the training error of the tree is scaled by the training error of the simplest tree c is the complexity parameter bounded between zero and one T is the size of the tree, determined by the number of terminal nodes

What are the overall predictions in a random forest for categorical targets?

Use the majority vote and pick the predicted class as the one that is the most commonly occurring among the B predictions

rpart() function

Used for fitting trees Takes a formula argument that specifies the target and predictors and data argument that specifies the data frame Method argument Control argument Xval argument Parms argument

What is the basis of ensemble methods?

Uses multiple base models, instead of relying on a single model, and takes the result of these base models in aggregate to make an overall prediction

Control argument in the rpart() function

Uses rpart.control() function to specify a list of parameters, controlling when the partition stops Controls the complexity of the tree

Why do we pre-specify the complexity parameter when fitting trees if we actually prune it later

We do this to pre-prune splits that are obviously not worthwhile to save the computational effort We don't want this pre-specified complexity parameter to be too big and over prune so we want a reasonably small value like .001

How do we make decisions on what predictor to use and what our cut off value is going to be for binary splits?

We need to quantify node, impurity, and partition the target observations by which ever variable results in the greatest reduction of impurity or greatest information gain

What does it mean for the recursive binary splitting algorithm to be top down

We start from the top of the tree and go down and sequentially partition the feature space in a series of binary splits

What is the rationale behind using a shrunken version of the tree predictions in boosting?

We want to move only a fraction at a time to slow down the learning process and allow for different shaped trees to attack the residuals Methods that learn, slowly tend to do well

Accuracy weighted

Weight is a measure of accuracy of the fitted model measured on a separate validation set

What are the overall predictions in a random forest for a numeric target?

Weighted average of the B base predictions Use 1/B for evenly waited Could also use accuracy waited or optimize weighted average

Graph for Gini index

When k = 2 this looks like an upside U that is skinnier than entropy

Graph for classification error rate

When k=2 it looks like an upside down V

Every time we creat a binary split what two interrelated decisions are we making

Which predictor to split Given that split predictor, what is the corresponding cut off value or how are we going to split the categorical predictors by level?

Role of c in cost complexity pruning

penalty term to prevent overfitting If c=0 then there is no price for a complex tree and the fitted tree is the same as the large overblown tree As c increases, the penalty becomes greater and the tree branches are snipped off to form new larger terminal notes If c=1 then the penalized term is minimized at the root node because no matter how we split the data the drop in the relative training error can never compensate for the increase in the penalty

Tree based models

संबंधित स्टडी सेट्स

Abrams Chapter 18 - NCLEX

BME 3801 BioDesign Final

insurance regulation

Chapter 6: Conflict of interest

Midterm Math

exam 2- psych

MIS Exam 1

Smartbook Ch 5

Chapter 12-Deficits & Debt

N1 IV Therapy ATI Quiz

y = mx + c

Ch 7 Legal Aspects/Professionalism

EDUC 1300 Midterm Review

AP Euro: Unit 3 Absolutism Review

Rekenen breuken procenten kommagetallen

UCSP MODULE 9

CMCB Firefighter 1 (Pt. 1)

Cystic Fibrosis

Hesi Review - Med Surg

nutrition exam 2(4,5,6)