BADM 211 Final Exam Random Forests
9. Know the relationship between GridSearchCV parameters and the RandomForest algorithm
-GridSearchCV is needed for hyperparameter selection/tuning (i.e. for finding best estimators) code: gridSearch = GridSearchCV(RandomForestClassifier(), param_grid, cv=2) -first part brings in Random Forest, second part reads the param_grid with the hyperparameters, cv = 2 is needed because GridSearchCV requires cv but since Random Forest already uses cross validation and therefore doesn't need to cross validate again the lowest possible number of 2 is used (since 1 and 0 won't work)
6. Interpret variable importance and understand how it is computed
-bar graph shows the importance of each variable (the larger the bar the more important) -regression: residual sum of squares (RSS), the less the RSS the more important the variable (Google equation) -classification: computed using the mean decrease in Gini index, and expressed relative to the maximum (predictors with largest mean decrease in Gini index are the most important) (i.e. the less the Gini index the more important the variable)
5. Know how Random Forests deal with multicollinearity in the construction of the ensemble
-by selecting a random subsets of the rows and columns for trees (i.e. bootstrap sampling for rows/samples and feature sampling for columns/predictors (i.e. where m = sqrt(p) to prevent correlation), every model sees a different set of data which prevents multicollinearity from occurring
1. Understand what the out-of-bag score indicates
-out of bag observations are the remaining observations not used to fit a given bagged tree (because bagging makes a model out of a sample of the observations [and then does it many times], meaning there are some leftover and these leftover, or out of bag, observations are used to test the error of the model, i.e. estimate the test error) (note: bagging it just doing bootstrapping multiple times) -OOB score is used to estimate the test error of the bagged model
8. Be able to perform a grid search optimization for locating the best values for hyperparameters
-run random forest and look at the best estimators -basically try to zoom in on the best estimator of each hyperparameter -repeatedly zoom until satisfied/it stops improving
4. Understand the method by which Random Forests create ensembles of trees
1. Draw multiple bootstrap resamples of cases from the training data 2. For each resample, use a random subset of predictors and produce a tree 3. Combine the predictions/classifications from all the trees (the "forest") -Voting for classification -Averaging for prediction
7. Know the hyperparameters of the Random Forest model (sci-kit learn documentation) a. Early stopping parameters (max_depth, etc.) b. Algorithm design parameters (oob_score, bootstrap)
a. -n_estimators: number of trees to be built before taking the average or majority voting -criterion: criterion used for measuring impurity (gini and entropy for classification, mae and rme (and ssd?) for regression -min_impurity_decrease: node will split if impurity is greater than or equal to this value -min_samples_split: minimum number of samples that must be present for a split to occur -max_depth: how deep to make the tree (i.e. limits the cp level) -max_features: maximum number of features (i.e. predictors) considered when finding the best split -min_samples_leaf: minimum size of the end node (i.e. leaf) of each decision tree -min_weight_fraction_leaf: same as min_samples_leaf but uses a fraction of the total number of observations -max_leaf_nodes: the maximum number of terminal nodes (i.e. leaf nodes) that can be created in a decision tree b. -oob_score: whether or not to display the oob-score -bootstrap: whether or not you should bootstrap your samples when building trees -class_weight: weight used for the classes -n_jobs: how many processors your computer can use -random_state: used to replicate results -verbose: gives constant updates -warm_start: fits new forest or adds estimators and reuses solutions from previous fit
3. Know the differences between bagging and Random Forests
bagging: bootstrap aggregating -multiplier effect comes from multiple bootstrap samples, rather than multiple methods (bootstrapping is to take resamples, with replacement, from the original data) 1. generate multiple bootstrap samples 2. run algorithm on each and produce scores 3. average those scores (regression) or take majority vote (classification) random forests: -improves bagging by decorrelating the trees -when building decision trees, each time a split in a tree is considered, a random, fresh sample (i.e. subset) of m predictors (usually m = sqrt(p) is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors (i.e. in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors) -basically prevents bagged trees from looking similar to each other, hence decorrelating the trees (e.g. if one strong predictor dominated the data, the trees would look similar (i.e. correlated) under bagging, but Random Forest prevents this) -The main difference between bagging and random forests is the choice of predictor subset size m. For instance, if a random forest is built using m = p, then this amounts simply to bagging. On the Heart data, random forests using m = √p leads to a reduction in both test error and OOB error over bagging (i.e. bagging uses a subset of predictors where m=p (where m is a subset of predictors, but in this case m contains all p) which results in correlation between the tree models while Random Forest uses m=sqrt(p) (m as a subset contains only a square root of p) which prevents correlation between the tree models)
2. Know the differences between Decision Trees and Random Forests
decision trees: -creates one singular tree -easy to interpret -but they're unstable -but they have poor predictive performance -but they're not cross validated random forests: -basically averages a bunch of trees which results in cross validation (e.g. when predicted a numerical value it will take the average of the values predicted by the various methods) -is an ensemble method -predicts more accurately -multiple methods (i.e. bootstrap samples) are used initially, and predictions/classifications tabulated -reduces variance in predictions -tends to cancel out error -but you lose the interpretability and the rules embodied in a single tree