Predictive Analytics - Tips from Actex
OLS model
Choose the estimates of the Bs to make the sum of the squared differences between the observed target values and the fitted values under the model the least
How to assess predictive performance for classification problems
Classification error rate = proportion of observations in the test set that are incorrectly classified Smaller error indicates the classifier is more predictive
How to assess predictive performance for regression problems
Commonly use root mean squared error (RMSE), which aggregates all of the prediction errors on the test set (size = n) Smaller RMSE indicates the model is more predictive Is measured in the same unit as the target variable
What does the alpha argument do?
Controls the transparency of the plotted object on a scale from 0 (fully transparent) to 1 (opaque) Increase transparency by decreasing alpha This is useful when there are a lot of overlapping of data points
Deviance residual of a GLM
The signed square root of the contribution of the ith observation to the deviance, approximately normally distributed and homoscedastic regardless of the target distribution Deviance = sum(di^2)
Backward/forward selection using the stepAIC function in R
#Backward selection model.backward <- stepAIC(model.full.bin) #Forward selection model.null <- lm(Balance ~ 1, data = train.bin) model.forward <- stepAIC(model.full.bin, direction = "forward", scope = list(upper = model.full.bin, lower = model.null) #Backward selection using BIC instead of AIC model.backward.BIC <- stepAIC(model.full.bin, k = log(nrow(train))
Confusion matrix
- A tabular display of how the predictions of a binary classifier line up with the observed classes -Use a pre-specified cutoff to say whether the event is predicted to occur - 4 scenarios: true positive, true negative, false positive, false negative - Large discrepancy in the classification performance of the training and test confusion matrices indicates overfitting - We care most about the confusion matrix on the test set
Comparing boosting to other methods
- Boosting has better prediction accuracy than random forests due to emphasis on bias reduction, but are more vulnerable to overfitting, also high computational cost - Boosting has better prediction accuracy than base trees, but less interpretability and computational efficiency
Boosting
- Builds a sequence of inter-dependent tress using information from previously grown trees - In each iteration, fit a tree to the residuals of the preceding tree (using the same set of predictors) and a scaled-down version of the current tree's predictions is subtracted from the preceding tree's residuals to form the new residuals - Effect: each tree focuses on predicting observations that the previous tree predicted poorly (reducing bias)
Visualizing the behavior of the cross-validation error
- Complexity parameter plot: shows how the cross-validation error of a decision tree varies with the complexity parameter - Top of the plot shows # of terminal nodes
Pros of decision trees as compared to GLMS
- Easy to interpret because of if/then nature - Handle nonlinear relationships and no need for variable transformation - Good at recognizing interactions between variables, therefore feature generation is less of an issue - Categorical predictors are automatically handled w/o the need for binarization or selecting a baseline level - Variables are automatically selected, most important variables at the top of the tree - Less susceptible to model mis-specification issues - Can be easily modified to deal with missing data
Random forests
- Generate multiple bootstrapped samples of the training set and fit base trees in parallel independently on each of the samples - "Bootstrapped" = sample training observations with replacement - Then, train a base, unpruned decision tree on each of the B bootstrapped samples - At each split, a random sample of m predictors is chosen as the split candidates out of the p available features, the one that contributes the greatest reduction in impurity is used for the split - Set m = sqrt(p) for classification, m = p/3 for regression - Decorrelate the base trees by making them use more diverse features for different splits and therefore lead to greater variance reduction
Splitting algorithm for a decision tree
- Given an impurity measure, the best binary split at each step maximizes the reduction in the impurity measure - Resulting splits are further split recursively - Process continues until we reach a stopping criterion, i.e. # of observations in a node falls below some pre-specified threshold or there are no more splits that can contribute to a significant reduction in node impurity
k-means clustering
- Goal: assign each obs in a dataset into one and only one of k clusters - k is selected upfront through the elbow method (plot ratio of between-cluster variation to the total variation against the value of k) - k clusters chosen such that the variation of the obs inside each cluster (within-cluster variation) is as small as possible while the variation between clusters is large - Outliers may distort cluster arrangements - Standardize features before running the analysis -> attaches an equal weight to all features when performing distance calculations
Ensemble methods
- Hedge against overfitting - Substantially improve predictive performance - Produce multiple base models and combine the results of these models to make predictions - Enhances the model's ability to capture complex relationships in the data - Reduces the variability of the model's predictions - 2 methods: random forests and boosting
Cons of decision trees compared to GLMs
- More prone to overfitting - Produce unstable predictions with a high variance, small change in the training data can lead to a large change in the fitted tree and predictions - Tend to favor categorical features with many levels over those with few levels - Lack of model diagnostic tools for decision trees
Commonly used terminology for decision trees
- Node: point on a decision tree that corresponds to a subset of the data and often results from 1 or more binary splits - Root node: node at the top if the decision tree representing the full dataset - Terminal node/leaf: nodes that are not further split and constitute an end on their own - Binary tree: decision tree where each node only has 2 children - Depth: # of branches from the tree's root node to the furthest terminal node, measure of the complexity of a decision tree
Cluster analysis
- Partitions heterogeneous obs into a set of distinct homogeneous groups (clusters) where obs share similar characteristics - Goal: find subgroups in the dataset 1. k-means clustering 2. Hierarchical clustering
Receiver Operating Characteristic (ROC) curve
- Plots sensitivity against specificity of a given classifier for each cutoff in the range (0,1) - Each point on the curve corresponds to a certain cutoff - Closer to the top left corner -> better predictive ability - Predictive ability = area under the curve (AUC) - No better than chance = 45 degree line, AUC = 0.5
Principal components analysis (PCA): proportion of variance explained
- Quantifies how much information is captured by the PCs, helps us know how many to use - Decreases with the index of the PC since subsequent PCs have more and more orthogonality constraints and so less flexibility with the choice of PC loadings
Random forest results for different types of target variables
- Quantitative: overall prediction is the average of the B base predictions - Qualitative: take a majority vote and pick the predicted class as the one that is most commonly occurring among the B predictions
Notes about cross validation error to determine the optimal lambda
- Red dots represent the cross-validation error corresponding to different values of lambda, vertical lines around the dots are CI bands - Numbers at the top of the graph indicate how many features have nonzero coefficient estimates at a given value of lambda - First vertical dotted line correspond to the minimization of cv error (can obtain using the lambda.min function) - Second vertical dotted line corresponds to the simplest regularized regression model so that the resulting cv error is within 1 standard error of the minimum error
Creating useful features using PCA
- Replace original variables with PCs that capture most of the info and so are predictors for the target available - Remember to delete the original variables, otherwise the model will be rank-deficient - Predictors are mutually uncorrelated -> collinearity is not an issue - Purpose: optimize bias-variance trade-off and improve prediction accuracy
More sophisticated methods that can be mentioned for dealing with missing values
- Replace with mean, median, or mode - Use a model based on combos of other variables to fill in missing values (imputation)
If remaining data values look off
- Retain an entry if it is not entirely clear, just make a remark about its potential abnormality - If it is clearly an error and the variable is used as a predictor, remove entry and comment on why
Principal components analysis (PCA): visualization
- Scree plot shows the PVE of each of the M PCs - Decreasing function -> look for the point where the PVE has dropped off to a sufficiently low level (elbow) - Add them together to determine how much of the total variance is explained
Principal components analysis (PCA): selecting loadings
- The first PC loadings should capture as much information in the original dataset as possible -> maximize the sample variance of the 1st PC - Subsequent PCs should also maximize variance of the linear combo, but also must be orthogonal (uncorrelated with) the previous PCs and thus measure different aspects of the variables in the dataset - Features must be centered (mean = 0) and scaled, otherwise those with large variance on their scale will receive a larger loading even if the variable does not explain much of the underlying data
How to justify removing missing values from the dataset
- These observations are few enough - No evidence they are systematically generated - Their removal is unlikely to introduce bias or impair the effectiveness of the model
Principal components analysis (PCA): basic terminology and purpose
- Transforms a large # of possibly correlated variables into a smaller, much more manageable set of variables that capture most of the info (in terms of variability) in the original high-dimensional dataset - Principal components (PCs): variables that are normalized linear combos of the existing variables using loadings - Simplify the dataset, reduce dimension, useful for feature generation - Can only be used on numeric variables, so any categorical variables must be binarized
Variable importance plot
- Used to understand random forests - Ranks the predictors according to their contribution to the model, with the importance score for a particular predictor computed by totaling the drop in node impurity due to that predictor, averaged over all the trees in the random forest - Variables high on list -> larger improvements in splitting criterion when used as splits
Basic coding decision trees in R
- method = "anova" indicates regression, "class" indicates classification - control is used to specify a list of parameters controlling when the partition stops, i.e. the complexity of the tree to be built - minsplit: minimum # of observations that must exist in a node in order for a split to be attempted, lower value -> more complex tree - minbucket: minimum # of observations in any terminal node, lower value -> more complex tree - cp: complexity parameter, value between 0 and 1 specifying the minimum amount of impurity reduction required for a split to be made, default value = 0.01, higher cp -> fewer splits and less complex tree, setting to 0 leads to the most complex tree while setting to 1 prohibits any splits - maxdepth: maximum # of branches from the tree's root note to the furthest terminal node, higher value -> more complex tree - xval: # of folds used when doing cross-validation - parms: describes the parameters that guide how the splits are performed (gini or information), limited to categorical target variables
Akaike information criterion (AIC)
-2l + 2p where l = loglikelihood of the model on the training set and p = # of parameters in the model The higher the value of l, the better the fit p acts as a penalty term for an overfitted model We want the model with the smallest AIC
Bayesian Information Criterion (BIC)
-2l + ln(n)p where n = size of the training set We want the model with the smallest BIC For more complex models, BIC is more strict than AIC and so is a conservative approach to feature selection
Pros/cons of random forests as compared to single trees
-Pros: more robust, substantial variance reduction especially when B is large, produces more precise predictions - Cons: interpretability, computational power
2 types of feature selection
1. Backward selection: start with the model with all features, in each step drop the feature that causes, in its absence, the greatest improvement 2. Forward selection: start with the simplest model and add the feature that results in the greatest improvement
Performance metrics that result from a confusion matrix
1. Classification error rate: proportion of misclassifications = (FN + FP)/n 2. Sensitivity/Recall: relative frequency of correctly predicting an event of interest when the event does take place, how sensitive a classifier is to identify positive cases = TP/(TP + FN) 3. Specificity: the relative frequency of correctly predicting a non-event when there is indeed no event = TN / (TN + FP) Higher sensitivity and specificity -> better classifier, ideally close to 1
Choices of target distributions for GLMs
1. Continuous, positive data: choose a model that captures the right-skewness of the target variable, such as gamma or inverse Gaussian 2. Binary data: binomial 3. Count data: Poisson, note that if the variance exceeds the mean (overdispersion) we can tweak Poisson to account for this
Exam note on partitioning data into test and training sets
1. Describe the design/results of the training/test split, i.e. the proportion (or number) of observations that go into each 2. Point out that you are using stratified sampling to ensure that both sets contain similar and representative values of the response variable 3. Explicitly check that the 2 sets of target variable values are comparable by looking at some summary stats, such as the mean
Measures of impurity for classification decision trees
1. Entropy 2. Gini index 3. Classification error For all measures, larger values -> more impure/heterogeneous the node is
3 part structure for interpreting the results of a GLM
1. Interpret the precise values of the estimated coefficients, i.e. every unit increase in a continuous predictor is associated with a multiplicative change of e^B in the expected value of the target variable, holding everything else fixed 2. Comment on whether the sign of the estimated coefficient conforms to intuition 3. Relate the findings to the "big picture" and explain how they can help the clients make a better decision
How to control the complexity of a decision tree
1. Pre-specify an impurity reduction threshold, starting from the root note and making a split only when the reduction in impurity exceeds this threshold, downside is this ignores the factor that a not-so-good split early in the tree could be followed by a good split later in the tree 2. Pruning: take a large tree and retroactively remove its splits that don't fulfill the impurity reduction threshold, this reduces tree complexity, prevents overfitting, and improves prediction accuracy
How splits are determined in decision trees
1. Quantitative target variables: use the sample mean of the target variable as the predicted value, called a regression tree 2. Qualitative target variables: use the most common class of the target variable in that group as the predicted class, called a classification tree
Linear model assumptions
1. Random error terms follow a zero-mean normal distribution with a common variance (homoskedastic) 2. As a result of #1, Y is normally distributed with the same constant variance
Residual graphical analysis
1. Residuals vs. Fitted plot: residuals should display no discernible patterns and should be spread symmetrically in either direction, otherwise may indicate heteroskedasticity 2. Normal Q-Q plot: points are expected to lie closely on a 45 degree straight line passing through the origin, this indicates the random errors are normally distributed
Types of regularization
1. Ridge regression: when f(B) is the sum of the squares of the slope coefficients, coefficients can be reduced but not all the way to 0 2. Lasso regression: when f(B) is the sum of the absolute values of the slope coefficients, coefficients can be forced to 0, produces simpler and more interpretable models with fewer features (sparse), in general lasso is preferred unless the fitted model is significantly inferior to ridge based on prediction accuracy 3. Elastic net regression: captures both ridge and lasso using a mixing coefficient (a) to determine the relative weighting a = 0 -> ridge a = 1 -> lasso
Performance metrics on a linear model (non-penalized)
1. Test RMSE - most commonly used since most interpretable and gives a sense of the typical magnitude of the prediction error 2. R^2 3. Test loglikelihood All three are functions of the sum of squared prediction errors on the test set
Characteristics of GLM models
1. The target variable is no longer confined to the class of normal random variables, it just has to be a member of the exponential family of distributions 2. Instead of equating the mean of the target variable to the linear combo of predictors, sets a function of the target mean to be linearly related to the predictors, allowing us to analyze situations where the effects on the mean are more complex than merely additive link function: links the mean of the target (mu) to a linear combo of the predictors (linear predictor = n), can be any monotonic function
Training/validation/test split
1. Training set (70%): where you develop your predictive model to estimate the signal function and if needed, the model parameters; same training set should be used on all candidate models 2. Validation set (20%): assess the predictive performance of the predictive model using a certain performance metric; use to determine which candidate model performs the best 3. Test set (10%): evaluate the predictive performance of the chosen model and obtain an independent measure of its prediction accuracy
Weights and offsets in GLMS
1. Weights: the observations of the target variable should be averaged across members of the same group; the variance of each observation is inversely related to group size which serves as the weight for that observation 2. Offsets: since we expected more people in the group, the more events we will observe, use group size as an additional predictor to account for the different means of different observations, note the term must be in the same scale as the linear predictor
Choices of link functions for GLM and interpretations
1. log link: use if the target mean is known to be positive Interpretation: a unit change in a continuous predictor with coefficient B is associated with a multiplicative increase in the target mean by a factor of e^B, holding all other variables fixed 2. logit link: use if the target mean is between 0 and 1 (binary target variable) Interpretation: similar to log, where odds = e^n
Regularization
A form of feature selection that shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero, reducing their expected effect on the target variable Minimize: RSS + lambda * f(B) where lambda = regularization parameter that controls the extent of regularization and quantifies our preference for simple models f(B) = regularization penalty that captures the size of the regression coefficients
Bias-variance tradeoff
A more complex model has a lower bias but a higher variance than a less flexible model The expected test prediction error follows a U-shape: at first bias drops faster than variance increases, but as the model becomes more complex the bias is outweighed by the rise in variance Goal of a good predictive model is to find the minimum point of the U-shaped curve
Unsupervised learning methods
A target variable is absent, and we are interested in extracting relationships and structures between different variables in the data 2 methods covered on PA are principal components analysis and cluster analysis
Aesthetics vs. geoms in ggplot
Aesthetics determine what relationships we want to see in the plot, while geoms determine how we want to see those relationships. We use the geom functions to set a property that affects how a plot looks.
Arguments for AIC vs. BIC
BIC: tends to favor a model with fewer features and so a more conservative approach to feature selection, makes sense since our goal is to identify key factors that relate to the target variable AIC: retains more features and so the client will be given more factors to consider, risk of missing important features will be smaller
Exam note: decision trees
Be sure to describe the rationale behind the choice of the control parameters, such as minbucket, cp, parms, maxdepth Once a final tree model has been reached, comment on how many splits the tree has, which variables appear to be the more informative/influential as shown in the first few splits, and whether that makes sense
How to color data points blue in a ggplot2
Do not include inside the initial ggplot function, this will simply map all variables to a color named "blue." Instead, include in the scatterplot creation: ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue")
Important first step in data exploration
Explore the distributional characteristics of the target variable -> may help you see useful transformations and determine which analytic techniques should be applied
Ways to compare the predictive performance of 2 decision trees
First, use the predict function on the test set 1. Compare confusion matrices 2. Compare AUC
Use 2 separate lines for different colors, or a single line?
For 2 lines - define color in the ggplot function For 1 line - define color in geom_point
Detecting interaction between variables in a decision tree
For example: the left branch of the root node is followed by a split using X2 while the same split can't be found in the right branch -> meaning the split is not useful
Arguments for forward vs. backward selection
Forward: more likely to produce a final model with fewer features, better aligns with the goal of identifying key factors affecting the target variable
Deviance of a GLM
Goodness of fit measure, measures the extent to which the GLM departs from the most elaborate GLM, which is known as the saturated model and allows for perfect fit D = 2(lsat - l) where lsat = loglikelihood of the saturated model l = loglikelihood of the given GLM model Lower deviance -> closer the GLM is to the model with perfect fit
Solution to Normal Q-Q plot deviating from 45 degree line
If does not line up on the extreme tails, use a response distribution with a fatter tail than the normal distribution
Solution to discernable shape of residuals in the Residuals vs. Fitted plot
Incorporate polynomial terms if there are nonlinear patterns
Determining the best cutoff for the confusion matrix
Involves a trade-off between having high sensitivity and high specificity - Cutoff set to 1 -> sensitivity = 0, specificity = 1 - As cutoff decreases, sensitivity increases while specificity decreases - Cutoff set to 0 -> sensitivity = 1, specificity = 0
Common technique for dealing with positive-valued skewed data
Log transformation
Canonical link functions
Normal: identity, g(mu) = mu Binomial: logit, g(pi) = ln(pi/(1-pi)) Poisson: natural log, g(mu) = ln(mu) Gamma: inverse, g(mu) = 1/mu Inverse Gaussian: squared inverse, g(mu) = 1/mu^2
Definition of bias
Predicted value - true value of the signal function In general, the more complex a model, the lower the bias (in magnitude) due to its higher ability to capture the underlying signal
Definition of variance
Quantifies the amount by which the predicted value would change if we estimated it using a different training set In general, the more complex a model, the higher the variance because a small change in the training data can cause massive changes in the predictive model
Regression vs. classification problems
Regression: supervised learning problems with a numeric target variable Classification: problems with a categorical target variable, this model is called a classifier
Important step after stepwise selection
Run the final model and check that everything looks right - make comments on which features are included, whether this matches the observations from the data exploration section, whether coefficients are statistically significant
One-standard-error rule
Select the smallest tree whose cross-validation error is within 1 standard error from the minimum cross validation error, limit shown as the dotted line on the complexity parameter plot
What to do when there are missing values in the dataset
Simply remove the observations that contain missing values since there are more observations than columns 1. Use the is.na() function: actuary.1 <- actuary[!is.na(actuary$Q1), ] 2. Use the complete.cases() function cc <- complete.cases(actuary) actuary.2 <- actuary[cc, ] 3. Use the na.omit() function actuary.3 <- na.omit(actuary)
How to select the predictor to split and the corresponding cutoff level at each step
Split the target observations based on the value of an "influential" variable to result in the greatest reduction in the variability of the target variable
Exam note: data cleaning
Summarize the total effects - how many observations and variables are left, any significant or somewhat controversial changes to the data
F-statistic
Tests whether several coefficients are zero, with the alternative that at least one of the coefficients is non-zero However, the test does not indicate which variables are actually predictive
Feature generation: definition and purpose
The process of developing new features based on existing variables in the data, which can then serve as final inputs in the model Purpose is to enhance the flexibility of the model and lower bias of the predictions at the expense of an increase in variance These are commonly used in constructing GLMs
Feature selection: definition and purpose
The process of dropping features with limited predictive power as an attempt to control model complexity and prevent overfitting Used in both GLMs and decision trees Especially relevant for categorical predictors, examples: 1. Combine similar categories: if a target variable behaves similarly (mean, median) in 2 categories 2. Combine sparse categories with others: if there are few observations in the category and its target variable behaves similarly to another category
Supervised learning methods
There is a target variable "supervising" our analysis, and our goal is to understand the relationship between the target and predictors and make accurate predictions for the target based on the predictors 2 methods covered on PA are GLMs and decision trees
When to use split box plots
To visualize the conditional distribution of a numeric variable given different levels of the categorical variable
Cross-validation
Useful when data set is too small to split into training/validation/test sets 1. Set aside test data 2. Split remaining data into k folds of ~equal size (common choices are k = 5 or 10) 3. Predictive model is fit k times. Each of the k-folds serves as the validation set and the remaining k-1 folds serve as the training set.
Expected test error components
Variability + bias + variance of the error where the first 2 terms are reducible, and the last term is irreducible
Measure of impurity for regression decision trees
Variability measured by the RSS Lower RSS -> more homogeneous the target observations
Purpose of the gridExtra package
We can arrange completely independent ggplots, all of which correspond to the entire sample, side-by-side in a single figure
More on the regularization parameter (lambda)
When lambda = 0 -> coefficient estimates = OLS estimates As lambda increases -> effect of regularization becomes more severe, decreasing variance but increasing bias of the coefficient estimates As lambda approaches infinity -> all slope coefficients approach 0 and the model becomes intercept only Determined by minimizing the cross-validation error
Hierarchical principle
Whenever a higher-order variable is retained in the model due to statistical significance, so are the lower-order variables, even if they are insignificant i.e. if the interaction between TV and radio are kept, then the separate terms for TV and radio should be kept as well
Equation for the relationship between the target variable (Y) and the set of predictors (X)
Y = f(X) + e where f is the signal function and e is the noise (random error that cannot be captured by the systematic component of the model)
Operator for testing whether a is an element of the vector c(a, b, c)
a %in% c(a, b, c)
Coding for calculating the mean of several columns
actuary.n$S <- apply(actuary.n[ , c("Q1","Q2", "Q3")], 1, mean) where MARGIN = 1 indicates the function will act on rows, and MARGIN = 2 indicates the function will act on columns
How to concatenate strings
actuary.n$genderSmoke <- paste0(actuary.n$gender, actuary.n$smoke) default is no space, can also add argument sep = " " to use a space in between
Sort data set in ascending/descending order
actuary.n1 <- actuary.n[order(actuary.n$S), ] actuary.n2 <- actuary.n[order(actuary.n$S, decreasing = TRUE), ]
Sort data by several variables: gender, then by average, both in descending order
actuary.n3 <- actuary.n[order(actuary.n$gender, actuary.n$S, decreasing = TRUE), ]
Coding to return the person with the highest/lowest average
actuary.n[which.max(actuary.n$S), "name"] actuary.n[which.min(actuary.n$S), "name"]
Binarization of a categorical variable
binarizer <- dummyVars(~Gender + Student + Married + Ethnicity, data = Credit, fullRank = FALSE) binarized_vars <- data.frame(predict(binarizer, newdata = Credit) Note that fullRank = TRUE leaves out the baseline levels of the categorical predictors and so is appropriate for regression
Operator for at least one of cond1 and cond2 being true
cond1 | cond2
Elements of the cptable of a decision tree
cp: complexity parameter nsplit: # of splits, note that # of terminal nodes is this + 1 rel error: (1 - R^2) = RSS/TSS, a scaled version of the training error xerror: each of the 6 observations is left out in turn, a decision tree is fitted to the remaining 5 observations, and a predicted value is generated for the held-out observation. The 6 prediction errors are squared, summed, and divided by the TSS. xstd: standard error of the cross validation error for each fitted decision tree
Syntax in R for creating an interaction term in a linear model
model.3 <- lm(sales ~ TV * radio, data = ad) OR model.3 <- lm(sales ~ TV + radio + TV:radio, data = ad)
Creating a RMSE function
rmse <- function(observed, predicted) { sqrt(mean((observed - predicted)^2)) }
Partial correlation
the correlation between 2 variables after controlling for effects of other predictors Calculated by regressing y on all other predictors other than x, then regressing x on all predictors other than x, then find the correlation between the residuals of each m1 <- lm(sales ~ TV + radio, data = ad) m2 <- lm(newspaper ~ TV + radio, data = ad) cor(m1$residuals, m2$residuals)