Predictive Modeling
Scheme for finalizing model type: (3 steps)
Start with least interpretable and most flexible models Investigate simpler models, easier to interpret Use simplest model which reasonably approximates the performance of complex models
k-Fold cross validation
Samples partitioned into k sets of similar size Model fit on k-1 partitions Held out samples used to estimate model performance Repeat with different subset held out each time Estimates of performance are summarized
Sample, data point, observation, instance
Single, independent unit of data (ie customer, patient, compound) Note: sample can also refer to a subset of data points
As we push towards higher accuracy, models become ________ _______ and their _________ becomes more difficult
more complex interpretability
A poor choice of a tuning parameter can...
Result in over-fitting
r2
The correlation coefficient between observed and predicted values R2 = r2
y
a vector of all n outcome variables
If highly right skewed, skewness is
> 1
How are models validated? (2 ways)
1. "Goodness of fit": Model statistic that compares difference between predicted/expected values 2. Look at performance on testing set not used for model building (training).
Reasons models fail (5 reasons)
1. Complex variables (ie human behavior) 2. Validation process inadequate 3. Inadequate pre-processing 4. Overfitting model 5. Only explore a few models
Ingredients of an effective predictive model (3)
1. Intuition and deep knowledge of problem context 2. Begin process with relevant data 3. Versatile computational tools
Two techniques for determining the effectiveness of the model:
1. Quantitative assessments of statistics (ie RMSE) 2. Visualizations of the model
Before building a model (2 things)
1. Understand predictors/response variables for data set 2. Perform pre-processing as necessary to optimize each model's predictive performance.
To get reliable/trustworthy models: (3)
1. Understand the data and objective of the modeling 2. Pre-process and split the data 3. Build, evaluate, and select models
Dummy variables
A categorical variable which has been converted into a series of binary variables.
If highly left skewed, skewness is
< -1
E[.]
Expected value of .
Tuning parameter
A parameter which cannot be directly estimated from the data.
Zero variance predictor
A predictor with a single unique value
If n > P...
All predictive models handle this scenario
Data Splitting
Allocating data to a testing set and a training set to build models and evaluate performance of a model.
When the goal is to choose between models, ____ is the preferred resampling method.
Bootstrap
Partial Least Squares (PLS)
Can be used for correlated predictors
Binning predictors
Categorize a numeric predictor into two or more groups prior to analysis
Principal Component Analysis (PCA)
Commonly used data reduction technique
Regression models predict a _____ outcome
Continuous
Continuous
Data that have natural numeric scales
Predictors/independent variables, attributes, descriptors
Data used as input for the prediction equation.
Training Set
Data used to develop models
Categorical, nominal attribute, discrete
Data values with no scale.
Process for choosing a tuning parameter (3)
Define a set of candidate values Estimate model performance on candidate values Choose optimal values (visualizations, statistics)
Test/Validation Set
Evaluating performance of a final set of candidate models.
PCA works by...
Finding linear combinations of the predictors (PCs), which capture the most possible variance.
Non-Linear Classification Models (6)
Flexible Discriminate Analysis Neural Networks SVMs KNNs Naive Bayes Nearest Shrunken Centroids
Resampling Technique
Methods used to estimate model performance
More complex models have very _____ which leads to ______.
High Variance Over-fitting
Root mean squared error (RMSE)
How far, on average, the residuals are from zero. Square root of the mean squared error.
Multiple Linear Regression models cannot handle.....
Missing predictor information.
Strategies for missing values (2)
Remove predictors Impute values (KNN or mean)
Linear models are based on ______ combinations of the predictors
Linear
Linear Classification Models (4)
Linear Discriminate Analysis Quadratic Discriminate Analysis Regularized Discriminate Analysis Partial Least Squares Discriminant Analysis
Linear Regression Models (3)
Linear regression Partial Least Squares L1 Regularization
Common transformations to remove skewness (4)
Log Square root Inverse Box Cox
Downside of center/scaling
Loss of interpretability since data are no longer in the original units.
Problems with manual binning of continuous data (3)
Loss of performance in the model Loss of precision in the predictions High rate of false positives
Transformations to resolve outliers
Make sure values are scientifically valid No data recording errors Spatial Sign transformation
If n < P
Multiple linear regression or linear discriminate analysis cannot be directly used. Recursive partitioning and K-nearest neighbors can be used directly.
Non-Linear Regression Models (4)
Neural Networks Multivariate adaptive regression splines (MARS) Support Vector Machines (SVMs) K-Nearest Neighbors (KNNs)
Non-Linear models are based on _______ combinations of the predictors
Non-linear
Methods for splitting data (3)
Nonrandom approaches (data has natural break) Random approaches Stratified random sampling (split based on subgroup)
n
Number of data points
P
Number of predictors
Multicolinearity, between-predictor correlations are problematic because....
One or more predictors represents the same information
Outcome, dependent variable, target, class, response
Outcome event or quantity being predicted.
Supervised data pre-processing
Outcome variable is utilized to pre-process the data
Models that have been ____ usually have _____ accuracy when predicting a new sample.
Over-fit Poor
In a KNN model, too few neighbors will ____ the model and too many neighbors will ___ the model.
Over-fit Under-fit
Mathematically, the jth PC can be written as:
PCj = (aj1 x Predictor 1) +(aj2 x Predictor 2)+....+ (ajP x Predictor P) Where P is the number of predictors, aj1, aj2,...,ajP are component weights and help us understand which predictors are most important to each PC.
What is predictive modeling?
Predict future events given current observations (predictors) Geisser 1993: "Process by which a model is chosen/created to best predict outcome" Kuhn & Johnson: "Process of developing math tool/model that generates accurate prediction"
Near-zero variance predictor
Predictors only have a handful of unique values that occur with very low frequencies.
Pr[.] or p
Probability of event
Model building/training, parameter estimation
Process of using data to determine values of model equations.
Most common method for characterizing a regression model's predictive capabilities?
RMSE
Bootstrap
Random sample of data taken Samples not selected "out-of-bag" samples Model built on selected samples Model predicts out-of-bag samples
Spearman's rank correlation
Ranks of observed and predicted outcomes are obtained and the correlation coefficient between these ranks is calculated.
Data reduction
Reduce the data by generating a smaller set of predictors that seek to capture a majority of the information in the original variables
How to center data?
The average predictor value is subtracted from all the values. The predictor will have a mean of zero.
Unsupervised data pre-processing
The outcome variable is not considered by pre-processing techniques
Data pre-processing
The process of adding, deleting, or transforming a data set
R2 is (2 things)
The proportion of data explained by the model A measure of correlation, not accuracy
Before using PCA, we must... (3 things)
Transform skewed predictors Center/Scale Decide how many components to retain
Models resistant to outliers
Tree-based classification (not covered in this class) Support vector machines
Recursive data partitioning
Unaffected by predictors of different scales, can be used for missing data, less stable partitioning structure when predictors are correlated
The primary advantage of PCA is that the PCs are...
Uncorrelated
How to scale data?
Value of each predictor variable is divided by its standard deviation. All data will have a common standard deviation of one.
xi
a collection (vector) of the P predictors for the ith data point. i = 1....P
X
a matrix of P predictors for all data points; this matrix has n rows and P columns
b
an estimated model coefficient based on a sample of data points
beta
an unknown or theoretical model coefficient
xj bar
average or sample mean of n data points for the jth predictor. j = 1....P
If moderately left skewed, skewness is
between -1 and -1/2
If moderately right skewed, skewness is
between 1 and 1/2
If roughly symmetric, skewness =
between 1/2 and -1/2
R2 is the...
coefficient of determination
f(.)
function of .; g(.) and h(.) also represent functions
If left skewed, skewness is
negative
C
number of classes for a categorical outcome
If right skewed, skewness is
positive
yi hat
predicted outcome of the ith data point, i = 1....n
pl
probability of the lth event
Resampling method for when sample size is small...
repeated 10-fold cross validation
Resampling method for large sample sizes....
simple 10-fold cross validation
n over sigma, i=1
summation of operator over the index i
Mean squared error (MSE)
the average of the squared differences between the forecasted and observed values
y bar
the average or sample mean of the n observed values of the outcome variable
yi
the ith observed value of the outcome, i = 1....n
Feature selection
the process of determining the minimum set of relevant predictors needed by the model.
X'
the transpose of X; this matrix has P rows and n columns
Cl
the value of the lth class level
xij
value of the jth predictor for the ith data point, i - 1...n and j = 1....P