Predictive Modeling

Ace your homework & exams now with Quizwiz!

Scheme for finalizing model type: (3 steps)

Start with least interpretable and most flexible models Investigate simpler models, easier to interpret Use simplest model which reasonably approximates the performance of complex models

k-Fold cross validation

Samples partitioned into k sets of similar size Model fit on k-1 partitions Held out samples used to estimate model performance Repeat with different subset held out each time Estimates of performance are summarized

Sample, data point, observation, instance

Single, independent unit of data (ie customer, patient, compound) Note: sample can also refer to a subset of data points

As we push towards higher accuracy, models become ________ _______ and their _________ becomes more difficult

more complex interpretability

A poor choice of a tuning parameter can...

Result in over-fitting

r2

The correlation coefficient between observed and predicted values R2 = r2

y

a vector of all n outcome variables

If highly right skewed, skewness is

> 1

How are models validated? (2 ways)

1. "Goodness of fit": Model statistic that compares difference between predicted/expected values 2. Look at performance on testing set not used for model building (training).

Reasons models fail (5 reasons)

1. Complex variables (ie human behavior) 2. Validation process inadequate 3. Inadequate pre-processing 4. Overfitting model 5. Only explore a few models

Ingredients of an effective predictive model (3)

1. Intuition and deep knowledge of problem context 2. Begin process with relevant data 3. Versatile computational tools

Two techniques for determining the effectiveness of the model:

1. Quantitative assessments of statistics (ie RMSE) 2. Visualizations of the model

Before building a model (2 things)

1. Understand predictors/response variables for data set 2. Perform pre-processing as necessary to optimize each model's predictive performance.

To get reliable/trustworthy models: (3)

1. Understand the data and objective of the modeling 2. Pre-process and split the data 3. Build, evaluate, and select models

Dummy variables

A categorical variable which has been converted into a series of binary variables.

If highly left skewed, skewness is

< -1

E[.]

Expected value of .

Tuning parameter

A parameter which cannot be directly estimated from the data.

Zero variance predictor

A predictor with a single unique value

If n > P...

All predictive models handle this scenario

Data Splitting

Allocating data to a testing set and a training set to build models and evaluate performance of a model.

When the goal is to choose between models, ____ is the preferred resampling method.

Bootstrap

Partial Least Squares (PLS)

Can be used for correlated predictors

Binning predictors

Categorize a numeric predictor into two or more groups prior to analysis

Principal Component Analysis (PCA)

Commonly used data reduction technique

Regression models predict a _____ outcome

Continuous

Continuous

Data that have natural numeric scales

Predictors/independent variables, attributes, descriptors

Data used as input for the prediction equation.

Training Set

Data used to develop models

Categorical, nominal attribute, discrete

Data values with no scale.

Process for choosing a tuning parameter (3)

Define a set of candidate values Estimate model performance on candidate values Choose optimal values (visualizations, statistics)

Test/Validation Set

Evaluating performance of a final set of candidate models.

PCA works by...

Finding linear combinations of the predictors (PCs), which capture the most possible variance.

Non-Linear Classification Models (6)

Flexible Discriminate Analysis Neural Networks SVMs KNNs Naive Bayes Nearest Shrunken Centroids

Resampling Technique

Methods used to estimate model performance

More complex models have very _____ which leads to ______.

High Variance Over-fitting

Root mean squared error (RMSE)

How far, on average, the residuals are from zero. Square root of the mean squared error.

Multiple Linear Regression models cannot handle.....

Missing predictor information.

Strategies for missing values (2)

Remove predictors Impute values (KNN or mean)

Linear models are based on ______ combinations of the predictors

Linear

Linear Classification Models (4)

Linear Discriminate Analysis Quadratic Discriminate Analysis Regularized Discriminate Analysis Partial Least Squares Discriminant Analysis

Linear Regression Models (3)

Linear regression Partial Least Squares L1 Regularization

Common transformations to remove skewness (4)

Log Square root Inverse Box Cox

Downside of center/scaling

Loss of interpretability since data are no longer in the original units.

Problems with manual binning of continuous data (3)

Loss of performance in the model Loss of precision in the predictions High rate of false positives

Transformations to resolve outliers

Make sure values are scientifically valid No data recording errors Spatial Sign transformation

If n < P

Multiple linear regression or linear discriminate analysis cannot be directly used. Recursive partitioning and K-nearest neighbors can be used directly.

Non-Linear Regression Models (4)

Neural Networks Multivariate adaptive regression splines (MARS) Support Vector Machines (SVMs) K-Nearest Neighbors (KNNs)

Non-Linear models are based on _______ combinations of the predictors

Non-linear

Methods for splitting data (3)

Nonrandom approaches (data has natural break) Random approaches Stratified random sampling (split based on subgroup)

n

Number of data points

P

Number of predictors

Multicolinearity, between-predictor correlations are problematic because....

One or more predictors represents the same information

Outcome, dependent variable, target, class, response

Outcome event or quantity being predicted.

Supervised data pre-processing

Outcome variable is utilized to pre-process the data

Models that have been ____ usually have _____ accuracy when predicting a new sample.

Over-fit Poor

In a KNN model, too few neighbors will ____ the model and too many neighbors will ___ the model.

Over-fit Under-fit

Mathematically, the jth PC can be written as:

PCj = (aj1 x Predictor 1) +(aj2 x Predictor 2)+....+ (ajP x Predictor P) Where P is the number of predictors, aj1, aj2,...,ajP are component weights and help us understand which predictors are most important to each PC.

What is predictive modeling?

Predict future events given current observations (predictors) Geisser 1993: "Process by which a model is chosen/created to best predict outcome" Kuhn & Johnson: "Process of developing math tool/model that generates accurate prediction"

Near-zero variance predictor

Predictors only have a handful of unique values that occur with very low frequencies.

Pr[.] or p

Probability of event

Model building/training, parameter estimation

Process of using data to determine values of model equations.

Most common method for characterizing a regression model's predictive capabilities?

RMSE

Bootstrap

Random sample of data taken Samples not selected "out-of-bag" samples Model built on selected samples Model predicts out-of-bag samples

Spearman's rank correlation

Ranks of observed and predicted outcomes are obtained and the correlation coefficient between these ranks is calculated.

Data reduction

Reduce the data by generating a smaller set of predictors that seek to capture a majority of the information in the original variables

How to center data?

The average predictor value is subtracted from all the values. The predictor will have a mean of zero.

Unsupervised data pre-processing

The outcome variable is not considered by pre-processing techniques

Data pre-processing

The process of adding, deleting, or transforming a data set

R2 is (2 things)

The proportion of data explained by the model A measure of correlation, not accuracy

Before using PCA, we must... (3 things)

Transform skewed predictors Center/Scale Decide how many components to retain

Models resistant to outliers

Tree-based classification (not covered in this class) Support vector machines

Recursive data partitioning

Unaffected by predictors of different scales, can be used for missing data, less stable partitioning structure when predictors are correlated

The primary advantage of PCA is that the PCs are...

Uncorrelated

How to scale data?

Value of each predictor variable is divided by its standard deviation. All data will have a common standard deviation of one.

xi

a collection (vector) of the P predictors for the ith data point. i = 1....P

X

a matrix of P predictors for all data points; this matrix has n rows and P columns

b

an estimated model coefficient based on a sample of data points

beta

an unknown or theoretical model coefficient

xj bar

average or sample mean of n data points for the jth predictor. j = 1....P

If moderately left skewed, skewness is

between -1 and -1/2

If moderately right skewed, skewness is

between 1 and 1/2

If roughly symmetric, skewness =

between 1/2 and -1/2

R2 is the...

coefficient of determination

f(.)

function of .; g(.) and h(.) also represent functions

If left skewed, skewness is

negative

C

number of classes for a categorical outcome

If right skewed, skewness is

positive

yi hat

predicted outcome of the ith data point, i = 1....n

pl

probability of the lth event

Resampling method for when sample size is small...

repeated 10-fold cross validation

Resampling method for large sample sizes....

simple 10-fold cross validation

n over sigma, i=1

summation of operator over the index i

Mean squared error (MSE)

the average of the squared differences between the forecasted and observed values

y bar

the average or sample mean of the n observed values of the outcome variable

yi

the ith observed value of the outcome, i = 1....n

Feature selection

the process of determining the minimum set of relevant predictors needed by the model.

X'

the transpose of X; this matrix has P rows and n columns

Cl

the value of the lth class level

xij

value of the jth predictor for the ith data point, i - 1...n and j = 1....P


Related study sets

Cht. 21 Blood Vessels and Circulation Test

View Set

Econ 2010 chapter 4 questions (Exam 1)

View Set

Biology 221E Biomes (49.3, 49.4, 49.5)

View Set

AP US Gov and Politics Unit 1 Study Guide

View Set

Unit 2 Basic Agricultural Science Test

View Set