Ch 6 - Multiple Linear Regression
outcome variable
the Y - what you are trying to predict
stepwise
- Like Forward Selection - Except at each step, also consider dropping non-significant predictors use step() to run stepwise regression (available in lm). set direction = both
backward elimination
- Start with all predictors - Successively eliminate least useful predictors one by one - Stop when all remaining predictors have statistically significant contribution - yields model with 7 predictors use step() to run stepwise regression (available in lm). set direction = backward
forward selection
- Start with no predictors - Add them one by one - Stop when the addition is not statistically significant use step() to run stepwise regression (available in lm). set direction = forward
popular subset selection algorithms
- a partial iterative search through the space of all possible regression models
exhaustive search
- a search that continues until the test item is compared with all items in the memory set - shows best model for each number or predictors - Subset selection methods help find "good" candidate models. These should then be run and assessed. evaluate all subsets of predictors: - Goal - Find parsimonious model (the simplest model that performs sufficiently well) - More robust - Higher predictive accuracy - We will assess predictive accuracy on validation data Exhaustive Search = "best subset": - Partial Search Algorithms; Forward, Backward, Stepwise - All possible subsets of predictors assessed (single, pairs, triplets, etc.) - Computationally intensive, not feasible for big data - Judge by "adjusted R2" - Use regsubsets() in package leaps - Exhaustive search requires library leaps and manual coding into binary dummies library(leaps) Fuel_Type <- as.data.frame(model.matrix(~ 0 + Fuel_Type, data=train.df)) train.df <- cbind(train.df[,-4], Fuel_Type[,]) head(train.df) search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2], method = "exhaustive") sum <- summary(search) sum$which sum$rsq sum$adjr2 sum$Cp
Adjusted R-squared
- correlation coefficient squared and adjusted for the number of independent variables used to make the estimate - Adjusted R-squared for the models with 1 predictor, 2 predictors, 3 predictors, etc. (exhaustive search method) > sum$adjr2 [1] 0.7556359 0.7922356 0.8267935 0.8436895 0.8494282 0.8534911 [7] 0.8544782 0.8545430 0.8543943 0.8542602 0.8540382 - Adjusted R-squared drops until you hit 7 predictors, then stabilizes, so choose model with 7 predictors, according to the adj R-squared criterion
explanatory modeling
- explaining or quantifying the average effect of inputs on an outcome - its a descriptive task - learn about the underlying relationship in the population - find a trend or pattern - Goal: Explain relationship between predictors (explanatory variables) and target - Familiar use of regression in data analysis - Model Goal: Fit the data well and understand the contribution of explanatory variables to the model - "goodness-of-fit": R2, residual analysis, p-values
predictive modeling
- predicting the outcome value for new records, given their input values - its a predictive task - predicting new records - Goal: predict target values in other data where we have predictor values, but not target values - Classic data mining context - Model Goal: should have good predictive accuracy - Predictive models are fit to training data, and predictive accuracy is evaluated on a separate validation data set - Assess performance on validation (hold-out) data - Explaining role of predictors is not primary purpose (but useful)
Ordinary least squares
- regression formula used to minimize the sum of squared deviations btw the actual outcome values (Y) and predicted values based on that model (ŷ) - these predictions will be unbiased and will have the smallest mean squared error
predictors
- the X variables - what you are inputing - Removing redundant predictors is key to achieving predictive accuracy and robustness
how to reduce the number of predictors
- use domain knowledge - reduce to a sensible set that reflects the problem at hand - make use of computational power and statistical performance metrics
multiple linear regression model p.153
- used to fit a relationship btw a numerical outcome variable (Y) and set of predictor variables (X's) - the goal is to predict new models, not simply identify which variables are more significant - can be used for explanatory and predictive modeling
overfitting
- when a model is too over done
underfitting
- when a model is too simple
explanatory vs predictive p.154-155
1.) explanatory should fit the data well; predictive gives new records 2.) explanatory uses the whole data set to estimate; predictive splits into training and predictive 3.) explanatory performs well if its fits well; predictive performs well is its predicts accurately 4.) explanatory models focus on coefficients; predictive models focus on predictions
the 3 algos of iterative search methods are
1.) forward selection - start with no predictors and then add predictors one by one 2.) backward elimination - start with all predictors, and then eliminate then least useful ones one by one 3.) stepwise regression - with each step, we consider dropping predictors that are not statistically significant
Understand the methods of variable selection in linear regression p.161-166
Backwards, forwards, stepwise (both?)
