Chapter 6: Multiple Linear Regression
Backward Elimination
*Popular Subset Selection Algorithms We start with all predictors and then at each step eliminate the least useful predictor (according to statistical significance)
Forward selection
*Popular Subset Selection Algorithms We start with no predictors and then add predictors one by one. Each predictor added is the one (among all predictors) that has the largest contribution to R2 on top of the predictors that are already in it.
Stepwise Regression
*Popular Subset Selection Algorithms Forward selection except that at each step we consider dropping predictors that are not statistically significant, as in backward elimination
Best Subsets (Exhaustive Search)
-All possible subsets of predictors assessed (single, pairs triplets, etc.) (you see which fits with the highest R2. -Computationally intensive - Solver can do it -Judged by Adjusted-R2
Best Subsets (Exhaustive) Search
-All possible subsets of predictors assessed (single, pairs, triplets, etc.) -Computationally intensive --> solver can do it - Judge by adjusted -R2
Partial search algorithms
-Forward: variables added one at a time starting with most significant -Backward: starts with all predictors and drops one at a time with the least significant -Stepwise: either adds or drops a predictor at each step until best model is found
Coefficient of Determination (R2)
-It is the square of the coefficient of correlation (R) -It also ranges from 0 to 1 -The proportion of the total variation in the response variable (Y) that is explained or accounted for by the variation in the predictor (X) -Eg. If R2=0.8, then 80% of the variation in miles driven is accounted by number of gallons in the tank. The remaining 20% is influenced by other unknown variables, such as traffic, weather, number of passengers, tows, road constructions, etc.
Correlation Coefficient (Pearson R)
-Measures strength of the relationship between two variables -It can range from -1.00 to 1.00 -Positive values indicate a direct relationship & negative values indicate an inverse relationship -1 is as strong as a relationship as 1. 0 is the Worst relationship! Negative just means it has a negative correlation. Positive means it has a positive correlation.
Explanatory vs. Predictive
1. Explanatory or quantifying the average effect of inputs on an output (explanatory or descriptive task) 2. Predicting the outcome value for new records, given their input values (predictive task)
When given an chart with (Actual) and (Predicted).. how do you figure out the error in prediction?
Actual - Prediction = Error in Prediction
Least Squares Principle
Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y.
Multiple Linear Regression --> Prediction
Emphasis is on how to make predictions on new data (supervised) eg. Predicting how much the sales revenue would be if certain dollar amount is spent on advertising in the future
Multiple Linear Regression -> Explanation
Explanation: Theoreticians want to know if relationship exists between predictors and response. Helps to understand and explain what goes on eg. amount of gas & mileage Advertisement budget & actual sales Hours studied & exam scores
Partial Search Algorithms
Forward - variables added one at a time starting with most significant Backward - starts with all predictors and drops one at a time with the least significant Stepwise - either adds or drops a predictor at each step until best model is found (randomly selects one variable at a time then either picking or dropping variables.
Selecting Subsets of Predictors
Goal: Find Parsimonious model (the simplest model that performs sufficiently well) -more robust (resistant to errors in the results, produced by deviations from assumptions such as normality) -High predictive accuracy (removing unrelated predictors) --> measured by R2
Regression Analysis
If there is a strong relationship between two variables, one can use a linear model of the form: Y' = a +b*X ( this is single linear form) Y' is the predicted value a is the Y-intercept (it is value of Y' when X=0) b is the slope of the line, or the average change in Y' for each change of one unit in x
Predictive Modeling
In contrast, in predictive modeling (data mining), the goal is to find a regression model that best predicts new individual records
Explanatory Modeling
In explanatory and descriptive modeling, where the focus is modeling the average record, we try to fit the best model to the data in an attempt to learn about underlying relationship in the population
Correlation Analysis
Measurement of association between two variables If you suspect two variables to have a relationship, start with drawing a scatter plot A scatter diagram is a chart that portrays the relationship between two variables (If points are scattered all over the graph, not much of a relationship) (If points are aligned; potential for a good relationship)
This model finds values β0 + β1 + B2 that minimize the sum of square deviations between the actual values (Y) and their predicted values based on the model (^Y)
Ordinary Least Squared
Residual Output
Residual output shows variation in the actual beverage sales and predicted values. The residual column displays the errors that the model captures. The smaller the sum of errors squared, the better. In addition, the residual value is a measure of how much a regression line vertically misses a data point. By changing the mean of the values to a regression equation, you are able to reduce the vertical distance between the residual errors.
Calculating Coefficient of Determination (R2)
The proportion of the total variation in the dependent variable (Y) that is explained by the variation in the independent variable (X) R2 = 1 - unexplained variation/total variation --> (Equation 2)
Y' = a +b*X regression analysis
The steeper the line the better A steeper slope means a stronger relationship horizontal line = 0 slope
True of False A good predictive model has high prediction accuracy (to a useful practical level)
True
True or False A good explanatory model is one that fits the data closely, whereas a good predictive model is one that predicts new cases accuartely
True
True or False A regression model that fits the existing data too well is not likely to perform well with new data
True
True or False Both explanatory and predictive modeling involve using a database to fit a model (i.e. to estimate coefficients), checking model validity, assessing its performance, and comparing to other models
True
True or False In explanatory models the focus is on the coefficients (β), whereas in predictive models the focus is on the predictions (^y)
True
True or False Linear regression models are very popular tools, not only for explanatory modeling but also for prediction
True
True or False The smallest R2 you can have is 0
True
True or False 0 is the worst relationship
True
True or False If X1 is known to cause Y, then such a statement indicates actionable policy changes - this is called explanatory modeling.
True
True or False Like R2, higher values of adjusted R2 indicate a better fit
True
True or False Regression is the Opposite of Progress
True
True or False Removing redundant predictors is key to achieving predictive accuracy and robustness
True
True or False Subset selection methods help find "good" candidate models. These should then be re-run and assessed
True
True or False Unlike R2, which does not account for the number of predictors used, adjusted R2 uses a penalty on the number of predictors
True
True or False the more predictors there are, the higher the chance of missing values in the data. If we delete or impute cases with missing values, multiple predictors will lead to a high rate of case deletion or imputation
True
True or False Regression modeling means not only estimating the coefficients but also choosing which input variables to include and in what form.
True For example, a numerical input can be included as-is , or in logarithmic for (Log X), or in a binned for (e.g. age group) Multiple linear regression is applicable to numerous predictive modeling situations. Examples are predicting customer activity on credit cards from their demographics and historical activity patterns, etc.
True or False In explanatory and descriptive modeling, where the focus is on modeling the average record, we try to fit the best model to the data in an attempt to learn about the underlying relationship in the population.
True In contrast, in predictive modeling (data mining), the goal is to find a regression model that best predicts new individual records
True or False Data are used to estimate the coefficients and to quantify the noise
True In predictive modeling, the data are also used to evaluate model performance.
True or False When the goal is to predict outcomes of new individual cases, the data are typically split into a training set and validation set.
True The training set is used to estimate the model, and the validation or holdout set is used to assess this model's predictive performance on new, unobserved data. Validation data used for predictive performance
What does Y' in regression analysis represent?
Y' is the predicted value
What does "a" in regression analysis represent?
a is the Y-intercept (it is value of Y' when X=0)
what does "b" in regression analysis represent?
b is the slope of the line, or average change in Y' for each change of one unit in X
The Coefficient of Determination
r2 - the percent of variation in the y variable that can be explained by the variation in the x variable
Multi-collinearity
when two or more predictors are highly correlated If you have two columns that are perfectly correlated then the extra variable won't be of any use at all.
least squares regression line
the line with the smallest sum of squared residuals