Chapter 6: Multiple Linear Regression

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Backward Elimination

*Popular Subset Selection Algorithms We start with all predictors and then at each step eliminate the least useful predictor (according to statistical significance)

Forward selection

*Popular Subset Selection Algorithms We start with no predictors and then add predictors one by one. Each predictor added is the one (among all predictors) that has the largest contribution to R2 on top of the predictors that are already in it.

Stepwise Regression

*Popular Subset Selection Algorithms Forward selection except that at each step we consider dropping predictors that are not statistically significant, as in backward elimination

Best Subsets (Exhaustive Search)

-All possible subsets of predictors assessed (single, pairs triplets, etc.) (you see which fits with the highest R2. -Computationally intensive - Solver can do it -Judged by Adjusted-R2

Best Subsets (Exhaustive) Search

-All possible subsets of predictors assessed (single, pairs, triplets, etc.) -Computationally intensive --> solver can do it - Judge by adjusted -R2

Partial search algorithms

-Forward: variables added one at a time starting with most significant -Backward: starts with all predictors and drops one at a time with the least significant -Stepwise: either adds or drops a predictor at each step until best model is found

Coefficient of Determination (R2)

-It is the square of the coefficient of correlation (R) -It also ranges from 0 to 1 -The proportion of the total variation in the response variable (Y) that is explained or accounted for by the variation in the predictor (X) -Eg. If R2=0.8, then 80% of the variation in miles driven is accounted by number of gallons in the tank. The remaining 20% is influenced by other unknown variables, such as traffic, weather, number of passengers, tows, road constructions, etc.

Correlation Coefficient (Pearson R)

-Measures strength of the relationship between two variables -It can range from -1.00 to 1.00 -Positive values indicate a direct relationship & negative values indicate an inverse relationship -1 is as strong as a relationship as 1. 0 is the Worst relationship! Negative just means it has a negative correlation. Positive means it has a positive correlation.

Explanatory vs. Predictive

1. Explanatory or quantifying the average effect of inputs on an output (explanatory or descriptive task) 2. Predicting the outcome value for new records, given their input values (predictive task)

When given an chart with (Actual) and (Predicted).. how do you figure out the error in prediction?

Actual - Prediction = Error in Prediction

Least Squares Principle

Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y.

Multiple Linear Regression --> Prediction

Emphasis is on how to make predictions on new data (supervised) eg. Predicting how much the sales revenue would be if certain dollar amount is spent on advertising in the future

Multiple Linear Regression -> Explanation

Explanation: Theoreticians want to know if relationship exists between predictors and response. Helps to understand and explain what goes on eg. amount of gas & mileage Advertisement budget & actual sales Hours studied & exam scores

Partial Search Algorithms

Forward - variables added one at a time starting with most significant Backward - starts with all predictors and drops one at a time with the least significant Stepwise - either adds or drops a predictor at each step until best model is found (randomly selects one variable at a time then either picking or dropping variables.

Selecting Subsets of Predictors

Goal: Find Parsimonious model (the simplest model that performs sufficiently well) -more robust (resistant to errors in the results, produced by deviations from assumptions such as normality) -High predictive accuracy (removing unrelated predictors) --> measured by R2

Regression Analysis

If there is a strong relationship between two variables, one can use a linear model of the form: Y' = a +b*X ( this is single linear form) Y' is the predicted value a is the Y-intercept (it is value of Y' when X=0) b is the slope of the line, or the average change in Y' for each change of one unit in x

Predictive Modeling

In contrast, in predictive modeling (data mining), the goal is to find a regression model that best predicts new individual records

Explanatory Modeling

In explanatory and descriptive modeling, where the focus is modeling the average record, we try to fit the best model to the data in an attempt to learn about underlying relationship in the population

Correlation Analysis

Measurement of association between two variables If you suspect two variables to have a relationship, start with drawing a scatter plot A scatter diagram is a chart that portrays the relationship between two variables (If points are scattered all over the graph, not much of a relationship) (If points are aligned; potential for a good relationship)

This model finds values β0 + β1 + B2 that minimize the sum of square deviations between the actual values (Y) and their predicted values based on the model (^Y)

Ordinary Least Squared

Residual Output

Residual output shows variation in the actual beverage sales and predicted values. The residual column displays the errors that the model captures. The smaller the sum of errors squared, the better. In addition, the residual value is a measure of how much a regression line vertically misses a data point. By changing the mean of the values to a regression equation, you are able to reduce the vertical distance between the residual errors.

Calculating Coefficient of Determination (R2)

The proportion of the total variation in the dependent variable (Y) that is explained by the variation in the independent variable (X) R2 = 1 - unexplained variation/total variation --> (Equation 2)

Y' = a +b*X regression analysis

The steeper the line the better A steeper slope means a stronger relationship horizontal line = 0 slope

True of False A good predictive model has high prediction accuracy (to a useful practical level)

True

True or False A good explanatory model is one that fits the data closely, whereas a good predictive model is one that predicts new cases accuartely

True

True or False A regression model that fits the existing data too well is not likely to perform well with new data

True

True or False Both explanatory and predictive modeling involve using a database to fit a model (i.e. to estimate coefficients), checking model validity, assessing its performance, and comparing to other models

True

True or False In explanatory models the focus is on the coefficients (β), whereas in predictive models the focus is on the predictions (^y)

True

True or False Linear regression models are very popular tools, not only for explanatory modeling but also for prediction

True

True or False The smallest R2 you can have is 0

True

True or False 0 is the worst relationship

True

True or False If X1 is known to cause Y, then such a statement indicates actionable policy changes - this is called explanatory modeling.

True

True or False Like R2, higher values of adjusted R2 indicate a better fit

True

True or False Regression is the Opposite of Progress

True

True or False Removing redundant predictors is key to achieving predictive accuracy and robustness

True

True or False Subset selection methods help find "good" candidate models. These should then be re-run and assessed

True

True or False Unlike R2, which does not account for the number of predictors used, adjusted R2 uses a penalty on the number of predictors

True

True or False the more predictors there are, the higher the chance of missing values in the data. If we delete or impute cases with missing values, multiple predictors will lead to a high rate of case deletion or imputation

True

True or False Regression modeling means not only estimating the coefficients but also choosing which input variables to include and in what form.

True For example, a numerical input can be included as-is , or in logarithmic for (Log X), or in a binned for (e.g. age group) Multiple linear regression is applicable to numerous predictive modeling situations. Examples are predicting customer activity on credit cards from their demographics and historical activity patterns, etc.

True or False In explanatory and descriptive modeling, where the focus is on modeling the average record, we try to fit the best model to the data in an attempt to learn about the underlying relationship in the population.

True In contrast, in predictive modeling (data mining), the goal is to find a regression model that best predicts new individual records

True or False Data are used to estimate the coefficients and to quantify the noise

True In predictive modeling, the data are also used to evaluate model performance.

True or False When the goal is to predict outcomes of new individual cases, the data are typically split into a training set and validation set.

True The training set is used to estimate the model, and the validation or holdout set is used to assess this model's predictive performance on new, unobserved data. Validation data used for predictive performance

What does Y' in regression analysis represent?

Y' is the predicted value

What does "a" in regression analysis represent?

a is the Y-intercept (it is value of Y' when X=0)

what does "b" in regression analysis represent?

b is the slope of the line, or average change in Y' for each change of one unit in X

The Coefficient of Determination

r2 - the percent of variation in the y variable that can be explained by the variation in the x variable

Multi-collinearity

when two or more predictors are highly correlated If you have two columns that are perfectly correlated then the extra variable won't be of any use at all.

Chapter 6: Multiple Linear Regression

Ensembles d'études connexes

Chapter 12

Ch. 25

spanish wuater exam

Module 13 ICMP

ch 18 marketing multiple choice

A&P CHAPTER 6 STUDY NOTES

Chp 61 & 62-

MCB150: Lecture 19 Post-Class Questions

Environmental systems unit 8 pt. 3

Sociology

Chapter 11 (Congress: Balancing National Goals & Local Interests)

PSYC 382 Exam 3

Exploring Cultural Identity through Language

GI

CBA #3 (Renal, Organ Transplant, Endocrine)

Chapter 10 English

Rabbits

Drug ID Test 1

Paine's "The Crisis, No 1"

psych 111 exam 3