Multiple Linear Regression
Reducing Multicollinearity
1. Do not include redundant independent variables 2. Add more observations 3. Use the RIDGE method of regression to estimate the parameters of the model
Variable Selection Procedures
1. Use all Possible subsets 2. Forward Selection 3. Backward Elimination 4. Stepwise Selection
VIF equation
1/ (1-R squared)
Multiple Regression
A technique for describing the relationship between a continuous response Y and a set of more than one explanatory variables. Same as simple linear model but extended.
Variable Selection: Backward Elimination
All variables are initially brought into the equation. The coefficient of the least significant variable is then tested for significance level. If it is significant, all variables are included in the model and the process stops. IF it is not significant, the variable is removed, the regression equation refitted without this variable and the process is repeated.
Dummy variables
Binary variables in the regression model with values 0 or 1 for each individual observation; the regression coefficient indicates the average difference in the dependent variable between the groups.
Variable Selection: Stepwise Selection
Combines the features of forward selection and backward elimination. Variables are added one by one to the model depending on their significance. After a variable is added, all variables already included in the model are examined and any variable that is not significant is deleted.
Cook's D
Compares the data points to regression and how much the line will change. Values lager than 4/n are considered highly influential
DFBETA Cut-off
DFBETA>2/sqrt of n has potential to be influential or DFBETA > 1
Variable Selection: Forward Selection
Including variables in the model one at a time (according to strength of their association with the Y) unti the coefficient of the next variable would not be significantly different from zero. 1) all simple linear regression models are considered to find the one that gives the best fit based on the F-statistic. This variable is brought into the regression equation first.
VIF
Measure of how highly correlated each independent variable is with the other predictors in the model; larger than 10 and mean larger than 1 usually implies that multicollinearity may be influencing the least square estimate.
DFBETA
Measure of how much an observation has effected the estimate of a regression coefficient. (there is more than one for each correlation coefficient)
Interactions
Need to account for difference in slopes between groups- interaction between continuous and dummy variable. Include an interaction term (the explanatory variables multiplied together)
Coefficient of Multiple Determination
R-squared increases with added variables; it is impacted by the number of parameters and sample size, thus r-squared's can't be compared between studies or models with different samples sizes and parameters.
Overall Significance of Modeel
Same as simple linear regression: TOTAL SS (df = n-1) = RESIDUAL SS (df = n-p-1) + REGRESSION SS (df = p) The f-test provides a composite test for the null hypothesis H0: beta1=beta2=beta3....=0 (predictor variables are irrelevant) H2: Predictor variables are better than just the average; at leat one does not = 0
Interpretation of Multiple Regression
The slope of the population (Beta) or the amount by which Y changes on average when Xi changes by one unit and all other X variables remain constant. In the case of 2 predictor variables Xi and Xii, the regression of Y on Xi and Xii is equivalent to: 1) The regression of Y on Xi (part or Y not explained by xi) 2) The regression of Y on Xii (part of Y not explained by xii) 3) Regression of residuals from 1 (remaining variation in Y not explained by Xi) on residuals from 2 (remaining variation in Xii not explained by Xi)
Collinearity
This is a measure of the correlation between the explanatory variables; 0.9 and above are cause for concern. Can calculate the VIF
Fitting a Model for Multiple Regression
Use method of Least Squares with added independent variables in equation.
Adjusted R-squared
Use this to take into account the chance contribution of each variable included. It can be calculated by (1-Rsquared)(n-1/n-p)
Leverage
When a point has an unusual X profile. Measure of how far an observation is from others in terms of the levels of the INDEPENDENT variable. Observations with values larger than 2(K+1)/n are considered to be potentially highly influential
Variable Selection: All Possible Subsets
if there are k variables under consideration, then there are 2^k-1 possible subsets; If K is large, this could be a problem but it can be assessed using the following criteria: R^2: maximize within each subset size Adjusted R^2 Maximize the adjusted R^2 within the range of sizes AIC (Akaikis information criteria)
Categorical (factor) explanatory analysis
multiple regression does not require X variables to be normally distributed or for variables to be continuous