Unit 3: Multiple Linear Regression
What is multiple linear regression?
In multiple linear regression, the data are the response variable given the values of the predicting variables more specifically observe. and realizations of the response variable along with the corresponding predicting variables. The relationship capture is the linear relationship between the response variable and the predicting variables.
the estimators of the regression coefficents are biased under what conditions
In short, the estimator β hat is unbiased for beta because the expectation of the estimator is equal to the true parameter.
Is y_hat biased?
No, it's not
does it matter what order variables enter the model in LM?
Note that the order in which the predicted variables enter the model is important here. I recommend to review the lesson where I introduce the testing procedure for subset of coefficients. The anova command gives the sum of squares explained by the first variable. Then the second variable condition on including the first. Then the third variable condition on the first and second and so forth. For example, the extra sum of squares of adding income to the model, which includes takers and rank is 2,858.
what does each parameter mean?
Only now there are several predictors, the estimated value for βi through βj, representing the estimated expected change in y associated with one unit of change in the corresponding predicting variable, holding all else in the model fixed.
matrix form of multiple linear regression model
AKA design matrix
What happened to the Cook's distance plot of SAT score after taking log(takers)?
Expenditure (for which AK is an outlier) became significant. So now for this model, which Transformation we see with the outlier, the values for the Alaska is influential point. And the reason is that, the predicting variable expenditure now has a strong, it's strongly associated to the SAT score, because the p-value Is small. So we see a change also in the statistical significance for expenditure.
explanatory variable
Explanatory variables can be used to explain variability in the response variable. They may be included in the model even if other similar variables are in the model.
For numerical summaries between the response and individual qualitative predicting variables, we can use the ANOVA to test whether the means in the response are statistically different across the categories of the quantitative variable.
For example ## Qualitative data & the response summary(aov(imdb~year))
deciding whether a variable should be qualitative or quantitative?
Generally, we can transform a quantitative variable into a qualitative, categorical variable, particularly when we see that there are non-linear relationships of that predictive variable with respect to the response
ANOVA, simple linear regression and multiple linear regression
In a different lecture we contrasted the simple linear regression model with the ANOVA model. In a simple linear regression, we consider modeling variation in the response with respect to a quantitative variable. Where in the ANOVA, we consider modeling variation in the response with respect to one or more qualitative variables. Multiple regression Is a generalization of both models. When we have both quantitative and qualitative variables in a multiple regression model, we need to understand how to interpret the model and how to model qualitative variables.
MLR flexibility
if you've got two predicting variables, you can create a number of models, including: 1. single order X1 and X2 2. second order X1 and X2 3. single order X1, X2 and X1*X2 4. second order X1 and X2 and first order X1*X2
Knowledge check
note residual analysis and goodness of fit
what's meant by the notation X~N(0,1)
variance and STDEV_1 = 1 Mean = 0 X is normally distributed
Marginal vs. Conditional
summary: marginal for SLR (no consideration of other factors); conditional for MLR (consider other factors) - expect different values (and perhaps even signs) for the parameters for the same predicting variable The marginal model, or simple linear regression, captures the association of one predicting variable to the response variable marginally, that means without consideration of other factors. The conditional or multiple linear regression model captures the association of a predictor variable to the response variable, conditional of other predicting variables in the model. Importantly, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship. Thus, the two models used to capture the relationship between a predicting variable and a response variable will provide different estimates of the relationship.
why the different lines in the probability density function of chi-square distribution?
when k (degrees of freedom) = 1, you have a GREAT probability of grabbing a Q ~ 0. Think about it - if X1 ~ N(0,1), the mean = 0, so you're likely to grab a number near 0. Then, you're squaring it. But when you add other variables and pick a random number out of their distributions, you're less likely to have a 0.
do we transform predictor variables?
you bet we do.
baselines for qualitative predictors
## Include all dummy variables without intercept fit.1 = lm(imdb~genre.1+ genre.2+ genre.3+ genre.4-1) ## Include 3 dummy variables with intercept fit.2 = lm(imdb~genre.1+ genre.2+ genre.3) ## genre = as.factor(data$genre) - No intercept fit.3=lm(imdb~genre)
what's the correlation coefficient?
can we used X-Y or X-X
is beta_hat an unbiased estimator of beta?
normal (i.e, the response variables are normally distributed), then β hat is also distributed normally, with β as the mean and sigma as the covariance matrix: In short, the estimator β hat is unbiased for beta because the expectation of the estimator is equal to the true parameter.
controlling variables
remove bias chi Controlling variables can be used to control for bias selection in a sample. They're used as default variables to capture more meaningful relationships. They are used in regression for observational studies, for example, when there are known sources of bias selection in the sample data. They are not necessarily of direct interest, but once a researcher identifies biases in the sample, he or she will need to correct for those, and will do so through controlling variables.
testing for assumptions of multiple linear regression
1) mean zero / linearity - plot residuals vs. predicting variables - should find mean zero 2) constant variance - plot residuals vs. predicting values - should not see variance increasing/decreasing w/ changes in x axis (fitted values) 3) plot residuals vs. fitted values - should not see clusters 4) plot histogram/QQ plot
What are the assumptions of multiple linear regression?
1) zero mean assumption of errors, meaning that linearity also holds 2) constant variance assumption - it cannot be true that the model is more accurate for some parts of the population and less accurate for other parts. A violation of this assumption means that the estimates are not as efficient as they could be in estimating the true parameters and would also result in poorly calibrated prediction intervals. 3) the response variables are independently drawn from the data-generating process (uncorrelated errors). Violation of this assumption can lead to a misleading assessment of the strength of the regression. 4) normally distributed errors - if this assumption is violated, hypothesis tests, confidence intervals and prediction intervals can be misleading.
methods to confirm linearity assumption
1. plot all predictors vs. response w/ plot(meddcor[,1:5])
rule of thumb for what's 'large' w/ Cook's distance
A rule of thumb is that when the Cook's distance for a particular observation is larger than 4 over n, it could be an indication of an outlier. Often, we can, if D, the Cook's distance is larger than 1, it is a clear indication of an outlier. But sometimes we just cooked for very large values for the Cook's distance
how many degrees of freedom in MLR estimate of variance?
But for the sample variance we actually use n- p- 1 degrees of freedom because, when we replace the error terms with the residuals, we replaced p+1 coefficients parameters -- we replaced β0 with β0 hat, β1 with β1 hat, and so on. So we lose now p+1 degrees of freedom: It's chi-square
But the covariance matrix depends on sigma square, which we do not know. What should we do?
Estimating Sigma Square We can replace sigma square with the mean square error, which is equal to the sum of squared of residuals divided by n- p- 1:
partial F test
For this, we can use what we call the Partial F-test, which means that we're going to compare the sum of square regression (the extra sum of square regression by adding the Z predictor to the model that already has the X's) versus the sum of square of error for the full model. Remember, for the F-statistic, we divide the mean sum of square regression by the mean sum of square of error. Here we can compare this statistic with the critical point 150 of the F distribution with P and n-p-1 degrees of freedom. (Remember that P here is the number of predictors that are already in the model.)
marginal vs. conditional
Marginal is like simple linear regression ## Marginal versus Conditional Modeling summary(aov(imdb~rtdirector)) summary(lm(imdb ~ duration)) summary(lm(imdb ~ earnings))
why use MLR?
Multiple linear regression allows for quantifying the relationship of a predicting variable to a response when other factors vary.
with interaction terms, do the models stay parallel?
No
fit accuracy measures
PM is Serban's favorite. The closer to zero, the better. Not scale sensitive. Secondly, comparing various forecasting models, which we'll not do in this example. The most common reported measures of predicting accuracy are: ● MSPE = mean squared prediction error = the sum of the square differences between predicted and observed ● MAE = mean absolute prediction errors = the sum of the absolute values of the differences ● MAPE = percentage measure such as the mean absolute percentage error = the sum of the absolute values of the differences scaled by the observed responses ● PM = precision error = the ratio between MSPE and the sum of square differences between the response and the mean of the responses The three measures can be computed using or as provided on the slide:
predictive variables
Predictive variables can be used to best predict variability in the response regardless of their explanatory power. Thus, when selecting explanatory variables, the objective is to explain the variability in the response. Whereas when selecting predictive variables, the objective is to predict the response.
F comparison test to compare full model vs. reduced model
So at least one predictive variable among the four has improved to predictive power of the model versus the model that doesn't include the four predictors.
what's one of the reasons that we do linear regression analysis?
So why restrict ourselves to linear models? Well, there are simpler to understand, and they're simpler mathematically. But most importantly, they work well for a wide range of circumstances.
Y_hat estimation vs. predict
Specifically, the uncertainty in the estimation, in the estimated regression line comes from the estimation alone, from the estimation of the regression coefficients. Whereas for prediction, the uncertainty comes from the estimation of the regression coefficients and from the newness of the observation
In our MLR on the SAT data, why does takers have a large P value and rank does not?
Takers and rank are highly correlated; perhaps removing rank would result in takers having a high P value
are the model parameters exact estimates?
The model parameters are: b0, b1, ... bp, sigma_squared Unknown regardless how much data are observed Estimated given the model assumptions Estimated based on data
What's the F-test null hypothesis?
The null hypothesis is that all coefficients (not the intercept) = 0.
checking for correlation (clustering) of residuals, NOT independence
Using residual analysis, we are checking for uncorrelated errors but not independence; Independence is a more complicated matter; if the data are from a randomized experiment, then independence holds; but most data are from observational studies. We commonly correct for selection bias in observational studies using controlling variables.
What are we looking for in this image to confirm linearity?
What we're looking for is that the residuals must be spread randomly across the 0 line.
What's the estimate of Y?
Y_hat is the regression line itself
Can year be qualitative and quantitative?
Yes. Normally, if observations are over many years, year is quantitative.
what transform can we apply if we have non-constant variance and/or normality issues?
box cox
When there is such a distinct gap in a variable distribution, sometimes it's a good idea to consider transformation
from a continuous variable to indicator variable.
testing subsets of coefficients
how much SSReg increases when adding on another variable
testing null hypotheses of beta for statistical significance if beta = 0 is the null hypothesis
if beta = 0 is the null hypothesis
is a beta statistically significant?
if it doesn't have 0 in its confidence interval, yeah
testing null hypotheses of beta for statistical significance if testing null hypothesis of beta = constant
if testing null hypothesis of beta = constant
causality
it's hard to prove causality w/o a controlled research experiment. You can at least call relationships b/w predictors and response variables 'associative'