Econometrics lecture note 6- Multiple linear regression estimation
Beware interpretations of R squared and adjusted R squared
Adjusted useful because it summarises extent to which regressors explain variation in Y Maximising adjusted in practice is rarely the goal, if it is too close to 1, it is a sign there is logical problem with regression model
Further dummy variable trap- how to solve
Cant have dummy variables adding to equal another dummy variable
Heteroskedasticity
Error term in regression is homoskedastic if... Otherwise error term is hetero (We work under assumption of hetero)
Standard error of regression (SER)
Estimates standard error of the error term u
Multiple linear regression
Extends single linear regression model to include additional variables as regressors Allows us to estimate effect on Y of changing one variable while holding all other regressors constant Tool for eliminating omitted variable bias
Fixing omitted variable boas
Focusing on schools with similar levels of income
R squared
Fraction of sample variance in Y explained or predicted by the regressors
Omitted variable bias
If the regressor is correlated with a variable that has been omitted from the analysis AND that determines, in part, the dependent variable, then the OLS estimator of the effect of interest will suffer from omitted variable bias (OVB) causing ß^₁≠ ß₁ Error in the regression model contains these confounding variable
Signing omitted variable bias
In our example, we had positive relationship between income and test score (Y), so positive sign (+) There was negative relationship between income and class size (X), so negative sign (-) Take the sign of the correlation of the omitted variable and Y and multiply by the sign of the correlation of omitted variable and X and you get sign of omitted variable bias
The constant
Intercept is the expected value of Y when all regressors are equal to zero We can equivalent write regression including third regressor which is a dummy variable equal to one for all observations, where X0 is the constant regressor
Summary
Just look at whats been written on slides!
When does omitted bias exist
Occurs when omitted variable satisfies two conditions 1. Correlated with included regressor 2. Helps determine the dependent variable If omitted bias exists then E[ß^₁]≠ ß₁, the OLS estimate of ß₁ is now biased and all machinery for estimating and testing regression fails
Omitted variable bas and OLS assumption 1
Omitted variable bias means our first least squares assumption E[u|X]=0, fails - u contains all factors other than X that are determinants of Y - If one of these factors are correlated with X, then this means u is correlated with X e.g. If income is a determinant so test score (Y) and we omit it, then it is in u AND if income is correlated with class size, then u will be correlated with x Because u and X are correlated in the presence of an omitted variable, conditional mean of u given X isn't zero --> If corr(uᵢ,Xᵢ)≠0 --> E[uᵢ|Xᵢ]≠0
Avoiding dummy variable trap
Only include G-1 of the G dummy variables in regression Dummy we do not include is the base category or base group or omitted variables We interpret all other dummies as change in outcome variable when given dummy variable is equal to 1 relative to the base group, holding all other regressors constant Alternatively, we can include all G dummies and drop the constant in the regression (very uncommon) In sum, when your software indicates you have perfect multicollinearity, eliminate by 1. Determining source of perfect multicollinearity 2. Creating a base group (drop one) 3. Ensure you properly interpret regression coefficients for dummies in regression relative to omitted base group
Econometrically modelling test scores
Other variables like income affect grades/ class size - So we have lower income --> higher class size, lower scores In words, if some variable like income varies across classes, this creates negative relationship between class size and test score - This has serious implications for OLS estimate, because ß₁ designed to capture direct link alone - The OLS coefficient will fail to isolate the direct link The OLS estimate is driven by two forces 1. Direct relationship we want to determine empirically 2. A separate indirect correlation due to differences in another variable
Dummy variable trap
Possible source of multicollinearity arises when multiple dummy variables used as regressors Add them, they equal one always= constant regressor (linearly related)
R squared in multiple
R squared always rises when regressions added to regression unless estimated coefficient on regression is exactly 0 (rare) Because of this, we often work the adjusted R squared, which is a modified version that doesn't necessarily increase when new regressor is added
Imperfect multicollinearity
Related issue, which arises when one regressor is highly correlated but not perfectly correlated with other regressors Doesn't prevent statistics programs from providing OLS estimates but results in regression coefficients being estimated imprecisely and having large standard errors and therefore statistically insignificant regression coefficients Intuitively, if two regressors highly correlated and almost always co-moving together, hard to disentangle individual impacts on dependent variable in regression Whereas multicollinearity arises because of logical mistake in regression set-ip, imperfect multicollinearity is not necessarily error but function of data, OLS and question you're trying to address
Student test score example
Richer regressions allow us to control for many other variables that predict test scores in avoiding omitted variable bias to isolate the relationship between variables which is the main effect of interest
How to deal with
Should we expect ß^₁ to be bigger or smaller than ß₁ Conceptually, we're interpreting what our OLS estimate ß^₁, we can think of it as containing two parts ß^₁= ß₁ + γ Where: - ß₁ is direct relationship - γ is indirect relationship Given we expect γ < 0, this means we expect the regression to yield ß^₁> ß₁ which means it gives a biased estimate of a relationship (or vice versa) i.e. downward bias Magnitude of estimate may be too large relative to true value, as it is confounded by different variables
Formula for omitted variable bias
Suppose least squares assumptions 2 and 3 hold but 1 doesn't Let corr(uᵢ,Xᵢ)=ρXᵤ is the correlation between Xᵢ and uᵢ in single linear regression
What does all of this mean
Suppose you get a statistically significant ß^₁, it could be driven by another variable and the link doesnt exist at all You may have statistically significant ß^₁ but could be just driven by the fact that higher income schools tend to have smaller class sizes, higher income kids do better on test - Could be no direct and driven by indirect relationship
Control variables
Taking conditional expectations at a point, where X1=x1 and X2=x2 the population regression function is... We often refer to some of the regressors in a multiple linear regression as control variables Interpreting ß₁, we say it is relationship between X1 and Y holding X2 fixed (or controlling for X2)
Perfect multicollinearity example: huge class
Two important aspects 1. Perfect multicollinearity can arise because if the constant 2. Perfect multicollinearity is specific to dataset you have at hand; we could imagine classes with more than 35 students (if you did, you wouldn't have it)
Perfect multicollinearity
Two regressors exhibit perfect multicollinearity if one of regressors is a perfect linear combination of other regressors Can't hold something fixed and change it at the same time Generally, groups of regressors that are perfectly collinear, it is impossible to hold one regressor fixed to estimate effect of one of the other collinear regressors on dependent variable In practice - Drop one variable - Error message - Modify set of regressors to eliminate problem
OLS estimate with multiple linear regression
Use ordinary least squares (OLS) estimator to estimate regression coefficients of multiple linear regression model OLS estimator aims to find regression coefficients that minimise mistakes models make in predicting dependent variable given regressions OLS estimators together minimise the sum of squared prediction mistakes
Coefficient interpretation
When we interpret coefficients, we imagine changing only one regressor at time, leaving others fixed It is often called the partial effect of X1 on Y, which emphasises our focus on changing just one regressor whole holding other regressors fixed
Income and class size error term impact on the regression
Yields bias in ß^₁