Econometrics - test 1: Multiple Linear Regression Assumptions and Theorems
Normality of error terms - assumption
It is assumed that the unobserved factors are normally distributed around the population regression function.
Gauss-Markov assumptions
MLR.1 - MLR.5
Classical linear model (CLM) assumptions
MLR.1 - MLR.6
Assumption of Homoscedasticity
MLR.5 The value of the explanatory variables must contain no information about the variance of the unobserved factors (not to be confused with exogeneity ".. contain no information about the MEAN* of the unobserved factors" Var(Ui | Xi1, Xi2,...,Xik) = σ shorthand notation: Var(Ui | Xi) = σ This assumption may also be hard to justify in many cases
In some cases, normality can be achieved through
transformations of the dependent variable (e.g. use log(wage) instead of wage)
Theorem 3.1
(Unbiasedness of OLS) MLR.1 - MLR.4 E(Bjhat) = Bj j = 0,1,...,k
Gauss-Markov Theorem
- Assumptions (conditions): MLR1 - MLR5
No perfect collinearity
- In the sample (and therefore in the population), none of the independent variables is constant (similar to SLR.3) there are no exact linear relationships among the independent variables Only rules out perfect collinearity/correlation between explanatory variables imperfect correlation is allowed Constant variables are also ruled out (collinear with intercept)
Components of OLS Variances:
1) The error variance: 2) The total sample variation in the explanatory variable: 3) Linear relationships among the independent variables
Components of OLS Variances: 1) The error variance:
A high error variance increases the sampling variance because there is more "noise" in the equation A large error variance necessarily makes estimates imprecise The error variance does not decrease with sample size
-Omitting relevant variables
All estimated coefficients will be biased. Instead of estimating B1 we estimate B1+B2d1. No bias if B2 is equal to zero or d1 is equal to zero.
superfluous variable
An explanatory variable that is a perfect linear combination of other explanatory variables
Unbiasedness
Average property in repeated samples; in a given sample, the estimates may still be far away from the true values'
Multicollinearity problem
Dropping some independent variables may reduce multicollinearity (but this may lead to omitted variable bias) Only the sampling variance of the variables involved in multicollinearity will be inflated; the estimates of other effects may be very precise multicollinearity is not a violation of MLR.3 in the strict sense
endogenous variables
Explanatory variables that are correlated with the error term
exogenous variables
Explanatory variables that are uncorrelated with the error term
MLR.5
Homoskedasticity
More likely to hold zero conditional mean? Simple o multiple regression?
In a multiple regression model, the zero conditional mean assumption is much more likely to hold because fewer things end up in the error
Assumption: Linear in parameters
In the population, the relationship between y and the explanatory variables is linear y = B0 + B1X1 + B2X2 + ... + BkXk + u
Exogeneity
Key assumption for a causal interpretation of the regression, and for unbiasedness of the OLS estimators The value of the explanatory variables must contain no information about the mean of the unobserved factors
MLR.1
Linear in parameters
Gauss-Markov Theorem definition
Mathematical result stating that, under certain conditions, the OLS estimator is the best linear unbiased estimator (BLUE) of the regression coefficients conditional on the values of the regressors.
Components of OLS Variances: 2) The total sample variation in the explanatory variable:
More sample variation leads to more precise estimates Total sample variation automatically increases with the sample size Increasing the sample size is thus a way to get more precise estimates
Will irrelevant variables cause bias in a model?
No
MLR.3
No perfect collinearity
MLR.6
Normality of error terms
When is OLS BLUE?
OLS is only the best estimator if MLR.1 - MLR.5 hold; if there is heteroskedasticity for example, there are better estimators
Partialling Out
One can show that the estimated coefficient of an explanatory variable in a multiple regression can be obtained in two steps: 1) Regress the explanatory variable on all other explanatory variables 2) Regress y on the residuals from this regression
MLR.2
Random sampling
Components of OLS Variances: 3) Linear relationships among the independent variables
Regress xj on all other independent variables (including a constant) (The R-squared of this regression will be the higher the better xj can be linearly explained by the other independent variables) Sampling variance of Bjhat will be the higher the better explanatory variable xj can be linearly explained by other independent variables The problem of almost linearly dependent explanatory variables is called multicollinearity
Theorem 3.2
Sampling variances of the OLS slope estimators Assumptions MLR.1 - MLR.5
Random sampling
The data is a random sample drawn from the population {(xi1,xi2,... ,xik,yi): i=1,...,n} Each data point, therefore, follows the population equation Yi = B0 + B1Xi2 + ... + BkXik + Ui
Why partialling out works?
The residuals from the first regression is the part of the explanatory variable that is uncorrelated with the other explanatory variables. The slope coefficient of the second regression therefore represents the isolated effect of the explanatory variable on the dependent variable.
Zero conditional mean
The value of the explanatory variables must contain no information about the mean of the unobserved factors E(ui/xi,xi2,...,xik)=0
The Gauss-Markov Theorem
Under assumptions MLR.1 - MLR.4, OLS is BLUE However, under these assumptions there may be many other estimators that are unbiased Which one is the unbiased estimator with the smallest variance? OLS. (MLR.5)
Theorem 3.4 (Gauss-Markov Theorem)
Under assumptions MLR.1 - MLR.5, the OLS estimators are the best linear unbiased estimators (BLUEs) of the regression coefficients, i.e.
normality assumption - discussion
Under normality, OLS is the best (even nonlinear) unbiased estimator
Endogeneity
Violation of assumption MLR.4: Zero conditional mean When explanatory variables are correlated with the error term (are endogenous) The value of the explanatory variables contains information about the mean of the unobserved factors
When MRL.4 holds?
When all explanatory variables are exogenous Exogeneity
MLR.4
Zero conditional mean (between error and error)
sampling variances of the OLS estimators
formulas are only valid under assumptions MLR.1-MLR.5 (in particular, there has to be homoskedasticity)
What problem irrelevant factors generate?
may increase sampling variance.
Larger sample sizes:
t-distribution is close to the Normal(0,1) distribution (MLR.6 not neded) t-tests are valid in large samples without MLR.6 MLR.1 - MLR.5 are still necessary, esp. homoskedasticity