Chapter 9 Multple Regression
Derby Watson in SAS
roc reg; model ret=ret_w ret_usd Jan / dwprob; run;
Adjusted R Square
Need to adjust because everyime we add a variable, your R square always increased regardless. This makes it hard to compare between models.
SerialCorrelation:vs Heteroskadicity
Serial- regression erros are corellated arocess error terms Hetero- The variance of the error terms differs across observations.( constnatn error term violation assumption is broken)
Assumptions in Multple Regression
-Linear relation -Independent variables are not random. No exact linear relationship exists. -Expected value of residuals is zero -Constant variance of residuals [i.e., homoskedastic error terms] -Residuals are independent [uncorrelated residuals -Residuals are normally distributed
How to detect Heteroskedasticity?
1. Look at residuals. very small in one part, and very large in another place. Variance is not constant. 2. Look at correlations between error terms and 3. Prog Reg, in model line do /spec Apply a test (e.g., Breusch-Pagan test, Chi-squared test)
How many dummy variables for 4 quarters?
3. Q1=1, q2 and Q3=0 When its fourth quarter, they are all going to be zero. The fourth quarter is picked up by the intercept.
Possible remedies to minimize multicolinearity:
:◦Stepwise regression -add independent variables one at a time; by adding x2, doesnt help me to explain y, then dont add it. Look at adjusted R squared. ◦Or reversely start with all independent variables, and remove redundant var. ◦Additional or new data There is no test to determine multicolinarity.
Calculated T vs Critical value T
??
Model mis-specifications
A model is misspecified when it violates the assumptions underlying linear regression, its functional form is incorrect, or it contains time-series specification problems.
What are the two types of heteroskadcity?
Conditional and unconditional
How do we know thier is serial correlation?
Derby Watson Statistic. Calculates a d-statistic D=2 no autocorrelation D<2 Positive serial correlation D>2 negative serial correlation
Residuals
Diffeence betwen your Y hat and your Y
Dummy Variable
Either 1 or 0. Not 1 or 2.
Sources of mis-specification cont....
Error term correlated with independent variables/time-series misspecification:
ANOVA table goal
Explain the variation of Y through X.
What is RSS?
Explained variation of the dependent variable
Issues in Regression Analysis:
Heteroskedasticity Autocorrelation Multicollinearity
Avoiding misspecification
If independent or dependent variables are nonlinear, use an appropriate transformation to make them linear.For example, use common size statements or log-based transformations .Avoid independent variables that are mathematical transformations of dependent variables. Don't include spurious independent variables (no data mining). Perform diagnostic tests for violations of the linear regression assumptions .If violations are found, use appropriate corrections. Validate model estimations out-of-sample when possible. Ensure that data come from a single underlying population.
Why is Heteroskadivity a problem?
If the residuals are heteroskedastic, the standard error estimated using ordinary least squares is understated, causing you to overstate t-stat. and the tests of hypotheses will be incorrect (reject the null when should not). -Bottom line: cannot trust the test statistics
Generally, model misspecification can result in _______________________ when we are using linear regression.
Invalud Statistical conclusion
How to calucate R squared?
It tells us R How much is explain by my regression?
How do you calcualte F stat?
Mean square of the regression/ mean square of the error terms
Sources of mis-specification
Misspecified functional form can arise from several possible problems: ◦Omitted variable bias(e.g., bid-ask spread example omitting lnMKTCAP) ◦Incorrectly represented variables -unscaled data (e.g. negative bid-ask spread, mktcap) ◦Data that are pooled which should not.
Model specification
Models should ◦Be grounded in financial or economic reasoning (v.s. data mining). ◦Have variables that are an appropriate functional form for their nature. ◦Have specifications that are parsimonious. (Same R-squared, less variables) ◦Be in compliance with the regression assumptions .◦Be tested out-of-sample before applying them to decisions.
Why is multicollineratiy a problem?
Multicollinearity makes it difficult to determine the variables that are doing the most explaining. ◦Coefficient estimates unreliable ◦Impossible to distinguish individual impact ◦Inflated s.e.-t-test has little power. In other words, tests of significance are questionable. Your estimate will be inocrrect
Three broad categories of qualitative dependent variable models
Probit and Logit deal with quanititive dependent variable Discriminant Analysis: It estimates a linear function, which can then be used to assign the observation to the underlying categories. E.g., Altman's model on bankruptcy
Multicollinearity
Talk about Indepedent variables. When your independt variables are highly correlated. ◦Occurs in time-series or cross-sectional analyses. ◦Prevalent in financial models. ◦Not a concern if the model is used only for prediction.
F- Statistic
Tells us signicance of the whole model.
Heteroskedasticity
The erros terms increase with the X increase Underestiamte the standard error. Error terms are non-normal, either very positive or negative. T-stat is going to be wrong. you are going to make wrong inference. Stanrdrs error is very skewed You are going to have a large T, reject more often then you shouldnt? What type or error? Type 1 Occurs in Cross sectional and time series
Random error term.
The random error term itself contains uncertainty, which can be estimated from the standard error of the estimate for the regression equation. We want the standard error of the estimate to be small
There are two sources of uncertainty in linear regression models:
Uncertainty associated with the random error term. Uncertainty associated with the parameter estimates.
SSE
Unepxlained Variation of the dependent variable
Interpreation of SST
Variation of the dependent variable
Autocorrelation/Serial Correlation
Violates the assumption the indpeent error terms, residuals are related to eachother and become larger and larger over time. If the error terms are related to eachother
Interpretation of Coefficient of of Dummy Variable
We fail to find a significant relationship between Jan. and the rest of the month Intercept-all the other variables will be euqal to zero. When its the fourth q
How can Heteroskedasticity be fixed?
in SAS next to SPEC, type HCC You can also.... ◦Using more sophisticated estimation methods to get Robust Standard Errors ◦Explicitly incorporating an independent variable that includes the missing factor.
What is Multple Regression?
method of explaining the variation in the dependent variable using the variation in more than one explanatory variable.
Degree of freedom when you have multple regression
n-k-1 K = Number of independent Variable
slope coefficient
the change in the dependent variable for a one unit change in that independent variable assuming/holding all other variables held constant.
Definition of the Intercept?
the value of the dependent variable if all independent variables have a value of zero.
What is Conditional Heteroskadicity?
wherein the error variance is correlated with the independent variable values Therefor..... .◦F-test and t-tests are unreliable.
Detection of Multiollinarity
◦Classic symptom: high R-square, significant F but insignificant t (Fortunately it does not affect F stat) ◦High pairwise correlations between independent variables
How to fix autocorrelation?
◦Robust standard errors ◦Explicitly incorporate source of correlation (e.g., time or seasonality variable) to capture the effect. ◦Use alternative time-series methods that capture autocorrelation (ARCH, GARCH models).
Uncertainty associated with the parameter estimates.
◦The estimated parameters also contain uncertainty because they are only estimates of the true underlying population parameters. ◦For a single independent variable, as covered in the prior chapter, estimates of this uncertainty can be obtained. ◦For multiple independent variables, the matrix algebra necessary to obtain such estimates is beyond the scope of this text. Bo and B1 are estimates, there is uncertainlty. Linear regression is estiamtion.