Session 3
Sample covariance (COVxy)
(Xi - mean of X)(Yi - mean of Y) / n - 1 N = sample size Xi and Yi= ith observation on variable X and Y
assumptions of a multiple regression model
1.A linear relationship exists between the dependent and independent variables 2.Independent variables are not random, and no exact linear relationship between any two or more independent variables 3.The expected value of the error term, conditional on the independent variable, is zero 4.The variance of the error term is constant for all observations 5.the error term for one observation is not correlated with that of another observation 6.the error term is normally distributed
The process to test whether an AR time series model is correctly specified involves three steps:
1.Estimate the AR model being evaluated using linear regression 2.Calculate the autocorrelations of the models residuals (ie the level of correlation between the forecast errors of one period to the next) 3.Test whether the autocorrelations are significantly different from zero: if the model is correctly specified, none of the autocorrelations will be statistically significant. (to test for significance, a t-test is used to test the hypothesis that the correlations of the residuals are zero. The t-stat is the estimated autocorrelation divided by the standard error. the standard error is 1/ squ root T, where T is the number of observations) ***Note the Durbin-Watson test that we used with trend models is not appropriate for testing for serial correlation of the error terms in an autoregressive model. Use this t-test instead.
limitations of regression analysis
1.Linear relationships can change over time known as parameter instability 2.Even if accurate, its usefulness in investment analysis will be limited if other market participants are also aware of it and act on this evidence 3.If assumptions underlying regression analysis do not hold, the interpretation and tests of hypothesis may not be valid 4.Heteroskedastic=nonconstant variance of the error term 5.Autocorrelation=Error terms are not independent
11B. Describe limitations of correlation
1.Outliners-may influence the results of regression and the estimate of the two correlation coefficients. 2. Spurious correlation-this happens when it may appear to be a relationship between two variables when in fact there is none. Ie. Height is correlated to vocabulary, but this correlation is driven by a third variable, age. 3. Nonlinear Relationships- Correlation measures the linear association between two variables, but it may not always be reliable. Two variables could have a strong nonlinear relation and still have a very low correlation.
Detecting heteroskedasticity
2 methods to detect: examine scatter plots and using the Breusch-Pagon Chi-square test Breusch-Pagan test- calls for regression of the squared residuals on the independent variables. IF conditional heteroskedasticity is present, the independent variables will significantly contribute to the explanation of hte squared residuals. BP chi-square test = n * R^2 of the residual n is the number of observations, R^2 for a second regression of the squared residuals from the first regression on the independent variables (NOT THE ORIGINAL R^2) and k=the number of independent variables
Sample Correlation coefficient (R)
= covariance of X and Y / (sample SD of X) (sample SD of Y) Number is bounded by -1 < R < 1
11.e Describe assumptions underlying linear regression and interpret regression coefficient
A linear relationship exists between the dependent and the independent variable Independent variable uncorrelated with error term Expected value of error term is zero Variance of the error term is constant Error Term is independently distributed Error term is normally distributed
When is a trend model covariance stationary
A time series is covariance stationary if it satisfies the following three conditions: 1.constant and finite expected value: The expected value of the time series is constant overtime. (mean reverting level) 2.Constant and finite variance: the time series volatility around its mean does not change over time 3.Constant and finite covariance between values at any given lag: the covariance of the time series with leading or lagged values of itself is constant **If it appears that the mean or variance changes over time, then the data is not covariance stationary. It is often the case that financial or economic data are not covariance stationary. An examples is if data exhibit seasonality or trends over time, the mean will not be constant. A nonstationary time series will produce meaningless regression results.
T-test- assume populations are normally distributed, we use t-test to test whether null hypothesis should be rejected
A two tailed test is appropriate The formula for the T-test is: T = r *(the squared root of n - 2) / (the square root of 1 - r^2) r= sample correlation
explain autoregressive conditional heteroskedasticity (ARCH)
ARCH exists if the variance of the residucals in one period is dependent on the variance of the residuals in a previous period. When this exists, the standard erros of the regression coefficient in AR models and the hypothesis tests of these coefficents are invalid. An ARCH model is used to test for autoregressive conditional heteroskedasticity.
F-Statistic
Assesses how well the set of independent variables, as a group, explains the variation in the dependent variable. IT is used to test whether AT LEAST ONE of the independent variables explains a significant portion of the variation of the dependent variable *** F-stat is ALWAYS a one-tailed test
Keys to the Exam Reading 13
Autoregressive AR models Mean reversion Modeling seasonality Covariance stationary and random walks ARCH and nonstationary forecasting using time series models
Factors that determine which model is best; linear or log-linear
Bottom line: When a variable grows at a constant rate, a long-linear model is most appropriate. When the variable increases over time by a constant amount, a linear trend model is most appropriate.
Explain how timer-series variables should be analyzed for nonstationarity and or cointegration before use in a linear regression.
Cointegration- two time series are economically linked (related to the same macro variables) of follow the same trend and that relationship is not expected to change The residuals are tested for a unit root using the Dickey Fuller test with critical t-values calculated by Engle and Granger. Remember: The DF test does not use the standard critical T-values we typically use in testing the statistical significant of individual regression coefficients. The DF-EG test further adjusts them to test for cointegration. As with the DF test, you do not have to know critical t values for the DF-EG test. If the null is rejected, we say the series (of error terms in this case) is covariance stationary and the two time series are cointegrated.
Key focus for reading 11
Correlation and regression Confidence intervals and t-tests ANOVA table
11A. Calculate and Interpret Covariance and sample correlation coefficient
Covariance- the statistical measure of how closely two variables move number can range from positive to negative infinity
F-stat formula
F = MSR / MSE = (RSS/K) / (SSE/ n-k-1) **where a one independent variable: df numerator with k=1 and denominator df of n-k-1= n-2 MSR= mean regression sum of square MSE= mean squared error
Confidence Interval for Coeffiecient
Formula: Point estimate +/- (reliability * Variability) 95% confidence interval for b1 if coefficient is .90 and standard error is .17? .90 +/- (critical t-value*standard error) = .90 +/- (2.2*.17) = .53<b1<1.27
correcting Heteroskedasticity
Heteroskedasticity is not easy to correct, with the most common remedy is to calculate robust standard errors called White-correlated standard errors
Key focus for reading 12
Hypothesis testing of coefficients Predicted Y-value ANOVA table- calculate and interpret its components Problems in regression analysis- define, effect, detect, and correct
Important to remember about F-stat
IMPORTANT: ** F-test is always a one-tail test **With one independent variable, the F-test tests the same hypothesis as the t-test for statistical significance of the slope coefficient **F-test more applicable in multiple regressions ***always remember the F-test is not as useful when we only have one independent variable because it tells us the same thing as the t-test of the slop coefficient. Make sure you know that fact for the exam.
Unconditional Heteroskedasticity
IT usually causes NO major problems with the regression
11j. Describe the use of ANOVA in regression analysis, interpret ANOVA, and calculate and interpret F-statistic
In regression, we use ANOVA to determine the usefulness of the independent variable or variables in explaining variation in the dependent variable.
MSE
Mean squared error (MSE) = SSE / n-k-1
Correcting Multicollinearity
Most common method to correct for multicollinearity is to omit one or more of the correlated independent variables
Detecting multicollinearity
Most common way to detect multicollinearity is the situation where T-tests indicate that none of the individual coefficients is significantly different then zero, while F-test is statistically signifiant and R^2 is high. This suggests that the variables together explain much of the variation in the dependent variable, but the individual independent variables do not. ***Rule of Thumb: If the absolute value of the sample correlation between any 2 independent variables in the regression is greater than .7, multicollinearity is a problem. **this only works with 2 independent variables. If here are more than two independent variables, while individual variables may not be highly correlated, linear combinations might be, leading to multicollinearity.
Heteroskedasticity Serial Correlation (autocorrelation) Multicollinearity
ON exam day, you MUST be able to answer the following 4 questions about each assumption violations: What is it? What is its effect on regression analysis? How do we detect it? How do we correct for it?
Heteroskedasticity
Occurs when the variance of the residuals is NOT constant across observations; this happens when there are subsamples that are more spread out than the rest of the sample
Limitations of trend models
One of the assumptions underlying linear regression is that the residual term is independently distributed and the residuals are uncorrelated with each other.. A violation of this assumption is referred to as autocorrelation. If serial correlation is present, not even a log-linear model is appropriate. In this case, we need to turn to an autoregressive model. Durbin Watson is used to detect autocorrelation. For a time series model without serial correlation DW should be approximately equal to 2.0. A DW significantly less than 2.0 suggests that the residual terms are positively correlated.
Regression sum of squares (RSS)
RSS- variation explained by X variables
R^2 cacluation
R^2 = explained variation / total variation R^2 = RSS / SST
adjusted R^2
R^2 almost always increases as variables are added to the model (referred as overestimating the regression) To overcome this problem we need to adjust the R2 for the number of independent variables R^2a = 1 - (((n-1) / (n-k-1) X (1 - R^2)) n=# of observations, k=# of independent variables; Ra^2= adjusted R^2 With more than one independent variable, R^2a <= R^2 adding new variables may increase or decrease the R^2a R^2a may be less than zero if the R^2 is low enough
12.C Formulate null and alternative hypothesis about population value of a regression coefficient, calculate the value of the test stat, and determine whether to reject
Regression Coefficient T-Test: T = (estimate - hypothesized) / standard error ***usually hypothesized is 0 so its estimate/standard error
MSR
Regression mean square (MSR) = RSS / k
remembering for P-value and T-stat
Reject H0 if P-value is small (less then .05) and t-stat is large (usually over 2). you will never calculate but you will have to interpret
Limitations of Regression- end reading 11
Relationships change over time Difficult to apply Usefulness limited- for investment applications because all participants can observe relationships
11F. Calculate and Interpret standard error of estimate (SEE), coefficient of determination (R^2) and confidence interval
SEE- measures accuracy of predicted values from regression equation; It measures the FIT of the regression line, the smaller the number the better fit SEE = square root of (SSE/n-2) = square root of MSE MAIN POINT: calculate on ANOVA table; lower is better
Sum of squared errors (SSE)
SSE- unexplained variation Thus, SST = RSS+SSE
effects of heteroskedasticity
Standard error too low in t-test formula, making t-stat too high or a false significane The F-test is also unreliable
Negative serial correlation
Standard errors are to large, which leads to T-stat that is to small; type II errors Positive SC is more common
Autoreressive models
The dependent variable is regressed against .
SSE
The sum of the squared vertical distances between the estimated and actual Y-values is referred to as the sum of the squared errors (SSE). So regression line is the line the minimizes the SSE.
Serial Correlation
This is known as auto correlation, the situation in which the residual terms are correlated with one another ***very common problem with time series data and in financial data
Log-Linear Trend Models
Time series data, particularly financial time series, often display exponential growth. IF we plot the data, the observations will form a convex curve.
Correcting serial correlation
To correct, adjust the coefficient standard error using Hansen method and recalculate t-stat, and improve the specification of the model
Putting it all together!!!
To determine what type of model is best suited to meet your needs, follow these steps: 1.Determine your goal (are you attempting to model the relationship of a variable to other variables? are you trying to model the variable over time? 2.IF you have decided on using a time series analysis for an individual variable, plot the values of the variable over time and look for characteristics that would indicate nonstationarity, such as non-constant variance (heteroskedasticity), non-constant mean, seasonality, or structural change (ie a significant shift in the plotted data at a point in time that seems to divide the data into two or more distinct patterns) 3.If there is no seasonal or structural shift, use a tend model (if the data plot on a straight lien with an upward or downward slope, use a linear trend model, if the data plot in a curve, use a log-linear trend model) 4.Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. (if you detect no serial correlation, you can use the model!!!; if you detect serial correlation, you must use another model (AR)) 5.IF the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows: 5a.IF the data has a linear trend, first-difference the data 5b.If the data has an exponential trend, first-difference the natural log of the data 5c.If there is a structural shift in the data, run two separate models as discussed above. 5d.IF the data has a seasonal component, incorporate the seasonality in the AR model as discussed above. 6.After first-differencing in 5 above, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality. 6a.IF there is no remaining serial correlation, you can use the model 6b.If you still detect serial correlation, incorporate lagged values of the variable into the AR model until you have removed any serial correlation. 7.Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. 7a.IF the coefficient is not significantly different from zero, you can use the model!!! 7b.IF the coefficient is significantly different from zero, ARCH is present. correct using generalized least squares. 8. IF you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out of sample RMSE.
making decision with T-stat
To make a decision, the calculated test statistic is compared with the critical t-value for the appropriate degrees of freedom and level of significance. Bearing in mind that we are conducting a 2 tailed test, the decision rule can be stated as: Reject Ho (correlation=0) if +Tcritical<T or T< -T critical this test statistic has a t-distribution with n-2 degrees of freedom if the null hypothesis is true Notes: as n increases (number of observations) the number of degrees of freedom increases and the absolute value of the critical value t decreases. Also the absolute value of the numerator increases with larger n, resulting in larger magnitude t values.
Total sum of squares (SST)
Total variation in "Y" variable (SST) RSS +SSE = SST
11C. Formulate a test of the hypothesis
We want to test whether the correlation between the population of two variables is equal to zero using a t-test
autoregressive model (AR)
When the dependent variable is regressed against one or more lagged values of itself, the resultant model is called as an autoregressive model (AR). For example, the sales of a firm could be regressed against the sales for the firm in previous months. In AR time series, past values of a variable are used to predict the current value of the variable.
interpret regression coefficients
Y = b0 + b1Xi+ ei b0= intercept b1= slope coefficient change in dependent variable for one unit change in independent variable (slope=covariance/variance) ei= residual error for the ith observation We refer to the intercept of b0 and the slope coefficient of b1 as the regression coefficients
Log-linear model formula
Y= E ^(b0+b1t). to make this easier, we take the natural log of both sides to get; ln(Y) = b0 + b1t + e ***when calculating, do not forget to take the log of the answer**
Time Series Analysis
a set of observations for a variable over successive periods of time. Trend models are easy; linear trend model and log-linear trend model. Lagged models are difficult mainly autoregressive models (AR)
Confidence Intervals for a regression coefficent
coefficient +/- (critical t-value * standard error)
Correlation coefficient (R)
converts the covariance into a measure that is easier to interpret. R = measure of the strength of the linear relationship (correlation) between the two variables
Root mean squared error
criterion (RMSE) is used to compare the accuracy of autoregressive models in forecasting out of sample values. For example, a researcher may have two autoregressive (AR) models an AR1 and an AR2. To determine which model will more accurately forecast future values, we calculate the RMSE for the out of sample data. the model with the lowest RMSE for in sample data may not be the model with the lowest RMSE for out of sample data. the model with the lower RMSE for the out of sample data will have lower forecast error and will be expected to have better predictive power in the future. Besides examining the RMSE criteria model, we will also want to examine the stability of regression coefficients.
A one period ahead forecast for an AR1 model is determined in the following manner:
hatX(t+1) = hatb0 + hatb1(x1) Likewise a two step ahead forecast for an AR1 model is calculated as: hatX(t+2) = hatb0 + hatb1 hatx (t+1) **not the has indicates that the inputs used in multi-period forecasts are actually forecasts themselves. This makes a multi-period forecast more uncertain than single-period forecast.
t-test- Testing significance example
n=10 and r=.475. so t= (.0475* SR of 8) / SR of 1-.475^2 = 1.527. 2 tailed t-value at 5% with df=10-2=8 found in t-table as +/-2.306. Because -2.306< 1.5267 <2.306 the null cannot be rejected.
p-value
probability value for a particular hypothesis. Its the smallest level of significance at which the null hypothesis can be rejected. In most regression software packages, the p-value printed for regression coefficients apply to the test of null hypothesis that the true parameter is equal to 0 against the alternative that the parameter is not equal to 0. Example, if the p-value is .005, we can reject the hypothesis that the true parameter is equal to 0 at the .5% significance level (99.5% confidence).
Coefficient of Determination (R^2)
r^2- Measures % of total variation in Y explained by independent X variable. Example is if we have a R^2 of .63, it indicates that the variation of the independent variable explains 63% of the variation in the dependent variable. caution: cannot conclude causation from high R^2; could be spurious correlation
Positive serial correlation
results in coefficient standard errors that are too small which will cause the computed T-stat to be larger then they should be (type 1 error: the rejection of the null hypothesis when it is actually true). the F-test is also unreliable because the MSE will be underestimated leading again to too many Type 1 errors; same as heteroskedasticity, t-stat to high
Detecting serial correlation
scatter plots and durbin-watson statistic DW = 2(1-r) r=correlation of residuals from one observation to the next **other stuff to ugly to remember on DW
P-value
the smallest level of significance for which the null can be rejected ***If p-value less then significance level, the null hypothesis can be rejected ***If p-value is greater than the significance level, the null hypothesis cannot be rejected. p-value is an alternative method of doing hypothesis testing of the coefficients is to compare the p-value to the significance level
Conditional heteroskedasticity
this is related to the level of the independent variables. For example, conditional heteroskedasticity exists if the variance of the residual term increases as the value of the independent variable increases. this does create significant problems for statistical inference. **on chart, closer to 0 the residuals are close to the regression line having low residual variance but as you move out, they separate further from the line creating higher residual variance
When do you know an AR model is correct?
when an AR model is correctly specified, the residual terms will not exhibit serial correlation
dummy variables
when there is an occasion when the independent variable is binary in nature, it is either on or off. They are often used to quantify the impact of quantitative events. Dummy variables are assigned a value of 0 or 1. Example: In a time series regression of monthly stock returns, you could employ a January dummy variable that would take on the value of 1 if a stock return occurred in January and 0 if it occurred in any other month. The purpose of including the January dummy variable would be to see if stock returns in January were significantly different than stock returns in all other months of the year. Important Consideration= with dummy variables is the choice of the number of dummy variables to include in the model. Whenever we want to distinguish between n classes, we must use n-1 dummy variables. If we do not, the regression assumption of no exact linear relationship between independent variables would be violated.
Multicollinearity
when two or more of the independent variables (X), in a multi-regression are highly correlated with each other. it reduces t-stats making them artificially small