Level 2 SS 3
ANOVA Table
◦output of ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. ANOVA tables are included in the regression output of may statistical software packages.
How do you detect serial correlation?
◦residual plots and durbin-watson statistic
To test whether the two time series have unit roots, the analyst first runs separate DF tests with five possible results:
1. Both time series are covariance stationary (can use linear regression) 2. only the dependent variable time series is covariance stationary (regression not reliable) 3. only the independent variable time series is covariance stationary (regression not reliable) 4. neither time series is covariance stationary and the two series are not cointegrated (depends if the two time series are cointegrated) 5. neither time series is covariance stationary and the two series are cointegrated. (depends if the two time series are cointergrated)
A time series is covariance stationary if it satisfies the following three conditions:
1. Constant and finite expected value: the expected value of the time series is constant over time. ( we will refer to this value as the mean-reverting level) 2. constant and finite variance: The time series' volatility around its mean (i.e., the distribution of the individual observations around the mean) does not change over time. 3. constant and finite covariance between values at any given lag. A. the covariance of the time series with leading or lagged values of itself is constant
What are the two guidelines to follow to determine what type of model is best suited to meet your needs?
1. Determine your goal A. are you attempting to model the relationship of a variable to other variables (e.g. cointegrated time series, cross-sectional multiple regression)? B. Are you trying to model the variables over time (e.g. trend model)? 2. If you have decided on using a time series analysis for an individual variable, plot the values of the variable over time and look for characteristics that would indicate nonstationarity, such as non-constant variance (heteroskedasticity), non-constant mean, seasonality, or structural change.
What is the procedure to test whether an AR time series model is correctly specified? (Three Steps)
1. Estimate the AR model being evaluated using linear regression: Start with a first order ar model 2. CALCULATE the autocorrelations of the model's residuals (i.e., the level of correlation between the forecast errors from one period to the next) 3. TEST whether the autocorrelations are significantly different from zero A. to test for significance, a t-test is used to tgest the hypothesis that the correlations of the residuals are zero. B. the t statistic is the hypothesis that the correlations of the residuals are zero. C. the t-statistic is the stimated autocorrelation divided by the standard error
What are steps 3-8 to determine what type of model is best suited?
3. if there is no seasonality or structural shift, use a trend model - if the data plot on a straight line with an upward or downward slope, use a linear trend model - if the data plot in a curve, use a log-linear trend model 4. Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. - if you detect no serial correlation, you can use the model - if you detect serial correlation, you must use another model (e.g AR) 5. IF the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows: - if the data has a linear trend, first-difference the data - if the data has an exponential trend, first-difference the natural log of the data - if there is a structural shift in the data, run two separate models as discussed above - if the data has a seasonal component, incorporate the seasonality in the AR model as discussed below. 6. After first-differencing in 5 above, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality - if there is no remaining serial correlation, you can use the model - if you still detect serial correlation, incorporate lagged values of the variable (possibly including one for seasonality - e.g. for monthly data, add the 12th lag of the time series) into the AR model until you have removed (i.e. modeled) any serial correlation 7. Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. - if the coefficient is not significantly different from zero, you can use the model - if the coefficient is significantly different from zero, ARCH is present. correct using generalized least squares. 8. If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.
C. the t-statistic is the stimated autocorrelation divided by the standard error.
a. numerator is the correlation of error term t with the kth lagged error term.
What are in sample forecasts?
in sample forecasts are within the range of data (i.e. time period) used to estimate the model, which for a time series is known as the sample or test period
Forecasting with an autoregressive model
these are calculated in same manner as other regression models, but since ind var are lagged, it is necessary to calculate a one step ahead forecast before a two step ahead forecast can be calculcated
How do you determine if a time series is covariance stationary?
• run an AR model and examine autocorrelations ◦an AR model is estimated and the statistical significance of the autocorrelations at various lags is examined • perform the dickey fuller test ◦(DF) transform the AR(1) model to run a simple regression.
What is autoregressive conditional heteroskedasticity (ARCH)?
◦ARCH exists if the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. ‣ when this exists, the standard errors of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid.
What is a linear trend model?
◦a linear trend model is a time series pattern that can be graphed using a straight line ◦downward sloping indicates a negative trend, while upward indicates positive trend. ◦ordinary least squares (OLS) regression is used to estimate the coefficient in the trend line, which provides the following prediction equation:
What is a structural change?
◦a structural change is indicated by a significant shift in the plotted data at a point in time that seems to divide the data into two or more distinct patterns. ◦in this case, you have to run 2 different models- 1 incorporating the data before and one after that date and test whether the time series has actually shifted. ◦if it has shifted, a single time series over the whole period is bad!
What is mean reversion?
◦a time series exhibits mean reversion if it has a tendency to move toward its mean. ‣ decline when current value is above, and increase when current value is below the mean ◦if at the mean reverting level, the model predicts that the next value of the time series will be the same as its current value
What is an ARCH time series?
◦an arch time series is one for which the variance of the residuals in one period is dependent on (i.e. a function of) the variance of the residuals in the preceding period.
Using ARCH Models
◦arch models are used to test for autoregressive contional heteroskedasticity
Limitations of Trend Models
◦assumption that residuals are uncorrelated with each other. violation of this is referred to as autocorrelation. the residuals are persistently positive or negative for periods of time and it is said that the data exhibit serial correlation. shouldn't use it, not appropriate for time series ◦if looks like should be log linear, but actually shouldn't be. use the autoregressive model. ◦for a time series model without serial correlation DW should be approximately equal to 2.0. A DW significantly different from 2.0 suggests that the residual terms are correlated.
What is cointegration?
◦cointegration means that two time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. ◦if two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable (scenario 5 ok, and scenario 4 is not)
What is a random walk?
◦if a time series follows a random walk process, the predicted value of the series (i.e., the value of the dependent variable) in one period is equal to the value of the series in the previous period plus a random error term.
What is a random walk with a drift?
◦if a time series follows a random walk with a drift, the intercept term is not equal to zero. ◦in addition to a random error term, the time series is expected to increase or decrease by a constant amount each period.
Predicting the variance of a time series
◦if a time series has ARCH errors, an ARCH model can be used to predict the variance of the residuals in future periods. ◦if ARCH (1) model... then can predict the variance of the residuals in period t + 1
First Differencing
◦if a time series is a random walk (has unit root) we can transform the data to a covariance stationary time series using first differencing. ◦process involves subtracting the value of the time series (i.e. the dependent variable) in the immediately preceding period from the current value of the time series to define a new dependent variable, y. (model change in dep var)
What is the chain rule of forecasting?
◦it is the calculation of successive forecasts int his manner (need to calculate one step ahead of forecast before a two step ahead forecast can be calculated)
Bottom line on DF-EG test
◦just like regular DF test, if the null is rejected, we say the series (of error terms in this case) is covariance stationary and the two time series are cointegrated
What is covariance stationarity?
◦neither a random walk nor a random walk with a drift exhibits covariance stationarity. ◦look at equation from last card... the mean reverting level is Bo / 1-b1 where 1-b1 is zero. so bo/0 is undefined. so it is not covariance stationary. but is unit root (b1=1) and we cant use the least squares regression procedure we used to estimate an ar(1) model without transforming the data.
What are out of sample forecasts?
◦out-of-sample forecasts are made outside of the sample period ◦we compare how accurate a model is in forecasting the y variable value for a time period outside the period used to develop the model. ◦help see if the model adequately describes the time series and whether it has relevance (i.e. predictive power) in the real world.
LOS 13.d: Describe the structure of an autoregressive (AR) model of order p, and calculate one and two-period-ahead forecasts given the estimated coefficients.
◦p indicates the number of lagged values that the autoregressive model will include as independent variables.Ar (2) means second order autoregressive
How do you test if two time series are cointegrated?
◦regress one variable on the other using the following model: ◦the residuals are tested for a unit root using the Dickey Fuller test with critical t-values calculated by Engle and Granger (i.e., the DF-EG test). ◦if the test rejects the null hypothesis of a unit root, we say the error terms generated by the two time series are covariance stationary and the two series are cointegrated. ◦if the two series are cointegrated, we can use the regression to model their relationship.
what do you do if a time series model has been determined to contain arch errors?
◦regression procedures that correct for heteroskedasticity, such as generalized least squares, must be used in order to develop a predictive model ◦otherwise, the standard errors of the models coefficients will be incorrect, leading to invalid conclusions.
Autocorrelation & Model Fit
◦serial correlation (or autocorrelation) means the error terms are positively or negatively correlated ◦when the error terms are correlated, standard errors are unreliable and t tests of individual coefficients can incorrectly show statistical significance or insignificance.
What is the root mean squared error?
◦the root mean squared error criterion (RMSE) is used to compare the accuracy of autoregressive models in forecasting out of sample values. ◦the model with the lower RMSE for the out of sample data will have lower forecast error and will be expected to have better predictive power in the future.
How do you test whether a time series is arch(1)?
◦the squared residuals from an estimated time-series model are regressed on the first lag of the squared residuals.
Log-Linear Trend Models
◦time series data often displays exponential growth (growth with continuous compounding). ◦positive exponential growth means that the random variable (i.e., the time series) tends to increase at some constant rate of growth. ◦the observations will form a convex curve. ◦negative exponential growth means that the data tends to decrease at some constant rate of decay, and the plotted time series will be a concave curve.
How do you correct for seasonality?
◦to adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable. if quarterly, seasonal lag is 4, if monthly then it is 12.
Bottom line on linear versus log-linear
◦when a variable grows at a constant rate, a log-linear model is most appropriate ◦when the variable increases over time by a constant amount, a linear trend model is most appropriate
What is a autoregressive model?
◦when the dependent variable is regressed against one or more lagged values of itself, the model is called a autoregressive model ◦in an autoregressive time series, past values of a variable are used to predict the current (and hence future) value of the variable. ◦statistical inferences based on ordinary least squares (OLS) estimates for an AR time series model may be invalid unless the time series being modeled is covariance stationary
LOS 13.h: Explain the instability of coefficients of time-series models
◦financial and economic time series inherently exhibit some form of instability or nonstationarity. ◦this is because financial and economic conditions are dynamic, and the estimated regression coefficients in one period may be quite different from those estimated during another period. ◦longer time periods are less stable, more time for things to change
What are some factors that determine which model is best?
◦first you must plot the data ◦use linear trend model if the data points appear to be equally distributed above and below the regression line. inflation rate data usually is this way ◦if plots with a non-linear (curved) shape, then the residuals from a linear trend model will be persistently positive or negative for a period of time. ‣ then use log linear model. usually financial data
How do you transform the AR1 model for DF?
first: start with the basic form of the AR(1) model and (2) subtract X(t-1) from both sides • then test that the new, transformed coefficient (b1-1) is different from zero using a modified t test. ◦if b1-1 is not significantly different from zero, they say that B1 must be equal to 1.0 and therefore, the series must have a unit root. ◦if the null (g=0) cannot be rejected, answer is that the time series has a unit root ‣ if rejected, noes not have a unit root.
3. Other time-series misspecifications that result in nonstationarity (covered later)
...
Analysis of variance (ANOVA)
Analysis of variance (ANOVA) is a statistical procedure for analyzing the total variability of the dependent variable
Confidence interval for the regression coefficient, b1, is calculated as:
Tc is the critical two-tailed t-value for the selected confidence level with the appropriate number of degrees of freedom, which is equal to the number of sample observations minus 2 (i.e. n-2) • SB1 is the standard error of the regression coefficient. it is a function of the SEE. as SEE rises, SB1 also increases, and confidence interval widens. ◦SEE measures the variability of the data about the regression line, and the more variable the data, the less confidence there is in the regression model to estimate a coefficient.
LOS 11.g: Formulate a null and alternative hypothesis about a population value of a regression coefficient, and determine the appropriate test statistic and whether the null hypothesis is rejected at a given level of significance.
a t-test may also be used to test the hypothesis that the true slope coefficient, b1, is equal to some hypothesized value. Letting b1 be the point estimate for b1, the appropriate test statistic with n-2 degrees of freedom is:
What is the effect of heteroskedasticity on regression analysis
there are four effects of heteroskedasticity you need to be aware of: ‣ the standard errors are usually unreliable estimates ‣ the coefficient estimates (the Bj) aren't affected ‣ if the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. the opposite is true if the standard errors are too large ‣ the f test is also unreliable.
Hypothesis Testing of Regression coefficients
use the t test on all of the individual ones... remember DF = n-k-1.. and k =1... the 1 is the intercept and the k is the regression coefficients.
How do you detect heteroskedasticity?
• examine scatter plots of the residuals • use the breush-pagan chi-square test
Regression Coefficient Confidence Interval
• hypothesis testing for a regression coefficient may use the confidence interval for the coefficient being tested.
Intercept term
• is the lines intersection with the y axis at x=0. it can be positive, negative, or zero.
LOS 11.j: Explain limitations of regression analysis
• linear relationships can change over time. This means that estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period. this is parameter instability • even if the regression model accurately reflects the historical relationship between the two variables, its usefulness in investment analysis will be limited if other market participants are also aware and act on it. • if assumptions underlying regression do not hold, may not be valid. if data is heteroskedastic (non-constant variance of the error terms) or exhibits autocorrelation (error terms are not independent), regression results may be invalid.
What is multiple regression?
◦ multiple regression is regression analysis with more than one independent variable ◦simple linear regression explains the variation in stock returns in terms of the variation in systematic risk as measured by beta ◦with multiple regression, stock returns can be regressed against beta and against additional variables. such as ‣ firm size ‣ equity ‣ industry classification
F-Statistic
◦F-test assess how well a set of independent variables, as a group, explains the variation in the dependent variable. ◦in multiple regression, the F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variables.
LOS 11.c: Formulate a test of the hypothesis that the population correlation coefficient equals zero, and determine whether the hypothesis is rejected at a given level of significance
◦Ho = p=0 versus Ha= p DNE 0 ◦assuming normally distributed, use a t test ‣ to make a decision, compare to the critical t-value for the appropriate degrees of freedom and level of significance ‣ reject Ho if +tcritical <t, or t< -tcritical
The DW test procedure for positive serial correlation as follows:
◦Ho = the regression has no positive serial correlation ◦If DW < d1, the error terms are positively serially correlated (i.e., reject the null hypothesis of no positive serial correlation). ◦if D1 < DW < Du, the test is inconclusive ◦if DW > Du, the test is inconclusive ◦If DW > Du, there is no evidence that the error terms are positively correlated (i.e., fail to reject the null of no positive serial correlation)
MSR and MSE
◦MSR is the mean regression sum of squares and MSE is the mean squared error. both are simply calculated as the appropriate sum of squares divided by its degrees of freedom
Adjusted R2
◦R2 almost always increases as variables are added to the model, even if the marginal contribution of the new variables is not statistically significant. ◦thus, a high R2 may reflect the impact of a large set of independent variables rather than how well the set explains the dependent variable. overestimating the regression
Coefficient of Determination, R2
◦R2 can be used to see the effectiveness of all independent variables. ◦R2 = Total variation - unexplained variation / total variation = SST-SSE/SST = explained variation/total variation = RSS/SST
Regression sum of squares (RSS)
◦RSS measures the variation in the dependent variable that is explained by the independent variable. RSS is the sum of the squared distances between the predicted Y-values and the mean of Y.
Sum of squared errors (SSE)
◦SSE measures the unexplained variation in the dependent variable. ◦also known as the sum of squared residuals or the residual sum of squares. ◦SSE is the sum of the squared vertical distances between the actual Y-values and the predicted Y-values on the regression line
What are the three broad categories of model misspecification, or ways in which the regression model can be specified incorrectly, each with several subcategories:
◦The functional form can be misspecified ◦explanatory variables are correlated with the error term in time series models ◦other time-series misspecifications that result in nonstantionarity.
Total Variation
◦Total variation = explained variation + unexplained variation ◦SST = RSS + SSE
2. Explanatory variables are correlated with the error term in time series models
◦a lagged dependent variable is used as an independent variable ◦a function of the dependent variable is used as an independent variable ("forecasting the past") ◦independent variables are measured with error.
Interpreting a scatter plot
◦a scatter plot is a collection of points on a graph where each point represents the values of two variables (an x/y pair) ◦upward sweeping scatter plot indicates a positive correlation between the two variables, while a downward sweeping plot implies a negative correlation
What is a time series?
◦a time series is a set of observations for a variable over successive periods of time (e.g. monthly stock market returns for past ten yrs) ◦series has a trend if a consistent pattern can be seen by plotting the data (individual observations) on a graph
How do you correct serial correlation?
◦adjust the coefficient standard errors (hansen method) ‣ also corrects for conditional heteroskedasticity ‣ these adjusted standard errors, which are sometimes called serial correlation consistent standard errors or hansen white standard errors, are then used in hypothesis testing of the regression coefficients. ‣ only use the Hansen method if serial correlation is a problem.the white corrected standard errors are preferred if only heteroskedasticity is a problem. if both conditions are present, then use hansen method. ◦improve the specification of the model - explicitly incorporate the timeseries nature of the data (include a seasonal term). This can be tricky
LOS 12.e: Calculate and interpret the F-statistic, and describe how it is used in regression analysis
◦an F-test assesses how well the set of independent variables, as a group, explains the variation in the dependent variable. ◦F-statistic is used to test whether at least one of the independent variables explains a significant portion of the variation of the dependent variable. ◦to determine whether at least one of the coefficients is statistically significant, the calculated F-statistic is compared with the one-tailed critical F-value, Fc, at the appropriate level of significance. The degrees of freedom for the numerator and denominator are: ‣ dfnum = k ‣ df denom = n-k-1 ‣ n= number of observations ‣ k=number of independent variables
LOS 12.d: Explain the assumptions of a multiple regression model
◦as with simple linear regression, most of the assumptions made with the multiple regression pertain to the error term: ‣ a linear relationship exists between the dependent and independent variables. ‣ the independent variables are not random, and there is no exact linear relation between any two or more independent variables ‣ the expected value of the error term, conditional on the independent variable, is zero (i.e, sum of errors =0) ‣ the variance of the error terms is constant for all observations ‣ the error term for one observation is not correlated with that of another observation ‣ the error term is normally distributed.
LOS 11.e: Explain the assumptions underlying linear regression, and interpret the regression coefficients
◦assumptions: ‣ a linear relationship exists between the dependent and the independent variable ‣ the independent variable is uncorrelated with the residuals ‣ the expected value of the residual term is zero ‣ the variance of the residual term is constant for all observations ‣ the residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation ‣ the residual term is normally distributed.
What is the effects of the model misspecfiication on the regression results?
◦basically the same for all of the misspecfiications we will discuss ‣ regression coefficients are often biased and/or inconsistent which means we cant have any confidence in our hypothesis tests of the coefficients or in the predictions of the model.
What is the effect of serial correlation on regression analysis?
◦because data clusters together from observation to observation, positive serial correlation typically results in coefficient standard errors that are too small ◦these small error terms will cause the computed t-stats to be larger than they should be, which causes too many type one errors(rejection of the null hypothesis when it is actually true) ◦F test will also be unreliable because the MSE will be underestimated leading again to too many type I errors
How do you detect multicollinearity?
◦best way to detect it is when t-tests indicate that none of the individual coefficients is sign. different than zero., while the f test is significant and the R2 is high. ◦so together they explain it, but individually they dont. means independent variables are highly correlated with each other. ◦if the absolute value of the sample correlation between any two independent variables in the regression is greater than 0.7, multicollinearity is a potential problem. this only works if there are exactly two independent variables. ‣ if more than two independent variables, while individual variables may not be highly correlated, linear combinations might be, leading to multicollinearity. ‣ high correlation among the independent variable suggests the possibility of multicollinearity, but low correlation among the independent variables does not necessarily indicate multicolinearity is not present.
Calculating R squared and SEE
◦both of these can also be calculated directly from the anova table ◦R2 is the percentage of the total variation in the dependent variable explained by the independent variable ◦SEE is the standard deviation of the regression error terms and is equal to the square root of the mean squared error (MSE)
How do you correct heteroskedasticity?
◦calculate robust standard errors (also called White-corrected standard errors or heteroskedasticity-consistent standard errors). ◦these robust standard errors are then used to recalculate the t-statistics using the original regression coefficients ◦can also use the genrealied least squares, which modifies ornginal equation
Nonlinear relationships
◦correlation measures the linear relationship between two variables ◦two variables could have nonlinear relationship such as y=(3x-6) squared and the correlation would be close to zero. ◦therefore, another limitation of correlation analysis is that is does not capture strong nonlinear relationships between variables.
Outliers
◦outliers represent a few extreme values for sample observations ◦relative to the rest of the sample data, the value of an outlier may be extraordinarily large or small ◦outliers can result in apparent statistical evidence that a significant relationship exists when, in fact, there is none, or that there is no relationship when, in fact, there is a relationship
Covariance
◦covariance between two random variables is a statistical measure of the degree to which the two variables move together. ◦covariance captures the linear relationship between two variables. ◦positive covariance indicates that the variables tend to move together ◦negative covariance indicates that the variables tend to move in opposite directions. ◦ negative to positive infinity, presented in squared units. hard to interpret so use correlation
what is the decision rule for the F-test?
◦decision rule: reject Ho if F (test statistic) > Fc (critical value)
Dependent variable
◦dependent variable is the variable whose variation is explained by the independent variable. also called endogenous variable, ◦independent variable: used to explain the variation of the dependent variable. the independent variable i also referred to as the explanatory variable, the exogenous variable, or the predicting variable.
Slope Coefficient (b1)
◦describes the change in Y for one unit change in X. ◦it can be positive, negative or zero, depending on the relationship between the regression variables.
What are discriminant models?
◦discriminant models are similar to probit and logit models but make different assumptions regarding the independent variables. ◦discriminant analysis results in a linear function similar to an ordinary regression, which generates an overall score, or ranking, for an observation. ◦The scores can then be used to rank or classify observations
What is the effect of multicollinearity on regression analysis?
◦even though multicollinearity does not affect the consistency of slope coefficients, such coefficients themselves tend to be unreliable ◦additionally, the standard errors of the slope coefficients are artificially inflated ‣ hence, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (i.e. Type II error) ◦likely to be present to some extent in economic models, but the issue is whether the multicollinearity has a significant effect on the regression results
What are the three most common violations for regression?
◦heteroskedasticity ◦serial correlation ◦multicollinearity
What is heteroskedasticity?
◦heteroskedasticity occurs when the variance of the residuals is not the same across all observations in the sample. This happens when there are sub samples that are more spread out than the rest of the sample. ◦unconditional heteroskedasticity occurs when the heteroskedasticity is not related to the level of the independent variables, which means that it doesn't systematically increase or decrease with changes in the value of the independent variable(s). while this is a problem with the equal variance assumption, it usually causes no major problem with the regression
DW Explained further
◦if = 2, then homoskedastic and not serially correlated ◦DW < 2 if the error terms are positively serially correlated (r >0) ◦DW > 2 if the error terms are negatively serially correlated (r < 0)
LOS 12.h: Formulate a multiple regression equation by using dummy variables to represent qualitative factors, and interpret the coefficients and regression results.
◦if independent variable is binary in nature = it is either "on" or "off" ◦ these are called dummy variables and are used to quantify the impact of qualitative events. ◦they are assigned value of "0" or "1" ◦whenever we want to distinguish between n classes, we must use n-1 dummy variables. otherwise, the regression assumption of no exact linear relationship between independent variables would be violated.
1. The functional form can be misspecified
◦important variables are omitted ◦variables should be transformed ◦data is improperly pooled
What is conditional heteroskedasticity?
◦it is heteroskedasticity that is related to the level of the independent variables. ◦for example, heteroskedasticity exists if the variance of the residual term increases as the value of the independent variable increases, as shown in figure 4. ◦this does create significant problems for statistical inference
What is negative serial correlation?
◦it occurs when a positive error in one period increases the probability of observing a negative error in the next period.
LOS 12.m: Interpret the economic meaning of the results of multiple regression analysis, and evaluate a regression model and its results
◦look at the slope coefficients
Whats a popular application of discriminant models?
◦makes use of financial ratios as the independent variables to predict the qualitative dependent variable bankruptcy. ◦a linear relationship among the independent variables produces a value for the dependent variable that places a company in a bankrupt or not bankrupt class.
Standard error of estimates (SEE)
◦measures the degree of variability of the actual Y-Values relative to the estimated Y-Values from a regression equation. The SEE gauges the "fit" of the regression line. ◦The smaller the standard error, the better the fit. ◦the SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.
Total sum of squares (SST)
◦measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the actual Y-values and the mean of Y
Determining Statistical Significance
◦most common hypothesis test is to test statistical significance- which means testing the null hypothesis that the coefficient is zero versus the alternative that it is not: ‣ "testing statistical significance" => Ho: Bj = 0 versus Ha: Bj DNE 0
How do you correct multicollinearity?
◦most common method to correct for multicollinearity is to omit one or more of the correlated independent variables. but its hard to find that specific variable that is the cause of the problem
LOS 12.j: Describe the multicollinearity, and explain its causes and effects in regression analysis.
◦multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. ◦this condition distorts the standard error of estimate and the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.
Interpreting p-Values
◦p-value is the smallest level of significance for which the null hypothesis can be rejected. ◦an alternative method of doing hypothesis testing of the coefficients is to compare the p-value to the significance level: ‣ if the p-value is less than significance level, the null hypothesis can be rejected ‣ if the p-value is greater than the significance level, the null hypothesis cannot be rejected.
What is positive serial correlation?
◦positive serial correlation exists when a positive regression error in one time period increases the probability of observing a positive regression error for the next time period.
Predicted Values
◦predicted values are values of the dependent variable based on the estimated regression coefficients and a prediction about the value of the independent variable. ◦they are the values that area predicted by the regression equation, given an estimate of the independent variable.
What is a probit and logit model?
◦probit model: based on the normal distribution while logit is based on the logistic distribution. ◦application of these models results in estimates of the probability that the event occurs (probability of default) ◦the maximum likelihood methodology is used to estimate coefficients for probit and logit models ◦these coefficients relate the independent variables to the likelihood of an event occurring, such as a merger, bankruptcy, or default.
LOS 12.1: Describe models with qualitative dependent variables
◦qualitative dependent variables: dummy variable that takes on a value of either zero or one. ◦a model that attempts to predict when a bond issuer will default - variable is one if default and zero if no default.
Correlation coefficient
◦r, a measure of the strength of the linear relationship between two variables ◦no unit of measure, is a pure measure of tendency of two variables to move together.
Regression line explained
◦regression line is the line for which the estimates of Bo and B1 are such that the sum of squared differences (vertical differences) between the Y values predicted by the regression equation and actual y values is minimized ◦the sum of the squared vertical distances between the estimated and actual Y-values is referred to as the sum of squared errors (SSE)
What is regression model specification?
◦regression model specification is the selection of the explanatory (independent) variables to be included in the regression and the transformations, if any, of those explanatory variables.
Breush-Pagan test
◦regression of the squared residuals on the independent variables ◦if conditional hetero is present, the independent variables will significantly contribute to the explanation of the squared residuals. ◦its a one tail test because heterosked is only a problem if the r2 and the BP test statistic are too large.
F-Test explained
◦rejection of null indicates that the independent variable is significantly different than zero, which is interpreted to mean that it makes a significant contribution to the explanation of the dependent variable. ◦in simple linear regression, it tells us the same thing as the t-test of the slope coefficient (tb1) ◦f is not important for single... but important for multiple regression.
What is serial correlation?
◦serial correlation, aka autocorrelation, refers to situation in which the residual terms are correlated with one another. ◦serial correlation is a relatively common problem with time series data
The F-Statistic with one independent variable
◦since only one independent variable, f test tests the same hypothesis as the t test for statistical sign of the slope of the coefficient. ◦Ho: b1 = 0 versus Ha: b1 DNE 0 ◦compare the calculated F-statistics with the critical F-value, Fc, at the appropriate level of significance. ◦the degree of free for the numerator and denominator with one independent variable is: ◦Df numerator = k=1 ◦df denominator = n-k-1=n-2 ◦n= # of observations ◦decision rule: Reject Ho if F > Fc
Spurious Correlation
◦spurious correlation refers to the appearance of a casual linear relationship when, in fact, there is no relation.
Confidence intervals for a regression coefficient
◦the confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. ‣ estimated regression coefficient +/- (critical t-value)(coefficient standard error) ◦constructing a confidence interval and conducting a t test with a null hypothesis of equal to zero will always result in the same conclusion regarding the statistical significance of the regression coefficient.
LOS 12.g: Evaluate how well a regression model explains the dependent variable by analyzing the output of the regression equation and an anova table
◦the info in an anova table is used to attribute the total variation of the dependent variable to one of two sources: the regression model or the residuals. This is indicated in the first column in the table, where the "source" of the variation is listed. ◦the info in an anova table can be used to calculate R2, the f statistic, and the standard error of estimate (SEE) ◦R2 = RSS/SST ◦F=MSR/MSE with k and n-k-1 Degrees of freedom ◦SEE = square root of MSE
Interpreting the multiple regression results
◦the intercept term is the value of the dependent variable when the independent variables are all equal to zero ◦each sloe coefficient is the estimated chg in dep var for a one unit change in the independent variable - holding other independent variables constant. ‣ slope coefficients in multiple regression are sometimes called partial slope coefficients.
Coefficient of determination (R sqr)
◦the percentage of the total variation in the dependent variable explained by the independent variable
LOS 11.d: Distinguish between the dependent and independent variables in a linear regression
◦the purpose of simple linear regression is to explain the variation in a dependent variable in terms of the variation in a single independent variable ◦the term variation is interpreted as the degree to which a variable differs from its mean value ◦ don't confuse variation with variance - they are related but are not the same.
How much below the magic number 2 is statistically significant enough to reject the null hypothesis of no positive serial correlation?
◦there are tables of DW stats that give upper and lower critical DW values for various sample sizes, levels of sign, and numbers of degrees of freedom
Durbin-Watson statistic (DW)
◦used to detect the presence of serial correlation
◦so you adjust R2 for the number of independent variables.
◦where: ‣ n = number of observations ‣ k= number of independent variables ‣ ra2= adjiusted r2 ‣ always less than or equal to r2. so adding more may either increase or decrease the ra2. if new variable barely impacts r2 it may decrease. it can also be less than zero if r2 is low enough