CFA L2 Quant
explain Akaike's information criterion (AIC)
-evaluates a collection of models that explain the same dependent variable -IC is a measure of model parsimony, so a lower AIC indicates a better-fitting model.
In what situations would adjusted R^2 increase or decrease when adding a variable?
-increase: if we add in an independent variable and the absolute value of the t statistic is greater than 1 -decrease: if we add in an independent variable and the absolute value of the t statistic is less than 1 (insignificant)
When do we use the Logit model?
-it is used when the linear regression is unsuitable (relationship between the DVs and IVs not linear; IE dependent variable being discrete/binary and not continuous)
T/F the DW statistic can be used for regressions that have a lagged value of the dependent variable as one of the explanatory variables
F
T/F "This regression is a special case of a first-order autoregressive [AR(1)] model in which the value for b0 is close to zero and the value of b1 is close to 1. These values suggest that the time series is a random walk."
T When modeled using a AR(1) model, as in the formula given in Exhibit 1, random walks will have an estimated intercept coefficient near zero and an estimated slope coefficient on the first lag near 1. Therefore, his statement is correct.
Standard Error of the Estimate (SEE)
The standard error of the estimate is the standard deviation of the regression residuals. The lower the SEE, the better the model fit. SEE = sqrt(MSE)
limitations of log-linear model?
not appropriate if serial correlation is present
What is a unit root?
- A random walk, with or without a drift, is not covariance stationary - If the value of the lag coefficient is equal to one, the time series is said to have a unit root and will follow a random walk process.
Effect of Serial Correlation on Regression Analysis
- (+) serial correlation results in coefficient SE's that are too small even though the estimated coefficients are consistent - F-test will be unreliable because MSE will be understated
explain variance inflation factor (VIF)
- allows us to quantify multicollinearity - VIF > 5 warrants further investigation while a VIF > 10 indicates severe multicollinearity
Possible remedies for serial correlation include:
- robuststandard errors (also called Newey--West corrected standard errors or heteroskedasticity-consistent standard errors). These robust standard errors are then used to recalculate the t-statistics using the original regression coefficients.
assumptions of a multiple regression model (6)
-A linear relationship exists between the dependent and independent variables. -The independent variables are not random, and there is no exact linear relation between any two or more independent variables. -The expected value of the error term, conditional on the independent variable, is zero -The variance of the error terms is constant for all observations. (Variance is the same for all observations) -The error terms are uncorrelated with each other. -The error term is normally distributed.
Explain Schwarz's Bayesian information criterion (BIC or SBC)
-A statistic used to compare sets of independent variables for explaining a dependent variable. It is preferred for finding the model with the best goodness of fit. -Compared to AIC, BIC assesses a greater penalty for having more parameters in a model, so it will tend to prefer small, more parsimonious models. -Lower BIC means a better fit
When is AIC vs BIC preferred?
-AIC is better for prediction purposes -BIC better when goodness of fit is preferred
explain the effect of the below misspecification: Omission of important independent variable(s)
-Biased and inconsistent regression parameters -May lead to serial correlation or heteroskedasticity in the residuals
Explain Leverage
-For a particular independent variable, leverage measures the distance between the value of the ith observation of that independent variable and the mean value of that variable across all n observations. -Leverage is a value between 0 and 1, and the higher the leverage, the more distant the observation's value is from the variable's mean and, hence, the more influence the ith observation can potentially exert on the estimated regression.
Rejecting the null in the DF test means?
-Reject null, statistically different from zero means there is no unit root -If (b1 − 1) is not significantly different from zero, they say that b1 must be equal to 1.0 and, therefore, the series must have a unit root
What is the Breusch-Godfrey test used to detect? What is the Breusch-Pagan test used to detect?
-The Breusch-Godfrey test is used to detect serial correlation (A more general test (which can accommodate serial correlation at multiple lags) -The Breusch-Pagan test is a formal test used to detect heteroskedasticity
What is multicollinearity? Effects? How to detect?
-Two or more independent variables are highly correlated with each other -Standard errors tend to be inflated; coefficients tend to be unreliable - High R^2, sig F-Test, insig t-test
What is Heteroskedasticity? Define the difference between unconditional and conditional Heteroskedasticity.
-Variance of residuals is not the same across all observations -Unconditional heteroskedasticity - it is not related to the level of independent variables and doesn't systematically increase or decrease with changes in value of independent variables. (NOT an Issue) -Conditional Heteroskedasticity - Related to the level of independent variables and variance increases as the value of independent variable increases (BIG Problem)
How to spot for seasonality?
-autocorrelation of the error term not equal to 0 -reject t-statistic
Explain the Breusch-Pagan Test for conditional heteroskedasticity
-test statistic is n x R^2 -test statistic is compared to chi^2 critical value and if the value of nR^2 is greater, than the null hypothesis of no conditional heteroskedasticity is rejected
Interpret Adjusted R^2
-the value of R^2 adjusted for the number of independent variables and the sample size. -model with higher adjusted R^2 is preferred
Explain what the Root Mean Squared Error (RMSE) is
-used to compare out of sample forecasting performance -smaller RMSE, more accurate the forecast
when is a log-linear vs linear model appropriate when it comes to growth rates?
-variable grows at a constant rate; a log-linear model is most appropriate -variable grows by constant amount; a linear trend model most appropriate
formula to calculate variance inflation factor (VIF)
1 / (1 - R^2)
A time series is covariance stationary if it satisfies the following three conditions:
1) Constant and finite expected value. The expected value of the time series is constant over time. (Later, we will refer to this value as the mean-reverting level.) 2) Constant and finite variance. The time series' volatility around its mean (i.e., the distribution of the individual observations around the mean) does not change over time. 3) Constant and finite covariance between values at any given lag. The covariance of the time series with leading or lagged values of itself is constant.
Assumptions of linear regression
1) Linearity: The relationship between the dependent variable, Y, and the independent variable, X, is linear. 2) Homoskedasticity: The variance of the regression residuals is the same for all observations. 3) Independence: The observations, pairs of Ys and Xs, are independent of one another. This implies the regression residuals are uncorrelated across observations. 4) Normality: The regression residuals are normally distributed.
how to correct heteroskedasticity?
1) compute Robust SE 2) use generalized least squares
four effects of heteroskedasticity on regression
1. the standard errors are usually unreliable estimates 2. The coefficient estimates aren't affected 3. If the standard errors are too small, but the coefficient estimates themselves are not affected, the t stats will be too large and null hypothesis of no statistical significance is rejected too often. the opposite will be true if the std. errors are too large 4. The f test is also unreliable
A leverage measure is potentially influential if it exceeds this
3*[(k + 1)/n]
Formula for standard error of the autocorrelations
=1/(sqrt(T)) where T represents the number of observations used in the regression
F Stat Formula
=MSR/MSE =(SSR/k)/(SSE/n-k-1)
Compare a Type I Error vs Type II Error
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population
Explain an independent variable's slope change in a Logit model
An independent variable's slope coefficient in a logistic regression model is the change in the log odds that the event happens per unit change in the independent variable, holding all other independent variables constant.
Formula for F-test from ANOVA Table
F = MSR/MSE = (RSS/k) / (SSE/n - k - 1)
explain adjusted R^2
Goodness-of-fit measure that adjusts the coefficient of determination, R2, for the number of independent variables in the model.
Durbin Watson decision rule and null hypothesis
H0: no positive serial correlation d>d(u): fail to reject H0 d<d(1): reject H0, conclude positive serial correlation d(1)<d<d(u): inconclusive
A time series is a random walk if:
If a time series follows a random walk process, the predicted value of the series (i.e., the value of the dependent variable) in one period is equal to the value of the series in the previous period plus a random error term.
Explain how to find influential points using leverage measure?
If the value exceeds 3[(k+1)/n]
Formula for LR Test
LR = −2 (Log likelihood restricted model − Log likelihood unrestricted model)
explain the effect of the below misspecification: Inappropriate variable form
May lead to heteroskedasticity in the residuals
explain the effect of the below misspecification: Inappropriate variable scaling
May lead to heteroskedasticity in the residuals or multicollinearity
explain the effect of the below misspecification: Data improperly pooled
May lead to heteroskedasticity or serial correlation in the residuals
Is a random walk also covariance stationary?
No. Neither a random walk nor a random walk with a drift exhibits covariance stationarity. It needs a finite mean-revering level to be covariance stationary.
What indicates a better fitting model using the log-likelihood metric?
Note the log-likelihood metric is always negative, so higher values (closer to 0) indicate a better fitting model.
Reasons to use adjusted R squared over R squared
R squared always increases as variables are added to the model, even if the marginal contribution of those new variables are not statistically significant
Why is a random walk not covariance stationary?
Random walks have no finite mean reverting level and no finite variance
What is serial correlation?
Serialcorrelation, also known as autocorrelation, refers to a situation in which regression residual terms are correlated with one another; that is, not independent. Serial correlation can pose serious problem with regressions using time series data.
If Cook's D is greater than 1
The ith observation is highly likely to be an influential data point.
If Cook's D is greater than 2*sqrt(k/n)
The ith observation is highly likely to be an influential data point.
If Cook's D is greater than 0.5...
The ith observation may be influential and merits further investigation.
How to Correct Multicollinearity
The most common method to correct for multicollinearity is to omit one or more of the correlated independent variables. Alternatively, we can use a different proxy for one of the included independent variables, or increase the sample size.
How to detect Heteroskedasticity?
There are two methods to detect heteroskedasticity: examining scatter plots of the residuals and using the Breusch-Pagan chi-square (χ2) test. A scatter plot of the residuals versus one or more of the independent variables can reveal patterns among observations.
How to correct for seasonality?
To adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable. For example, if quarterly data are used, the seasonal lag is 4; if monthly data are used the seasonal lag is 12; and so on.
Explain cointegration
Two time series are cointegrated if a long-term financial or economic relationship exists between them such that they do not diverge from each other without bound in the long run. For example, two time series are cointegrated if they share a common trend.
How to detect Serial Correlation?
Use Durbin Watson Test DW = 2(1-r) r = correlation coefficient between the dependent and independent variables (square root of R^2) DW<2 = positive corr DW>2 = negative corr
Explain the The likelihood ratio (LR) test)
a method to assess the fit of logistic regression models and is based on the log-likelihood metric that describes the fit to the data.
A time series model with no serial correlation has a DW statistic of what value?
approximately 2
autoregressive conditional heteroskedasticity (ARCH) exists if
autoregressive conditional heteroskedasticity (ARCH) exists if the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. When this condition exists, the standard errors of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid.
mean-reverting level is expressed as
b_0 / (1 - b_1)
Explain Cook's distance
is a metric for identifying influential data points. It measures how much the estimated values of the regression change if observation i is deleted from the sample
Interpret a high and low RMSE
lower is better; implies that it yields the most accurate forecast.
Total sum of squares (SST)
measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the actual Y-values and the mean of Y actual minus mean
Sum of squared errors (SSE)
measures the unexplained variation in the dependent variable. It's also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual Y-values and the predicted Y-values on the regression line actual minus predicted
Regression sum of squares (RSS)
measures the variation in the dependent variable that is explained by the independent variable. RSS is the sum of the squared distances between the predicted Y-values and the mean of Y. predicted minus mean
degrees of freedom for Breusch--Godfrey (BG) test
n - p - k - 1 degrees of freedom
Breusch--Pagan test Formula
n x R^2
Can DW test statistic be used for an AR model?
no
coefficient of determination (r^2)
percentage of the total variation in the dependent variable (Y) explained by the independent variable (X) =RSS/SST (Explained variation/Total variation)
RMSE formula
square root of the average squared errors
Conditional Heteroskedasticity occurs when...
the error variance is not constant and is correlated with the independent variables in the multiple regression
When to use AR(2) model?
use an AR(2) model when the t-statistics of the autocorrelations in the AR(1) model are significant
For large samples when using studentized residuals, the flagged observation is an outlier if
|t| > 3
For small samples when using studentized residuals, the flagged observation is potentially influential if
|t| > critical value of t-statistic with n-k-2 degrees of freedom at selected significant level