CFA L2 Quantitative Methods
F statistic
MSR/MSE=(RSS/k)/(SSE/(n-k-1)) MSR = mean regression of sum of squares MSE = mean squared error This is always a one tailed test not as useful when we only have one independent variable b/c it tells us the same this as the t test of the slope coefficient.
Example: The Breusch-Pagan test The residual plot of mutual fund returns over time shows evidence of heteroskedasticity. To confirm your suspicions, you regress the squared residuals from the original regression on the independent variable, S&P 500 index returns. The R2 from that regression is 8%. Use the Breusch-Pagan test to determine whether heteroskedasticity is present at the 5% significance level.
With five years of monthly observations, n is equal to 60. The test statistic is: n × R2 = 60 × 0.08 = 4.8 The one-tailed critical value for a chi-square distribution with one degree of freedom and α equal to 5% is 3.841. Therefore you should reject the null hypothesis and conclude that you have a problem with conditional heteroskedasticity.
generalized least squares
a second method to correct for heteroskedasticity which attempts to eliminate the heteroskedasticity by modifying the original equation
Correlation coefficient
converts the covariance into a standardized measure that is easier to interpret. The cc, or r, is a measure of the strength of the linear relationship between two variables. Has no unit of measurement, but it is a pure measure of the tendency of two variables to move together
positive serial correlation
exists when a positive regression error in one time period increase the probability of observing a positive regression error for the next time period
Standard error of estimate (SEE)
measures the degree of variability of the actual Y values relative to the estimated Y values from a regression equation. The smaller the standard error the better fit. =MSE square rooted The SEE is the standard deviation of the error terms in the regression
homoskedasticity
refers to the basic assumption of a multiple regression model that the variance of the error terms is constant.
Serial Correlation (Auto Correlation)
refers to the situation in which the residual terms are correlated with one another. Serial correlation is a relatively common problem with time series data.
Ordinary Least Square
regression is used to estimate the coefficient in the trend line, which provides the following prediction equation it's very similar to the simple linear regression model we covered previously; only here, (t) takes on the value of the time period
When these assumptions are violated, the inferences drawn from the model are questionable. There are three primary assumption violations that you will encounter: On exam day, you must be able to answer the following four questions about each of the three assumption violations:•What is it?•What is its effect on regression analysis?•How do we detect it?•How do we correct for it?
(1) heteroskedasticity, ( 2) serial correlation (i.e., autocorrelation), and (3) multicollinearity.
To determine whether a time series is covariance stationary, we can
(1) run an AR model and examine autocorrelations, or (2) perform the Dickey Fuller test.
Advantages of simulations
1. Better input quality - superior inputs are likely to result when an analyst goes through the process of selecting a proper distribution for critical inputs, rather than relying on single best estimates. The distribution selected can additionally be checked for conformity with historical or cross sectional data 2. Provides a distribution of expected value rather than a point estimate - the distribution of an investment's expected value provides an indication of risk in the investment.
A time series is covariance stationary if it satisfies the following
1. Constant and finite expected value. The expected value of the time series is constant over time. 2. Constant and finite variance - the time series' volatility around its mean does not change over time 3. Constant and finite covariance between values at any given lag - the covariance of the time series with leading or lagged values of itself is constant. The estimation results of an AR model involving a time series that is not covariance stationary are meaningless.
Limitations of multiple regression
1. Linear relationships can change over time. This means that the estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period. This is referred to as parameter instability. 2. Even if the regression model accurately reflects the historical relationship between the two variables, its usefulness in investment analysis will be limited if other market participants are also aware of an act on this evidence. 3. If the assumptions underlying regression analysis do not hold, the interpretation and tests of hypotheses may not be valid. For example, if the data is heteroskedastic (non-constant variance of the error terms) or exhibits autocorrelation (error terms are not independent), regression results may be invalid.
Three broad categories of model misspecification, or ways in which the regression model can be specified incorrectly, each with several subcategories
1. The functional form can be misspecified - Important variables are omitted, - variables should be transformed, -data is improperly pooled. 2. Explanatory variables are correlated with the error term in time series models - a lagged dependent variable is used as an independent variable, -a function of the dependent variable is used as an independent variable, -independent variables are measured with error. 3. Other time-series misspecification that result with error.
correcting serial correlation
1. adjust the coefficient standard errors, using the hansen method. the hansen method also corrects for conditional heteroskedasticity. these adjusted standard errors, which are sometimes called serial correlation consistent standard errors or hansen-white standard errors, are then used in hypothesis testing of the regression coefficients. Only use the hansen method if serial correlation is a problem. the white corrected standard errors are preferred if only heteroskedasticity is a problem. if both conditions are present, use the hansen method. 2. Improve the specification of the model, the best way to do this is to explicitly incorporate the time series nature of the data.
four effects of heteroskedasticity
1. the standard errors are usually unreliable estimates 2. The coefficient estimates aren't affected 3. If the standard errors are too small, but the coefficient estimates themselves are not affected, the t stats will be too large and null hypothesis of no statistical significance is rejected too often. the opposite will be true if the std. errors are too large 4. The f test is also unreliable
Example: Computing the slope coefficient and intercept term Compute the slope coefficient and intercept term for the ABC regression example using the following information: cov(S&P 500, ABC) = 0.000336 S&P 500 = -2.70% var(S&P 500) = 0.000522 ABC = -4.05%
=
Total variation
=explained variation+ unexplained variation, or SST=RSS+SSE
Time-Series Analysis LOS 11
A time series is a set of observations of a random variable spaced evenly through time (e.g., quarterly sales revenue for a company over the past 60 quarters). For the exam, given a regression output, identifying violations such as heteroskedasticity, nonstationarity, serial correlation, etc., will be important, as well as being able to calculate a predicted value given a time-series model. Know why a log-linear model is sometimes used; understand the implications of seasonality and how to detect and correct it, as well as the root mean squared error (RMSE) criterion.
Solving For the mean reverting level
All covariance stationary time series have a finite mean-reverting level.
Example: The Durbin-Watson test for serial correlation Suppose you have a regression output which includes three independent variables that provide you with a DW statistic of 1.23. Also suppose that the sample size is 40. At a 5% significance level, determine if the error terms are serially correlated.
Answer: From a 5% DW table with n = 40 and k = 3, the upper and lower critical DW values are found to be dl = 1.34 and du = 1.66, respectively. Since DW < dl (i.e., 1.23 < 1.34), you should reject the null hypothesis and conclude that the regression has positive serial correlation among the error terms.
unit root
As we discussed in the previous LOS, if the coefficient on the lag variable is 1, the series is not covariance stationary. If the value of the lag coefficient is equal to one, the time series is said to have a unit root and will follow a random walk process. Since a time series that follows a random walk is not covariance stationary, modeling such a time series in an AR model can lead to incorrect inferences.
Effect of Serial Correlation
Because of the tendency of the data to cluster together from observation to observation, positive serial correlation typically results in coefficient standard errors that are too small, even though the estimated coefficients are consistent. These small standard error terms will cause the computed t-statistics to be larger than they should be, which will cause too many Type I errors: the rejection of the null hypothesis when it is actually true. The F-test will also be unreliable because the MSE will be underestimated leading again to too many Type I errors.
Simulation constraints
Constraints are specific limits imposed by users of simulations as a risk assessment tool. A constraint is a condition that, if violated, would pose dire consequences for the firm book value constraints, earnings and cash flow constrains, market value constrains,
Dynamic Correlations
Correlations between input variables may not be stable. To the extent that correlations between input variables change over time, it becomes far more difficult to model them. If we model the correlation between variables based on past data and such relationships amongst variables change, the output of simulation will be flawed.
Earnings / CF
Earnings or cash flow constraints can be imposed internally to meet analyst expectations or to achieve bonus targets. Sometimes, failure to meet analyst expectations could result in job losses for the executive team Earnings constraints can also be imposed externally, such as a loan covenant. Violating such a constraint could be very expensive for the firm.
Effect of Multicollinearity on Regression Analysis
Even though multicollinearity does not affect the consistency of slope coefficients, such coefficients themselves tend to be unreliable. Additionally, the standard errors of the slope coefficients are artificially inflated. Hence, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (i.e., a Type II error). Multicollinearity is likely to be present to some extent in most economic models. The issue is whether the multicollinearity has a significant effect on the regression results.
How to analyze a regression model
Examine the individual coefficients using t-tests, determine the validity of the model with the F-test and the R2, and look out for heteroskedasticity, serial correlation, and multicollinearity.
LOS 11.h: Explain the instability of coefficients of time-series models.
Financial and economic time series inherently exhibit some form of instability or nonstationary. Models estimated with shorter time series are usually more stable than those with longer time series because a longer sample period increases the chance that the underlying economic process has changed. Thus, there is a tradeoff between the increased statistical reliability when using longer time periods and the increased stability of the estimates when using shorter periods.
There are three approaches to specifying a distribution:
Historical: Examination of past data may point to a distribution that is suitable for the probabilistic variable. This method assumes that the future values of the variable will be similar to its past. Cross Sectional: When past data is unavailable (or unreliable), we may estimate the distribution of the variable based on the values of the variable for peers. Pick a distribution and est parameters: When neither historical nor cross-sectional data provide adequate insight, subjective specification of a distribution along with related parameters is the appropriate approach
first differencing
If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called involves subtracting the value of the time series (i.e., the dependent variable) in the immediately preceding period from the current value of the time series to define a new dependent variable, y.
LOS 12.a: Describe steps in running a simulation. LOS 12.b: Explain three ways to define the probability distributions for a simulation's variables. LOS 12.c: Describe how to treat correlation across variables in a simulation.
Imagine a capital budgeting exercise to estimate the net present value of a project. The project involves production of a new product with uncertain demand. To estimate the cash flows from the project, we need the estimated demand and selling price per unit for each of the years, as well as estimated cash expenses. All of these variables are uncertain and can only be estimated with error. Some variables are more uncertain than others; for example, cash expenses are relatively easy to estimate, but product demand is more uncertain. Some variables (e.g., interest rates) are not limited to a few discrete outcomes; such variables can take on any number of values (within some plausible range). Simulations lend themselves to situations where the risk is continuous (e.g., uncertainty in interest rate). This flexibility helps simulations to accurately model reality and provide a full picture of the risk in an investment
run an AR model and examine autocorrelations
In an AR model is estimated and the statistical significance of the autocorrelations at various lags is examined. A stationary process will usually have residual autocorrelations insignificantly different from zero at all lags or residual autocorrelations that decay to zero as the number of lags increases
Steps in simulation - 3 - Check for correlations among variables
In this step, we use historical data to determine whether any of the probabilistic input variables are systematically related. When there is a strong correlation between variables, we can either 1. allow only one of the variables to vary, 2. build the rules of correlation into simulation.
Non Stationary Distribution
Input variable distributions may change over time, so the distribution and parameters specified for a particular simulation may not be valid anymore. For example, based on past data, we conclude that earnings growth rate has a normal distribution with a mean of 3% and variance of 2.5%. However, the parameters may have changed to a mean of 2% and variance of 5%.
Market value constraints
Market value constraints seek to minimize the likelihood of financial distress or bankruptcy for the firm. In a simulation, we can explicitly model the entire distribution of key input variables to identify situations where financial distress would be likely. We can then explicitly incorporate the costs of financial distress in a valuation model for the firm.
Multicolinearity
Multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. This condition distorts the standard error of estimate and the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.
LOS 10 - Multiple Regression
Multiple regression is the centerpiece of the quantitative methods topic at Level II. It is a useful analysis tool that shows up throughout the Level II curriculum. Multiple regression is especially important for multifactor models in Study Sessions 9 (Equity), and 17 (Portfolio Management). Know this material well. You should know how to use a t-test to assess the significance of the individual regression parameters and an F-test to assess the effectiveness of the model as a whole in explaining the dependent variable. You should understand the effect that heteroskedasticity, serial correlation, and multicollinearity have on regression results. Also be able to identify the common model misspecifications. Focus on interpretation of the regression equation and the test statistics. Remember that most of the test and descriptive statistics discussed (e.g., t-stat, F-stat, and R2) are provided in the output of statistical software. Hence, application and interpretation of these measurements are more likely than actual computations on the exam.
Adding a second dependent variable to a linear regression
Notice that the estimated slope coefficient for X1 changed from 4.5 to 2.5 when we added X2 to the regression. We would expect this to happen most of the time when a second variable is added to the regression, unless X2 is uncorrelated with X1, because if X1 increases by 1 unit, then we would expect X2 to change as well. The multiple regression equation captures this relationship between X1 and X2 when predicting Y.
Steps in simulations - 1 Determine the probablistic variables
Probabilistic variables are the uncertain input variables that influence the value of an investment. WHile there is no limit to the number of uncertain input variables, in practice some variables are either predicted or have an insignificant influence on the value of the investment.
However, there are several different types of models that use a qualitative dependent variable
Probit and Logit discriminant
Correlation of determination
R2, can be used to test the overall effectiveness of the entire set of independent variables in explaining the dependent variable. Its interpretation is similar to that for simple linear regression: the percentage of variation in the dependent variable that is collectively explained by all of the independent variables Calculated in same manor of linear regressino = RSS / SST
Inappropriate Stat Distribution
Real world data often does not fit the stringent requirements of statistical distributions. If the underlying distribution of an input is improperly specified, the quality of that input will be poor
Input Quality
Regardless of the complexities employed in running simulations, if the underlying inputs are poorly specified, the output will be low quality (i.e., garbage in, garbage out). In fact, the detailed output provided in a simulation may give the decision maker a false sense of making an informed decision.
If X and Y are perfectly correlated...
Regressing Y onto X will result in the standard error of estimate will be zero.
coefficient of determination (R^2)
Rsquared is defined as the percentage of the total variation in the dependent variable explained by the independent variable. May range from 0:+1
Autoregressive
Statistical inferences based on ordinary least squares (OLS) estimates for an AR time series model may be invalid unless the time series being modeled is covariance stationary.
Example: Calculating a confidence interval for a regression coefficient Calculate the 90% confidence interval for the estimated coefficient for the independent variable PR in the real earnings growth example.
The critical t-value is 1.68, the same as we used in testing the statistical significance at the 10% significance level (which is the same thing as a 90% confidence level). The estimated slope coefficient is 0.25 and the standard error is 0.032. The 90% confidence interval is: 0.25 ± (1.68)(0.032) = 0.25 ± 0.054 = 0.196 to 0.304 Professor's Note: Notice that because zero is not contained in the 90% confidence interval, we can conclude that the PR coefficient is statistically significant at the 10% level—the same conclusion we made when using the t-test earlier in this topic review. Constructing a confidence interval and conducting a t-test with a null hypothesis of "equal to zero" will always result in the same conclusion regarding the statistical significance of the regression coefficient.
LOS 10.o: Evaluate and interpret a multiple regression model and its results.
The economic meaning of the results of a regression estimation focuses primarily on the slope coefficients. For example, suppose that we run a regression using a cross section of stock returns (in percent) as the dependent variable, and the stock betas (CAPM) and market capitalizations (in $ billions) as our independent variables
Correcting Multicolinearity
The most common method to correct for multicollinearity is to omit one or more of the correlated independent variables. Unfortunately, it is not always an easy task to identify the variable(s) that are the source of the multicollinearity. There are statistical procedures that may help in this effort, like stepwise regression, which systematically remove variables from the regression until multicollinearity is minimized.
Detecting Multicolinearity
The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the F-test is statistically significant and the R2 is high. This suggests that the variables together explain much of the variation in the dependent variable, but the individual independent variables don't.
ANOVA Table
The output of the ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. ANOVA tables are included in the regression output of many statistical software packages. You can think of the ANOVA table as the source of the data for the computation of many of the regression concepts discussed in this topic review.
Interpreting p-Values
The p-value is the smallest level of significance for which the null hypothesis can be rejected. An alternative method of doing hypothesis testing of the coefficients is to compare the p-value to the significance level: •If the p-value is less than significance level, the null hypothesis can be rejected. •If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
Detecting Serial Correlation
There are two methods that are commonly used to detect the presence of serial correlation: residual plots and the Durbin-Watson statistic
Steps in simulation - 2: Define probability distributions for these variables
This is important but sometimes difficult step entails specifying the distribution from which to sample the uncertain variables. First we must determine the appropriate distribution to characterize this uncertain veraible, then we also need to specify the parameters for the distribution.
Probabilistic Approaches: Scenario Analysis, Decision Trees, and Simulations LOS 12
This topic review discusses simulation as a risk-measurement tool. After studying this material, you should be able to understand the methodology used in running simulations and the limitations of simulations, recognize why and how certain constraints are introduced into simulations, and determine when simulation versus some other method (such as a decision tree or scenario analysis) is appropriate.
Correcting for seasonality
To adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable. For example, if quarterly data are used, the seasonal lag is 4; if monthly data are used the seasonal lag is 12; and so on.
LOS 11.o: Determine an appropriate time-series model to analyze a given investment problem and justify that choice
To determine what type of model is best suited to meet your needs, follow these guidelines: 1. Determine your goal 2.If you have decided on using a time series analysis for an individual variable, plot the values of the variable over time and look for characteristics that would indicate nonstationarity 3.If there is no seasonality or structural shift, use a trend model 4.Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. 5.If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows: 6.After first-differencing in 5 above, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality. 7.Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. 8.If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.
LOS 11.o: Determine an appropriate time-series model to analyze a given investment problem and justify that choice.
To determine what type of model is best suited to meet your needs, follow these guidelines: 1. Determine your goal 2.If you have decided on using a time series analysis for an individual variable, plot the values of the variable over time and look for characteristics that would indicate nonstationarity 3.If there is no seasonality or structural shift, use a trend model 4.Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. 5.If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows: 6.After first-differencing in 5 above, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality. 7.Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. 8.If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.
Occasionally an analyst will run a regression using two time series (i.e., time series utilizing two different variables). For example, using the market model to estimate the equity beta for a stock, an analyst regresses a time series of the stock's returns (yt) on a time series of returns for the market (xt): yt = b0 + b1xt + et
To test whether the two time series have unit roots, the analyst first runs separate DF tests with five possible results: 1.Both time series are covariance stationary. 2.Only the dependent variable time series is covariance stationary. 3.Only the independent variable time series is covariance stationary. 4.Neither time series is covariance stationary and the two series are not cointegrated. 5.Neither time series is covariance stationary and the two series are cointegrated. In scenario 1 the analyst can use linear regression, and the coefficients should be statistically reliable, but regressions in scenarios 2 and 3 will not be reliable. Whether linear regression can be used in scenarios 4 and 5 depends upon whether the two time series are cointegrated.
Nonlinear relationships
Two variables could have a nonlinear relationship yet a zero correlation. Therefore, another limit of correl analysis is that it doesnt capture strong non linear relationships among variables
Example: Hypothesis testing with dummy variables The standard error of the coefficient b1 is equal to 0.15 from the EPS regression model. Test whether first quarter EPS is equal to fourth quarter EPS at the 5% significance level.
We are testing the following hypothesis : H0: b1 = 0 vs. HA: b1 ≠0 The t-statistic is 0.75/0.15 = 5.0 and the two-tail 5% critical value with 36 degrees of freedom is approximately 2.03. Therefore, we should reject the null and conclude that first quarter EPS is statistically significantly different than fourth quarter EPS at the 5% significance level.
predicting the dependent variable
We can use the regression equation to make predictions about the dependent variable based on forecasted values of the independent variables. The process is similar to forecasting with simple linear regression, only now we need predicted values for more than one independent variable it does not matter if there are statistically insig variables - use the full model
LOS 11.e: Explain how autocorrelations of the residuals can be used to test whether the autoregressive model fits the time series.
When an AR model is correctly specified, the residual terms will not exhibit serial correlation. Serial correlation (or autocorrelation) means the error terms are positively or negatively correlated. When the error terms are correlated, standard errors are unreliable and t-tests of individual coefficients can incorrectly show statistical significance or insignificance.
Durbin-Watson statistic
You can see from the approximation that the Durbin-Watson test statistic is approximately equal to 2 if the error terms are homoskedastic and not serially correlated (r = 0). DW < 2 if the error terms are positively serially correlated (r > 0), and DW > 2 if the error terms are negatively serially correlated (r < 0).
qualitative dependent variable
a dummy variable that takes on a value of either zero or one
dickey fuller test
a more definitive test for unit root For statistical reasons, you cannot directly test whether the coefficient on the independent variable in an AR time series is equal to 1. To compensate, Dickey and Fuller created a rather ingenious test for a unit root. Remember, if an AR(1) model has a coefficient of 1, it has a unit root and no finite mean reverting level (i.e., it is not covariance stationary). if null hypotheses is rejected, you know there is not a unit root, but if you can't reject, then there is a unit root.
Probit and logit models
a probit model is based on the normal distribution, while a logit model is based on the logistic distribution. Application of these models results in estimates of the probability that the event occurs (pd). The maximum likelihood methodology is used to estimate coefficients for probit and logit models. These coefficients relate the independent variables to the likelihood of an event occurring, such as a merger, bankruptcy, or default
Analysis of variance
a statistical procedure for analyzing the total variability of the dependent variable.
finite mean-reverting level
a time series must have a finite mean-reverting level to be covariance stationary. Thus, a random walk, with or without a drift, is not covariance stationary, and exhibits what is known as a unit root (b1 = 1). For a time series that is not covariance stationary, the least squares regression procedure that we have been using to estimate an AR(1) model will not work without transforming the data.
Adjusted R2
adjustment due to the fact that R2 increases with increasing variables
partial slope coefficient
another name for the coefficients in a multiple regression
linear trend model
appropriate if the data points appear to be equally distributed above and below the regression line. Inflation rate data can often be modeled with a linear trend model. Use when data grows by a constant AMOUNT
Decision Tree
are an appropriate approach when risk is both discrete and sequential. For example, imagine that an investment's value varies based on the uncertain outcome of a number of discrete sequential events, and at time t=0 there are two possible choices: make the investment or not. If we make the investment, the cash flow at time t=1 can be either high (C1H) or low (C1L). If the cash flow is high, we can then decide to expand capacity (expand or don't expand). If we expand capacity, the cash flow can be EC2H or EC2L, but if we don't expand capacity, the cash flow will either be DC2H or DC2L. Like simulations, decision trees consider all possible states of the outcome and hence the sum of the probabilities is 1.
BV Constraint
are imposed on a firm's book value of equity. There are two types of restrictions on book value of equity that may necessitate risk hedging: •Regulatory capital requirements. Banks and insurance companies are required to maintain adequate levels of capital. Violations of minimum capital requirements are considered serious and could threaten the very existence of the firm. •Negative equity. In some countries, negative book value of equity may have serious consequences
Dummy variables
assigned values of 0 or 1 An important consideration when performing multiple regression with dummy variables is the choice of the number of dummy variables to include in the model. Whenever we want to distinguish between n classes, we must use n - 1 dummy variables. Otherwise, the regression assumption of no exact linear relationship between independent variables would be violated.
What is ARCH
autoregressive conditional heteroskedasticity (ARCH) exists if the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. When this condition exists, the standard errors of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid.
Covariance
between two random variables is a statistical measure of the degree to which two variables move together. Captures the linear relationship between two variables. the actual value is not not very meaningful b/c its measurement is extremely sensitive to the scale of the two variables.
8.If you have developed two statistically reliable models and want to determine which is better at forecasting,
calculate their out-of-sample RMSE.
Outliers
computed correlation coefficients, as well as other sample statistics, may be affected by outliers. These represent a few extreme values for sample observations. Relative to the rest for he sample data, the value of an outlier may be extraordinarily large or small.
Scenario Analysis
computes the value of an investment under a finite set of scenarios (e.g., best case, worst case, and most likely case). Because the full spectrum of outcomes is not considered in these scenarios, the combined probability of the outcomes that are considered is less than 1.
Professor's Note: Testing the statistical significance of the regression coefficients means
conducting a t-test with a null hypothesis that the regression coefficient is equal to zero. Rather than cover that concept here, even though it is mentioned in this LOS, we will cover it in detail in a later LOS as part of our general discussion of hypothesis testing
two methods of detecting heteroskedasticity
examining scatter plots of the residuals and using the Breusch-Pagan Chi square test.
Trend
if a consistent pattern can be seen by plotting the data (i.e., the individual observations) on a graph. For example, a seasonal trend in sales data is easily detected by plotting the data and noting the significant jump in sales during the same month(s) each year.
Random walk
if a time series follows a random walk process, the predicted value of the series, in one period is equal to the value of the series in the previous period plus a random error term. 1.E(εt) = 0: The expected value of each error term is zero. 2.E(εt2) = σ2: The variance of the error terms is constant. 3.E(εi,εj) = 0; if i ≠ j: There is no serial correlation in the error terms.
Random walk with a drift
if a time series follows this, the intercept term in not equal to zero. That is, in addition to a random walk error term, the time series is expected to increase or decrease by a constant amount each period.
Mean Reversion
if it has a tendency to move toward its mean. In other words, the time series has a tendency to decline when the current value is above the mean and rise when the current value is below the mean. If a time series is at its mean-reverting level, the model predicts that the next value of the time series will be the same as its current value
A linear trend model may be appropriate if the log-linear model may be appropriate if
if the data points appear to be equally distributed above and below the regression line. Inflation rate data can often be modeled with a linear trend model. the data plots with a non-linear (curved) shape, then the residuals from a linear trend model will be persistently positive or negative for a period of time
Limitations of using simulations as a risk assessment tool
input quality, inappropriate statistical distributions, non-stationary distributions, dynamic correlations
Time Series
is a set of observations for a variable over successive periods of time (e.g., monthly stock market returns for the past ten years).
Conditional Heteroskedacity
is heteroskedasticity that is related to the level of (i.e., conditional on) the independent variables. For example, conditional heteroskedasticity exists if the variance of the residual term increases as the value of the independent variable increases, as shown in Figure 4. Notice in this figure that the residual variance associated with the larger values of the independent variable, X, is larger than the residual variance associated with the smaller values of X. Conditional heteroskedasticity does create significant problems for statistical inference
Multiple Regression definition
is regression analysis with more than one independent variable. It is used to quantify the influence of two or more independent variables on a dependent variable. For instance, simple (or univariate) linear regression explains the variation in stock returns in terms of the variation in systematic risk as measured by beta. With multiple regression, stock returns can be regressed against beta and against additional variables, such as firm size, equity, and industry classification, that might influence returns.
Regression model specification
is the selection of the explanatory (independent) variables to be included in the regression and the transformations, if any, of those explanatory variables
Primary concen when selecting a time series sample (econ or fin)
is the underlying economic processes. Have there been regulatory changes? Has there been a dramatic change in the underlying economic environment? If the answer is yes, then the historical data may not provide a reliable model.
root mean squared error criterion
is used to compare the accuracy of autoregressive models in forecasting out of sample values. For example, a researcher may have two autoregressive (AR) models: an AR(1) model and an AR(2) model. To determine which model will more accurately forecast future values, we calculate the RMSE (the square root of the average of the squared errors) for the out-of-sample data. Note that the model with the lowest RMSE for in-sample data may not be the model with the lowest RMSE for out-of-sample data.
Covariance Stationarity
mean and variance do not change over time
Cointegrated
means that two time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable. This means that scenario 5 will produce reliable regression estimates, whereas scenario 4 will not.
log-linear model
most used when the data plots with a non linear shape, and the residuals from a linear trend model will be persistently positive or negative. Used in financial data and company sales. Use when data grows by a constant RATE
negative serial correlation
occurs when a positive error in one period increases the probability of observing a negative error in the next period.
unconditional heteroskedacity
occurs when the heteroskedasticity is not related to the level of the independent variables, which means that it doesn't systematically increase or decrease with changes in the value of the independent variable(s). While this is a violation of the equal variance assumption, it usually causes no major problems with the regression
Heteroskedacity
occurs when the variance of the residuals is not the same across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.
Steps in simulation - 4: Run the simulation
running the simulation means randomly drawing variables from their underlying distributions and then using them as inputs to generate estimated values. This process may be repeated to yield thousands of estimates of value, giving a distribution of the investment's value.
Discriminant model
similar to p+L models but make different assumptions regarding the independent variables. Discrimination analysis results in a linear function similar to an ordinary regression, which generates an overall score, or ranking, for an observation. the scores can then be used to rank or classify observations. A popular application of a discriminant model makes use of financial ratios as the independent variables to predict the qualitative dependent variable bankruptcy. A linear relationship among the independent variables produces a value for the dependent variable that places a company in a bankrupt or not bankrupt class.
Chain Rule of Forecasting
since the independent variable is a lagged value of the dependent variable, it is necessary to calculate a one-step-ahead forecast before a two-step-ahead forecast can be calculated.
interpretation of slope coeffeicient and intercept term
slope coefficient: when excess X returns increase by 1%, Y returns increasy by b1% intercept term: when the excess return on X is zero, the return on Y is b0 However, tany conclusions regarding the importance of an independent variable in explaining a dependent variable require determining the statistical significance of the slope coefficient. A hypothesis test must be conducted, or a confidence interval must be formed, to assess the importance of the variable.
Spurious correlation
the appearance of a causal linear relationship when, in fact, there is no relation. Certain data items may be highly correlated purely by chance.
Limitations of correlation analysis
the impact of outliers, the potential for spurious correlation, and nonlinear relationships
white corrected standard errors (robust standard errors)
the most common remedy for heteroskedasticity. These robust standard errors are then used to recalculate the t-statistics using the original regression coefficients. On the exam, use robust standard errors to calculate t-statistics if there is evidence of heteroskedasticity.
Independent variable
the variable used to explain the variation of the dependent variable. (also referred to as the explanatory variable, the exogenous variable, or the predicting variable".)
Dependent variable
the variable whose variation is explained by the independent variable. We are interested in answering the question, "what explains fluctuations in the dependent variable?" (also referred to as the explained variable, the endogenous variable, of the predicted variable).
Purpose of a simple linear regression
to explain the variation in a dependent variable in terms of the variation in a single independent variable.
Predicted Values
values of the dependent variable based on the estimated regression coefficients and a prediction about the value of the independent variable. They are the values that are predicted by the regression equation, given an estimate of the independent variable.
autoregressive model
when the dependent variable is regressed against one of more lagged values of itself, the resultant model is this. For example, the sales for a firm could be regressed against the sales for the firm in the previous month.
1. Determine your goal
•Are you attempting to model the relationship of a variable to other variables (e.g., cointegrated time series, cross-sectional multiple regression)? •Are you trying to model the variable over time (e.g., trend model)?
7. Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero.
•If the coefficient is not significantly different from zero, you can use the model. •If the coefficient is significantly different from zero, ARCH is present. Correct using generalized least squares.
5.If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows:
•If the data has a linear trend, first-difference the data. •If the data has an exponential trend, first-difference the natural log of the data. •If there is a structural shift in the data, run two separate models as discussed above. •If the data has a seasonal component, incorporate the seasonality in the AR model as discussed below.
3. If there is no seasonality or structural shift, use a trend model.
•If the data plot on a straight line with an upward or downward slope, use a linear trend model. •If the data plot in a curve, use a log-linear trend model.
6.After first-differencing in 5 above, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality.
•If there is no remaining serial correlation, you can use the model. •If you still detect serial correlation, incorporate lagged values of the variable (possibly including one for seasonality—e.g., for monthly data, add the 12th lag of the time series) into the AR model until you have removed (i.e., modeled) any serial correlation.
4.Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test.
•If you detect no serial correlation, you can use the model. •If you detect serial correlation, you must use another model (e.g., AR).
The interpretation of the estimated regression coefficients from a multiple regression is the same as in simple linear regression for the intercept term but significantly different for the slope coefficients:
•The intercept term is the value of the dependent variable when the independent variables are all equal to zero. •Each slope coefficient is the estimated change in the dependent variable for a one-unit change in that independent variable, holding the other independent variables constant. That's why the slope coefficients in a multiple regression are sometimes called partial slope coefficients
The number of simulations needed for a good output is driven by:
•The number of uncertain variables. The higher the number of probabilistic inputs, the greater the number of simulations needed. •The types of distributions.The greater the variability in types of distributions, the greater the number of simulations needed. Conversely, if all variables are specified by one distribution (e.g., normal), then the number of simulations needed would be lower. •The range of outcomes. The wider the range of outcomes of the uncertain variables, the higher the number of simulations needed.