Quantitative Methods Module 7-10
qualitative dependent variable
-example = default or don't default -can't use regression
Misspecification #4: Using a Lagged Dependent Variable as an Independent Variable
-in title
Detecting Multicollinearity
#***-The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the F-test is statistically significant and the R2 is high. *The correct step is to remove one of the variables, to see if the remaining variable becomes significant. -This suggests that the variables together explain much of the variation in the dependent variable, but the individual independent variables don't. *-If the absolute value of the sample correlation between any two independent variables in the regression is greater than 0.7, multicollinearity is a potential problem. *The only case in which correlation between independent variables may be a reasonable indicator of multicollinearity occurs in a regression with exactly two independent variables *Correlation between independent variables may be a reasonable indication of multicollinearity. If just two
**T-Test
T = R*SQRT(N-2)/SQRT(1-R^2)
A time series is covariance stationary if it satisfies the following three conditions:
expected value of the time series is constant over time. (Later, we will refer to this value as the mean-reverting level.) Constant and finite variance. The time series' volatility around its mean (i.e., the distribution of the individual observations around the mean) does not change over time. Constant and finite covariance between values at any given lag. The covariance of the time series with leading or lagged values of itself is constant.
Misspecification #6: Measuring Independent Variables with Error
trying to measure happiness by looking at GDP per capita (not a perfect correlation)
Number of Observations =
N-K-1 (K=degrees of freedom from regression)
R^2 =
(SST - SSE) / SST or RSS/SST SST = RSS + SSE
*Multiple linear Regression Assumptions
****-The variance of the error term is assumed to be constant, resulting in errors that are [homoskedastic]. **-The error term is normally distributed. **-The variance of the residuals is constant -The independent variables are not random. *regression forces the mean of the error term to zero by definition.
**Advantages of Simulations
**-Better input quality *-Provides distribution of expected value rather than a point estimate. *-do not provide better estimations **-Simulations by themselves do not necessarily lead to better decisions.
**correlation coefficient
**Rxy = Cov (X&Y) / (SdX * SdY) Rxy=r to solve for https://www.youtube.com/watch?v=NVc3RZHyzaA **with two populations do -2nd(7) -2nd(clear) -enter values -2nd stat -then down arrow button in simple linear regression = R or square root of R^2
Effect of Heteroskedasticity on Regression Analysis
**The standard errors are usually unreliable estimates. **If Heteroskedasticity test statistic > sig (example 1.96) then heteroskedasticity in the error term variance. The coefficient estimates (the ) aren't affected. If the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. The opposite will be true if the standard errors are too large. The F-test is also unreliable.
**Dickey and Fuller
*-if an AR(1) model has a coefficient of 1, it has a unit root and no finite mean reverting level (i.e., it is *not covariance stationary). *The Dickey-Fuller test for unit roots could be used to test whether the data is covariance non-stationarity. For testing for unit roots, the Dickey-Fuller (DF) test computes the conventional t-statistic, which is then compared against the revised set of critical values computed by DF. ***If the test statistic is significant, we reject the null hypothesis (that the time series has a unit root), implying that a unit root is NOT present. DF - dick - one unit or fowler = whole in one
**SEE = The standard error of estimate
*= SQRT(MSE) = SQRT(SSE/(N-2)) The SEE will be low if the relationship is strong and conversely will be high if the relationship is weak.
*Classification and Regression Trees (CART)
*Classification trees are appropriate when the target variable is categorical *While regression trees are appropriate when the target is continuous. *The CART (classification and regression trees) approach is most commonly used when the outcome is binary (outperforms or does not) and there may be significant non-linear relationships among variables. *Variables are added in order of the greatest contribution to misclassification error reduction and cease being added when there is no further meaningful reduction possible.
Random walk.
*If a time series follows a random walk process, the predicted value of the series (i.e., the value of the dependent variable) in one period is equal to the value of the series in the previous period plus a random error term. A time series that follows a simple random walk process is described in equation form as xt = xt-1 + εt, where the best forecast of xt is xt-1 and: E(εt) = 0: The expected value of each error term is zero. E(εt2) = σ2: The variance of the error terms is constant. E(εiεj) = 0; if i ≠ j: There is no serial correlation in the error terms.
ARCH (1)
*If the estimate of the slope of the regression of the squared residuals on the lagged one period squared residuals is statistically significantly different from 0, the time series is ARCH(1).
**unsupervised learning,
*ML program is [not given labeled] training data. Instead, inputs are provided without any conclusions about those inputs. In the absence of any tagged data, the program seeks out structure or interrelationships in the data. *Clustering is one example of the output of an unsupervised ML program.
x
*Neither multicollinearity nor serial correlation affects the consistency (i.e. make them biased) of regression coefficients.(slope) *Multicollinearity can however make the regression coefficients unreliable. *multicollinearity = consistent but not reliable (like a dart board all on the 1) *Both multicollinearity and serial correlation biases the standard errors of the slope coefficients.
*Cointegration (good)
*[two] time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. ***If only one time series has a unit root, we should not use linear regression. If neither time series have unit root, or if both time series have unit root [AND] the time series are cointegrated, linear regression is appropriate to use. *if cointegrated then slope and intercept valid
Discriminant models.
-Discriminant models are similar to probit and logit models but make different assumptions regarding the independent variables. Discriminant analysis results in a linear function similar to an ordinary regression, which generates an overall score, or ranking, for an observation.
*First Differencing
-If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called first differencing.
There are several limitations of using simulations as a risk assessment tool:
-Input quality. -Inappropriate statistical distributions. -Non-stationary distributions. -Dynamic correlations.
Decision trees
-Like simulations, decision trees consider all possible states of the outcome and hence the sum of the probabilities is 1.
Log-Linear Trend Models
-Time series data, particularly financial time series, often display exponential growth Yt = e^ (Bo + B1(t)) use LN of answer
Ordinary least squares (OLS) regression
-used to estimate the coefficient in the trend line Yt= Bo + B1(t)
*The standard error of the estimated autocorrelations
1 / SQRT (T)
standard error
1 / SQRT(t)
Types of constraints in Simulations
1-Book value constraints -Regulatory capital requirements. Banks and insurance companies are required to maintain adequate levels of capital. Violations of minimum capital requirements are considered serious and could threaten the very existence of the firm. -Negative equity. In some countries, negative book value of equity may have serious consequences. For example, some European countries require firms to raise additional capital in the event that book value of equity becomes negative. 2-Earnings and cash flow constraints -Sometimes, failure to meet analyst expectations could result in job losses for the executive team. -Earnings constraints can also be imposed externally, such as a loan covenant. Violating such a constraint could be very expensive for the firm. 3-Market value constraints -Market value constraints seek to minimize the likelihood of financial distress or bankruptcy for the firm.
MSR (mean squared regression)
= RSS / k =[(forecased / mean)^2] / k
SEE
=(MSE/(n-k-1))^1/2
random forest
A random forest is a collection of randomly generated classification trees from the same data set. A randomly selected subset of features is used in creating each tree and each tree is slightly different from the others.
x
A time series exhibits mean reversion if it has a tendency to move toward its mean.
*assumptions underlying linear regression is that the residuals are uncorrelated with each other.
A violation of this assumption is referred to as autocorrelation.
Simple Linear Regression
Assume that the independent variable is uncorrelated with the residuals the appropriate degrees of freedom for both confidence intervals is the number of observations in the sample (n) minus two
*Random walk
Bo=0 B1=1
R^2
Can be used to test the overall effectiveness of the entire set of independent variables in explaining the dependent variable.
Correlated Risk
Correlated risks are difficult to model using decision trees. Correlation across risks can be modeled explicitly using simulation. In a scenario analysis, we can address correlations by creating different scenarios that reflect how these variables should move together.
R^2 or coefficient of determination
Defined as the percentage of the total variation in the dependent variable explained by the independent variable. *For example, an R2 of 0.63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable.
Dimension Reduction
Dimension reduction seeks to remove the noise (i.e., those attributes that do not contain much information).
F-Test
How well the set of independent variables, as a group, explains the variation in the dependent variable. One tailed test Ha= at least one of the slopes is not equal to 0
Random Walk with a Drift.
If a time series follows a random walk with a drift, the intercept term is not equal to zero. That is, in addition to a random error term, the time series is expected to increase or decrease by a constant amount each period. xt = b0 + b1xt-1 + εt where: b0 = the constant drift b1 = 1
#***The Breusch-Pagan
If the Breusch-Pagan test is statistically significant at any reasonable level of significance it indicates heteroskedasticity. one-tailed test because we are only concerned about large values in the residuals coefficient of determination.
**unit root
If the value of the lag coefficient is equal to one, the time series is said to have a unit root and will follow a random walk process.
*Dummy Variables
Independent variable is binary in nature—it is either "on" or "off." Dummy variables are used to represent a qualitative independent variable. *must hold other variables constant *N-1 *If 5 days 4 dummy variables
Total sum of squares (SST)
Measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the ACTUAL Y-values and the mean of Y:
Sum of squared errors (SSE) (Unexplained variation)
Measures the unexplained variation in the dependent variable. It's also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual Y-values and the predicted Y-values on the regression line.
Regression sum of square (RSS) (Explained varioation)
Measures the variation in the dependent variable that is explained by the independent variable. RSS is the sum of the squared distances between the PREDICTED Y-values and the mean of Y.
Module 10
Module 10
Module 8
Module 8
Module 8.11: Machine Learning Algorithms
Module 8.11: Machine Learning Algorithms
Module 8.8: Multicollinearity
Module 8.8: Multicollinearity
Module 8.9: Model Misspecification, and Qualitative Dependent Variables
Module 8.9: Model Misspecification, and Qualitative Dependent Variables
Module 9.1: Linear and Log-Linear Trend Models
Module 9.1: Linear and Log-Linear Trend Models
Module 9.2: Autoregressive (AR) Models
Module 9.2: Autoregressive (AR) Models
Module 9.3: Random Walks and Unit Roots
Module 9.3: Random Walks and Unit Roots
*Breusch-Pagan test
N*(R^2 of residual) with k degrees of freedom
Neural Networks (Artificial Neural Networks)
Neural networks are constructed with nodes connected by links. The input layer is the nodes with values for the features (independent variables). *Neural networks are suitable when the underlying relationship between the features is nonlinear. It does not mimic human neurons. Also, the features need not have a common unit of measurement.
null hypothesis
Null hypothesis is opposite to a conclusion Conclusion I can jump more than 5 feet Null = I can't jump more than 5 feet
*Unconditional Heteroskedasticity
Occurs when the heteroskedasticity is not related to the level of the independent variables, which means that it doesn't systematically increase or decrease with changes in the value of the independent variable(s). While this is a violation of the equal variance assumption, [it usually causes no major problems with the regression.]
*penalized regression.
Penalized regression models seek to minimize forecasting errors by reducing the problem of overfitting *In summary, penalized regression models seek to reduce the number of features included in the model while retaining as much predictive information in the data as possible. *assumes linear regression
Type 1 error
Rejecting null hypothesis when it is true
x
R2 never increases when independent variables are dropped.
*MSR =
RSS/K = regression / variables K=make sure to figure out value
*R^2 =
RSS/SST Regression / (Regression + Error)
R^2residual
R^2 from a second regression of the residuals *one tailed test
AR(1) model
Random walks will have an estimated intercept coefficient near zero and an estimated slope coefficient on the first lag near 1.
RSS
Regression Sum of Squares
Misspecification #2: Variable Should Be Transformed
Regression assumes that the dependent variable is linearly related to each of the independent variables. Typically, however, market capitalization is not linearly related to portfolio returns, but rather the natural log of market cap is linearly related. If we include market cap in the regression without transforming it by taking the natural log—if we use M and not ln(M)—we've misspecified the model.
Effects of Model Misspecification
Regression coefficients are often biased and/or inconsistent, which means we can't have any confidence in our hypothesis tests of the coefficients or in the predictions of the model.
***Conditional Heteroskedasticity
Related to the level of (i.e., conditional on) the independent variables. For example, conditional heteroskedasticity exists if the variance of the residual term increases as the value of the *independent variable increases *-Conditional heteroskedasticity will result in consistent coefficient estimates but inconsistent standard errors resulting in biased t-statistics.
*P-Value
Remember <0.05 is significant 0.95 is tricking you and is extremely not significant.
MSE (mean squared error)
SSE / df df = n - k -1
standard error of estimate (SSE)
SSE = measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. -measures the Y variable's variability that is not explained by the regression equation. -is the square root of the sum of the squared deviations from the regression line divided by (n − 2). Coefficient of variable / T-value
*MSE
SSE/n-k-1 = error / n - k - 1
x
Scenario analysis is the most appropriate choice when outcomes are discrete and risks occur concurrently. Simulations are better suited for continuous risks than for discrete outcomes. Decision trees are better suited to sequential risks than for concurrently risks, since risk is considered in phases.
**The Hansen procedure
Simultaneously solves for heteroskedasticity and serial correlation. -method used to adjust the coefficient standard errors. Trick-two hands solve for both
Slope
Slope = Cov(x,y) / Variance
T-Test
Slope/SE
#**Steps in Model Training
Specify the algorithm. *Specify the hyperparameters (before the processing begins). **Divide data into training and validation samples. In the case of cross validation, the training and validation samples are randomly generated every learning cycle. *Evaluate the training using a performance parameter, P, in the validation sample. Repeat the training until adequate level of performance is achieved. In choosing the number of times to repeat, the researcher must use caution to avoid overfitting the model.
*Stationary vs Non-stationary
Stationary = series returns to 0 -desirable Non-stationary = series does not return to 0
Sample Covariance
Sum of ((X1-X(mean)) (Y1-Y(mean))) / N-1
SSE
Sum of Squares Error
Supervised Learning Algorithms
Supervised learning algorithms are used for prediction (i.e., regression) and classification.
**Spurious Correlation
The appearance of a relationship between two variables when there is none is spurious correlation.
*Heteroskedasticity
The variance of the residuals is not the same across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample. **a systematic relationship between the residuals and the independent variable
Predicted values
They are the values that are predicted by the regression equation, given an estimate of the independent variable. PV = slope + X(Slope)
Variance =
Total sum of squares / N-1
The root mean squared error criterion (RMSE)
Used to compare the accuracy of autoregressive models in forecasting out-of-sample values.
*x
Variables with a correlation close to 0 can nonetheless exhibit a strong relationship—just not a linear relationship.
*autoregressive model (AR)
When the dependent variable is regressed against one or more lagged values of itself In an autoregressive time series, past values of a variable are used to predict the current (and hence future) value of the variable.
#****mean reverting level is expressed as
Xt = Bo / (1-B1) Bo=intercept B1=slope
The standard error of the estimate
[SSE/(n − k − 1)]^1/2
**Supervised learning
[labeled] training data to guide the ML program in achieving superior forecasting accuracy. Supervised learning uses labeled training data, does not need human intervention, and can be used for nonstructured data.
A time series
a set of observations for a variable over successive periods of time (e.g., monthly stock market returns for the past ten years).
Covariance
a statistical measure of the degree to which the two variables move together. -if two variables negatively correlated than negative covariance
Scenario analysis
computes the value of an investment under a finite set of scenarios -Because the full spectrum of outcomes is not considered in these scenarios, the combined probability of the outcomes that are considered is less than 1. *-scenario analysis is very subjective in the selection of risk parameters. *Scenario analysis does not require the estimation of the underlying distribution of risk factors. *Scenario analysis is the most appropriate choice when outcomes are discrete and risks occur concurrently. *Simulations are better suited for continuous risks than for discrete outcomes.
type 2 error
failing to reject a false null hypothesis
#***Durbin-Watson
designed to detect serial correlation in multiple regression **-DW should be approximately equal to 2.0. A DW significantly different from 2.0 suggests that the residual terms are correlated. *-If the regression residuals are positively correlated, the Durbin-Watson statistic will be less than 2. A Durbin-Watson statistic greater than 2 suggests negative serial correlation. *Serial correlation (also called autocorrelation) exists when the error terms are correlated with each other. *-DW statistic is not appropriate for testing serial correlation in an autoregressive model **#The Durbin-Watson test is not an appropriate test statistic in an AR model *Example-The value of the Durbin-Watson statistic is given in Exhibit 1 as 1.890. The critical values are given as 1.63 and 1.72. Because the value (1.890) exceeds the upper critical value (1.72), the Durbin-Watson test fails to reject the null hypothesis *if was 1.68 then inconclusive -Use the Hansen method as a remedy
*Functional form model misspecification.
don't combine data from year 1 and year 2 if something changed in year 2. (maybe interest rates rose)
Misspecification #3: Incorrectly Pooling Data
don't combine data from year 1 and year 2 if something changed in year 2. (maybe interest rates rose)
Misspecification #5: Forecasting the Past
if measuring stock returns for 2018 and using market cap use January 1st market cap not December 31st market cap.
Misspecification #1: Omitting a Variable
if variable is correlated with other independent variables then resulting regression coefficients estimates are biased and inconsistent.
A structural change
is indicated by a significant shift in the plotted data at a point in time that seems to divide the data into two or more distinct patterns
chain rule of forecasting.
it is necessary to calculate a one-step-ahead forecast before a two-step-ahead forecast can be calculated.
an AR(1) model with a seasonal lag:
ln xt = b0 + b1(ln xt-1) + b2(ln xt-4) + εt.
*Covariance Stationarity.
mean and variance do not change over time *Neither a random walk nor a random walk with a drift exhibits covariance stationarity
*Deep learning nets (DLNs)
neural networks with many hidden layers (often more than 20). Deep learning requires large quantities of data and fast computers to train models. It takes a set of input data and passes it through multiple layers of functions to generate a set of probabilities.
A consistent estimator
one for which the accuracy of the parameter estimate increases as the sample size increases.
An unbiased estimator
one for which the expected value of the estimator is equal to the parameter you are trying to estimate. For example, because the expected value of the sample mean is equal to the population mean, the sample mean is an unbiased estimator of the population mean.
If see test greater than...
one tailed test
*Serial Correlation
the tendency for stock returns to be related to past returns
**autoregressive conditional heteroskedasticity (ARCH)
the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. The test for ARCH is based on a regression of the squared residuals on their lagged values.
*ARCH model
used to test for autoregressive conditional heteroskedasticity.
Big Data
very large data sets which may include both structured (e.g., spreadsheet) data and unstructured (e.g., emails, text, or pictures) data
In-sample forecasts
we are comparing how accurate our model is in forecasting the actual data we used to develop the model
Out-of-sample forecasts
we compare how accurate a model is in forecasting the y variable value for a time period outside the period used to develop the model. -used for prediciting real world
Factors that Determine Which Model Is Best
when the residuals from a linear trend model are serially correlated, a log-linear trend model may be more appropriate.
*Multicollinearity
when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. *most variables are partially correlated problem occurs when highly correlated.
xt = b0 + b1xt-1 + εt
where: xt = value of time series at time t b0 = intercept at the vertical axis (y-axis) b1 = slope coefficient xt-1 = value of time series at time t − 1 εt = error term (or residual term or disturbance term) t = time; t = 1, 2, 3...T
activation function
x
A linear trend
yt = b0 + b1(t) + εt where: yt = the value of the time series (the dependent variable) at time t b0 = intercept at the vertical axis (y-axis) b1 = slope coefficient (or trend coefficient) εt = error term (or residual term or disturbance term) t = time (the independent variable); t = 1, 2, 3...T
#*Probit and logit models.
-A probit model is based on the normal distribution -while a logit model is based on the logistic distribution. -Both are Binary -example return positive or negative
Adjusted R^2
-equal or less than R^2 -may be less than zero if the R2 is low enough. *1-[(n-1)/n-1-k)*(1-R^2)] **-Adjusted R2 adjusts for the loss of degrees of freedom when additional independent variables are added to a regression
**Steps in simulations
1-Determine the probabilistic variables. *-focus attention on a few variables that have a significant impact on value, rather than trying to define probability distributions for dozens of inputs that may have only a marginal impact on value. 2-Define probability distributions for these variables. -Historical data: *-Cross-sectional data=we may estimate the distribution of the variable based on the values of the variable for peers. -Pick a distribution and estimate the parameters(our best guess) 3-Check for correlations among variables. -When there is a strong correlation between variables, we can either 1) allow only one of the variables to vary (the other variable could then be algorithmically computed), or 2) build the rules of correlation into the simulation (this necessitates more sophisticated simulation packages). If we choose to pursue the first option (i.e., allow only one variable to fluctuate randomly), the random variable should be the one that has the highest impact on valuation. 4-Run the simulation. -The higher the number of probabilistic inputs, the greater the number of simulations needed. -The types of distributions. The greater the variability in types of distributions, the greater the number of simulations needed. -The range of outcomes. The wider the range of outcomes of the uncertain variables, the higher the number of simulations needed.
To determine what type of model is best suited to meet your needs, follow these guidelines:
1-Determine your goal. Are you attempting to model the relationship of a variable to other variables (e.g., cointegrated time series, cross-sectional multiple regression)? Are you trying to model the variable over time (e.g., trend model)? *2-If you have decided on using a time series analysis for an individual variable, plot the values of the variable over time and look for characteristics that would indicate nonstationarity, such as non-constant variance (heteroskedasticity), non-constant mean, seasonality, or structural change. 3-If there is no seasonality or structural shift, use a trend model. 4-Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. 5-If the data has serial correlation, reexamine the data for stationarity before running an AR model. 6-After first-differencing in 5 previously, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality. 7-Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. 8-If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.
Machine learning
ML terms: *Target variable or tag variable is the dependent variable (i.e., the y-variable). Target variables can be continuous, categorical, or ordinal. -When a target variable is specified, the problem is structured. When the target is continuous, structured regression models are appropriate. Features are the independent variables (i.e., the x-variables). Feature engineering is curating a dataset of features for ML processing.
*F-Test =
MSR/MSE
F-Statistic
MSR/MSE A regression analysis to find out if the means between two populations are significantly different. -a value you get when you run an ANOVA test or a regression analysis to find out if the means between two populations are significantly different.
Module 9.4: Seasonality
Module 9.4: Seasonality
Module 9.5: ARCH and Multiple Time Series
Module 9.5: ARCH and Multiple Time Series
Total Variation = Explained variation + Unexplained variation
SST = RSS + SSE
Multiple Linear Regression
Y=Bo+b1x1+b2x2....+ error term
confidence intervals
always two tailed
*AR(p)
where p indicates the number of lagged values that the autoregressive model will include as independent variables. *an AR model regresses a dependent variable against one or more lagged values of itself. multi-period forecasts are more uncertain than single-period forecasts. For example, for a two-step-ahead forecast, there is the usual uncertainty associated with forecasting xt+1 using xt, plus the additional uncertainty of forecasting xt+2 using the forecasted value for xt+1.