Quant

Ace your homework & exams now with Quizwiz!

A rudimentary way to think of machine learning algorithms is that they:

"find the pattern, apply the pattern."

ANOVA

(analysis of variance) ANOVA is a statistical procedure that attributes the variation in the dependent variable to one of two sources: the regression model or the residuals (i.e., the error term).

Qualitative Independent Variables

(dummy variables) they capture the effect of binary independent variables. When we want to distinguish between n classes, we must use (n - 1) dummy variables. Otherwise, we would violate the regression assumption of no exact linear relationship between independent variables. A dummy variable can be an intercept dummy or a slope dummy or a combination of the two.

Deep learning algorithms

Algorithms such as neural networks and reinforced learning learn from their own prediction errors and are used for complex tasks such as image recognition and natural language processing.

formula for F-statistic in multiple regression model test (for all coefficients collectively)

F = MSR / MSE F = (RSS/k) / [SSE/(n-k-1)] Rejection of the null hypothesis at a stated level of significance indicates that at least one of the coefficients is significantly different than zero, which is interpreted to mean that at least one of the independent variables in the regression model makes a significant contribution to the explanation of the dependent variable.

F1 score formula when evaluating fit of a ML algorithm

F1 score = (2*P*R) / (P + R) P = precision R = recall

what is the coefficient of determination (R^2)? What does it represent? formula?

R^2 = RSS/SST R^2 is the percentage of the variation in the dependent variable explained by the independent variables. the higher R^2 is, the more accurate our regression model is. R^2 > 90% is very strong correlation.

Seasonality

Seasonality in a time series is tested by calculating the autocorrelations of error terms. A statistically significant lagged error term may indicate seasonality (p>5). To adjust for seasonality in an AR model, an additional lag of the variable (corresponding to the statistically significant lagged error term) is added to the original model. Usually, if quarterly data are used, the seasonal lag is 4; if monthly data are used, the seasonal lag is 12. If a seasonal lag coefficient is appropriate and corrects the seasonality, a revised model incorporating the seasonal lag will show no statistical significance of the lagged error terms.

signs of nonstationarity

Signs of nonstationarity include linear trend, exponential trends, seasonality, or a structural change in the data.

accuracy formula when evaluating fit of a ML algorithm

accuracy = (true positives + true negatives) / (all positives and negatives)

bag-of-words (BOW) procedure

after the n-grams technique is applied to text, A bag-of-words (BOW) procedure then collects all the tokens in a document.

Data wrangling

also known as preprocessing data. includes data transformation and scaling. Data transformation types include extraction, aggregation, filtration, selection, and conversion of data. Scaling is the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization.

Deep learning nets

are more complex neural networks with MANY hidden layers (more than 20) useful for pattern, speech, and image recognition.

Random forest algorithm

a type of SUPERVISED machine learning algorithm This is a variant of the classification tree whereby a large number of classification trees are trained using data bagged from the same data set.

Classification and regression tree algorithms

a type of SUPERVISED machine learning algorithm This is used for classifying categorical target variables when there are significant nonlinear relationships among variables.

K-nearest neighbor algorithm

a type of SUPERVISED machine learning algorithm This is used to classify an observation based on nearness to the observations in the training sample.

Penalized regression algorithm

a type of SUPERVISED machine learning algorithm This reduces overfitting by imposing a penalty on—and reducing—the nonperforming features.

Hierarchical clustering algorithm

a type of UNSUPERVISED machine learning algorithm This builds a hierarchy of clusters without any predefined number of clusters.

K-means clustering algorithm

a type of UNSUPERVISED machine learning algorithm This partitions observations into k non-overlapping clusters; a centroid is associated with each cluster.

Principal components analysis algorithm

a type of UNSUPERVISED machine learning algorithm This summarizes the information in a large number of correlated factors into a much smaller set of uncorrelated factors called eigenvectors.

Neural networks

comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function).

Conditional heteroskedasticity affects what?

computed t-statistic and F-statistic. Conditional heteroskedasticity results in consistent coefficient estimates, but it biases standard errors, affecting the computed t-statistic and F-statisticCT

Steps in a Data Analysis Project

conceptualization of the modeling task data collection data preparation and wrangling data exploration model training

In multiple regression assumptions, the variance of the error term IS assumed to be

constant, resulting in errors that are homoskedastic, and non random

cross-validation

method of reducing overfitting in a supervised ML model. in cross-validation - a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern

what makes a variable significant in a regression analysis?

p value less than significance level. The p-value is the smallest level of significance for which the null hypothesis can be rejected. An independent variable is significant if the p-value is less than the stated significance level.

receiver operating characteristic (ROC)

plots a curve showing the tradeoff between false positives and true positives

precision (P) formula when evaluating fit of a ML algorithm

precision (P) = true positives / (false positives + true positives)

recall (R) formula when evaluating fit of a ML algorithm

recall (R) = true positives / (true positives + false negatives)

Durbin-Watson statistic may be used to test for

serial correlation at a single lag

what can you conclude if you reject the null hypothesis of a durbin-watson test.

serial correlation. the error terms are correlated with each other.

Data standardization

standardization centers the variables at a mean of 0 and a standard deviation of 1. Unlike normalization, standardization is not sensitive to outliers, but it assumes that the variable distribution is normal.

Data Scaling

the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization.

consequence of conditional heteroskedasticity

the standard errors will be too low, which, in turn, causes the t-statistics to be too high

predominance

the state or condition of being greater in number or amount

Moving data from a storage medium to where they are needed is referred to as

transfer of data

conditional heteroskedasticity causes what type of error?

type 1 error. standard errors of the parameter estimates will be too small and the t-statistics too large

If you find serial correlation in a time series model, what do we do? then, how do we correct the following: Correcting for a linear trend Correcting for an exponential trend Correcting for a structural shift Correcting for seasonality

use an auto regressive (AR) model by making it covariance stationary. Correcting for a linear trend—use first differencing. Correcting for an exponential trend—take natural log and first difference. Correcting for a structural shift—estimate the models before and after the change. Correcting for seasonality—add a seasonal lag

Serial Correlation

when the residuals are correlated with each other. when positive, the residual of the last dependent variable will tell us about the residual of the next dependent variable. When negative, the opposite, the next dependent variable's residuals will be the opposite of the residual of the variable before it. effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors (positive correlation) detection: Breusch-Godfrey (BG) F-test correction: Use robust or Newey-West corrected standard errors

Data cleansing

Data cleansing deals with missing, invalid, inaccurate, and non-uniform values, and duplicate observations.

Outliers vs high-leverage points

Outliers are extreme observations of the dependent or "Y" variable, high-leverage points are the extreme observations of the independent or "X" variables.

Ensemble learning algorithm

a type of SUPERVISED machine learning algorithm This combines predictions from multiple models, resulting in a lower average error rate.

document term matrix

A document term matrix organizes text as structured data: documents are represented by words and tokens by columns. Cell values reflect the number of times a token appears in a document.

standard error of the estimate formula

SEE = Sqrt [SSE/(n-k-1)]

degrees of freedom of the sum of squares errors (SSE)

SSE df = k-k-1

formula for R^2

SSR/SST or the square the correlation coefficient

What does the sum of squared totals (SST) mean?

SST = RSS + SSE

formula for the mean reverting level for an AR(1) model

= b0 / (1-b1) b0 = intercept b1 = independent variables coefficient.

multiple-R formula

= sqrt(R^2)

what does A low SEE imply

A low SEE implies a high R^2

Autoregressive Conditional Heteroskedasticity (ARCH)

ARCH describes the condition where the variance of the residuals in one time period within a time series is dependent on the variance of the residuals in another period. When this condition exists, the standard errors of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid. The ARCH(1) regression model is expressed as: εt^2 = a0 + a1*εt−1^2 + μt If the coefficient, a1, is statistically different from zero, the time series is ARCH(1).

Random Walk

A random walk time series is one for which the value in one period is equal to the value in another period, plus a random (unpredictable) error. If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called first differencing. Random walk without a drift: xt = xt−1 + εt Random walk with a drift: xt = b0 + xt−1 + εt In either case, the mean reverting level is undefined (b1 = 1), so the series is not covariance stationary.

Multicollinearity

A situation in which several independent variables are highly correlated with each other. This characteristic can result in difficulty in estimating separate or independent regression coefficients for the correlated variables. effect: Coefficients are consistent (but unreliable). Standard errors are overestimated. Too many Type II errors detection: Conflicting t and F-statistics; high variance inflation factors (VIF) correction: Drop one of the correlated variables, or use a different proxy for an included independent variable

Unit Root

A unit root is a stochastic trend in a time series, sometimes called a "random walk with drift". If a time series has a unit root, it shows a systematic pattern that is unpredictable. If the value of the lag coefficient is equal to one, the time series is said to have a unit root and will follow a random walk process. A series with a unit root is not covariance stationary. Economic and finance time series frequently have unit roots. First differencing will often eliminate the unit root. If there is a unit root, this period's value is equal to last period's value plus a random error term and the mean reverting level is undefined. In probability theory and statistics, a unit root is a feature of some stochastic processes, such as random walks, that can cause problems in statistical inference involving time series models.

Joint hypothesis test

An F-test to evaluate nested models, which consist of a full (or unrestricted) model, and a restricted model that uses "q" fewer independent variables. To test whether the "q" excluded variables add to the explanatory power of the model, we test the hypothesis: The null hypothesis would be that all coefficients of the excluded variables are equal to zero, and the null that at least one of the excluded coefficients is not equal to zero. Likewise, we reject the null hypothesis if the F-statistic is greater than the critical value. In essence, this is a combined test for coefficients of all excluded variables. F-Statistic in a joint-hypothesis test F = [(SSEr - SSEu)/q] / [(SSEu/(n-k-1)] SSEr = restricted model's SSR SSEu = unrestricted model's SSE q = # of excluded independent variables in the restricted model k = number of independent variables in the unrestricted model Decision rule: reject H0 if F (test-statistic) > Fc (critical value)

Akaike's information criterion (AIC) and Schwarz's Bayesian information criterion (BIC)

Both are used to evaluate competing models with the same dependent variable. AIC is used if the goal is a better forecast, while BIC is used if the goal is a better goodness of fit. They both use lognormals. ** Lower AIC & BIC values indicate a better model under either criteria. While R^2 explains the goodness of fit of the independent variables in the regression model, AIC and BIC explain how good we are at explaining those relationships of R^2. AIC = n×ln(SSE/n) + 2(k+1) BIC = n×ln(SSE/n) + ln(n)×(k+1) k = # of independent variables n = # of data points (dependent variables) In practice... as we use more data (more complex) k increases, so AIC & BIC increase, however, as the goodness of fit increases, it decreases the k value, making AIC and BIC smaller.

how to check for serial correlation in a time series model?

Check for serial correlation using the Durbin-Watson statistic.

Data Exploration

Data exploration involves exploratory data analysis, feature selection, and feature engineering. Exploratory data analysis looks at summary statistics describing the data and any patterns or relationships that can be observed. Feature selection involves choosing only those features that meaningfully contribute to the model's predictive power. Feature engineering optimizes the selected features.

degrees of freedom of the sum of squares totals (SST)

SST df = (k) + (n-k-1) = (n-1)

Cointegration

Cointegration means that two time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable. To test whether two time series are cointegrated, we regress one variable on the other using the following model: yt = b0 + b1xt + ε where: yt = value of time series y at time t xt = value of time series x at time t

Cook's D values

Cook's distance (D) is a measure of how much a regression estimate would change if an observation were deleted. It's useful for identifying influential data points. influential if sqrt(k/n) Influential data points should be checked for input errors; alternatively, the observation may be valid but the model incomplete.

Structural Change (Coefficient Instability) in regression modeling

Estimated regression coefficients may change from one time period to another. There is a trade-off between the statistical reliability of using a long time series and the coefficient stability of a short time series. You need to ask, has the economic process or environment changed? A structural change is indicated by a significant shift in the plotted data at a point in time that seems to divide the data into two distinct patterns. When this is the case, you have to run two different models, one incorporating the data before and one after that date, and test whether the time series has actually shifted. If the time series has shifted significantly, a single time series encompassing the entire period (i.e., encompassing both patterns) will likely produce unreliable results, so the model using more recent data may be more appropriate.

requirements for a series to be covariance stationary

For a time series to be covariance stationary: 1) the series must have an expected value that is constant and finite in all periods, 2) the series must have a variance that is constant and finite in all periods 3) the covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods.

First Differencing

If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called first differencing. involves subtracting the value of the time series (i.e., the dependent variable) in the immediately preceding period from the current value of the time series to define a new dependent variable, y. yt = xt − xt−1 ⇒ yt = εt Then, stating y in the form of an AR(1) model: yt = b0 + b1yt−1 + εt where: b0 = b1 = 0 This transformed time series has a finite mean-reverting level of 0/(1−0)=0 and is, therefore, covariance stationary.

formula for mean square errors (MSE)

MSE = SSE/(n-k-1) k = # of independent variables n = number of data points (dependent variables)

overfitting in supervised learning (ML)

In supervised learning, overfitting results from having a large number of independent variables (features), resulting in an overly complex model which may have generalized random noise that improves in-sample forecasting accuracy. However, overfit models do not generalize well to new data (i.e., low out-of-sample R-squared). To reduce the problem of overfitting, data scientists use *complexity reduction* and *cross validation*. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model. in cross-validation - a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern

Supervised machine learning

Inputs and outputs are identified by humans for the computer, and the algorithm uses this labeled training data to model relationships.

name entity recognition (NER)

It is the method of identifying entities such as name, location, etc that are often seen in unstructured data. An algorithm that analyzes individual tokens and their surrounding semantics while referring to its dictionary to tag an object class to the token.

formula for mean square residuals (MSR)

MSR = RSS/k k = # of independent variables

Nested models

Models that are the same in every way except (at least) one parameter is fixed (at 0) within one of the models Nested models comprise a unrestricted (or full) model, and a restricted model that uses "q" fewer independent variables. We use the joint hypothesis test to test null hypothesis.

consequences of multicollinearity

Multicollinearity refers to independent variables that are correlated with each other. Multicollinearity causes standard errors for the regression coefficients to be too high, which, in turn, causes the t-statistics to be too low. multicollinearity has no effect on the F-statistic

Chain Rule of Forecasting

Multiperiod forecasting with AR models is done one period at a time, where risk increases with each successive forecast because it is based on previously forecasted values. The calculation of successive forecasts in this manner is referred to as the chain rule of forecasting. A one-period-ahead forecast for an AR(1) model is determined in the following manner: xt+1 = b0 + b1*xt Likewise, a 2-step-ahead forecast for an AR(1) model is calculated as: xt+2 = b0 + b1*xt+1

Jason Fye, CFA, wants to check for seasonality in monthly stock returns (i.e., the January effect) after controlling for market cap and systematic risk. The type of model that Fye would most appropriately select is:

Multiple regression model.

Data Normalization

Normalization scales variables between the values of 0 and 1

parts-of-speech (POS) tokenization

POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word

Reinforcement learning

Perform an action then learn based on a reward or punishment agents seek to learn from their own errors maximizing a defined reward.

Logistic Regression Models

Predict nominal types of responses such as "Yes", a company will go bankrupt or "No" a company will not go bankrupt, probably during some predetermined period. Qualitative dependent variables (e.g., bankrupt versus non-bankrupt) require methods other than ordinary least squares. Logistic regression (logit) models use log odds as the dependent variable, and the coefficients are estimated using the maximum likelihood estimation methodology. The slope coefficients in a logit model are interpreted as the change in the "log odds" of the event occurring per 1-unit change in the independent variable, holding all other independent variables constant.

what is the degrees of freedom of the regression sum of squares (RSS)

RSS df = k

multicollinearity has an effect on which statistic

T-statistic not F-statistic.

Support vector machine algorithm

a type of SUPERVISED machine learning algorithm This is a linear classification algorithm that separates the data into one of two possible classifiers based on a model-defined hyperplane.

Preparing, Wrangling, and Exploring Text-Based Data for Financial Forecasting

Text processing involves removing HTML tags, punctuations, numbers, and white spaces. Text is then normalized by lowercasing of words, removal of stop words, and stemming/lemmatization. Text wrangling involves tokenization of text.

One possible problem that could jeopardize the validity of the employment growth rate model is multicollinearity. Which of the following would most likely suggest the existence of multicollinearity?

The F-statistic suggests that the overall regression is significant, however the regression coefficients are not individually significant.

what does adjusted R^2 represent?

The R2 is the ratio of the explained variation to the total variation. The adjusted R2 provides a measure of the goodness of fit that adjusts for the number of independent variables included in the model. In the context of regression, it is a statistical measure of how well the regression line approximates the actual data. represents the explanatory power of our regression model to the actual data

Unsupervised learning

The computer is not given labeled data; rather, it is provided unlabeled data that the algorithm uses to determine the structure of the data.

Root mean squared error (RMSE)

Used to assess the predictive accuracy of autoregressive models. For example, you could compare the results of an AR(1) and an AR(2) model. The RMSE is the square root of the average (or mean) squared error. The model with the lower RMSE is better. Out-of-sample forecasts predict values using a model for periods beyond the time series used to estimate the model. The RMSE of a model's out-of-sample forecasts should be used to compare the accuracy of alternative models.

What would a regression formula be of a dependent variable (e.g., sales) on three independent variables?

Yi = b0 + (b1 × X1i) + (b2 × X2i) + (b3 × X3i) + εi y = dependent variable b0 = intercept b1,2,3 = coefficients x1,2,3 = independent variable εi = error term

N-grams technique

a technique that defines a token as a sequence of words and is applied when the sequence is important. N-gram is a technique used in natural language processing (NLP). N-grams are contiguous sequences of n items from a given sequence of text. They can be used to capture local context and to extract meaning from text data

Covariance Stationary

data can be plotted on a straight line parallel to the x axis. Signs of nonstationarity include linear trend, exponential trends, seasonality, or a structural change in the data. Covariance stationarity is a property of a time series that indicates that its properties, such as the mean and variance, remain constant over time. A nonstationary time series leads to invalid linear regression estimates with no economic meaning. A time series is covariance stationary if it satisfies the following three conditions: 1. Constant and finite mean. 2. Constant and finite variance. 3. Constant and finite covariance with leading or lagged values. To determine whether a time series is covariance stationary, we can: 1. Plot the data to see if the mean and variance remain constant. 2. Perform the Dickey-Fuller test (which is a test for a unit root, or if b1 − 1 is equal to zero). Most economic and financial time series relationships are not stationary. The degree of nonstationarity depends on the length of the series and the underlying economic and market environment and conditions. AR(1) model is covariance stationary when there's a mean reverting level... b1 < 1.

What does the sum of squared errors (SSE) mean?

difference between: (predicted) regression estimate and the actual value SSE is the sum of the distances of each dependent variables' predicted value (on the regression line) to their actual values.

What does the regression sum of squares (RSS) mean?

difference between: mean value to (predicted) regression estimate RSS is the sum of each (dependent variable) data points' distance from the 'average of all of the (dependent variable) data points' to the regression line (the predicted values of the dependent variables)

hegemony

domination over others

Curation

ensuring the quality of data, for example by adjusting for bad or missing data.

Influential data points

extreme observations that when excluded cause a significant change in model coefficients. Influential data points cause the model to perform poorly out of sample.

Log-Linear Trend Model (time series model)

for exponential growth over time (like dividend reinvesting) Log-linear regression assumes the dependent financial variable grows at some constant rate: yt = e^(b0 + b1(t)) ln(yt) = ln[e^(b0 + b1(t))] ⇒ ln(yt) = b0 + b1(t) The log-linear model is best for a data series that exhibits a trend or for which the residuals are correlated or predictable or the mean is non-constant. Most of the data related to investments have some type of trend and thus lend themselves more to a log-linear model. In addition, any data that have seasonality are candidates for a log-linear model. Recall that any exponential growth data call for a log-linear model. The use of the transformed data produces a linear trend line with a better fit for the data and increases the predictive ability of the model. Because the log-linear model more accurately captures the behavior of the time series, the impact of serial correlation in the error terms is minimized.

The degree to which a machine learning model retains its explanatory power when predicting out-of-sample is most commonly described as:

generalization

Conditional Heteroskedasticity

in theory, the further right you go, the residuals get larger, model isnt fit very well. CONDITIONAL when heteroskedasticity is related to the independent variables. unconditional isn't a problem, its heteroskedasticity not related to the independent variables. effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors detection: Breusch-Pagan chi-square test correction: Use robust or White-corrected standard errors

Linear Trend Model (time series model)

just one independent variable (x). 'normal y = b + mx' The typical time series uses time as the independent variable to estimate the value of time series (the dependent variable) in period t: yt = b0 +b1(t) + εt The predicted change in y is b1 and t = 1, 2, ..., T Trend models are limited in that they assume time explains the dependent variable. Also, they tend to be plagued by various assumption violations. The Durbin-Watson test statistic can be used to check for serial correlation. A linear trend model may be appropriate if the data points seem to be equally distributed above and below the line and the mean is constant. Growth in GDP and inflation levels are likely candidates for linear models.

complexity reduction

method of reducing overfitting in a supervised ML model. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model.

Autoregressive (AR) Model

when you see autoregressive, think "regress back one time period (t-1) to tell us about this variable at t" In AR models, the dependent variable is regressed against previous values of itself. An autoregressive model of order p can be represented as: xt = b0 + b1*xt−1 + b2*xt−2 + ... + bp*xt−p + εt There is no longer a distinction between the dependent and independent variables (i.e., x is the only variable). An AR(p) model is specified correctly if the autocorrelations of residuals from the model are not statistically significant at any lag. When testing for serial correlation in an AR model, don't use the Durbin-Watson statistic. Use a t-test to determine whether any of the correlations between residuals at any lag are statistically significant. If some are significant, the model is incorrectly specified and a lagged variable at the indicated lag should be added.


Related study sets

Chapter 1: Physical Fitness & Wellness

View Set

Econ Chapters 1-5 Multiple Choice Questions

View Set

Semester One Final Social Studies Lewis Cass

View Set

Communication with families and professional boundaries

View Set

Chapter 3: Stress and Illness/ Disease

View Set