Quant
A rudimentary way to think of machine learning algorithms is that they:
"find the pattern, apply the pattern."
ANOVA
(analysis of variance) ANOVA is a statistical procedure that attributes the variation in the dependent variable to one of two sources: the regression model or the residuals (i.e., the error term).
Qualitative Independent Variables
(dummy variables) they capture the effect of binary independent variables. When we want to distinguish between n classes, we must use (n - 1) dummy variables. Otherwise, we would violate the regression assumption of no exact linear relationship between independent variables. A dummy variable can be an intercept dummy or a slope dummy or a combination of the two.
Deep learning algorithms
Algorithms such as neural networks and reinforced learning learn from their own prediction errors and are used for complex tasks such as image recognition and natural language processing.
formula for F-statistic in multiple regression model test (for all coefficients collectively)
F = MSR / MSE F = (RSS/k) / [SSE/(n-k-1)] Rejection of the null hypothesis at a stated level of significance indicates that at least one of the coefficients is significantly different than zero, which is interpreted to mean that at least one of the independent variables in the regression model makes a significant contribution to the explanation of the dependent variable.
F1 score formula when evaluating fit of a ML algorithm
F1 score = (2*P*R) / (P + R) P = precision R = recall
what is the coefficient of determination (R^2)? What does it represent? formula?
R^2 = RSS/SST R^2 is the percentage of the variation in the dependent variable explained by the independent variables. the higher R^2 is, the more accurate our regression model is. R^2 > 90% is very strong correlation.
Seasonality
Seasonality in a time series is tested by calculating the autocorrelations of error terms. A statistically significant lagged error term may indicate seasonality (p>5). To adjust for seasonality in an AR model, an additional lag of the variable (corresponding to the statistically significant lagged error term) is added to the original model. Usually, if quarterly data are used, the seasonal lag is 4; if monthly data are used, the seasonal lag is 12. If a seasonal lag coefficient is appropriate and corrects the seasonality, a revised model incorporating the seasonal lag will show no statistical significance of the lagged error terms.
signs of nonstationarity
Signs of nonstationarity include linear trend, exponential trends, seasonality, or a structural change in the data.
accuracy formula when evaluating fit of a ML algorithm
accuracy = (true positives + true negatives) / (all positives and negatives)
bag-of-words (BOW) procedure
after the n-grams technique is applied to text, A bag-of-words (BOW) procedure then collects all the tokens in a document.
Data wrangling
also known as preprocessing data. includes data transformation and scaling. Data transformation types include extraction, aggregation, filtration, selection, and conversion of data. Scaling is the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization.
Deep learning nets
are more complex neural networks with MANY hidden layers (more than 20) useful for pattern, speech, and image recognition.
Random forest algorithm
a type of SUPERVISED machine learning algorithm This is a variant of the classification tree whereby a large number of classification trees are trained using data bagged from the same data set.
Classification and regression tree algorithms
a type of SUPERVISED machine learning algorithm This is used for classifying categorical target variables when there are significant nonlinear relationships among variables.
K-nearest neighbor algorithm
a type of SUPERVISED machine learning algorithm This is used to classify an observation based on nearness to the observations in the training sample.
Penalized regression algorithm
a type of SUPERVISED machine learning algorithm This reduces overfitting by imposing a penalty on—and reducing—the nonperforming features.
Hierarchical clustering algorithm
a type of UNSUPERVISED machine learning algorithm This builds a hierarchy of clusters without any predefined number of clusters.
K-means clustering algorithm
a type of UNSUPERVISED machine learning algorithm This partitions observations into k non-overlapping clusters; a centroid is associated with each cluster.
Principal components analysis algorithm
a type of UNSUPERVISED machine learning algorithm This summarizes the information in a large number of correlated factors into a much smaller set of uncorrelated factors called eigenvectors.
Neural networks
comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function).
Conditional heteroskedasticity affects what?
computed t-statistic and F-statistic. Conditional heteroskedasticity results in consistent coefficient estimates, but it biases standard errors, affecting the computed t-statistic and F-statisticCT
Steps in a Data Analysis Project
conceptualization of the modeling task data collection data preparation and wrangling data exploration model training
In multiple regression assumptions, the variance of the error term IS assumed to be
constant, resulting in errors that are homoskedastic, and non random
cross-validation
method of reducing overfitting in a supervised ML model. in cross-validation - a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern
what makes a variable significant in a regression analysis?
p value less than significance level. The p-value is the smallest level of significance for which the null hypothesis can be rejected. An independent variable is significant if the p-value is less than the stated significance level.
receiver operating characteristic (ROC)
plots a curve showing the tradeoff between false positives and true positives
precision (P) formula when evaluating fit of a ML algorithm
precision (P) = true positives / (false positives + true positives)
recall (R) formula when evaluating fit of a ML algorithm
recall (R) = true positives / (true positives + false negatives)
Durbin-Watson statistic may be used to test for
serial correlation at a single lag
what can you conclude if you reject the null hypothesis of a durbin-watson test.
serial correlation. the error terms are correlated with each other.
Data standardization
standardization centers the variables at a mean of 0 and a standard deviation of 1. Unlike normalization, standardization is not sensitive to outliers, but it assumes that the variable distribution is normal.
Data Scaling
the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization.
consequence of conditional heteroskedasticity
the standard errors will be too low, which, in turn, causes the t-statistics to be too high
predominance
the state or condition of being greater in number or amount
Moving data from a storage medium to where they are needed is referred to as
transfer of data
conditional heteroskedasticity causes what type of error?
type 1 error. standard errors of the parameter estimates will be too small and the t-statistics too large
If you find serial correlation in a time series model, what do we do? then, how do we correct the following: Correcting for a linear trend Correcting for an exponential trend Correcting for a structural shift Correcting for seasonality
use an auto regressive (AR) model by making it covariance stationary. Correcting for a linear trend—use first differencing. Correcting for an exponential trend—take natural log and first difference. Correcting for a structural shift—estimate the models before and after the change. Correcting for seasonality—add a seasonal lag
Serial Correlation
when the residuals are correlated with each other. when positive, the residual of the last dependent variable will tell us about the residual of the next dependent variable. When negative, the opposite, the next dependent variable's residuals will be the opposite of the residual of the variable before it. effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors (positive correlation) detection: Breusch-Godfrey (BG) F-test correction: Use robust or Newey-West corrected standard errors
Data cleansing
Data cleansing deals with missing, invalid, inaccurate, and non-uniform values, and duplicate observations.
Outliers vs high-leverage points
Outliers are extreme observations of the dependent or "Y" variable, high-leverage points are the extreme observations of the independent or "X" variables.
Ensemble learning algorithm
a type of SUPERVISED machine learning algorithm This combines predictions from multiple models, resulting in a lower average error rate.
document term matrix
A document term matrix organizes text as structured data: documents are represented by words and tokens by columns. Cell values reflect the number of times a token appears in a document.
standard error of the estimate formula
SEE = Sqrt [SSE/(n-k-1)]
degrees of freedom of the sum of squares errors (SSE)
SSE df = k-k-1
formula for R^2
SSR/SST or the square the correlation coefficient
What does the sum of squared totals (SST) mean?
SST = RSS + SSE
formula for the mean reverting level for an AR(1) model
= b0 / (1-b1) b0 = intercept b1 = independent variables coefficient.
multiple-R formula
= sqrt(R^2)
what does A low SEE imply
A low SEE implies a high R^2
Autoregressive Conditional Heteroskedasticity (ARCH)
ARCH describes the condition where the variance of the residuals in one time period within a time series is dependent on the variance of the residuals in another period. When this condition exists, the standard errors of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid. The ARCH(1) regression model is expressed as: εt^2 = a0 + a1*εt−1^2 + μt If the coefficient, a1, is statistically different from zero, the time series is ARCH(1).
Random Walk
A random walk time series is one for which the value in one period is equal to the value in another period, plus a random (unpredictable) error. If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called first differencing. Random walk without a drift: xt = xt−1 + εt Random walk with a drift: xt = b0 + xt−1 + εt In either case, the mean reverting level is undefined (b1 = 1), so the series is not covariance stationary.
Multicollinearity
A situation in which several independent variables are highly correlated with each other. This characteristic can result in difficulty in estimating separate or independent regression coefficients for the correlated variables. effect: Coefficients are consistent (but unreliable). Standard errors are overestimated. Too many Type II errors detection: Conflicting t and F-statistics; high variance inflation factors (VIF) correction: Drop one of the correlated variables, or use a different proxy for an included independent variable
Unit Root
A unit root is a stochastic trend in a time series, sometimes called a "random walk with drift". If a time series has a unit root, it shows a systematic pattern that is unpredictable. If the value of the lag coefficient is equal to one, the time series is said to have a unit root and will follow a random walk process. A series with a unit root is not covariance stationary. Economic and finance time series frequently have unit roots. First differencing will often eliminate the unit root. If there is a unit root, this period's value is equal to last period's value plus a random error term and the mean reverting level is undefined. In probability theory and statistics, a unit root is a feature of some stochastic processes, such as random walks, that can cause problems in statistical inference involving time series models.
Joint hypothesis test
An F-test to evaluate nested models, which consist of a full (or unrestricted) model, and a restricted model that uses "q" fewer independent variables. To test whether the "q" excluded variables add to the explanatory power of the model, we test the hypothesis: The null hypothesis would be that all coefficients of the excluded variables are equal to zero, and the null that at least one of the excluded coefficients is not equal to zero. Likewise, we reject the null hypothesis if the F-statistic is greater than the critical value. In essence, this is a combined test for coefficients of all excluded variables. F-Statistic in a joint-hypothesis test F = [(SSEr - SSEu)/q] / [(SSEu/(n-k-1)] SSEr = restricted model's SSR SSEu = unrestricted model's SSE q = # of excluded independent variables in the restricted model k = number of independent variables in the unrestricted model Decision rule: reject H0 if F (test-statistic) > Fc (critical value)
Akaike's information criterion (AIC) and Schwarz's Bayesian information criterion (BIC)
Both are used to evaluate competing models with the same dependent variable. AIC is used if the goal is a better forecast, while BIC is used if the goal is a better goodness of fit. They both use lognormals. ** Lower AIC & BIC values indicate a better model under either criteria. While R^2 explains the goodness of fit of the independent variables in the regression model, AIC and BIC explain how good we are at explaining those relationships of R^2. AIC = n×ln(SSE/n) + 2(k+1) BIC = n×ln(SSE/n) + ln(n)×(k+1) k = # of independent variables n = # of data points (dependent variables) In practice... as we use more data (more complex) k increases, so AIC & BIC increase, however, as the goodness of fit increases, it decreases the k value, making AIC and BIC smaller.
how to check for serial correlation in a time series model?
Check for serial correlation using the Durbin-Watson statistic.
Data Exploration
Data exploration involves exploratory data analysis, feature selection, and feature engineering. Exploratory data analysis looks at summary statistics describing the data and any patterns or relationships that can be observed. Feature selection involves choosing only those features that meaningfully contribute to the model's predictive power. Feature engineering optimizes the selected features.
degrees of freedom of the sum of squares totals (SST)
SST df = (k) + (n-k-1) = (n-1)
Cointegration
Cointegration means that two time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable. To test whether two time series are cointegrated, we regress one variable on the other using the following model: yt = b0 + b1xt + ε where: yt = value of time series y at time t xt = value of time series x at time t
Cook's D values
Cook's distance (D) is a measure of how much a regression estimate would change if an observation were deleted. It's useful for identifying influential data points. influential if sqrt(k/n) Influential data points should be checked for input errors; alternatively, the observation may be valid but the model incomplete.
Structural Change (Coefficient Instability) in regression modeling
Estimated regression coefficients may change from one time period to another. There is a trade-off between the statistical reliability of using a long time series and the coefficient stability of a short time series. You need to ask, has the economic process or environment changed? A structural change is indicated by a significant shift in the plotted data at a point in time that seems to divide the data into two distinct patterns. When this is the case, you have to run two different models, one incorporating the data before and one after that date, and test whether the time series has actually shifted. If the time series has shifted significantly, a single time series encompassing the entire period (i.e., encompassing both patterns) will likely produce unreliable results, so the model using more recent data may be more appropriate.
requirements for a series to be covariance stationary
For a time series to be covariance stationary: 1) the series must have an expected value that is constant and finite in all periods, 2) the series must have a variance that is constant and finite in all periods 3) the covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods.
First Differencing
If we believe a time series is a random walk (i.e., has a unit root), we can transform the data to a covariance stationary time series using a procedure called first differencing. involves subtracting the value of the time series (i.e., the dependent variable) in the immediately preceding period from the current value of the time series to define a new dependent variable, y. yt = xt − xt−1 ⇒ yt = εt Then, stating y in the form of an AR(1) model: yt = b0 + b1yt−1 + εt where: b0 = b1 = 0 This transformed time series has a finite mean-reverting level of 0/(1−0)=0 and is, therefore, covariance stationary.
formula for mean square errors (MSE)
MSE = SSE/(n-k-1) k = # of independent variables n = number of data points (dependent variables)
overfitting in supervised learning (ML)
In supervised learning, overfitting results from having a large number of independent variables (features), resulting in an overly complex model which may have generalized random noise that improves in-sample forecasting accuracy. However, overfit models do not generalize well to new data (i.e., low out-of-sample R-squared). To reduce the problem of overfitting, data scientists use *complexity reduction* and *cross validation*. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model. in cross-validation - a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern
Supervised machine learning
Inputs and outputs are identified by humans for the computer, and the algorithm uses this labeled training data to model relationships.
name entity recognition (NER)
It is the method of identifying entities such as name, location, etc that are often seen in unstructured data. An algorithm that analyzes individual tokens and their surrounding semantics while referring to its dictionary to tag an object class to the token.
formula for mean square residuals (MSR)
MSR = RSS/k k = # of independent variables
Nested models
Models that are the same in every way except (at least) one parameter is fixed (at 0) within one of the models Nested models comprise a unrestricted (or full) model, and a restricted model that uses "q" fewer independent variables. We use the joint hypothesis test to test null hypothesis.
consequences of multicollinearity
Multicollinearity refers to independent variables that are correlated with each other. Multicollinearity causes standard errors for the regression coefficients to be too high, which, in turn, causes the t-statistics to be too low. multicollinearity has no effect on the F-statistic
Chain Rule of Forecasting
Multiperiod forecasting with AR models is done one period at a time, where risk increases with each successive forecast because it is based on previously forecasted values. The calculation of successive forecasts in this manner is referred to as the chain rule of forecasting. A one-period-ahead forecast for an AR(1) model is determined in the following manner: xt+1 = b0 + b1*xt Likewise, a 2-step-ahead forecast for an AR(1) model is calculated as: xt+2 = b0 + b1*xt+1
Jason Fye, CFA, wants to check for seasonality in monthly stock returns (i.e., the January effect) after controlling for market cap and systematic risk. The type of model that Fye would most appropriately select is:
Multiple regression model.
Data Normalization
Normalization scales variables between the values of 0 and 1
parts-of-speech (POS) tokenization
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word
Reinforcement learning
Perform an action then learn based on a reward or punishment agents seek to learn from their own errors maximizing a defined reward.
Logistic Regression Models
Predict nominal types of responses such as "Yes", a company will go bankrupt or "No" a company will not go bankrupt, probably during some predetermined period. Qualitative dependent variables (e.g., bankrupt versus non-bankrupt) require methods other than ordinary least squares. Logistic regression (logit) models use log odds as the dependent variable, and the coefficients are estimated using the maximum likelihood estimation methodology. The slope coefficients in a logit model are interpreted as the change in the "log odds" of the event occurring per 1-unit change in the independent variable, holding all other independent variables constant.
what is the degrees of freedom of the regression sum of squares (RSS)
RSS df = k
multicollinearity has an effect on which statistic
T-statistic not F-statistic.
Support vector machine algorithm
a type of SUPERVISED machine learning algorithm This is a linear classification algorithm that separates the data into one of two possible classifiers based on a model-defined hyperplane.
Preparing, Wrangling, and Exploring Text-Based Data for Financial Forecasting
Text processing involves removing HTML tags, punctuations, numbers, and white spaces. Text is then normalized by lowercasing of words, removal of stop words, and stemming/lemmatization. Text wrangling involves tokenization of text.
One possible problem that could jeopardize the validity of the employment growth rate model is multicollinearity. Which of the following would most likely suggest the existence of multicollinearity?
The F-statistic suggests that the overall regression is significant, however the regression coefficients are not individually significant.
what does adjusted R^2 represent?
The R2 is the ratio of the explained variation to the total variation. The adjusted R2 provides a measure of the goodness of fit that adjusts for the number of independent variables included in the model. In the context of regression, it is a statistical measure of how well the regression line approximates the actual data. represents the explanatory power of our regression model to the actual data
Unsupervised learning
The computer is not given labeled data; rather, it is provided unlabeled data that the algorithm uses to determine the structure of the data.
Root mean squared error (RMSE)
Used to assess the predictive accuracy of autoregressive models. For example, you could compare the results of an AR(1) and an AR(2) model. The RMSE is the square root of the average (or mean) squared error. The model with the lower RMSE is better. Out-of-sample forecasts predict values using a model for periods beyond the time series used to estimate the model. The RMSE of a model's out-of-sample forecasts should be used to compare the accuracy of alternative models.
What would a regression formula be of a dependent variable (e.g., sales) on three independent variables?
Yi = b0 + (b1 × X1i) + (b2 × X2i) + (b3 × X3i) + εi y = dependent variable b0 = intercept b1,2,3 = coefficients x1,2,3 = independent variable εi = error term
N-grams technique
a technique that defines a token as a sequence of words and is applied when the sequence is important. N-gram is a technique used in natural language processing (NLP). N-grams are contiguous sequences of n items from a given sequence of text. They can be used to capture local context and to extract meaning from text data
Covariance Stationary
data can be plotted on a straight line parallel to the x axis. Signs of nonstationarity include linear trend, exponential trends, seasonality, or a structural change in the data. Covariance stationarity is a property of a time series that indicates that its properties, such as the mean and variance, remain constant over time. A nonstationary time series leads to invalid linear regression estimates with no economic meaning. A time series is covariance stationary if it satisfies the following three conditions: 1. Constant and finite mean. 2. Constant and finite variance. 3. Constant and finite covariance with leading or lagged values. To determine whether a time series is covariance stationary, we can: 1. Plot the data to see if the mean and variance remain constant. 2. Perform the Dickey-Fuller test (which is a test for a unit root, or if b1 − 1 is equal to zero). Most economic and financial time series relationships are not stationary. The degree of nonstationarity depends on the length of the series and the underlying economic and market environment and conditions. AR(1) model is covariance stationary when there's a mean reverting level... b1 < 1.
What does the sum of squared errors (SSE) mean?
difference between: (predicted) regression estimate and the actual value SSE is the sum of the distances of each dependent variables' predicted value (on the regression line) to their actual values.
What does the regression sum of squares (RSS) mean?
difference between: mean value to (predicted) regression estimate RSS is the sum of each (dependent variable) data points' distance from the 'average of all of the (dependent variable) data points' to the regression line (the predicted values of the dependent variables)
hegemony
domination over others
Curation
ensuring the quality of data, for example by adjusting for bad or missing data.
Influential data points
extreme observations that when excluded cause a significant change in model coefficients. Influential data points cause the model to perform poorly out of sample.
Log-Linear Trend Model (time series model)
for exponential growth over time (like dividend reinvesting) Log-linear regression assumes the dependent financial variable grows at some constant rate: yt = e^(b0 + b1(t)) ln(yt) = ln[e^(b0 + b1(t))] ⇒ ln(yt) = b0 + b1(t) The log-linear model is best for a data series that exhibits a trend or for which the residuals are correlated or predictable or the mean is non-constant. Most of the data related to investments have some type of trend and thus lend themselves more to a log-linear model. In addition, any data that have seasonality are candidates for a log-linear model. Recall that any exponential growth data call for a log-linear model. The use of the transformed data produces a linear trend line with a better fit for the data and increases the predictive ability of the model. Because the log-linear model more accurately captures the behavior of the time series, the impact of serial correlation in the error terms is minimized.
The degree to which a machine learning model retains its explanatory power when predicting out-of-sample is most commonly described as:
generalization
Conditional Heteroskedasticity
in theory, the further right you go, the residuals get larger, model isnt fit very well. CONDITIONAL when heteroskedasticity is related to the independent variables. unconditional isn't a problem, its heteroskedasticity not related to the independent variables. effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors detection: Breusch-Pagan chi-square test correction: Use robust or White-corrected standard errors
Linear Trend Model (time series model)
just one independent variable (x). 'normal y = b + mx' The typical time series uses time as the independent variable to estimate the value of time series (the dependent variable) in period t: yt = b0 +b1(t) + εt The predicted change in y is b1 and t = 1, 2, ..., T Trend models are limited in that they assume time explains the dependent variable. Also, they tend to be plagued by various assumption violations. The Durbin-Watson test statistic can be used to check for serial correlation. A linear trend model may be appropriate if the data points seem to be equally distributed above and below the line and the mean is constant. Growth in GDP and inflation levels are likely candidates for linear models.
complexity reduction
method of reducing overfitting in a supervised ML model. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model.
Autoregressive (AR) Model
when you see autoregressive, think "regress back one time period (t-1) to tell us about this variable at t" In AR models, the dependent variable is regressed against previous values of itself. An autoregressive model of order p can be represented as: xt = b0 + b1*xt−1 + b2*xt−2 + ... + bp*xt−p + εt There is no longer a distinction between the dependent and independent variables (i.e., x is the only variable). An AR(p) model is specified correctly if the autocorrelations of residuals from the model are not statistically significant at any lag. When testing for serial correlation in an AR model, don't use the Durbin-Watson statistic. Use a t-test to determine whether any of the correlations between residuals at any lag are statistically significant. If some are significant, the model is incorrectly specified and a lagged variable at the indicated lag should be added.
