CFA Level 2 (Quantitative Methods)
LOS 4.a: Identify and explain steps in a data analysis project.
The steps involved in a data analysis project include (1) conceptualization of the modeling task, (2) data collection, (3) data preparation and wrangling, (4) data exploration, and (5) model training.
LOS 2.j: Describe implications of unit roots for time-series analysis, explain when unit roots are likely to occur and how to test for them, and demonstrate how a time series with a unit root can be transformed so it can be analyzed with an AR model.
A time series has a unit root if the coefficient on the lagged dependent variable is equal to one. A series with a unit root is not covariance stationary. Economic and finance time series frequently have unit roots. Data with a unit root must be first differenced before being used in a time-series model.
LOS 2.a: Calculate and evaluate the predicted trend value for a time series, modeled as either a linear trend or a log-linear trend, given the estimated trend coefficients.
A time series is a set of observations for a variable over successive periods of time. A time series model captures the time-series pattern and allows us to make predictions about the variable in the future.
LOS 2.c: Explain the requirement for a time series to be covariance stationary and describe the significance of a series that is not stationary.
A time series is covariance stationary if its mean, variance, and covariances with lagged and leading values do not change over time. Covariance stationarity is a requirement for using AR models.
LOS 2.f: Explain mean reversion and calculate a mean-reverting level.
A time series is mean reverting if it tends towards its mean over time. The mean reverting level for an AR(1) model is b0/(1-b1)
LOS 4.f: Describe objectives, steps, and techniques in model training.
Summary statistics for textual data includes term frequency and co-occurrence. A word cloud is a visual representation of all the words in a BOW, such that words with higher frequency have a larger font size. This allows the analyst to determine which words are contextually more important. Feature selection can use tools such as document frequency, the chi-square test, and mutual information (MI). FE for text data includes identification of numbers, usage of N-grams, name entity recognition (NER), or parts of speech (POS) tokenization.
LOS 2.i: Describe characteristics of random walk processes and contrast them to covariance stationary processes.
A random walk time series is one for which the value in one period is equal to the value in another period, plus a random error. A random walk process does not have a mean reverting level and is not stationary.
LOS 1.c: Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions.
Assumptions underlying a multiple regression model include: -A linear relationship exists between the dependent and independent variables. -The residuals are normally distributed. -The variance of the error terms is constant for all observations. -The residual for one observation is not correlated with that of another observation. -The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
LOS 4.d: Describe objectives, methods, and examples of data exploration.
Before model training, the model is conceptualized where ML engineers work with domain experts to identify data characteristics and relationships. ML seeks to identify patterns in the training data, such that the model is able to generalize to out-of-sample data. Model fitting errors can be caused by using a small training sample or by using an inappropriate number of features. Too few features may underfit the data, while too many features can lead to the problem of overfitting. Model training involves model selection, model evaluation, and tuning.
LOS 4.b: Describe objectives, steps, and examples of preparing and wrangling data.
Data cleansing deals with missing, invalid, inaccurate, and nonuniform values as well as with duplicate observations. Data wrangling or preprocessing includes data transformation and scaling. Data transformation types include extraction, aggregation, filtration, selection, and conversion of data. Scaling is the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization. Normalization scales variables between the values of 0 and 1, while standardization centers the variables at a mean of 0 and a standard deviation of 1. Unlike normalization, standardization is not sensitive to outliers, but it assumes that the variable distribution is normal.
LOS 4.g: Describe preparing, wrangling, and exploring text-based data for financial forecasting.
Data exploration involves exploratory data analysis (EDA), feature selection, and feature engineering (FE). EDA looks at summary statistics describing the data and any patterns or relationships that can be observed. Feature selection involves choosing only those features that meaningfully contribute to the model's predictive power. FE optimizes the selected features.
LOS 3.b: Describe overfitting and identify methods of addressing it.
In supervised learning, overfitting results from a large number of independent variables (features), resulting in an overly complex model that may have generalized random noise that improves in-sample forecasting accuracy. However, overfit models do not generalize well to new data (i.e., low out-of-sample R-squared). To reduce the problem of overfitting, data scientists use complexity reduction and cross validation. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model.
LOS 2.g: Contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of different time-series models based on the root mean squared error criterion.
In-sample forecasts are made within the range of data used in the estimation. Out-of-sample forecasts are made outside of the time period for the data used in the estimation. The root mean squared error criterion (RMSE) is used to compare the accuracy of autoregressive models in forecasting out-of-sample values. A researcher may have two autoregressive (AR) models, both of which seem to fit the data: an AR(1) model and an AR(2) model. To determine which model will more accurately forecast future values, we calculate the square RMSE. The model with the lower RMSE for the out-of-sample data will have lower forecast error and will be expected to have better predictive power in the future.
LOS 2.h: Explain the instability of coefficients of time-series models.
Most economic and financial time-series data are not stationary. The degree of the nonstationarity depends on the length of the series and changes in the underlying economic environment.
LOS 1.a: Describe the types of investment problems addressed by multiple linear regression and the regression process.
Multiple regression models can be used to identify relationships between variables, to forecast variables, or to test existing theories.
LOS 3.e: Describe neural networks, deep learning nets, and reinforcement learning.
Neural networks comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function). Deep learning networks are neural networks with many hidden layers (more than 20), useful for pattern, speech, and image recognition. Reinforcement learning (RL) algorithms seek to learn from their own errors, thus maximizing a defined reward.
LOS 1.k: Describe influence analysis and methods of detecting influential data points.
Outliers are extreme observations of the dependent or 'Y' variable, while high-leverage points are the extreme observations of the independent or 'X' variables. Influential data points are extreme observations that when excluded cause a significant change in model coefficients, causing the model to perform poorly out-of-sample. Cook's D values greater than Squareroot(k/n) indicate that the observation is highly likely to be an influential data point. Influential data points should be checked for input errors; alternatively, the observation may be valid but the model incomplete.
LOS 1.m: Formulate and interpret a logistic regression model.
Qualitative dependent variables (e.g., bankrupt versus non-bankrupt) require methods other than ordinary least squares. Logistic regression (logit) models use log odds as the dependent variable, and the coefficients are estimated using the maximum likelihood estimation methodology. The slope coefficients in a logit model are interpreted as the change in the "log odds" of the event occurring per 1-unit change in the independent variable, holding all other independent variables constant.
LOS 1.l: Formulate and interpret a multiple regression model that includes qualitative independent variables.
Qualitative independent variables (dummy variables) capture the effect of a binary independent variable. A dummy variable can be an intercept dummy, or a slope dummy.
LOS 2.l: Explain how to test and correct for seasonality in a time-series model and calculate and interpret a forecasted value using an AR model with a seasonal lag.
Seasonality in a time series is tested by calculating the autocorrelations of error terms. A statistically significant lagged error term corresponding to the periodicity of the data indicates seasonality. Seasonality can be corrected by incorporating the appropriate seasonal lag term in an AR model. If a seasonal lag coefficient is appropriate and corrects the seasonality, the AR model with the seasonal terms will have no statistically significant autocorrelations of error terms.
LOS 3.c: Describe supervised machine learning algorithms—including penalized regression, support vector machine, k-nearest neighbor, classification and regression tree, ensemble learning, and random forest—and determine the problems for which they are best suited.
Supervised learning algorithms include: -Penalized regression. Reduces overfitting by imposing a penalty on and reducing the nonperforming features. -Support vector machine (SVM). A linear classification algorithm that separates the data into one of two possible classifiers based on a model-defined hyperplane. -K-nearest neighbor (KNN). Used to classify an observation based on nearness to the observations in the training sample. Classification and regression tree (CART). Used for classifying categorical target variables when there are significant nonlinear relationships among variables. -Ensemble learning. Combines predictions from multiple models, resulting in a lower average error rate. -Random forest. A variant of the classification tree whereby a large number of classification trees are trained using data bagged from the same data set.
LOS 4.e: Describe methods for extracting, selecting and engineering features from textual data.
Text processing involves removing HTML tags, punctuations, numbers, and white spaces. Text is then normalized by lowercasing of words, removal of stop words, stemming, and lemmatization. Text wrangling involves tokenization of text. N-grams is a technique that defines a token as a sequence of words, and is applied when the sequence is important. A bag-of-words (BOW) procedure then collects all the tokens in a document. A document term matrix organizes text as structured data: documents are represented by words, and tokens by columns. Cell values reflect the number of times a token appears in a document.
LOS 2.o: Determine an appropriate time-series model to analyze a given investment problem and justify that choice.
The RMSE criterion is used to determine which forecasting model will produce the most accurate forecasts. The RMSE equals the square root of the average squared error.
LOS 2.k: Describe the steps of the unit root test for nonstationarity and explain the relation of the test to autoregressive time-series models.
To determine whether a time series is covariance stationary, we can (1) run an AR model and/or (2) perform the Dickey Fuller test.
LOS 3.d: Describe unsupervised machine learning algorithms—including principal components analysis, k-means clustering, and hierarchical clustering—and determine the problems for which they are best suited.
Unsupervised learning algorithms include: -Principal components analysis. Summarizes the information in a large number of correlated factors into a much smaller set of uncorrelated factors, called eigenvectors. -K-means clustering. Partitions observations into k nonoverlapping clusters; a centroid is associated with each cluster. -Hierarchical clustering. Builds a hierarchy of clusters without any predefined number of clusters.
LOS 2.e: Explain how autocorrelations of the residuals can be used to test whether the autoregressive model fits the time series.
When an AR model is correctly specified, the residual terms will not exhibit serial correlation. If the residuals possess some degree of serial correlation, the AR model that produced the residuals is not the best model for the data being studied and the regression results will be problematic. The procedure to test whether an AR time-series model is correctly specified involves three steps: 1.Estimate the AR model being evaluated using linear regression. 2.Calculate the autocorrelations of the model's residuals. 3.Test whether the autocorrelations are significant.
LOS 2.n: Explain how time-series variables should be analyzed for nonstationarity and/or cointegration before use in a linear regression.
When working with two time series in a regression: (1) if neither time series has a unit root, then the regression can be used; (2) if only one series has a unit root, the regression results will be invalid; (3) if both time series have a unit root and are cointegrated, then the regression can be used; (4) if both time series have a unit root but are not cointegrated, the regression results will be invalid. The Dickey Fuller test with critical t-values calculated by Engle and Granger is used to determine whether two times series are cointegrated.
LOS 3.a: Describe supervised machine learning, unsupervised machine learning, and deep learning.
With supervised learning, inputs and outputs are identified for the computer, and the algorithm uses this labeled training data to model relationships. With unsupervised learning, the computer is not given labeled data; rather, it is provided unlabeled data that the algorithm uses to determine the structure of the data. With deep learning algorithms, algorithms such as neural networks and reinforced learning learn from their own prediction errors, and they are used for complex tasks such as image recognition and natural language processing.