CFA Lvl II - Quant
What are the traditional (i.e. with structured data) ML Model Building Steps?
1. Conceptualization of the modeling task. This crucial first step entails determining what the output of the model should be (e.g., whether the price of a stock will go up/down one week from now), how this model will be used and by whom, and how it will be embedded in existing or new business processes. 2. Data collection. The data traditionally used for financial forecasting tasks are mostly numeric data derived from internal and external sources. Such data are typically already in a structured tabular format, with columns of features, rows of instances, and each cell representing a particular value. 3. Data preparation and wrangling. This step involves cleansing and preprocessing of the raw data. Cleansing may entail resolving missing values, out-of-range values, and the like. Preprocessing may involve extracting, aggregating, filtering, and selecting relevant data columns. 4. Data exploration. This step encompasses exploratory data analysis, feature selection, and feature engineering. 5. Model training. This step involves selecting the appropriate ML method (or methods), evaluating performance of the trained model, and tuning the model accordingly. of the modeling task. This crucial first step entails determining what the output of the model should be (e.g., whether the price of a stock will go up/down one week from now), how this model will be used and by whom, and how it will be embedded in existing or new business processes.
What are transformations common in practice for Data Wrangling?
1. Extraction - A new variable can be extracted from the current variable for ease of analyzing and using for training the ML model. 2. Aggregation - Two or more variables can be aggregated into one variable to consolidate similar variables. 3. Filtration - The data rows that are not needed for the project must be identified and filtered. 4. Selection - The data columns that are intuitively not needed for the project can be removed. 5. Conversion - The variables can be of different types: nominal, ordinal, continuous, and categorical. The variables in the dataset must be converted into appropriate types to further process and analyze them correctly.
What are the possible errors in a raw dataset?
1. Incompleteneness Error 2. Invalidity Error 3. Inaccuracy Error 4. Inconsistency Error 5. Non-uniformity Error 6. Duplication Error
What are the Text ML Model Building Steps?
1. Text problem formulation. 2. Data (text) curation. 3. Text preparation and wrangling. 4. Text exploration. The resulting output (e.g., sentiment prediction scores) can either be combined with other structured variables or used directly for forecasting and/or analysis.
What is Suport Vector Machine (SVM)?
A linear classifier that aims to seek the optimal hyperplane—the one that separates the two sets of data points by the maximum margin (and thus is typically used for classification).
What is a linear classifier?
A binary classifier that makes its classification decision based on a linear combination of the features of each data point.
What is a bag-of-words (BOW)?
A collection of a distinct set of tokens from all the texts in a sample dataset. Created after the cleanses text is normalized.
What is a Random Forest Classifier?
A collection of many different decision trees generated by a bagging method or by randomly reducing the number of features available during training.
What is a corpus?
A collection of text data in any form, including list, matrix, or data table forms
What is a binary CART?
A combination of an initial root node, decision nodes, and terminal nodes. The root node and each decision node represent a single feature (f) and a cutoff value (c) for that feature. The CART algorithm iteratively partitions the data into sub-groups until terminal nodes are formed that contain the predicted label.
Hierarchical clustering is best described as a technique in which: A) the grouping of observations is unsupervised. B) features are grouped into a pre-specified number, k, of clusters. C) observations are classified according to predetermined labels.
A is correct. B is incorrect because it refers to k-means clustering. C is incorrect because it refers to classification, which involves supervised learning.
A column of a document term matrix is best described as representing: A) a token. B) a regularization term. C) an instance.
A is correct. Each column of a document term matrix represents a token from the bag-of-words that is built using all the documents in a sample dataset.
When some words appear very infrequently in a textual dataset, techniques that may address the risk of training highly complex models include: A) stemming. B) scaling. C) data cleansing.
A is correct. Stemming, the process of converting inflected word forms into a base word (or stem), is one technique that can address the problem described.
What is a learning curve in ML?
A curve which plots the accuracy rate (= 1 − error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample, so is useful for describing under- and overfitting as a function of bias and variance errors.
What is a fitting curve?
A curve which shows the in- and out-of-sample error rates (Ein and Eout) on the y-axis plotted against model complexity on the x-axis.
What is a test sample?
A data sample that is used to test a model's ability to predict well on new data.
What is a training sample?
A data sample that is used to train a model.
What is a validation sample?
A data sample that is used to validate and tune a model.
What is a labeled data set?
A dataset that contains matched sets of observed inputs or features (Xs) and the associated output or target (Y).
What is a summation operation?
A functional part of a neural network's node that multiplies each input value received by a weight and sums the weighted values to form the total net input, which is then passed to the activation function.
What is Document Term Matrix (DTM)?
A matrix where each row belongs to a document (or text file), and each column represents a token (or term). The number of rows is equal to the number of documents (or text files) in a sample text dataset. The number of columns is equal to the number of tokens from the BOW built using all the documents in the sample dataset. The cells typically contain the counts of the number of times a token is present in each document.
What is an Eigenvalue?
A measure that gives the proportion of the total variance in the initial dataset that is explained by each eigenvector.
What is Grid Search?
A method of systematically training a model by using various combinations of hyperparameter values, cross validating each model, and determining which combination of hyperparameter values ensures the best model performance.
Deep learning net (Definition)
A neural network (NN) with many hidden layers (at least 3 but often more than 20). NNs and DLNs have been successfully applied to a wide variety of complex tasks characterized by non-linearities and interactions among features, particularly pattern recognition problems.
What is the learning rate in ML?
A parameter that affects the magnitude of adjustments in the weights in a neural network.
What is a hyperparamater?
A parameter whose value must be set by the researcher before the learning begins.
What is a scree plot?
A plot that shows the proportion of total variance in the data explained by each principal component.
What is LASSO (least absolute shrinkage and selection operator)?
A popular type of penalized regression where the penalty term involves summing the absolute values of the regression coefficients. The greater the number of included features, the larger the penalty. So, a feature must make a sufficient contribution to model fit to offset the penalty from including it.
What is Feature Engineering?
A process of creating new features by changing or transforming existing features.
What is Feature Selection?
A process whereby only pertinent features from the dataset are selected for ML model training. Selecting fewer features decreases ML model complexity and training time.
What is Pruning?
A regularization technique used in CART to reduce the size of the classification or regression tree—by using pruning sections of the tree that provide little classifying power are removed.
What are N-Grams?
A representation of word sequences. The length of a sequence varies from 1 to n. When one word is used, it is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on.
What is Regex (regular expression)?
A series of texts that contains characters in a particular order. Regex is used to search for patterns of interest in a given text.
What is dimension reduction?
A set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation.
What is a Cluster?
A subset of observations from a data set such that all the observations within the same cluster are deemed "similar"
What is cross-validation?
A technique for estimating out-of-sample error directly by determining the error in validation samples.
What is K-Fold Cross-validation?
A technique for mitigating the holdout sample problem (excessive reduction of the training set size). The data (excluding test sample and fresh data) are shuffled randomly and then divided into k equal sub-samples, with k - 1 samples used as training samples and one sample, the kth, used as a validation sample.
What is Ensemble Learning?
A technique of combining the predictions from a collection of models. It typically produces more accurate and more stable predictions than the best single model.
What is a dendrogram?
A type of tree diagram used for visualizing a hierarchical cluster analysis - it highlights the hierarchical relationships among clusters.
What is a composite variable?
A variable that combines two or more variables that are statistically strongly related to each other.
What is an Eignvector?
A vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features.
What is Document Frequency (DF)?
Another frequency measure that helps to discard the noise features that carry no specific information about the text class and are present across all texts. The DF of a token is defined as the number of documents (texts) that contain the respective token divided by the total number of documents. It is the simplest feature selection method and often performs well when many thousands of tokens are present.
What is soft margin classification?
An adaptation in the support vector machine algorithm which adds a penalty to the objective function for observations in the training set that are misclassified.
What is Parts of Speech?
An algorithm that uses language structure and dictionaries to tag every token in the text with a corresponding part of speech (i.e., noun, verb, adjective, proper noun, etc.).
What is K-Means?
An unsupervised ML algorithm that partitions observations into a fixed number (k) of non-overlapping clusters. Each cluster is characterized by its centroid, and each observation belongs to the cluster with the centroid to which that observation is closest.
What is a Principal components analysis (PCA)?
An unsupervised ML algorithm that reduces highly correlated features into fewer uncorrelated composite variables by transforming the feature covariance matrix. PCA produces eigenvectors that define the principal components (i.e., the new uncorrelated composite variables) and eigenvalues, which give the proportion of total variance in the initial data that is explained by each eigenvector and its associated principal component.
What is Hierarchical Clustering?
An unsupervised iterative algorithm that is used to build a hierarchy of clusters. Two main strategies are used to define the intermediary clusters (i.e., those clusters between the initial data set and the final set of clustered data).
As used in supervised machine learning, regression problems involve: A) binary target variables. B) continuous target variables. C) categorical target variables.
B is correct. A and C are incorrect because when the target variable is binary or categorical, the problem is a classification problem rather than a regression problem.
A cell of a document term matrix is best described as containing: A) a token. B) a count of tokens. C) a count of instances.
B is correct. A cell in a document term matrix contains a count of the number of tokens of the kind indicated in the column heading.
A neural network is best described as a technique for machine learning that is: A) exactly modeled on the human nervous system. B) based on layers of nodes connected by links when the relationships among the features are usually non-linear. C) based on a tree structure of nodes when the relationships among the features are linear.
B is correct. A is incorrect because neural networks are not exactly modeled on the human nervous system. C is incorrect because neural networks are not based on a tree structure of nodes when the relationships among the features are linear.
How does Agglomerative (bottom-up) hierarchical clustering work?
Begins with each observation being its own cluster. Then, the algorithm finds the two closest clusters, defined by some measure of distance, and combines them into a new, larger cluster. This process is repeated until all observations are clumped into a single cluster.
What are the two types of model fitting errors?
Bias and Variance. Bias error is associated with underfitting, and variance error is associated with overfitting. Bias error is high when a model is overly simplified and does not sufficiently learn from the patterns in the training data. Variance error is high when the model is overly complicated and memorizes the training data so much that it will likely perform poorly on new data.
Describe Text Problem Formulation
First step in Text ML Model building process. Analysts begin by determining how to formulate the text classification problem, identifying the exact inputs and outputs for the model. Perhaps we are interested in computing sentiment scores (structured output) from text (unstructured input). Analysts must also decide how the text ML model's classification output will be utilized.
Dimension reduction techniques are best described as a means to reduce a set of features: A) to a manageable size without regard for the variation in the data. B) to a manageable size while increasing the variation in the data. C) to a manageable size while retaining as much of the variation in the data as possible.
C is correct because dimension reduction techniques, like PCA, are aimed at reducing the feature set to a manageable size while retaining as much of the variation in the data as possible.
CART is best described as: A) an unsupervised ML algorithm. B) a clustering algorithm based on decision trees. C) a supervised ML algorithm that accounts for non-linear relationships among the features.
C is correct. A is incorrect because CART is a supervised ML algorithm. B is incorrect because CART is a classification and regression algorithm, not a clustering algorithm.
Which of the following best describes penalized regression? Penalized regression: A) is unrelated to multiple linear regression. B) involves a penalty term that is added to the predicted target variable. C) is a category of general linear models used when the number of features and overfitting are concerns.
C is correct. A is incorrect because penalized regression is related to multiple linear regression. B is incorrect because penalized regression involves adding a penalty term to the sum of the squared regression residuals.
In text cleansing, situations in which one may need to add an annotation include the removal of: A) html tags. B) white spaces. C) punctuations.
C is correct. Some punctuations, such as percentage signs, currency symbols, and question marks, may be useful for ML model training, so when such punctuations are removed annotations should be added.
The output produced by preparing and wrangling textual data is best described as a: A) data table. B) confusion matrix. C) document term matrix.
C is correct. The objective of data preparation and wrangling of textual data is to transform the unstructured data into structured data. The output of these processes is a document term matrix that can be read by computers. The document term matrix is similar to a data table for structured data.
Points to cover in normalizing textual data include: A) removing numbers. B) removing white spaces. C) lowercasing the alphabet.
C is correct. The other choices are related to text cleansing.
In machine learning, if the target variable to be predicted is categorical or ordinal (eg. determining a firm's rating) then it is a task of:
Classification
What are Neural Networks made of?
Consist of nodes connected by links. They have three types of layers: an input layer, hidden layers, and an output layer. Learning takes place in the hidden layer nodes, each of which consists of a summation operator and an activation function. Neural networks have been successfully applied to a variety of investment tasks characterized by non-linearities and complex interactions among variables.
Describe text exploration
Fourth & final step in Text ML Model building process. This step encompasses text visualization through techniques, such as word clouds, and text feature selection and engineering.
The general feature selection methods in text data include?
Frequency, Chi-Square, Mutual Information
What are holdout samples?
Data samples that are held and not used to train a model.
What are the main factors causing model fitting errors?
Dataset size and number of features.
What is Variance Error?
Describes how much a model's results change in response to new data from validation and test samples.
What are two types of problems well suited to Unsupervised ML?
Dimension Reduction and Clustering
What is Base error?
Error due to randomness in the data.
Data exploration involves what three tasks?
Exploratory data analysis, feature selection, and feature engineering.
What is the False Positive Rate (FPR)?
False positive rate (FPR) = FP/(TN + FP)
What is generalization in ML?
Generalization describes the degree to which an ML model retains its explanatory power when predicting out-of-sample. Overfitting, a primary reason for lack of generalization, is the tendency of ML algorithms to tailor models to the training data at the expense of generalization to new data points.
What is a target?
In machine learning, the dependent variable (Y) in a labeled dataset; the company in a merger or acquisition that is being acquired.
How is Classification and regression tree (CART) used?
It can be applied to predict either a categorical target variable, producing a classification tree, or a continuous target variable, producing a regression tree.
What is Regularization?
It describes methods that reduce statistical variability in high dimensional data estimation or prediction problems.
What is the Root Mean Squared Error (RMSE)?
It is a single metric that captures all the prediction errors in the data (n). The root mean squared error is computed by finding the square root of the mean of the squared differences between the actual values and the model's predicted values (error). A small RMSE indicates potentially better model performance. The formula is: RMSE=√∑i (Predictedi−Actuali)^2/n.
What is the goal of Machine Learning
Machine learning aims at extracting knowledge from large amounts of data by learning from known examples to determine an underlying structure in the data. The emphasis is on generating structure or predictions without human intervention. An elementary way to think of ML algorithms is to "find the pattern, apply the pattern."
What is unsupervised learning?
Machine learning that does not make use of labeled data.
What is involved in Model Tuning?
Managing the trade-off between model bias error, associated with underfitting, and model variance error, associated with overfitting. A fitting curve of in-sample (training sample) error and out-of-sample (cross-validation sample) error on the y-axis versus model complexity on the x-axis is useful for managing the bias vs. variance error trade-off.
What is Mutual Information (MI)?
Measures how much information is contributed by a token to a class of texts. The mutual information value will be equal to 0 if the token's distribution in all text classes is the same. The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of text.
What are the three tasks of ML model training?
Method selection, performance evaluation, and tuning. Method selection is the art and science of deciding which ML method(s) to incorporate and is guided by such considerations as the classification task, type of data, and size of data. Performance evaluation entails using an array of complementary techniques and measures to quantify and understand a model's performance. Tuning is the process of undertaking decisions and actions to improve model performance. These steps may be repeated multiple times until the desired level of ML model performance is attained.
What are techniques for feature engineering?
Numbers, N-Grams, Name Entity Recognition (NER), Parts of Speech (POS)
How do you calculate Out-of-sample error?
Out-of-sample error equals bias error plus variance error plus base error
What is complexity?
Refers to the number of features, terms, or branches in a model and to whether the model is linear or non-linear (non-linear is more complex).
In machine learning, if the target variable to be predicted is continuous, then the task is one of:
Regression
Describe data(text) curation
Second step in Text ML Model building process. This step involves gathering relevant external text data via web services or web spidering (scraping or crawling) programs that extract raw content from a source, typically web pages. Annotation of the text data with high-quality, reliable target (dependent) variable labels might also be necessary for supervised learning and performance evaluation purposes. For instance, experts might need to label whether a given expert assessment of a stock is bearish or bullish.
How does Divisive (top-down) hierarchical clustering work?
Starts with all observations belonging to a single cluster. The observations are then divided into two clusters based on some measure of distance. The algorithm then progressively partitions the intermediate clusters into smaller clusters until each cluster contains only one observation.
What do you need for Supervised Learning?
Supervised learning depends on having labeled training data as well as matched sets of observed inputs (X's, or features) and the associated output (Y, or target).The inferred pattern is then used to map a given input set into a predicted output.
What is K-Nearest Neighbor (KNN)?
Supervised learning technique most often used for classification. The idea is to classify a new observation by finding similarities ("nearness") between it and its k-nearest neighbors in the existing data set.
What factors govern method selection for ML Models?
Supervised or unsupervised learning, Type of Data, Size of Data.
What is the F1 Score?
The F1 measure is the harmonic mean, or weighted average, of the precision and recall scores. Also called the f-measure or the f-score, the F1 score is calculated using the following formula: F1 score = (2 * P * R)/(P + R). The F1 measure penalizes classifiers with imbalanced precision and recall scores, like the trivial classifier that always predicts the positive class. A model with perfect precision and recall scores will achieve an F1 score of one.
What is a Centroid?
The center of a cluster formed using the K-means clustering algorithm.
What is Bias Error?
The degree to which a model fits the training data.
What are Features?
The independent variables (X's) in a labeled dataset
Describe the functions of the three groups of layers of a deep learning net.
The input layer, the hidden layers, and the output layer constitute the three groups of layers of DLNs. - The input layer receives the inputs (i.e., features) and has as many nodes as there are dimensions of the feature set. - The hidden layers consist of nodes, each comprised of a summation operator and an activation function that are connected by links. These hidden layers are, in effect, where the model is learned. - The final layer, the output layer, produces a set of probabilities of an observation being in any of the target style categories (each represented by a node in the output layer). The DLN assigns the category based on the style category with the highest probability.
What is a ground truth?
The known outcome (i.e., target variable) of each observation in a labelled dataset.
The number of iterations required to reach optimum results for Machine Learning model training depends on:
The nature of the problem and input data and the level of model performance needed for practical application.
What is model accuracy?
The percentage of correctly predicted classes out of total predictions. Accuracy = (TP + TN)/(TP + FP + TN + FN)
What is Exploratory data analysis (EDA)?
The preliminary step in data exploration. Exploratory graphs, charts, and other visualizations, such as heat maps and word clouds, are designed to summarize and observe data.
What is Winsorization?
The process by which extreme values and outliers are replaced with the maximum (for large value outliers) and minimum (for small value outliers) values of data points that are not outliers,
What is one-hot encoding?
The process in which categorical variables are converted into binary form (0 or 1) for machine reading
What is Scaling?
The process of adjusting the range of a feature by shifting and changing the scale of the data. Two of the most common ways of scaling are normalization and standardization.
What is scaling?
The process of adjusting the range of a feature by shifting and changing the scale of the data. Two of the most common ways of scaling are normalization and standardization.
What is forward propagation?
The process of adjusting weights in a neural network, to reduce total error of the network, by moving forward through the network's layers.
What is Standardization in scaling?
The process of both centering and scaling the variables. Centering involves subtracting the mean (μ) of the variable from each observation (Xi) so the new mean is 0. Scaling adjusts the range of the data by dividing the centered values (Xi - μ) by the standard deviation (σ) of feature X. The resultant standardized variable will have an arithmetic mean of 0 and standard deviation of 1.Xi (standardized)=(Xi−μ)/σ
What is normalization in scaling?
The process of rescaling numeric variables in the range of [0, 1]. To normalize variable X, the minimum value (Xmin) is subtracted from each observation (Xi), and then this value is divided by the difference between the maximum and minimum values of X (Xmax - Xmin) as follows:Xi (normalized)=Xi−Xmin/Xmax−Xmin.
What is model Recall (also known as sensitivity)?
The ratio of correctly predicted positive classes to all actual positive classes. Recall is useful in situations where the cost of FN or Type II error is high—for example, when an expensive product passes quality inspection (predicted Class "0") and is sent to the valued customer, but it is actually quite defective (actual Class "1"). Recall (R) = TP/(TP + FN)
What is Model Precision?
The ratio of correctly predicted positive classes to all predicted positive classes. Precision is useful in situations where the cost of FP, or Type I error, is high—or example, when an expensive product fails quality inspection (predicted Class "1") and is scrapped, but it is actually perfectly good (actual Class "0"). Precision (P) = TP/(TP + FP)
What is Term Frequency (TF)?
The ratio of the number of times a given token occurs in all the texts in the dataset to the total number of tokens in the dataset (e.g., word associations, average word and sentence length, and word and syllable counts).
What is Clustering?
The sorting of obeservations into groups (clusters) such that observations in the same cluster are more similar to each other than they are to observations in other clusters.
What is used as input for the ML algorithm in a big data project involving text data analysis for classifying and predicting sentiment of financial text for particular stocks?
The text data are transformed into structured data for populating the DTM, which is then used as the input for the ML algorithm.
What is projection error?
The vertical (perpendicular) distance between a data point and a given principal component.
Text cleansing typically involves removing the following:
html tags, punctuations, most numbers, and white spaces.
Describe data(text) preparation and wrangling
Third step in Text ML Model building process. This step involves critical cleansing and preprocessing tasks necessary to convert streams of unstructured data into a format that is usable by traditional modeling methods designed for structured inputs.
What is Data Preparation (Cleansing)?
This is the initial and most common task in data preparation that is performed on raw data. Data cleansing is the process of examining, identifying, and mitigating errors in raw data.
What is Data Wrangling (Preprocessing)?
This task performs transformations and critical processing steps on the cleansed data to make the data ready for ML model training. Raw data most commonly are not present in the appropriate format for model consumption. After the cleansing step, data need to be processed by dealing with outliers, extracting useful variables from existing data points, and scaling the data.
What is tokenization?
Tokenization is the process of splitting a given text into separate tokens. This step takes place after cleansing the raw text data (removing html tags, numbers, extra white spaces, etc.). The tokens are then normalized to create the bag-of-words (BOW).
To derive term frequency (TF) at the sentence level and TF-IDF, both of which can be inputs to the DTM, the following frequency measures should be used to create a term frequency measures table:
TotalWordsInSentence; TotalWordCount; TermFrequency (Collection Level); WordCountInSentence; SentenceCountWithWord; Document Frequency; and Inverse Document Frequency.
For classification problems, error analysis involves computing what four basic evaluation metrics?
True positive (TP), false positive (FP), true negative (TN), and false negative (FN) metrics. FP is also called a Type I error, and FN is also called a Type II error.
What is the True Positive Rate (TPR)?
True positive rate (TPR) = TP/(TP + FN) Note that true positive rate is the same as recall.
What is a reinforcement learning (RL) algorithm and how is it used?
Type of DLN that involves an agent that should perform actions that will maximize its rewards over time, taking into consideration the constraints of its environment. Unlike supervised learning, RL has neither direct labeled data for each observation nor instantaneous feedback. With RL, the algorithm needs to observe its environment, learn by testing new actions (some of which may not be immediately optimal), and reuse its previous experiences. The learning subsequently occurs through millions of trials and errors
What is Overfitting?
When a model fits the training data too well and so does not predict well using new data.
What is Duplication error?
Where duplicate observations are present. This can be corrected by removing the duplicate entries.
What is Inaccuracy error?
Where the data are not a measure of true value. This can be rectified with the help of business records and administrators.
What is Non-uniformity error?
Where the data are not present in an identical format. This can be resolved by converting the data points into a preferable standard format.
What is Incompleteness error?
Where the data are not present, resulting in missing data. This can be corrected by investigating alternate data sources.
What is Invalidity error?
Where the data are outside of a meaningful range, resulting in invalid data. This can be corrected by verifying other administrative data records.
What is Inconsistency error?
Where the data conflict with the corresponding data points or reality. This contradiction should be eliminated by clarifying with another source.
What is metadata?
a set of data that describes and gives information about other data.
Preprocessing for structured data typically involves performing what transformations?
extraction, aggregation, filtration, selection, and conversion.
To carry out receiver operating characteristic (ROC) analysis, ROC curves and area under the curve (AUC) of various models are calculated and compared. The more convex the ROC curve and the higher the AUC, ____________ the model performance.
the better the model performance.