Quantitative Methods 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

deep learning

Algorithms based on complex neural networks, ones with many hidden layers (more than 3), that address highly complex tasks, such as image classification, face recognition, speech recognition, and natural language processing

deep learning nets

Algorithms based on complex neural networks, ones with many hidden layers (more than 3), that address highly complex tasks, such as image classification, face recognition, speech recognition, and natural language processing.

Recall

Also known as sensitivity, in error analysis for classification problems it is the ratio of correctly predicted positive classes to all actual positive classes. Recall is useful in situations where the cost of false negatives (FN), or Type II error, is high.

Soft Margin Classification

An adaptation in the support vector machine algorithm which adds a penalty to the objective function for observations in the training set that are misclassified.

name entity recognition

An algorithm that analyzes individual tokens and their surrounding semantics while referring to its dictionary to tag an object class to the token.

parts of speech

An algorithm that uses language structure and dictionaries to tag every token in the text with a corresponding part of speech (i.e., noun, verb, adjective, proper noun, etc.).

Hierarchical clustering

An iterative unsupervised learning procedure used for building a hierarchy of clusters.

holdout sample

Data samples that are not used to train a model.

metadata

Data that describes and gives information about other data.

variance error

Describes how much a model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance error, causing overfitting and high out-of-sample error.

bias error

Describes the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias error with poor approximation, causing underfitting and high in-sample error.

Recall is useful in situations where the cost of

FN or Type II error is high

Precision is useful in situations where the cost of

FP, or Type I error, is high

False positive rate (FPR) =

FP/(TN + FP)

neural netowrks

Highly flexible machine learning algorithms that have been successfully applied to a variety of tasks characterized by non-linearities and interactions among features. They typically consist of three layers: input layer; hidden layer, where learning occurs; and output layer.

precision

In error analysis for classification problems it is ratio of correctly predicted positive classes to all predicted positive classes. Precision is useful in situations where the cost of false positives (FP), or Type I error, is high.

target

In machine learning, the dependent variable (Y) in a labeled dataset; the company in a merger or acquisition that is being acquired.

LASSO

Least absolute shrinkage and selection operator is a type of penalized regression which involves minimizing the sum of the absolute values of the regression coefficients.

Reinforcement learning

Machine learning in which a computer learns from interacting with itself (or data generated by the same algorithm).

reinforcement learning

Machine learning in which a computer learns from interacting with itself (or data generated by the same algorithm).

Unsupervised learning

Machine learning that does not make use of labeled data.

Supervised learning

Machine learning where algorithms infer patterns between a set of inputs (the X's) and the desired output (Y). The inferred pattern is then used to map a given input set into a predicted output.

mutual information

Measures how much information is contributed by a token to a class of texts. MI will be 0 if the token's distribution in all text classes is the same. MI approaches 1 as the token in any one class tends to occur more often in only that particular class of text.

base error

Model error due to randomness in the data.

web spidering (scraping or crawling) programs

Programs that extract raw content from a source, typically web pages.

term frequency (TF)

Ratio of the number of times a given token occurs in all the texts in the dataset to the total number of tokens in the dataset.

DF =

SentenceCountWithWord/Total number of sentences.

When some words appear very infrequently in a textual dataset, techniques that may address the risk of training highly complex models include:

Stemming, the process of converting inflected word forms into a base word (or stem), is one technique that can address the problem described.

TF-IDF =

TF * IDF

Recall (R) =

TP/(TP + FN)

True positive rate (TPR) =

TP/(TP + FN)

Precision (P) =

TP/(TP + FP)

Readme files

Text files provided with raw data that contain information related to a data file. They are useful for understanding the data and how they can be interpreted correctly.

Centroid

The center of a cluster formed using the K-means clustering algorithm.

F1 score

The harmonic mean of precision and recall. F1 score is a more appropriate overall performance metric (than accuracy) when there is unequal class distribution in the dataset and it is necessary to measure the equilibrium of precision and recall.

features

The independent variables (X's) in a labeled dataset.

ground truth

The known outcome (i.e., target variable) of each observation in a labelled dataset.

ensemble method

The method of combining multiple learning algorithms, as in ensemble learning.

Sentence Length

The number of characters, including spaces, in a sentence.

Document frequency (DF)

The number of documents (texts) that contain a particular token divided by the total number of documents. It is the simplest feature selection method and often performs well when many thousands of tokens are present.

collection frequency (CF)

The number of times a given word appears in the whole corpus (i.e., collection of sentences) divided by the total number of words in the corpus.

Accuracy

The percentage of correctly predicted classes out of total predictions. It is an overall performance metric in classification problems.

Exploratory data analysis (EDA)

The preliminary step in data exploration, where graphs, charts, and other visualizations (heat maps and word clouds) as well as quantitative methods (descriptive statistics and central tendency measures) are used to observe and summarize data.

one hot encoding

The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data.

backward propagation

The process of adjusting weights in a neural network, to reduce total error of the network, by moving backward through the network's layers.

forward propagation.

The process of adjusting weights in a neural network, to reduce total error of the network, by moving forward through the network's layers.

Data Preparation (Cleansing)

The process of examining, identifying, and mitigating (i.e., cleansing) errors in raw data.

Frequency analysis

The process of quantifying how important tokens are in a sentence and in the corpus as a whole. It helps in filtering unnecessary tokens (or features).

Clustering

The sorting of observations into groups (clusters) such that observations in the same cluster are more similar to each other than they are to observations in other clusters.

projection error

The vertical (perpendicular) distance between a data point and a given principal component.

Data Wrangling (Preprocessing)

This task performs transformations and critical processing steps on cleansed data to make the data ready for ML model training (i.e., preprocessing), and includes dealing with outliers, extracting useful variables from existing data points, and scaling the data.

When using big data for inference or prediction, there is a "fourth V":

Veracity relates to the credibility and reliability of different data sources.

overfitting

When a model fits the training data too well and so does not predict well using new data.

generalize

When a model retains its explanatory power when predicting out-of-sample (i.e., using new data).

TF (Sentence Level) =

WordCountInSentence/TotalWordsInSentence

Xi (normalized)=

Xi−Xmin/Xmax−Xmin

A column of a document term matrix is best described as representing

a token.

avoid overfitting and improve the predictive power of the CART model by

adding regularization parameters.

the goal of machine learning algorithms is to

automate decision-making processes by generalizing (i.e., "learning") from known examples to determine an underlying structure in the data.

The KNN algorithm has many applications in the investment industry, including

bankruptcy prediction, stock price prediction, corporate bond credit rating assignment, and customized equity and bond index creation.

The shape of the ROC curve provides insight into the model's performance. A more convex curve indicates

better model performance

Model fitting has two types of error:

bias and variance

Most commonly, CART is applied to

binary classification or regression

As used in supervised machine learning, regression problems involve:

continuous target variables.

Feature extraction is the process of

creating (i.e., extracting) new variables from existing ones in the data.

The output produced by preparing and wrangling textual data is best described as a:

document term matrix.

Typical applications of CART in investment management include, among others,

enhancing detection of fraud in financial statements, generating consistent decision processes in equity and fixed-income selection, and simplifying communication of investment strategies to clients.

As complexity increases in the test set,

error rates (Eout) rise and variance error rises.

Data exploration involves three important tasks:

exploratory data analysis, feature selection, and feature engineering

True or false: simulations yield better estimates of expected value than conventional risk-adjusted value models

false; the expected values from simulations should be fairly close to the expected value that we would obtain using the expected values for each of the inputs (rather than the entire distribution).

The model is overfitted, so it has

high variance error.

Which of the following best describes penalized regression? Penalized regression:

is a category of general linear models used when the number of features and overfitting are concerns

Bagging is a very useful technique because

it helps to improve the stability of predictions and protects against overfitting the model.

an important drawback of random forest is that it

lacks the ease of interpretability of individual trees; as a result, it is considered a relatively black box-type algorithm.

the sum of the probabilities of the scenarios we examine in scenario analysis can be

less than one, whereas the sum of the probabilities of outcomes in decision trees and simulations has to equal one.

IDF =

log(1/DF).

Research suggests that machine learning solutions outperform

mean-variance optimization in portfolio construction

Variance error is associated with

overfitting

Supervised machine learning models are trained using labeled data, and depending on the nature of the target (Y) variable, they can be divided into two types:

regression for a continuous target variable and classification for a categorical or ordinal target variable

Supervised learning can be divided into two categories of problems,

regression problems and classification problems, with the distinction between them being determined by the nature of the target (Y) variable

Decision trees are designed for

sequential and discrete risks, where the risk in an investment is considered into phases and the risk in each phase is captured in the possible outcomes and the probabilities that they will occur

A long-established statistical method for dimension reduction is principal components analysis (PCA). PCA is used to

summarize or reduce highly correlated features of data into a few main, uncorrelated composite variables.

Research comparing statistical and machine learning methods' abilities to explain and predict equity prices so far indicates that simple neural networks produce models of equity returns at the individual stock and portfolio level that are

superior to models built using traditional statistical methods due to their ability to capture dynamic and interacting variables.

Multiple regression is an example of

supervised learning

In practice, the smallest number of principal components that should be retained is

that which the scree plot shows as explaining 85% to 95% of total variance in the initial data set.

Therefore, an optimal point of model complexity exists where

the bias and variance error curves intersect and in- and out-of-sample error rates are minimized.

Running a simulation is simplest for firms that consider

the same kind of projects repeatedly.

A high TF-IDF value indicates

the word appears many times within a small number of documents, signifying an important yet unique term within a sentence

The relatively parsimonious models produced by applying penalized regression methods, like LASSO, tend to work well because

they are less subject to overfitting.

Dimension reduction techniques are best described as a means to reduce a set of features:

to a manageable size while retaining as much of the variation in the data as possible.

For classification problems, error analysis involves computing four basic evaluation metrics:

true positive (TP), false positive (FP), true negative (TN), and false negative (FN) metrics

Bias error is associated with

underfitting

Big data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 3Vs:

volume, variety, and velocity.

F1 score =

(2 * P * R)/(P + R)

New weight =

(Old weight) - (Learning rate) × (Partial derivative of the total error with respect to the old weight),

Expected base-year after-tax cash flow =

(Revenue × Pretax margin - Non-operating expenses)(1 - Tax rate)

Accuracy =

(TP + TN)/(TP + FP + TN + FN)

DLNs have become hugely successful because of a confluence of three developments:

1) advances in analytical methods for fitting these models; 2) the availability of large quantities of machine readable data to train models; and 3) fast computers, especially new chips in the graphics processing unit (GPU) class, tailored for the type of calculations done on DLNs.

Ensemble learning can be divided into two main categories:

1) an ensemble method can be an aggregation of heterogeneous learners (i.e., different types of algorithms combined together with a voting classifier); or 2) an ensemble method can be an aggregation of homogenous learners (i.e., a combination of the same algorithm, using different training data that are based, for example, on a bootstrap aggregating

common practice for splitting ML dataset

1) training set; 2) cross-validation (CV) set; and 3) test set; These are in the ratio of 60:20:20, respectively

Generically, there are three ways in which we can go about defining probability distributions:

1. Historical data; 2.Cross sectional data; 3.Statistical distribution and parameters

linear classifier

A binary classifier that makes its classification decision based on a linear combination of the features of each data point.

agglomerative clustering

A bottom-up hierarchical clustering method that begins with each observation being treated as its own cluster. The algorithm finds the two closest clusters, based on some measure of distance (similarity), and combines them into 1 new larger cluster. This process is repeated iteratively until all observations are clumped into a single large cluster.

majority-vote classifier

A classifier that assigns to a new data point the predicted label with the most votes (i.e., occurrences).

K-means

A clustering algorithm that repeatedly partitions observations into a fixed number, k, of non-overlapping clusters.

Bag-of-words (BOW)

A collection of a distinct set of tokens from all the texts in a sample dataset. BOW does not capture the position or sequence of words present in the text.

random forest classifier

A collection of a large number of decision trees trained via a bagging method.

corpus

A collection of text data in any form, including list, matrix, or data table forms.

learning curve

A curve which plots the accuracy rate (= 1 - error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample, so is useful for describing under- and overfitting as a function of bias and variance errors.

fitting curve

A curve which shows in- and out-of-sample error rates (Ein and Eout) on the y-axis plotted against model complexity on the x-axis.

test sample

A data sample that is used to test a model's ability to predict well on new data.

training sample

A data sample that is used to train a model.

validation sample

A data sample that is used to validate and tune a model.

labeled data set

A dataset that contains matched sets of observed inputs or features (X's) and the associated output or target (Y).

summation operator

A functional part of a neural network's node that multiplies each input value received by a weight and sums the weighted values to form the total net input, which is then passed to the activation function.

activation function

A functional part of a neural network's node that transforms the total net input received into the final output of the node. The activation function operates like a light dimmer switch that decreases or increases the strength of the input.

confusion matrix

A grid used for error analysis in classification problems, it presents values for four evaluation metrics including true positive (TP), false positive (FP), true negative (TN), and false negative (FN).

Support vector machine

A linear classifier that determines the hyperplane that optimally separates the observations into two sets of data points.

document term matrix (DTM)

A matrix where each row belongs to a document (or text file), and each column represents a token (or term). The number of rows is equal to the number of documents (or text files) in a sample text dataset. The number of columns is equal to the number of tokens from the BOW built using all the documents in the sample dataset. The cells typically contain the counts of the number of times a token is present in each document.

eigenvalue

A measure that gives the proportion of total variance in the initial dataset that is explained by each eigenvector.

Grid Search

A method of systematically training a model by using various combinations of hyperparameter values, cross validating each model, and determining which combination of hyperparameter values ensures the best model performance.

learning rate

A parameter that affects the magnitude of adjustments in the weights in a neural network.

hyperparameter

A parameter whose value must be set by the researcher before learning begins.

Scree plot

A plot that shows the proportion of total variance in the data explained by each principal component.

Feature engineering

A process of creating new features by changing or transforming existing features.

feature selection

A process whereby only pertinent features from the dataset are selected for model training. Selecting fewer features decreases model complexity and training time.

Penalized regression

A regression that includes a constraint such that the regression coefficients are chosen to minimize the sum of squared residuals plus a penalty term that increases in size with the number of included features.

Pruning

A regularization technique used in CART to reduce the size of the classification or regression tree—by using pruning sections of the tree that provide little classifying power are removed.

N-grams

A representation of word sequences. The length of a sequence varies from 1 to n. When one word is used, it is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on.

regular expression (regex)

A series of texts that contains characters in a particular order. Regex is used to search for patterns of interest in a given text.

Dimension reduction

A set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation.

application programming interface (API)

A set of well-defined methods of communication between various software components and typically used for accessing external data.

cluster

A subset of observations from a data set such that all the observations within the same cluster are deemed "similar."

K-Nearest Neighbor

A supervised learning technique that classifies a new observation by finding similarities ("nearness") between this new observation and the existing data.

Classification and Regression Tree (CART)

A supervised machine learning technique that can be applied to predict either a categorical target variable, producing a classification tree, or a continuous target variable, producing a regression tree. CART is commonly applied to binary classification or regression.

ceiling analysis

A systematic process of evaluating different components in the pipeline of model building. It helps to understand what part of the pipeline can potentially improve in performance by further tuning.

cross-validation

A technique for estimating out-of-sample error directly by determining the error in validation samples.

k-fold cross-validation

A technique in which data (excluding test sample and fresh data) are shuffled randomly and then are divided into k equal sub-samples, with k - 1 samples used as training samples and one sample, the kth, used as a validation sample.

Ensemble Learning

A technique of combining the predictions from a collection of models to achieve a more accurate prediction.

Bootstrap aggregating (or bagging)

A technique whereby the original training data set is used to generate n new training data sets or bags of data. Each new bag of data is generated by random sampling with replacement from the initial training set.

complexity

A term referring to the number of features, terms, or branches in a model and to whether the model is linear or non-linear (non-linear is more complex).

Regularization

A term that describes methods for reducing statistical variability in high dimensional data estimation problems.

divisive clustering

A top-down hierarchical clustering method that starts with all observations belonging to a single large cluster. The observations are then divided into two clusters based on some measure of distance (similarity). The algorithm then progressively partitions the intermediate clusters into smaller ones until each cluster contains only 1 observation.

dendrogram

A type of tree diagram used for visualizing a hierarchical cluster analysis—it highlights the hierarchical relationships among the clusters.

composite variable

A variable that combines two or more variables that are statistically strongly related to each other.

eigenvectors

A vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features.


Ensembles d'études connexes

EEG201 Midterm (Combine with Quiz 1)

View Set

Unit 5: Sampling Distributions Review

View Set

Biology: Cell Transportation Test

View Set