Machine Learning

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Deep learning

The computer learn from interacting with itself (from data generated by the same algorithm(, based on neural networks

What is complexity in ML?

refers to the number of features, terms, or branches in the model and to whether the model is linear or non-linear (non-linear is more complex). As models become more complex, overfitting risk increases.

Three sources of out of sample errors

1. Bias error ( model fit, decreases as model get's more complex) 2. Variance error (responsiveness to new data, increases as model get's more complex) 3. Base error (due to randomness in data)

Two types of unsupervised learning model

1. Dimension reduction a. use principal component analysis (PCA). goal is to reduce number of features to manageable size while retaining variation of data reduce highly correlated features of data into a few main, uncorrelated composite variables. Looking for the the eigenvector with the highest eigenvalue con: hard to interpret the chosen features 2. Clustering (creating subset of similar data from the data set that are as dissimilar as possible (seperation) con: necessitate judgment in deciding what is similar, useful in investment where grouping by similarity is important a. K-Means b. hierarchical clustering

Three layers of neural network

1. Input layer 2. Hidden layer (where learning occur) 3. Output layer *the nodes in the hidden layer transform the input features input are standardized (scaled) for differences in the unit of the data

Issues with ML algorithm

1. Overly complex that are difficult to explain 2. Overfitting, where the model does not generalize well to new data (low out of sample predictivity) this is why out of sample is more important than in sample fitting

Supervised learning methods

1. Regression a. penalized regression (minimize sum of square but applies a penalty term that increases in size for added features) i. LASSO, the penalty term is lambda λ (hyperparameter) set before the learning beging, the larger the higher the penlaty (if λ = 0, regular OLS) 2. Classification a. support vector machine (SVM) b. K-nearest neighnor (KNN) c. classification and regression tree (CART)

Two functional parts of the hidden layer nodes

1. Summation operator (multiplies each value by a weight and sums the weighted value to form the new input value) that gets passed on to 2. Activation function (transforms input to final output) * itterative process wher network weights are adjusted to reduct the total error of the netowrk (backward propagation) New weight = (Old weight) - (Learning rate) × (Partial derivative of the total error with respect to the old weight), learning rate is the magnitude of the adjustment (the larger the error the larger the learning rate)

3 classes of machine learning

1. Supervised Learning 2. Unsupervised Learning 3. Deep Learning

3 sample within ML data sets

1. Training sample (in sample) 2. Validation sample (validating and training model) 3. Test sample (out of sample)

Two main techiniques to avoid overfiiting risk is

1. compelxity reduction 2. Cross validation (estimate the model's out-of-sample error, and then adjust the model accordingly.)

How to prevent overfitting in supervised machine learning

1. prevent algorithm from getting too complex (penalties for complexity) 2. proper data sampling using cross-validation (determining error using the valiadation sample) a. K fold cross-valiadation, shuffle data in k sub-sample . where 1 of the k sample is the vlidation sample

How to calculate dimensionability

= number of features With 19 fundamental and 5 technical factors (i.e., the features) the dimensionality of the model is 24.

What type of Model is a CART

CART is a supervised ML alogrithm a classification and regression algorithm

Unsupervised learning + 2 important application

Does not make use of labeled data , no target is being supplied , the algorithm tries to seek structure within the data by itself 1. reducing dimension of data , less more relevant features (dimension reduction) 2. sorting the data into cluster (clustering)

Hierarchical clustering

Does not require any prior input. iterative procedure used to build a hierarchy of clusters. In k-means clustering, the algorithm segments the data into a predetermined number of clusters with no defined relationship among the resulting clusters. work in round where each round the clusters : a. increase in size (agglomerative clustering), each observation is it's own cluster and the algorithm finds the two closest cluster based on distance and similarity b. decrease in size (divisive clustering) all observation are in a singlecluster and the observation are divided in more more clusters based on similarity a dendrogram hems visualize the cluster and highlights the relationship

Supervised learnign

Independent variable are features Dependent variable is the target can be used for: 1. regression problems (continuous target) 2. classification problems (categorical or ordinal target)

What is ensemble learning + two main categories

Instead of using a single model, predicting using a group/ensemble of models? It take the average results of many prediction models a. heterogenous learners (different model and apply a voting classifier (the output that's the most frequent from the different models win) b. homogenous learners (same model different training data set, bootstrap aggregating, the data is trained from the output data)

Summary

Machine learning aims at extracting knowledge from large amounts of data by learning from known examples to determine an underlying structure in the data. The emphasis is on generating structure or predictions without human intervention. An elementary way to think of ML algorithms is to "find the pattern, apply the pattern." Supervised learning depends on having labeled training data as well as matched sets of observed inputs (X's, or features) and the associated output (Y, or target). It can be divided into two categories: regression and classification. If the target variable to be predicted is continuous, then the task is one of regression. If the target variable is categorical or ordinal (e.g., determining a firm's rating), then it is a classification problem. With unsupervised learning, algorithms are trained with no labeled data, so they must infer relations between features, summarize them, or present an interesting underlying structure in their distributions that has not been explicitly provided. Two important types of problems well suited to unsupervised ML are dimension reduction and clustering. Another category of ML algorithm includes deep learning (based on neural networks) in which a computer learns from interacting with itself. Sophisticated algorithms address such highly complex tasks as image classification, face recognition, speech recognition and natural language processing, and reinforcement learning. Generalization describes the degree to which an ML model retains its explanatory power when predicting out-of-sample. Overfitting, a primary reason for lack of generalization, is the tendency of ML algorithms to tailor models to the training data at the expense of generalization to new data points. Bias error is the degree to which a model fits the training data. Variance error describes how much a model's results change in response to new data from validation and test samples. Base error is due to randomness in the data. Out-of-sample error equals bias error plus variance error plus base error. K-fold cross-validation is a technique for mitigating the holdout sample problem (excessive reduction of the training set size). The data (excluding test sample and fresh data) are shuffled randomly and then divided into k equal sub-samples, with k - 1 samples used as training samples and one sample, the kth, used as a validation sample. Regularization describes methods that reduce statistical variability in high dimensional data estimation or prediction problems. LASSO (least absolute shrinkage and selection operator) is a popular type of penalized regression where the penalty term involves summing the absolute values of the regression coefficients. The greater the number of included features, the larger the penalty. So, a feature must make a sufficient contribution to model fit to offset the penalty from including it. Support vector machine (SVM) is a linear classifier that aims to seek the optimal hyperplane—the one that separates the two sets of data points by the maximum margin (and thus is typically used for classification). K-nearest neighbor (KNN) is a supervised learning technique most often used for classification. The idea is to classify a new observation by finding similarities ("nearness") between it and its k-nearest neighbors in the existing data set. Classification and regression tree (CART) can be applied to predict either a categorical target variable, producing a classification tree, or a continuous target variable, producing a regression tree. A binary CART is a combination of an initial root node, decision nodes, and terminal nodes. The root node and each decision node represent a single feature (f) and a cutoff value (c) for that feature. The CART algorithm iteratively partitions the data into sub-groups until terminal nodes are formed that contain the predicted label. Ensemble learning is a technique of combining the predictions from a collection of models. It typically produces more accurate and more stable predictions than the best single model. A random forest classifier is a collection of many different decision trees generated by a bagging method or by randomly reducing the number of features available during training. Principal components analysis (PCA) is an unsupervised ML algorithm that reduces highly correlated features into fewer uncorrelated composite variables by transforming the feature covariance matrix. PCA produces eigenvectors that define the principal components (i.e., the new uncorrelated composite variables) and eigenvalues, which give the proportion of total variance in the initial data that is explained by each eigenvector and its associated principal component. K-means is an unsupervised ML algorithm that partitions observations into a fixed number (k) of non-overlapping clusters. Each cluster is characterized by its centroid, and each observation belongs to the cluster with the centroid to which that observation is closest. Hierarchical clustering is an unsupervised iterative algorithm that is used to build a hierarchy of clusters. Two main strategies are used to define the intermediary clusters (i.e., those clusters between the initial data set and the final set of clustered data). Agglomerative (bottom-up) hierarchical clustering begins with each observation being its own cluster. Then, the algorithm finds the two closest clusters, defined by some measure of distance, and combines them into a new, larger cluster. This process is repeated until all observations are clumped into a single cluster. Divisive (top-down) hierarchical clustering starts with all observations belonging to a single cluster. The observations are then divided into two clusters based on some measure of distance. The algorithm then progressively partitions the intermediate clusters into smaller clusters until each cluster contains only one observation. Neural networks consist of nodes connected by links. They have three types of layers: an input layer, hidden layers, and an output layer. Learning takes place in the hidden layer nodes, each of which consists of a summation operator and an activation function. Neural networks have been successfully applied to a variety of investment tasks characterized by non-linearities and complex interactions among variables. Neural networks with many hidden layers (at least 3 but often more than 20) are known as deep learning nets (DLNs) and are the backbone of the artificial intelligence revolution. The RL algorithm involves an agent that should perform actions that will maximize its rewards over time, taking into consideration the constraints of its environment.

Deep learning nets

NN with many hidden layers, at least 3 but more than 20 hidden layers

CART

Often applied to binary classification or regression. Choses a feature and a cutoff and each node base on the one that minimizes error Pro: the tree provides a visual explanation for the prediction.

Regularization

Regularization is a technique used in an attempt to solve the overfitting problem in statistical models. Reduces statistical variability in high dimensional data estimation

SVM (Support Vector Machines)

SVM is a linear classifier creates a boundary (hyperplane) that optimally separates the data into 2 or more different categories. If a data point is in the wrong category a penalty is added to the model suited for small- to medium-size but complex high-dimensional data sets

How to select which ML model to use

Step , Is the data complex? (mulitple correlated feature) if yes reduce the dimension using PCS Step 2 , is it a classification or numerical prediction problem? a. if numerical and linear use penalized regression or if non linear use CART, random forest, neural network b. if a classification problem, the coice is based on if data is labeled or not i. if labeled , use k-nearest neighnor (KNN) and support vector machine (SVM) for linear, use CART random forest, and neural network for non-linaer ii. if unlabelled will depend if it's linear or not, neural network for non-linear, k means for linear

What is the optimal point of model complexity

Where the bias and variance curve intersect when a model is overfitted, it has high variance error,

Reinforcement learning

algorithm involves an agent that should perform actions that will maximize its rewards over time, taking into consideration the constraints of its environment. RL in a similar way in investment strategies where the agent could be a virtual trader who follows certain trading rules (the actions) in a specific market (the environment) to maximize its profits (its reward).

Benefit and disadvantage of more nodes

all feature are interconnected with non linear activation function so it allows NN to find non lienear relationships so more nodes and more hidden layers are specified, a neural network's ability to handle complexity tends to increase (but so does the risk of overfitting).

Neural networks

can be for non-linear and complex featured and used for supervised learning (regression and classification) and unsupervised reinforcement learning

Random forest classifier + usage

fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. (an example of ensemble learning)

What are features of a good model + how to detect bad overfitted model

good fit/robust model fits the training (in-sample) data well and generalizes well to out-of-sample data, both within acceptable degrees of error. Overfitted model: low in sample error but high out of sample error

Benefits of ML compared to other statistical method

handle problems with many variables (high dimensionality) large data sets or with a high degree of non-linearity. captures non-linear relationships

k-means clustering

iterative analytics technique that seeks to allocate each observation to the cluster closest to it K needs to be specified (penalty for more features_, ,a distance measure and the feature set used to group by similarity Each cluster is characterized by its centroid (i.e., center), and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest. start from an initial set of centroid and calculate the distance between it and sample within its group minmize the distance, until no future min

K-Nearest Neighbor

lassifies a new observation by finding similarities ("nearness") between this new observation and the existing data The decision rule is to choose the classification with the largest number of nearest neighbors out of the k being considered k is the hyperparameter, number of different categories non-parametric; the model makes no assumptions about the distribution of the data. however needs to define what is the similarities which needs a knowledge and understanding of the data

Learning curve

plots the accuracy rate ( 1- error rate) the goal is to have out of sample accuracy increase as training sample size increase

What is underfitting

the model does not capture the relationships in the data


Ensembles d'études connexes

Resource Prices and Utilization - ECON 2302

View Set

Sociology Exam 3- Ch. 7 Study Guide

View Set

Integrated Business Processes with ERP Systems - Chapter 1

View Set

Lesson 3: Ratios, Proportions, and Inequalities

View Set

"Tell Tale Heart" COMPREHENSION QUESTIONS ( KRISHA)

View Set