ML Algorithms

¡Supera tus tareas y exámenes ahora con Quizwiz!

SVM Scoring

A value close to the line returns a value close to zero and the point may be dicult to classify. If the magnitude of the value is large, the model may have more con dence in the prediction.

Preparing Data for Linear Regression

Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship). Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible. Remove Collinearity. Linear regression will over t your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated. Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some bene t using transforms (e.g. log or BoxCox) on your variables to make their distribution more Gaussian looking. -Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

Preparing data for Naive Bayes

1. Categorical Inputs: Naive Bayes assumes label attributes such as binary, categorical or nominal. 2. Gaussian Inputs: If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean). 3. Classi cation Problems: Naive Bayes is a classi cation algorithm suitable for binary and multiclass classi cation. 4. Log Probabilities: The calculation of the likelihood of di erent class values involves multiplying a lot of small numbers together. This can lead to an under ow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this under ow. 5. Kernel Functions: Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions. 6. Update Probabilities: When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

5 Main Nonlinear Algorithms

1. Classification and Regression trees 2. Naive Bayes 3. KNN 4. Learning Vector Quantization 5. SVMs

Parametric Functions Limitations

1. Constrained by choosing a functional form 2. Limited complexity - suited to simpler problems 3. Poor Fit - unlikely in practice to match underlying mapping

Nonparametric Benefits

1. Flexibility: capable of fitting a larger # of functional forms 2. Power: No assumptions about the underlying data 3. Performance: Can result in higher performance

4 main Linear Algorithms

1. Gradient descent 2. Linear regression 3. Logistic regression 4. Linear discriminant analysis

Nonparametric Limitations

1. More data 2. Slower 3. Overfitting: easier to overfit because it is harder to explain why specific predictions are made

Data Prep for SVM

1. Numerical Inputs 2. Binary calsification

Extensions to LDA

1. Quadratic Discriminant Analysis: Each class uses its own estimate of variance (or covari- ance when there are multiple input variables). 2. Flexible Discriminant Analysis: Where nonlinear combination of inputs is used such as splines. 3. Regularized Discriminant Analysis: Introduces regularization into the estimate of the variance (or covariance), moderating the in uence of di erent variables on LDA.

Preparing data for KNN

1. Rescale all between 0-1 2. Address missing data 3. Lower the dimensionality

Steps for a parametric function

1. Select a form for the function. 2. Learn the coefficients for the function from the training data.

Parametric Functions Benefits

1. Simplifies the mapping to a known functional form - interpretable. 2. models are very fast in learning. 3. Do not require as much training data.

LDA Assumptions

1. That your data is Gaussian 2. That each attribute has the same variance, that values of each variable vary around the mean by the same amount on average.

Limitations of Linear Regression

1. Two-Class Problems. Logistic regression is intended for two-class or binary classi cation problems. It can be extended for multiclass classi cation, but is rarely used for this purpose. 2. Unstable With Well Separated Classes. Logistic regression can become unstable when the classes are well separated. 3. Unstable With Few Examples. Logistic regression can become unstable when there are few examples from which to estimate the parameters. An unstable model here refs to the spectrum from poor results to a

Techniques to reduce overfitting

1. Use a resampling technique to estimate model accuracy 2. Hold back a validation dataset

Parametric Functions

A learning model that summarizes data with a set of parameters of xed size (independent of the number of training examples) is called a parametric model.

AdaBoost

AdaBoost is best used to boost the performance of decision trees on binary classi cation problems. AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. The amount of preset learners used is a user input

Association Problems

An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy A also tend to buy B. Apriori algorithm for association rule learning problems.

Curse of dimensionality

As the number of dimensions increases, the volume of the input space increases at an exponential rate. In high dimensions, points that may be similar may have very large distances.

Bagging

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Let's assume we have a dataset of 1000 instances and we are using the CART algorithm. Bagging of the CART algorithm would work as follows. 1. Create many (e.g. 100) random subsamples of our dataset with replacement. 2. Train a CART model on each sample. 3. Given a new dataset, calculate the average prediction from each model.

Bias

Bias are the simplifying assumptions made by a model to make the target function easier to learn. Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible.

Boosting

Boosting is an ensemble technique that attempts to create a strong classi er from a number of weak classi ers.

Preparing data for CART

CART does not require any special data preparation other than a good representation of the problem.

Preparing data for LDA

Classi cation Problems. This might go without saying, but LDA is intended for classi cation problems where the output variable is categorical. LDA supports both binary and multiclass classi cation. Gaussian Distribution. The standard implementation of the model assumes a Gaussian distribution of the input variables. Consider reviewing the univariate distributions of each attribute and using transforms to make them more Gaussian-looking (e.g. log and root for exponential distributions and Box-Cox for skewed distributions). Remove Outliers. Consider removing outliers from your data. These can skew the basic statistics used to separate classes in LDA such as the mean and the standard deviation. Same Variance. LDA assumes that each input variable has the same variance. It's almost always a good idea to standardize your data before using LDA so that it has a mean of 0 and a standard deviation of 1.

Ensemble Algorithms (definition & what they do)

Combine the predictions from multiple models in order to provide more accurate predictions. 1. Bagging/Random Forests 2. Boosting ensemble/AdaBoost

How a CART Learns

Creating a binary decision tree is actually a process of dividing up the input space. A greedy approach is used to divide the space called recursive binary splitting. This is a numerical procedure where all the values are lined up and di erent split points are tried and tested using a cost function. The split with the best cost (lowest cost because we minimize cost) is selected

Most common adaboost

Decision trees with one level (otherwise known as decision stumps).

Gradient descent - definition

Gradient descent is an optimization algorithm used to fi nd the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm. Common examples of algorithms with coecients that can be optimized using gradient descent are Linear Regression and Logistic Regression.

High & Low Variance Algs

High: Decision trees (higher if not pruned), KNN, SVM Low: Linear Reg, LDA, Logistic Regression

High & Low bias algorithms

High: Linear Reg, LDA, Logistic Reg Low: Deicison trees, KNN, SVM

Linear Discriminant Analysis (LDA)

If you have more than two classes then the Linear Discriminant Analysis is the preferred linear classifi cation technique. LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The model uses Bayes Theorem to estimate the probabilities. The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix (this is a multi-dimensional generalization of variance). These statistical properties are estimated from your data and plug into the LDA equation to make predictions. These are the model values that you would save to le for your model. Let's look at how these parameters are estimated.

Linear Regression most common method

In general: Ordinary Least Squared ML: Gradient Descent

Soft Margin Classifier

In practice, real data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is often called the soft margin classifi er. An additional set of coecients are introduced that give the margin wiggle room in each dimension. These coefficients are sometimes called slack variables.

Stochastic Gradient Descent

In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent. In this variation, the gradient descent procedure described above is run but the update to the coecients is performed for each training instance, rather than at the end of the batch of instances. Dataset order must be randomized.

Logistic Regression

It is the go-to method for binary classi cation problems (problems with two class values). In this chapter you will discover the logistic regression algorithm for machine learning. Logistic regression is named for the function used at the core of the method, the logistic function. The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It's an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. Logistic regression is a linear method, but the predictions are transformed using the logistic function.

K-Nearest Neightbors (KNN)

KNN makes predictions using the training dataset directly. Predictions are made for a new data point by searching through the entire training set for the k most similar instances (the neighbors) and summarizing the output variable for those k instances. For regression this might be the mean output variable, in classi cation this might be the mode (or most common) class value. Most common distacne function is Euclidean distance (especially for real-valued input variables and similar in type) but others can be used. Manhattan distance is good if the input variables are not similar in type.

Gaussian Naive Bayes

Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian distribution. This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.

Naive Bayes

Naive Bayes is a classi cation algorithm for binary (two-class) and multiclass classi cation problems. The technique is easiest to understand when described using binary or categorical input values. It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simpli ed to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1; d2; d3jh), they are assumed to be conditionally independent given the target value and calculated as P(d1jh) * P(d2jh) and so on. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold. Bayes' Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge. Bayes' Theorem is stated as: P(h|d)=[P(d|h)*P(h)]/P(d). After calculating the posterior probability, P(h|d), for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.

Nonparametric Functions

Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing just the right features.

Batch gradient descent

One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent. Batch gradient descent is the most common form of gradient descent described in machine learning.

Kernel Trick

Other kernels can be used that transform the input space into higher dimensions such as a Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick. It is desirable to use more complex kernels as it allows lines to separate the classes that are curved or even more complex. This in turn can lead to more accurate classifi ers.

Tips for Gradient Descent

Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate. Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best. Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieve this by rescaling all of the input variables (X) to the same range, such as between 0 and 1. Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients. Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.

NB: P(h|d)

Probability of hypothesis h given the data d. This is called the posterior probability.

Semi-supervised Learning

Problems where you have a large amount of input data (X) and only some of the data is labeled (Y ) are called semi-supervised learning problems. Due to expense/time to label data as it may require domain experts.

Supervised Learning

Supervised learning is where you have input variables (X) and an output variable (Y ) and you use an algorithm to learn the mapping function from the input to the output. Supervised learning problems can be further grouped into regression and classifi cation problems.

Overfitting vs underfitting

That overfi tting refers to learning the training data too well at the expense of not generalizing well to new data. That underfi tting refers to failing to learn the problem from the training data sufficiently.

F1 Score

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

SVM

The Maximal-Margin Classi er is a hypothetical classi er that best explains how SVM works in practice. The numeric input variables (x) in your data (the columns) form an n-dimensional space. For example, if you had two input variables, this would form a two-dimensional space. A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.

Pruning a CART

The complexity of a decision tree is de ned as the number of splits in the tree. Simpler trees are preferred. They are easy to understand (you can print them out and show them to subject matter experts), and they are less likely to over t your data. The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the e ect of removing it using a hold-out test set. Leaf nodes are removed only if it results in a drop in the overall cost function on the entire test set.

SVM Kernel

The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra. A powerful insight is that the linear SVM can be rephrased using the inner product of any two given observations, rather than the observations themselves.

KNN model representation

The model representation for KNN is the entire training dataset. It is as simple as that. KNN has no model other than storing the entire dataset, so there is no learning required. Efficient implementations can store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction efficient.

CART Stopping Criterion

The recursive binary splitting procedure described above needs to know when to stop splitting as it works its way down the tree with the training data. The most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. e.g. 5 or 10. Too specifi c (e.g. a count of 1) and the tree will over fit the training data and likely have poor performance on the test set.

Classification and Regressions Trees (CART)

The representation for the CART model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric). The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Given a new input, the tree is traversed by evaluating the speci c input started at the root node of the tree. A learned binary tree is actually a partitioning of the input space. You can think of each input variable as a dimension on an p-dimensional space. The decision tree split this up into rectangles (when p = 2 input variables) or hyper-rectangles with more inputs. New data is ltered through the tree and lands in one of the rectangles and the output value for that rectangle is the prediction made by the model.

Small vs. Large C

The smaller the value of C, the more sensitive the algorithm is to the training data (higher variance and lower bias). The larger the value of C, the less sensitive the algorithm is to the training data (lower variance and higher bias).

Bias-Variance Trade-off

There is a trade-off at play between these two concerns and the algorithms you choose and the way you choose to confi gure them are finding different balances in this trade-off for your problem. In reality we cannot calculate the real bias and variance error terms because we do not know the actual underlying target function. Nevertheless, as a framework, bias and variance provide the tools to understand the behavior of machine learning algorithms in the pursuit of predictive performance.

Unsupervised Learning

Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. Unsupervised learning problems can be further grouped into clustering and association problems.

Variance

Variance is the amount that the estimate of the target function will change if different training data was used. Generally nonparametric machine learning algorithms that have a lot of exibility have a high variance.

Bootstrapping

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure: 1. Create many (e.g. 1000) random subsamples of our dataset with replacement (meaning we can select the same value multiple times). 2. Calculate the mean of each subsample. 3. Calculate the average of all of our collected means and use that as our estimated mean for the data. This can be used to estimate other quantities like standard deviation and learned coefficients.

Imporant characteristics of submodels when combining predictions using bagging

When bagging with decision trees, we are less concerned about individual trees over tting the training data. For this reason and for eciency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias.

Learning Vector Quantization (LVQ)

an arti cial neural network algorithm that allows you choose how many training instances to hang onto and learns exactly what those instances should look like. Best understood as a classification algorithm

Tuning Parameter

called simply C that de nes the magnitude of the wiggle allowed across all dimensions. The C parameters de nes the amount of violation of the margin allowed. A C = 0 is no violation and we are back to the in exible Maximal-Margin Classi er described above. The larger the value of C the more violations of the hyperplane are permitted.

Coefficient determination in Logistic Regression

coefficients in logistic regression are estimated using a process called maximum- likelihood estimation.

Precision

precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.

Accuracy paradox

predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.

NB: P(d|h)

probability of data d given that the hypothesis h was true.

NB: P(d)

probability of the data (regardless of the hypothesis).

Recall

recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

Preparing data for AdaBoost

some heuristics for best preparing your data for AdaBoost. Quality Data: Because the ensemble method continues to attempt to correct misclas- si cation's in the training data, you need to be careful that the training data is of a high-quality. Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset. Noisy Data: Noisy data, speci cally noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

Random Forest (how it is different from bagged tree)

split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search. The number of features that can be searched at each split point (m) must be speci ed as a parameter to the algorithm. You can try di erent values and tune it using cross-validation. For classi cation a good default is: m = sqrt(p) For regression a good default is: m = p/3 Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables.

NB: P(h)

the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.


Conjuntos de estudio relacionados

africa review set, Art Test #4 ch.14 MESOAMERICA, Art History in Culture Exam 2, Art History ch18, Exam 3 Practice Questions, ch. 15, Test #5 - Multiple Choice A, Art History: South and Southeast Asia before 1200, Chapter 10: The Islamic World, Art H...

View Set

Chapter 35 - Dysrhythmias - Complex 2021

View Set

Prep U chapter 40, Nursing Fluid & Electrolyte prep u, Chapter 4 - PrepU - Fluid and Electrolyte and Acid-Base Imbalances, Taylor's Chapter 39: Fluid, Electrolyte, and Acid-Base balance (PrepU), Prep-U Chapter 13: Fluid and Electrolytes: Balance and...

View Set

NCLEX Prep II Renal and Urinary Medications Chapter 59

View Set

PHY 102-Chapters 8, 9, 11, & 12 Review

View Set