ML Algorithms
SVM Scoring
A value close to the line returns a value close to zero and the point may be dicult to classify. If the magnitude of the value is large, the model may have more condence in the prediction.
Preparing Data for Linear Regression
Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship). Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible. Remove Collinearity. Linear regression will overt your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated. Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benet using transforms (e.g. log or BoxCox) on your variables to make their distribution more Gaussian looking. -Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.
Preparing data for Naive Bayes
1. Categorical Inputs: Naive Bayes assumes label attributes such as binary, categorical or nominal. 2. Gaussian Inputs: If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean). 3. Classication Problems: Naive Bayes is a classication algorithm suitable for binary and multiclass classication. 4. Log Probabilities: The calculation of the likelihood of dierent class values involves multiplying a lot of small numbers together. This can lead to an under ow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this under ow. 5. Kernel Functions: Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions. 6. Update Probabilities: When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.
5 Main Nonlinear Algorithms
1. Classification and Regression trees 2. Naive Bayes 3. KNN 4. Learning Vector Quantization 5. SVMs
Parametric Functions Limitations
1. Constrained by choosing a functional form 2. Limited complexity - suited to simpler problems 3. Poor Fit - unlikely in practice to match underlying mapping
Nonparametric Benefits
1. Flexibility: capable of fitting a larger # of functional forms 2. Power: No assumptions about the underlying data 3. Performance: Can result in higher performance
4 main Linear Algorithms
1. Gradient descent 2. Linear regression 3. Logistic regression 4. Linear discriminant analysis
Nonparametric Limitations
1. More data 2. Slower 3. Overfitting: easier to overfit because it is harder to explain why specific predictions are made
Data Prep for SVM
1. Numerical Inputs 2. Binary calsification
Extensions to LDA
1. Quadratic Discriminant Analysis: Each class uses its own estimate of variance (or covari- ance when there are multiple input variables). 2. Flexible Discriminant Analysis: Where nonlinear combination of inputs is used such as splines. 3. Regularized Discriminant Analysis: Introduces regularization into the estimate of the variance (or covariance), moderating the in uence of dierent variables on LDA.
Preparing data for KNN
1. Rescale all between 0-1 2. Address missing data 3. Lower the dimensionality
Steps for a parametric function
1. Select a form for the function. 2. Learn the coefficients for the function from the training data.
Parametric Functions Benefits
1. Simplifies the mapping to a known functional form - interpretable. 2. models are very fast in learning. 3. Do not require as much training data.
LDA Assumptions
1. That your data is Gaussian 2. That each attribute has the same variance, that values of each variable vary around the mean by the same amount on average.
Limitations of Linear Regression
1. Two-Class Problems. Logistic regression is intended for two-class or binary classication problems. It can be extended for multiclass classication, but is rarely used for this purpose. 2. Unstable With Well Separated Classes. Logistic regression can become unstable when the classes are well separated. 3. Unstable With Few Examples. Logistic regression can become unstable when there are few examples from which to estimate the parameters. An unstable model here refs to the spectrum from poor results to a
Techniques to reduce overfitting
1. Use a resampling technique to estimate model accuracy 2. Hold back a validation dataset
Parametric Functions
A learning model that summarizes data with a set of parameters of xed size (independent of the number of training examples) is called a parametric model.
AdaBoost
AdaBoost is best used to boost the performance of decision trees on binary classication problems. AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. The amount of preset learners used is a user input
Association Problems
An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy A also tend to buy B. Apriori algorithm for association rule learning problems.
Curse of dimensionality
As the number of dimensions increases, the volume of the input space increases at an exponential rate. In high dimensions, points that may be similar may have very large distances.
Bagging
Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Let's assume we have a dataset of 1000 instances and we are using the CART algorithm. Bagging of the CART algorithm would work as follows. 1. Create many (e.g. 100) random subsamples of our dataset with replacement. 2. Train a CART model on each sample. 3. Given a new dataset, calculate the average prediction from each model.
Bias
Bias are the simplifying assumptions made by a model to make the target function easier to learn. Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible.
Boosting
Boosting is an ensemble technique that attempts to create a strong classier from a number of weak classiers.
Preparing data for CART
CART does not require any special data preparation other than a good representation of the problem.
Preparing data for LDA
Classication Problems. This might go without saying, but LDA is intended for classication problems where the output variable is categorical. LDA supports both binary and multiclass classication. Gaussian Distribution. The standard implementation of the model assumes a Gaussian distribution of the input variables. Consider reviewing the univariate distributions of each attribute and using transforms to make them more Gaussian-looking (e.g. log and root for exponential distributions and Box-Cox for skewed distributions). Remove Outliers. Consider removing outliers from your data. These can skew the basic statistics used to separate classes in LDA such as the mean and the standard deviation. Same Variance. LDA assumes that each input variable has the same variance. It's almost always a good idea to standardize your data before using LDA so that it has a mean of 0 and a standard deviation of 1.
Ensemble Algorithms (definition & what they do)
Combine the predictions from multiple models in order to provide more accurate predictions. 1. Bagging/Random Forests 2. Boosting ensemble/AdaBoost
How a CART Learns
Creating a binary decision tree is actually a process of dividing up the input space. A greedy approach is used to divide the space called recursive binary splitting. This is a numerical procedure where all the values are lined up and dierent split points are tried and tested using a cost function. The split with the best cost (lowest cost because we minimize cost) is selected
Most common adaboost
Decision trees with one level (otherwise known as decision stumps).
Gradient descent - definition
Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm. Common examples of algorithms with coecients that can be optimized using gradient descent are Linear Regression and Logistic Regression.
High & Low Variance Algs
High: Decision trees (higher if not pruned), KNN, SVM Low: Linear Reg, LDA, Logistic Regression
High & Low bias algorithms
High: Linear Reg, LDA, Logistic Reg Low: Deicison trees, KNN, SVM
Linear Discriminant Analysis (LDA)
If you have more than two classes then the Linear Discriminant Analysis is the preferred linear classification technique. LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The model uses Bayes Theorem to estimate the probabilities. The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix (this is a multi-dimensional generalization of variance). These statistical properties are estimated from your data and plug into the LDA equation to make predictions. These are the model values that you would save to le for your model. Let's look at how these parameters are estimated.
Linear Regression most common method
In general: Ordinary Least Squared ML: Gradient Descent
Soft Margin Classifier
In practice, real data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is often called the soft margin classifier. An additional set of coecients are introduced that give the margin wiggle room in each dimension. These coefficients are sometimes called slack variables.
Stochastic Gradient Descent
In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent. In this variation, the gradient descent procedure described above is run but the update to the coecients is performed for each training instance, rather than at the end of the batch of instances. Dataset order must be randomized.
Logistic Regression
It is the go-to method for binary classication problems (problems with two class values). In this chapter you will discover the logistic regression algorithm for machine learning. Logistic regression is named for the function used at the core of the method, the logistic function. The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It's an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. Logistic regression is a linear method, but the predictions are transformed using the logistic function.
K-Nearest Neightbors (KNN)
KNN makes predictions using the training dataset directly. Predictions are made for a new data point by searching through the entire training set for the k most similar instances (the neighbors) and summarizing the output variable for those k instances. For regression this might be the mean output variable, in classication this might be the mode (or most common) class value. Most common distacne function is Euclidean distance (especially for real-valued input variables and similar in type) but others can be used. Manhattan distance is good if the input variables are not similar in type.
Gaussian Naive Bayes
Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian distribution. This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.
Naive Bayes
Naive Bayes is a classication algorithm for binary (two-class) and multiclass classication problems. The technique is easiest to understand when described using binary or categorical input values. It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplied to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1; d2; d3jh), they are assumed to be conditionally independent given the target value and calculated as P(d1jh) * P(d2jh) and so on. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold. Bayes' Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge. Bayes' Theorem is stated as: P(h|d)=[P(d|h)*P(h)]/P(d). After calculating the posterior probability, P(h|d), for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.
Nonparametric Functions
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing just the right features.
Batch gradient descent
One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent. Batch gradient descent is the most common form of gradient descent described in machine learning.
Kernel Trick
Other kernels can be used that transform the input space into higher dimensions such as a Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick. It is desirable to use more complex kernels as it allows lines to separate the classes that are curved or even more complex. This in turn can lead to more accurate classifiers.
Tips for Gradient Descent
Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate. Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best. Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieve this by rescaling all of the input variables (X) to the same range, such as between 0 and 1. Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients. Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.
NB: P(h|d)
Probability of hypothesis h given the data d. This is called the posterior probability.
Semi-supervised Learning
Problems where you have a large amount of input data (X) and only some of the data is labeled (Y ) are called semi-supervised learning problems. Due to expense/time to label data as it may require domain experts.
Supervised Learning
Supervised learning is where you have input variables (X) and an output variable (Y ) and you use an algorithm to learn the mapping function from the input to the output. Supervised learning problems can be further grouped into regression and classification problems.
Overfitting vs underfitting
That overfitting refers to learning the training data too well at the expense of not generalizing well to new data. That underfitting refers to failing to learn the problem from the training data sufficiently.
F1 Score
The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
SVM
The Maximal-Margin Classier is a hypothetical classier that best explains how SVM works in practice. The numeric input variables (x) in your data (the columns) form an n-dimensional space. For example, if you had two input variables, this would form a two-dimensional space. A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.
Pruning a CART
The complexity of a decision tree is dened as the number of splits in the tree. Simpler trees are preferred. They are easy to understand (you can print them out and show them to subject matter experts), and they are less likely to overt your data. The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the eect of removing it using a hold-out test set. Leaf nodes are removed only if it results in a drop in the overall cost function on the entire test set.
SVM Kernel
The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra. A powerful insight is that the linear SVM can be rephrased using the inner product of any two given observations, rather than the observations themselves.
KNN model representation
The model representation for KNN is the entire training dataset. It is as simple as that. KNN has no model other than storing the entire dataset, so there is no learning required. Efficient implementations can store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction efficient.
CART Stopping Criterion
The recursive binary splitting procedure described above needs to know when to stop splitting as it works its way down the tree with the training data. The most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. e.g. 5 or 10. Too specific (e.g. a count of 1) and the tree will overfit the training data and likely have poor performance on the test set.
Classification and Regressions Trees (CART)
The representation for the CART model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric). The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Given a new input, the tree is traversed by evaluating the specic input started at the root node of the tree. A learned binary tree is actually a partitioning of the input space. You can think of each input variable as a dimension on an p-dimensional space. The decision tree split this up into rectangles (when p = 2 input variables) or hyper-rectangles with more inputs. New data is ltered through the tree and lands in one of the rectangles and the output value for that rectangle is the prediction made by the model.
Small vs. Large C
The smaller the value of C, the more sensitive the algorithm is to the training data (higher variance and lower bias). The larger the value of C, the less sensitive the algorithm is to the training data (lower variance and higher bias).
Bias-Variance Trade-off
There is a trade-off at play between these two concerns and the algorithms you choose and the way you choose to configure them are finding different balances in this trade-off for your problem. In reality we cannot calculate the real bias and variance error terms because we do not know the actual underlying target function. Nevertheless, as a framework, bias and variance provide the tools to understand the behavior of machine learning algorithms in the pursuit of predictive performance.
Unsupervised Learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. Unsupervised learning problems can be further grouped into clustering and association problems.
Variance
Variance is the amount that the estimate of the target function will change if different training data was used. Generally nonparametric machine learning algorithms that have a lot of exibility have a high variance.
Bootstrapping
We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure: 1. Create many (e.g. 1000) random subsamples of our dataset with replacement (meaning we can select the same value multiple times). 2. Calculate the mean of each subsample. 3. Calculate the average of all of our collected means and use that as our estimated mean for the data. This can be used to estimate other quantities like standard deviation and learned coefficients.
Imporant characteristics of submodels when combining predictions using bagging
When bagging with decision trees, we are less concerned about individual trees overtting the training data. For this reason and for eciency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias.
Learning Vector Quantization (LVQ)
an articial neural network algorithm that allows you choose how many training instances to hang onto and learns exactly what those instances should look like. Best understood as a classification algorithm
Tuning Parameter
called simply C that denes the magnitude of the wiggle allowed across all dimensions. The C parameters denes the amount of violation of the margin allowed. A C = 0 is no violation and we are back to the in exible Maximal-Margin Classier described above. The larger the value of C the more violations of the hyperplane are permitted.
Coefficient determination in Logistic Regression
coefficients in logistic regression are estimated using a process called maximum- likelihood estimation.
Precision
precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.
Accuracy paradox
predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.
NB: P(d|h)
probability of data d given that the hypothesis h was true.
NB: P(d)
probability of the data (regardless of the hypothesis).
Recall
recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.
Preparing data for AdaBoost
some heuristics for best preparing your data for AdaBoost. Quality Data: Because the ensemble method continues to attempt to correct misclas- sication's in the training data, you need to be careful that the training data is of a high-quality. Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset. Noisy Data: Noisy data, specically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.
Random Forest (how it is different from bagged tree)
split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search. The number of features that can be searched at each split point (m) must be specied as a parameter to the algorithm. You can try dierent values and tune it using cross-validation. For classication a good default is: m = sqrt(p) For regression a good default is: m = p/3 Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables.
NB: P(h)
the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.