Machine Learning Concepts

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Confusion matrix

A confusion matrix helps organize and visualize this information. Each roe represents the actual number of observations in a class, and each column represents the number of observations predicted as belonging to a class.

Random forests

A random forest is an ensemble method that can utilize many decision trees, whose decisions it averages. Typically, an individual decision tree may be prone to overfitting because a leaf node can be created for each observation. Two characteristics of random forests allows a reduction in overfitting and the correlation between the trees. The first is bagging, where individual decision trees are fitted following each bootstrap sample and then averaged afterwards. Bagging significantly reduces the variance of the random forest versus the variance of any individual decision trees. The second way random forests reduce overfitting is that a random subset of features is considered at each split, preventing the important features from always being present at the tops of individual trees. Random forests are often used due to their versatility, interpretability (you can quickly see feature importance), quick training times (they can be trained in parallel), and prediction performance. In interviews, you'll be asked about how they work versus a decision tree, and when you would use a random forest over other techniques.

Multicollinearity

Another pitfall is if the predictors are correlated. This phenomenon, known as multicollinearity, affects the resulting coefficient estimates by making it problematic to distinguish the true underlying individual weights of variables. Multicollinearity is most commonly observed by weights that flip magnitude. It is one of the reasons why model weights cannot be directly interpreted as the importance of a feature in linear regression. One way to assess multicollinearity is by examining the variance inflation factor (VIF), which quantifies how much the estimated coefficients are inflated when multicollinearity exists. Methods to address multicollinearity included removing the correlated variables, linearly combining the variables, or using PCA/PLS (partial least squares).

Bagging vs. boosting

Bagging and boosting are two main types of ensemble learning methods. The main difference between these learning methods is the way in which they are trained. In bagging, weak learners are trained in parallel, but in boosting, they learn sequentially. This means that a series of models are constructed and with each new model iteration, the weights of the misclassified data in the previous model are increased. This redistribution of weights helps the algorithm identify the parameters that it needs to focus on to improve its performance. AdaBoost, which stands for "adaptative boosting algorithm," is one of the most popular boosting algorithms as it was one of the first of its kind. Other types of boosting algorithms include XGBoost, GradientBoost, and BrownBoost. Another difference in which bagging and boosting differ are the scenarios in which they are used. For example, bagging methods are typically used on weak learners which exhibit high variance and low bias, whereas boosting methods are leveraged when low variance and high bias is observed.

Bagging

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and depending on the type of task—regression or classification, for example—the average or majority of those predictions yield a more accurate estimate. As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging and feature randomness to create an uncorrelated forest of decision trees. The bagging algorithm has three basic steps: 1. Bootstrapping: Bagging leverages a bootstrapping sampling technique to create diverse samples. This resampling method generates different subsets of the training dataset by selecting data points at random and with replacement. This means that each time you select a data point from the training dataset, you are able to select the same instance multiple times. As a result, a value/instance repeated twice (or more) in a sample. 2. Parallel training: These bootstrap samples are then trained independently and in parallel with each other using weak or base learners. 3. Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a majority of the predictions are taken to compute a more accurate estimate. In the case of regression, an average is taken of all the outputs predicted by the individual classifiers; this is known as soft voting. For classification problems, the class with the highest majority of votes is accepted; this is known as hard voting or majority voting. The key benefits of bagging include: - Ease of implementation: Python libraries such as scikit-learn (also known as sklearn) make it easy to combine the predictions of base learners or estimators to improve model performance. Their documentation (link resides outside IBM) lays out the available modules that you can leverage in your model optimization. - Reduction of variance: Bagging can reduce the variance within a learning algorithm. This is particularly helpful with high-dimensional data, where missing values can lead to higher variance, making it more prone to overfitting and preventing accurate generalization to new datasets. The key challenges of bagging include: - Loss of interpretability: It's difficult to draw very precise business insights through bagging because due to the averaging involved across predictions. While the output is more precise than any individual data point, a more accurate or complete dataset could also yield more precision within a single classification or regression model. - Computationally expensive: Bagging slows down and grows more intensive as the number of iterations increase. Thus, it's not well-suited for real-time applications. Clustered systems or a large number of processing cores are ideal for quickly creating bagged ensembles on large test sets. - Less flexible: As a technique, bagging works particularly well with algorithms that are less stable. One that are more stable or subject to high amounts of bias do not provide as much benefit as there's less variation within the dataset of the model. As noted in the Hands-On Guide to Machine Learning (link resides outside of IBM), "bagging a linear regression model will effectively just return the original predictions for large enough b."

Batch gradient descent

Batch gradient descent sums the error for each data point in a training set, updating the model only after all training examples have been evaluated. This process referred to as a training epoch. While this batching provides computation efficiency, it can still have a long processing time for large training datasets as it still needs to store all of the data into memory. Batch gradient descent also usually produces a stable error gradient and convergence, but sometimes that convergence point isn't the most ideal, finding the local minimum versus the global one.

Linear regression assumptions

Before you can use linear regression, you must validate its four main assumptions to prevent erroneous results: - Linearity: The relationship between the feature set and the target variable is linear. - Homoscedasticity: The variance of the residuals is constant. - Independence: All observations are independent of one another. - Normality: The distribution of Y is assumed to be normal.

Types of boosting

Boosting algorithms can differ in how they create and aggregate weak learners during the sequential process. Three popular types of boosting methods include: - Adaptive boosting or AdaBoost: Yoav Freund and Robert Schapire are credited with the creation of the AdaBoost algorithm. This method operates iteratively, identifying misclassified data points and adjusting their weights to minimize the training error. The model continues optimize in a sequential fashion until it yields the strongest predictor. - Gradient boosting: Building on the work of Leo Breiman, Jerome H. Friedman developed gradient boosting, which works by sequentially adding predictors to an ensemble with each one correcting for the errors of its predecessor. However, instead of changing weights of data points like AdaBoost, the gradient boosting trains on the residual errors of the previous predictor. The name, gradient boosting, is used since it combines the gradient descent algorithm and boosting method. - Extreme gradient boosting or XGBoost: XGBoost is an implementation of gradient boosting that's designed for computational speed and scale. XGBoost leverages multiple cores on the CPU, allowing for learning to occur in parallel during training.

Boosting

Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors. In boosting, a random sample of data is selected, fitted with a model and then trained sequentially—that is, each model tries to compensate for the weaknesses of its predecessor. With each iteration, the weak rules from each individual classifier are combined to form one, strong prediction rule. The key benefits of boosting include: - Ease of Implementation: Boosting can be used with several hyper-parameter tuning options to improve fitting. No data preprocessing is required, and boosting algorithms like have built-in routines to handle missing data. In Python, the scikit-learn library of ensemble methods (also known as sklearn.ensemble) makes it easy to implement the popular boosting methods, including AdaBoost, XGBoost, etc. - Reduction of bias: Boosting algorithms combine multiple weak learners in a sequential method, iteratively improving upon observations. This approach can help to reduce high bias, commonly seen in shallow decision trees and logistic regression models. - Computational Efficiency: Since boosting algorithms only select features that increase its predictive power during training, it can help to reduce dimensionality as well as increase computational efficiency. The key challenges of boosting include: - Overfitting: There's some dispute in the research around whether or not boosting can help reduce overfitting or exacerbate it. We include it under challenges because in the instances that it does occur, predictions cannot be generalized to new datasets. - Intense computation: Sequential training in boosting is hard to scale up. Since each estimator is built on its predecessors, boosting models can be computationally expensive, although XGBoost seeks to address scalability issues seen in other types of boosting methods. Boosting algorithms can be slower to train when compared to bagging as a large number of parameters can also influence the behavior of the model.

Cross-validation

Cross-validation assesses the performance of an algorithm in several subsamples of training data. It consists of running the algorithm on subsamples of the training data, such as the original data without some of the original observations, and evaluating model performance on the portion of the data that was excluded from the subsample. This process is repeated many times for the different subsamples, and results are combined at the end. Cross-validation helps you avoid training and testing on the same subsets of data points, which would lead to overfitting. As mentioned earlier, in cases where there isn't enough data or getting more data is costly, cross-validation enables you to have more faith in the quality and consistency of a model's test performance.

Ensemble learning

Ensemble learning refers to a group (or ensemble) of base learners, or models, which work collectively to achieve a better final prediction. A single model, also known as a base or weak learner, may not perform well individually due to high variance or high bias. However, when weak learners are aggregated, they can form a strong learner, as their combination reduces bias or variance, yielding better model performance. Ensemble methods are frequently illustrated using decision trees as this algorithm can be prone to overfitting (high variance and low bias) when it hasn't been pruned and it can also lend itself to underfitting (low variance and high bias) when it's very small, like a decision stump, which is a decision tree with one level. Remember, when an algorithm overfits or underfits to its training set, it cannot generalize well to new datasets, so ensemble methods are used to counteract this behavior to allow for generalization of the model to new datasets. While decision trees can exhibit high variance or high bias, it's worth noting that it is not the only modeling technique that leverages ensemble learning to find the "sweet spot" within the bias-variance tradeoff.

Local minima and saddle points

For convex problems, gradient descent can find the global minimum with ease, but as nonconvex problems emerge, gradient descent can struggle to find the global minimum, where the model achieves the best results. Recall that when the slope of the cost function is at or close to zero, the model stops learning. A few scenarios beyond the global minimum can also yield this slope, which are local minima and saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost function increases on either side of the current point. However, with saddle points, the negative gradient only exists on one side of the point, reaching a local maximum on one side and a local minimum on the other. Its name inspired by that of a horse's saddle. Noisy gradients can help the gradient escape local minimums and saddle points.

Gradient descent

Gradient descent is an optimization algorithm used to train machine learning models by minimizing errors between predicted and actual results. The goal of gradient descent is to minimize the cost function, or the error between predicted and actual y. In order to do this, it requires two data points—a direction and a learning rate. These factors determine the partial derivative calculations of future iterations, allowing it to gradually arrive at the local or global minimum (i.e. point of convergence). Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of more precision, the number of iterations compromises overall efficiency as this takes more time and computations to reach the minimum. The cost (or loss) function measures the difference, or error, between actual y and predicted y at its current position. This improves the machine learning model's efficacy by providing feedback to the model so that it can adjust the parameters to minimize the error and find the local or global minimum. It continuously iterates, moving along the direction of steepest descent (or the negative gradient) until the cost function is close to or at zero. At this point, the model will stop learning. Additionally, while the terms, cost function and loss function, are considered synonymous, there is a slight difference between them. It's worth noting that a loss function refers to the error of one training example, while a cost function calculates the average error across an entire training set.

Hyperparameter tuning

Hyperparameters are important because they impact a model's training time, compute resources needed (and hence cost), and, ultimately, performance. One popular method for tuning hyperparameters is grid search, which involves forming a grid that is the Cartesian product of those parameters and then sequentially trying all such combinations and seeing which yields the best results. While comprehensive, this method can take a long time to run since the cost increases exponentially with the number of hyperparameters. Another popular hyperparameter tuning method is random search, where we define a distribution for each parameter and randomly sample from the joint distribution over all parameters. This solves the problem of exploring an exponentially increasing search space, but is not necessarily guaranteed to achieve an optimal result.

Vanishing and exploding gradients

In deeper neural networks, particular recurrent neural networks, we can also encounter two other problems when the model is trained with gradient descent and backpropagation. Vanishing gradients: This occurs when the gradient is too small. As we move backwards during backpropagation, the gradient continues to become smaller, causing the earlier layers in the network to learn more slowly than later layers. When this happens, the weight parameters update until they become insignificant—i.e. 0—resulting in an algorithm that is no longer learning. Exploding gradients: This happens when the gradient is too large, creating an unstable model. In this case, the model weights will grow too large, and they will eventually be represented as NaN. One solution to this issue is to leverage a dimensionality reduction technique, which can help to minimize complexity within the model.

Learning curves

Learning curves are plots of model learning performance over time. The y-axis is some metric of learning (for example, classification accuracy), and the x-axis is experience (time). A popular data science interview question involving learning curves is "How would you identify if your model was overfitting?" By analyzing the learning curves, you should be able to spot whether the model is underfitting or overfitting. For example, above, you can see that as the number of iterations is increasing, the training error is getting better. However, the validation error is not improving—in fact, it is increasing at the end—a clear sign that the model is overfitting and training can be stopped. Additionally, learning curves should help you discover whether a dataset is representative or not. If the data was not representative, the plot would show a large gap between the training curve and validation curve, which doesn't get smaller even as training time increases.

Leave-one-out cross-validation

Leave-one-out cross-validation is a special case of k-fold cross validation where k is equal to the size of the dataset (n). That is, it is where the model is testing on every single data point during the cross-validation.

Heteroscedasticity

Linear regression assumes that the residuals (distance between what the model predicted versus the actual value) are identically distributed. If the variance of the residuals is not constant, then heteroscedasticity is most likely present, meaning that the risduals are not identically distributed. To find heteroscedasticity, you can plot the residuals versus the fitted values. If the relationship between residuals and fitted values has a nonlinear pattern, this indicates that you should try to transform the dependent variable or include nonlinear terms in the model.

Normality

Linear regression assumes the residuals are normally distributed. We can test this through a QQ plot. Also known as a quantile plot, a QQ plot graphs the standardized residuals versus theoretical quantiles and shows whether the residuals appear to be normally distributed (i.e., the plot resembles a straight line). If the QQ plot is not a reasonably straight line, this is a sign that the residuals are not normally distributed, and hence, the model should be reexamined. In that case, transforming the dependent variable (with a log or square-root transformation, for example) can help reduce skew.

Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic gradient descent. It splits the training dataset into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of batch gradient descent and the speed of stochastic gradient descent.

Confounding variables

Multicollinearity is an extreme case of confounding, which occurs when a variable (but not the main independent or dependent variables) affects the relationship between the independent and dependent variables. This can cause invalid correlations. For example, say you were studying the effects of ice cream consumption on sunburns and find that higher ice cream consumption leads to a higher likelihood of sunburn. That would be an incorrect conclusion because temperature is the confounding variable—higher summer temperatures lead to people eating more ice cream and also spending more time outdoors (which leads to more sunburn). Confounding can occur in many other ways, too. For example, one way is selection bias, where the data are biased due to the way they were collected (for example, group imbalance). Another problem, known as omitted variable bias, occurs when important variables are omitted, resulting in a linear regression model that is biased and inconsistent. Omitted variables can stem from dataset generation issues or choices made during modeling. A common way to handling confounding is stratification, a process where you create multiple categories or subgroups in which the confounding variables do not vary much, and then test significance and strength of associations using chi square.

k-fold cross-validation

One popular way to do cross-validation is called k-fold cross-validation. The process is as follows: 1. Randomly shuffle data into equally-sized blocks (folds). 2. For each fold k, train the model on all the data except for fold i, and evaluation the validation error using block i. 3. Average the k validation errors from step 2 to get an estimate of the true error.

Outliers

Outliers can have an outsized impact on regression results. There are several ways to identify outliers. One of the more popular methods is examining Cook's distance, which is the estimate of the influence of any given data point. Cook's distance takes into account the residual and leverage (how far away the X value differs from that of other observations) of every point. In practice, it can be useful to remove points with a Cook's distance value above a certain threshold.

Overfitting

Overfitting refers to the scenario where the model tends not to generalize well out of sample. That's because during overfitting, the models pick up too much noise or random fluctuations using the training data, which hinders performance on data the model has never seen before. Simpler, smaller models are less likely to overfit.

Regularization

Regularization aims to reduce the complexity of models. In relation to the bias-variance tradeoff, regularization aims to decrease complexity in a way that significantly reduces variance while only slightly increasing bias. The most widely used forms of regularization are L1 and L2. Both methods add a simple penalty term to the objective function. The penalty helps shrink coefficients of features, which reduces overfitting. This is why, not surprisingly, they are known as shrinkage methods. Specifically, L1, also known as lasso, uses the absolute value of a coefficient to the objective function as a penalty. On the other hand, L2, also known as ridge, uses the squared magnitude of a coefficient to the objective function. The L1 and L2 penalties can also be linearly combined, resulting in the popular form of regularization called elastic net. Since having models overfit is a prevalent problem in machine learning, it's important to understand when to use each type of regularization. For example, L1 serves as a feature selection method, since many coefficients shrink to 0 (are zeroed out), and hence, are removed from the model. L2 is less likely to shrink any coefficients to 0. Therefore, L1 regularization leads to sparser models, and is thus considered a more strict shrinkage operation.

How to apply cross-validation for time series data?

Standard k-fold CV can't be applied, since the time-series data is not randomly distributed but instead is already in chronological order. Therefore, you should not be using data "in the future" for predicting data "from the past." Instead, you should use historical data up until a given point in time, and vary that point in time from the beginning till the end.

The end-to-end ML workflow

Step 1: Clarify the problem and constraints Step 2: Establish metrics Step 3: Understand your data sources Step 4: Explore your data Step 5: Clean your data Step 6: Feature engineering Step 7: Model selection Step 8: Model training & evaluation Step 9: Deployment Step 10: Iterate

Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and it updates each training example's parameters one at a time. Since you only need to hold one training example, they are easier to store in memory. While these frequent updates can offer more detail and speed, it can result in losses in computational efficiency when compared to batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be helpful in escaping the local minimum and finding the global one.

Bias-variance tradeoff

The bias-variance tradeoff refers to finding a model with the right complexity, minimizing both the bias and variance error. With any model, we are usually trying to estimate a function f(x), which predicts our target variable y based on out input x. This relationship can be described as follows: y = f(x) + w where w is noise, not captured by f(x), and is assumed to be distributed as a zero-mean Gaussian random variable for certain regression problems. To assess how well the model fits, we can decompose the error of y into the following: 1. Bias: how close the model's predicted values come to the true underlying f(x) values, with smaller being better 2. Variance: the extent to which model prediction error changes based on training inputs, with smaller being better 3. Irreducible error: variation due to inherently noisy observation processes The tradeoff between bias and variance provides a lens through which you can analyze different models. Say we want to predict housing prices given a large set of potential predictors (square footage of a house, the number of bathrooms, and so on). A model with high bias but low variance, such as linear regression, is easy to implement but may oversimplify the situation at hand. This high bias but low variance would mean that predicted house prices are frequently off from the market value, but the variance in these predicted prices is low. On the flip side, a model with low bias and high variance, such as neural networks, would lead to predicted house prices closer to market value, but with predictions varying wildly based on the input features.

Bootstrapping

The process of bootstrapping is simply drawing observations from a large data sample repeatedly (sampling with replacement) and estimating some quantity of a population by averaging estimates from multiple smaller samples. Besides being useful in cases where the dataset is small, bootstrapping is also useful for helping deal with class imbalance: for the classes that are rare, we can generate new samples via bootstrapping.

Interpretability & explainability

There's usually a tradeoff between performance and model interpretability. Often, using a more complex model might increase performance, but make results harder to interpret. Various models have their own way of interpreting feature importance. For example, linear models have weights which can be visualized and analyzed to identify what the model is using and learning. There are also some general frameworks that can help with more "black-box" models. One is SHAP (SHapley Additive exPlanation), which uses "Shapley" values to denote the average marginal contribution of a feature over all possible combinations of inputs. Another technique is LIME (Local Interpretable Model-agnostic Explanations), which use sparse linear models built around various predictions to understand how any model performs in that local vicinity.

Precision and recall

Two metrics that go beyond accuracy are precision and recall. In classification, precision is the actual positive proportion of observations that were predicted positive by the classifier. In the cancer diagnostic example, it's the percentage of people you said would have cancer who actually ended up having the decision. Recall, also known as sensitivity, is the percentage of total positive cases captures, out of all positive cases. It's essentially how well you do in finding people with cancer. In real-world modeling, there's a natural tradeoff between optimizing for precision or recall. For example, having high recall—catching most people who have cancer—ends up saving the lives of some people with the disease. However, this often leads to misdiagnosing others who didn't truly have cancer, which subjects healthy people to costly and dangerous treatments like chemotherapy for a cancer they never had. On the flip side, having high precision means being confident

Underfitting

Underfitting refers to the scenario where the model is not learning enough of the true relationship underlying the data.


Set pelajaran terkait

Client with Heart Failure (final)

View Set

DMV Permit Test Practice: New York 2021

View Set

Chapter 9: Network Risk Management

View Set

Review Chapter 7:4 Skeletal System

View Set

Chapter 8 Life Insurance Questions

View Set