Regularization

¡Supera tus tareas y exámenes ahora con Quizwiz!

Basics of Regularization

Consider the training dataset comprising of independent variables X=(x1,x2....xn) and the corresponding target variables t=(t1,t2,...tn). X are random variables lying uniformly between [0,1]. The target dataset 't' is obtained by substituting the value of X into the function sin(2πx) and then adding some Gaussian noise into it. Note: Gaussian noise are the deviations created in the target variables from the actual obtained output value, which follows the Gaussian distribution. This is done to represent the scenario of any real world dataset, as no data is perfect without any noise component. Now, our goal is to find patterns in this underlying dataset and generalize it to predict the corresponding target value for some new values of 'x'. The problem here is, our target dataset is inflicted with some random noise. So, it will be difficult to find the inlying function sin(2πx) in the training data. So, how do we solve it? Let's try fitting a polynomial on the given data. It should be noted that the given polynomial function is a non-linear function of 'x' but a linear function of 'w'. We train our data on this function to determine the values of w that will make the function to minimize the error in predicting target values. The error function used in this case is mean squared error. In order to minimize the error, calculus is used. The derivative of E(w) is equated with 0 to get the value of w which will result at the minimum value of error function. E(w) is a quadratic equation, but it's derivative will be a linear equation and hence will result in only a single value of w. Let that be denoted by w*. So now, we will get the correct value of w but the issue is what degree of polynomial (given in Eq 1.1) to choose? All degree of polynomials can be used to fit on the training data, but how to decide the best choice with minimum complexity? Furthermore, if we see the Taylor expansion of sine series then any higher order polynomial can be used to determine the correct value of the function. Just for reference, Taylor expansion of sin(x) is To represent the above mentioned point graphically, present below is the graph of target variable vs input variable i.e y= sin(2πx). Blue circles are the training data-points. Green curve is the expected polynomial function which fits the training data set and red curve is the polynomial function of various degrees (given by variable M) that are trained to fit the data set. The function has trained itself to get the correct target values for all the noise induced data points and thus has failed to predict the correct pattern. This function may give zero error for training set but will give huge errors in predicting the correct target values for test dataset. To avoid this condition regularization is used. Regularization is a technique used for tuning the function by adding an additional penalty term in the error function. The additional term controls the excessively fluctuating function such that the coefficients don't take extreme values. This technique of keeping a check or reducing the value of error coefficients are called shrinkage methods or weight decay in case of neural networks. Overfitting can also be controlled by increasing the size of training dataset. EDIT: Here increasing the size of the dataset to avoid overfitting refers to increasing the number of observations (or rows) and not the number of features (or columns). Adding columns may lead to increase in the complexity of problem and therefore may result in more poor performance. Thanks to Sean McClure for pointing it out :)

Balancing bias and variance

The concept of balancing bias and variance, is helpful in understanding the phenomenon of overfitting.

Variance

Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a different training data set. Since the training data is used to fit the statistical learning method, different training data sets will result in a different estimation. But ideally the estimate for f(X) should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in f(X).

What is l1 and l2 regularization?

Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights.

Overfitting

One of the major aspects of training your machine learning model is avoiding overfitting. The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean the data points that don't really represent the true properties of your data, but random chance. Learning such data points, makes your model more flexible, at the risk of overfitting.

What is the regularization parameter?

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.

2.1 Let's Regularize:

Finally we are here! In this section I am going to tell you about the types of regularization and how are they obtained. Let's go back to the point where we started. Our initial condition on the dataset was have a target variable t consisting of actual value with some added Gaussian noise. According to our probabilistic view point, we can write from equation 2.3 Where β is our inverse of variance or precision parameter. Note that the above equation is only valid if our data points are drawn independently from the distribution (eq 2.2). The sum of square error for the above function is defined by: Don't get confused by seeing " wT*phi(xn)" . It's nothing but our y(x,w). It is a more generalized form of the original eq (1.2). phi(xn) are known as basis functions. For eq (1.1) " phi(xn)" was equal to "x^n". Now we have seen previously that in many cases using this error function often leads to overfitting. Thus regularization term was introduced. After introducing the regularization co-efficient, our overall cost function becomes Where, λ controls the relative importance of data dependent error w.r.t regularization error term. A generalized form of regularization term is given below: Where, if q=1, then it is termed as lasso regression or L1 regularization, and if q=2, then it is called ridge regression or L2 regularization. If both the terms L1 regularization and L2 regularization are introduced simultaneously in our cost function, then it is termed as elastic net regularization. Note that the regularization term has a constraint that: For appropriate value of constant η. Now how do these extra regularization term help us to keep a check on the co-efficient terms. Lets understand that. Consider the case of lasso regularization. Suppose we have an equation y= w1 +w2x. We have considered the equation with only two parameters because it will be easy to visualize in the contour plots. I got a headache when I tried visualizing a multidimensional graph. Not an easy task. Even google was not helpful here :P Note: Just for a quick review, contour plots are 2D representation of and given 3D graphs. For the cost function or error function the 3D graph is plotted between the coefficient values w1 and w2 and the corresponding error function. i.e X-axis =w1, Y-axis= w2 and Z-axis= J(w1,w2), where J(w1,w2), is the cost function When the contour plot is plotted for the above equation the x and y axis represents the independent variables (w1 and w2 in this case) and the cost function is plotted in a 2D view. Now, returning back to our regularization. The cost function for the lasso regression will be : The graph for this equation will be a diamond figure as shown below, fig 3 Here, the blue circles represent the contours for the un-regularized error function (ED ) and diamond shape contour is for L1 regularization term i.e (λ/2( |w1| + |w2| ),. We can see in the graph that optimal value is obtained at the point where w1 term is zero i.e the basis function corresponding to w1 term will not affect the output. Here represented by w* where both the terms of cost function will take a common value of 'w' as required in the equation. Hence we can say that for the proper value, λ the solution vector will be a sparse matrix (eg [0,w2]). So this is how the complexity of the equation (1.3) can be reduced. The solution matrix of w will have most of it's values as zero and the non-zero value will contain only the relevant and important information thus finding a general trend for the given dataset.

Balancing Bias and Variance to Control Errors in Machine Learning

In the world of Machine Learning, accuracy is everything. You strive to make your model more accurate by tuning and tweaking the parameters, but are never able to make it 100% accurate. That's the hard truth about your prediction/ classification models, they can never be error free. - In this article I'll discuss why this happens and other forms of error that can be reduced.

What does Regularization achieve?

A standard least squares model tends to have some variance in it, i.e. this model won't generalize well for a data set different than its training data. Regularization, significantly reduces the variance of the model, without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques described above, controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected. This is all the basic you will need, to get started with Regularization. It is a useful technique that can help in improving the accuracy of your regression models. A popular library for implementing these algorithms is Scikit-Learn. It has a wonderful api that can get your model up an running with just a few lines of code in python.

Ridge Regression

Above image shows ridge regression, where the RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = ...= xip = 0. When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are also known as the L2 norm. The coefficients that are produced by the standard least squares method are scale equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression. The formula used to do this is given below.

General Rule

Any change in dataset will provide a different estimate, which is highly accurate, when using a statistical method that tries to match data points too closely. A general rule is that, as a statistical method tries to match data points more closely or when a more flexible method is used, the bias reduces, but variance increases.

Bias

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. So, if the true relation is complex and you try to use linear regression, then it will undoubtedly result in some bias in the estimation of f(X). No matter how many observations you have, it is impossible to produce an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is highly complex.

Probablistic view :

Here we will look at the same problem but with a different perspective. Once again let us assume the training data to be 'X' and corresponding target variables to be 't'. Now we have to find the correct value of target variable given any new test data x. or we can say that we have to find the highest probability of t given x i.e p(t|x), parameterised over w. Let us make a small assumption for this purpose. Let's assume that for a given value of 'x', the corresponding value of 't' follows a Gaussian distribution with mean equal to y(x,w) (note that y(x,w) is our eq 1.1.) Where Gaussian distribution is defined as, Here β is the precision parameter i.e inverse of variance Why we have made this assumption? Don't worry, I have mentioned it in the very next section. So going back to our probability equation, here parameters w and β are unknown. We use our training data to determine the value of w and β. All the training data points are independent of each other. Therefore we can say that, (Independent probabilities can be multiplied) Now we will get the correct values of w and β by maximizing the given equation. For convenience we will take the logarithm of the above equation as it doesn't affect the values. Here when differentiating w.r.t 'w', last 2 terms can be ignored as it contains no 'w' terms. Also β in the prefix of first term only performs the task of scaling the term and hence has no significant effect in the final differential output Therefore, if we ignore β, we will end up getting the same mean squared error formula, which gives us the value of 'w' as in (eq 1.2). Isn't this magic? Similarly we will find the value of β as well, by differentiating it w.r.t β. Now having known the values of w and β, we can now use our final equation to predict the values of test data.

Bias-Variance Trade off

In the above figure, imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target, such that each blue dot represents different realizations of our model based on different data sets for same problem. It displays four different cases representing combinations of both high and low bias and variance. High bias is when all dots are far from bulls eye and high variance is when all dots are scattered. This illustration combined with previous explanation makes the difference between bias and variance pretty clear. As described earlier, in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. There is always a trade-off between these values because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low. "Mastering the trade-off between bias and variance is necessary to become a machine learning champion." This concept should be kept in mind while solving machine learning problems as it helps in improving the model accuracy. Also retaining this knowledge helps you in deciding best statistical models for different situations quickly.

Lasso

Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm. Lets take a look at above methods with a different perspective. The ridge regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And the Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions. Consider their are 2 parameters in a given problem. Then according to above formulation, the ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by β1² + β2² ≤ s. Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s. The image below describes these equations. The above image shows the constraint functions(green areas), for lasso(left) and ridge regression(right), along with contours for RSS(red ellipse). Points on the ellipse share the value of RSS. For a very large value of s, the green regions will contain the center of the ellipse, making coefficient estimates of both regression techniques, equal to the least squares estimates. But, this is not the case in the above image. In this case, the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions(where parameters are much more than 2), many of the coefficient estimates may equal zero simultaneously. This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models.

Cross-validation

One of the ways of avoiding overfitting is using cross validation, that helps in estimating the error over test set, and in deciding what parameters work best for your model.

What is regularization used for?

Regularization is a technique which is used to handle the outliers in the particular parameters which results in overfitting of the machine learning models. Regularisation has ridge and lasso technique to improve it. Regularization makes the model very accurate by minimizing the cost function.

What does ridge regression do?

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. ... By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.

Model Complexity

The complexity of a relation, f(X), between input and response variables, is an important factor to consider while learning from a dataset. A simple relation is easy to interpret. For example a linear model would look like this Y ≈ β0 + β1X1 + β2X2 + ...+ βpXp It is easy to infer information from this relation and also it clearly tells how a particular feature impacts the response variable. Such models come under the category of restrictive models as they can take only a particular form, linear in this case. But, a relation may be more complex than this, for example it may be quadratic, circular, etc. These models are more flexible as they fit data points more closely can take different forms. Generally such methods result in a higher accuracy. But this flexibility comes at the cost of interpretability, as a complex relation is harder to interpret. Choosing a flexible model, does not always guarantee high accuracy. It happens because our flexible statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. This changes our estimation of f(X), leading to a less accurate model. This phenomenon is also known as overfitting. When inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. This is when we use more flexible methods.

Ridge Regression #1

The main idea of Ridge Regression is to find a New line that doesn't fit the Traning data as well. In other words, we introduce a small amount of Bias into how a New line is fit to the data. But in return for that small amount of Bias, we get a significant drop in Variance. When Ridge Regression determines values for the parameters in this equation Size = y-axis intercept + slope x Weight, , it minimizes "the sum of the squared residuals + lambda x the slope^2". - "the slope^2" add a penalty to the Least Squares method. - and the lambda determines how severe that penalty is.

Regularization

The word regularize means to make things regular or acceptable. This is exactly why we use it for. Regularizations are techniques used to reduce the error by fitting a function appropriately on the given training set and avoid overfitting. This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X). Y ≈ β0 + β1X1 + β2X2 + ...+ βpXp The fitting procedure involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function. Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won't generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Quality of fit

To quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation, the most commonly-used measure in regression setting is the mean squared error (MSE), Taken from Wikipedia As the name goes, it is the mean of square of the errors or differences in predictions and observed values for all inputs. It is known as training MSE if calculated using training data, and test MSE if calculated using testing data. The expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities: the variance of f(x0), the squared bias of f(x0) and the variance of the error terms e. Where, e is the irreducible error, about which we discusses earlier. So, lets see more about bias and variance.

Proof for Assumption:

To understand this you should have a little knowledge about expectation of a function. Expectation of a function is it's average value under a probability distribution. For continuous variables it is defined as Expectation for continuous variables Consider this situation, where we want to categorize patient as having cancer or not based on some observed variable. If a person not having a cancer is categorized as a cancer patient then some loss will be incurred in our model, but conversely if some patient having cancer is predicted as a healthy patient then the loss incurred will be several times higher then the previous case. Let Lij denote the loss when i is falsely predicted as j. The average expected loss will be given by: We wish to minimize the loss as much as possible. We choose the value of y(x), such that E(L) is minimum. Thus applying product rule [p(x,t)=p(t|x) * p(x)] and differentiating it w.r.t y(x) we get, Does the above equation looks familiar? Yes it is our assumption made. The term Et[t|x] is the conditional average of 't' conditioned on 'x' i.e for a given 'x', 't' follows a Gaussian with mean equal to 'y'. This is also known as regression function. Fig2 The given above figure helps to get a clear understanding of it. We can see that follows a Gaussian distribution with minimum value of loss function for the regression function y(x) obtained at 'y' which is the mean value


Conjuntos de estudio relacionados

Milady - Chapter 9 Nail Structure & Growth

View Set

Tableau Desktop Specialist Certification

View Set

Module 6 RAID and Expansion Devices

View Set

Causes of the American Revolution & American Revolution

View Set

Chapter 26 - Monopoly behavior: Second-degree price discrimination

View Set

Expressing feelings with verbs or ed/ing adj.

View Set

Cognitive Psychology Chapter 1 Quiz

View Set