StatQuest Linear Regression , Gradient Descent
What is problem with Gradient Descent and how we can solve it
It is slow ,and solution is Stochastic Gradient Descent
Which machine learn model has no analytical solution
Logistic Regression , and Nerural
what is Mini- Batch Stochastic Gradient Descent.
Mini-Batch Stochastic Gradient Descent (SGD) is a variation of SGD that randomly selects a small subset of observations, typically between 10 and 1,000, to compute the gradient at each iteration. This approach strikes a balance between the efficiency of SGD and the stability of using the entire dataset. Using a small subset of observations reduces the noise introduced by using a single data point in SGD, while still providing faster convergence than using the entire dataset. Additionally, mini-batches can be easily parallelized across multiple computing nodes, further reducing training time for large datasets. instead of randomly selecting one point per iteration ( Stochastic Gradient Descent. ) , we might select 3 points.
What is meaning of Gradient
NOTE: A collection of derivatives of the same function but with respect to different parameters is called a Gradient, so this is where Gradient Descent gets its name from. We'll use the gradient to descend to the lowest SSR.
How Gradient Descent work in machine learning
This is an example of using gradient descent to optimize a linear regression model. The first step is to calculate the derivative of the sum of squared residuals (SSR) with respect to the intercept, which is a measure of how much the SSR changes when the intercept is varied. In this case, the derivative is calculated to be -2.3. Next, the step size is calculated by multiplying the derivative by the learning rate, which is a hyperparameter that determines the size of the update at each iteration. In this case, the learning rate is set to 0.1, so the step size is -0.23. Finally, the new intercept value is calculated by subtracting the step size from the current intercept. The new intercept is 0.8, which is closer to the optimal value than the previous intercept of 0.57. This new intercept results in a lower SSR, indicating that the model is improving. The fact that the step size is smaller than before indicates that the model is getting closer to the optimal value, as the slope of the tangent line is not as steep as before. This is a desirable property of gradient descent, as it ensures that the model converges to a minimum in a stable and efficient manner.
how Stochastic Gradient Descent (SGD) helps to overcome local min problem
Using a random subset of the training data at each iteration helps Stochastic Gradient Descent (SGD) to avoid getting stuck in local minima in several ways: Reduced likelihood of encountering the same data points repeatedly: When the algorithm uses the full training set for each iteration, it can get stuck in a local minimum where the gradient is close to zero, but not the optimal minimum. By randomly selecting a subset of the training data, SGD reduces the likelihood of encountering the same data points repeatedly, which can help the algorithm to escape from local minima and converge on the global minimum. Increased exploration of the parameter space: By randomly selecting a subset of the training data, SGD allows the algorithm to explore different regions of the parameter space. This can help the algorithm to find a better solution than it would by only considering a subset of the data. Faster convergence: Since SGD updates the model parameters more frequently than the batch gradient descent, it can converge faster to the optimal solution. Overall, using a random subset of the training data in each iteration can help Stochastic Gradient Descent to avoid getting stuck in local minima, explore more of the parameter space, and converge faster to the optimal solution.
Optimizing Two or More Parameters Gradient Descent.
When optimizing two or more parameters using gradient descent, we end up with a three-dimensional graph of the SSR (Sum of Squared Residuals), where one axis represents the SSR value, and the other two axes represent the slope and intercept values. To find the optimal values of slope and intercept, we calculate the partial derivatives of the SSR function with respect to each parameter. The derivative with respect to intercept is -2 x (Height - (intercept + slope x Weight)), and the derivative with respect to slope is 2 x Weight x (Height - (intercept + slope x Weight)). We then plug the observed values into these derivatives of the SSR function and start with random values for slope and intercept. We can find the SSR value for each combination of intercept and slope separately. The optimization process involves iterating through two different functions: one for updating the intercept value and one for updating the slope value. These functions use different step sizes and new intercept values (Step SizeIntercept = Derivative x Learning Rate) and new slope values. The resulting optimal values for slope and intercept are the ones that minimize the SSR value, which can be shown in a two-dimensional graph. After some iterations, we should get closer to a minimum SSR value, which ideally should be close to 0.
How you can solve local min problem in Gradient Descent briefly
1) Try again using different random numbers to initialize the parameters that we want to optimize. Starting with different values may avoid a local minimum. 2) Fiddle around with the Step Size. Making it a little larger may help avoid getting stuck in a local minimum. 3) Use Stochastic Gradient Descent, because the extra randomness helps avoid getting trapped in a local minimum.
How you can choose the learning rate
Cross Validation
How find the points for min SSR (Iterative Method )
Gradient Descent starts with a guess for the value and then goes into a loop that improves the guess one small step at a time
Explain Gradient Descent one parameter
Gradient descent is an iterative optimization algorithm commonly used in machine learning to minimize a loss function. The goal of gradient descent is to find the values of the parameters that minimize the loss function. The algorithm starts with some initial values for the parameters, typically chosen randomly. At each iteration, the derivative of the loss function with respect to each parameter is calculated. If the derivative is positive, the parameter is updated in the negative direction of the derivative to move toward the minimum. The size of the update is determined by the learning rate, which is a hyperparameter set by the user. For example, if the learning rate is 0.01, the step size for updating the intercept would be calculated as Step SizeIntercept = Derivative x Learning Rate. If the current intercept is 0 and the derivative is -7.3, the new intercept would be calculated as New intercept = Current intercept - Step SizeIntercept = 0 - (-0.073) = 0.073. The algorithm continues to iterate until the change in the loss function becomes small enough, indicating that the parameters have converged to a minimum. There are variations of gradient descent, such as stochastic gradient descent and batch gradient descent, that use different strategies for selecting the data points to update the parameters at each iteration. Overall, gradient descent is a powerful and widely used optimization algorithm that is essential in many machine learning models and applications.
3 steps in Gradient Descent.
Gradient descent is an iterative optimization method that aims to minimize the loss function of a model by updating its parameters, such as the intercept and slope, based on the gradient of the loss function. In this case, we are optimizing a linear regression model by updating its intercept. To start the process, we need to take the derivative of the sum of squared residuals (SSR) with respect to the intercept. The formula for SSR is given by SSR = (Height - (intercept + 0.64 x Weight))^2, where 0.64 is the fixed slope coefficient. Taking the derivative of SSR with respect to intercept gives us 2 x (Height - (intercept + 0.64 x Weight)). Next, we need to initialize the intercept by choosing a random value, for example, 0, and calculate the SSR at this value. We can then calculate the step size using the formula Step Size = Derivative of SSR x Learning Rate, where Learning Rate is a hyperparameter that determines the size of the update at each iteration. We can use the step size to update the intercept by subtracting it from the current intercept value. The new intercept is then used to calculate a new SSR, and the process is repeated until the step size is close to 0 or we reach the maximum number of steps. The maximum number of steps can be determined based on a threshold value, t, that controls the number of iterations needed to converge to a minimum. In summary, gradient descent is an iterative method that updates the parameters of a model, such as the intercept, based on the gradient of the loss function. By choosing an appropriate learning rate and maximum number of steps, we can optimize the model and improve its performance.
P value in R sqaure
If you have R value equals 0,66 , and p value is 0.1 meaning there's a 10% chance that random data could give us an R2 ≥ 0.66.
How you can solve local min problem in Gradient Descent
One common technique for avoiding local minima in Gradient Descent is to use a variant called Stochastic Gradient Descent (SGD). SGD uses a randomly selected subset of the training data at each iteration, which can help the algorithm avoid getting stuck in local minima. Another technique is to use a variant of Gradient Descent called "momentum". Momentum adds a fraction of the previous update to the current update, which helps to smooth out the steps taken and allows the algorithm to continue making progress even when the gradient is small. Another technique is to use "learning rate annealing". This means reducing the learning rate over time as the algorithm approaches the minimum. This can help the algorithm avoid overshooting the minimum and getting stuck in a local minimum. Finally, another technique is to use a more advanced optimization algorithm such as Adam or RMSprop, which incorporate adaptive learning rates and other advanced features that can help the algorithm avoid local minim
How find the points for min SSR (Analytical Solution,)
You should show the different y-axis intercepts and SSR values at these shops. If you connect all the SSR values you get a graph and then a function for this graph. You can find the derivative of this function where the slope equals zero
What is problem with Gradient Descent
it's possible that we might get stuck at the bottom of this local minimum. Instead of finding our way to the bottom and the global minimum.
How derivatives are used in Gradient Descent
the derivative tells us in which direction to take a step and how large that step should be, so let's learn how to take the derivative of the SSR!!!
why Stochastic Gradient Descent is important
what if we had a more complicated model with 10,000 parameters? Then we would have 10,000 derivatives to compute. Stochastic Gradient Descent randomly selects one data point from the dataset for each iteration and computes the gradient based on that single data point. This allows for faster and more efficient optimization of large datasets because it reduces the amount of computation needed per iteration. Instead of computing the gradient over the entire dataset, which can be computationally expensive, SGD only computes the gradient over a single data point, which is much faster. Additionally, SGD can be more effective than standard Gradient Descent when dealing with noisy or sparse data, as it is less likely to get stuck in local minima.