Gradient descent and stochastic gradient descent - Heuristic Analyst

In this post I would like to introduce the Gradient Descent and its applications.

I hope that every reader of this blog is familiar with derivatives and gradients of (simple in school learned) functions, such as f(x)=7x²+3x+9. If you derive the mentioned function, you get the derivative f'(x)=14x+3.

However, if you have functions that depend on several variables, the functions could look like this:
f(x, y) = x²+y²
or f(x₁, x₂) = x₁²+x₂² (just a notational difference)

If you want to derive this function, you now partially derive it.

That means that you derive after each variable once, where in each individual derivation of a variable xᵢ all other variables are considered as constants. So you derive the function f(x₁, x₂) = x₁²+x₂² twice (with respect to x₁ and x₂).

This results in the following 2 derivatives:

Derivative with respect to x₁: ∂f(x₁, x₂)/∂x₁ = 2x₁
Derivative with respect to x₂: ∂f(x₁, x₂)/∂x₂ = 2x₂

Sidenote: ∂ is the delta used to denote a partial derivative; under the fraction is the variable you derive with respect to

One can now combine all derivatives into a vector with the respective derivatives. This vector is called a gradient:

It is important to know that when calculating the gradient of a point of a function f, the rate of change at this point is calculated (with the highest slope “in each direction”).

With this knowledge we can now understand the “Gradient Descent Algorithm” – An algorithm we can use to optimize a function. If we have a function to optimize, for example a cost/loss function, where we know the input, but we want to change the weights of the function so that the output is minimal, we can do that with this algorithm.

Sidenote: If the function is covex, like f(x)=x², it is called convex optimization, otherwise it is called non-convex optimization.

Convexity of a function to be optimized is favorable because we can also reach the global minimum of the function by using the gradient descent algorithm, whereas in the optimization of a non-convex function a local minimum may be aimed at and not the global minimum.

For the explanation of the gradient descent algorithm I use the sum of the squared differences, which is a function, and to be more precise a loss function, because we want to minimize the numerical value at the end.

f(x) = ŷ = intercept-slope*x = b + m*x
squared residuals = (y-ŷ)² = (y-intercept-slope*x)²
Loss function: L(squared differences)
- = ∑(squared differences) = (y₁-intercept-slopex₁)² + … + (yᵢ-intercept-slopexᵢ)²

For a better understanding, read the blog article on simple linear regression!

Applying Gradient Descent to a function L(squared residuals) to find optimate values for interception and slope:

Choose initial values of the values to be optimized, in this case for interception and slope, for example 0 and 1
Determine gradients (derive function L(squared differences) according to intercept and slope, since we want to optimize them) (Tipp: Because the Loss function is the sum of the squared residuals we can just derive the squared differences function = (y-intercept-slope*x)² once and sum the derivative as often as we have x’s)
Calculate slope for each derivative by using the starting values and the y’s and x’s
Calculate Step Size: Step Size = Slope * Learning Rate (see below for what it is exactly)
Update the individual values to be optimized: New Weight = Old Weight – Step Size (e.g. Intercept: New Intercept = Old Intercept – Step Size)
Repeat the steps from 3 to 5 as many times as set before (for example 10,000 times) or until the Step Size becomes smaller than e.g. 0.0001 (standard metric in machine learning) – when iterating through step 3 to 5 sthe updated values from step 5 are used at step 3 tp calculate the slope

What is this “Step Size”: If you think of this algorithm graphically, you first determine the slope at a point in a function. If the slope is negative, you walk in the right direction of the optimization and go one step further (New Intercept = Old Intercept – (negative Step Size) → New Intercept = Old Intercept + Step Size). If the slope is positive we go one step back, because we rise and walk away from the minimum. We can manipulate this step length of our walk with the so-called Learning Rate. If it is too large we may never reach the minimum because the step size is simply not granular enough. If it is too small, the computational cost can become extreme. It’s a trade-off.

At the end of running the algorithm, you have successfully optimized the values!

Now the question remains what the “Stochastic Gradient Descent” is.

If you calculate the gradient descent on for example a linear regression with 100 factors, you have to derive the loss function 100 times. Add to that the fact that you have 20,000,000 samples and with at least 1,000 iterations you have a total of 100*20,000,000*1,000 = 2,000,000,000,000 (2 trillion) calculations.

Since this takes a long time and is an extreme calculation time you can update the factors after one sample (loss function is then not the sum of all squared residuales but only the squared residual from the random sample). This is the strict definition of the stochastic gradient descent. You choose a random sample and update the factors only with the sample. This is also done about 1,000 times, but still way more efficient. In practice, however, several random samples are selected into a batch, and then optimized.

Sidenote – ML Lingo: In modern machine learning you could have a set of factors, called features. Let’s 100 features and 60,000 samples. Now you might update the weights, slopes, intercepts, … whatever, everytime with a batchsize of 10,000. With 60,000 samples we could run 6 batches without reusing samples. This is called a epoch. So usually you need to define a batchsize and the numbers of epochs.