Multiple linear regression with gradient descent from scratch in Python - Heuristic Analyst

In this post I will show you how to calculate the optimal values of a multiple linear regression using the gradient descent algorithm. In the previous posts I already introduced you to the simple linear regression (of one independent variable) and also to the (stochastic) gradient descent algorithm.

First, I will briefly introduce what multiple linear regression is and then I will present my implementation of the gradient descent algorithm created by scratch applied to multiple linear regression.

The multiple linear regression is like the simple linear regression. You try to represent a linear relationship. The difference is that in multiple linear regression there are 2 or more independent variables.

Simple linear regression:

Multiple linear regression:

So there are more than one than faktors explaining the ŷ – Example: The 2 factors gender and age are important in predicting a person’s socioeconomic status

Now let’s move on to the implementation of gradient descent to optimize the factors of a multiple linear regression in Python: (Github link at the end)

Step 1: First we need to create a loss function. As a loss function I take the sum of the squared residuals:
∑(squared residuals) = (y₁-ŷ₁)² + … + (yᵢ-ŷᵢ)²
So we calculate ŷ (=b₀ + b₁x₀ + b₂x₁+…) of every sample sample, subtract it from the actual y and square it. We do this for all samples and add them to get th sum of squared differences.

Using the library “SymPy” we can create functions and work with Python, so I will use that here. First I create the following equation with SymPy: (y-ŷ)² =(y-(b₀ + b₁x₀ + b₂x₁+…))². I will sibstitute the following variable names:
y -> y_actual
bᵢ -> betas[i]
xᵢ -> x_values[i]

Quickly how SymPy work: In SymPy, the variables have to be created first -> I store them separately in lists.
Then I create the desired function, loss function, with the help of these variables.

This I will store in “L_fct” as string. I also have an empty list “gradient”, in which I will add all derivatives after the betas to be optimized. With “smp.diff” I can derive the previously created function with SymPy, which I do once for each beta. Then I store the derivatives in the “gradient” list.

Sidenote: If you have stored a function as a string in a variable, for example f=”2x”, and you have defined the variables of this function before, then you can use the “eval” function to calculate the function (so print(eval(f)) with x=4 declared before -> output>> 8).

import sympy as smp

def generate_loss_function_and_gradient(number_of_betas: int):
    # create variables:
    #   y -> y_actual
    #   b_0 -> betas[0], b_1 -> betas[1], ...
    #   x_0 -> x_values[0], x_1 -> x_values[1], ...
    y_actual = smp.symbols("y_actual", real=True)
    beta_symbols = []
    x_symbols = []
    for i in range(number_of_betas):
        beta_symbols.append(smp.symbols(f"betas[{i}]", real=True))
    for i in range(number_of_betas-1):
        x_symbols.append(smp.symbols(f"x_values[{i}]", real=True))

    # Add variables to the function which will look like this:
    # (y-ŷ)^2 = (y-(b_0 + b_1*x_0 + b_2*x_1 + ...))^2 -> (y_actual - (betas[0] + betas[1]*x_values[0] + ...))^2
    f = beta_symbols[0]
    for i in range(1, number_of_betas):
        f = f+beta_symbols[i]*x_symbols[i - 1]
    f = (y_actual-f)**2
    # Save function as a string
    L_fct = str(f).replace(" ", "")
    # calculate gradient - derive with respect to every beta
    gradientList = []
    for i in range(len(beta_symbols)):
        gradientList.append(str(smp.diff(f, beta_symbols[i])).replace(" ", ""))
    return L_fct, gradientList

Step 2: Now we have to pass the gradient, all sample data (x-values and associated y-value = 1 sample), the betas with which we started (and then optimized with each iteration) and the learning rate to the optimization function. Then calculate the descent values for each descent and update all betas in the last step.

def gradient_descent_algorithm(learning_rate:float, y_actual_list: list, x_values_population:list, betas: list, gradient: list):
    # Create list to save temporarily new betas
    new_betas = [0 for i in range(len(betas))]
    # Iterate through each derivative of the gradient
    for i in range(len(gradient)):
        tmp = 0
        # Calculate derivative ("slope") of every sample for the derivative i in gradient and sum them up in "tmp"
        for j in range(len(x_values_population)):
            y_actual = y_actual_list[j]
            x_values = x_values_population[j]
            tmp += eval(gradient[i])
        # Calculate Step size
        step_size=tmp*learning_rate
        # Calculate new beta
        new_betas[i] = betas[i]-step_size
    # When every beta is calculated: Update them
    for i in range(len(betas)):
        betas[i] = new_betas[i]

Step 3: I wrote some code to calculate the sum of squared differences to stop either when the difference of the loss from the previous step to this step is smaller than a threshold, or until the number of iterations has reached its maximum:

def loss_function(y_actual: float, x_values: list, betas: list, L_fct: str):
    loss_of_sample = eval(L_fct)
    return loss_of_sample


def sum_loss_function(y_actual: list, x_values_population: list, betas: list, L_fct: str):
    sum_losses = 0
    for i in range(len(y_actual)):
        sum_losses += loss_function(y_actual[i], x_values_population[i], betas, L_fct)
    return sum_losses

Last step: Test everything – To test everything I created 3 sample data with 3 independent variables each (that are then 4 betas – 3 factors + 1 intercept):

if __name__ == '__main__':
    betas_vals = [1, 1, 1, 1]  # 4 factors - random initial values (I set them to all be ones)
    x_vals = [[0.5, 2.1, 0.9],  # sample 1 - 3 features
              [2.3, 8.9, 4],  # sample 2 - 3 features
              [2.9, 12.2, 6]]  # sample 3 - 3 features
    y_vals = [-37.04, -207.0, -312.9]  # 3 outcomes

    # Generate loss function and gradient
    L_function, gradient_functions = generate_loss_function_and_gradient(len(betas_vals))
    # Print loss function
    print("Loss function:", L_function)

    # Create list which will hold betas from previous iteration to compare loss function from previous to current step
    # I do it to check whether to break out of the loop because the step is small enough
    betas_previous = [0 for i in range(len(betas_vals))]

    for i in range(100000):
        print("Iteration number", i + 1)
        # fill beta values from previous step
        for j in range(len(betas_vals)):
            betas_previous[j] = betas_vals[j]
        # Optimize betas
        gradient_descent_algorithm(0.0001, y_vals, x_vals, betas_vals, gradient_functions)
        # Compare losses from previous step to current -> break from loop if loss is small
        if abs(sum_loss_function(y_vals, x_vals, betas_vals, L_function) - sum_loss_function(y_vals, x_vals, betas_previous, L_function)) < 0.00001:
            break

    # Print optimized betas
    print("Optimized betas:", betas_vals)

    # Print actual and predicted y
    for i in range(len(x_vals)):
        print("Actual y:", y_vals[i], "- Predicted y:", betas_vals[0] + betas_vals[1] * x_vals[i][0] + betas_vals[2] * x_vals[i][1] + betas_vals[3] * x_vals[i][2])

Everything that we coded generate the following output:

Loss function: (-betas[0]-betas[1]*x_values[0]-betas[2]*x_values[1]-betas[3]*x_values[2]+y_actual)**2
Iteration number 1
Iteration number 2
...
Iteration number 99999
Iteration number 100000
Optimized betas: [15.39260868874023, 11.865689746680127, -12.9318031446267, -34.0024362801841]
Actual y: -37.04 - Predicted y: -36.43352569380147
Actual y: -207.0 - Predicted y: -208.41909800180952
Actual y: -312.9 - Predicted y: -311.9795070914377

Important sidenote: The whole procedure is very sensitive to the learning rate. It is possible that the step size is so large that the minimum is completely skipped and even a higher numerical value of the convex function is reached than intended. If this happens, i.e. the loss becomes larger and larger, then the learning rate must be reduced further! What also helps is to standardize the data x and y before (between 0 and 1). This is very common in machine learning. If you then want to use the optimized beta factors, you must always standardize the data according to the same scheme as in the optimization procedure.

This was my post about implementing the gradient descent algorithm for multiple linear regression models. I hope you enjoyed it! The full, aggregated code can also be found here: github.com/Heuristic-Analyst/…