In this post I will show you how to calculate the optimal values of a multiple linear regression using the gradient descent algorithm. In the previous posts I already introduced you to the simple linear regression (of one independent variable) and also to the (stochastic) gradient descent algorithm.
First, I will briefly introduce what multiple linear regression is and then I will present my implementation of the gradient descent algorithm created by scratch applied to multiple linear regression.
The multiple linear regression is like the simple linear regression. You try to represent a linear relationship. The difference is that in multiple linear regression there are 2 or more independent variables.
Simple linear regression:
Multiple linear regression:
So there are more than one than faktors explaining the ŷ – Example: The 2 factors gender and age are important in predicting a person’s socioeconomic status
Now let’s move on to the implementation of gradient descent to optimize the factors of a multiple linear regression in Python: (Github link at the end)
Step 1: First we need to create a loss function. As a loss function I take the sum of the squared residuals:
∑(squared residuals) = (y₁-ŷ₁)² + … + (yᵢ-ŷᵢ)²
So we calculate ŷ (=b₀ + b₁x₀ + b₂x₁+…) of every sample sample, subtract it from the actual y and square it. We do this for all samples and add them to get th sum of squared differences.
Using the library “SymPy” we can create functions and work with Python, so I will use that here. First I create the following equation with SymPy: (y-ŷ)² =(y-(b₀ + b₁x₀ + b₂x₁+…))². I will sibstitute the following variable names:
y -> y_actual
bᵢ -> betas[i]
xᵢ -> x_values[i]
Quickly how SymPy work: In SymPy, the variables have to be created first -> I store them separately in lists.
Then I create the desired function, loss function, with the help of these variables.
This I will store in “L_fct” as string. I also have an empty list “gradient”, in which I will add all derivatives after the betas to be optimized. With “smp.diff” I can derive the previously created function with SymPy, which I do once for each beta. Then I store the derivatives in the “gradient” list.
Sidenote: If you have stored a function as a string in a variable, for example f=”2x”, and you have defined the variables of this function before, then you can use the “eval” function to calculate the function (so print(eval(f)) with x=4 declared before -> output>> 8).
import sympy as smp
def generate_loss_function_and_gradient(number_of_betas: int):
# create variables:
# y -> y_actual
# b_0 -> betas[0], b_1 -> betas[1], ...
# x_0 -> x_values[0], x_1 -> x_values[1], ...
y_actual = smp.symbols("y_actual", real=True)
beta_symbols = []
x_symbols = []
for i in range(number_of_betas):
beta_symbols.append(smp.symbols(f"betas[{i}]", real=True))
for i in range(number_of_betas-1):
x_symbols.append(smp.symbols(f"x_values[{i}]", real=True))
# Add variables to the function which will look like this:
# (y-ŷ)^2 = (y-(b_0 + b_1*x_0 + b_2*x_1 + ...))^2 -> (y_actual - (betas[0] + betas[1]*x_values[0] + ...))^2
f = beta_symbols[0]
for i in range(1, number_of_betas):
f = f+beta_symbols[i]*x_symbols[i - 1]
f = (y_actual-f)**2
# Save function as a string
L_fct = str(f).replace(" ", "")
# calculate gradient - derive with respect to every beta
gradientList = []
for i in range(len(beta_symbols)):
gradientList.append(str(smp.diff(f, beta_symbols[i])).replace(" ", ""))
return L_fct, gradientList
Step 2: Now we have to pass the gradient, all sample data (x-values and associated y-value = 1 sample), the betas with which we started (and then optimized with each iteration) and the learning rate to the optimization function. Then calculate the descent values for each descent and update all betas in the last step.
def gradient_descent_algorithm(learning_rate:float, y_actual_list: list, x_values_population:list, betas: list, gradient: list):
# Create list to save temporarily new betas
new_betas = [0 for i in range(len(betas))]
# Iterate through each derivative of the gradient
for i in range(len(gradient)):
tmp = 0
# Calculate derivative ("slope") of every sample for the derivative i in gradient and sum them up in "tmp"
for j in range(len(x_values_population)):
y_actual = y_actual_list[j]
x_values = x_values_population[j]
tmp += eval(gradient[i])
# Calculate Step size
step_size=tmp*learning_rate
# Calculate new beta
new_betas[i] = betas[i]-step_size
# When every beta is calculated: Update them
for i in range(len(betas)):
betas[i] = new_betas[i]
Step 3: I wrote some code to calculate the sum of squared differences to stop either when the difference of the loss from the previous step to this step is smaller than a threshold, or until the number of iterations has reached its maximum:
def loss_function(y_actual: float, x_values: list, betas: list, L_fct: str):
loss_of_sample = eval(L_fct)
return loss_of_sample
def sum_loss_function(y_actual: list, x_values_population: list, betas: list, L_fct: str):
sum_losses = 0
for i in range(len(y_actual)):
sum_losses += loss_function(y_actual[i], x_values_population[i], betas, L_fct)
return sum_losses
Last step: Test everything – To test everything I created 3 sample data with 3 independent variables each (that are then 4 betas – 3 factors + 1 intercept):
if __name__ == '__main__':
betas_vals = [1, 1, 1, 1] # 4 factors - random initial values (I set them to all be ones)
x_vals = [[0.5, 2.1, 0.9], # sample 1 - 3 features
[2.3, 8.9, 4], # sample 2 - 3 features
[2.9, 12.2, 6]] # sample 3 - 3 features
y_vals = [-37.04, -207.0, -312.9] # 3 outcomes
# Generate loss function and gradient
L_function, gradient_functions = generate_loss_function_and_gradient(len(betas_vals))
# Print loss function
print("Loss function:", L_function)
# Create list which will hold betas from previous iteration to compare loss function from previous to current step
# I do it to check whether to break out of the loop because the step is small enough
betas_previous = [0 for i in range(len(betas_vals))]
for i in range(100000):
print("Iteration number", i + 1)
# fill beta values from previous step
for j in range(len(betas_vals)):
betas_previous[j] = betas_vals[j]
# Optimize betas
gradient_descent_algorithm(0.0001, y_vals, x_vals, betas_vals, gradient_functions)
# Compare losses from previous step to current -> break from loop if loss is small
if abs(sum_loss_function(y_vals, x_vals, betas_vals, L_function) - sum_loss_function(y_vals, x_vals, betas_previous, L_function)) < 0.00001:
break
# Print optimized betas
print("Optimized betas:", betas_vals)
# Print actual and predicted y
for i in range(len(x_vals)):
print("Actual y:", y_vals[i], "- Predicted y:", betas_vals[0] + betas_vals[1] * x_vals[i][0] + betas_vals[2] * x_vals[i][1] + betas_vals[3] * x_vals[i][2])
Everything that we coded generate the following output:
Loss function: (-betas[0]-betas[1]*x_values[0]-betas[2]*x_values[1]-betas[3]*x_values[2]+y_actual)**2
Iteration number 1
Iteration number 2
...
Iteration number 99999
Iteration number 100000
Optimized betas: [15.39260868874023, 11.865689746680127, -12.9318031446267, -34.0024362801841]
Actual y: -37.04 - Predicted y: -36.43352569380147
Actual y: -207.0 - Predicted y: -208.41909800180952
Actual y: -312.9 - Predicted y: -311.9795070914377
Important sidenote: The whole procedure is very sensitive to the learning rate. It is possible that the step size is so large that the minimum is completely skipped and even a higher numerical value of the convex function is reached than intended. If this happens, i.e. the loss becomes larger and larger, then the learning rate must be reduced further! What also helps is to standardize the data x and y before (between 0 and 1). This is very common in machine learning. If you then want to use the optimized beta factors, you must always standardize the data according to the same scheme as in the optimization procedure.
This was my post about implementing the gradient descent algorithm for multiple linear regression models. I hope you enjoyed it! The full, aggregated code can also be found here: github.com/Heuristic-Analyst/…