Simple linear regression - Heuristic Analyst

In this post, I write about a simple method of creating a simple linear regression using two formulas.

“Simple Linear Regression” is a linear regression model with a singl explanatory variable. A linear regression trys to fit the data with a linear line, hence linear regression.

A linear model follows the following equation:

Where:

m: slope
b: interception with y-axis

Look at the following picture:

In grey we got the datapoints we want to forecast
y is the dependent variable. x is the independent variable
The linear model we try to get is the blue one
The forecasts the linear model does are ŷ

The best linear model here (using ordinary least squares optimization) is the one which minimizes the sum of the squared differences.

Sum of squared differences:

We square the differences to get rid of the effect that differences of datapoints under the linear model would cancel out with the differences of datapoints above the linear model (the distance is then negative and positive and when added together cancel out).

To optimize (minimize) the sums of squared differences we will use the following equations to get the slope m and interception with the y-axis b:

Sidenote: m is basically the covariance of (X,Y) divided by variance of X (N (or N-1) in covariance and variance formula cancel each other out)

If you calculated both, m and b, your linear model (y=m*x+b) is fitted to the data and ready to be tested and then maybe even used!

Simple linear regression in Python: (github.com/Heuristic-Analyst/…)

import random
import matplotlib.pyplot as plt
 
x = [i for i in range(1000)]
y = [-70*(i-(random.random()*1000-250))+130 for i in range(1000)]
 
plt.scatter(x, y)
plt.show()

def linear_regression(x_values:list, y_values:list):
    x_mean = 0
    y_mean = 0
    n = 0
    m_sum_numerator = 0
    m_sum_denominator = 0
    m = 0
    b = 0
 
    for i in range(len(x_values)):
        x_mean += x_values[i]
        y_mean += y_values[i]
        n += 1
    x_mean /= n
    y_mean /= n
 
    for i in range(len(x_values)):
        m_sum_numerator += (x_values[i] - x_mean) * (y_values[i] - y_mean)
        m_sum_denominator += (x_values[i] - x_mean) ** 2
 
    m = m_sum_numerator / m_sum_denominator
    b = y_mean - m * x_mean
 
    return m, b

m, b = linear_regression(x, y)
print(m, b)
 
Output >>> -69.8562263813947 17009.049915164287

y_predict = []
for x_values in x:
    y_predict.append(m*x_values+b)
 
plt.scatter(x, y)
plt.plot(y_predict, "red")
plt.show()