In this post, I write about a simple method of creating a simple linear regression using two formulas.
“Simple Linear Regression” is a linear regression model with a singl explanatory variable. A linear regression trys to fit the data with a linear line, hence linear regression.
A linear model follows the following equation:
Where:
- m: slope
- b: interception with y-axis
Look at the following picture:
- In grey we got the datapoints we want to forecast
- y is the dependent variable. x is the independent variable
- The linear model we try to get is the blue one
- The forecasts the linear model does are ŷ
The best linear model here (using ordinary least squares optimization) is the one which minimizes the sum of the squared differences.
Sum of squared differences:
We square the differences to get rid of the effect that differences of datapoints under the linear model would cancel out with the differences of datapoints above the linear model (the distance is then negative and positive and when added together cancel out).
To optimize (minimize) the sums of squared differences we will use the following equations to get the slope m and interception with the y-axis b:
Sidenote: m is basically the covariance of (X,Y) divided by variance of X (N (or N-1) in covariance and variance formula cancel each other out)
If you calculated both, m and b, your linear model (y=m*x+b) is fitted to the data and ready to be tested and then maybe even used!
Simple linear regression in Python: (github.com/Heuristic-Analyst/…)
import random
import matplotlib.pyplot as plt
x = [i for i in range(1000)]
y = [-70*(i-(random.random()*1000-250))+130 for i in range(1000)]
plt.scatter(x, y)
plt.show()
def linear_regression(x_values:list, y_values:list):
x_mean = 0
y_mean = 0
n = 0
m_sum_numerator = 0
m_sum_denominator = 0
m = 0
b = 0
for i in range(len(x_values)):
x_mean += x_values[i]
y_mean += y_values[i]
n += 1
x_mean /= n
y_mean /= n
for i in range(len(x_values)):
m_sum_numerator += (x_values[i] - x_mean) * (y_values[i] - y_mean)
m_sum_denominator += (x_values[i] - x_mean) ** 2
m = m_sum_numerator / m_sum_denominator
b = y_mean - m * x_mean
return m, b
m, b = linear_regression(x, y)
print(m, b)
Output >>> -69.8562263813947 17009.049915164287
y_predict = []
for x_values in x:
y_predict.append(m*x_values+b)
plt.scatter(x, y)
plt.plot(y_predict, "red")
plt.show()