Right away the most important thought in this post:
- Covariance by itself only gives the direction of correlatedness
- Without knowing the scales of your variables, you can’t tell which variables are more correlated
Imagine we have a sample X filled with weight values of people.
In addition we also have another sample Y with height values of people.
Now one can plot both samples X and Y on a xy-coordinate – as a result we would get a scatter plot.
With the calculation of the covariance of X and Y we can see the 3 possible relationships between both:
- Positive covariance – positive trend – weight rises and height rises
- Negative covariance – negative trend – weight decline and height decline
- covariance equals 0 – no trend – same height but rising or declinging weight (or vice verca)
The formula of a covariance of X and Y – cov(X,Y) – is the following:
Formula to calculate covariance for a population:
Formula to calculate covariance for a sample:
where:
- x_i and y_i = Population/sample values
- x and y with line = Mean of all given x and y
- N = Number/Amount of given x-values (same as Amount of given y-values)
As you can see in the formula you could swap X and Y ( covariance(X,Y) = covariance(Y,X) )
Sidenote: For the explanation of why there is a difference in the sample and population covariance look at the previous post “Variance and standard deviation“
In the formula above one can see that (x_i – x_mean) can be positive or negative, same for the y-values. Because of that there are four variations with two outcomes in regards to positivity or negativity of the number:
- (x_i – x_mean) can be positive and (y_i – y_mean) can be positive = positive number
- (x_i – x_mean) can be positive and (y_i – y_mean) can be negative = negative number
- (x_i – x_mean) can be negative and (y_i – y_mean) can be positive = positive number
- (x_i – x_mean) can be negative and (y_i – y_mean) can be negative = negative number
This depends on the distribution of the values (respective the mean value).
Covariance is the stepping stone to calculate many other things, like the correlation of X an Y.
The Pearson correlation does not say whether there is a positive, negative trend, but is a measurement of how strong the relationship between X and Y is.
The formula to calculate the Pearson correlation is the following:
Sidenote: You can also multiply the variances first and take the square root of the product, no difference (square root rule)
Keep in mind, covariances are not compareable. This is why it is good to get a better grasp of the relationship of covariance and correlation. One can think of correlation being the scaled covariance from 0 to 1, which we achive through scaling it with the individual variations.
Implementation of covariance and Pearson correlation in Python code: (github.com/Heuristic-Analyst/…)
from math import sqrt
def covariance(x_values:list, y_values:list, population_or_sample:str):
x_mean = 0
y_mean = 0
sum_x_y_diff = 0
n = 0
for i in range(len(x_values)):
x_mean += x_values[i]
y_mean += y_values[i]
n += 1
x_mean /= n
y_mean /= n
for i in range(len(x_values)):
sum_x_y_diff += (x_values[i]-x_mean)*(y_values[i]-y_mean)
if population_or_sample == "population":
return sum_x_y_diff/n
elif population_or_sample == "sample":
return sum_x_y_diff/(n-1)
def variance(x_values:list, population_or_sample:str):
x_mean = 0
squared_sum_x = 0
n = 0
for x in x_values:
x_mean += x
n += 1
x_mean /= n
for x in x_values:
squared_sum_x += (x-x_mean)**2
if population_or_sample == "population":
return squared_sum_x/n
elif population_or_sample == "sample":
return squared_sum_x/(n-1)
def correlation(x_values:list, y_values:list, population_or_sample:str):
corr = covariance(x_values, y_values) / sqrt(variance(x_values, population_or_sample) * variance(y_values, population_or_sample))
return corr