In this post I will explain how to calculate the variance and standard deviation, what they are, explain two common question which are mostly not clarified in school books/by teachers and also give away a Python code to calculate both.
To quantify the variation of values to a mean, one can use the so-called “variance” and “standard deviation”.
Before we jump into the formulas, I should explain the difference between population and sample:
Population: the population is the entire group about which you want to draw conclusions.
Sample: A sample is the specific group you are collecting data from in order to use it to draw conclusions about the characteristics of the population.
(Sidenote: Therefore, with any sample, the collection methodology (also known as sampling) is also extremely important, because it should reflect the characteristics of the population)
The formulas of the variance for the population and the sample differ:
Population formula:
Sample formula:
where:
- x_i = Population/sample values
- x with line = Mean of all given x
- N = Number/Amount of given x
Graphically you can imagine this plot:
In the example above you see the differences of different heights to the sample mean. Because we square the differences we also square the units, so we got cm². That is why we need the standard deviation, which is just the square root of the variance.
Standard deviation formula (for population and sample):
Now you may ask: “Why are we not just sum up te absolute values?” Good question.
Answer: Because it is. It was introduced into the world 100 years ago and is used as a basis for many more calculations and analyses.
Another frequently asked question: “Why is there a -1 in the variance sample formula and not in the poplation formula?“
Also a very good question.
Answer: If you draw a sample S from a population P and calculate the variance S with different means and plot the variances for each mean (so you don’t use the mean of the sample S, but different means, lets say around the actual mean of S and mean of P), you will see that the variances form a curve, a parabola.
The parabola will have a minimum and will rise on the left and right sides.
The minimum of the parabola will be where the actual mean of the sample is – this is an important ovbservation.
Since we know that S is a sample and the variance of S is an approximation of the variance of the population P, we know that the calculated variance of S (with the actual mean of S, which is the minimum as said before) is most likely not at the minimum, but higher.
For this reason, we correct the population formula by “-1” to get a larger result that better approximates the variance value of population P.
Code to calculate the variance and standard deviation in Python: (github.com/Heuristic-Analyst/…)
from math import sqrt
def variance(x_values:list, population_or_sample:str):
x_mean = 0
squared_sum_x = 0
n = 0
for x in x_values:
x_mean += x
n += 1
x_mean /= n
for x in x_values:
squared_sum_x += (x-x_mean)**2
if popultion_or_sample == "population":
return squared_sum_x/n
elif popultion_or_sample == "sample":
return squared_sum_x/(n-1)
def standard_deviation(x_values:list, population_or_sample:str):
var = variance(x_values, population_or_sample)
return sqrt(var)