Standard Deviation in NumPy: Population vs Sample

Understand how to accurately calculate standard deviation using NumPy by distinguishing between sample and population data. Learn why adjusting degrees of freedom is crucial for meaningful statistical analysis.

Key Insights

Use NumPy's built-in STD function to calculate standard deviation, noting that specifying degrees of freedom is essential: zero for population data and one for sample data.
Recognize that when working with sample data—such as a subset of temperature or height measurements—setting degrees of freedom to one provides a more precise standard deviation, resulting in slightly higher values (e.g., 14.4 instead of 14).
Understand that adjusting degrees of freedom is particularly important when analyzing variance, as sample variance calculations differ significantly from population variance calculations.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's take a look at how we can calculate standard deviation using NumPy. If we use NumPy's built-in std for standard deviation and pass it a list, it will give us the standard deviation. Now, there is a big thing to consider here, which is that the standard deviation of the degrees list is 14 degrees if we're considering the degrees to be the entire population of degrees.

In other words, all the degrees that have been measured, which for degrees doesn't really make sense. I mean, any degrees are going to be a sample of the temperatures. Now, we might look at the entire population and the standard deviation of that.

When we're looking at something like height, then we can look at the entire population. We're looking at a sample, not just the height of a random thousand men in America, but what is the height of men in America? If we're looking at, as we are in this case, a sample, a small subset of the overall population, we actually need to adjust for that to get the true standard deviation. To do that, we use a concept called degrees of freedom, which is, well, it's not worth explaining degrees of freedom.

You can look into that more if you want to dive into, hey, what exactly is the offset we use and where does it come from? When we're talking about population variance—the variance in the whole population—versus the deviation within one particular sample of it. But degrees of freedom is set to zero normally. That's the population variance value.

If we set it to one, that's for a sample. If we look at that, we get a slightly higher number, 14.4. Now, this will matter particularly when we're looking at variance, where that will actually vary even more when we're looking at the value of variance. But this is a key thing.

We can use NumPy to do this. We must always be mindful of degrees of freedom. Are we looking at a sample of the population or the entire population as a whole?

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning