jayson kibet

Posted on Jun 20

Why Statistics is Important in Data Science

#computerscience #datascience #learning #python

Introduction

Ask ten people what data science is and you'll hear things like Python,machine learning and building dashboards.Almost nobody says "statistics."But statistics is the thing doing the real work underneath all of it.It's what tells you whether the real work actually means anything.

Simple Example:

I analyzed a CSV dataset of Nairobi rental listings.The data includes monthly rent,property type,bedrooms,bathrooms,floor size and distance to the CBD.
Imagine you're looking at rental prices in Nairobi.You open your spreadsheet and see thousands of numbers staring back at you. Where do you even begin?
Statistics gives you a starting point.First,you need to understand what kind of data you're actually dealing with.Some numbers like rent can be anything - 150,000 KES,287,500 KES you name it.But other numbers like bedrooms are whole things - you can have 1 bedroom,2 bedrooms or 3 bedrooms.You can't have 2.5 bedrooms.
Some of your data isn't even numbers.Estates like Westlands or Kilimani are just names(text).There's no mathematical order to them.Westlands isn't "bigger" or "better" than Kilimani in any way that fits into a formula.
This sounds obvious but if you get this wrong,everything else breaks.You might try to average something that should be counted.Or you might use the wrong test on the wrong kind of data and end up with completely meaningless results.

1.What the numbers actually tell you

Once you understand what you're working with,simple statistics start telling you a real story.
Take the average rent in the data - 223,196 KES.But here's something interesting.The middle value - the price you get when you line up every single listing from cheapest to most expensive and pick the one in the exact middle is 190,307 KES.
That gap between the average and the middle? It's a clue.It means a small number of really expensive properties are pulling the average upward.If you only looked at the average,you'd think rent in Nairobi is higher than what most people actually pay.The middle value gives you a much more honest picture.
The spread of prices matters too.Most rents fall somewhere between 129,188 KES and 287,698 KES.Knowing this range helps you understand what normal market prices look like.You can also spot outliers - those properties that are way too expensive or surprisingly cheap using a simple rule instead of just guessing.

2.The shape of your data matters more(fun part)

This is where things get interesting.The rent data leans to one side - lots of cheap and mid-range listings,with a few really expensive ones stretching things out.
Why does this matter? Because many common tools like basic regression models work best on data that spreads out evenly around the middle.If you ignore how your data is shaped and just run the numbers anyway,your results will be slightly off in ways that are really hard to catch later.

3.Central Limit Theorem

There's something called the Central Limit Theorem that explains a deeper reason this matters.The simple version is this;even if your raw data is completely lopsided,if you take enough samples and look at their averages,those averages will start to form a nice normal pattern.This one idea is the reason most of the math behind confidence intervals and statistical tests actually works in the real world.Without it,a lot of what data scientists do every day wouldn't really hold up.

4.The question nobody asks

Here's something that gets skipped over way too often; you almost never have all the data.You have a sample - a small piece of a much bigger picture.This Nairobi data isn't every single rental in the city.It's just a slice.
So the real question isn't "what's the average rent in this dataset?" The real question is "based on this small slice,what can I honestly say about rent across all of Nairobi?"
This is exactly what statistics is built for.A confidence interval lets you say something like "we're 95% sure the real average rent across Nairobi is somewhere between 215,000 and 231,000 KES."That's much more honest than pretending you know the exact number.

5.Are two things really different or does it look that way?

Statistics also gives you a way to test if two things are genuinely different or if it's just random chance in your sample.

Say you want to know if rent in Kilimani is really higher than rent in Westlands.You start by assuming there's no real difference.Then you run a test and check how surprising your result would be if that assumption were true.
This same basic idea of testing if something is real or just luck is what companies use when they test a new app feature or pricing page on some users before rolling it out to everyone.
In this step,you'll end up making big claims based on tiny samples and random chance.That's not a rare mistake.It's probably the most common way data analysis goes wrong.

6.Just because two things move together doesn't mean one causes the other

Properties further from the city center tend to be cheaper.Easy to see,easy to plot on a chart.But just because two things move together doesn't mean one is causing the other.
Maybe being far from town really does lower the price.But maybe both things are caused by something else entirely like less infrastructure in those areas.
Mixing these two up; "these move together" and "one causes the other" - is one of the most common mistakes people make with data.Statistics trains you to slow down and not jump to that conclusion.

You can build models that predict rent using several things at once;size,bedrooms,distance from town.But even then,the model only shows you connections.It doesn't prove what's causing what.

7.The traps that catch people who skip statistics

Sometimes a pattern that's true for every small group disappears or even flips when you combine all the groups together.Sometimes if you test enough things,you'll find something that looks "significant" just by random chance even if nothing real is going on.Sometimes your data quietly leaves things out without you noticing, like a dataset of current listings missing all the properties that got rented out fast which probably wasn't random at all.
None of these issues show up as error messages in your code.They show up as wrong answers that sound confident.And that's worse,because nothing tells you to double check.

8.Why all this matters for machine learning

Every machine learning model is really just a statistics model wearing different clothes.Knowing how spread out your data is and what shape it takes helps you pick the right model and notice when something's wrong.
Testing a model on data it hasn't seen before,which is standard practice - is really just using a sample to guess at the bigger picture.Picking good inputs for your model gets much easier once you understand which numbers actually carry useful information and which just look interesting on a graph.

Conclusion

Statistics is what stands between "I noticed a pattern" and "this pattern actually means something." It's the difference between reporting an average and knowing when that average is misleading you.Between seeing two things move together and knowing if it's worth acting on.Between running one test and understanding why ten tests could have fooled you.

Data science without statistics isn't a simpler version of data science.It's just guessing and dressed up with a nicer chart.

Statistics is what decides which one you end up with.Either to write a confident report full of wrong conclusions Or it could be used to write an honest one,with the uncertainty clearly laid out

Start with the basics.Ask yourself what your data actually is,what the center looks like and how spread out it is.Getting those right will save you from getting everything else wrong.
I hope you found this artice usefull.

DEV Community