DEV Community

Cover image for Statistics
Sami
Sami

Posted on

Statistics

Descriptive Statistics Explained

My notes while learning Statistics for Data Science

When I first started learning Data Science, I thought statistics was just another boring mathematics subject with formulas to memorize.

I was completely wrong.

The more I learned, the more I realized that statistics is actually the language of data. Before building Machine Learning models, creating dashboards, or making predictions, we first need to understand what our data is trying to tell us.

This blog is my attempt to explain the concepts in a simple way—the way I understood them while learning.


What is Statistics?

Imagine a company has data of 10 million customers.

Can a human sit and read every row?

Obviously not.

Statistics helps us summarize huge amounts of data into meaningful information so we can make decisions.

Instead of reading every single record, statistics answers questions like:

  • What is the average customer age?
  • Which product sells the most?
  • How much variation exists in customer spending?
  • Are sales increasing or decreasing?

In simple words,

Statistics is the science of collecting, organizing, analyzing, and understanding data.

This is exactly why statistics is used almost everywhere.

  • Netflix recommends movies using statistical patterns.
  • Hospitals test whether a new medicine actually works.
  • Companies forecast future sales.
  • Governments conduct surveys before making policies.

Types of Statistics

Statistics is mainly divided into two parts.

1. Descriptive Statistics

This is where every Data Analyst starts.

Descriptive Statistics focuses on understanding the data we already have.

It answers questions like:

  • What is the average salary?
  • Which category appears most?
  • How spread out are the values?
  • What does the data look like?

It does not predict the future.

It simply describes the present.

Example:

Suppose we have students' marks.

78
82
91
67
75
88
Enter fullscreen mode Exit fullscreen mode

Using descriptive statistics, we can calculate:

  • Average marks
  • Highest marks
  • Lowest marks
  • Most common marks
  • Overall distribution

2. Inferential Statistics

This goes one step ahead.

Instead of describing existing data, it tries to make conclusions about a larger population using only a sample.

For example,

Imagine India has more than a billion people.

Surveying every person isn't possible.

Instead, researchers survey a small group and use statistics to estimate what the entire population might think.

Machine Learning heavily depends on Inferential Statistics because models learn patterns from sample data and apply them to unseen data.


Population vs Sample

This was one of the easiest concepts once I stopped overthinking it.

Suppose a college has 12,000 students.

The Population is all 12,000 students.

Now imagine we randomly choose 500 students for a survey.

Those 500 students are called the Sample.

Population
↓

12000 Students

↓

Take 500 Random Students

↓

Sample
Enter fullscreen mode Exit fullscreen mode

Since collecting data from everyone is expensive and time-consuming, most companies work with samples.

The important part is choosing the sample correctly.

A good sample should be:

  • Large enough
  • Random
  • Representative of the entire population

Otherwise, the conclusions may be misleading.


Parameter vs Statistic

This confused me initially because the words sound almost the same.

Here's how I remember it.

A Parameter describes the entire population.

A Statistic describes only the sample.

For example,

Average salary of every employee in a company

→ Parameter

Average salary of 300 surveyed employees

→ Statistic

Simple.


Types of Data

Before doing any analysis, we should understand what kind of data we're working with.

Some data represents categories.

Examples:

  • Gender
  • Department
  • City
  • Blood Group

Some data represents numbers.

Examples:

  • Salary
  • Height
  • Weight
  • Age
  • Marks

Knowing the data type helps us choose the right visualization and statistical method.


Measure of Central Tendency

Suppose your friend asks,

"Can you summarize this dataset in one number?"

That's exactly what central tendency does.

It finds the center of the data.

There are different ways to define this center.


Mean (Average)

The most commonly used measure.

Formula:

Mean = Sum of all values / Number of values
Enter fullscreen mode Exit fullscreen mode

Example

Marks:

80 90 70 60 100
Enter fullscreen mode Exit fullscreen mode

Mean

(80+90+70+60+100)/5

=80
Enter fullscreen mode Exit fullscreen mode

Easy.

But there's a problem.

Mean is affected by extreme values.

Suppose one billionaire enters a room of middle-class people.

Suddenly the average wealth becomes enormous.

That doesn't represent reality.


Median

Median is the middle value after arranging the data.

Example

10 20 25 30 90
Enter fullscreen mode Exit fullscreen mode

Median = 25

Unlike the mean, median ignores extremely high or low values.

That's why salaries, house prices, and income distributions often use the median instead of the average.


Mode

Mode is simply the value that appears most frequently.

Example

2 3 5 3 6 7 3
Enter fullscreen mode Exit fullscreen mode

Mode = 3

This is useful when analyzing customer preferences.

Example:

Most purchased mobile brand.

Most common payment method.

Most selected course.


Weighted Mean

Sometimes every value shouldn't have equal importance.

Example:

Your semester marks.

Maybe:

Assignments = 20%

Mid Exam = 30%

Final Exam = 50%

Here we cannot simply calculate the average.

Each score has a different weight.

That's where Weighted Mean becomes useful.


Trimmed Mean

Imagine a company accidentally records these salaries.

25000
27000
26000
28000
30000
5000000
Enter fullscreen mode Exit fullscreen mode

That last value is an extreme outlier.

Instead of letting one unusual value distort the average, we remove a small percentage of the highest and lowest values before calculating the mean.

This gives a more reliable average.


Measure of Dispersion

Knowing only the average isn't enough.

Consider these two classes.

Class A

50
50
50
50
50
Enter fullscreen mode Exit fullscreen mode

Class B

20
40
50
60
80
Enter fullscreen mode Exit fullscreen mode

Both have the same average.

But clearly, Class B is much more spread out.

Dispersion tells us how scattered the data is.


Range

The simplest measure.

Range = Maximum - Minimum
Enter fullscreen mode Exit fullscreen mode

Example

10 20 30 50
Enter fullscreen mode Exit fullscreen mode

Range = 50 − 10 = 40

Easy to calculate.

But it depends only on two values, so it's very sensitive to outliers.


Variance

Variance measures how far the values are from the average.

Instead of looking only at the highest and lowest values, it considers every observation.

Higher variance means the data is more spread out.

Lower variance means the values stay close to the average.


Standard Deviation

Standard deviation is simply the square root of variance.

It is probably the most important measure of spread in statistics.

A low standard deviation means the data points are tightly packed.

A high standard deviation means the data is widely scattered.

In Data Science, you'll see Standard Deviation almost everywhere—from feature engineering to anomaly detection and probability distributions.


Coefficient of Variation (CV)

Imagine two datasets.

Dataset A

Average = 20

Standard Deviation = 5

Dataset B

Average = 500

Standard Deviation = 20

Looking only at standard deviation isn't fair because the averages are completely different.

Coefficient of Variation solves this by comparing variability relative to the mean.

This makes it easier to compare datasets with different scales.


Visualizing Data

Numbers alone don't always tell the whole story.

Visualizations help us understand patterns much faster.

Some commonly used graphs include:

Frequency Distribution Table

Shows how many times each value appears.

Useful for categorical data.


Histogram

Used for numerical data.

It helps us understand:

  • Distribution
  • Skewness
  • Peaks
  • Outliers

Whenever I open a new dataset, one of the first charts I create is a histogram.


Scatter Plot

Scatter plots are useful when working with two numerical variables.

For example,

Hours Studied vs Marks

Experience vs Salary

Temperature vs Ice Cream Sales

They help identify relationships and trends.


Contingency Table

Used when comparing two categorical variables.

Example:

Gender vs Purchased Product

Department vs Promotion Status

It helps identify relationships between categories.


My Biggest Takeaway

Earlier, I thought statistics was all about formulas.

Now I see it differently.

Statistics is simply a way of asking better questions about data.

Instead of looking at thousands of rows, we summarize, visualize, compare, and understand what's happening.

Every Machine Learning model, every dashboard, and every business decision starts with this understanding.

Learning descriptive statistics has made me realize that before predicting the future with AI, we first need to understand the present through data.

And that's exactly what statistics teaches us.


What's Next?

In the next blog, I'll explore more about the Statistics, especially Inferential Statistics and explain it the same way—with simple examples, real-world intuition, and practical understanding instead of memorizing formulas.

If you're also starting Data Science, I hope these notes make your learning a little easier.

Top comments (0)