Chanchal Singh

Posted on Nov 19

Statistics Day5: The Super-Simple Guide to Random Variables and Correlation for Data Science Beginners

#statistics #machinelearning #datascience #beginners

If you’re learning statistics for data science, you’ll hear words that sound very big: random variables, PDF, correlation, and more.

But don’t worry.
Today, we’ll break everything down in simple language so even a 10-year-old can follow.

What Is a Random Variable?

A random variable is just a number that comes from a random activity.

Think of it like this:
You do something uncertain → you get a number as a result.

Example: Roll a dice → you get 1, 2, 3, 4, 5, or 6.
That number is your random variable.

There are two types:

1. Discrete Random Variables

Discrete means you can count the possible values.
They come in separate chunks — no in-between values.

Examples:

Number of chocolates in a box (you can’t have 4.6 chocolates)
Number of students absent
Dice outcome (1–6)

Why it matters in data science?
You use discrete random variables when your feature takes clear, countable values.

2. Continuous Random Variables

Continuous means the values can be anything in a range — even decimals.

Examples:

Height (160.25 cm is possible)
Temperature (34.7°C, 34.75°C…)
Weight

Why it matters?
Many ML models assume continuous data follows patterns like the normal distribution.

What Is a Normal Distribution?

A normal distribution is the famous bell-shaped curve.

It looks like a hill that is:

highest in the middle
smooth
symmetric
values near the mean are more common

Example: Most people’s heights cluster around an average.
Only few are extremely short or extremely tall.

What Is the Probability Density Function (PDF)?

The PDF is simply a formula that tells us:

“How likely is a value to appear in a continuous distribution?”

For a normal distribution, the PDF looks complicated, but the meaning is simple:

It helps us find probabilities for continuous values
The highest point is at the mean (most likely)
The sides go down smoothly (less likely)

You cannot take one point and say “this value has 10% probability.”
For continuous data, we talk about areas under the curve.

Think of the curve as a mountain.
Probability = how much area lies under that mountain between two points.

This helps in:

calculating confidence intervals
computing z-scores
understanding statistical tests

Pearson's Correlation Coefficient (r)

Pearson’s correlation tells us:

“How strongly are two numerical variables related?”

It gives a number between -1 and +1:

Value (r)	Meaning
+1	Perfect positive relationship
0	No linear relationship
-1	Perfect negative relationship

Examples:

Height vs weight → positive correlation
Age vs toy preference → negative correlation
Shoe size vs IQ → almost zero correlation

In simple terms:
If one goes up and the other goes up too → positive.
If one goes up and the other goes down → negative.

Practical Use Cases

Concept	Real-Life Use	Data Science Use
Discrete RV	Counting customers	Classification features
Continuous RV	Measuring weight or speed	Regression, clustering
PDF	Finding chances in continuous data	Hypothesis testing, probability models
Pearson Correlation	See if two things are linked	Feature selection, EDA

When Are These Useful in Machine Learning?

1. Feature Engineering

Correlation helps detect:

predictive features
multicollinearity (when features are too similar)

2. Understanding Your Dataset

Random variables and distributions help decide:

Which visualization to use
Which model suits the data
Whether scaling/normalization is required

3. Statistical Testing

PDF + normal distribution help compute:

z-scores
p-values
confidence intervals

Simple Examples to Lock the Concepts

Example 1: Discrete

Number of pets in a house:

0,1,2,3… Countable. No decimals.

Example 2: Continuous

Time taken to run 100 meters:

12.5s, 12.51s, 12.512s Infinite possibilities.

Example 3: Pearson Correlation

Study time vs test score → high positive
Ice cream sales vs temperature → positive
Mobile use vs sleep → negative

I love breaking down complex topics into simple, easy-to-understand explanations so everyone can follow along. If you're into learning AI in a beginner-friendly way, make sure to follow for more!

Connect on Linkedin: https://www.linkedin.com/in/chanchalsingh22/
Connect on YouTube: https://www.youtube.com/@Brains_Behind_Bots

DEV Community

Statistics Day5: The Super-Simple Guide to Random Variables and Correlation for Data Science Beginners

What Is a Random Variable?

1. Discrete Random Variables

2. Continuous Random Variables

What Is a Normal Distribution?

What Is the Probability Density Function (PDF)?

Pearson's Correlation Coefficient (r)

Practical Use Cases

When Are These Useful in Machine Learning?

1. Feature Engineering

2. Understanding Your Dataset

3. Statistical Testing

Simple Examples to Lock the Concepts

Example 1: Discrete

Example 2: Continuous

Example 3: Pearson Correlation

Top comments (0)