DEV Community

Cover image for Statistics Day5: The Super-Simple Guide to Random Variables and Correlation for Data Science Beginners
Chanchal Singh
Chanchal Singh

Posted on

Statistics Day5: The Super-Simple Guide to Random Variables and Correlation for Data Science Beginners

If you’re learning statistics for data science, you’ll hear words that sound very big: random variables, PDF, correlation, and more.

But don’t worry.
Today, we’ll break everything down in simple language so even a 10-year-old can follow.


What Is a Random Variable?

A random variable is just a number that comes from a random activity.

Think of it like this:
You do something uncertain → you get a number as a result.

Example: Roll a dice → you get 1, 2, 3, 4, 5, or 6.
That number is your random variable.

There are two types:


1. Discrete Random Variables

Discrete means you can count the possible values.
They come in separate chunks — no in-between values.

Examples:

  • Number of chocolates in a box (you can’t have 4.6 chocolates)
  • Number of students absent
  • Dice outcome (1–6)

Why it matters in data science?
You use discrete random variables when your feature takes clear, countable values.

demonstration of discrete and continous random variables


2. Continuous Random Variables

Continuous means the values can be anything in a range — even decimals.

Examples:

  • Height (160.25 cm is possible)
  • Temperature (34.7°C, 34.75°C…)
  • Weight

Why it matters?
Many ML models assume continuous data follows patterns like the normal distribution.


What Is a Normal Distribution?

A normal distribution is the famous bell-shaped curve.

Normal distribution

It looks like a hill that is:

  • highest in the middle
  • smooth
  • symmetric
  • values near the mean are more common

Example: Most people’s heights cluster around an average.
Only few are extremely short or extremely tall.


What Is the Probability Density Function (PDF)?

The PDF is simply a formula that tells us:

“How likely is a value to appear in a continuous distribution?”

For a normal distribution, the PDF looks complicated, but the meaning is simple:

  • It helps us find probabilities for continuous values
  • The highest point is at the mean (most likely)
  • The sides go down smoothly (less likely)

You cannot take one point and say “this value has 10% probability.”
For continuous data, we talk about areas under the curve.

probability density function

Think of the curve as a mountain.
Probability = how much area lies under that mountain between two points.

This helps in:

  • calculating confidence intervals
  • computing z-scores
  • understanding statistical tests

Pearson's Correlation Coefficient (r)

Pearson’s correlation tells us:

“How strongly are two numerical variables related?”

It gives a number between -1 and +1:

Value (r) Meaning
+1 Perfect positive relationship
0 No linear relationship
-1 Perfect negative relationship

Pearson's correlation coefficient

Examples:

  • Height vs weight → positive correlation
  • Age vs toy preference → negative correlation
  • Shoe size vs IQ → almost zero correlation

In simple terms:
If one goes up and the other goes up too → positive.
If one goes up and the other goes down → negative.


Practical Use Cases

Concept Real-Life Use Data Science Use
Discrete RV Counting customers Classification features
Continuous RV Measuring weight or speed Regression, clustering
PDF Finding chances in continuous data Hypothesis testing, probability models
Pearson Correlation See if two things are linked Feature selection, EDA

When Are These Useful in Machine Learning?

1. Feature Engineering

Correlation helps detect:

  • predictive features
  • multicollinearity (when features are too similar)

2. Understanding Your Dataset

Random variables and distributions help decide:

  • Which visualization to use
  • Which model suits the data
  • Whether scaling/normalization is required

3. Statistical Testing

PDF + normal distribution help compute:

  • z-scores
  • p-values
  • confidence intervals

Simple Examples to Lock the Concepts

Example 1: Discrete

Number of pets in a house:

  • 0,1,2,3… Countable. No decimals.

Example 2: Continuous

Time taken to run 100 meters:

  • 12.5s, 12.51s, 12.512s Infinite possibilities.

Example 3: Pearson Correlation

Study time vs test score → high positive
Ice cream sales vs temperature → positive
Mobile use vs sleep → negative


I love breaking down complex topics into simple, easy-to-understand explanations so everyone can follow along. If you're into learning AI in a beginner-friendly way, make sure to follow for more!

Connect on Linkedin: https://www.linkedin.com/in/chanchalsingh22/
Connect on YouTube: https://www.youtube.com/@Brains_Behind_Bots

Top comments (0)