Correlation measures the strength and direction of a relationship between two numerical variables.
👉 It answers questions like:
When X increases, does Y increase or decrease?
How strongly are X and Y related?
📌 Correlation does NOT mean causation.
Example:
Ice cream sales ↑ and temperature ↑ → correlated
Ice cream sales ↑ does NOT cause temperature ↑
2️⃣ Why Correlation is Important in Data Science
Correlation is used in:
✔ Exploratory Data Analysis (EDA)
✔ Feature selection
✔ Detecting multicollinearity
✔ Understanding data patterns
✔ Model simplification
✔ Business insights
Example:
If two features are highly correlated, one may be removed.
3️⃣ Direction of Correlation
➕ Positive Correlation
Both variables increase together
Example: Height & Weight
📈 Graph: Upward slope
➖ Negative Correlation
One increases, the other decreases
Example: Speed & Travel Time
📉 Graph: Downward slope
⚪ Zero Correlation
No relationship
Example: Shoe size & IQ
📊 Graph: Random scatter
4️⃣ Correlation Coefficient (r)
The correlation coefficient measures correlation numerically.
Range:
-1 ≤ r ≤ +1
Value of r
Meaning
+1
Perfect positive
-1
Perfect negative
0
No correlation
±0.7 to ±1
Strong
±0.3 to ±0.7
Moderate
±0.0 to ±0.3
Weak
5️⃣ Pearson Correlation (Most Common)
📌 Used for:
Linear relationships
Continuous numerical data
Formula:
✔ Linear relationship
✔ No extreme outliers
✔ Normal distribution (optional but preferred)
Example:
Study hours & exam marks
6️⃣ Spearman Rank Correlation
📌 Used for:
Monotonic (non-linear) relationships
Ranked or ordinal data
Key Idea:
Convert values into ranks
Apply Pearson on ranks
Example:
Customer satisfaction rank vs loyalty rank
7️⃣ Kendall’s Tau Correlation
📌 Used for:
Small datasets
Ordinal data
Robust to ties
Concept:
Counts concordant & discordant pairs
Example:
Ranking similarity between two judges
8️⃣ Correlation vs Covariance
Covariance
Correlation
Measures joint variability
Measures strength & direction
Units depend on data
Unit-free
Hard to interpret
Easy to interpret
Range: −∞ to +∞
Range: −1 to +1
📌 Correlation = Normalized covariance
9️⃣ Correlation Matrix
A correlation matrix shows correlations between multiple variables.
Example:
A
B
C
A
1
0.8
-0.2
B
0.8
1
-0.4
C
-0.2
-0.4
1
📌 Used in:
Feature selection
Heatmaps
Multivariate EDA
🔥 10️⃣ Multicollinearity
What is it?
When independent variables are highly correlated
Problems:
❌ Unstable coefficients
❌ Reduced model interpretability
❌ Inflated variance
Detection:
Correlation Matrix
VIF (Variance Inflation Factor)
11️⃣ Correlation ≠ Causation (Very Important)
Correlation does NOT mean one variable causes the other.
Example:
Crime rate & Ice cream sales are correlated
Both depend on temperature
📌 Hidden variable = Confounding factor
12️⃣ Limitations of Correlation
⚠ Only measures linear relationships (Pearson)
⚠ Sensitive to outliers
⚠ Cannot capture cause-effect
⚠ Misses complex patterns
13️⃣ Correlation in Machine Learning
Used in:
Feature elimination
Dimensionality reduction
Data cleaning
Model diagnostics
Example:
Remove one of two features with r > 0.9
14️⃣ Real-World Example (Data Science)
📌 Dataset: House Prices
Feature
Correlation with Price
Area
+0.85
Distance to city
-0.62
Age of house
-0.40
Bedrooms
+0.70
Interpretation:
Area strongly increases price
Distance negatively impacts price
15️⃣ Visualizing Correlation
✔ Scatter plots
✔ Heatmaps
✔ Pair plots
16️⃣ Summary (Key Takeaways)
✔ Correlation measures relationship, not causation
✔ Range is from −1 to +1
✔ Pearson → Linear
✔ Spearman → Rank / Non-linear
✔ Used heavily in EDA & ML
✔ Helps detect redundancy in features
Top comments (0)