Feature selection is one of the most important steps before building any machine learning model.
And one of the simplest tools to do this is correlation.
But correlation alone doesn’t tell the whole story.
To use it correctly, you also need to understand variance, standard deviation, and a few other related statistical terms.
This blog breaks everything down in the simplest way possible — no heavy maths, just practical understanding.
1. What Is Correlation?
Correlation tells us how two numerical features move together.
- If they grow together → positive correlation
- If one grows while the other falls → negative correlation
- If they don’t move in any clear pattern → zero correlation
Correlation ranges from –1 to +1:
- +1 → perfectly move together
- –1 → perfectly opposite
- 0 → no relationship
In feature selection, correlation helps you answer:
“Which features are actually related to the target?”
“Which features are repeating the same information?”
2. How Do We Use Correlation for Feature Selection?
A. Select Features That Are Correlated With the Target
If you're predicting house price, and size_in_sqft has high correlation with price, that feature is useful.
Example:
| Feature | Correlation with Price |
|---|---|
| Size (sqft) | 0.82 |
| No. of rooms | 0.65 |
| Age of house | –0.20 |
| Zip code | 0.05 |
High correlation → strong predictive power.
B. Remove Features That Are Highly Correlated With Each Other
When two features are too similar, they cause multicollinearity, which confuses models (especially regression).
Example:
-
heightandtotal_floors→ correlation 0.95 - They’re giving the same information.
- You keep only one.
This makes your model:
- simpler
- faster
- less noisy
- more stable
C. The Big Warning: Correlation Only Catches Linear Relationships
If a feature has a non-linear relationship with the target, correlation may say “0”, even when the feature is useful.
Example:
Predicting salary based on experience — relationship grows but flattens → non-linear curve.
Low correlation does not mean useless feature.
Best practice:
Include the feature anyway and check feature importance using:
- Random Forest
- XGBoost
- SHAP values
3. Variance — How Spread Out the Data Is
Variance tells you how much the values are spread from the average.
- Low variance → values are almost the same
- High variance → wide variety of values
Example:
| Values | Variance |
|---|---|
| 50, 50, 50, 50 | Very low |
| 10, 80, 120, 200 | Very high |
In feature selection:
Features with extremely low variance (almost constant features) should be removed.
Example:
- A column with 99% “No” and 1% “Yes”
- Gives almost no information
This is called low-variance filtering.
4. Standard Deviation — The More Interpretable Version of Variance
Standard deviation (SD) is the square root of variance.
Why do we use SD?
Because SD is in the same units as the data, so it’s easier to interpret.
Example:
- Variance = 2500
- SD = 50 SD = “On average, values are 50 units away from the mean.”
In data science:
- High SD → more spread
- Low SD → less spread
SD is important in:
- normal distribution
- Z-score normalization
- outlier detection
5. Practical Use Cases in Real Data Science
A. Feature Engineering
- Remove highly correlated features
- Keep features that correlate with the target
- Remove low-variance features
- Treat outliers using SD
B. Model Stability (Regression Models)
High correlation among features (multicollinearity):
- inflates coefficients
- makes the model unstable
- reduces interpretability
Solution:
- Correlation matrix
- Variance Inflation Factor (VIF)
C. Detecting Outliers
Using SD:
- Any value > 3 SD from the mean is often considered an outlier This helps clean the dataset before modeling.
D. Normalization
Z-score = (value – mean) ÷ SD
Used heavily in:
- KNN
- SVM
- Gradient descent-based models
Because these models depend on distance, standardization is essential.
6. Quick Summary Table
| Concept | Meaning | Why It Matters for Feature Selection |
|---|---|---|
| Correlation | How two features move together | Helps identify useful or redundant features |
| Variance | How spread out the data is | Remove near-constant features |
| Standard Deviation | Average spread from the mean | Used in scaling and outlier detection |
| High Feature-to-Target Correlation | Strong predictor | Keep it |
| High Feature-to-Feature Correlation | Redundant | Remove one |
| Low Correlation | Not always useless | Check with ML model importance |
7. Final Takeaways
- Use correlation to pick predictive features.
- Remove features that are too similar to each other.
- Use variance and standard deviation to spot boring or noisy features.
- Always validate with ML models because correlation misses non-linear relationships.
Feature selection is not just theory — it’s one of the most practical skills in data science.
If you understand correlation, variance, and SD, you're already ahead.
Connect on Linkedin: https://www.linkedin.com/in/chanchalsingh22/
Connect on YouTube: https://www.youtube.com/@Brains_Behind_Bots
I love breaking down complex topics into simple, easy-to-understand explanations so everyone can follow along. If you're into learning AI in a beginner-friendly way, make sure to follow for more!



Top comments (0)