🔍 Step-by-Step Guide to Understanding Your Data with .describe()
(Using a Regression Dataset)
Whether you're a beginner in data science or preparing data for a machine learning model, the first step is always:
📌 Understand the data before using it.
And one of the most powerful tools for that is:
df.describe()
In this blog, I’ll walk you through exactly how to analyze and interpret **.describe()
**** output*, step by step, using a **realistic regression dataset* with features like age
, income_lakhs
, and premium_amount
.
📦 Step 1: Load the Dataset
Let’s simulate a dataset like you'd use for predicting insurance premiums.
import pandas as pd
# Simulated dataset
data = {
'age': [22, 31, 45, 356, 18, 29, 60],
'number_of_dependants': [0, 2, 3, 1, -3, 1, 5],
'income_lakhs': [7, 17, 31, 930, 1, 9, 15],
'premium_amount': [8608, 13928, 22274, 43471, 3501, 9000, 11000]
}
df = pd.DataFrame(data)
📊 Step 2: Use describe()
to Summarize the Data
df.describe()
You’ll get this output:
age | number_of_dependants | income_lakhs | premium_amount | |
---|---|---|---|---|
count | 7.0 | 7.0 | 7.0 | 7.0 |
mean | 80.1 | 1.29 | 144.28 | 16097.14 |
std | 123.1 | 2.72 | 340.42 | 12174.92 |
min | 18.0 | -3.0 | 1.0 | 3501.0 |
25% | 25.5 | 0.0 | 8.0 | 8804.0 |
50% (median) | 31.0 | 1.0 | 15.0 | 11000.0 |
75% | 45.0 | 2.0 | 24.0 | 13928.0 |
max | 356.0 | 5.0 | 930.0 | 43471.0 |
🧠 Step 3: Check for Missing or Incomplete Data
Check the count
row:
- Is the count equal to the total number of rows?
- If not, there are missing values.
In our case, all counts = 7 → ✅ No missing values.
❌ Step 4: Spot Invalid Values
Now look for logically impossible values:
-
age = 356
→ Invalid! No one lives that long. -
number_of_dependants = -3
→ Can’t have negative dependents.
✅ Action: These are data quality issues. You should either:
- Remove those rows, or
- Impute/fix the incorrect values.
📈 Step 5: Detect Skewed Data (Mean vs Median)
Compare the mean and 50% (median):
-
income_lakhs
: mean = 144.28, median = 15 → heavily right-skewed (a few rich people pull the mean up) -
premium_amount
: mean = 16,097, median = 11,000 → moderate right skew
✅ Action:
- Skewed data can impact regression models.
- Apply log transformation to handle skew:
df['log_income'] = np.log1p(df['income_lakhs'])
📉 Step 6: Detect Outliers with IQR Method
Use IQR (Interquartile Range) to find outliers:
Q1 = df['income_lakhs'].quantile(0.25)
Q3 = df['income_lakhs'].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 + 1.5 * IQR
outliers = df[df['income_lakhs'] > upper_limit]
In our example,
income_lakhs = 930
is a clear outlier (way beyond Q3 + 1.5 * IQR).
✅ Action: Either:
- Remove the outlier if it’s not realistic.
- Cap it using
winsorization
. - Log-transform the feature to reduce skew.
📊 Step 7: Visualize It (Boxplot or Histogram)
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df['income_lakhs'])
plt.title("Income Distribution")
plt.show()
This will visually show the outlier (930) far outside the box.
✅ Step 8: Clean the Data
Here’s what I’d do before modeling:
# Fix invalid age
df = df[df['age'] <= 100]
# Set negative dependents to 0
df['number_of_dependants'] = df['number_of_dependants'].apply(lambda x: max(x, 0))
# Log-transform income and premium
df['log_income'] = np.log1p(df['income_lakhs'])
df['log_premium'] = np.log1p(df['premium_amount'])
Now the data is ready for regression modeling!
🔀 Recap: Your Step-by-Step Process
Step | What to Do |
---|---|
1️⃣ | Load and explore your data |
2️⃣ | Use .describe() for summary stats |
3️⃣ | Check for missing values |
4️⃣ | Identify invalid entries |
5️⃣ | Analyze distribution using mean vs median |
6️⃣ | Detect outliers using IQR |
7️⃣ | Visualize with plots |
8️⃣ | Clean and transform data |
🧠 Final Thoughts
The .describe()
method is your first lens into the soul of your dataset.
Use it well, and you’ll:
- Catch bad data before it messes up your model
- Discover hidden patterns (like skew or outliers)
- Build better models, faster
🚀 Want to Try It Yourself?
Use any regression dataset (insurance, housing prices, salary prediction), and go through these steps. Start with:
df.describe()
And from there, let the data speak.
Top comments (0)