🔍 Step-by-Step Guide to Understanding Your Data with .describe() (Using a Regression Dataset)
Whether you're a beginner in data science or preparing data for a machine learning model, the first step is always:
📌 Understand the data before using it.
And one of the most powerful tools for that is:
df.describe()
In this blog, I’ll walk you through exactly how to analyze and interpret **.describe()**** output*, step by step, using a **realistic regression dataset* with features like age, income_lakhs, and premium_amount.
📦 Step 1: Load the Dataset
Let’s simulate a dataset like you'd use for predicting insurance premiums.
import pandas as pd
# Simulated dataset
data = {
'age': [22, 31, 45, 356, 18, 29, 60],
'number_of_dependants': [0, 2, 3, 1, -3, 1, 5],
'income_lakhs': [7, 17, 31, 930, 1, 9, 15],
'premium_amount': [8608, 13928, 22274, 43471, 3501, 9000, 11000]
}
df = pd.DataFrame(data)
📊 Step 2: Use describe() to Summarize the Data
df.describe()
You’ll get this output:
| age | number_of_dependants | income_lakhs | premium_amount | |
|---|---|---|---|---|
| count | 7.0 | 7.0 | 7.0 | 7.0 |
| mean | 80.1 | 1.29 | 144.28 | 16097.14 |
| std | 123.1 | 2.72 | 340.42 | 12174.92 |
| min | 18.0 | -3.0 | 1.0 | 3501.0 |
| 25% | 25.5 | 0.0 | 8.0 | 8804.0 |
| 50% (median) | 31.0 | 1.0 | 15.0 | 11000.0 |
| 75% | 45.0 | 2.0 | 24.0 | 13928.0 |
| max | 356.0 | 5.0 | 930.0 | 43471.0 |
🧠 Step 3: Check for Missing or Incomplete Data
Check the count row:
- Is the count equal to the total number of rows?
- If not, there are missing values.
In our case, all counts = 7 → ✅ No missing values.
❌ Step 4: Spot Invalid Values
Now look for logically impossible values:
-
age = 356→ Invalid! No one lives that long. -
number_of_dependants = -3→ Can’t have negative dependents.
✅ Action: These are data quality issues. You should either:
- Remove those rows, or
- Impute/fix the incorrect values.
📈 Step 5: Detect Skewed Data (Mean vs Median)
Compare the mean and 50% (median):
-
income_lakhs: mean = 144.28, median = 15 → heavily right-skewed (a few rich people pull the mean up) -
premium_amount: mean = 16,097, median = 11,000 → moderate right skew
✅ Action:
- Skewed data can impact regression models.
- Apply log transformation to handle skew:
df['log_income'] = np.log1p(df['income_lakhs'])
📉 Step 6: Detect Outliers with IQR Method
Use IQR (Interquartile Range) to find outliers:
Q1 = df['income_lakhs'].quantile(0.25)
Q3 = df['income_lakhs'].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 + 1.5 * IQR
outliers = df[df['income_lakhs'] > upper_limit]
In our example,
income_lakhs = 930is a clear outlier (way beyond Q3 + 1.5 * IQR).
✅ Action: Either:
- Remove the outlier if it’s not realistic.
- Cap it using
winsorization. - Log-transform the feature to reduce skew.
📊 Step 7: Visualize It (Boxplot or Histogram)
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df['income_lakhs'])
plt.title("Income Distribution")
plt.show()
This will visually show the outlier (930) far outside the box.
✅ Step 8: Clean the Data
Here’s what I’d do before modeling:
# Fix invalid age
df = df[df['age'] <= 100]
# Set negative dependents to 0
df['number_of_dependants'] = df['number_of_dependants'].apply(lambda x: max(x, 0))
# Log-transform income and premium
df['log_income'] = np.log1p(df['income_lakhs'])
df['log_premium'] = np.log1p(df['premium_amount'])
Now the data is ready for regression modeling!
🔀 Recap: Your Step-by-Step Process
| Step | What to Do |
|---|---|
| 1️⃣ | Load and explore your data |
| 2️⃣ | Use .describe() for summary stats |
| 3️⃣ | Check for missing values |
| 4️⃣ | Identify invalid entries |
| 5️⃣ | Analyze distribution using mean vs median |
| 6️⃣ | Detect outliers using IQR |
| 7️⃣ | Visualize with plots |
| 8️⃣ | Clean and transform data |
🧠 Final Thoughts
The .describe() method is your first lens into the soul of your dataset.
Use it well, and you’ll:
- Catch bad data before it messes up your model
- Discover hidden patterns (like skew or outliers)
- Build better models, faster
🚀 Want to Try It Yourself?
Use any regression dataset (insurance, housing prices, salary prediction), and go through these steps. Start with:
df.describe()
And from there, let the data speak.
Top comments (0)