Animesh Kewale

Posted on Jul 20

Step-by-Step Guide to Understanding Your Data with `.describe()` (Regression Example)

#machinelearning #python #datascience #beginners

🔍 Step-by-Step Guide to Understanding Your Data with `.describe()` (Using a Regression Dataset)

Whether you're a beginner in data science or preparing data for a machine learning model, the first step is always:

📌 Understand the data before using it.

And one of the most powerful tools for that is:


df.describe()

In this blog, I’ll walk you through exactly how to analyze and interpret **.describe()**** output*, step by step, using a **realistic regression dataset* with features like age, income_lakhs, and premium_amount.

📦 Step 1: Load the Dataset

Let’s simulate a dataset like you'd use for predicting insurance premiums.

import pandas as pd

# Simulated dataset
data = {
    'age': [22, 31, 45, 356, 18, 29, 60],
    'number_of_dependants': [0, 2, 3, 1, -3, 1, 5],
    'income_lakhs': [7, 17, 31, 930, 1, 9, 15],
    'premium_amount': [8608, 13928, 22274, 43471, 3501, 9000, 11000]
}

df = pd.DataFrame(data)

📊 Step 2: Use `describe()` to Summarize the Data

df.describe()

You’ll get this output:

	age	number_of_dependants	income_lakhs	premium_amount
count	7.0	7.0	7.0	7.0
mean	80.1	1.29	144.28	16097.14
std	123.1	2.72	340.42	12174.92
min	18.0	-3.0	1.0	3501.0
25%	25.5	0.0	8.0	8804.0
50% (median)	31.0	1.0	15.0	11000.0
75%	45.0	2.0	24.0	13928.0
max	356.0	5.0	930.0	43471.0

🧠 Step 3: Check for Missing or Incomplete Data

Check the count row:

Is the count equal to the total number of rows?
If not, there are missing values.

In our case, all counts = 7 → ✅ No missing values.

❌ Step 4: Spot Invalid Values

Now look for logically impossible values:

age = 356 → Invalid! No one lives that long.
number_of_dependants = -3 → Can’t have negative dependents.

✅ Action: These are data quality issues. You should either:

Remove those rows, or
Impute/fix the incorrect values.

📈 Step 5: Detect Skewed Data (Mean vs Median)

Compare the mean and 50% (median):

income_lakhs: mean = 144.28, median = 15 → heavily right-skewed (a few rich people pull the mean up)
premium_amount: mean = 16,097, median = 11,000 → moderate right skew

✅ Action:

Skewed data can impact regression models.
Apply log transformation to handle skew:

df['log_income'] = np.log1p(df['income_lakhs'])

📉 Step 6: Detect Outliers with IQR Method

Use IQR (Interquartile Range) to find outliers:

Q1 = df['income_lakhs'].quantile(0.25)
Q3 = df['income_lakhs'].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 + 1.5 * IQR

outliers = df[df['income_lakhs'] > upper_limit]

In our example, income_lakhs = 930 is a clear outlier (way beyond Q3 + 1.5 * IQR).

✅ Action: Either:

Remove the outlier if it’s not realistic.
Cap it using winsorization.
Log-transform the feature to reduce skew.

📊 Step 7: Visualize It (Boxplot or Histogram)

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df['income_lakhs'])
plt.title("Income Distribution")
plt.show()

This will visually show the outlier (930) far outside the box.

✅ Step 8: Clean the Data

Here’s what I’d do before modeling:

# Fix invalid age
df = df[df['age'] <= 100]

# Set negative dependents to 0
df['number_of_dependants'] = df['number_of_dependants'].apply(lambda x: max(x, 0))

# Log-transform income and premium
df['log_income'] = np.log1p(df['income_lakhs'])
df['log_premium'] = np.log1p(df['premium_amount'])

Now the data is ready for regression modeling!

🔀 Recap: Your Step-by-Step Process

Step	What to Do
1️⃣	Load and explore your data
2️⃣	Use `.describe()` for summary stats
3️⃣	Check for missing values
4️⃣	Identify invalid entries
5️⃣	Analyze distribution using mean vs median
6️⃣	Detect outliers using IQR
7️⃣	Visualize with plots
8️⃣	Clean and transform data

🧠 Final Thoughts

The .describe() method is your first lens into the soul of your dataset.

Use it well, and you’ll:

Catch bad data before it messes up your model
Discover hidden patterns (like skew or outliers)
Build better models, faster

🚀 Want to Try It Yourself?

Use any regression dataset (insurance, housing prices, salary prediction), and go through these steps. Start with:

df.describe()

And from there, let the data speak.

DEV Community

Step-by-Step Guide to Understanding Your Data with `.describe()` (Regression Example)

🔍 Step-by-Step Guide to Understanding Your Data with `.describe()` (Using a Regression Dataset)

📦 Step 1: Load the Dataset

📊 Step 2: Use `describe()` to Summarize the Data

🧠 Step 3: Check for Missing or Incomplete Data

❌ Step 4: Spot Invalid Values

📈 Step 5: Detect Skewed Data (Mean vs Median)

📉 Step 6: Detect Outliers with IQR Method

📊 Step 7: Visualize It (Boxplot or Histogram)

✅ Step 8: Clean the Data

🔀 Recap: Your Step-by-Step Process

🧠 Final Thoughts

🚀 Want to Try It Yourself?

Top comments (0)

🔍 Step-by-Step Guide to Understanding Your Data with .describe() (Using a Regression Dataset)

📦 Step 1: Load the Dataset

📊 Step 2: Use describe() to Summarize the Data

🧠 Step 3: Check for Missing or Incomplete Data

❌ Step 4: Spot Invalid Values

📈 Step 5: Detect Skewed Data (Mean vs Median)

📉 Step 6: Detect Outliers with IQR Method

📊 Step 7: Visualize It (Boxplot or Histogram)

✅ Step 8: Clean the Data

🔀 Recap: Your Step-by-Step Process

🧠 Final Thoughts

🚀 Want to Try It Yourself?

🔍 Step-by-Step Guide to Understanding Your Data with `.describe()` (Using a Regression Dataset)

📊 Step 2: Use `describe()` to Summarize the Data