DEV Community

Cover image for Step-by-Step Guide to Understanding Your Data with `.describe()` (Regression Example)
Animesh Kewale
Animesh Kewale

Posted on

Step-by-Step Guide to Understanding Your Data with `.describe()` (Regression Example)

🔍 Step-by-Step Guide to Understanding Your Data with .describe() (Using a Regression Dataset)

Whether you're a beginner in data science or preparing data for a machine learning model, the first step is always:

📌 Understand the data before using it.

And one of the most powerful tools for that is:


df.describe()
Enter fullscreen mode Exit fullscreen mode

In this blog, I’ll walk you through exactly how to analyze and interpret **.describe()**** output*, step by step, using a **realistic regression dataset* with features like age, income_lakhs, and premium_amount.


📦 Step 1: Load the Dataset

Let’s simulate a dataset like you'd use for predicting insurance premiums.

import pandas as pd

# Simulated dataset
data = {
    'age': [22, 31, 45, 356, 18, 29, 60],
    'number_of_dependants': [0, 2, 3, 1, -3, 1, 5],
    'income_lakhs': [7, 17, 31, 930, 1, 9, 15],
    'premium_amount': [8608, 13928, 22274, 43471, 3501, 9000, 11000]
}

df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

📊 Step 2: Use describe() to Summarize the Data

df.describe()
Enter fullscreen mode Exit fullscreen mode

You’ll get this output:

age number_of_dependants income_lakhs premium_amount
count 7.0 7.0 7.0 7.0
mean 80.1 1.29 144.28 16097.14
std 123.1 2.72 340.42 12174.92
min 18.0 -3.0 1.0 3501.0
25% 25.5 0.0 8.0 8804.0
50% (median) 31.0 1.0 15.0 11000.0
75% 45.0 2.0 24.0 13928.0
max 356.0 5.0 930.0 43471.0

🧠 Step 3: Check for Missing or Incomplete Data

Check the count row:

  • Is the count equal to the total number of rows?
  • If not, there are missing values.

In our case, all counts = 7 → ✅ No missing values.


❌ Step 4: Spot Invalid Values

Now look for logically impossible values:

  • age = 356 → Invalid! No one lives that long.
  • number_of_dependants = -3 → Can’t have negative dependents.

Action: These are data quality issues. You should either:

  • Remove those rows, or
  • Impute/fix the incorrect values.

📈 Step 5: Detect Skewed Data (Mean vs Median)

Compare the mean and 50% (median):

  • income_lakhs: mean = 144.28, median = 15 → heavily right-skewed (a few rich people pull the mean up)
  • premium_amount: mean = 16,097, median = 11,000 → moderate right skew

Action:

  • Skewed data can impact regression models.
  • Apply log transformation to handle skew:
df['log_income'] = np.log1p(df['income_lakhs'])
Enter fullscreen mode Exit fullscreen mode

📉 Step 6: Detect Outliers with IQR Method

Use IQR (Interquartile Range) to find outliers:

Q1 = df['income_lakhs'].quantile(0.25)
Q3 = df['income_lakhs'].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 + 1.5 * IQR

outliers = df[df['income_lakhs'] > upper_limit]
Enter fullscreen mode Exit fullscreen mode

In our example, income_lakhs = 930 is a clear outlier (way beyond Q3 + 1.5 * IQR).

Action: Either:

  • Remove the outlier if it’s not realistic.
  • Cap it using winsorization.
  • Log-transform the feature to reduce skew.

📊 Step 7: Visualize It (Boxplot or Histogram)

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df['income_lakhs'])
plt.title("Income Distribution")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This will visually show the outlier (930) far outside the box.


✅ Step 8: Clean the Data

Here’s what I’d do before modeling:

# Fix invalid age
df = df[df['age'] <= 100]

# Set negative dependents to 0
df['number_of_dependants'] = df['number_of_dependants'].apply(lambda x: max(x, 0))

# Log-transform income and premium
df['log_income'] = np.log1p(df['income_lakhs'])
df['log_premium'] = np.log1p(df['premium_amount'])
Enter fullscreen mode Exit fullscreen mode

Now the data is ready for regression modeling!


🔀 Recap: Your Step-by-Step Process

Step What to Do
1️⃣ Load and explore your data
2️⃣ Use .describe() for summary stats
3️⃣ Check for missing values
4️⃣ Identify invalid entries
5️⃣ Analyze distribution using mean vs median
6️⃣ Detect outliers using IQR
7️⃣ Visualize with plots
8️⃣ Clean and transform data

🧠 Final Thoughts

The .describe() method is your first lens into the soul of your dataset.

Use it well, and you’ll:

  • Catch bad data before it messes up your model
  • Discover hidden patterns (like skew or outliers)
  • Build better models, faster

🚀 Want to Try It Yourself?

Use any regression dataset (insurance, housing prices, salary prediction), and go through these steps. Start with:

df.describe()
Enter fullscreen mode Exit fullscreen mode

And from there, let the data speak.

Top comments (0)