DEV Community

Cover image for Data Cleaning & Preparation: The Ultimate Guide for Any Dataset
Nivesh Bansal
Nivesh Bansal

Posted on

Data Cleaning & Preparation: The Ultimate Guide for Any Dataset

When we start working as a Data Analyst, the first step is not building dashboards or making charts.
The first step is always cleaning and preparing the dataset.

👉 If the dataset is not clean, the final analysis will be wrong or misleading.

In this post, we will go through every important task you should do with a fresh dataset (no matter if it is sales data, HR data, COVID data, banking data, or any other).

We will use Python (pandas) for examples, but the steps apply to any dataset.


Step 1: Import Libraries and Load Dataset

import pandas as pd

## Example for CSV file
df = pd.read_csv("data.csv")

## Example for Excel file
## df = pd.read_excel("data.xlsx")

## Example for SQL database
## import sqlite3
## conn = sqlite3.connect("data.db")
## df = pd.read_sql("SELECT * FROM table_name", conn)

print(df.head())
Enter fullscreen mode Exit fullscreen mode

Always start by loading the dataset and looking at the first few rows.


Step 2: Check Dataset Shape

print(df.shape)
Enter fullscreen mode Exit fullscreen mode

Example Output:

(10000, 15)
Enter fullscreen mode Exit fullscreen mode
  • 10,000 rows (data entries)
  • 15 columns (variables)

Helps you know how big the dataset is.


Step 3: Get Dataset Info

print(df.info())
Enter fullscreen mode Exit fullscreen mode

This shows:

  • Column names
  • Data types (int, float, object, datetime)
  • Non-null counts

Example:

order_id        10000 non-null int64
order_date      9990 non-null object
sales_amount    9800 non-null float64
customer_name   10000 non-null object
Enter fullscreen mode Exit fullscreen mode

Step 4: Check First and Last Few Rows

print(df.head(10))
print(df.tail(10))
Enter fullscreen mode Exit fullscreen mode

Helps you:

  • Confirm column order
  • Check sorting (time, id, etc.)
  • Spot strange values

Step 5: Check Column Names

print(df.columns)
Enter fullscreen mode Exit fullscreen mode

Example:

['Order ID', 'Order Date', 'Sales Amount', 'Customer Name']
Enter fullscreen mode Exit fullscreen mode

Sometimes names have spaces, symbols, or capital letters. Clean them:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
Enter fullscreen mode Exit fullscreen mode

Now column names are:

['order_id', 'order_date', 'sales_amount', 'customer_name']
Enter fullscreen mode Exit fullscreen mode

Step 6: Handle Data Types

Always check if each column has the correct type:

print(df.dtypes)
Enter fullscreen mode Exit fullscreen mode

Example fixes:

## Convert date column
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

## Convert numeric columns
df['sales_amount'] = pd.to_numeric(df['sales_amount'], errors='coerce')

## Convert categorical columns
df['customer_name'] = df['customer_name'].astype('category')
Enter fullscreen mode Exit fullscreen mode

Step 7: Check Missing Values

print(df.isnull().sum())
Enter fullscreen mode Exit fullscreen mode

Example Output:

order_date      10
sales_amount   200
customer_name    0
Enter fullscreen mode Exit fullscreen mode

Ways to handle missing values:

  1. Drop rows/columns
df.dropna(subset=['order_date'], inplace=True)
Enter fullscreen mode Exit fullscreen mode
  1. Fill with mean/median/mode
df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True)
Enter fullscreen mode Exit fullscreen mode
  1. Fill forward/backward (time series)
df['sales_amount'].fillna(method='ffill', inplace=True)
Enter fullscreen mode Exit fullscreen mode

Step 8: Check for Duplicates

print(df.duplicated().sum())
df = df.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

Avoid double counting.


Step 9: Explore Unique Values

print(df['customer_name'].unique())
print(df['customer_name'].nunique())
Enter fullscreen mode Exit fullscreen mode

Helps find spelling mistakes like "John Smith" vs "Jon Smith".


Step 10: Descriptive Statistics

print(df.describe(include='all'))
Enter fullscreen mode Exit fullscreen mode

Example:

sales_amount   min: -50   max: 100000   mean: 2500
Enter fullscreen mode Exit fullscreen mode

Helps detect:

  • Outliers
  • Negative values (not possible for sales)
  • Very large values

Step 11: Handle Outliers

Example: Negative sales amount should not exist.

df = df[df['sales_amount'] >= 0]
Enter fullscreen mode Exit fullscreen mode

Or cap extreme values:

q1 = df['sales_amount'].quantile(0.25)
q3 = df['sales_amount'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

df = df[(df['sales_amount'] >= lower) & (df['sales_amount'] <= upper)]
Enter fullscreen mode Exit fullscreen mode

Step 12: Standardize Text Data

Example: customer names, product names, cities

df['customer_name'] = df['customer_name'].str.strip().str.title()
Enter fullscreen mode Exit fullscreen mode

Removes extra spaces and standardizes case.


Step 13: Rename or Drop Columns

df.rename(columns={'sales_amount': 'revenue'}, inplace=True)

df.drop(columns=['unnecessary_column'], inplace=True)
Enter fullscreen mode Exit fullscreen mode

Step 14: Create New Features

Examples:

  • Profit Margin
  • Year, Month from date
  • Age group
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
Enter fullscreen mode Exit fullscreen mode

Step 15: Check Data Consistency

  • Dates should be in correct order
  • IDs should be unique
  • Categories should make sense
print(df['year'].unique())
print(df['order_id'].nunique(), df.shape[0])
Enter fullscreen mode Exit fullscreen mode

Step 16: Sort and Reset Index

df = df.sort_values(by=['order_date'])
df = df.reset_index(drop=True)
Enter fullscreen mode Exit fullscreen mode

Step 17: Final Save

df.to_csv("clean_data.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Now dataset is ready for analysis or dashboard (Power BI, Tableau, etc.).


Final Checklist for Any Dataset

✅ Import dataset
✅ Check shape, info, head/tail
✅ Clean column names
✅ Fix data types
✅ Handle missing values
✅ Remove duplicates
✅ Check unique values
✅ Descriptive statistics
✅ Handle outliers
✅ Standardize text
✅ Drop/rename columns
✅ Create new features
✅ Check consistency
✅ Sort and reset index
✅ Save clean dataset


Conclusion

No matter what dataset you are working on (Sales, HR, Finance, COVID, Marketing, Banking, Ecommerce),
these steps will always help you make the dataset clean and ready for analysis.

👉 Clean data = Better insights + Correct dashboards + Happy clients.

By Nivesh Bansal

Top comments (0)