Nivesh Bansal

Posted on Sep 25

Data Cleaning & Preparation: The Ultimate Guide for Any Dataset

#python #datacleaning #powerfuldevs #dataanalysis

When we start working as a Data Analyst, the first step is not building dashboards or making charts.
The first step is always cleaning and preparing the dataset.

👉 If the dataset is not clean, the final analysis will be wrong or misleading.

In this post, we will go through every important task you should do with a fresh dataset (no matter if it is sales data, HR data, COVID data, banking data, or any other).

We will use Python (pandas) for examples, but the steps apply to any dataset.

Step 1: Import Libraries and Load Dataset

import pandas as pd

## Example for CSV file
df = pd.read_csv("data.csv")

## Example for Excel file
## df = pd.read_excel("data.xlsx")

## Example for SQL database
## import sqlite3
## conn = sqlite3.connect("data.db")
## df = pd.read_sql("SELECT * FROM table_name", conn)

print(df.head())

Always start by loading the dataset and looking at the first few rows.

Step 2: Check Dataset Shape

print(df.shape)

Example Output:

(10000, 15)

10,000 rows (data entries)
15 columns (variables)

Helps you know how big the dataset is.

Step 3: Get Dataset Info

print(df.info())

This shows:

Column names
Data types (int, float, object, datetime)
Non-null counts

Example:

order_id        10000 non-null int64
order_date      9990 non-null object
sales_amount    9800 non-null float64
customer_name   10000 non-null object

Step 4: Check First and Last Few Rows

print(df.head(10))
print(df.tail(10))

Helps you:

Confirm column order
Check sorting (time, id, etc.)
Spot strange values

Step 5: Check Column Names

print(df.columns)

Example:

['Order ID', 'Order Date', 'Sales Amount', 'Customer Name']

Sometimes names have spaces, symbols, or capital letters. Clean them:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

Now column names are:

['order_id', 'order_date', 'sales_amount', 'customer_name']

Step 6: Handle Data Types

Always check if each column has the correct type:

print(df.dtypes)

Example fixes:

## Convert date column
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

## Convert numeric columns
df['sales_amount'] = pd.to_numeric(df['sales_amount'], errors='coerce')

## Convert categorical columns
df['customer_name'] = df['customer_name'].astype('category')

Step 7: Check Missing Values

print(df.isnull().sum())

Example Output:

order_date      10
sales_amount   200
customer_name    0

Ways to handle missing values:

Drop rows/columns

df.dropna(subset=['order_date'], inplace=True)

Fill with mean/median/mode

df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True)

Fill forward/backward (time series)

df['sales_amount'].fillna(method='ffill', inplace=True)

Step 8: Check for Duplicates

print(df.duplicated().sum())
df = df.drop_duplicates()

Avoid double counting.

Step 9: Explore Unique Values

print(df['customer_name'].unique())
print(df['customer_name'].nunique())

Helps find spelling mistakes like "John Smith" vs "Jon Smith".

Step 10: Descriptive Statistics

print(df.describe(include='all'))

Example:

sales_amount   min: -50   max: 100000   mean: 2500

Helps detect:

Outliers
Negative values (not possible for sales)
Very large values

Step 11: Handle Outliers

Example: Negative sales amount should not exist.

df = df[df['sales_amount'] >= 0]

Or cap extreme values:

q1 = df['sales_amount'].quantile(0.25)
q3 = df['sales_amount'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

df = df[(df['sales_amount'] >= lower) & (df['sales_amount'] <= upper)]

Step 12: Standardize Text Data

Example: customer names, product names, cities

df['customer_name'] = df['customer_name'].str.strip().str.title()

Removes extra spaces and standardizes case.

Step 13: Rename or Drop Columns

df.rename(columns={'sales_amount': 'revenue'}, inplace=True)

df.drop(columns=['unnecessary_column'], inplace=True)

Step 14: Create New Features

Examples:

Profit Margin
Year, Month from date
Age group

df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month

Step 15: Check Data Consistency

Dates should be in correct order
IDs should be unique
Categories should make sense

print(df['year'].unique())
print(df['order_id'].nunique(), df.shape[0])

Step 16: Sort and Reset Index

df = df.sort_values(by=['order_date'])
df = df.reset_index(drop=True)

Step 17: Final Save

df.to_csv("clean_data.csv", index=False)

Now dataset is ready for analysis or dashboard (Power BI, Tableau, etc.).

Final Checklist for Any Dataset

✅ Import dataset
✅ Check shape, info, head/tail
✅ Clean column names
✅ Fix data types
✅ Handle missing values
✅ Remove duplicates
✅ Check unique values
✅ Descriptive statistics
✅ Handle outliers
✅ Standardize text
✅ Drop/rename columns
✅ Create new features
✅ Check consistency
✅ Sort and reset index
✅ Save clean dataset

Conclusion

No matter what dataset you are working on (Sales, HR, Finance, COVID, Marketing, Banking, Ecommerce),
these steps will always help you make the dataset clean and ready for analysis.

👉 Clean data = Better insights + Correct dashboards + Happy clients.

By Nivesh Bansal

DEV Community

Data Cleaning & Preparation: The Ultimate Guide for Any Dataset

Step 1: Import Libraries and Load Dataset

Step 2: Check Dataset Shape

Step 3: Get Dataset Info

Step 4: Check First and Last Few Rows

Step 5: Check Column Names

Step 6: Handle Data Types

Step 7: Check Missing Values

Ways to handle missing values:

Step 8: Check for Duplicates

Step 9: Explore Unique Values

Step 10: Descriptive Statistics

Step 11: Handle Outliers

Step 12: Standardize Text Data

Step 13: Rename or Drop Columns

Step 14: Create New Features

Step 15: Check Data Consistency

Step 16: Sort and Reset Index

Step 17: Final Save

Final Checklist for Any Dataset

Conclusion

Top comments (0)