When we start working as a Data Analyst, the first step is not building dashboards or making charts.
The first step is always cleaning and preparing the dataset.
👉 If the dataset is not clean, the final analysis will be wrong or misleading.
In this post, we will go through every important task you should do with a fresh dataset (no matter if it is sales data, HR data, COVID data, banking data, or any other).
We will use Python (pandas) for examples, but the steps apply to any dataset.
Step 1: Import Libraries and Load Dataset
import pandas as pd
## Example for CSV file
df = pd.read_csv("data.csv")
## Example for Excel file
## df = pd.read_excel("data.xlsx")
## Example for SQL database
## import sqlite3
## conn = sqlite3.connect("data.db")
## df = pd.read_sql("SELECT * FROM table_name", conn)
print(df.head())
Always start by loading the dataset and looking at the first few rows.
Step 2: Check Dataset Shape
print(df.shape)
Example Output:
(10000, 15)
- 10,000 rows (data entries)
- 15 columns (variables)
Helps you know how big the dataset is.
Step 3: Get Dataset Info
print(df.info())
This shows:
- Column names
- Data types (int, float, object, datetime)
- Non-null counts
Example:
order_id 10000 non-null int64
order_date 9990 non-null object
sales_amount 9800 non-null float64
customer_name 10000 non-null object
Step 4: Check First and Last Few Rows
print(df.head(10))
print(df.tail(10))
Helps you:
- Confirm column order
- Check sorting (time, id, etc.)
- Spot strange values
Step 5: Check Column Names
print(df.columns)
Example:
['Order ID', 'Order Date', 'Sales Amount', 'Customer Name']
Sometimes names have spaces, symbols, or capital letters. Clean them:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
Now column names are:
['order_id', 'order_date', 'sales_amount', 'customer_name']
Step 6: Handle Data Types
Always check if each column has the correct type:
print(df.dtypes)
Example fixes:
## Convert date column
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
## Convert numeric columns
df['sales_amount'] = pd.to_numeric(df['sales_amount'], errors='coerce')
## Convert categorical columns
df['customer_name'] = df['customer_name'].astype('category')
Step 7: Check Missing Values
print(df.isnull().sum())
Example Output:
order_date 10
sales_amount 200
customer_name 0
Ways to handle missing values:
- Drop rows/columns
df.dropna(subset=['order_date'], inplace=True)
- Fill with mean/median/mode
df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True)
- Fill forward/backward (time series)
df['sales_amount'].fillna(method='ffill', inplace=True)
Step 8: Check for Duplicates
print(df.duplicated().sum())
df = df.drop_duplicates()
Avoid double counting.
Step 9: Explore Unique Values
print(df['customer_name'].unique())
print(df['customer_name'].nunique())
Helps find spelling mistakes like "John Smith"
vs "Jon Smith"
.
Step 10: Descriptive Statistics
print(df.describe(include='all'))
Example:
sales_amount min: -50 max: 100000 mean: 2500
Helps detect:
- Outliers
- Negative values (not possible for sales)
- Very large values
Step 11: Handle Outliers
Example: Negative sales amount should not exist.
df = df[df['sales_amount'] >= 0]
Or cap extreme values:
q1 = df['sales_amount'].quantile(0.25)
q3 = df['sales_amount'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
df = df[(df['sales_amount'] >= lower) & (df['sales_amount'] <= upper)]
Step 12: Standardize Text Data
Example: customer names, product names, cities
df['customer_name'] = df['customer_name'].str.strip().str.title()
Removes extra spaces and standardizes case.
Step 13: Rename or Drop Columns
df.rename(columns={'sales_amount': 'revenue'}, inplace=True)
df.drop(columns=['unnecessary_column'], inplace=True)
Step 14: Create New Features
Examples:
- Profit Margin
- Year, Month from date
- Age group
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
Step 15: Check Data Consistency
- Dates should be in correct order
- IDs should be unique
- Categories should make sense
print(df['year'].unique())
print(df['order_id'].nunique(), df.shape[0])
Step 16: Sort and Reset Index
df = df.sort_values(by=['order_date'])
df = df.reset_index(drop=True)
Step 17: Final Save
df.to_csv("clean_data.csv", index=False)
Now dataset is ready for analysis or dashboard (Power BI, Tableau, etc.).
Final Checklist for Any Dataset
✅ Import dataset
✅ Check shape, info, head/tail
✅ Clean column names
✅ Fix data types
✅ Handle missing values
✅ Remove duplicates
✅ Check unique values
✅ Descriptive statistics
✅ Handle outliers
✅ Standardize text
✅ Drop/rename columns
✅ Create new features
✅ Check consistency
✅ Sort and reset index
✅ Save clean dataset
Conclusion
No matter what dataset you are working on (Sales, HR, Finance, COVID, Marketing, Banking, Ecommerce),
these steps will always help you make the dataset clean and ready for analysis.
👉 Clean data = Better insights + Correct dashboards + Happy clients.
By Nivesh Bansal
Top comments (0)