In today’s data-driven world, building a data pipeline is a must-have skill for aspiring data engineers and analysts. Whether you're preparing raw data for analysis, automating reporting, or just learning the ropes, a clean and simple data pipeline gives you a hands-on understanding of how real-world data flows.
In this blog, we’ll walk through building a basic ETL (Extract, Transform, Load) pipeline using Python and Pandas—the go-to library for data manipulation.
What Is a Data Pipeline?
A data pipeline is a series of steps that move data from one system to another, often transforming it along the way. Common pipeline stages include:
Extract – Getting raw data from a source (CSV, API, database, etc.)
Transform – Cleaning, restructuring, or enriching the data.
Load – Saving the final data into a target system (file, database, data lake, etc.)
Project Goal
We’ll build a pipeline that:
Extracts data from a sample CSV
Cleans and transforms it
Loads the result into a Parquet file (a common storage format in data lakehouses)
- Extract: Loading Raw Data
Let’s use a sample dataset of e-commerce sales. Suppose you have a CSV like this:
order_id,customer_name,product,quantity,price,date
1001,Alice,Laptop,1,700,2023-11-01
1002,Bob,Mouse,2,25,2023-11-01
1003,,Monitor,1,150,2023-11-02
1004,Charlie,Laptop,,700,2023-11-03
Python Code:
import pandas as pd
Load data
df = pd.read_csv('ecommerce_sales.csv')
print(df.head())
- Transform: Cleaning the Data
We’ll do basic transformations:
Drop rows with missing customer names
Fill missing quantities with 1
Create a new column: total_price = quantity * price
Drop rows where customer_name is missing
df = df.dropna(subset=['customer_name'])
Fill missing quantities with 1
df['quantity'] = df['quantity'].fillna(1)
Calculate total price
df['total_price'] = df['quantity'] * df['price']
Now your data is clean and enriched!
- Load: Writing to Parquet
Parquet is a fast, columnar storage format widely used in data lakehouses like Apache Iceberg, Delta Lake, and OLake.
Save to Parquet
df.to_parquet('processed_sales.parquet', index=False)
print("Data pipeline completed! File saved as processed_sales.parquet")
Full Code in One Shot:
import pandas as pd
Extract
df = pd.read_csv('ecommerce_sales.csv')
Transform
df = df.dropna(subset=['customer_name'])
df['quantity'] = df['quantity'].fillna(1)
df['total_price'] = df['quantity'] * df['price']
Load
df.to_parquet('processed_sales.parquet', index=False)
Why This Matters
While this is a basic example, it reflects a real-world pattern:
Ingest → clean → enrich → store
Parquet is used in cloud storage, big data systems, and lakehouses
This foundation scales up to tools like Apache Airflow, dbt, Apache Spark, and OLake
Next Steps You Can Try:
Add logging or error handling to make the pipeline production-ready
Load data from a REST API instead of a CSV
Schedule it using cron or Airflow
Load the Parquet into a data lakehouse using OLake or Iceberg
Thanks for reading!
Want more beginner-friendly data engineering tutorials? Let me know and I’ll keep sharing them!
Top comments (0)