DEV Community

Vignan Baratam
Vignan Baratam

Posted on

Building a Simple Data Pipeline with Python and Pandas

In today’s data-driven world, building a data pipeline is a must-have skill for aspiring data engineers and analysts. Whether you're preparing raw data for analysis, automating reporting, or just learning the ropes, a clean and simple data pipeline gives you a hands-on understanding of how real-world data flows.

In this blog, we’ll walk through building a basic ETL (Extract, Transform, Load) pipeline using Python and Pandas—the go-to library for data manipulation.

What Is a Data Pipeline?

A data pipeline is a series of steps that move data from one system to another, often transforming it along the way. Common pipeline stages include:

  1. Extract – Getting raw data from a source (CSV, API, database, etc.)

  2. Transform – Cleaning, restructuring, or enriching the data.

  3. Load – Saving the final data into a target system (file, database, data lake, etc.)

Project Goal

We’ll build a pipeline that:

Extracts data from a sample CSV

Cleans and transforms it

Loads the result into a Parquet file (a common storage format in data lakehouses)

  1. Extract: Loading Raw Data

Let’s use a sample dataset of e-commerce sales. Suppose you have a CSV like this:

order_id,customer_name,product,quantity,price,date
1001,Alice,Laptop,1,700,2023-11-01
1002,Bob,Mouse,2,25,2023-11-01
1003,,Monitor,1,150,2023-11-02
1004,Charlie,Laptop,,700,2023-11-03

Python Code:

import pandas as pd

Load data

df = pd.read_csv('ecommerce_sales.csv')
print(df.head())

  1. Transform: Cleaning the Data

We’ll do basic transformations:

Drop rows with missing customer names

Fill missing quantities with 1

Create a new column: total_price = quantity * price

Drop rows where customer_name is missing

df = df.dropna(subset=['customer_name'])

Fill missing quantities with 1

df['quantity'] = df['quantity'].fillna(1)

Calculate total price

df['total_price'] = df['quantity'] * df['price']

Now your data is clean and enriched!

  1. Load: Writing to Parquet

Parquet is a fast, columnar storage format widely used in data lakehouses like Apache Iceberg, Delta Lake, and OLake.

Save to Parquet

df.to_parquet('processed_sales.parquet', index=False)

print("Data pipeline completed! File saved as processed_sales.parquet")

Full Code in One Shot:

import pandas as pd

Extract

df = pd.read_csv('ecommerce_sales.csv')

Transform

df = df.dropna(subset=['customer_name'])
df['quantity'] = df['quantity'].fillna(1)
df['total_price'] = df['quantity'] * df['price']

Load

df.to_parquet('processed_sales.parquet', index=False)

Why This Matters

While this is a basic example, it reflects a real-world pattern:

Ingest → clean → enrich → store

Parquet is used in cloud storage, big data systems, and lakehouses

This foundation scales up to tools like Apache Airflow, dbt, Apache Spark, and OLake

Next Steps You Can Try:

Add logging or error handling to make the pipeline production-ready

Load data from a REST API instead of a CSV

Schedule it using cron or Airflow

Load the Parquet into a data lakehouse using OLake or Iceberg

Thanks for reading!
Want more beginner-friendly data engineering tutorials? Let me know and I’ll keep sharing them!

Top comments (0)

👋 Kindness is contagious

Dive into this informative piece, backed by our vibrant DEV Community

Whether you’re a novice or a pro, your perspective enriches our collective insight.

A simple “thank you” can lift someone’s spirits—share your gratitude in the comments!

On DEV, the power of shared knowledge paves a smoother path and tightens our community ties. Found value here? A quick thanks to the author makes a big impact.

Okay