DEV Community

Srinivasaraju Tangella
Srinivasaraju Tangella

Posted on

Build a Complete Data Pipeline from Scratch: CSV to Dashboard Using Python, MySQL, and Airflow”

Building a Simple Yet Powerful Data Pipeline: From Raw CSV to Dashboard Visualization

Every Data Engineer’s journey begins with understanding how data flows — from raw collection to meaningful visualization. In this article, let’s build a simple but complete data pipeline that covers all essential stages of ETL (Extract, Transform, Load). We’ll use Python, MySQL, and Airflow to bring everything together, and finally visualize the results with Grafana or Metabase.

The Goal

We’ll create a pipeline that collects sales data from a CSV file, cleans and transforms it using Python, loads it into a MySQL database, and schedules everything with Apache Airflow. Finally, we’ll visualize the processed data in a simple dashboard.

This flow mirrors how real-world Data Engineering systems work — just in a smaller, easier-to-grasp version.

The Flow

Our data journey looks like this:

CSV File → Python ETL Script → MySQL Database → Airflow Scheduler → Grafana Dashboard

Each block in this chain has a clear responsibility — extracting, transforming, storing, orchestrating, and visualizing.

Step 1: Create the Sample Data

We’ll start with a simple CSV file named sales_data.csv:

order_id,customer,amount,date
1001,Ravi,250,2025-11-10
1002,Anita,300,2025-11-11
1003,Ramesh,150,2025-11-12
1004,Sita,,2025-11-12
1005,Raju,200,2025-11-12

Step 2:Extract and Transform the Data

We’ll use Python and Pandas to read the data, clean it, and load it into MySQL.

File: etl_pipeline.py

import pandas as pd
import mysql.connector

Step 1: Extract

df = pd.read_csv('sales_data.csv')

Step 2:Transform

df = df.dropna(subset=['amount']) # Remove missing values
df['amount'] = df['amount'].astype(float)
df['date'] = pd.to_datetime(df['date'])

Step 3:Load into MySQL

connection = mysql.connector.connect(
host='localhost',
user='root',
password='yourpassword',
database='sales_db'
)

cursor = connection.cursor()

cursor.execute("""
CREATE TABLE IF NOT EXISTS sales (
order_id INT PRIMARY KEY,
customer VARCHAR(50),
amount FLOAT,
date DATE
)
""")

for _, row in df.iterrows():
cursor.execute("""
REPLACE INTO sales (order_id, customer, amount, date)
VALUES (%s, %s, %s, %s)
""", tuple(row))

connection.commit()
cursor.close()
connection.close()

print("✅ Data loaded successfully!")

This script performs all three core ETL tasks:

Extracts from the CSV file

Transforms data by removing nulls and formatting types

Loads clean data into MySQL

Step 3:Automate the Pipeline with Apache Airflow

Once the ETL script is ready, we can schedule it using Airflow. Create a DAG file named sales_etl_dag.py inside your Airflow DAGs directory.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
'owner': 'data_engineer',
'start_date': datetime(2025, 11, 12),
'retries': 1
}

with DAG('sales_etl_pipeline',
default_args=default_args,
schedule_interval='@daily',
catchup=False) as dag:

extract_transform_load = BashOperator(
    task_id='run_etl_script',
    bash_command='python /opt/airflow/dags/etl_pipeline.py'
)

extract_transform_load
Enter fullscreen mode Exit fullscreen mode

This DAG runs your Python ETL script every day, automatically processing new data as it arrives. Airflow makes it easy to visualize runs, track logs, and retry on failure — key features for reliable data engineering.

Step 4:Visualize the Data

After the data is stored in MySQL, connect it to a visualization tool like Grafana, Metabase, or Power BI.
You can create charts or dashboards with queries such as:

SELECT date, SUM(amount) AS total_sales
FROM sales
GROUP BY date
ORDER BY date;

This lets you see daily sales trends, total revenue, or customer activity — transforming simple CSV data into actionable insights.

Step 5: Make It Production-Ready

Once you understand this flow, you can extend it easily:

Store raw files in S3 or Google Cloud Storage

Replace Pandas with Apache Spark for larger datasets

Use Kafka for real-time streaming ingestion

Add Great Expectations for data quality validation

Orchestrate everything in CI/CD pipelines with Jenkins or GitHub Actions

The Takeaway

Even a small data pipeline like this demonstrates the full lifecycle of modern Data Engineering — extraction, transformation, loading, automation, and visualization.
Once you grasp this end-to-end flow, you’ll have the foundation to build large-scale, production-ready data systems.

Top comments (0)