Building a Simple Yet Powerful Data Pipeline: From Raw CSV to Dashboard Visualization
Every Data Engineer’s journey begins with understanding how data flows — from raw collection to meaningful visualization. In this article, let’s build a simple but complete data pipeline that covers all essential stages of ETL (Extract, Transform, Load). We’ll use Python, MySQL, and Airflow to bring everything together, and finally visualize the results with Grafana or Metabase.
The Goal
We’ll create a pipeline that collects sales data from a CSV file, cleans and transforms it using Python, loads it into a MySQL database, and schedules everything with Apache Airflow. Finally, we’ll visualize the processed data in a simple dashboard.
This flow mirrors how real-world Data Engineering systems work — just in a smaller, easier-to-grasp version.
The Flow
Our data journey looks like this:
CSV File → Python ETL Script → MySQL Database → Airflow Scheduler → Grafana Dashboard
Each block in this chain has a clear responsibility — extracting, transforming, storing, orchestrating, and visualizing.
Step 1: Create the Sample Data
We’ll start with a simple CSV file named sales_data.csv:
order_id,customer,amount,date
1001,Ravi,250,2025-11-10
1002,Anita,300,2025-11-11
1003,Ramesh,150,2025-11-12
1004,Sita,,2025-11-12
1005,Raju,200,2025-11-12
Step 2:Extract and Transform the Data
We’ll use Python and Pandas to read the data, clean it, and load it into MySQL.
File: etl_pipeline.py
import pandas as pd
import mysql.connector
Step 1: Extract
df = pd.read_csv('sales_data.csv')
Step 2:Transform
df = df.dropna(subset=['amount']) # Remove missing values
df['amount'] = df['amount'].astype(float)
df['date'] = pd.to_datetime(df['date'])
Step 3:Load into MySQL
connection = mysql.connector.connect(
host='localhost',
user='root',
password='yourpassword',
database='sales_db'
)
cursor = connection.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS sales (
order_id INT PRIMARY KEY,
customer VARCHAR(50),
amount FLOAT,
date DATE
)
""")
for _, row in df.iterrows():
cursor.execute("""
REPLACE INTO sales (order_id, customer, amount, date)
VALUES (%s, %s, %s, %s)
""", tuple(row))
connection.commit()
cursor.close()
connection.close()
print("✅ Data loaded successfully!")
This script performs all three core ETL tasks:
Extracts from the CSV file
Transforms data by removing nulls and formatting types
Loads clean data into MySQL
Step 3:Automate the Pipeline with Apache Airflow
Once the ETL script is ready, we can schedule it using Airflow. Create a DAG file named sales_etl_dag.py inside your Airflow DAGs directory.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
default_args = {
'owner': 'data_engineer',
'start_date': datetime(2025, 11, 12),
'retries': 1
}
with DAG('sales_etl_pipeline',
default_args=default_args,
schedule_interval='@daily',
catchup=False) as dag:
extract_transform_load = BashOperator(
task_id='run_etl_script',
bash_command='python /opt/airflow/dags/etl_pipeline.py'
)
extract_transform_load
This DAG runs your Python ETL script every day, automatically processing new data as it arrives. Airflow makes it easy to visualize runs, track logs, and retry on failure — key features for reliable data engineering.
Step 4:Visualize the Data
After the data is stored in MySQL, connect it to a visualization tool like Grafana, Metabase, or Power BI.
You can create charts or dashboards with queries such as:
SELECT date, SUM(amount) AS total_sales
FROM sales
GROUP BY date
ORDER BY date;
This lets you see daily sales trends, total revenue, or customer activity — transforming simple CSV data into actionable insights.
Step 5: Make It Production-Ready
Once you understand this flow, you can extend it easily:
Store raw files in S3 or Google Cloud Storage
Replace Pandas with Apache Spark for larger datasets
Use Kafka for real-time streaming ingestion
Add Great Expectations for data quality validation
Orchestrate everything in CI/CD pipelines with Jenkins or GitHub Actions
The Takeaway
Even a small data pipeline like this demonstrates the full lifecycle of modern Data Engineering — extraction, transformation, loading, automation, and visualization.
Once you grasp this end-to-end flow, you’ll have the foundation to build large-scale, production-ready data systems.
Top comments (0)