How I Built an Automated System That Turns Messy Sales Data Into Business Gold
Ever wonder how your favorite supermarket knows exactly when to restock the shelves, which products are flying off the racks, or why they always seem to have your favorite snacks in stock? The secret lies in data pipelines, and I built one from scratch.
The Problem: Data Drowning
Imagine you're the manager of a busy supermarket(e.g., Naivas). Every single day, thousands of transactions flow through your registers, customers buying milk, bread, snacks, cleaning supplies. Each transaction generates a line of data: who bought what, how much they paid, and how they paid.
Now here's the challenge: all this data is sitting in a messy Google spreadsheet, updated by cashiers in real-time. It's like having a river of gold nuggets flowing past you, but no way to catch them.
The questions that keep you up at night:
- Which products are selling the most?
- What payment methods do customers prefer?
- Are there duplicate transactions messing up your accounting?
- How can you make this data useful for reports AND for your mobile app?
This is exactly the problem I solved with the Supermarket ETL Pipeline.
The Solution: An Automated Data Factory
Think of my solution like a water treatment plant for data:
| Stage | Water Plant Analogy | What My Pipeline Does |
|---|---|---|
| Extract | Pumping water from the river | Pulling raw sales data from Google Sheets |
| Transform | Filtering out dirt and impurities | Cleaning duplicates, fixing missing values |
| Load | Storing clean water in tanks | Saving clean data to PostgreSQL & MongoDB |
The Google Sheet
Capture the source Google Sheet showing raw transaction data with columns like id, quantity, product_name, total_amount, payment_method, customer_type. Show some messy/duplicate rows if possible.
How It Works
Step 1: Extraction: "Fishing for Data"
My pipeline starts by reaching out to Google Sheets, think of it like casting a fishing net into a lake. The spreadsheet contains raw transaction records: every purchase, every customer, every payment.
The Pipeline says: "Hey Google, give me all the sales data!"
Google responds: "Here's 1,000 rows of transactions!"
Why Google Sheets? Because it's where real businesses often keep their data, it's accessible, shareable, and doesn't require expensive software.
Terminal showing extraction logs
Capture the terminal output showing: "Starting extraction from Google Sheets" and "Extracted X rows" messages.
Step 2: Transformation: "The Car Wash for Data"
Raw data is messy. Imagine every car that comes through a car wash covered in mud, leaves, and bird droppings. The transformation stage is my car wash, it takes dirty data and makes it sparkle.
What gets cleaned:
| Problem | Solution |
|---|---|
| Duplicate transactions (same ID twice) | Removed automatically |
| Missing transaction IDs | Rows dropped |
| Unnecessary columns | Only essential fields kept |
The pipeline keeps only what matters:
-
id— Unique transaction identifier -
quantity— How many items purchased -
product_name— What was bought -
total_amount— How much was paid -
payment_method— Cash, card, or digital -
customer_type— Member or regular customer
Transformation Logs
Step 3: Loading: "Two Warehouses, Two Purposes"
Here's where it gets interesting. Instead of storing data in just one place, I built a dual-database strategy. Think of it like having two different storage facilities:
PostgreSQL: The Library
PostgreSQL is like a meticulously organized library. Every book (data record) has its place, follows strict rules, and can be cross-referenced with other books easily.
Best for:
- Financial reports ("How much revenue did we make last month?")
- Accounting audits (data integrity is guaranteed)
- Complex queries ("Show me all cash transactions over $100 from member customers")
MongoDB: The Flexible Warehouse
MongoDB is like a modern warehouse with adjustable shelving. You can store items of different shapes and sizes without reorganizing everything.
Best for:
- Mobile app backends (JSON-friendly)
- Rapid prototyping ("Let's quickly add a new field!")
- Analytics dashboards (flexible data exploration)
Docker containers running
PostgreSQL data view
MongoDB data view
How It Works (The Technical Deep-Dive)
For my fellow engineers, let's pop the hood and look at the engine.
Architecture Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Google Sheets │────▶│ Python ETL │────▶│ PostgreSQL │
│ (Data Source) │ │ (Container) │ │ (Relational) │
└─────────────────┘ │ │ └─────────────────┘
│ • Extract │
│ • Transform │ ┌─────────────────┐
│ • Load │────▶│ MongoDB │
└─────────────────┘ │ (Document) │
└─────────────────┘
Project folder structure
The Modular Design Philosophy
Instead of one giant script, I split the pipeline into specialized modules, like having different specialists in a hospital:
| File | Role | Hospital Analogy |
|---|---|---|
config.py |
Configuration management | Hospital administrator |
extract.py |
Data extraction | Ambulance driver |
transform.py |
Data cleaning | Surgeon |
load_postgres.py |
PostgreSQL loading | Recovery ward nurse |
load_mongo.py |
MongoDB loading | Rehabilitation specialist |
main.py |
Orchestration | Chief of Medicine |
Why this matters:
- Testability: I can test the transformation logic without needing a database connection
- Maintainability: Changing the data source doesn't break the loading logic
- Scalability: Adding a new destination (like Snowflake) is just adding one new file
main.py code
from etl_pipeline.config import Config
from etl_pipeline.extract import extract_data
from etl_pipeline.transform import transform_data
from etl_pipeline.load_postgres import load_to_postgres
from etl_pipeline.load_mongo import load_to_mongo
import sys
import logging
# Configure logging to stdout
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)
def main():
logging.info("ETL Application pipeline initialized.")
# 1. Extract
try:
if Config.DATA_SOURCE_TYPE == "sheets":
logging.info(f"Starting extraction from Google Sheets (ID: {Config.GOOGLE_SHEET_ID})")
# Extract data
data = extract_data(
source_type="sheets",
sheet_id=Config.GOOGLE_SHEET_ID
)
else:
logging.error(f"Unknown data source: {Config.DATA_SOURCE_TYPE}")
return
logging.info(f"Extracted {len(data)} rows.")
# 2. Transform
logging.info("Step 2: Transform")
transformed_data = transform_data(data)
logging.info(f"Transformed Data Shape: {transformed_data.shape}")
# 3. Load to PostgreSQL
logging.info("Step 3: Load to PostgreSQL")
load_to_postgres(transformed_data, Config.POSTGRES_URL)
# Load to MongoDB
logging.info("Step 4: Load to MongoDB")
load_to_mongo(
transformed_data,
Config.MONGO_URI,
Config.MONGO_DB
)
logging.info("\nETL pipeline completed successfully.")
except Exception as e:
logging.critical(f"ETL failed: {e}")
if __name__ == "__main__":
main()
The Code Walkthrough
Extraction: Pandas Does the Heavy Lifting
def extract_from_public_sheet(sheet_id):
export_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"
df = pd.read_csv(export_url)
return df
The magic: Google Sheets can export any public sheet as CSV. Pandas reads it directly from the URL, no authentication needed for public sheets!
Transformation: Clean Data or Bust
def transform_data(df):
required_columns = ["id", "quantity", "product_name",
"total_amount", "payment_method", "customer_type"]
df_transformed = df[required_columns].copy()
df_transformed.drop_duplicates(subset=['id'], inplace=True)
df_transformed.dropna(subset=['id'], inplace=True)
return df_transformed
Key decisions:
- Only keep essential columns (data minimization)
- Remove duplicates by transaction ID (data integrity)
- Drop rows with missing IDs (no orphan records)
Loading: Two Paths, One Pipeline
PostgreSQL with SQLAlchemy:
def load_to_postgres(df, db_url, table_name="transactions"):
engine = create_engine(db_url)
df.to_sql(table_name, engine, if_exists='replace', index=False)
MongoDB with PyMongo:
def load_to_mongo(df, mongo_uri, db_name, collection_name="transactions"):
client = MongoClient(mongo_uri)
collection = client[db_name][collection_name]
records = df.to_dict("records")
collection.insert_many(records)
Successful ETL run
Docker: The "It Works on My Machine" Killer
One of the biggest headaches in software is environment setup. "It works on my machine!" is the developer's equivalent of "the dog ate my homework."
Docker solves this by containerizing everything. My entire stack, Python app, PostgreSQL, MongoDB runs in isolated containers that work identically on any machine.
The docker-compose.yml Magic
services:
postgres:
image: postgres:15
# PostgreSQL runs in its own container
mongo:
image: mongo:6
# MongoDB runs in its own container
etl-app:
build: .
depends_on:
- postgres
- mongo
# My Python app waits for databases to be ready
To run the entire system:
docker compose up -d --build
docker compose exec etl-app python main.py
Key Lessons & Design Decisions
Why Two Databases?
| Use Case | Best Database | Reason |
|---|---|---|
| Financial reports | PostgreSQL | ACID compliance, SQL support |
| Mobile app API | MongoDB | JSON-native, flexible schema |
| Complex joins | PostgreSQL | Relational model excels |
| Rapid prototyping | MongoDB | No schema migrations needed |
Why Python?
- Pandas: Industry-standard for data manipulation
- SQLAlchemy: ORM that prevents SQL injection
- PyMongo: Lightweight MongoDB driver
- Rich ecosystem: Libraries for everything
Why Modular Design?
Think of it like LEGO blocks. Each module is a self-contained piece that:
- Can be tested independently
- Can be replaced without breaking others
- Makes debugging a breeze
Future Enhancements
This pipeline is production-ready, but here's what could come next:
- Scheduling: Run automatically every hour with Apache Airflow or cron
- Message Queues: Use Kafka/RabbitMQ for async processing at scale
- Data Validation: Add Great Expectations for data quality checks
- Monitoring: Add Prometheus/Grafana for pipeline observability
- More Sources: Extend to pull from APIs, S3, or other databases
Conclusion
Building this ETL pipeline taught me that good data engineering is invisible. When it works, nobody notices, the reports are accurate, the app loads fast, and decisions get made with confidence.
But behind that invisibility is careful architecture: modular code, dual-database strategy, containerized deployment, and clean data transformations.
Whether you're a business analyst who just wants clean data, or an engineer looking to build your own pipeline, I hope this walkthrough demystified the magic behind turning chaotic spreadsheets into business intelligence gold.
The supermarket never runs out of your favorite snacks because somewhere, a data pipeline is quietly doing its job.
If you're interested in the code, check out the repository here:GitHub Repo






Top comments (0)