John Wakaba

Posted on Feb 4

Real-World ETL Pipeline from a Public Google Sheet

#database #dataengineering #python #tutorial

Spreadsheets are everywhere.

They’re easy to use, easy to share, and often become the first home of business data. But they’re terrible for analytics, automation, and scale.

In this article, I’ll walk through how I built a production-style ETL pipeline that:

Extracts data from a public Google Sheet
Cleans and validates the data using Python
Loads the data into PostgreSQL and MongoDB
Handles real-world issues like UUIDs, connection strings, and performance bottlenecks

The Problem

A supermarket dataset was stored in a Google Sheet. While this works for manual inspection, it introduces several problems:

No schema enforcement
No support for analytics or BI tools
Poor performance for large queries
No safe way to integrate with applications

The goal was to move this data into proper databases while following real ETL best practices.

Architecture Overview

Public Google Sheet (CSV Export)
        ↓
     Python ETL
        ↓
  Transform & Validate
        ↓
 PostgreSQL (Analytics)   MongoDB (Documents)

Why two databases?

PostgreSQL acts as the system of record for analytics and reporting
MongoDB provides flexible, document-based storage for application access

Tools Used

Python 3.12
UV for dependency and environment management
Pandas for data transformation
Requests for HTTP-based extraction
PostgreSQL (via SQLAlchemy)
MongoDB (via PyMongo)
Loguru for structured logging

Step 1: Extracting Data from a Public Google Sheet

Instead of dealing with Google Cloud authentication, I used a simpler (and very realistic) approach.

Google Sheets exposes a CSV export endpoint for public sheets.

A human-friendly link like this:

https://docs.google.com/spreadsheets/d/<sheet-id>/edit

can be converted to:

https://docs.google.com/spreadsheets/d/<sheet-id>/export?format=csv&gid=<gid>

Python can then fetch the data directly:

response = requests.get(csv_url)
df = pd.read_csv(StringIO(response.text))

No API keys. No OAuth. Fully automated.

Step 2: Transforming the Data (Where Things Get Real)

This is where assumptions break.

I initially assumed the id column was numeric. It wasn’t.

It contained UUIDs like:

47d54138-a950-4ec0-9d4a-e637e8dfb290

Trying to cast this to an integer caused the pipeline to fail.

Lesson #1: The data always wins

The fix was simple but important:

Treat id as a string
Update both transformation logic and database schemas

Step 3: Loading into PostgreSQL

PostgreSQL is the backbone of the pipeline.

Key design decisions:

Strong schema enforcement
Idempotent inserts
Safe re-runs of the pipeline

The table is created if it doesn’t exist, and inserts use:

ON CONFLICT (id) DO NOTHING

This ensures:

No duplicate records
No need to truncate tables
Safe incremental runs

Step 4: Loading into MongoDB (and Fixing Performance)

My first MongoDB implementation used update_one() in a loop.

It worked — but it was painfully slow.

The fix was switching to bulk operations:

collection.bulk_write(operations, ordered=False)

This reduced load time.

Step 5: Debugging a Nasty PostgreSQL Error

One of the most confusing errors I hit was:

could not translate host name "4401@localhost"

It turned out the PostgreSQL password contained an @ symbol.

Lesson #3: Database passwords must be URL-safe

The solution was URL encoding:

KIM@4401 → KIM%404401

Results

After fixing these issues, the pipeline:

Runs end-to-end.
Can be safely re-run without duplicates
Loads clean data into both databases
Handles real-world data quirks correctly

Key Takeaways

Spreadsheets are common sources — but not suitable destinations
Never assume data types without inspecting real data
UUIDs are extremely common in production systems
Bulk operations matter for performance
Environment variables and connection strings are frequent failure points

Final Thoughts

This project wasn’t about flashy tools.

It was about building something real, breaking it, and fixing it.

DEV Community