DEV Community

John Wakaba
John Wakaba

Posted on

Real-World ETL Pipeline from a Public Google Sheet

Spreadsheets are everywhere.

They’re easy to use, easy to share, and often become the first home of business data. But they’re terrible for analytics, automation, and scale.

In this article, I’ll walk through how I built a production-style ETL pipeline that:

  • Extracts data from a public Google Sheet
  • Cleans and validates the data using Python
  • Loads the data into PostgreSQL and MongoDB
  • Handles real-world issues like UUIDs, connection strings, and performance bottlenecks

The Problem

A supermarket dataset was stored in a Google Sheet. While this works for manual inspection, it introduces several problems:

  • No schema enforcement
  • No support for analytics or BI tools
  • Poor performance for large queries
  • No safe way to integrate with applications

The goal was to move this data into proper databases while following real ETL best practices.


Architecture Overview

Public Google Sheet (CSV Export)
        ↓
     Python ETL
        ↓
  Transform & Validate
        ↓
 PostgreSQL (Analytics)   MongoDB (Documents)
Enter fullscreen mode Exit fullscreen mode

Why two databases?

  • PostgreSQL acts as the system of record for analytics and reporting
  • MongoDB provides flexible, document-based storage for application access

Tools Used

  • Python 3.12
  • UV for dependency and environment management
  • Pandas for data transformation
  • Requests for HTTP-based extraction
  • PostgreSQL (via SQLAlchemy)
  • MongoDB (via PyMongo)
  • Loguru for structured logging

Step 1: Extracting Data from a Public Google Sheet

Instead of dealing with Google Cloud authentication, I used a simpler (and very realistic) approach.

Google Sheets exposes a CSV export endpoint for public sheets.

A human-friendly link like this:

https://docs.google.com/spreadsheets/d/<sheet-id>/edit
Enter fullscreen mode Exit fullscreen mode

can be converted to:

https://docs.google.com/spreadsheets/d/<sheet-id>/export?format=csv&gid=<gid>
Enter fullscreen mode Exit fullscreen mode

Python can then fetch the data directly:

response = requests.get(csv_url)
df = pd.read_csv(StringIO(response.text))
Enter fullscreen mode Exit fullscreen mode

No API keys. No OAuth. Fully automated.


Step 2: Transforming the Data (Where Things Get Real)

This is where assumptions break.

I initially assumed the id column was numeric. It wasn’t.

It contained UUIDs like:

47d54138-a950-4ec0-9d4a-e637e8dfb290
Enter fullscreen mode Exit fullscreen mode

Trying to cast this to an integer caused the pipeline to fail.

Lesson #1: The data always wins

The fix was simple but important:

  • Treat id as a string
  • Update both transformation logic and database schemas

Step 3: Loading into PostgreSQL

PostgreSQL is the backbone of the pipeline.

Key design decisions:

  • Strong schema enforcement
  • Idempotent inserts
  • Safe re-runs of the pipeline

The table is created if it doesn’t exist, and inserts use:

ON CONFLICT (id) DO NOTHING
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • No duplicate records
  • No need to truncate tables
  • Safe incremental runs

Step 4: Loading into MongoDB (and Fixing Performance)

My first MongoDB implementation used update_one() in a loop.

It worked — but it was painfully slow.

  • The fix was switching to bulk operations:
collection.bulk_write(operations, ordered=False)
Enter fullscreen mode Exit fullscreen mode

This reduced load time.


Step 5: Debugging a Nasty PostgreSQL Error

One of the most confusing errors I hit was:

could not translate host name "4401@localhost"
Enter fullscreen mode Exit fullscreen mode

It turned out the PostgreSQL password contained an @ symbol.

Lesson #3: Database passwords must be URL-safe

The solution was URL encoding:

KIM@4401 → KIM%404401
Enter fullscreen mode Exit fullscreen mode

Results

After fixing these issues, the pipeline:

  • Runs end-to-end.
  • Can be safely re-run without duplicates
  • Loads clean data into both databases
  • Handles real-world data quirks correctly

Key Takeaways

  • Spreadsheets are common sources — but not suitable destinations
  • Never assume data types without inspecting real data
  • UUIDs are extremely common in production systems
  • Bulk operations matter for performance
  • Environment variables and connection strings are frequent failure points

Final Thoughts

This project wasn’t about flashy tools.

It was about building something real, breaking it, and fixing it.


Top comments (0)