Oliver Samuel

Posted on Jan 30

Building Scalable Data Pipelines with Airflow, Docker, and Python: A SightSearch Case Study

#automation #dataengineering #docker #python

Data is the new oil, but a raw oil field isn't useful until you build a pipeline to refine it. In this article, I'll take you through the journey of building SightSearch, a robust data ingestion orchestration pipeline. Whether you're a seasoned data engineer or a product manager curious about how data moves from a website to a database, you're in the right place.

The Problem: Why Orchestration Matters

Imagine you need to scrape thousands of product images and details daily. You write a script. It works fine on day one. But then:

The script crashes halfway through
You run out of disk space
You forget to run it on Sunday
The website layout changes

A simple script isn't enough. You need orchestration, a system that manages, schedules, monitors, and retries your tasks automatically.

Tech Stack

I entered the workshop with a clear goal: build something scalable and reliable. Here are the tools I chose:

Apache Airflow: The industry standard for orchestrating complex workflows (DAGs)
Docker & Docker Compose: To ensure our code runs the same way on my laptop as it does in production
Python: For the heavy lifting (scraping, image processing)
MongoDB: NoSQL storage for our flexible product data
PostgreSQL: Relational storage for Airflow's internal metadata

Architecture Overview

The pipeline is split into independent, reusable "tasks." This modularity is key. If the scraping works but the database is down, we don't lose the data, we just retry the storage step later.

A high-level diagram of the architecture

The Pipeline in Action

Let's look at the heart of our project: the Airflow DAG (Directed Acyclic Graph). It defines the order of operations.

1. The Scrape Task

First, we hit the target website to gather raw product titles and image URLs. We use smart logic to handle pagination and rate limiting.

2. Image Processing

Raw images are heavy. We download them, calculate their hash (pHash) for deduplication, and extract metadata like dimensions and file size.

3. Validation and Storage

Data quality is paramount. We validate every record. Good data goes to MongoDB; bad data is logged for review.

Step-by-Step Walkthrough

Here's how we bring this system to life.

Phase 1: The Setup

We use docker-compose to spin up our entire infrastructure with one command:

docker compose -f docker/docker-compose.yml up -d

Terminal showing Docker containers starting up successfully

Phase 2: The Airflow UI

Once running, we log into the Airflow webserver. This is our command center.

We unpause our sightsearch_ingestion_pipeline and trigger a run.

Phase 3: Monitoring Execution

As the pipeline runs, we can watch each task succeed in real-time. This visual feedback is incredibly satisfying and useful for debugging.

Airflow UI showing specific tasks turning dark green, indicating success

Phase 4: Verifying the Data

Finally, the moment of truth. We check our database to ensure the data actually arrived.

MongoDB query db.products.findOne() returning a structured product document with title, price, and image_metadata

Challenges and Best Practices

It wasn't all smooth sailing. Here are critical lessons I learned:

1. Handling Secrets Securely

Initially, I hardcoded database passwords in docker-compose.yml. This is a huge security risk!

Solution: I refactored to use a .env file, keeping my credentials out of version control.

2. Module-Level Connections

I initially opened a database connection at the top of our scraping script. This caused Airflow to try and connect to the DB just while parsing the file, leading to timeouts.

Solution: I moved the connection logic inside the execution functions. Always initialize resources lazily!

Conclusion

SightSearch demonstrates that with the right tools, even complex data ingestion can be made reliable and transparent. Airflow gives us control, Docker gives us consistency, and Python gives us power.

If you're interested in the code, check out the repository here: GitHub Repo

DEV Community