Data is the new oil, but a raw oil field isn't useful until you build a pipeline to refine it. In this article, I'll take you through the journey of building SightSearch, a robust data ingestion orchestration pipeline. Whether you're a seasoned data engineer or a product manager curious about how data moves from a website to a database, you're in the right place.
The Problem: Why Orchestration Matters
Imagine you need to scrape thousands of product images and details daily. You write a script. It works fine on day one. But then:
- The script crashes halfway through
- You run out of disk space
- You forget to run it on Sunday
- The website layout changes
A simple script isn't enough. You need orchestration, a system that manages, schedules, monitors, and retries your tasks automatically.
Tech Stack
I entered the workshop with a clear goal: build something scalable and reliable. Here are the tools I chose:
- Apache Airflow: The industry standard for orchestrating complex workflows (DAGs)
- Docker & Docker Compose: To ensure our code runs the same way on my laptop as it does in production
- Python: For the heavy lifting (scraping, image processing)
- MongoDB: NoSQL storage for our flexible product data
- PostgreSQL: Relational storage for Airflow's internal metadata
Architecture Overview
The pipeline is split into independent, reusable "tasks." This modularity is key. If the scraping works but the database is down, we don't lose the data, we just retry the storage step later.

A high-level diagram of the architecture
The Pipeline in Action
Let's look at the heart of our project: the Airflow DAG (Directed Acyclic Graph). It defines the order of operations.
1. The Scrape Task
First, we hit the target website to gather raw product titles and image URLs. We use smart logic to handle pagination and rate limiting.
2. Image Processing
Raw images are heavy. We download them, calculate their hash (pHash) for deduplication, and extract metadata like dimensions and file size.
3. Validation and Storage
Data quality is paramount. We validate every record. Good data goes to MongoDB; bad data is logged for review.
Step-by-Step Walkthrough
Here's how we bring this system to life.
Phase 1: The Setup
We use docker-compose to spin up our entire infrastructure with one command:
docker compose -f docker/docker-compose.yml up -d

Terminal showing Docker containers starting up successfully
Phase 2: The Airflow UI
Once running, we log into the Airflow webserver. This is our command center.
We unpause our sightsearch_ingestion_pipeline and trigger a run.
Phase 3: Monitoring Execution
As the pipeline runs, we can watch each task succeed in real-time. This visual feedback is incredibly satisfying and useful for debugging.

Airflow UI showing specific tasks turning dark green, indicating success
Phase 4: Verifying the Data
Finally, the moment of truth. We check our database to ensure the data actually arrived.

MongoDB query db.products.findOne() returning a structured product document with title, price, and image_metadata
Challenges and Best Practices
It wasn't all smooth sailing. Here are critical lessons I learned:
1. Handling Secrets Securely
Initially, I hardcoded database passwords in docker-compose.yml. This is a huge security risk!
Solution: I refactored to use a .env file, keeping my credentials out of version control.
2. Module-Level Connections
I initially opened a database connection at the top of our scraping script. This caused Airflow to try and connect to the DB just while parsing the file, leading to timeouts.
Solution: I moved the connection logic inside the execution functions. Always initialize resources lazily!
Conclusion
SightSearch demonstrates that with the right tools, even complex data ingestion can be made reliable and transparent. Airflow gives us control, Docker gives us consistency, and Python gives us power.
If you're interested in the code, check out the repository here: GitHub Repo
Top comments (0)