After spending some time diving into the fundamentals of data engineering and Docker, I wanted a practical way to put my skills to the test. I decided to move beyond writing basic scripts and attempted to build a functional, containerized data pipeline. My goal was simple: scrape real-world e-commerce data, manage persistence reliably, and use Docker to orchestrate the entire environment.
This project served as a test of what I’ve learned so far about data flow, containerization, and handling the unique quirks of file-based databases in a virtualized setting.
The Stack
To keep things lightweight but effective, I chose the following tools:
- Scrapy: For building the spider logic.
- SQLite: A serverless, file-based database for local persistence.
- Docker & Docker Compose: To containerize the application and manage service dependencies.
- sqlite-web: A web-based interface to monitor the scraped data in real-time.
- The Spider: Testing Extraction Logic I built a Scrapy spider targeting the Xiaomi store on Jumia. The goal here was to practice selecting specific HTML elements and cleaning the data before it enters the pipeline.
# jumiascraper/spiders/jumia_spider.py
import scrapy
from jumiascraper.items import JumiaScraperItem
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://www.jumia.com.ng/mlp-xiaomi-store/"]
def parse(self, response):
product_articles = response.css('article.prd._fb.col.c-prd')
for product in product_articles:
yield JumiaScraperItem(
name=product.css('div.info h3.name::text').get().strip(),
new_price=product.css('div.info div.prc::text').get().strip(),
old_price=product.css("div.info div.s-prc-w div.old::text").get().strip(),
discount=product.css('div.info div.s-prc-w div.bdg._dsct._sm::text').get().strip(),
url=response.urljoin(product.css('a.core::attr(href)').get())
)
The Pipeline: Learning Data Persistence
One of my main learning goals was handling data updates. Instead of just appending duplicates, I used the INSERT OR REPLACE logic in the pipeline. By defining the url as a UNIQUE constraint in the schema, the database automatically updates existing records with the latest prices rather than creating new entries.
# jumiascraper/pipelines.py
import sqlite3
from itemadapter import ItemAdapter
class JumiascraperPipeline:
def open_spider(self, spider):
# Fetching the DB name from Scrapy settings
self.conn = sqlite3.connect(spider.settings.get('SQLITE_DB'))
self.cur = self.conn.cursor()
self.cur.execute("""
CREATE TABLE IF NOT EXISTS jumia_products(
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
new_price TEXT,
old_price TEXT,
discount TEXT,
url TEXT UNIQUE
)
""")
self.conn.commit()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
data = [adapter.get('name'), adapter.get('new_price'),
adapter.get('old_price'), adapter.get('discount'), adapter.get('url')]
# Upsert logic: Update price if product already exists
self.cur.execute('''INSERT OR REPLACE INTO jumia_products(name, new_price,
old_price, discount, url) VALUES (?, ?, ?, ?, ?)''', data)
return item
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
Orchestration
The most interesting part of this project was orchestrating the services via docker-compose.yml. Since SQLite is a file on disk rather than a network service, I had to use shared volumes to allow the db_viewer to "see" the database created by the scraper.
# jumiascraper/Docker-compose.yml
services:
scraper:
build: .
environment:
- SQLITE_DB=jumia_data.db
volumes:
- .:/app # Shared directory for the .db file
db_viewer:
image: coleifer/sqlite-web
ports:
- "8080:8080"
volumes:
- .:/data:ro # Read-only access to the same directory
environment:
- SQLITE_DATABASE=jumia_data.db
depends_on:
- scraper
Lessons Learned
- Volumes are Key: I learned that for file-based databases in Docker, shared volumes are the primary way to enable cross-container "peeping" since you can't rely on standard network ports.
- Data Integrity: Implementing UNIQUE constraints and INSERT OR REPLACE logic was a great lesson in maintaining a clean dataset over multiple runs.
- Read-Only Safety: Using the :ro flag for the viewer service taught me how to provide data visibility while ensuring the monitoring tool doesn't accidentally interfere with the scraper's writes.
Future Work: Scaling for Multi-Site Price Tracking
While this project focuses on Jumia, I’ve designed the architecture to be expandable. In the future, I plan to add more spiders to scrape additional e-commerce sites (like Konga). This will allow me to track price changes on similar products across multiple retailers simultaneously.
To support this, I'll need to evolve the schema further—likely adding a store_name column—to compare prices across different sources while still using the INSERT OR REPLACE logic to keep data fresh. This will be another great way to test how Docker handles multiple concurrent scraping services and shared database access.
Top comments (0)