Why Build a Price Comparison Tool?
Price comparison tools are one of the most practical web scraping projects you can build. Whether it's for personal use (finding the best deals) or a business application (competitive pricing intelligence), the fundamentals are the same: scrape prices from multiple sources, normalize the data, and present the results.
In this tutorial, I'll walk you through building a multi-site price scraper from scratch.
Architecture Overview
Our price comparison tool has four components:
- Scrapers — Site-specific modules that extract product data
- Normalizer — Cleans and standardizes data across sources
- Storage — SQLite database for price history
- Reporter — Generates comparison output
┌─────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────┐
│ Scraper A │────▶│ │────▶│ │────▶│ │
│ Scraper B │────▶│ Normalizer │────▶│ SQLite │────▶│ Reporter │
│ Scraper C │────▶│ │────▶│ │────▶│ │
└─────────────┘ └────────────┘ └──────────┘ └──────────┘
Setting Up
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
The Base Scraper Class
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Building Site-Specific Scrapers
Amazon Scraper
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
eBay Scraper
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Price Database
import sqlite3
from datetime import datetime
class PriceDatabase:
def __init__(self, db_path='prices.db'):
self.conn = sqlite3.connect(db_path)
self.create_tables()
def create_tables(self):
self.conn.execute('''
CREATE TABLE IF NOT EXISTS prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id TEXT,
name TEXT,
price REAL,
currency TEXT,
source TEXT,
url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
self.conn.commit()
def save_product(self, product: Product):
self.conn.execute(
'INSERT INTO prices (product_id, name, price, currency, source, url) VALUES (?, ?, ?, ?, ?, ?)',
(product.product_id, product.name, product.price, product.currency, product.source, product.url)
)
self.conn.commit()
def get_price_history(self, product_id):
cursor = self.conn.execute(
'SELECT price, scraped_at FROM prices WHERE product_id = ? ORDER BY scraped_at',
(product_id,)
)
return cursor.fetchall()
def get_best_prices(self, name_query):
cursor = self.conn.execute('''
SELECT name, price, source, url, scraped_at
FROM prices
WHERE name LIKE ?
ORDER BY price ASC
LIMIT 20
''', (f'%{name_query}%',))
return cursor.fetchall()
The Comparison Engine
import pandas as pd
class PriceComparer:
def __init__(self):
self.scrapers = [
AmazonScraper(),
EbayScraper(),
]
self.db = PriceDatabase()
def compare(self, query: str) -> pd.DataFrame:
all_products = []
for scraper in self.scrapers:
try:
products = scraper.search(query)
for p in products:
self.db.save_product(p)
all_products.extend(products)
print(f'{scraper.__class__.__name__}: found {len(products)} results')
except Exception as e:
print(f'{scraper.__class__.__name__} failed: {e}')
time.sleep(random.uniform(1, 3))
if not all_products:
return pd.DataFrame()
df = pd.DataFrame([
{'name': p.name, 'price': p.price, 'source': p.source,
'url': p.url, 'in_stock': p.in_stock}
for p in all_products
])
return df.sort_values('price').head(20)
# Run comparison
comparer = PriceComparer()
results = comparer.compare('wireless mouse')
print(results[['name', 'price', 'source']].to_string())
Scaling with Proxies
When scraping multiple sites, you'll quickly hit rate limits. Using a proxy service like ThorData with rotating residential IPs solves this:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Scheduling Price Checks
import schedule
def daily_check():
comparer = PriceComparer()
products = ['wireless mouse', 'mechanical keyboard', 'usb-c hub']
for query in products:
results = comparer.compare(query)
if not results.empty:
best = results.iloc[0]
print(f'Best {query}: ${best["price"]:.2f} at {best["source"]}')
time.sleep(5)
schedule.every().day.at('09:00').do(daily_check)
while True:
schedule.run_pending()
time.sleep(60)
Conclusion
A price comparison tool is a great way to learn multi-site scraping. The key challenges are normalizing data across sources and handling anti-bot measures. For serious scraping workloads, a reliable proxy service like ThorData keeps your scrapers running smoothly across all target sites.
The full code from this tutorial gives you a foundation to build on — add more retailers, implement price alerts, or build a web dashboard.
Top comments (0)