🧠 Building a Serverless Price Indexer with AWS — From Web Crawling to Insights

Tracking product prices across e-commerce platforms is crucial for brands, marketplaces, and sellers. We needed a way to index smartphone prices at regular intervals and store that data in a form that’s ready for analytics — all without managing infrastructure manually.

In this post, we’ll walk through how we built a serverless price indexer using AWS Lambda and S3. The system is fast, scalable, and cost-effective — and you can easily adapt it to other product categories.

📚 Table of Contents

What We’re Using and Why
✅ Crawling E-Commerce Sites with AWS Lambda
📈 Outcome

🧰 What We’re Using and Why

Here’s a quick breakdown of the tech stack and why it fits this use case:

🛠️ AWS Lambda

A serverless compute service — perfect for short-running scraping tasks. It scales automatically and only incurs cost when used.

⏰ EventBridge

Used to schedule our Lambda crawler at regular intervals (e.g., every 6–12 hours).

🐍 Python + Libraries

requests and beautifulsoup4: Simple, reliable HTML parsing.
pandas: Tabular data manipulation.
pyarrow: For writing optimized Parquet files.
boto3: To interact with S3.

🪣 Amazon S3 + Parquet

Stores crawled data efficiently. Parquet gives us compression + columnar performance for analytics.

✅ Crawling E-Commerce Sites with AWS Lambda

🎯 Objective

Build a lightweight, scalable web crawler that scrapes smartphone prices from target e-commerce pages.

🧱 Components

Python
AWS Lambda (triggered by EventBridge)
requests, beautifulsoup4
Output as Parquet to S3

📦 Lambda Code Example

import requests
from bs4 import BeautifulSoup
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import boto3
import datetime

def lambda_handler(event, context):
    url = "https://example.com/smartphones"
    headers = {"User-Agent": "Mozilla/5.0"}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    items = []
    for product in soup.select(".product-card"):
        name = product.select_one(".product-title").text.strip()
        price = float(product.select_one(".product-price").text.replace("$", ""))
        items.append({
            "name": name,
            "price": price,
            "timestamp": datetime.datetime.utcnow().isoformat()
        })

    df = pd.DataFrame(items)
    table = pa.Table.from_pandas(df)

    s3 = boto3.client("s3")
    s3_key = f"price-index/smartphones/{datetime.datetime.utcnow().isoformat()}.parquet"
    pq.write_table(table, "/tmp/data.parquet")

    s3.upload_file("/tmp/data.parquet", "your-s3-bucket", s3_key)

    return {"status": "success", "records": len(items)}

⏰ Scheduling

Use Amazon EventBridge to run this Lambda every few hours. This gives you a time-series trail of smartphone prices.

📈 Outcome

✅ Fully serverless pipeline to extract pricing data

✅ Easily stores compressed time-stamped snapshots in S3

✅ Clean data foundation for future analytics or indexing