DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

๐Ÿง  Building a Serverless Price Indexer with AWS โ€” From Web Crawling to Insights

Tracking product prices across e-commerce platforms is crucial for brands, marketplaces, and sellers. We needed a way to index smartphone prices at regular intervals and store that data in a form thatโ€™s ready for analytics โ€” all without managing infrastructure manually.

In this post, weโ€™ll walk through how we built a serverless price indexer using AWS Lambda and S3. The system is fast, scalable, and cost-effective โ€” and you can easily adapt it to other product categories.


๐Ÿ“š Table of Contents

  1. What Weโ€™re Using and Why
  2. โœ… Crawling E-Commerce Sites with AWS Lambda
  3. ๐Ÿ“ˆ Outcome

๐Ÿงฐ What Weโ€™re Using and Why

Hereโ€™s a quick breakdown of the tech stack and why it fits this use case:

๐Ÿ› ๏ธ AWS Lambda

A serverless compute service โ€” perfect for short-running scraping tasks. It scales automatically and only incurs cost when used.

โฐ EventBridge

Used to schedule our Lambda crawler at regular intervals (e.g., every 6โ€“12 hours).

๐Ÿ Python + Libraries

  • requests and beautifulsoup4: Simple, reliable HTML parsing.
  • pandas: Tabular data manipulation.
  • pyarrow: For writing optimized Parquet files.
  • boto3: To interact with S3.

๐Ÿชฃ Amazon S3 + Parquet

Stores crawled data efficiently. Parquet gives us compression + columnar performance for analytics.


โœ… Crawling E-Commerce Sites with AWS Lambda

๐ŸŽฏ Objective

Build a lightweight, scalable web crawler that scrapes smartphone prices from target e-commerce pages.

๐Ÿงฑ Components

  • Python
  • AWS Lambda (triggered by EventBridge)
  • requests, beautifulsoup4
  • Output as Parquet to S3

๐Ÿ“ฆ Lambda Code Example

import requests
from bs4 import BeautifulSoup
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import boto3
import datetime

def lambda_handler(event, context):
    url = "https://example.com/smartphones"
    headers = {"User-Agent": "Mozilla/5.0"}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    items = []
    for product in soup.select(".product-card"):
        name = product.select_one(".product-title").text.strip()
        price = float(product.select_one(".product-price").text.replace("$", ""))
        items.append({
            "name": name,
            "price": price,
            "timestamp": datetime.datetime.utcnow().isoformat()
        })

    df = pd.DataFrame(items)
    table = pa.Table.from_pandas(df)

    s3 = boto3.client("s3")
    s3_key = f"price-index/smartphones/{datetime.datetime.utcnow().isoformat()}.parquet"
    pq.write_table(table, "/tmp/data.parquet")

    s3.upload_file("/tmp/data.parquet", "your-s3-bucket", s3_key)

    return {"status": "success", "records": len(items)}
Enter fullscreen mode Exit fullscreen mode

โฐ Scheduling

Use Amazon EventBridge to run this Lambda every few hours. This gives you a time-series trail of smartphone prices.


๐Ÿ“ˆ Outcome

โœ… Fully serverless pipeline to extract pricing data

โœ… Easily stores compressed time-stamped snapshots in S3

โœ… Clean data foundation for future analytics or indexing

Top comments (0)