DEV Community

vinicius fagundes
vinicius fagundes

Posted on

🧠 Building a Serverless Price Indexer with AWS β€” From Web Crawling to Insights

Tracking product prices across e-commerce platforms is crucial for brands, marketplaces, and sellers. We needed a way to index smartphone prices at regular intervals and store that data in a form that’s ready for analytics β€” all without managing infrastructure manually.

In this post, we’ll walk through how we built a serverless price indexer using AWS Lambda and S3. The system is fast, scalable, and cost-effective β€” and you can easily adapt it to other product categories.


πŸ“š Table of Contents

  1. What We’re Using and Why
  2. βœ… Crawling E-Commerce Sites with AWS Lambda
  3. πŸ“ˆ Outcome

🧰 What We’re Using and Why

Here’s a quick breakdown of the tech stack and why it fits this use case:

πŸ› οΈ AWS Lambda

A serverless compute service β€” perfect for short-running scraping tasks. It scales automatically and only incurs cost when used.

⏰ EventBridge

Used to schedule our Lambda crawler at regular intervals (e.g., every 6–12 hours).

🐍 Python + Libraries

  • requests and beautifulsoup4: Simple, reliable HTML parsing.
  • pandas: Tabular data manipulation.
  • pyarrow: For writing optimized Parquet files.
  • boto3: To interact with S3.

πŸͺ£ Amazon S3 + Parquet

Stores crawled data efficiently. Parquet gives us compression + columnar performance for analytics.


βœ… Crawling E-Commerce Sites with AWS Lambda

🎯 Objective

Build a lightweight, scalable web crawler that scrapes smartphone prices from target e-commerce pages.

🧱 Components

  • Python
  • AWS Lambda (triggered by EventBridge)
  • requests, beautifulsoup4
  • Output as Parquet to S3

πŸ“¦ Lambda Code Example

import requests
from bs4 import BeautifulSoup
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import boto3
import datetime

def lambda_handler(event, context):
    url = "https://example.com/smartphones"
    headers = {"User-Agent": "Mozilla/5.0"}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    items = []
    for product in soup.select(".product-card"):
        name = product.select_one(".product-title").text.strip()
        price = float(product.select_one(".product-price").text.replace("$", ""))
        items.append({
            "name": name,
            "price": price,
            "timestamp": datetime.datetime.utcnow().isoformat()
        })

    df = pd.DataFrame(items)
    table = pa.Table.from_pandas(df)

    s3 = boto3.client("s3")
    s3_key = f"price-index/smartphones/{datetime.datetime.utcnow().isoformat()}.parquet"
    pq.write_table(table, "/tmp/data.parquet")

    s3.upload_file("/tmp/data.parquet", "your-s3-bucket", s3_key)

    return {"status": "success", "records": len(items)}
Enter fullscreen mode Exit fullscreen mode

⏰ Scheduling

Use Amazon EventBridge to run this Lambda every few hours. This gives you a time-series trail of smartphone prices.


πŸ“ˆ Outcome

βœ… Fully serverless pipeline to extract pricing data

βœ… Easily stores compressed time-stamped snapshots in S3

βœ… Clean data foundation for future analytics or indexing

Top comments (0)