Tracking product prices across e-commerce platforms is crucial for brands, marketplaces, and sellers. We needed a way to index smartphone prices at regular intervals and store that data in a form thatβs ready for analytics β all without managing infrastructure manually.
In this post, weβll walk through how we built a serverless price indexer using AWS Lambda and S3. The system is fast, scalable, and cost-effective β and you can easily adapt it to other product categories.
π Table of Contents
- What Weβre Using and Why
- β Crawling E-Commerce Sites with AWS Lambda
- π Outcome
π§° What Weβre Using and Why
Hereβs a quick breakdown of the tech stack and why it fits this use case:
π οΈ AWS Lambda
A serverless compute service β perfect for short-running scraping tasks. It scales automatically and only incurs cost when used.
β° EventBridge
Used to schedule our Lambda crawler at regular intervals (e.g., every 6β12 hours).
π Python + Libraries
-
requests
andbeautifulsoup4
: Simple, reliable HTML parsing. -
pandas
: Tabular data manipulation. -
pyarrow
: For writing optimized Parquet files. -
boto3
: To interact with S3.
πͺ£ Amazon S3 + Parquet
Stores crawled data efficiently. Parquet gives us compression + columnar performance for analytics.
β Crawling E-Commerce Sites with AWS Lambda
π― Objective
Build a lightweight, scalable web crawler that scrapes smartphone prices from target e-commerce pages.
π§± Components
- Python
- AWS Lambda (triggered by EventBridge)
-
requests
,beautifulsoup4
- Output as Parquet to S3
π¦ Lambda Code Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import boto3
import datetime
def lambda_handler(event, context):
url = "https://example.com/smartphones"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
items = []
for product in soup.select(".product-card"):
name = product.select_one(".product-title").text.strip()
price = float(product.select_one(".product-price").text.replace("$", ""))
items.append({
"name": name,
"price": price,
"timestamp": datetime.datetime.utcnow().isoformat()
})
df = pd.DataFrame(items)
table = pa.Table.from_pandas(df)
s3 = boto3.client("s3")
s3_key = f"price-index/smartphones/{datetime.datetime.utcnow().isoformat()}.parquet"
pq.write_table(table, "/tmp/data.parquet")
s3.upload_file("/tmp/data.parquet", "your-s3-bucket", s3_key)
return {"status": "success", "records": len(items)}
β° Scheduling
Use Amazon EventBridge to run this Lambda every few hours. This gives you a time-series trail of smartphone prices.
π Outcome
β
Fully serverless pipeline to extract pricing data
β
Easily stores compressed time-stamped snapshots in S3
β
Clean data foundation for future analytics or indexing
Top comments (0)