Tracking product prices across e-commerce platforms is crucial for brands, marketplaces, and sellers. We needed a way to index smartphone prices at regular intervals and store that data in a form thatโs ready for analytics โ all without managing infrastructure manually.
In this post, weโll walk through how we built a serverless price indexer using AWS Lambda and S3. The system is fast, scalable, and cost-effective โ and you can easily adapt it to other product categories.
๐ Table of Contents
- What Weโre Using and Why
- โ Crawling E-Commerce Sites with AWS Lambda
- ๐ Outcome
๐งฐ What Weโre Using and Why
Hereโs a quick breakdown of the tech stack and why it fits this use case:
๐ ๏ธ AWS Lambda
A serverless compute service โ perfect for short-running scraping tasks. It scales automatically and only incurs cost when used.
โฐ EventBridge
Used to schedule our Lambda crawler at regular intervals (e.g., every 6โ12 hours).
๐ Python + Libraries
-
requestsandbeautifulsoup4: Simple, reliable HTML parsing. -
pandas: Tabular data manipulation. -
pyarrow: For writing optimized Parquet files. -
boto3: To interact with S3.
๐ชฃ Amazon S3 + Parquet
Stores crawled data efficiently. Parquet gives us compression + columnar performance for analytics.
โ Crawling E-Commerce Sites with AWS Lambda
๐ฏ Objective
Build a lightweight, scalable web crawler that scrapes smartphone prices from target e-commerce pages.
๐งฑ Components
- Python
- AWS Lambda (triggered by EventBridge)
-
requests,beautifulsoup4 - Output as Parquet to S3
๐ฆ Lambda Code Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import boto3
import datetime
def lambda_handler(event, context):
url = "https://example.com/smartphones"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
items = []
for product in soup.select(".product-card"):
name = product.select_one(".product-title").text.strip()
price = float(product.select_one(".product-price").text.replace("$", ""))
items.append({
"name": name,
"price": price,
"timestamp": datetime.datetime.utcnow().isoformat()
})
df = pd.DataFrame(items)
table = pa.Table.from_pandas(df)
s3 = boto3.client("s3")
s3_key = f"price-index/smartphones/{datetime.datetime.utcnow().isoformat()}.parquet"
pq.write_table(table, "/tmp/data.parquet")
s3.upload_file("/tmp/data.parquet", "your-s3-bucket", s3_key)
return {"status": "success", "records": len(items)}
โฐ Scheduling
Use Amazon EventBridge to run this Lambda every few hours. This gives you a time-series trail of smartphone prices.
๐ Outcome
โ
Fully serverless pipeline to extract pricing data
โ
Easily stores compressed time-stamped snapshots in S3
โ
Clean data foundation for future analytics or indexing
Top comments (0)