I bought my first bicycle in 2019. The pandemic had just started and during that period all team sports were stopped. The bicycle was already a few years old and I bought it for a few hundred euros. I really started to enjoy cycling and and I joined a small cycling team.
During the summer of 2021, I started looking around for a new bike. It was time to trade my old one for a new one. I became interested in a Canyon Endurance CF SL 8 but unfortunately they were always all sold out. In September 2021, I subscribed to the mailing list. I would get an email if the model was back in stock.
After a few months I started to wonder why I never got an email. I saw that similar bikes were sold and after doing a little investigation on Reddit, I read that the notification system wasn't working properly. The emails were probably sent in small badges, and it didn't take more than a few minutes before all Canyon Endurance CF SL 8s were sold out again.
It was time to look for another method. I read about a tool called Distrill, which could be used to check the site every so many seconds. The free version was working with a browser plugin. To be able to use it without your browser you needed a paid plan.
So I decided to create something similar which was cheaper.
I decided to develop a Lambda function to scrape* the web page and check if the bicycle in my size was still unavailable. I checked the content of the div
.
I made use of the Python module BeautifulSoup
to scrape the webpage and find a match for the Coming soon text.
If the text would change, I would get notified by email using SNS.
""" Scrape Canyon site."""
import requests
import os
import boto3
from bs4 import BeautifulSoup
client = boto3.client("sns")
url = "https://www.canyon.com/xxx"
def lambda_handler(event, context):
"""Main."""
page = requests.get(url)
results = BeautifulSoup(page.content, "html.parser")
items = []
for div in results.findAll(
"div", attrs={"class": "productConfiguration__availabilityMessage"}
):
text = div.text
items.append(text.strip())
# size small is 4th of the list
small_item = items[3]
print("item: " + small_item)
if "Soon" not in small_item:
print("alert!")
client.publish(
TopicArn=os.environ["TOPIC"],
Message="Time to buy a Canyon!",
Subject="Time to buy a Canyon!",
)
The code (and actually the whole solution) is really basic. I didn't need advanced integrations or checks.
I used a simple EventBridge rule to trigger the Lambda every minute (except when I need to sleep!).
Event:
Type: AWS::Events::Rule
Properties:
Description: Trigger every minute
Name: ScraperEvent
# Run every minute when I don't sleep
ScheduleExpression: cron(0/1 6-23 * * ? *)
Targets:
- Arn: !GetAtt Scraper.Arn #Lambda Arn
Id: canyon-scraper
LambdaPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !GetAtt Scraper.Arn
Action: lambda:InvokeFunction
Principal: events.amazonaws.com
SourceArn: !GetAtt Event.Arn
SNS was used to send notifications.
Topic:
Type: AWS::SNS::Topic
Properties:
DisplayName: canyon-topic
Subscription:
- Endpoint: me@mail.com
Protocol: email
TopicName: canyon-topic
I also made use of Lambda layers to make the function as fast and lightweight as possible. I used Docker to build my layer and I uploaded it to S3
.
$ docker run --rm \
--volume=$(pwd):/lambda-build \
-w=/lambda-build \
lambci/lambda:build-python3.8 \
pip install -r requirements.txt --target python
$ zip -vr python.zip python/
$ aws s3 cp python.zip s3://xxx-layers/python.zip
I configured my Lambda to make use of this layer.
Scraper:
Type: AWS::Serverless::Function
Properties:
FunctionName: canyon-scraper
CodeUri: src/
Handler: lambda.lambda_handler
Runtime: python3.8
Role: !GetAtt ScraperRole.Arn
Environment:
Variables:
TOPIC: !Ref Topic
Layers:
- !Ref libs
libs:
Type: AWS::Serverless::LayerVersion
Properties:
LayerName: python-lib
Description: Dependencies for the canyon scraper
ContentUri: s3://xxx-layers/python.zip
CompatibleRuntimes:
- python3.8
The function was fast enough to run with the minimum amount of memory and with a timeout of a few seconds.
This basic solution was monitoring the website ever minute. It's important to note that Cron expressions that lead to rates faster than 1 minute are not supported. If you need a faster solution, then check out this blog of a fellow Community Builder! Be sure that you're allowed to scrape and that you're not flooding the server!
Now I just had to wait, and after a month I got an email...
And I was able to order my favorite bicycle! It gave me great satisfaction. Like many, I'm also interested in new AWS features, fancy integrations and big setups, but sometimes you don't need the new fancy stuff to accomplish your needs.
All code is available here. Feel free to fork it and adapt it to your needs.
*Web scraping is legal if you follow the rules (avoid scraping personal data or intellectual property, check copyrights and robots.txt of the website, you're not allowed to flood the servers, ...!
Top comments (1)
Cool & congrats♥