DEV Community

Karthik Subramanian for AWS Community Builders

Posted on • Edited on • Originally published at Medium

3 1

Web Scraping with Selenium & AWS Lambda

In my last post I created a lambda that accepts a request, stores it in a dynamodb table and sends a message to an SQS queue.

Let’s now create another lambda to read from that queue and process the request by scraping the url using selenium.

Installing Selenium

Create a new file under src called “chrome-deps.txt” and copy the following into it -

acl adwaita-cursor-theme adwaita-icon-theme alsa-lib at-spi2-atk at-spi2-core
atk avahi-libs cairo cairo-gobject colord-libs cryptsetup-libs cups-libs dbus
dbus-libs dconf desktop-file-utils device-mapper device-mapper-libs elfutils-default-yama-scope
elfutils-libs emacs-filesystem fribidi gdk-pixbuf2 glib-networking gnutls graphite2
gsettings-desktop-schemas gtk-update-icon-cache gtk3 harfbuzz hicolor-icon-theme hwdata jasper-libs
jbigkit-libs json-glib kmod kmod-libs lcms2 libX11 libX11-common libXau libXcomposite libXcursor libXdamage
libXext libXfixes libXft libXi libXinerama libXrandr libXrender libXtst libXxf86vm libdrm libepoxy
liberation-fonts liberation-fonts-common liberation-mono-fonts liberation-narrow-fonts liberation-sans-fonts
liberation-serif-fonts libfdisk libglvnd libglvnd-egl libglvnd-glx libgusb libidn libjpeg-turbo libmodman
libpciaccess libproxy libsemanage libsmartcols libsoup libthai libtiff libusbx libutempter libwayland-client
libwayland-cursor libwayland-egl libwayland-server libxcb libxkbcommon libxshmfence lz4 mesa-libEGL mesa-libGL
mesa-libgbm mesa-libglapi nettle pango pixman qrencode-libs rest shadow-utils systemd systemd-libs trousers ustr
util-linux vulkan vulkan-filesystem wget which xdg-utils xkeyboard-config
Enter fullscreen mode Exit fullscreen mode

Create another file called “install-browser.sh” and copy the following -

#!/bin/bash

echo "Downloading Chromium..."

curl "https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$CHROMIUM_VERSION%2Fchrome-linux.zip?generation=1652397748160413&alt=media" > /tmp/chromium.zip

unzip /tmp/chromium.zip -d /tmp/

mv /tmp/chrome-linux/ /opt/chrome

curl "https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$CHROMIUM_VERSION%2Fchromedriver_linux64.zip?generation=1652397753719852&alt=media" > /tmp/chromedriver_linux64.zip

unzip /tmp/chromedriver_linux64.zip -d /tmp/

mv /tmp/chromedriver_linux64/chromedriver /opt/chromedriver
Enter fullscreen mode Exit fullscreen mode

Update the Dockerfile to look like this -

FROM public.ecr.aws/lambda/python:3.9 as stage

# Hack to install chromium dependencies

RUN yum install -y -q sudo unzip

# Current stable version of Chromium

ENV CHROMIUM_VERSION=1002910

# Install Chromium

COPY install-browser.sh /tmp/

RUN /usr/bin/bash /tmp/install-browser.sh

FROM public.ecr.aws/lambda/python:3.9 as base

COPY chrome-deps.txt /tmp/

RUN yum install -y $(cat /tmp/chrome-deps.txt)

COPY --from=stage /opt/chrome /opt/chrome

COPY --from=stage /opt/chromedriver /opt/chromedriver

COPY create.py ${LAMBDA_TASK_ROOT}
COPY process.py ${LAMBDA_TASK_ROOT}

COPY requirements.txt ${LAMBDA_TASK_ROOT}

COPY db/ ${LAMBDA_TASK_ROOT}/db/

RUN python3.9 -m pip install -r requirements.txt -t .
Enter fullscreen mode Exit fullscreen mode

Update the requirements.txt file and add

selenium==4.4.2
Enter fullscreen mode Exit fullscreen mode

And install the dependency

pip install -r src/requirements.txt
Enter fullscreen mode Exit fullscreen mode

Process the request

Create a new file under src for the new lambda function called “process.py”

import json
from db import db_helper
from selenium.webdriver.common.by import By
from selenium import webdriver
def lambda_handler(event=None, context=None):
request = get_request(event=event)
if request is None:
return {
"statusCode": 400,
"body": {
"message": "Cannot parse url"
}
}
dbHelper = db_helper.DBHelper()
try:
dbHelper.update_order_status(request=request, status='In Progress')
url = request['url']
driver = get_driver()
driver.get(url)
search_results = driver.find_elements(By.XPATH, "//div[@data-header-feature]")
dbHelper.update_order_status(request=request, status='Complete')
except Exception as e:
print(e)
dbHelper.update_order_status(request=request, status='Failed')
return {
"statusCode": 500,
"body": {
"message": f"Error processing request: {e}"
}
}
return {
"statusCode": 200,
"body": json.dumps(
{
"records found": len(search_results),
}
),
}
def get_request(event) -> str:
if "Records" in event:
body = event['Records'][0]['body']
event = json.loads(body)
return event
def get_driver():
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = "/opt/chrome/chrome"
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-tools")
chrome_options.add_argument("--no-zygote")
chrome_options.add_argument("--single-process")
chrome_options.add_argument("window-size=2560x1440")
chrome_options.add_argument("--remote-debugging-port=9222")
input_driver = webdriver.Chrome("/opt/chromedriver", options=chrome_options)
return input_driver
view raw process.py hosted with ❤ by GitHub



Finally, modify the template.yaml file to tell SAM about the new lambda -

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
python3.9
Sample SAM Template for serverless-arch-example
Parameters:
Environment:
Type: String
Description: AWS Environment where code is being executed (AWS_SAM_LOCAL or AWS)
Default: 'AWS'
DynamoDBUri:
Type: String
Description: AWS local DynamoDB instance URI (will only be used if AWSENVNAME is AWS_SAM_LOCAL)
Default: 'http://docker.for.mac.host.internal:8000'
ProjectName:
Type: String
Description: 'Name of the project'
Default: 'serverless-arch-example'
# More info about Globals: https://github.com/awslabs/serverless-application-model/blob/master/docs/globals.rst
Globals:
Function:
Timeout: 120
MemorySize: 2048
Environment:
Variables:
ENVIRONMENT: !Ref Environment
DYNAMODB_DEV_URI: !Ref DynamoDBUri
ORDERS_TABLE_NAME: !Ref OrdersTable
SQS_QUEUE: !Ref OrdersQueue
Resources:
OrdersTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: !Join ['-', [!Sub '${ProjectName}', 'orders']]
AttributeDefinitions:
- AttributeName: request_id
AttributeType: S
KeySchema:
- AttributeName: request_id
KeyType: HASH
ProvisionedThroughput:
ReadCapacityUnits: 3
WriteCapacityUnits: 3
OrdersQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: !Join ['-', [!Sub '${ProjectName}', 'orders']]
VisibilityTimeout: 120 # must be same as lambda timeout
CreateFunction:
Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction
Properties:
PackageType: Image
ImageConfig:
Command:
- create.lambda_handler
Architectures:
- x86_64
Events:
CreateAPI:
Type: Api # More info about API Event Source: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#api
Properties:
Path: /example/create
Method: post
Policies:
- AmazonDynamoDBFullAccess
- SQSSendMessagePolicy:
QueueName: !GetAtt OrdersQueue.QueueName
Metadata:
Dockerfile: Dockerfile
DockerContext: ./src
DockerTag: python3.9-v1
ProcessFunction:
Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction
Properties:
FunctionName: !Join ['-', [!Sub '${ProjectName}', 'process']]
PackageType: Image
ImageConfig:
Command:
- process.lambda_handler
Architectures:
- x86_64
Policies:
- AmazonDynamoDBFullAccess
Events:
SqsEvent:
Type: SQS
Properties:
Queue: !GetAtt OrdersQueue.Arn
BatchSize: 1
Metadata:
Dockerfile: Dockerfile
DockerContext: ./src
DockerTag: python3.9-v1
Outputs:
# ServerlessRestApi is an implicit API created out of Events key under Serverless::Function
# Find out more about other implicit resources you can reference within SAM
# https://github.com/awslabs/serverless-application-model/blob/master/docs/internals/generated_resources.rst#api
CreateAPI:
Description: "API Gateway endpoint URL for Prod stage for Create function"
Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/example/create"
CreateFunction:
Description: "Create Lambda Function ARN"
Value: !GetAtt CreateFunction.Arn
CreateFunctionIamRole:
Description: "Implicit IAM Role created for Create function"
Value: !GetAtt CreateFunctionRole.Arn
OrdersTable:
Description: "DynamoDB Table for orders"
Value: !GetAtt OrdersTable.Arn
OrdersQueue:
Description: "SQS Queue for orders"
Value: !GetAtt OrdersQueue.Arn
ProcessFunction:
Description: "Process Lambda Function ARN"
Value: !GetAtt ProcessFunction.Arn
view raw template.yaml hosted with ❤ by GitHub



Since we created a new lambda function, we need to tell aws where to grab the image from. Modify the samconfig.toml file and add another entry into the image_repositories array for ProcessFunction with the exact same value as that of CreateFunction. So if the row looked like this before -

image_repositories = ["CreateFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo"]
Enter fullscreen mode Exit fullscreen mode

It should now look like this -

image_repositories = ["CreateFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo",
"ProcessFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo"]
Enter fullscreen mode Exit fullscreen mode

Test the changes

Build the app -

sam build
Enter fullscreen mode Exit fullscreen mode

To mimic receiving an event from the queue, we invoke the lambda by passing it a sample payload.

Under the events directory, update the contents of the event.json file -

{
"Records": [
{
"messageId": "2fc6cd1b-544b-452d-bf13-035256a10358",
"receiptHandle": "AQEB1xh5E0MulLiCOgW9GdHXdr14bSrCSAGbjl6WToOIVCObaMZfBZCYIqBoNG3aAW4dhubspACLsqtKlYltUkPjzcct38Hkx9GFTuRgkT/tz91Skf029ADYrEt8azHC50S/TjdCNGFMF0pLln4RnUxFqUBqivBuyRXkj/R4khOzXDKK6gT2MNr2rVqHPKNxWkWR7QHMIULCo0Bh4rxG7TtmfFWlvLpy8O1mMTviIj2ajPBS7iYV1bBE6uT2rOWfWKafbcBjwSqUZImBdCUbSTimP414aYMoi2mtDKvgukcb3UBWDA4pDRTNpiK5oNpbfGbL/zJIiifGDTkjFgfHpBPqixP+09bevn2MUGwIKBjoPkSXAf/vf/llniedtkSMjSRDFZCRgLQIeySQ3pkWPPfbAw==",
"body": "{ \"request_id\": \"5232634\", \"url\": \"https://www.google.com/search?q=aws+sqs\"\n}",
"attributes": {
"ApproximateReceiveCount": "2",
"SentTimestamp": "1661096438766",
"SenderId": "AIDAX4EAG5Y5I2ZNJ6RNX",
"ApproximateFirstReceiveTimestamp": "1661096438771"
},
"messageAttributes": {},
"md5OfBody": "8b2d97573fcd7eeddf89ed10a153cc81",
"eventSource": "aws:sqs",
"eventSourceARN": "arn:aws:sqs:us-east-2:541434768954:reviews-scraper",
"awsRegion": "us-east-2"
}
]
}
view raw event.json hosted with ❤ by GitHub



Now we run the app locally with the following command -

sam local invoke --env-vars ./tests/env.json -e ./events/event.json ProcessFunction
Enter fullscreen mode Exit fullscreen mode

The output should look like -

SAM output

Check the local dynamodb table to verify that the request was marked complete -

DynamoDB table

Deploying the changes

Deploy the changes to aws with the following command -

sam deploy
Enter fullscreen mode Exit fullscreen mode

The output should look like this -

SAM deploy output

Just like before, test the changes by triggering a request for postman & validating the data in the dynamodb table -

dyanmodb table

You’ll notice that the message from the last test was also processed successfully.

Source Code

Here is the source code for the project created here.

Next: Part 5: Writing a CSV to S3 from AWS Lambda

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Create a simple OTP system with AWS Serverless cover image

Create a simple OTP system with AWS Serverless

Implement a One Time Password (OTP) system with AWS Serverless services including Lambda, API Gateway, DynamoDB, Simple Email Service (SES), and Amplify Web Hosting using VueJS for the frontend.

Read full post