DEV Community

Cover image for Automate a Papertrail with AWS Lambda
Jakob Ondrey for AWS Community Builders

Posted on

Automate a Papertrail with AWS Lambda

Photo by Nana Smirnova on Unsplash

The aws_lambda_python_alpha module (or probably any aws_lambda_<language>_alpha module) is really great for writing, building, packaging, and deploying serverless functions with the AWS CDK.
  • Import Both:
import aws_cdk.aws_lambda as lambda_
import aws_cdk.aws_lambda_python_alpha as lambda_python
Enter fullscreen mode Exit fullscreen mode
  • Create Lambda Infrastructure with alpha module

  • Build (w/ docker running) and Deploy:

cdk synth --no-staging
cdk deploy
Enter fullscreen mode Exit fullscreen mode

The CDK will take care of the rest.

About Me
My name is Jakob and I am a DevOps Engineer. I used to be a lot of other things as well (Dish Washer, Retail Employee, Camp Counselor, Army Medic, Infectious Disease Researcher), but now I am a DevOps Engineer. I received no formal CS education but I'm not self taught, because I had thousands of instructors who taught me through their tutorials and blog posts. The culture of information sharing within the software engineering community is vital to everyone, especially those like me who didn't have other options. So, as I learn new things I will be documenting them through the eyes of someone learning for the first time, because those are the people most in need of a guide. Happy Learning! And don't be a stranger.

The Problem

A couple of months ago I caused quite a stir by pointing out that my local school district (the 3rd largest in the United States) was underrepresenting COVID-19 cases on their School Dashboards in the middle of the Omicron surge.

To summarize: They changed which cases were being displayed without changing the legends of the graphs and charts displaying the data. This resulted in thousands of cases missing that, based on the legends, should have been there. It was very bad work on their part.

One tool that helped me tell the story and counter the school district's excuses were archived snapshots of the district's web pages that had been captured by the Internet Archive's Wayback Machine.

The Solution

If you haven't heard of the Wayback Machine, go check it out. It is essentially a time capsule of web pages and it allowed me to prove that on (at least) certain dates that the district graphs were not showing what they said they were. It was very useful for me.

So what does this have to do with anything? Well, in the spirit of accountability, and wanting to play with the AWS-CDK I built a stack that you can deploy to automate the archival of web pages to the Internet Archive. Using a serverless Lambda function, it's so cost effective it is basically free (you would need to make over a million archive requests per day, every day to go over the AWS Free Tier.

You can find the repo here if you want to take a look or clone it for your own use and I will also outline the juicy bits below, but briefly:

Using Python and the AWS-CDK, I have created a CloudFormation stack consisting primarily of a Lambda function that uses a wayback machine python library to request the archival of a list of urls, and an EventBridge event that triggers the Lambda. The CDK also makes all the accessory roles and permission sets that are required for this all to function.

The Function

The function code (below) sits in a directory with a requirements.txt file. The CDK (with the help of docker) will create a container, install the requirements.txt file, and deploy that artifact for you to AWS. It's quite neat.

 1    from waybackpy import WaybackMachineSaveAPI
 2    import signal
 3    import logging
 5    logger = logging.getLogger()
 6    logger.setLevel(logging.INFO)
 8    user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
 9    url_list = [
10         "",
11         ]
13    def handler(sigum, frame):
14         raise Exception("Request Sent")
16    def lambda_handler(event, context):
17         signal.signal(signal.SIGALRM, handler)
18         for url in url_list:
19              signal.alarm(5) # resets alarm
20              try:
21                   save_api = WaybackMachineSaveAPI(url, user_agent)
23              except Exception:
24         "Request Sent: {}".format(url))
25                   continue
Enter fullscreen mode Exit fullscreen mode

At the top import and logging are set as well as the user_agent for the Wayback API. Then there is a list for all the URLs you want to save.

Skipping the handler function, the lambda will enter into the lambda_handler function. the first thing it will do on line 17 is use the signal library to listen for a signal alarm and call the handler function if that alarm reaches 0. This is what will allow us to just throw requests at the wayback machine without having to wait the >30 seconds for a response that the archival is done.

Then we will iterate over url_list resetting a signal alarm to 5 seconds each time and then requesting the url be saved. Meanwhile the signal alarm is counting down and when it reaches 0 the requested exception is raised, the request is logged, and we move on to the next iteration of resetting the alarm and requesting the url be saved.

I think it a nice way to self rate limit, while not having to wait for a response. I suppose you could lower the timer down to 1 second if you really wanted to.

The Stack

The CDK stack is maintained in /wayback/
Note that both aws_lambda and aws_lambda_python_alpha modules are imported and that aws_lambda_python_alpha is imported on its own. That is important.

The stack contains two things:


The function is created with the aws_lambda_python_alpha module. It is used instead of aws_lambda because it will automatically package and even allow local testing should we be interested in that.

16        wayback_function = lambda_python.PythonFunction(
17            self,
18            "wayback",
19            function_name="WaybackArchiver",
20            runtime=lambda_.Runtime.PYTHON_3_9,
21            entry="./wayback_app",
22            index="",
23            handler="lambda_handler",
24            memory_size=128,
25            timeout=Duration.seconds(300),
26            )
Enter fullscreen mode Exit fullscreen mode

Most of the parameters make sense when viewed in context, but I will explicitly point out that entry= on line 21 is the DIRECTORY that will be packaged up by the CDK. If there is a requirements.txt file in this directory all the libraries and modules indicated here will be installed.

index= and handler= on lines 22-23 together tell the CDK which file and function in the entry directory should be the starting point at execution.

Event Rule

This is not named because it isn't referenced anywhere, but this section creates the EventBridge rule that triggers the Lambda. There are only two important things here, schedule and targets.

28        event.Rule(self, "WaybackRule",
29            rule_name="WaybackRule",
30            schedule=event.Schedule.cron(
31                minute="0", 
32                hour="9",
33                ),
34            targets=[
35                targets.LambdaFunction(wayback_function),
36                ],
37            )
Enter fullscreen mode Exit fullscreen mode

line 30 indicates that you will be using the CDK's version of cron notation. Note that the timezone is UTC and everything here is represented by a string. In addition, parameters that are not stated default to *. So, for example, not including day= above would mean all/any days of the month.

line 34 is where you can specify what you want to happen when this event occurs. In this case it is referencing the wayback_function named above.

Building and Deploying

The stack can be built, including the packaged lambda (assuming docker is running) and deployed with the following commands:

cdk synth --no-staging
cdk deploy
Enter fullscreen mode Exit fullscreen mode

Once created you will have a function that will archive a list of web pages (that you don't have to manage storage of) without further direction from you.

Modifications to the stack or the application just need to rebuilt and deployed and the old scheme will be replaced with the modifications.

Now, figure out what web-pages you think could use some oversight and start archiving them. You never know when an external record of there content at a given time will bear fruit!

Top comments (0)