zahaar

Posted on Apr 25, 2022

Generate PDFs from HTML via Puppeteer on AWS Lambda + API Gateway

#lambda #puppeteer #pdf #aws

“Evil cannot create anything new, they can only corrupt and ruin what good forces have invented or made.” - JRR Tolkien.

Preface

Would it be great to have the functionality that would enable you to generate PDF files using HTML && CSS capabilities without the need to rely on overly complex drivers that are dependent on a whole bunch of C libraries?

While also supporting all the latest features of HTML5 && CSS3?

Well, we have great news. There is a framework called Puppeteer that uses relatively new Chrome feature and makes it accessible though a NodeJS based API.

Essentially what Puppeteer does, is: Launches a Chromium browser instance in a headless mode ( not actually opening it ), and allows us to manipulate the browser via set of API command to parse website, retrieve images and generate PDF as if you were actually opening an HTML file in the latest browser version, etc..

While we can create a running Docker Puppeteer instance and deploy that on ECS or Heroku. The creation of stable && optimized image can be quite challenging...

Having a running instance in AWS Lambda IHMO in contrast would be much simpler in terms of development speed, debug and monitoring. Besides, serverless, is a nice concept for POC ( you pay for what you use )

Repo -> End Result

You can see the complete working example in this repo

zahaar / generate-pdf-lambda

Generate PDF document via Puppeteer running on AWS Lambda

This repo contains a serverless application that takes a HTML template and return a PDF in form of a binary

Diagram

Requirements

How to Run

Clone this repo git clone https://github.com/zahaar/generate-pdf-lambda
Import cUrl to Insomnia ( Postman is not recommended, as it can't visualize Pdf ).
Run make api-local to have local API GW running.
Send cUrl request via Insomnia.

You can also invoke Lambda bypassing API GW, by supplying an example event in file, and running make invokation-local. The response would be a base64 encoded PDF binary.

How to Deploy

A configured AWS CLI V2 is a must -> AWS Console Account && API Keys

make deploy
Fetch AWS SAM deploy output URL Value, and change the Url in Insomnia from localhost to that value execution result in…

View on GitHub

Requirements and Prerequisites

1. SetUp local AWS SAM Template with Chrome Lambda Layer

In this step the local SAM execution setUp will be complete. Once this is done, we will have a strong reference point.

The end version of this step can be fetched from 1_local-setup branch

We can create a basic SAM template by running sam init or reference a guide

but our end goal should be a sophisticated structure like this



├── Makefile
├── VERSION -- for VERSION tracking, helpful for CI
├── envs.json -- to sep envs for local execution ( if necessary )
├── events
│   └── api-gw-event.json -- an example API GW event for local execution
├── src
│   └── app.js -- main source code file
└── template.yaml -- AWS SAM configuration template

app.js contains simple code that will return the same event.body that it receives from example event.



...
...
var response = {
    statusCode: 200,

    body: event.body,
  }

  return response
...

while template.yml has a resource configuration for API GW Service



...
...
  ApiGatewayApi:
    Type: AWS::Serverless::Api
    Properties:
      StageName: Staging
    BinaryMediaTypes:
      - application~1pdf  // Note the support for binary pdf media Type
...

and the Lambda. As per context of our goal, it's called PdfFunction
Take note of the Layer being used in this config. By setting chrome-aws-lambda, we have essentially ruled out the need to set package.json dependencies for puppeteer and chrome on Docker image thar AWS is using on EC2 for Lambdas, as this step can be quite challenging.



...
...
  PdfFunction:
    Type: AWS::Serverless::Function
    Description: Invoked by EventBridge scheduled rule
    Properties:
      CodeUri: src/
      Handler: app.handler
      Runtime: nodejs12.x
      Timeout: 15
      MemorySize: 3008

      Layers:
        - !Sub 'arn:aws:lambda:${AWS::Region}:764866452798:layer:chrome-aws-lambda:22'
      Environment:
        Variables:
          EXAMPLE_ENV: 'CHANGE_THIS'
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /pdf
            Method: post
            RestApiId:
              Ref: ApiGatewayApi
...

To test that all requirements are met, let's run a local event.



make invokation-local

The output should be essentially the same as the execution logs in CloudWatch



...
...
Mounting /Users/wparker/Dev/scheduled-website-screenshot-app/.aws-sam/build/PdfFunction as /var/task:ro,delegated inside runtime container
START RequestId: e4d7743d-5be2-4735-84c8-9d5160d9a750 Version: $LATEST
...

2. Configure Puppeteer in Lambda; Supply Template HTML

Next step is to program app.js to start puppeteer, consume HTML from an API GW event and return a base64 encoded response that would be decoded on Response by API GW.

The end version of this step can be fetched from 2_generate-pdf branch

We need to change the Lambda handler code to something like this. File ( File is too long to displayed here )

Key takeaways are:

Browser launch args parameters in this example are set specifically for AWS Lambda compatibility.



...
...
    browser = await chromium.puppeteer.launch({
      args: chromium.args,
      defaultViewport: chromium.defaultViewport,
      executablePath: await chromium.executablePath,
      headless: chromium.headless,
      ignoreHTTPSErrors: true,
    })

The return format goal was set to mimic A4 document.



...
...
    await page.setViewport({
      width: 1080,
      height: 1600,
      deviceScaleFactor: 1,
      isLandscape: true,
    })
    pdf = await page.pdf({
      format: 'a4',
      margin: {
        top: '0px',
        right: '0px',
        bottom: '0px',
        left: '0px',
      },
    })
...

The response headers are set for pdf file transfer. isBase64Encoded flag is set to true to inform API GW that it needs to decode the file.



...
...
  var response = {
    statusCode: 200,
    headers: {
      'Access-Control-Allow-Origin': '*',
      'Access-Control-Allow-Methods': 'GET, POST',
      'Content-type': 'application/pdf',
      'Content-Disposition': 'attachment; filename="foo.pdf"',
    },
    isBase64Encoded: true,
    body: pdf.toString('base64'),
  }
...

To test this code, an HTML template is needed. We will use this open-source one for demonstration.
The document is being sent as body with 'Content-Type: text/html'

Please note 'Accept: application/pdf', this is important.

The end result of cUrl request is in this file

To test our result let's start local SAM in local start-api mode. ( akin to a server, contrary to one time invokation)



make api-local

Import cUrl into Insomnia.

Result

3. Deploy Lambda + API GW via SAM

Refers to step in README.md on main branch

Tips

1. Insomnia vs Postman

Instead of playing tricks with Postman PDF Visualization. I highly recommend switching to Insomnia

Insomnia Visualized Response	Postman Visualization Response

2. `Error: Error building docker image: pull access denied for`

Jan 14 '21

Works fine here. You shouldn't need credentials for Public ECR (you can use auth for specific cases) but if you just want to consume it, remove the existing credentials