DEV Community

loading...
Cover image for AWS Amplify & Puppeteer

AWS Amplify & Puppeteer

joshvincent profile image Josh Vincent ・9 min read

Recently working on a project that required getting hourly market data of the top 10.

I have put this together in the hope that it will help someone else out using Amplify, Puppeteer, Lambda, DynamoDB, and scraping public data off the web.

As we found it very difficult to find any Api's that could solve this issue for us.

We Discovered a way in which we could leverage lambda every hour to run and get the top 10 companies by marketcap, insert the data into a database which we would fetch later (another future blog post to come...)
This is how we did it.

Here is an image of what the architecture looks like.
Alt Text

  1. This has a 1-hour timer that triggers a cloud watch alarm
  2. Cloudwatch calls our first Lambda function
  3. Lambda function pushes messages as it reads from the table to an SQS queue
  4. SQS triggers a series of lambda functions that write data to our DynamoDB

Creating our new project.

In VS Code run the following

amplify init
Enter fullscreen mode Exit fullscreen mode

Create a name for your project and then accept all defaults

Note: It is recommended to run this command from the root of your app directory
$ Enter a name for the project MarketData
$ Enter a name for the environment dev
$ Choose your default editor: Visual Studio Code
$ Choose the type of app that you're building javascript

Please tell us about your project
$ What javascript framework are you using none
$ Source Directory Path:  src
$ Distribution Directory Path: dist
$ Build Command:  npm run-script build
$ Start Command: npm run-script start
Using default provider  awscloudformation

$ Do you want to use an AWS profile? Yes
$ Please choose the profile you want to use marketdata
Enter fullscreen mode Exit fullscreen mode

If you don't have your credentials saved in ~/.aws/credentials you can just say no to the last question.

Creating our first Lambda function

Next, we get to create our first lambda function by doing the following.

amplify add function
Enter fullscreen mode Exit fullscreen mode

Follow the prompts which should look like this

$ amplify add function
$ Select which capability you want to add: Lambda function (serverless function)
$ Provide a friendly name for your resource to be used as a label for this category in the project: marketdataPuppeteer
$ Provide the AWS Lambda function name: marketdataPuppeteer
$ Choose the runtime that you want to use: NodeJS
$ Choose the function template that you want to use: Hello World
$ Do you want to access other resources in this project from your Lambda function? Yes
$ Select the category 

You can access the following resource attributes as environment variables from your Lambda function
        ENV
        REGION
$ Do you want to invoke this function on a recurring schedule? Yes
$ At which interval should the function be invoked: Hourly
$ Enter the rate in hours: 1
$ Do you want to configure Lambda layers for this function? No
$ Do you want to edit the local lambda function now? (Y/n) y
Enter fullscreen mode Exit fullscreen mode

Set Up the dependencies

next, browse into the name of the function you just created

cd amplify/backend/function/marketdataPuppeteer/src/
Enter fullscreen mode Exit fullscreen mode

Now install the following.

npm install chrome-aws-lambda --save-prod
npm install puppeteer-core --save-prod
Enter fullscreen mode Exit fullscreen mode

Great now open your index.js file inside the amplify/backend/function/marketdataPuppeteer/src/ folder.

Add the required dependencies

//amplify/backend/function/marketdataPuppeteer/src/index.js

var AWS = require("aws-sdk");
var SQS = new AWS.SQS({ region: "ap-southeast-2" });
const chromium = require("chrome-aws-lambda");
Enter fullscreen mode Exit fullscreen mode

You will also need to add in the URL of your SQS queue which we will set up shortly.

var QUEUE_URL =
  "https://sqs.ap-southeast-2.amazonaws.com/012345678/Market-Data-Que";
Enter fullscreen mode Exit fullscreen mode

here is our full function

//amplify/backend/function/marketdataPuppeteer/src/index.js

var AWS = require("aws-sdk");
var SQS = new AWS.SQS({ region: "ap-southeast-2" });
const chromium = require("chrome-aws-lambda");

var QUEUE_URL =
  "https://sqs.ap-southeast-2.amazonaws.com/012345678/Market-Data-Que";

exports.handler = async (event, context, callback) => {
  let result = null;
  let browser = null;

  try {
    browser = await chromium.puppeteer.launch({
      args: chromium.args,
      defaultViewport: chromium.defaultViewport,
      executablePath: await chromium.executablePath,
      headless: chromium.headless,
      ignoreHTTPSErrors: true,
    });

    let page = await browser.newPage();
    await page.goto("https://www.marketindex.com.au/asx-listed-companies");
    await delay(5000); // Wait 5 seconds for page to load

    const selector = "tbody tr"; //We are finding the table here
    const row = await page.$$eval(selector, (
      tableRows //looping through each row in the table
    ) =>
      tableRows.map((tableRow) => {
        const tableDataElement = [...tableRow.getElementsByTagName("td")]; //Get each table data element
        return tableDataElement.map((tableData) =>
          tableData.textContent.trim() //Return the text content from the table data element.
        ); 
      })
    );
    //All the rows now exist in the row variable, console.log(row[0][2]); will get you the first element in rows and the third column
    row.slice(0, 10).map((company) => {
      //Create an object for each of the rows to add to our table
      const item = {
        symbol: company[2],
        name: company[3],
        price: company[4],
        daily_change: company[5],
        yearly_change: company[6],
        marketcap: company[7],
      };
      //console.log(item); //In our logs we should now have a list of objects with the above data

      var params = {
        MessageBody: JSON.stringify(item),
        QueueUrl: QUEUE_URL,
      };
      console.log(item);
      SQS.sendMessage(params)
        .promise()
        .then((result) => console.log("Successfully sent message", result))
        .catch((error) => console.log("Error failed to send message", error));
    });
    //console.log(table);
    await browser.close();
  } catch (error) {
    return callback(error);
  } finally {
    if (browser !== null) {
      await browser.close();
    }
  }
  return callback(null, result);
};

Enter fullscreen mode Exit fullscreen mode

cd back to the project root and push the backend to aws by running

amplify push --y
Enter fullscreen mode Exit fullscreen mode

you should see something like this

$ amplify push
✔ Successfully pulled backend environment dev from the cloud.

Current Environment: dev

| Category | Resource name       | Operation | Provider plugin   |
| -------- | ------------------- | --------- | ----------------- |
| Function | marketdataPuppeteer | Create    | awscloudformation |
$ Are you sure you want to continue? (Y/n) y

Enter fullscreen mode Exit fullscreen mode

This should push your code to AWS Amplify backend and create your lambda function.

Open your console in AWS. You should see something that looks like this
Alt Text

Creating the SQS Queue

Now we have our lambda function setup you might have noticed the QUEUE_URL we haven't created yet.
As of the time this is posted Amplify doesn't have the built-in functionality to create SQS queues so we will do it via the console.

Alt Text

Click create queue
Alt Text

Give your queue a name as we have here Market-Data-Que
Alt Text

Keep all the defaults & Create a queue

Now get your URL from the endpoint address, go back to your function, and replace the QUEUE_URL
 with your value.

Run another amplify push to update our recent changes.

amplify push --y 
Enter fullscreen mode Exit fullscreen mode

As this is a function that runs for a little longer than normal and the way Puppeteer works we have increased both the execution times and memory for this function.

Go to your Lambda function
Alt Text

  1. Click the configuration tab
  2. Click Edit
  3. Increase Memory to 512 MB
  4. Timeout to 2 min 30 seconds (increase this if you need to)
  5. Click on the View the marketdataLambdaRole12313-dev-role on the IAM console.
  6. We want to enable lambda to contact SQS, so add SQSFullAccess to the role. (update to ARN to lock down access)

Alt Text

Attach the policy and return to your lambda function.

Now we are ready to Test!

Open up the test tab, create a demo event using the hello-world template and Invoke.
You should now see something that looks like this

Alt Text

If you then visit your SQS Queue you should see a heap of messages like this.

Alt Text

Okay great, So far we have a lambda function opening the window in puppeteer, extracting the data, and pushing it to an SQS queue.

Now we want to create a new function that uses the SQS queue as a trigger to enter our entries into the DynamoDB table. Let's set them up.

Creating the storage table.

First let's create the Storage with amplify add storage

$ amplify add storage 
$ Please select from one of the below mentioned services: NoSQL Database

Welcome to the NoSQL DynamoDB database wizard
This wizard asks you a series of questions to help determine how to set up your NoSQL database table.

$ Please provide a friendly name for your resource that will be used to label this category in the project: marketdata
$ Please provide table name: marketdata

You can now add columns to the table.

$ What would you like to name this column: id
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: symbol
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: name
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: price
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: daily_change
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: yearly_change
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: marketcap

Please choose partition key for the table: id
$ Do you want to add a sort key to your table? Yes
$ Please choose sort key for the table: symbol

You can optionally add global secondary indexes for this table. These are useful when you run queries defined in a different column than the primary key.
To learn more about indexes, see:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.SecondaryIndexes

$ Do you want to add global secondary indexes to your table? No
$ Do you want to add a Lambda Trigger for your Table? No
Enter fullscreen mode Exit fullscreen mode

Do another amplify push and see the DynamoDB database created

amplify push —y 
Enter fullscreen mode Exit fullscreen mode

Now open the console and go to DynamoDB you should see your table here.

Alt Text

Now that the table is created we can now create another lambda function that will pull messages from the SQS queue and then input the data into our newly created table.

Creating Lambda function 2

We create another function with amplify add function

amplify add function
Enter fullscreen mode Exit fullscreen mode
$ Select which capability you want to add: Lambda function (serverless function)
$ Provide a friendly name for your resource to be used as a label for this category in the project: addMarketDataToDB
$ Provide the AWS Lambda function name: addMarketDataToDB
$ Choose the runtime that you want to use: NodeJS
$ Choose the function template that you want to use: Hello World
$ Do you want to access other resources in this project from your Lambda function? Yes
$ Select the category storage
Storage category has a resource called marketdata
$ Select the operations you want to permit for marketdata create

You can access the following resource attributes as environment variables from your Lambda function
        ENV
        REGION
        STORAGE_MARKETDATA_ARN
        STORAGE_MARKETDATA_NAME
$ Do you want to invoke this function on a recurring schedule? No
$ Do you want to configure Lambda layers for this function? No
$ Do you want to edit the local lambda function now? (Y/n) y
Enter fullscreen mode Exit fullscreen mode

cd into the function again

cd amplify/backend/functions/addMarketDataToDB/src
Enter fullscreen mode Exit fullscreen mode

Add the following dependencies to the top of your index.js

const AWS = require("aws-sdk");
const docClient = new AWS.DynamoDB.DocumentClient();
Enter fullscreen mode Exit fullscreen mode
/* Amplify Params - DO NOT EDIT
    ENV
    REGION
    STORAGE_MARKETDATA_ARN
    STORAGE_MARKETDATA_NAME
Amplify Params - DO NOT EDIT */

const AWS = require("aws-sdk");
const docClient = new AWS.DynamoDB.DocumentClient();

exports.handler = async (event, context) => {
  const { body } = event.Records[0];
  const parsedBody = JSON.parse(body);
  const timestamp = new Date().toISOString();
  var params = {
    TableName: "marketdata-dev", // UPDATE THIS WITH THE ACTUAL NAME OF THE FORM TABLE ENV VAR (set by Amplify CLI)
    Item: {
      id: context.awsRequestId, //this uses the id of the invoked lambda function from context
      timestamp: timestamp,
      ...parsedBody,
    },
  };
  console.log(params);
  await docClient
    .put(params)
    .promise()
    .then((data) => console.log("Success!", data))
    .catch((err) => console.log("error!", err));
};

/* SQS event looks like this.
{
    "Records": [
      {
        "messageId": "19dd0b57-b21e-4ac1-bd88-01bbb068cb78",
        "receiptHandle": "MessageReceiptHandle",
        "body": "{\n      \"symbol\": \"FMG\",\n      \"name\": \"FMG Fortescue Metals Group Ltd\",\n      \"price\": \"$20.63\",\n      \"daily_change\": \"+1.23%\",\n      \"yearly_change\": \"+113.78%\",\n      \"marketcap\": \"$63.52 B\"\n    }",
        "attributes": {
          "ApproximateReceiveCount": "1",
          "SentTimestamp": "1523232000000",
          "SenderId": "123456789012",
          "ApproximateFirstReceiveTimestamp": "1523232000001"
        },
        "messageAttributes": {},
        "md5OfBody": "{{{md5_of_body}}}",
        "eventSource": "aws:sqs",
        "eventSourceARN": "arn:aws:sqs:ap-southeast-2:123456789012:MyQueue",
        "awsRegion": "ap-southeast-2"
      }
    ]
}
*/

Enter fullscreen mode Exit fullscreen mode

Update our cloud backend with another amplify push —y

amplify push --y 
Enter fullscreen mode Exit fullscreen mode



wait for addMarketDataToDB function to appear in the console.

Once it has appeared go to the test tab and insert the following event with this example SQS event.

{
    "Records": [
      {
        "messageId": "19dd0b57-b21e-4ac1-bd88-01bbb068cb78",
        "receiptHandle": "MessageReceiptHandle",
        "body": "{\n      \"symbol\": \"FMG\",\n      \"name\": \"FMG Fortescue Metals Group Ltd\",\n      \"price\": \"$20.63\",\n      \"daily_change\": \"+1.23%\",\n      \"yearly_change\": \"+113.78%\",\n      \"marketcap\": \"$63.52 B\"\n    }",
        "attributes": {
          "ApproximateReceiveCount": "1",
          "SentTimestamp": "1523232000000",
          "SenderId": "123456789012",
          "ApproximateFirstReceiveTimestamp": "1523232000001"
        },
        "messageAttributes": {},
        "md5OfBody": "{{{md5_of_body}}}",
        "eventSource": "aws:sqs",
        "eventSourceARN": "arn:aws:sqs:ap-southeast-2:123456789012:MyQueue",
        "awsRegion": "ap-southeast-2"
      }
    ]
}
Enter fullscreen mode Exit fullscreen mode

If this is working you should now be able to see an entry in your DynamoDB table like this.

Alt Text

Next, we need to configure the lambda function to get the messages from the SQS queue.
Go back to your lambda function in the console.

  1. Click Triggers
  2. Find SQS
  3. Enter the name of your SQS queue
  4. Make sure to make the batch size 1
  5. Click create

Once you have created this if you go to your SQS queue you will notice the count of messages decreasing, Lambda is now querying the SQS queue and inserting the data into the table.

Alt Text

Now you have a fully automated web scraper running every hour. To Test the whole process just wait for 1Hr or Trigger the first lambda function we created!

Discussion (0)

pic
Editor guide