DEV Community

Cover image for Lambda to scrape data using typescript & serverless
metacollective
metacollective

Posted on

4 2

Lambda to scrape data using typescript & serverless

Scrape

In this blog post we are going to do the following -

  • Write a lambda function in nodeJS/typescript to extract the following set of data from a website
    • Title of the page
    • Any image on the page
  • Store that extracted data on AWS's S3

We will use the following node packages for this project -

  • serverless (This must be installed globally): This will help us write & deploy Lambda function
  • cheerio: This will help us parse the content of webpage into a jQuery object
  • Axios: Promise based HTTP client for the browser and node.js
  • exceljs: To read, manipulate and write spreadsheet
  • aws-sdk
  • serverless-offline: To run lambda functions locally

Step 1: Install serverless globally

npm install -g serverless
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a new typescript based project from serverless template library like this

sls create --template aws-nodejs-typescript
Enter fullscreen mode Exit fullscreen mode

Step3: Install the required node packages for this lambda project

npm install axios exceljs cheerio aws-sdk
Enter fullscreen mode Exit fullscreen mode

Step 4: Add serverless-offline to plugins list in serverless.ts

plugins: ['serverless-webpack', 'serverless-offline']
Enter fullscreen mode Exit fullscreen mode

Step 5: Add S3 bucket name in the environment variable in serverless.ts like this

environment: {
      AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
      AWS_BUCKET_NAME: 'YOUR BUCKET NAME'
}
Enter fullscreen mode Exit fullscreen mode

Step 6: Define your function in serverless.ts file like this

import type { AWS } from '@serverless/typescript';

const serverlessConfiguration: AWS = {
  service: 'scrapeContent',
  frameworkVersion: '2',
  custom: {
    webpack: {
      webpackConfig: './webpack.config.js',
      includeModules: true
    }
  },
  // Add the serverless-webpack plugin
  plugins: ['serverless-webpack', 'serverless-offline'],
  provider: {
    name: 'aws',
    runtime: 'nodejs14.x',
    apiGateway: {
      minimumCompressionSize: 1024,
    },
    environment: {
      AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
      AWS_BUCKET_NAME: 'scrape-data-at-56'
    },
  },
  functions: {
    scrapeContent: {
      handler: 'handler.scrapeContent',
      events: [
        {
          http: {
            method: 'get',
            path: 'scrapeContent',
          }
        }
      ]
    }
  }
}

module.exports = serverlessConfiguration;
Enter fullscreen mode Exit fullscreen mode

Step 7: In your handler.ts file define your function to do the following

  • Receive the url to scrape data of from query string
  • Make a get request to the url using axios
  • Parse the response data using cheerio
  • Extract data from the parsed response object and store them in a JSON file and all the image urls in an excel file
  • Upload the extracted data up to S3
import { APIGatewayEvent } from "aws-lambda";
import "source-map-support/register";
import axios from "axios";
import * as cheerio from "cheerio";


import { badRequest, okResponse, errorResponse } from "./src/utils/responses";
import { scrape } from "./src/interface/scrape";
import { excel } from "./src/utils/excel";
import { getS3SignedUrl, uploadToS3 } from "./src/utils/awsWrapper";

export const scrapeContent = async (event: APIGatewayEvent, _context) => {

  try {

    if (!event.queryStringParameters?.url) {
      return badRequest;
    }

    //load page
    const $ = cheerio.load(await (await axios.get(event.queryStringParameters?.url)).data);

    //extract title and all images on page
    const scrapeData = {} as scrape;
    scrapeData.images = [];
    scrapeData.url = event.queryStringParameters?.url;
    scrapeData.dateOfExtraction = new Date();
    scrapeData.title = $("title").text();
    $("img").each((_i, image) => {
      scrapeData.images.push({
        url: $(image).attr("src"),
        alt: $(image).attr("alt"),
      });
    });

    //add this data to a an excel sheet and upload to s3
    const excelSheet = await saveDataAsExcel(scrapeData);
    const objectKey = `${scrapeData.title.toLocaleLowerCase().replace(/ /g, '_')}_${new Date().getTime()}`;
    await uploadToS3({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.xlsx`,
      ContentType:
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
      Body: await excelSheet.workbook.xlsx.writeBuffer()
    });

    //Get signed url with an expiry date
    scrapeData.xlsxUrl = await getS3SignedUrl({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.xlsx`,
      Expires: 3600 //this is 60 minutes, change as per your requirements
    });

    //Upload to S3 & give a link to download result as xslx
    await uploadToS3({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.json`,
      ContentType:
        'application/json',
      Body: JSON.stringify(scrapeData)
    });

    return okResponse(scrapeData);
  } catch (error) {
    return errorResponse(error);
  }
};

/**
 * 
 * @param scrapeData 
 * @returns excel
 */
async function saveDataAsExcel(scrapeData: scrape) {
  const workbook:excel = new excel({ headerRowFillColor: '046917', defaultFillColor: 'FFFFFF' });
  let worksheet = await workbook.addWorkSheet({ title: 'Scrapped data' });
  workbook.addHeaderRow(worksheet, [
    "Title",
    "URL",
    "Date of extraction",
    "Images URL",
    "Image ALT Text"
  ]);

  workbook.addRow(
    worksheet,
    [
      scrapeData.title,
      scrapeData.url,
      scrapeData.dateOfExtraction.toDateString()
    ],
    { bold: false, fillColor: "ffffff" }
  );

  for (let image of scrapeData.images) {
    workbook.addRow(
      worksheet,
      [
        '', '', '',
        image.url,
        image.alt
      ],
      { bold: false, fillColor: "ffffff" }
    );
  }

  return workbook; 
}
Enter fullscreen mode Exit fullscreen mode

Step 8: Set your aws access key and aws secret key in your environment like this

export AWS_ACCESS_KEY_ID = YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY = YOUR_ACCESS_SECRET_KEY
Enter fullscreen mode Exit fullscreen mode

Step 9: You are now ready to run this function on your machine like this sls offline --stage local

Now you should be able to access your function from your machine like this http://localhost:3000/local/scrapeContent?url=ANY_URL_YOU_WISH_TO_SCRAPE

Step 10: If you wish to deploy this lambda function on your AWS account then you can do it like this -
sls deploy

You can checkout this lambda function from here.

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay