Serverless Extraction: Triggering Scrapers via AWS Lambda

#aws #node #serverless #javascript

Running web scrapers on traditional VPS instances (like DigitalOcean or Linode) introduces unnecessary maintenance overhead. You have to manage PM2, handle memory leaks, rotate logs, and update Node.js versions.

A cleaner, more modern approach is to use a serverless architecture to orchestrate your data extraction.

The Serverless Stack

Instead of running the scraping code yourself, use a managed Actor like the Vinted Smart Scraper. This Actor exposes a REST API that allows you to trigger runs programmatically.

We can use an AWS Lambda function (or Cloudflare Worker) to act as the cron job and orchestrator.

The Lambda Implementation (Node.js)

Here is a simplified example of how you can trigger the Apify Actor from an AWS Lambda function:

const fetch = require('node-fetch');

exports.handler = async (event) => {
    const APIFY_TOKEN = process.env.APIFY_TOKEN;
    const ACTOR_ID = 'kazkn/vinted-smart-scraper';

    // The payload for the scraper
    const runInput = {
        startUrls: [{ url: "https://www.vinted.fr/vetements?brand_id[]=53" }],
        maxItems: 50
    };

    const response = await fetch(`https://api.apify.com/v2/acts/${ACTOR_ID}/runs?token=${APIFY_TOKEN}`, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(runInput)
    });

    const data = await response.json();
    console.log(`Run started with ID: ${data.data.id}`);

    return { statusCode: 200, body: 'Extraction triggered' };
};