Robert N. Gutierrez

Posted on Feb 17

Building a 'Data-on-Demand' Microservice: Wrapping Alibaba Scrapers for Internal Tools

#webscraping #scraper #python #dataengineering

You’ve built a powerful Puppeteer script that scrapes product data from Alibaba. It works perfectly in your terminal, outputting clean JSON every time you run it. But then your RevOps manager asks for that same data to populate a live dashboard in Retool or a Google Sheet.

Business users don’t live in the terminal, and they certainly don't want to manage Node.js environments or npm dependencies just to check competitor pricing. To bridge this gap, you need to move the logic out of a standalone script and into a Node.js scraping API.

This guide explores how to transform a local Alibaba scraper into a production-ready microservice using Express.js. We’ll cover modular refactoring, concurrency management, and how to secure your endpoint so internal teams can access live data on demand.

High-Level Architecture

Before writing code, it helps to understand the request lifecycle. Traditional scraping is often scheduled, running once a day or week. "Data-on-Demand" is synchronous. A user clicks a button and expects a response within seconds.

The flow works like this:

The Client: An internal tool or a simple curl command sends a GET request to your server with a search query.
The Express API: Validates the request and checks for an API key.
The Scraper Module: Spawns a Puppeteer instance, navigates Alibaba, and extracts the data.
The Response: The server sends structured JSON back to the client and closes the browser instance.

Because browsers are resource-heavy, stability is the priority. Spawning 50 Chrome instances simultaneously will crash a standard VPS, so we need to implement safeguards.

Phase 1: Refactoring the Scraper for Modularity

Most CLI scrapers use a "flat" structure where logic runs immediately. To use this in an API, wrap it in an asynchronous function that returns data instead of just logging it to the console.

Create a file named scraper.js. This module will export a single function, scrapeAlibaba, to handle the browser lifecycle.

const puppeteer = require('puppeteer');

/**
 * Scrapes Alibaba search results for a given keyword.
 * @param {string} keyword - The product to search for.
 * @param {number} limit - Maximum number of products to return.
 */
async function scrapeAlibaba(keyword, limit = 10) {
  // Use --no-sandbox for compatibility with many Linux environments
  const browser = await puppeteer.launch({
    headless: "new",
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  try {
    const page = await browser.newPage();

    // Set a realistic user agent to avoid immediate detection
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36');

    const searchUrl = `https://www.alibaba.com/trade/search?SearchText=${encodeURIComponent(keyword)}`;
    await page.goto(searchUrl, { waitUntil: 'networkidle2', timeout: 60000 });

    // Wait for the product cards to load
    await page.waitForSelector('.list-no-v2-main-list-element', { timeout: 10000 });

    const products = await page.evaluate((maxItems) => {
      const items = Array.from(document.querySelectorAll('.list-no-v2-main-list-element'));
      return items.slice(0, maxItems).map(item => ({
        title: item.querySelector('.search-card-e-title span')?.innerText || 'N/A',
        price: item.querySelector('.search-card-e-price-main')?.innerText || 'N/A',
        minOrder: item.querySelector('.search-card-e-v2-min-order')?.innerText || 'N/A',
        link: item.querySelector('a.search-card-e-slider__link')?.href || ''
      }));
    }, limit);

    return products;
  } catch (error) {
    console.error(`Scraping failed for ${keyword}:`, error.message);
    throw error; 
  } finally {
    await browser.close();
  }
}

module.exports = { scrapeAlibaba };

By returning the products array and using a finally block to ensure browser.close() always runs, the scraper becomes a predictable, reusable unit.

Phase 2: Building the Express Server

To expose this function via an HTTP endpoint, use Express.js for its simplicity and middleware ecosystem.

First, install the packages:
npm install express puppeteer

Then, create server.js:

const express = require('express');
const { scrapeAlibaba } = require('./scraper');

const app = express();
const PORT = process.env.PORT || 3000;

app.get('/api/search', async (req, res) => {
  const { q, limit } = req.query;

  if (!q) {
    return res.status(400).json({ 
      error: "Missing search query. Use ?q=keyword" 
    });
  }

  try {
    const results = await scrapeAlibaba(q, parseInt(limit) || 10);

    res.json({
      success: true,
      query: q,
      timestamp: new Date().toISOString(),
      count: results.length,
      data: results
    });

  } catch (error) {
    res.status(500).json({ 
      success: false, 
      error: "Failed to fetch data from Alibaba",
      details: error.message 
    });
  }
});

app.listen(PORT, () => {
  console.log(`Alibaba Microservice running on http://localhost:${PORT}`);
});

Navigating to http://localhost:3000/api/search?q=mechanical+keyboard will now trigger a real-time scrape and return a JSON payload.

Phase 3: Handling Concurrency and Stability

The code above works for one user, but Puppeteer is resource-heavy. Each browser instance can consume 100MB to 500MB of RAM. If ten people hit the API at once, the server might run out of memory.

Limiting Concurrency

Use a simple counter to ensure you only run a set number of browsers at once. This prevents the server from being overwhelmed.

let activeRequests = 0;
const MAX_CONCURRENT_SCRAPES = 3;

app.get('/api/search', async (req, res) => {
  if (activeRequests >= MAX_CONCURRENT_SCRAPES) {
    return res.status(503).json({ error: "Server busy. Try again later." });
  }

  activeRequests++;
  try {
    const results = await scrapeAlibaba(req.query.q);
    res.json({ data: results });
  } finally {
    activeRequests--; 
  }
});

Proxy Integration

Alibaba protects its data aggressively. Sending too many requests from a single IP will lead to CAPTCHAs. For a reliable service, use rotating residential proxies.

Update the puppeteer.launch configuration in scraper.js:

const browser = await puppeteer.launch({
  args: [
    `--proxy-server=http://YOUR_PROXY_ADDRESS:PORT`,
    '--no-sandbox'
  ]
});

const page = await browser.newPage();
await page.authenticate({
  username: 'YOUR_USERNAME',
  password: 'YOUR_PASSWORD'
});

Phase 4: Error Handling and Response Formatting

A good API returns context. When building for internal tools, map scraping failures to appropriate HTTP status codes:

400 (Bad Request): No search term provided.
404 (Not Found): The scrape finished, but Alibaba returned zero results.
503 (Service Unavailable): The server is at capacity or the IP is blocked.

Standardizing the response format makes it easier for other developers to use the API. A consistent structure like { success: boolean, data: [], error: string } works best.

Phase 5: Security and Integration

Never leave a scraping API open to the public. Since Puppeteer consumes significant resources, an unprotected endpoint can lead to high infrastructure costs or a crashed server.

Adding API Key Middleware

A simple middleware function can protect the route:

const API_KEY = "your-super-secret-key";

const authenticate = (req, res, next) => {
  const userKey = req.headers['x-api-key'];
  if (userKey && userKey === API_KEY) {
    next();
  } else {
    res.status(401).json({ error: "Unauthorized. Valid API key required." });
  }
};

app.get('/api/search', authenticate, async (req, res) => { ... });

Real-World Usage

With this API in place, the RevOps team can use a tool like Retool. They can drag a "Button" component onto a canvas, set it to trigger a REST API request to your microservice, and display the results in a table. They get live Alibaba data without writing any code.

To Wrap Up

Building a "Data-on-Demand" microservice transforms web scraping from a developer task into a company-wide asset. By wrapping Puppeteer in an Express API, you provide a clean, structured interface for non-technical stakeholders.

As you scale, consider these improvements:

Caching: Use Redis to store search results for 24 hours to reduce proxy costs.
Task Queues: For longer scrapes, move from a synchronous API to a background job system like BullMQ.
Headless Management: Use services like Browserless.io to run Chrome instances on external infrastructure, keeping your API server lightweight.

These patterns ensure your data extraction pipelines are accessible, resilient, and professional.

DEV Community