Prithwish Nath

Posted on May 19 • Originally published at Medium

5 Production Stacks for Live Data Ingestion at Scale (Without Getting Blocked)

#webdev #programming #ai #javascript

TL;DR: Most teams over-engineer data ingestion. They use Kafka before they’ve hit their first rate limit, or Playwright before they’ve checked the network tab. This guide shows five production-tested stacks for live data ingestion from minimal fetch + cron up to LLM agents calling Model Context Protocol (MCP) tools — with the specific failure mode each one solves.

What You’ll Learn

When a plain fetch loop with a cron job is truly all you need
How an agent calling MCP tools adapts where other methods can’t
The serverless + object storage pattern that handles high fan-out volume
How to add retries, idempotency, and replay to any I/O layer without rewriting it
When (and only when) to reach for a headless browser

The right ingestion stack is the one with the fewest moving parts that still handles your specific failure mode. Not someone else’s failure mode. Yours.

Stack	Failure mode it solves	Vendor surface	Complexity	Cost floor	Real ceiling
1. Bun/Node fetch + allowlist	Stable public APIs, no anti-bot story	None	Minimal	Free	Upstream rate limits, not your hardware. Most public APIs cap at 60–6,000 req/min.
2. Agent + Bright Data MCP	High-complexity targets: anti-bot, JS rendering, multi-step flows, adaptive extraction — at moderate volume	Bright Data (MCP tooling w/ free tier)	Low	Bright Data free tier	Rapid: 5K req/mo free. Pro tools and `web_data_*` extractors bill separately — check Bright Data pricing before you rely on them.
3. Serverless cron → object storage	Bursty ingest, fan-out, raw payload durability	Cloud provider only	Low-medium	~$0	Sub-request limit per invocation (50 Free / 1K Paid on CF Workers). Bypass with Queues.
4. Durable workflow engine + swappable I/O	Flaky upstreams, retries, idempotency, replay	Workflow engine + optional proxy	Medium	Varies by engine	Concurrency, history size, replay behavior, and hosted usage limits vary — model fan-out before you scale.
5. Minimal Playwright headless	JS-rendered pages, SPAs, click-to-render flows	Optional proxy vendor	Medium-high	Compute cost	Memory-bound. Each Chromium context ~200–500 MB. Parallelize based on your instance's RAM, not intuition.

1. Bun/Node `fetch` + Allowlist — The Boring Baseline That Works

Documentation: MDN: Fetch API · Bun HTTP

License: N/A

Free Tier: N/A

Best for: Stable public APIs, RSS feeds, open datasets, internal tooling that hits your own endpoints, anything where the anti-bot story is there is no anti-bot story.

Bun - A fast all-in-one JavaScript runtimeBundle, install, and run JavaScript & TypeScript - all in Bun. Bun is a new JavaScript runtime with a native bundler…bun.com

What is the fetch + cron stack?

Plain fetch against a list of known-good URLs, on a timer, writing results to flat files or SQLite. No framework, queue, or service dependencies. A script you can read start to finish in five minutes.

Why use fetch + cron for live data ingestion?

I want to be honest about how often this is all you need. If you’re hitting stable public APIs — government data portals, RSS feeds, well-behaved JSON endpoints, open datasets that publish on a schedule — there is no failure mode that demands anything more than this.

This is the stack everything else is measured against. Before you add a workflow engine like Temporal/Trigger, agents, a proxy, or anything else — see if this one is enough. It usually is.

// bun run ingest.ts   

import { Database } from "bun:sqlite";  

const ALLOWLIST = [  
  "https://api.github.com/repos/vercel/next.js/releases",  
  "https://registry.npmjs.org/react",  
  "https://data.gov/some-dataset.json",  
  // whatever else  
];  

const db = new Database("./data.sqlite");  

db.run(`  
  CREATE TABLE IF NOT EXISTS raw_payloads (  
    id INTEGER PRIMARY KEY AUTOINCREMENT,  
    url TEXT,  
    fetched_at TEXT,  
    payload TEXT  
  )  
`);  

async function ingest() {  
  for (const url of ALLOWLIST) {  
    try {  
      const res = await fetch(url, {  
        headers: { "User-Agent": "my-ingest-bot/1.0" },  
        signal: AbortSignal.timeout(10_000),  
      });  

      if (!res.ok) {  
        console.warn(`[${res.status}] ${url}`);  
        continue;  
      }  

      const payload = await res.text();  

      db.run(  
        `INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`,  
        [url, new Date().toISOString(), payload],  
      );  
    } catch (err) {  
      console.error(`Failed: ${url}`, err);  
    }  
  }  
}  

ingest();

Run it with a system cron, a GitHub Actions schedule, or a simple setInterval. That's literally the whole stack.

How to Handle Pagination

Most real APIs paginate. A very simple, intuitive pattern is to keep one loop, store each payload, and stop only when the API gives no next pointer.

type Page = { items?: unknown[]; next_cursor?: string; next_url?: string };  

async function ingestPaginated(baseUrl: string) {  
  let url: string | null = baseUrl;  
  let page = 0;  

  while (url) {  
    const res = await fetch(url, {  
      headers: { "User-Agent": "my-ingest-bot/1.0" },  
      signal: AbortSignal.timeout(10_000),  
    });  
    if (!res.ok) {  
      console.warn(`[${res.status}] page ${page} — stopping`);  
      break;  
    }  

    const json = (await res.json()) as Page;  

    db.run(  
      `INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`,  
      [url, new Date().toISOString(), JSON.stringify(json)],  
    );  

    url =  
      json.next_url ??  
      (json.next_cursor ? `${baseUrl}?cursor=${json.next_cursor}` : null);  
    page++;  
    if (url) await new Promise((r) => setTimeout(r, 250)); // polite pacing  
  }  

  console.log(`Ingested ${page} pages from ${baseUrl}`);  
}

For offset-based APIs (?page=1&per_page=100), increment page and stop when the response is empty. For link-header APIs (Link:; rel="next"), parse res.headers.get("link") or similar for the next URL.

When fetch + cron isn’t enough

You start getting 403s or rate-limited responses → Stack 2 (Agent + Bright Data MCP) or Stack 5 (Playwright)
You need retries with backoff and idempotency → Stack 4 (durable workflow engine)
Your list of URLs grows to thousands and you need fanout → Stack 3 (serverless)
The pages require JS execution → Stack 5 (Playwright)

What I got wrong

Always store the raw payload, just in case! I built a pipeline that extracted five fields and stored them in a normalized SQLite table — didn’t see the point of keeping the raw response. Three weeks later I needed a sixth field, one that had been in every response the whole time. Re-fetching that government dataset took four days because of their rate limits. So, yeah. The disk cost of storing res.text() verbatim is trivial. The cost of finding out your schema was wrong after the fact and wasting time and money re-ingesting, is decidedly not. 😅 You can parse as you want downstream, separately, later.

2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax

Repository: https://github.com/brightdata/brightdata-mcp

License: MIT

Free Tier: 5,000 requests/month

Best for: High-complexity, moderate-volume targets (competitive pricing, job market analysis, funding data, SERP monitoring) where the hard problem is anti-bot, adaptive navigation, or frequent site changes; not throughput.

The Web MCP by Bright Data - Start with a Free PlanConnect LLMs and AI agents to real-time web data with Bright Data MCP Server. Search, crawl, and automate web tasks at…brightdata.com

What is the agent + Bright Data MCP stack?

An agent loop — running in the IDE, as a headless script, or on a schedule — wired to Bright Data’s MCP (Model Context Protocol) server as its acquisition layer. The agent calls MCP tools, gets structured data back, and writes results to whatever sink fits your pipeline — files, NDJSON, a database, object storage — without fixed selectors, without a scraping framework, and without proxy infra you operate yourself.

Why use Bright Data MCP for agentic data extraction?

Most guides treat “scale” as a throughput problem. This stack solves a different one: complexity scale — targets where the hard problem is defeating defenses, not managing volume.

A traditional scraper against a heavily defended site is a maintenance contract. Every DOM restructure breaks a selector, every bot detection upgrade breaks your fingerprint, and every geo-block breaks your IP pool. You spend more time maintaining the scraper than using the data. An agent in the loop sidesteps this — it reads what’s on the page and derives the extraction schema from the current DOM, the same way a person would. When the site changes, the agent adapts — autonomously configuring + calling into Bright Data primitives like its proxy network for bot bypass, Scraping Browser when a real browser session is required, and the SERP API when you need to hit Google/Bing etc.

This is also the right stack when you need adaptive acquisition — targets where the data you want depends on what you find at each step. Navigating a site to a specific product variant, following a pagination trail that changes shape, clicking through a login flow — these aren’t hard for a browser-capable agent and are genuinely painful to script deterministically. The proven production pattern here is LLM as orchestrator, pre-built tools as the acquisition layer — which is exactly what MCP provides.

A MCP Client like Claude Desktop, Cursor, etc. is just the easiest entry point:

Basic setup:

{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": [  
        "mcp-remote",  
        "https://mcp.brightdata.com/mcp?token=YOUR_API_TOKEN"  
      ]  
    }  
  }  
}

For local + advanced config:

{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": ["@brightdata/mcp"],  
      "env": {  
        "API_TOKEN": "YOUR_API_TOKEN",  
        "PRO_MODE": "true",  
        "WEB_UNLOCKER_ZONE": "custom",  
        "BROWSER_ZONE": "custom_browser"  
      }  
    }  
  }  
}

But the same MCP wiring can also run headlessly from a script, triggered by cron, or invoked from any orchestrator. The agent loop is not coupled to a GUI.

How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)

You don’t actually need Cursor, Claude Desktop, or any hosted client. The MCP TypeScript SDK gives you a reference Client — install @modelcontextprotocol/client (or use the umbrella @modelcontextprotocol/sdk with the .../client/*.js paths.) That client can spawn @brightdata/mcp as a subprocess, negotiate the MCP handshake, and expose listTools() / callTool() the same way an IDE-hosted MCP client does.

It looks something like this:

import { Client } from "@modelcontextprotocol/client";  
import { StdioClientTransport } from "@modelcontextprotocol/client/stdio";  

const client = new Client({ name: "ingest-client", version: "1.0.0" });  
const transport = new StdioClientTransport({  
  command: "npx",  
  args: ["@brightdata/mcp"],  
  env: { ...process.env, API_TOKEN: process.env.API_TOKEN!, PRO_MODE: "true" },  
});  

await client.connect(transport);  

// Call any Bright Data MCP tool (SDK: distinguish result.isError from thrown ProtocolError/SdkError)  
const result = await client.callTool({  
  name: "scrape_as_markdown",  
  arguments: { url: "https://example.com" },  
});

For the hosted endpoint instead of stdio, swap StdioClientTransport for StreamableHTTPClientTransport and point it at https://mcp.brightdata.com/mcp?token=…. The MCP TypeScript SDK client guide covers transports and error handling.

What I got wrong

To explain what I did wrong, first, here are the tools included in the Bright Data MCP:

**scrape_as_markdown** / **scrape_as_html** — General-purpose scraping with bot bypass
**search_engine** — SERP (search engine results page) data without writing a scraper
**navigate** / **click** / **type** — Full browser automation for flows that require interaction
60+ specialized extractors for Amazon, LinkedIn, Crunchbase, Yahoo Finance, and more

The first lesson was a tier/billing one. The default Rapid mode gives you search and Web Unlocker-backed scraping — not the 60+ Pro tools, browser automation, or the **web_data_*** APIs. Those require PRO_MODE=true, which is pay-as-you-go on top of the free tier. I'd skimmed the marketing copy and missed that footnote entirely.

The second lesson was an API semantics one. I had lowered POLLING_TIMEOUT thinking it was a standard request timeout. It isn't — those web_data_* tools submit a background data-collection job and then poll for the result, and POLLING_TIMEOUT controls how long that polling is allowed to run. Slow extractions just need more time. BASE_TIMEOUT and BASE_MAX_RETRIES are what you actually want for the base tools (search_engine, scrape_as_markdown) — they don't affect the polling path at all.

3. Serverless Cron + Object Storage — Disposable Compute, Durable Data

Repository: https://github.com/cloudflare/workers-sdk

Documentation: Cloudflare Workers · Cloudflare R2 · AWS Lambda

License: MIT

Free Tier: Cloud provider free tiers (limits apply — e.g. Workers limits)

Best for: High-volume URL lists, fan-out workloads, raw payload archiving.

Overview · Cloudflare Workers docsBuild and deploy serverless applications across Cloudflare's global network with Workers.developers.cloudflare.com

What is the serverless cron + object storage pattern?

Basically, Cloudflare Workers / AWS Lambda + Cloudflare R2/AWS S3 + optional manifest (DynamoDB, Postgres, or a key in R2 itself). A short-lived worker that fires on a schedule, fetches one or more payloads, and lands them in object storage as raw files. The compute is fully disposable. The storage is durable. A small manifest (optional but useful) tracks what’s been fetched and when.

Why use serverless workers + R2/S3 for fan-out ingestion?

This is the pattern I reach for when I need fan-out. If Stack 1 is “one script, one machine, one process,” this is “N concurrent workers, each responsible for a slice of the work, all landing to the same durable sink.” You can ingest a thousand URLs in parallel without managing a server, and the raw payloads survive whatever happens to the compute.

The interesting design decision is the separation: when you run (cron) is completely decoupled from what does the fetching (the worker). That worker is a pure function — give it a URL, it gives you a payload in storage. You can swap the acquisition layer (direct fetch today, proxy tomorrow) without touching the scheduling or the storage format.

// Cloudflare Worker (wrangler.toml has crons configured)  

export default {  
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {  
    const urls = await getUrlBatch(env); // from KV, D1, or hardcoded slice  

    await Promise.allSettled(  
      urls.map(async (url) => {  
        try {  
          const res = await fetch(url, {  
            // Cap per-request wait so one bad host doesn't stall the batch (good hygiene, not CF's max wall time)  

            signal: AbortSignal.timeout(25_000),  
          });  

          if (!res.ok) return;  

          const key = `raw/${new Date().toISOString().slice(0, 10)}/${encodeURIComponent(url)}.json`;  

          await env.BUCKET.put(key, res.body, {  
            httpMetadata: { contentType: "application/json" },  

            customMetadata: {  
              source_url: url,  

              fetched_at: new Date().toISOString(),  

              status: String(res.status),  
            },  
          });  
        } catch (err) {  
          console.error(`Failed: ${url}`, err);  
        }  
      }),  
    );  
  },  
};

wrangler.toml

[[r2_buckets]]  
binding = "BUCKET"  
bucket_name = "my-ingest-bucket"  

[triggers]  
crons = ["0 * * * *"]

Do I need a manifest for serverless ingest?

You don’t always need one. But if you need to know “did I already ingest this URL today?” or “which keys are new since the last pipeline run?”, a manifest pays for itself fast. Cheapest would be a JSON file in R2 itself, keyed by date. Need more? Try a DynamoDB table or a single Postgres table with (url, date, key) rows.

What I got wrong

The biggest gotcha is that Cloudflare Workers silently caps outbound fetch calls at 50 sub-requests on the Free plan per invocation — the excess doesn't error, it simply doesn't fire. I learned this from the logs that weren't there. For any batch larger than that cap, you need to dispatch to Cloudflare Queues and process in smaller chunks.

The other thing is that Workers has a wall-clock time limit — 30 seconds on the Free plan. When you fan out to 40 URLs and three of them are a slow government portal that takes 28 seconds to respond, those three will get cut off at the execution limit, and the logs will show nothing wrong. The only way I caught it was tracking expected vs. actually-written object counts in the manifest — when those numbers diverged, something had timed out quietly. Per-request AbortSignal.timeout helps, but a manifest count is the only reliable canary.

4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer

Repository: Trigger.dev · Temporal

Documentation: Trigger.dev docs · Temporal docs

License: Varies by engine (Trigger.dev: Apache 2.0; Temporal: MIT)

Free Tier: Varies by engine — hosted usage tiers, self-hosted deployments, and managed-cloud limits all differ

Best for: Any ingest workload where “what failed and why” needs to be answerable, upstreams are flaky, or you’re running at a scale where silent failures are unacceptable.

Welcome to the Trigger.dev docs - Trigger.devFind all the resources and guides you need to get startedtrigger.dev

What is the workflow engine stack?

A workflow engine (Trigger.dev, Temporal, Inngest, AWS Step Functions, etc.) handles retries, idempotency, scheduling, replay, and observability — while the actual acquisition is a swappable I/O step inside the workflow.

Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?

The counterintuitive argument here is that this stack isn’t heavier than Stack 3 in any meaningful sense — it just makes the complexity visible instead of hiding it in ad-hoc retry logic and try/catch soup.

Every serious ingestion system eventually needs automatic retries with backoff, deduplication of runs, visibility into what failed and why, and the ability to replay a failed run without re-running the whole pipeline. Trigger.dev, Temporal, Inngest, and Step Functions all live in this category. The APIs differ but the job is the same.

So with durable orchestration stabilized, your I/O step — direct fetch, proxy, browser job — is the interchangeable part. When your upstream starts blocking you, you change one function. The retries, scheduling, idempotency, and replay story stay intact.

Here’s a Trigger.dev example for this stack:

// trigger.dev task (see current SDK import path for your version — often `@trigger.dev/sdk`)  
import { task, idempotencyKeys } from "@trigger.dev/sdk";  

export const ingestUrl = task({  
  id: "ingest-url",  
  retry: {  
    maxAttempts: 5,  
    factor: 2,  
    minTimeoutInMs: 1000,  
    maxTimeoutInMs: 30_000,  
  },  
  run: async (payload: { url: string; date: string }) => {  
    // Swap this block for proxy / Web Unlocker / browser job when direct fetch isn't enough.  
    const res = await fetch(payload.url, {  
      signal: AbortSignal.timeout(15_000),  
    });  
    if (!res.ok) throw new Error(`HTTP ${res.status}`);  
    const data = await res.json();  
await writeToStorage(data, payload.url);  
    return { success: true, url: payload.url };  
  },  
});  
// Trigger a batch from a cron or an API route  
export const ingestBatch = task({  
  id: "ingest-batch",  
  cron: "0 */6 * * *",  
  run: async () => {  
    const urls = await getTargetUrls();  
    const date = new Date().toISOString().slice(0, 10);  
    const items = await Promise.all(  
      urls.map(async (url) => ({  
        payload: { url, date },  
        options: {  
          idempotencyKey: await idempotencyKeys.create(`ingest:${url}:${date}`, {  
            scope: "global",  
          }),  
        },  
      }))  
    );  
    await ingestUrl.batchTriggerAndWait(items);  
  },  
});

When to swap the I/O step in a workflow

When the I/O step starts failing — consistent 403s, CAPTCHAs, geo-blocks — you replace fetch(url) with a call through a proxy or unlocker API. The retry logic, the scheduling, the idempotency — none of it changes. You changed one line. That's the payoff.

// Before  
const res = await fetch(url);  
// After — swap the I/O layer for a proxy or unlocker when direct fetch isn't enough  
// const res = await proxyClient.fetch(url);

What I got wrong

Don’t neglect idempotency keys! They feel optional until the first time you need to replay something, which is always, eventually.

Minimal Playwright Headless — The Last Resort

Repository: https://github.com/microsoft/playwright

Documentation: https://playwright.dev/docs/intro

License: Apache 2.0

Free Tier: Unlimited (open source; you pay for compute/hosting)

Best for: SPAs, client-rendered dashboards, sites that gate content behind click interactions, any page that simply doesn’t exist until JavaScript runs.

GitHub - microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows…Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single…github.com

What is the minimal Playwright headless stack?

One Playwright browser context, strict timeouts, screenshot and/or HTML captured to storag, with optional residential/datacenter proxy. No parallel contexts unless you’ve actually measured you need them. No fancy orchestration unless you’ve already tried Stack 4 with a browser job.

When to use Playwright for scraping (and why it’s a last resort)

Headless browsers are the right tool for exactly one specific failure mode: the page doesn’t exist until JavaScript executes it. SPAs, client-side rendered dashboards, pages that require a click to reveal pricing — these can’t be handled by any of the previous stacks without adding a browser layer.

But headless is expensive. CPU, memory, time. A Playwright context consumes dramatically more resources than a fetch. You serialize concurrency. Cold starts on serverless are brutal. If you're reaching for Playwright because you might need it, DON'T. Try one of the above stacks first.

The minimal version of this stack will be familiar to most readers: one browser context, one page, strict timeouts, then write to disk. Proxies are optional.

import { chromium } from "playwright";  
import { writeFile } from "fs/promises";  

async function scrapeWithBrowser(url: string, outputDir: string) {  
  const browser = await chromium.launch({  
    headless: true,  
    args: [  
      "--no-sandbox",  
      "--disable-setuid-sandbox",  
      "--disable-dev-shm-usage", // critical for containerized environments  
    ],  
  });  

  const context = await browser.newContext({  
    // Proxy config goes here when you need it:  
    // proxy: { server: "http://proxy.brightdata.com:22225", username: "...", password: "..." },  
    userAgent:  
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",  
    viewport: { width: 1280, height: 800 },  
  });  

  const page = await context.newPage();  

  try {  
    // Hard timeout on navigation — don't wait for every lazy-loaded asset  
    await page.goto(url, {  
      waitUntil: "domcontentloaded", // not 'networkidle' — too slow  
      timeout: 20_000,  
    });  

    // Wait for the element you actually want, not the whole page  
    await page.waitForSelector("[data-testid='price']", { timeout: 8_000 });  
    const html = await page.content();  
    const slug = encodeURIComponent(url).slice(0, 80);  
    const ts = Date.now();  
    await Promise.all([  
      writeFile(`${outputDir}/${slug}-${ts}.html`, html),  
      page.screenshot({  
        path: `${outputDir}/${slug}-${ts}.png`,  
        fullPage: false,  
      }),  
    ]);  

    return { success: true, url };  
  } finally {  
    // Always close — leaked contexts accumulate fast  
    await context.close();  
    await browser.close();  
  }  
}

When to add a proxy to your Playwright stack

Only when you’ve been blocked, definitively. Not before. The minimal version without a proxy will work on the vast majority of public pages. When you start seeing CAPTCHAs, bot challenges, or suspiciously empty responses, that’s when you plug in residential proxies or a scraping browser service (which handles fingerprinting and unblocking at the browser level).

What I got wrong

Of course, the one gotcha that catches everyone on the containerization side: --disable-dev-shm-usage. I deployed without it. Container crashed with exit code 1 and no useful error. Two hours of debugging: Chromium was exhausting Docker's default 64MB /dev/shm. The flag routes Chromium's shared memory to /tmp instead. It is mandatory in any containerized environment. It is never in the first tutorial you read. It is always in the second incident report.

Decision Tree: Choosing Your Ingestion Stack

Start with the simplest viable ingestion layer, then add capabilities only as you actually need more complexity:

Stack 1 → Simple public APIs and small URL sets
Stack 2 → Agentic navigation, anti-bot handling, or validation
Stack 3 → Large-scale fan-out across many URLs
Stack 4 → Durable orchestration, retries, and observability
Stack 5 → JavaScript-rendered pages via headless browsers

Remember: these stacks are composable layers, not mutually exclusive choices.

The most common production pattern I’ve seen is Stack 4 (orchestration) wrapping Stack 5 (browser) or Stack 3 (serverless fan-out) as the I/O step.

Frequently Asked Questions (FAQ)

Q: What is the simplest data ingestion stack for a small project?

A: A plain fetch loop on a cron schedule writing to SQLite or flat files. No framework, no queue, no service dependencies — a script you can read end-to-end in five minutes. This works for the majority of stable public APIs, RSS feeds, and open datasets.

Q: Can I run Bright Data MCP without Claude Desktop or Cursor?

A: Yes. The official @modelcontextprotocol/sdk for TypeScript ships an MCP client (Client + StdioClientTransport) that spawns @brightdata/mcp as a subprocess from any Node script. From there, bridge the tool loop to whatever LLM you prefer: Anthropic's Messages API, Ollama's /api/chat with a local tool-calling model, or any OpenAI-compatible endpoint. No proprietary client install required.

Q: Should I store raw API payloads or only the parsed fields I need?

A: Always store the raw payload first. Schemas evolve, fields you didn’t think you needed become important, and re-fetching upstream data — especially rate-limited public APIs — can take days.

Q: When is it worth paying for a proxy or unblocking service?

A: When you’re seeing 403s, 429s, CAPTCHAs, or geo-blocks on your target. Not before. Vanilla fetch and a minimal Playwright context work on the vast majority of public pages. Add a paid proxy or unblocking layer only when you've measured the specific failure mode you're solving. The same applies to LLM-agent stacks: validate the data exists and has the shape you need before committing to paid infrastructure.

Q: How do I run an MCP server programmatically from a backend service?

A: Use the MCP TypeScript SDK’s Client with the transport that matches your deployment: StdioClientTransport for local subprocesses (e.g. npx @brightdata/mcp), or StreamableHTTPClientTransport / SSEClientTransport for a remote MCP URL like https://mcp.brightdata.com/mcp?token=…. Call mcp.connect(), then listTools() / callTool() exactly like an IDE would. The MCP TypeScript SDK client guide covers all transports.

Q: When should I combine stacks rather than pick one?

A: Almost always! These patterns are primitives, not solo architectures. The most common production shape is Stack 4 (durable orchestration and retries) wrapping Stack 5 (Playwright as the browser I/O step) or Stack 3 (serverless fan-out for parallel fetching). Stack 1 or 2 for ad-hoc validation before you commit to building any of it. Treat the decision tree as "which I/O layer do I need?" not "which complete system should I adopt?"

What stack are you running for live data ingestion? Have you ran into walls I didn’t cover here? Drop it in the comments. 👇

DEV Community

5 Production Stacks for Live Data Ingestion at Scale (Without Getting Blocked)

What You’ll Learn

1. Bun/Node `fetch` + Allowlist — The Boring Baseline That Works

What is the fetch + cron stack?

Why use fetch + cron for live data ingestion?

How to Handle Pagination

When fetch + cron isn’t enough

What I got wrong

2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax

What is the agent + Bright Data MCP stack?

Why use Bright Data MCP for agentic data extraction?

How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)

What I got wrong

3. Serverless Cron + Object Storage — Disposable Compute, Durable Data

What is the serverless cron + object storage pattern?

Why use serverless workers + R2/S3 for fan-out ingestion?

Do I need a manifest for serverless ingest?

What I got wrong

4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer

What is the workflow engine stack?

Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?

When to swap the I/O step in a workflow

What I got wrong

What is the minimal Playwright headless stack?

When to use Playwright for scraping (and why it’s a last resort)

When to add a proxy to your Playwright stack

What I got wrong

Decision Tree: Choosing Your Ingestion Stack

Frequently Asked Questions (FAQ)

Top comments (0)

What You’ll Learn

1. Bun/Node fetch + Allowlist — The Boring Baseline That Works

What is the fetch + cron stack?

Why use fetch + cron for live data ingestion?

How to Handle Pagination

When fetch + cron isn’t enough

What I got wrong

2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax

What is the agent + Bright Data MCP stack?

Why use Bright Data MCP for agentic data extraction?

How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)

What I got wrong

3. Serverless Cron + Object Storage — Disposable Compute, Durable Data

What is the serverless cron + object storage pattern?

Why use serverless workers + R2/S3 for fan-out ingestion?

Do I need a manifest for serverless ingest?

What I got wrong

4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer

What is the workflow engine stack?

Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?

When to swap the I/O step in a workflow

What I got wrong

What is the minimal Playwright headless stack?

When to use Playwright for scraping (and why it’s a last resort)

When to add a proxy to your Playwright stack

What I got wrong

Decision Tree: Choosing Your Ingestion Stack

Frequently Asked Questions (FAQ)

1. Bun/Node `fetch` + Allowlist — The Boring Baseline That Works