AlterLab

Posted on Mar 31 • Edited on Apr 6 • Originally published at alterlab.io

Build a Web Scraping Pipeline with n8n and AlterLab

#automation #dataextraction #api #scraping

n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.

This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.

Prerequisites

n8n instance (self-hosted via Docker or n8n Cloud)
API key — follow the quickstart guide to get one in under two minutes
Familiarity with n8n's workflow editor and basic JavaScript

Step 1: Store the API Key in n8n Credentials

Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:

Field	Value
Name	`Scraping API Key`
Header Name	`X-API-Key`
Header Value	`YOUR_API_KEY`

Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.

Step 2: Configure the HTTP Request Node

Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.

```json title="HTTP Request — Payload"
{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false,
"country": "us",
"timeout": 30000
}




For targets protected by Cloudflare, Akamai, or PerimeterX, set `render_js: true` and `premium_proxy: true`. The [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.

The same request in cURL for testing before wiring into n8n:



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render_js": false,
    "premium_proxy": false
  }'

The equivalent single-URL call in Python:

```python title="single_scrape.py" {7-12}

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"

def scrape(url: str, render_js: bool = False) -> dict:
with httpx.Client() as client: # synchronous single fetch
r = client.post(
BASE_URL,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": render_js},
timeout=30.0,
)
r.raise_for_status()
return r.json()

result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")




The API response shape:



```json title="API Response"
{
  "success": true,
  "status_code": 200,
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "html": "<!DOCTYPE html>...",
  "elapsed_ms": 712
}

Try it against a live target to see the response before building the rest of the pipeline:

Step 3: Parse HTML in the Code Node

Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.

```javascript title="n8n Code Node — Extract Book Listings" {7-18}
const { load } = require('cheerio');

const results = [];

for (const item of $input.all()) {
const $ = load(item.json.html);

$('article.product_pod').each((_, el) => { // iterate product cards
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text().trim();
const rating = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
const relHref = $(el).find('h3 a').attr('href');

results.push({                                    // emit flat record
  title,
  price,
  rating,
  url: `https://books.toscrape.com/catalogue/${relHref}`,
  scraped_at: new Date().toISOString(),
});

});
}

return results.map(r => ({ json: r }));




For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:



```javascript title="n8n Code Node — Parse JSON from html Field" {2-3}
const raw = $input.first().json.html;
const data = JSON.parse(raw);            // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));

If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.

Step 4: Scrape Multiple Pages

Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:

```javascript title="n8n Code Node — Generate Paginated URL List" {3-6}
const BASE = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;

const urls = Array.from( // generate range of page URLs
{ length: PAGES },
(_, i) => ({ json: { url: ${BASE}${i + 1}.html } })
);

return urls;




Set **Split In Batches** to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.

For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:



```python title="batch_scrape.py" {15-21}

API_KEY  = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    r = await client.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        json={"url": url, "render_js": False},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()

async def scrape_batch(urls: list[str]) -> list[dict]:  # fan-out entry point
    async with httpx.AsyncClient() as client:           # single connection pool
        tasks   = [fetch(client, u) for u in urls]      # build coroutine list
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

if __name__ == "__main__":
    pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
    data  = asyncio.run(scrape_batch(pages))

    for i, result in enumerate(data):
        if isinstance(result, Exception):
            print(f"Page {i+1} failed: {result}")
        else:
            print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")

The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.

Step 5: Route Data to Storage

Wire the Code node output to whichever storage node fits your stack.

Postgres — recommended for structured pipelines:

Node: Postgres, Operation: Insert, Table: scraped_books
Map title, price, rating, url, scraped_at directly from Code node output fields

Google Sheets — minimal setup for low-volume runs:

Node: Google Sheets, Operation: Append or Update
Same column mapping

Webhook forward — for downstream microservices or event buses:

```json title="Webhook Payload"
{
"source": "n8n-book-scraper",
"run_id": "{{ $execution.id }}",
"count": 20,
"records": [
{ "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
]
}




---

## Step 6: Schedule and Add Error Handling

Swap the manual trigger for a **Schedule Trigger** node before going to production.

| Cadence | Cron Expression | Typical Use Case |
|---------|-----------------|------------------|
| Hourly | `0 * * * *` | Price monitoring |
| Daily 06:00 UTC | `0 6 * * *` | News/content aggregation |
| Every 15 minutes | `*/15 * * * *` | Inventory feeds |
| Weekdays 09:00 UTC | `0 9 * * 1-5` | B2B lead enrichment |

For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a **Postgres Trigger** node watching for new rows.

**Error handling — configure before going live:**

1. HTTP Request node → enable **Retry On Fail**: 3 retries, 2000ms backoff
2. Code node → enable **Continue On Fail** if partial runs are acceptable
3. In **Settings → Error Workflow**, assign a dedicated workflow that captures and routes failures:



```javascript title="Error Workflow — Log Failures to Dead-Letter Table" {5-11}
// Runs inside the error workflow's Code node
const err = $input.first().json;

return [{
  json: {
    workflow:     err.workflow?.name,
    node:         err.execution?.lastNodeExecuted,   // which node threw
    message:      err.execution?.error?.message,
    failed_at:    new Date().toISOString(),
    execution_id: err.execution?.id,
  }
}];

Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.

Approach	Anti-Bot Handling	Setup Time	Maintenance	Scaling	Cost Model
DIY Playwright + Proxies	Manual (fingerprinting, stealth)	Days–weeks	High (browser updates, proxy churn)	Complex (concurrency, queueing)	Infrastructure + proxy fees
n8n + Scraping API	Automatic (TLS, CAPTCHA, headers)	<1 hour	Low (API versioned separately)	Batch nodes + API concurrency	Per successful request
Commercial ETL (Apify, etc.)	Varies by actor	Minutes (pre-built actors)	Low but opaque	Platform-managed	Platform subscription + compute

Monitoring Pipeline Health

Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:

Log success: false responses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request
Store elapsed_ms per run in a scrape_metrics table; trend upward means proxy pool degradation
Row count guard — after the storage node, add a Code node that alerts if results.length < EXPECTED_MINIMUM:

```javascript title="n8n Code Node — Row Count Guard" {5-9}
const MINIMUM = 15; // expect at least 15 records per page

const count = $input.all().length;

if (count < MINIMUM) { // trigger alert path
throw new Error(Low yield: got ${count}, expected >= ${MINIMUM});
}

return $input.all(); // pass through if OK




Place this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.

---

## Takeaways

- n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
- Use `render_js: true` selectively; static fetches are faster and cheaper than headless browser requests
- Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
- Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
- Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
- For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow