n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.
This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.
Prerequisites
- n8n instance (self-hosted via Docker or n8n Cloud)
- API key — follow the quickstart guide to get one in under two minutes
- Familiarity with n8n's workflow editor and basic JavaScript
Step 1: Store the API Key in n8n Credentials
Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:
| Field | Value |
|---|---|
| Name | Scraping API Key |
| Header Name | X-API-Key |
| Header Value | YOUR_API_KEY |
Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.
Step 2: Configure the HTTP Request Node
Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.
```json title="HTTP Request — Payload"
{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false,
"country": "us",
"timeout": 30000
}
For targets protected by Cloudflare, Akamai, or PerimeterX, set `render_js: true` and `premium_proxy: true`. The [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.
The same request in cURL for testing before wiring into n8n:
```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false
}'
The equivalent single-URL call in Python:
```python title="single_scrape.py" {7-12}
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"
def scrape(url: str, render_js: bool = False) -> dict:
with httpx.Client() as client: # synchronous single fetch
r = client.post(
BASE_URL,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": render_js},
timeout=30.0,
)
r.raise_for_status()
return r.json()
result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")
The API response shape:
```json title="API Response"
{
"success": true,
"status_code": 200,
"url": "https://books.toscrape.com/catalogue/page-1.html",
"html": "<!DOCTYPE html>...",
"elapsed_ms": 712
}
Try it against a live target to see the response before building the rest of the pipeline:
Step 3: Parse HTML in the Code Node
Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.
```javascript title="n8n Code Node — Extract Book Listings" {7-18}
const { load } = require('cheerio');
const results = [];
for (const item of $input.all()) {
const $ = load(item.json.html);
$('article.product_pod').each((_, el) => { // iterate product cards
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text().trim();
const rating = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
const relHref = $(el).find('h3 a').attr('href');
results.push({ // emit flat record
title,
price,
rating,
url: `https://books.toscrape.com/catalogue/${relHref}`,
scraped_at: new Date().toISOString(),
});
});
}
return results.map(r => ({ json: r }));
For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:
```javascript title="n8n Code Node — Parse JSON from html Field" {2-3}
const raw = $input.first().json.html;
const data = JSON.parse(raw); // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));
If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.
Step 4: Scrape Multiple Pages
Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:
```javascript title="n8n Code Node — Generate Paginated URL List" {3-6}
const BASE = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;
const urls = Array.from( // generate range of page URLs
{ length: PAGES },
(_, i) => ({ json: { url: ${BASE}${i + 1}.html } })
);
return urls;
Set **Split In Batches** to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.
For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:
```python title="batch_scrape.py" {15-21}
API_KEY = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
r = await client.post(
ENDPOINT,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": False},
timeout=30.0,
)
r.raise_for_status()
return r.json()
async def scrape_batch(urls: list[str]) -> list[dict]: # fan-out entry point
async with httpx.AsyncClient() as client: # single connection pool
tasks = [fetch(client, u) for u in urls] # build coroutine list
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
if __name__ == "__main__":
pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
data = asyncio.run(scrape_batch(pages))
for i, result in enumerate(data):
if isinstance(result, Exception):
print(f"Page {i+1} failed: {result}")
else:
print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")
The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.
Step 5: Route Data to Storage
Wire the Code node output to whichever storage node fits your stack.
Postgres — recommended for structured pipelines:
- Node: Postgres, Operation: Insert, Table:
scraped_books - Map
title,price,rating,url,scraped_atdirectly from Code node output fields
Google Sheets — minimal setup for low-volume runs:
- Node: Google Sheets, Operation: Append or Update
- Same column mapping
Webhook forward — for downstream microservices or event buses:
```json title="Webhook Payload"
{
"source": "n8n-book-scraper",
"run_id": "{{ $execution.id }}",
"count": 20,
"records": [
{ "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
]
}
---
## Step 6: Schedule and Add Error Handling
Swap the manual trigger for a **Schedule Trigger** node before going to production.
| Cadence | Cron Expression | Typical Use Case |
|---------|-----------------|------------------|
| Hourly | `0 * * * *` | Price monitoring |
| Daily 06:00 UTC | `0 6 * * *` | News/content aggregation |
| Every 15 minutes | `*/15 * * * *` | Inventory feeds |
| Weekdays 09:00 UTC | `0 9 * * 1-5` | B2B lead enrichment |
For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a **Postgres Trigger** node watching for new rows.
**Error handling — configure before going live:**
1. HTTP Request node → enable **Retry On Fail**: 3 retries, 2000ms backoff
2. Code node → enable **Continue On Fail** if partial runs are acceptable
3. In **Settings → Error Workflow**, assign a dedicated workflow that captures and routes failures:
```javascript title="Error Workflow — Log Failures to Dead-Letter Table" {5-11}
// Runs inside the error workflow's Code node
const err = $input.first().json;
return [{
json: {
workflow: err.workflow?.name,
node: err.execution?.lastNodeExecuted, // which node threw
message: err.execution?.error?.message,
failed_at: new Date().toISOString(),
execution_id: err.execution?.id,
}
}];
Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.
| Approach | Anti-Bot Handling | Setup Time | Maintenance | Scaling | Cost Model |
|---|---|---|---|---|---|
| DIY Playwright + Proxies | Manual (fingerprinting, stealth) | Days–weeks | High (browser updates, proxy churn) | Complex (concurrency, queueing) | Infrastructure + proxy fees |
| n8n + Scraping API | Automatic (TLS, CAPTCHA, headers) | <1 hour | Low (API versioned separately) | Batch nodes + API concurrency | Per successful request |
| Commercial ETL (Apify, etc.) | Varies by actor | Minutes (pre-built actors) | Low but opaque | Platform-managed | Platform subscription + compute |
Monitoring Pipeline Health
Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:
-
Log
success: falseresponses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request -
Store
elapsed_msper run in ascrape_metricstable; trend upward means proxy pool degradation -
Row count guard — after the storage node, add a Code node that alerts if
results.length < EXPECTED_MINIMUM:
```javascript title="n8n Code Node — Row Count Guard" {5-9}
const MINIMUM = 15; // expect at least 15 records per page
const count = $input.all().length;
if (count < MINIMUM) { // trigger alert path
throw new Error(Low yield: got ${count}, expected >= ${MINIMUM});
}
return $input.all(); // pass through if OK
Place this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.
---
## Takeaways
- n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
- Use `render_js: true` selectively; static fetches are faster and cheaper than headless browser requests
- Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
- Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
- Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
- For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow
Top comments (0)