AlterLab

Posted on Apr 11 • Edited on Apr 18 • Originally published at alterlab.io

Automate Web Scraping in n8n with AlterLab API

#datapipelines #automation #python #scraping

Automate Web Scraping in n8n with AlterLab's API

n8n is a workflow automation tool that connects APIs, databases, and services. Pair it with a scraping API that handles anti-bot bypass, proxy rotation, and headless rendering, and you get a pipeline that pulls structured data from any website on a schedule.

This tutorial shows how to build that pipeline. You will configure an n8n workflow that sends scrape requests, receives clean JSON, and routes the data to a database, spreadsheet, or webhook.

Prerequisites

An n8n instance (self-hosted or cloud)
An API key from alterlab.io/signup
Basic familiarity with n8n's node-based workflow editor

Step 1: Configure the HTTP Request Node

Create a new workflow in n8n. Add an HTTP Request node and configure it as follows:

Method: POST
URL: https://api.alterlab.io/v1/scrape
Authentication: Header Auth
Header Name: X-API-Key
Header Value: Your API key
Send Body: JSON

Set the JSON body to:

```json title="HTTP Request Body"
{
"url": "https://example.com/products",
"formats": ["json"],
"min_tier": 3
}




The `min_tier` parameter controls the scraping tier. Tier 3 enables JavaScript rendering. Set it higher for sites with aggressive bot detection. The [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) system auto-escalates if the initial tier fails.

## Step 2: Test with cURL First

Before building the full workflow, verify the endpoint works from your terminal. This isolates API issues from n8n configuration problems.



```bash title="Terminal" {1-4}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "formats": ["json"]}'

A successful response returns structured data:

```json title="Response" {3-8}
{
"status": "success",
"data": {
"products": [
{"name": "Widget A", "price": 29.99},
{"name": "Widget B", "price": 49.99}
]
},
"metadata": {
"url": "https://example.com/products",
"timestamp": "2026-04-11T10:30:00Z"
}
}




<div data-infographic="try-it" data-url="https://example.com" data-description="Try scraping this page with AlterLab"></div>

## Step 3: Build the Full n8n Workflow

A production workflow needs more than a single HTTP request. You need error handling, data transformation, and a destination for the scraped data.

### Workflow Structure



```plaintext
[Schedule Trigger] -> [HTTP Request (Scrape)] -> [Code (Parse)] -> [Database/Sheet/Webhook]

Add these nodes in order:

1. Schedule Trigger

Set a cron expression for your scrape frequency. Daily at 6 AM UTC:

0 6 * * *

2. HTTP Request Node

Use the configuration from Step 1. Enable "Continue On Fail" so one failed scrape does not block the entire workflow.

3. Code Node (Data Transformation)

Parse the JSON response and extract the fields you need:

```python title="n8n Code Node" {5-12}

Access the HTTP Request output

response = json.parse($input.first().json.body)

Extract product data

products = response.get("data", {}).get("products", [])

Transform to your schema

items = []
for product in products:
items.append({
"json": {
"name": product["name"],
"price": product["price"],
"scraped_at": response["metadata"]["timestamp"],
"source": response["metadata"]["url"]
}
})

return items




**4. Destination Node**

Connect your output node. Common choices:

- **Postgres/MySQL**: Use the database node to upsert records
- **Google Sheets**: Append rows for lightweight tracking
- **Webhook**: Push to your own API or a Slack channel

## Step 4: Handle Multiple URLs

Scraping a single page is straightforward. Real pipelines scrape dozens or hundreds of URLs. Use n8n's Split Out node to fan out requests.



```python title="URL List Generator" {3-7}
# Code node that outputs multiple URLs
urls = [
    "https://example.com/products/page/1",
    "https://example.com/products/page/2",
    "https://example.com/products/page/3"
]

return [{"json": {"url": u}} for u in urls]

Connect this to a Split Out node, then to your HTTP Request node. Each URL becomes a separate execution branch. n8n processes them in parallel up to your concurrency limit.

Add rate limiting between requests if the target site requires it. Use the Wait node between the Split Out and HTTP Request nodes:

Wait: 2 seconds

Step 5: Add Error Handling and Retries

Scraping fails. Pages change structure, sites go down, anti-bot systems update. Your workflow should handle failures gracefully.

Retry Configuration

In the HTTP Request node settings:

Retry On Fail: Enable
Max Retries: 3
Retry Backoff: Exponential

Error Routing

Add an error output branch from the HTTP Request node:

[HTTP Request] --(success)--> [Parse] --> [Database]
       |
       --(error)--> [Error Handler] --> [Alert/Log]

The error handler can log failures to a separate sheet, send a Slack notification, or queue the URL for a retry with a higher tier.

```python title="Error Handler Code Node" {4-9}

Capture failed URLs for retry

error_data = $input.first().json

failed_urls.append({
"url": error_data.get("url"),
"error": error_data.get("error"),
"timestamp": datetime.utcnow().isoformat(),
"retry_tier": 4 # escalate tier on retry
})

return [{"json": {"failed": failed_urls}}]




## Step 6: Use Cortex AI for Structured Extraction

Some pages do not have clean HTML structures. Product listings buried in JavaScript, unstructured text, or dynamic content require a different approach. Cortex AI extracts structured data using natural language instructions.



```bash title="Terminal" {5-9}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/reviews",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract reviewer name, rating (1-5), and review text from each review block"
    }
  }'

The response returns data matching your schema:

```json title="Cortex AI Response" {4-12}
{
"status": "success",
"data": {
"reviews": [
{
"reviewer_name": "Jane D.",
"rating": 5,
"review_text": "Excellent product, fast shipping."
},
{
"reviewer_name": "Mark S.",
"rating": 4,
"review_text": "Good quality, slightly overpriced."
}
]
}
}




In n8n, the Cortex output works identically to standard JSON output. Route it through the same Code and Database nodes.

## Step 7: Monitor and Alert on Changes

Scraping is not always about collecting new data. Sometimes you need to detect changes on existing pages. Price drops, stock availability, competitor updates, regulatory filings.

Configure monitoring by storing previous scrape results and comparing them on each run:



```python title="Change Detection Code Node" {6-15}
# Compare current scrape with previous state
current = $input.first().json
previous = get_previous_state(current["url"])  # from database

changes = []
for key in current["data"]:
    if key not in previous:
        changes.append({"field": key, "action": "added", "value": current["data"][key]})
    elif current["data"][key] != previous[key]:
        changes.append({
            "field": key,
            "action": "changed",
            "old": previous[key],
            "new": current["data"][key]
        })

# Only pass through if changes detected
if changes:
    return [{"json": {"url": current["url"], "changes": changes}}]
return []

When changes exist, route to an alert node. When nothing changed, the workflow exits silently.

Cost Considerations

Scraping pipelines can run expensive if you are not careful. A few practices:

Cache aggressively: Do not re-scrape pages that have not changed. Store hashes of previous responses and skip identical results.
Use the lowest tier that works: Start with min_tier: 1 for static pages. Only escalate to tier 3+ for JavaScript-heavy sites.
Batch URLs: Group related URLs into single workflow runs rather than triggering separate workflows per URL.
Set spend limits: API keys support spend caps. Set them per workflow to prevent runaway costs.

Check pricing for current rates. You pay for what you use with no monthly minimums.

Complete Workflow Example

Here is the full n8n workflow JSON for a daily product price scrape:

```json title="n8n Workflow Export" {10-20}
{
"name": "Daily Price Scraper",
"nodes": [
{
"name": "Schedule",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": { "interval": ["days"], "triggerAtHour": 6 }
}
},
{
"name": "Scrape Products",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "POST",
"url": "https://api.alterlab.io/v1/scrape",
"authentication": "headerAuth",
"body": {
"url": "={{ $json.url }}",
"formats": ["json"],
"min_tier": 3
},
"options": {
"retryOnFail": true,
"maxTries": 3
}
}
},
{
"name": "Parse Response",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "const data = $input.first().json.body;\nreturn data.data.products.map(p => ({ json: p }));"
}
},
{
"name": "Save to Database",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "upsert",
"table": "product_prices",
"columns": "name,price,scraped_at"
}
}
],
"connections": {
"Schedule": { "main": [[{ "node": "Scrape Products", "type": "main" }]] },
"Scrape Products": { "main": [[{ "node": "Parse Response", "type": "main" }]] },
"Parse Response": { "main": [[{ "node": "Save to Database", "type": "main" }]] }
}
}




Import this into n8n via the workflow editor, replace the authentication credentials with your API key, and adjust the URL and database schema to match your use case.

## Troubleshooting

**Empty responses**: The page may require a higher tier. Increase `min_tier` to 4 or 5. Check the [API docs](https://alterlab.io/docs) for tier descriptions.

**Rate limit errors**: Add a Wait node between requests. Start with 1-2 seconds and increase if needed.

**CAPTCHA blocks**: Set `min_tier: 5` to enable CAPTCHA solving. This costs more per request but eliminates manual intervention.

**Schema drift**: Websites change their HTML structure. Cortex AI handles this better than CSS selectors since it uses semantic understanding. Switch to Cortex if your selectors break frequently.

**n8n timeout**: Long-running scrapes can exceed n8n's execution timeout. For large batches, use the webhook pattern. Configure AlterLab to push results to an n8n webhook URL instead of polling.

## Takeaway

n8n handles orchestration. AlterLab handles extraction. Together they give you a scraping pipeline that runs on a schedule, handles failures, and delivers clean data to your systems.

Start with a single URL and a basic HTTP Request node. Add error handling, multi-URL support, and change detection as your needs grow. The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers API setup in under five minutes.

DEV Community