Propfirmkey

Posted on Mar 2

Building a Self-Healing Web Scraping Pipeline with n8n and Gemini AI

#n8n #webdev #automation #ai

Web scraping breaks. Pages redesign, HTML structures change, and your regex stops matching. I built a pipeline that uses AI to handle format changes automatically.

The Architecture

I track data from 33+ websites that update their content frequently. The goal: detect changes within 6 hours and update my database automatically.

n8n (scheduler) → Firecrawl (scraper) → Gemini AI (parser) → SQLite (storage)
         ↓                                                          ↓
    Error handler                                          Telegram notification

Why n8n Over Custom Code

I initially built this as a Node.js cron job. It worked, but:

Debugging required reading logs line by line
Error handling was ad-hoc (try/catch everywhere)
Adding a new source meant modifying code and redeploying

n8n gives you:

Visual workflow editor (see errors at a glance)
Built-in retry logic per node
Webhook triggers for manual re-runs
Credential management (no API keys in code)

Step 1: Firecrawl for JS-Heavy Sites

Many modern websites load content via client-side JavaScript. A simple fetch() returns an empty shell. Firecrawl handles this:

// n8n HTTP Request node
const response = await fetch('https://api.firecrawl.dev/v0/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${FIRECRAWL_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: targetUrl,
    formats: ['markdown'],
    waitFor: 3000,
    removeBase64Images: true
  })
});

The waitFor: 3000 parameter gives SPAs time to render. The markdown format strips HTML and returns clean text — perfect for AI processing.

Step 2: AI Extraction (The Self-Healing Part)

This is where it gets interesting. Traditional scraping uses CSS selectors or XPath:

// Fragile - breaks when HTML changes
const price = document.querySelector('.plan-card .price-value').textContent;

My approach sends the full page content to Gemini Flash with a structured extraction prompt:

const prompt = `
Extract the following from this page content:
- items: array of {name, size, price, features: string[]}
- supported_platforms: string[]
- last_updated: date if visible

Rules:
- Return valid JSON only
- Use null for missing values
- Normalize prices to integers (USD)
- Remove currency symbols
`;

const result = await gemini.generateContent([
  prompt,
  pageMarkdown
]);

When a website redesigns, the AI still finds the data because it understands semantics, not DOM structure. I haven't had to update a single extraction rule in 3 months despite multiple source sites redesigning.

Step 3: Smart Diffing

Naive approach: overwrite the database every cycle. Problem: you lose change history and generate unnecessary writes.

My diff layer compares extracted values against current state:

interface Change {
  field: string;
  oldValue: unknown;
  newValue: unknown;
  confidence: number;
}

function computeDiff(
  current: Record<string, unknown>,
  extracted: Record<string, unknown>
): Change[] {
  const changes: Change[] = [];

  for (const [key, newVal] of Object.entries(extracted)) {
    if (newVal === null) continue; // AI couldn't find it

    const oldVal = current[key];
    if (JSON.stringify(oldVal) === JSON.stringify(newVal)) continue;

    // Confidence check: flag large changes for review
    const confidence = typeof newVal === 'number' && typeof oldVal === 'number'
      ? 1 - Math.abs(newVal - oldVal) / Math.max(oldVal, 1)
      : 1;

    changes.push({ field: key, oldValue: oldVal, newValue: newVal, confidence });
  }

  return changes;
}

Changes with confidence below 0.5 (value changed by more than 50%) get flagged for manual review instead of auto-updating. This catches AI hallucinations.

Step 4: Notifications

if (changes.length > 0) {
  const autoApplied = changes.filter(c => c.confidence >= 0.5);
  const flagged = changes.filter(c => c.confidence < 0.5);

  await sendTelegram({
    text: [
      `Source: ${source.name}`,
      `Auto-updated: ${autoApplied.map(c => c.field).join(', ')}`,
      flagged.length ? `Flagged for review: ${flagged.map(c => c.field).join(', ')}` : '',
      `Timestamp: ${new Date().toISOString()}`
    ].filter(Boolean).join('\n')
  });
}

Error Handling Strategy

Error Type	n8n Behavior	Custom Logic
Network timeout	Auto-retry 3x	Skip source, try next cycle
Firecrawl 429	Exponential backoff	Add jitter (1-5s random delay)
AI returns invalid JSON	Retry with simpler prompt	Flag source, keep old data
AI returns null for all fields	No retry	Alert: source may have redesigned

Performance Numbers

Metric	Value
Sources tracked	33
Crawl frequency	Every 6 hours
Average cycle time	4 minutes
AI extraction accuracy	97%
False positive rate	0.3%
Uptime (3 months)	99.8%
Monthly cost	~$20 (VPS + API calls)

The n8n Workflow Layout

[Cron Trigger: 6h]
    ↓
[Loop: for each source]
    ↓
[HTTP: Firecrawl] → [Error: skip source]
    ↓
[HTTP: Gemini AI] → [Error: retry with fallback prompt]
    ↓
[Code: JSON parse + validate]
    ↓
[Code: Diff against DB]
    ↓
[Branch: changes detected?]
  Yes → [Code: Apply changes] → [Telegram: notify]
  No  → [Continue loop]

Lessons Learned

AI > regex for scraping. Format-agnostic extraction means zero maintenance when sources change their HTML.
Confidence thresholds prevent bad data. The 50% change threshold catches 100% of AI hallucinations in my testing.
n8n visual debugging saves hours. Clicking a node to see its input/output beats reading log files.
Self-host n8n. The cloud version works, but self-hosting on the same VPS as your database eliminates network latency for DB operations.

The platform this pipeline feeds is propfirmkey.com — a data comparison engine. The scraping pipeline keeps it updated automatically.

Have you used AI for web scraping? What's your experience with accuracy vs traditional selectors? Let me know in the comments.

DEV Community