DEV Community

Propfirmkey
Propfirmkey

Posted on

Building a Self-Healing Web Scraping Pipeline with n8n and Gemini AI

Web scraping breaks. Pages redesign, HTML structures change, and your regex stops matching. I built a pipeline that uses AI to handle format changes automatically.

The Architecture

I track data from 33+ websites that update their content frequently. The goal: detect changes within 6 hours and update my database automatically.

n8n (scheduler) → Firecrawl (scraper) → Gemini AI (parser) → SQLite (storage)
         ↓                                                          ↓
    Error handler                                          Telegram notification
Enter fullscreen mode Exit fullscreen mode

Why n8n Over Custom Code

I initially built this as a Node.js cron job. It worked, but:

  • Debugging required reading logs line by line
  • Error handling was ad-hoc (try/catch everywhere)
  • Adding a new source meant modifying code and redeploying

n8n gives you:

  • Visual workflow editor (see errors at a glance)
  • Built-in retry logic per node
  • Webhook triggers for manual re-runs
  • Credential management (no API keys in code)

Step 1: Firecrawl for JS-Heavy Sites

Many modern websites load content via client-side JavaScript. A simple fetch() returns an empty shell. Firecrawl handles this:

// n8n HTTP Request node
const response = await fetch('https://api.firecrawl.dev/v0/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${FIRECRAWL_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: targetUrl,
    formats: ['markdown'],
    waitFor: 3000,
    removeBase64Images: true
  })
});
Enter fullscreen mode Exit fullscreen mode

The waitFor: 3000 parameter gives SPAs time to render. The markdown format strips HTML and returns clean text — perfect for AI processing.

Step 2: AI Extraction (The Self-Healing Part)

This is where it gets interesting. Traditional scraping uses CSS selectors or XPath:

// Fragile - breaks when HTML changes
const price = document.querySelector('.plan-card .price-value').textContent;
Enter fullscreen mode Exit fullscreen mode

My approach sends the full page content to Gemini Flash with a structured extraction prompt:

const prompt = `
Extract the following from this page content:
- items: array of {name, size, price, features: string[]}
- supported_platforms: string[]
- last_updated: date if visible

Rules:
- Return valid JSON only
- Use null for missing values
- Normalize prices to integers (USD)
- Remove currency symbols
`;

const result = await gemini.generateContent([
  prompt,
  pageMarkdown
]);
Enter fullscreen mode Exit fullscreen mode

When a website redesigns, the AI still finds the data because it understands semantics, not DOM structure. I haven't had to update a single extraction rule in 3 months despite multiple source sites redesigning.

Step 3: Smart Diffing

Naive approach: overwrite the database every cycle. Problem: you lose change history and generate unnecessary writes.

My diff layer compares extracted values against current state:

interface Change {
  field: string;
  oldValue: unknown;
  newValue: unknown;
  confidence: number;
}

function computeDiff(
  current: Record<string, unknown>,
  extracted: Record<string, unknown>
): Change[] {
  const changes: Change[] = [];

  for (const [key, newVal] of Object.entries(extracted)) {
    if (newVal === null) continue; // AI couldn't find it

    const oldVal = current[key];
    if (JSON.stringify(oldVal) === JSON.stringify(newVal)) continue;

    // Confidence check: flag large changes for review
    const confidence = typeof newVal === 'number' && typeof oldVal === 'number'
      ? 1 - Math.abs(newVal - oldVal) / Math.max(oldVal, 1)
      : 1;

    changes.push({ field: key, oldValue: oldVal, newValue: newVal, confidence });
  }

  return changes;
}
Enter fullscreen mode Exit fullscreen mode

Changes with confidence below 0.5 (value changed by more than 50%) get flagged for manual review instead of auto-updating. This catches AI hallucinations.

Step 4: Notifications

if (changes.length > 0) {
  const autoApplied = changes.filter(c => c.confidence >= 0.5);
  const flagged = changes.filter(c => c.confidence < 0.5);

  await sendTelegram({
    text: [
      `Source: ${source.name}`,
      `Auto-updated: ${autoApplied.map(c => c.field).join(', ')}`,
      flagged.length ? `Flagged for review: ${flagged.map(c => c.field).join(', ')}` : '',
      `Timestamp: ${new Date().toISOString()}`
    ].filter(Boolean).join('\n')
  });
}
Enter fullscreen mode Exit fullscreen mode

Error Handling Strategy

Error Type n8n Behavior Custom Logic
Network timeout Auto-retry 3x Skip source, try next cycle
Firecrawl 429 Exponential backoff Add jitter (1-5s random delay)
AI returns invalid JSON Retry with simpler prompt Flag source, keep old data
AI returns null for all fields No retry Alert: source may have redesigned

Performance Numbers

Metric Value
Sources tracked 33
Crawl frequency Every 6 hours
Average cycle time 4 minutes
AI extraction accuracy 97%
False positive rate 0.3%
Uptime (3 months) 99.8%
Monthly cost ~$20 (VPS + API calls)

The n8n Workflow Layout

[Cron Trigger: 6h]
    ↓
[Loop: for each source]
    ↓
[HTTP: Firecrawl] → [Error: skip source]
    ↓
[HTTP: Gemini AI] → [Error: retry with fallback prompt]
    ↓
[Code: JSON parse + validate]
    ↓
[Code: Diff against DB]
    ↓
[Branch: changes detected?]
  Yes → [Code: Apply changes] → [Telegram: notify]
  No  → [Continue loop]
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. AI > regex for scraping. Format-agnostic extraction means zero maintenance when sources change their HTML.

  2. Confidence thresholds prevent bad data. The 50% change threshold catches 100% of AI hallucinations in my testing.

  3. n8n visual debugging saves hours. Clicking a node to see its input/output beats reading log files.

  4. Self-host n8n. The cloud version works, but self-hosting on the same VPS as your database eliminates network latency for DB operations.

The platform this pipeline feeds is propfirmkey.com — a data comparison engine. The scraping pipeline keeps it updated automatically.


Have you used AI for web scraping? What's your experience with accuracy vs traditional selectors? Let me know in the comments.

Top comments (0)