Web scraping breaks. Pages redesign, HTML structures change, and your regex stops matching. I built a pipeline that uses AI to handle format changes automatically.
The Architecture
I track data from 33+ websites that update their content frequently. The goal: detect changes within 6 hours and update my database automatically.
n8n (scheduler) → Firecrawl (scraper) → Gemini AI (parser) → SQLite (storage)
↓ ↓
Error handler Telegram notification
Why n8n Over Custom Code
I initially built this as a Node.js cron job. It worked, but:
- Debugging required reading logs line by line
- Error handling was ad-hoc (
try/catcheverywhere) - Adding a new source meant modifying code and redeploying
n8n gives you:
- Visual workflow editor (see errors at a glance)
- Built-in retry logic per node
- Webhook triggers for manual re-runs
- Credential management (no API keys in code)
Step 1: Firecrawl for JS-Heavy Sites
Many modern websites load content via client-side JavaScript. A simple fetch() returns an empty shell. Firecrawl handles this:
// n8n HTTP Request node
const response = await fetch('https://api.firecrawl.dev/v0/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${FIRECRAWL_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: targetUrl,
formats: ['markdown'],
waitFor: 3000,
removeBase64Images: true
})
});
The waitFor: 3000 parameter gives SPAs time to render. The markdown format strips HTML and returns clean text — perfect for AI processing.
Step 2: AI Extraction (The Self-Healing Part)
This is where it gets interesting. Traditional scraping uses CSS selectors or XPath:
// Fragile - breaks when HTML changes
const price = document.querySelector('.plan-card .price-value').textContent;
My approach sends the full page content to Gemini Flash with a structured extraction prompt:
const prompt = `
Extract the following from this page content:
- items: array of {name, size, price, features: string[]}
- supported_platforms: string[]
- last_updated: date if visible
Rules:
- Return valid JSON only
- Use null for missing values
- Normalize prices to integers (USD)
- Remove currency symbols
`;
const result = await gemini.generateContent([
prompt,
pageMarkdown
]);
When a website redesigns, the AI still finds the data because it understands semantics, not DOM structure. I haven't had to update a single extraction rule in 3 months despite multiple source sites redesigning.
Step 3: Smart Diffing
Naive approach: overwrite the database every cycle. Problem: you lose change history and generate unnecessary writes.
My diff layer compares extracted values against current state:
interface Change {
field: string;
oldValue: unknown;
newValue: unknown;
confidence: number;
}
function computeDiff(
current: Record<string, unknown>,
extracted: Record<string, unknown>
): Change[] {
const changes: Change[] = [];
for (const [key, newVal] of Object.entries(extracted)) {
if (newVal === null) continue; // AI couldn't find it
const oldVal = current[key];
if (JSON.stringify(oldVal) === JSON.stringify(newVal)) continue;
// Confidence check: flag large changes for review
const confidence = typeof newVal === 'number' && typeof oldVal === 'number'
? 1 - Math.abs(newVal - oldVal) / Math.max(oldVal, 1)
: 1;
changes.push({ field: key, oldValue: oldVal, newValue: newVal, confidence });
}
return changes;
}
Changes with confidence below 0.5 (value changed by more than 50%) get flagged for manual review instead of auto-updating. This catches AI hallucinations.
Step 4: Notifications
if (changes.length > 0) {
const autoApplied = changes.filter(c => c.confidence >= 0.5);
const flagged = changes.filter(c => c.confidence < 0.5);
await sendTelegram({
text: [
`Source: ${source.name}`,
`Auto-updated: ${autoApplied.map(c => c.field).join(', ')}`,
flagged.length ? `Flagged for review: ${flagged.map(c => c.field).join(', ')}` : '',
`Timestamp: ${new Date().toISOString()}`
].filter(Boolean).join('\n')
});
}
Error Handling Strategy
| Error Type | n8n Behavior | Custom Logic |
|---|---|---|
| Network timeout | Auto-retry 3x | Skip source, try next cycle |
| Firecrawl 429 | Exponential backoff | Add jitter (1-5s random delay) |
| AI returns invalid JSON | Retry with simpler prompt | Flag source, keep old data |
| AI returns null for all fields | No retry | Alert: source may have redesigned |
Performance Numbers
| Metric | Value |
|---|---|
| Sources tracked | 33 |
| Crawl frequency | Every 6 hours |
| Average cycle time | 4 minutes |
| AI extraction accuracy | 97% |
| False positive rate | 0.3% |
| Uptime (3 months) | 99.8% |
| Monthly cost | ~$20 (VPS + API calls) |
The n8n Workflow Layout
[Cron Trigger: 6h]
↓
[Loop: for each source]
↓
[HTTP: Firecrawl] → [Error: skip source]
↓
[HTTP: Gemini AI] → [Error: retry with fallback prompt]
↓
[Code: JSON parse + validate]
↓
[Code: Diff against DB]
↓
[Branch: changes detected?]
Yes → [Code: Apply changes] → [Telegram: notify]
No → [Continue loop]
Lessons Learned
AI > regex for scraping. Format-agnostic extraction means zero maintenance when sources change their HTML.
Confidence thresholds prevent bad data. The 50% change threshold catches 100% of AI hallucinations in my testing.
n8n visual debugging saves hours. Clicking a node to see its input/output beats reading log files.
Self-host n8n. The cloud version works, but self-hosting on the same VPS as your database eliminates network latency for DB operations.
The platform this pipeline feeds is propfirmkey.com — a data comparison engine. The scraping pipeline keeps it updated automatically.
Have you used AI for web scraping? What's your experience with accuracy vs traditional selectors? Let me know in the comments.
Top comments (0)