Vinicius Porto

Posted on Apr 1 • Edited on Apr 25

When the Scraper Breaks Itself: Building a Self-Healing CSS Selector Repair System

#python #ai #llm

A Python sidecar that watches your scraper fail, calls a local LLM, and fixes the problem before your users notice.

The Problem with Fragile Selectors

Production web scrapers have a hidden fragility: they depend on CSS selectors and XPath expressions authored against a snapshot of a third-party website's DOM. The moment the site redesigns its layout, renames a class, or restructures a table, those selectors silently return nothing — or, worse, return the wrong data.

For a surf alert system that monitors dozens of forecast sources, this is a recurring operational problem. A selector like tr.forecast-table__row[data-row-name="wave-heigth"] (yes, a typo the upstream site just fixed) breaks at 3 AM. The scraper records a failure, the forecast pipeline stalls, and users stop receiving alerts for a beach they care about. An engineer wakes up to a Slack notification, digs through logs, finds the selector, pushes a fix, and deploys.

The fix itself usually takes under five minutes. The detection, investigation, and deploy ceremony takes two hours.

This is not a scaling problem — it is a friction problem. The actual repairs are trivially mechanical: look at the new HTML, find the element, write a new selector. Automating that loop, safely, is what the Self-Healer is for.

Core Idea in One Paragraph

When the Ruby scraper fails to extract a field (e.g., wave_height returns nil), it publishes a repair job to a Redis queue. A Python sidecar — the Self-Healer — picks up that job, fetches the current HTML from the source URL, trims it to fit a token budget, and sends a targeted prompt to a local LLM running via MLX. The LLM proposes new CSS/XPath selector candidates with confidence scores and reasoning. Each candidate is then tested against the live HTML using BeautifulSoup and lxml, and the extracted value is validated against a type schema (e.g., float:0.1-20.0 for wave height). If a candidate passes, the new selector is written directly to data_sources.selector_overrides in PostgreSQL — no redeploy required. The next scraper run reads the updated config and proceeds as if nothing happened.

Design Principles

The shape of this system comes from a few deliberate constraints.

LLM as proposer, not decider

The model suggests selectors. Code decides whether they work. Every candidate goes through deterministic BeautifulSoup/lxml evaluation against the actual HTML before anything touches the database. The LLM output is treated like a PR from an intern: read with interest, merged only after review.

This matters because LLMs are confident even when wrong. A model can generate a plausible-looking XPath that matches the wrong column in a table — extracting the header label instead of the numeric value. Type validation catches this: "Wave Height (m)" fails float:0.1-20.0, so the candidate is rejected.

Sandbox before promotion

Candidates run against fetched HTML in-process, using the same parsing libraries the scraper would use. There is no staging environment, no shadow traffic, no A/B rollout. The sandbox is the gate. A selector that does not match the current HTML and extract a value that passes type validation never reaches the database.

Escalation over hallucination

Not all broken selectors can be repaired automatically. Two cases get escalated instead of guessed:

JS-rendered pages: If the fetched HTML is an empty SPA shell — detected via React/Vue/Angular framework markers, sparse body text relative to total HTML size, or loading spinners — there is no static selector to find. The repair status becomes ESCALATED_JS, the data_sources row is flagged with is_js_rendered = true, and a Slack message goes to the team.
Exhausted retries: If the LLM generates candidates across multiple context window widths and none pass validation, the status becomes ESCALATED_HUMAN. Automated repair has reached its limit; a human needs to look.

Both paths are first-class outcomes, not error states. They produce repair log entries, update DB flags, and send structured Slack notifications — so the team has full visibility without polling logs.

Decoupling via Redis

The Ruby scraper and Python sidecar share nothing except two Redis queues: healer:jobs (input) and healer:results (output). The scraper does LPUSH healer:jobs on failure; the healer does BRPOP healer:jobs and blocks until a job arrives. There is no shared process, no RPC, no HTTP call between services. Either side can restart independently without affecting the other.

Local inference

The LLM runs on-device via MLX, Apple's machine learning framework for Apple Silicon. This has three practical implications: fetched HTML never leaves the machine, there are no API costs per repair, and latency is bounded by local compute rather than network round-trips. The LLM client uses an OpenAI-compatible HTTP API (POST /chat/completions), so switching to a cloud model is a one-line config change if needed.

Traceability by default

Every repair attempt — successful or not — writes to selector_repair_logs. Jobs carry an optional scrape_failure_id that links back to the original scrape_failures row. When a repair succeeds, scrape_failures.resolved_at is stamped and resolved_by is set to 'healer'. The full audit trail is in the database, queryable, and joinable.

System Context

The Self-Healer is a container in the existing Docker Compose stack, enabled via --profile healer. It adds no new database (it reads and writes to shared PostgreSQL), no new queue infrastructure (it uses the existing Redis), and no new external dependencies beyond the local MLX process and a Slack webhook.

End-to-End Pipeline

A single repair job moves through eight steps:

1. Job arrives. RepairWorker pops from healer:jobs. The job carries beach_name, field_name, failed_selector, source_url, and optionally an html_snapshot and a last_known_good_value.

2. HTML fetch. HTMLFetcher makes a plain HTTP GET to source_url. If the job included an html_snapshot, it is used directly — useful for testing and replaying past failures.

3. JS detection. JSDetector checks the HTML for framework markers (<div id="root"></div>, data-reactroot, ng-app, etc.), loading indicators, and sparse body content relative to total HTML size. If any check fires, the job escalates immediately. Feeding an LLM trimmed markup from an empty SPA shell produces useless selectors.

4. HTML trimming. HTMLTrimmer reduces a 300 KB page to a ~5 KB snippet. It removes <script>, <style>, <svg>, and other non-content tags; strips non-selector attributes (keeping id, class, data-*, aria-*); locates the element nearest the target selector or the expected value; then walks up context_levels parent nodes to include enough surrounding structure. The default is 3 levels. Retries use 5, 7, then 9 levels.

5. LLM call. LLMClient builds a prompt with the field name, the failed selector, the expected type, the last known good value, and the trimmed HTML snippet. It posts to the MLX OpenAI-compatible endpoint with temperature=0.3. The response is parsed into a SelectorCandidates object. If the model outputs preamble before the JSON (some models do), _extract_balanced_brace finds the {...} block regardless.

6. Sandbox. SelectorSandbox runs each candidate against the full HTML using BeautifulSoup (CSS) or lxml (XPath) and extracts the matched value. Results are sorted: successful matches first, then by LLM confidence score.

7. Validation. Validator checks each passing sandbox result against the expected type spec. For float:0.1-20.0, it extracts the numeric portion via regex (handling mixed strings like "1.5ESE"), converts to float, and range-checks it. A header cell returning "Wave Height (m)" fails here.

8. Promote or retry/escalate. The first candidate that passes validation is promoted: selector_overrides is updated, the repair log is written, the scrape failure is resolved, and Slack gets a success notification. If no candidate passes and retries remain, the trimmer is called again with wider context. If retries are exhausted, the job escalates to human review.

The sequence diagram below shows how messages and data move across services for a single repair:

And the internal step-by-step flow inside a single repair attempt:

Deep Dives

Token budget and the context_levels retry loop

Sending a full HTML page to an LLM is wasteful and often impossible — 300 KB of HTML is around 75,000 tokens, well beyond typical context windows and full of noise. The trimmer's job is to find the smallest HTML fragment that still gives the LLM enough signal.

The context_levels parameter controls how far up the DOM tree to walk from the target element. At level 3 you might get table > tbody > tr > td — enough to understand the structure. At level 9 (used on the final retry) you might get the full forecast section container. Each retry widens the window, trading token cost for contextual richness.

This mirrors how a human engineer approaches the same problem: start by looking at the specific row, and if that isn't enough, zoom out to the table.

Prompting for structured output

The prompt template is explicit about output format: "You must respond with ONLY a single JSON object. Start your response with { and end with }." It also encodes rules that push the model toward stable selectors: prefer data-* and aria-* attributes over class names, prefer label-anchored selectors over positional nth-child, include both CSS and XPath variants when possible.

The structured output — a candidates array with selector, type, confidence, reasoning, and attribute — lets the pipeline operate deterministically on the model's output without parsing prose. The reasoning field is stored in the repair log, useful for post-incident review and prompt iteration.

Why type validation is not enough

A selector can match the right element type but the wrong element. Consider:

<tr data-row-name="wave-height">
  <th>Wave Height (m)</th>   <!-- header -->
  <td>1.5</td>               <!-- data    -->
</tr>

If the LLM generates tr[data-row-name="wave-height"] th instead of td, the sandbox extracts "Wave Height (m)". Type validation rejects this because it cannot parse a float from that string. Correct outcome, for the right reason.

The deeper issue is that the healer currently validates in isolation. It does not verify that the extracted HTML structure matches what the Ruby scraper's process_wave_data method expects downstream — for example, a <td> carrying a data-swell-state JSON attribute that encodes the full swell envelope. A selector that extracts 1.5 from a plain <span> would pass today's validation but silently break the downstream extraction logic.

This is an acknowledged gap. The roadmap includes a StructuralValidator that codifies field-level DOM requirements (element type, required attributes, required children) and a ValueCrossValidator that compares extracted values against recent historical data for the same beach — catching plausible-looking outliers that type ranges alone would pass.

Limitations and Honest Boundaries

Static HTML only. The automated repair path works only when target data is present in the raw HTTP response. Single-page applications that populate content via JavaScript require a headless browser for rendering, which is a different class of tooling. For now, JS-rendered sources are escalated rather than guessed at.

Scraper coupling. Even a correctly extracted value might not flow through the pipeline if the scraper's process_* methods expect a specific DOM structure the new selector doesn't deliver. The healer validates the selector in isolation; it cannot yet confirm end-to-end compatibility with Ruby-side extraction logic.

LLM correctness. A local model running at temperature=0.3 is not a theorem prover. The validation pipeline exists precisely because the model can propose syntactically valid but semantically wrong selectors. No incorrect LLM suggestion can reach production without passing independent deterministic checks — but monitoring scrape quality over time (alerts, golden samples, periodic audits of selector_overrides) remains necessary.

Roadmap: From Reactive Repair to Proactive Extraction Engine

The current system operates in repair mode: it reacts to production failures and fixes them. The building blocks — fetch, detect, trim, prompt, sandbox, validate, promote — compose naturally into a second mode.

Scout mode would run proactively for new sources: given a URL and a target schema (wave height, period, wind speed, tide, etc.), the system discovers selectors without waiting for a failure. It adds stricter series validation — not just "does this selector extract a float?" but "does it extract a time-aligned series of consistent floats across multiple forecast periods?" — and returns a structured config the Ruby scraper can consume immediately.

The architectural upgrade this enables is significant. Instead of onboarding a new surf forecast source by manually inspecting HTML and writing selectors, an engineer submits a URL and reviews the proposed config. The same Python service handles both ongoing repair and initial discovery with shared components and a single mental model. The system stops being "a scraper with a repair button" and becomes a forecast extraction engine with two modes.

DEV Community