My critic agent caught 3 data integrity bugs my main agent introduced — all in 24h

#claude #ai #python #automation

This week, my autonomous real estate agent had 1 confirmed email subscriber. Then it had 0. In between, it found 47 ghost strings and 5 false legal claims it had been publishing quietly for weeks.

Here's what actually happened — and what it tells you about running autonomous agents in production.

Background: BailleurVérif, 424 wakes later

BailleurVérif is a French tool that checks whether your rent is legal under encadrement des loyers (rent control law). It crawls listings from locservice.fr, cross-references them against ELAN art. 140 rent caps, and serves verdicts with cited jurisprudence.

The system runs as an autonomous agent: a Claude Sonnet instance wakes every 2 hours via cron, reads its runs/ diary, checks three inboxes (inbox.md, inbox-from-critic.md, inbox-from-strategic-critic.md), and decides what to ship. Wake 424 happened today. Wake 1 was 6 weeks ago.

What I didn't build from the start — but wish I had — was strict separation between test data and production metrics.

Bug 1: The subscriber that wasn't

On June 2, subscribers.jsonl had 2 entries. The metrics reported subscribers_real_lifetime = 2. The strategic critic scored this as a positive signal and built its T+72h success criteria around it: email_submitted >= 3.

Then the tactical critic ran (audit-57, June 3 at 07:00Z) and flagged this:

{
  "file": "subscribers.jsonl",
  "line": 2,
  "email": "smoke-strategic40-run417@example.com",
  "source": "smoke-test-run417",
  "note": "Builder self-verification curl from run-417 ship. Not a real user."
}

The executor agent (run-417) had added a smoke test to verify its own deploy — and written it to the same subscribers.jsonl file used for real conversions. Three runs later, email_submitted_lifetime = 1 was being cited as a positive signal, and the T+72h deadline metric was theoretically 33% satisfied by a fake entry.

There were 5 smoke entries in total across subscribers.jsonl and funnel-events.jsonl — from runs 330, 346, 376, 416, and 417. All written by the agent checking its own work. All silently counted as real user activity.

The fix (run-421):

def is_smoke(entry):
    smoke_markers = ['smoke', 'test', 'eclp-smoke']
    email = entry.get('email', '') or ''
    source = entry.get('source', '') or ''
    session = entry.get('sessionId', '') or ''
    if any(m in email.lower() for m in smoke_markers): return True
    if any(m in source.lower() for m in smoke_markers): return True
    if any(m in session.lower() for m in smoke_markers): return True
    meta = entry.get('meta', {}) or {}
    return meta.get('smoke') or meta.get('src') == 'smoke'

Post-purge: subscribers_real_lifetime = 1, email_submitted_real_lifetime = 0. Honest numbers, even if they're smaller than before.

Lesson: Smoke tests and production metrics must never share the same file. Use a separate endpoint (/api/_smoke/) or a distinct JSONL with a strict naming convention. The bug is obvious in retrospect. In a cron-driven agent loop with no human reviewing every write, it isn't.

Bug 2: 47 pages with `{ville}` in their JSON-LD

My agent generates programmatic pages for French cities: nantes-dpe-f-g-interdit-location.html, toulouse-dpe-f-g-interdit-location.html, and 45 others. Each page includes a JSON-LD FAQPage schema with city-specific Q&A, including:

{
  "@type": "Question",
  "name": "Le DPE est-il opposable juridiquement à {ville} ?"
}

The template variable {ville} was never substituted. Forty-seven pages had this literal string in their JSON-LD — meaning Google's Rich Results parser was seeing a malformed FAQ question for every DPE city page. This had been silently degrading Rich Results eligibility across the whole section for weeks.

The critic caught it in audit-57. Previous runs had manually fixed 2 cities. The sweep was incomplete. Run-421 fixed all 47 in one pass:

import re, pathlib

def extract_city(html: str) -> str:
    # Pull city name from meta description: "A Nantes (44)..."
    m = re.search(r'content="[AÀ]\s+([^(]+?)\s*\(', html)
    return m.group(1).strip() if m else None

for path in pathlib.Path('wedge-tool/static').glob('*-dpe-f-g-interdit-location.html'):
    html = path.read_text()
    city = extract_city(html)
    if city and '{ville}' in html:
        path.write_text(html.replace('{ville}', city))
        print(f"Fixed: {path.name} -> {city}")

Post-fix: grep -rl "{ville}" wedge-tool/static/ returned 0 results. All 47 JSON-LD FAQPage schemas now valid. Each unresolved {ville} was invalidating 1 of 6 FAQ questions per page — degrading Rich Results eligibility by ~17% per page across 47 pages.

Lesson: Template substitution bugs are invisible to page rendering (browsers display HTML fine) but break structured data parsers. Add a post-build linter that greps for unresolved placeholders before deploying.

Bug 3: 5 pages claiming a law that doesn't apply

This one is the most serious.

BailleurVérif has pages for the Grenoble metropolitan cluster: Grenoble, Échirolles, Eybens, Fontaine, Saint-Martin-d'Hères. These pages claimed — in title, meta description, JSON-LD Dataset, and body — that these cities had encadrement des loyers in force under ELAN art. 140.

They don't. Grenoble Alpes Métropole applied for ELAN status, but no arrêté préfectoral was ever published. The pages were generated from a template that conflated candidature with confirmed application. For months, these pages told visitors (and Google) that a specific rent cap was legally binding in Grenoble. It wasn't.

Run-423 built check_legal_regime.py v2 — a 272-line script that cross-references:

An AUTHORITATIVE table of cities with confirmed décrets/arrêtés (Paris, Lille, Lyon, Bordeaux, Montpellier, Est Ensemble, Plaine Commune)
Wikipedia FR Contrôle des loyers
Service-Public.fr F1314

Each city gets a confidence_score (0–1) and a pending_legal_verification flag. Running it against all 32 encadrement pages revealed: 26 confirmed, 5 pending (Grenoble cluster), 1 explicitly non-applicable (Marseille — zone tendue but no ELAN art. 140 arrêté).

Run-424 batch-patched all 5 pages, 17 edits each:

- <title>Loyer à Grenoble 2026 — Plafond légal 12,4 €/m² | BailleurVérif</title>
+ <title>Loyer à Grenoble 2026 — estimation observatoire 12,4 €/m² (statut légal pending)</title>

- "plafond légal applicable à Grenoble"
+ "estimation observatoire — ELAN art. 140 non confirmé par arrêté préfectoral"

85 edits total. 5 pages now display a visible amber disclaimer. E-E-A-T accuracy consolidated.

Lesson: LLM-assisted content generation at scale creates E-E-A-T risk when the model interpolates from pattern ("city X applied for status" → "city X has status"). A fact-checking tool querying authoritative sources is not optional — it's infrastructure.

The architecture behind the catches

The critic agent runs on the same cron. It reads the last 7 run files, checks metric deltas against source data, and writes prioritized recommendations to inbox-from-critic.md. The executor reads this inbox on every wake and either honors the recommendation or explicitly overrides it with a written WHY_THIS_NOT_THAT ritual.

The key asymmetry: the critic has no agency. It cannot write code or ship files. It can only write text. This prevents the critic from introducing its own bugs while still giving it real leverage — a flagged issue the executor ignores gets re-raised at the next audit, escalating to the strategic critic if needed.

All three bugs this week were caught by the critic, not by automated tests. There are no unit tests. There's check_legal_regime.py now, and a post-build grep for {ville}, but mostly the quality layer is: write runs, have a separate agent read them, close the loop within 24h.

At wake 424, the honest state is: 1 real subscriber (Marseille, via the DPE page), 0 email submissions post-CTA, 4 humans engaged lifetime, 97 sessions. The metrics are smaller than they looked before the purge. They're also real.

Takeaways

Separate smoke from signal on write, not on read. By the time you're filtering, the metric is already corrupted — and decisions have been made on it.
Template bugs survive because HTML renders fine. Structured data parsers don't forgive unresolved placeholders. Add a linter to your deploy step.
Autonomous content at scale requires a legal fact-check layer. "Pending" is always safer than "confirmed" when you're not certain of a regulatory status.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638