DEV Community: Florian Demartini

My critic agent caught 3 data integrity bugs my main agent introduced — all in 24h

Florian Demartini — Wed, 03 Jun 2026 14:39:26 +0000

This week, my autonomous real estate agent had 1 confirmed email subscriber. Then it had 0. In between, it found 47 ghost strings and 5 false legal claims it had been publishing quietly for weeks.

Here's what actually happened — and what it tells you about running autonomous agents in production.

Background: BailleurVérif, 424 wakes later

BailleurVérif is a French tool that checks whether your rent is legal under encadrement des loyers (rent control law). It crawls listings from locservice.fr, cross-references them against ELAN art. 140 rent caps, and serves verdicts with cited jurisprudence.

The system runs as an autonomous agent: a Claude Sonnet instance wakes every 2 hours via cron, reads its runs/ diary, checks three inboxes (inbox.md, inbox-from-critic.md, inbox-from-strategic-critic.md), and decides what to ship. Wake 424 happened today. Wake 1 was 6 weeks ago.

What I didn't build from the start — but wish I had — was strict separation between test data and production metrics.

Bug 1: The subscriber that wasn't

On June 2, subscribers.jsonl had 2 entries. The metrics reported subscribers_real_lifetime = 2. The strategic critic scored this as a positive signal and built its T+72h success criteria around it: email_submitted >= 3.

Then the tactical critic ran (audit-57, June 3 at 07:00Z) and flagged this:

{
  "file": "subscribers.jsonl",
  "line": 2,
  "email": "smoke-strategic40-run417@example.com",
  "source": "smoke-test-run417",
  "note": "Builder self-verification curl from run-417 ship. Not a real user."
}

The executor agent (run-417) had added a smoke test to verify its own deploy — and written it to the same subscribers.jsonl file used for real conversions. Three runs later, email_submitted_lifetime = 1 was being cited as a positive signal, and the T+72h deadline metric was theoretically 33% satisfied by a fake entry.

There were 5 smoke entries in total across subscribers.jsonl and funnel-events.jsonl — from runs 330, 346, 376, 416, and 417. All written by the agent checking its own work. All silently counted as real user activity.

The fix (run-421):

def is_smoke(entry):
    smoke_markers = ['smoke', 'test', 'eclp-smoke']
    email = entry.get('email', '') or ''
    source = entry.get('source', '') or ''
    session = entry.get('sessionId', '') or ''
    if any(m in email.lower() for m in smoke_markers): return True
    if any(m in source.lower() for m in smoke_markers): return True
    if any(m in session.lower() for m in smoke_markers): return True
    meta = entry.get('meta', {}) or {}
    return meta.get('smoke') or meta.get('src') == 'smoke'

Post-purge: subscribers_real_lifetime = 1, email_submitted_real_lifetime = 0. Honest numbers, even if they're smaller than before.

Lesson: Smoke tests and production metrics must never share the same file. Use a separate endpoint (/api/_smoke/) or a distinct JSONL with a strict naming convention. The bug is obvious in retrospect. In a cron-driven agent loop with no human reviewing every write, it isn't.

Bug 2: 47 pages with `{ville}` in their JSON-LD

My agent generates programmatic pages for French cities: nantes-dpe-f-g-interdit-location.html, toulouse-dpe-f-g-interdit-location.html, and 45 others. Each page includes a JSON-LD FAQPage schema with city-specific Q&A, including:

{
  "@type": "Question",
  "name": "Le DPE est-il opposable juridiquement à {ville} ?"
}

The template variable {ville} was never substituted. Forty-seven pages had this literal string in their JSON-LD — meaning Google's Rich Results parser was seeing a malformed FAQ question for every DPE city page. This had been silently degrading Rich Results eligibility across the whole section for weeks.

The critic caught it in audit-57. Previous runs had manually fixed 2 cities. The sweep was incomplete. Run-421 fixed all 47 in one pass:

import re, pathlib

def extract_city(html: str) -> str:
    # Pull city name from meta description: "A Nantes (44)..."
    m = re.search(r'content="[AÀ]\s+([^(]+?)\s*\(', html)
    return m.group(1).strip() if m else None

for path in pathlib.Path('wedge-tool/static').glob('*-dpe-f-g-interdit-location.html'):
    html = path.read_text()
    city = extract_city(html)
    if city and '{ville}' in html:
        path.write_text(html.replace('{ville}', city))
        print(f"Fixed: {path.name} -> {city}")

Post-fix: grep -rl "{ville}" wedge-tool/static/ returned 0 results. All 47 JSON-LD FAQPage schemas now valid. Each unresolved {ville} was invalidating 1 of 6 FAQ questions per page — degrading Rich Results eligibility by ~17% per page across 47 pages.

Lesson: Template substitution bugs are invisible to page rendering (browsers display HTML fine) but break structured data parsers. Add a post-build linter that greps for unresolved placeholders before deploying.

Bug 3: 5 pages claiming a law that doesn't apply

This one is the most serious.

BailleurVérif has pages for the Grenoble metropolitan cluster: Grenoble, Échirolles, Eybens, Fontaine, Saint-Martin-d'Hères. These pages claimed — in title, meta description, JSON-LD Dataset, and body — that these cities had encadrement des loyers in force under ELAN art. 140.

They don't. Grenoble Alpes Métropole applied for ELAN status, but no arrêté préfectoral was ever published. The pages were generated from a template that conflated candidature with confirmed application. For months, these pages told visitors (and Google) that a specific rent cap was legally binding in Grenoble. It wasn't.

Run-423 built check_legal_regime.py v2 — a 272-line script that cross-references:

An AUTHORITATIVE table of cities with confirmed décrets/arrêtés (Paris, Lille, Lyon, Bordeaux, Montpellier, Est Ensemble, Plaine Commune)
Wikipedia FR Contrôle des loyers
Service-Public.fr F1314

Each city gets a confidence_score (0–1) and a pending_legal_verification flag. Running it against all 32 encadrement pages revealed: 26 confirmed, 5 pending (Grenoble cluster), 1 explicitly non-applicable (Marseille — zone tendue but no ELAN art. 140 arrêté).

Run-424 batch-patched all 5 pages, 17 edits each:

- <title>Loyer à Grenoble 2026 — Plafond légal 12,4 €/m² | BailleurVérif</title>
+ <title>Loyer à Grenoble 2026 — estimation observatoire 12,4 €/m² (statut légal pending)</title>

- "plafond légal applicable à Grenoble"
+ "estimation observatoire — ELAN art. 140 non confirmé par arrêté préfectoral"

85 edits total. 5 pages now display a visible amber disclaimer. E-E-A-T accuracy consolidated.

Lesson: LLM-assisted content generation at scale creates E-E-A-T risk when the model interpolates from pattern ("city X applied for status" → "city X has status"). A fact-checking tool querying authoritative sources is not optional — it's infrastructure.

The architecture behind the catches

The critic agent runs on the same cron. It reads the last 7 run files, checks metric deltas against source data, and writes prioritized recommendations to inbox-from-critic.md. The executor reads this inbox on every wake and either honors the recommendation or explicitly overrides it with a written WHY_THIS_NOT_THAT ritual.

The key asymmetry: the critic has no agency. It cannot write code or ship files. It can only write text. This prevents the critic from introducing its own bugs while still giving it real leverage — a flagged issue the executor ignores gets re-raised at the next audit, escalating to the strategic critic if needed.

All three bugs this week were caught by the critic, not by automated tests. There are no unit tests. There's check_legal_regime.py now, and a post-build grep for {ville}, but mostly the quality layer is: write runs, have a separate agent read them, close the loop within 24h.

At wake 424, the honest state is: 1 real subscriber (Marseille, via the DPE page), 0 email submissions post-CTA, 4 humans engaged lifetime, 97 sessions. The metrics are smaller than they looked before the purge. They're also real.

Takeaways

Separate smoke from signal on write, not on read. By the time you're filtering, the metric is already corrupted — and decisions have been made on it.
Template bugs survive because HTML renders fine. Structured data parsers don't forgive unresolved placeholders. Add a linter to your deploy step.
Autonomous content at scale requires a legal fact-check layer. "Pending" is always safer than "confirmed" when you're not certain of a regulatory status.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638

My autonomous agent scraped 35 real questions from French renters — then rewrote our homepage

Florian Demartini — Wed, 27 May 2026 14:36:07 +0000

Every product landing page I've seen — including mine — had a section that said something like "you might be worried about your lease, your deposit, your DPE rating." Pure copywriter hypothesis. This week, my autonomous agent replaced every word of it with data scraped from Reddit. Here's the exact pipeline.

Background: 370 autonomous cycles, one French housing rights tool

BailleurVérif is a French housing rights SaaS I run solo. The agent wakes every 2 hours via cron, reads a strategic critic audit every 12 hours, and executes prescriptions autonomously — no human in the loop per wake cycle. As of today: 370 wakes, 27/27 strategic prescriptions honored, 0 ScheduleWakeup calls (external cron handles pacing).

Stack: Python + Anthropic Claude API for the critic/executor pattern + SQLite funnel tracker + static HTML server. The agent commits to GitHub, pings the Indexing API, publishes datasets on data.gouv.fr. This week it made a product decision I'd been procrastinating on for weeks.

The problem: our homepage copy was entirely hypothetical

Strategic critic audit-26 (2026-05-26T21:55Z) flagged it directly:

"Homepage copy is 100% hypothetical — 'vous vous demandez peut-être si votre loyer est légal...' — while Reddit has thousands of real renter questions publicly accessible."

The prescription: scrape 30+ questions from French subreddits, build a data-driven page, publish the JSON dataset as CC-BY 4.0, then use that corpus to swap the homepage hero. Builder-only. Zero human action required from me.

The scraping pipeline

The agent wrote scrape_reddit_locataires_run367.py. Two rounds, rate-limited at 2s/request, with an identified bot UA:

import urllib.request, urllib.parse, json, time

SUBREDDITS = "france+paris+immobilier+vosfinances+AskFrance"
UA = "BailleurVerifBot/0.1 (+bailleurverif.fr; contact@bailleurverif.fr)"

TAG_QUERIES = {
    "loyer-abusif": [
        "loyer abusif encadrement refus propriétaire",
        "loyer trop élevé que faire locataire",
    ],
    "dpe-invalide": [
        "DPE faux fraude loyer augmenté",
        "dpe invalide recours locataire",
    ],
    "depot-garantie-non-restitue": [
        "caution non rendue délai légal",
        "dépôt garantie non restitué propriétaire",
    ],
}

results = []
for tag, queries in TAG_QUERIES.items():
    for q in queries:
        url = (
            f"https://www.reddit.com/search.json"
            f"?q={urllib.parse.quote(q)}"
            f"&subreddit={SUBREDDITS}"
            f"&sort=top&limit=25&t=month"
        )
        req = urllib.request.Request(url, headers={"User-Agent": UA})
        with urllib.request.urlopen(req, timeout=10) as resp:
            data = json.loads(resp.read())
        for post in data["data"]["children"]:
            p = post["data"]
            results.append({
                "tag": tag,
                "title": p["title"],
                "subreddit": p["subreddit"],
                "score": p["score"],
                "num_comments": p["num_comments"],
                "id_hash": p["id"],  # no username stored
            })
        time.sleep(2)

Round 1 produced 50 raw results. Round 2 added r/Locataires and tightened the queries. Final dataset: 35 questions.

The quality filter: why 50 → 35 matters

The filter pass mattered more than the scrape volume:

Title-anchor required: must contain a tag-specific word (loyer, DPE, caution, garantie, encadrement)
Renter scope only: not a landlord asking about their property
Anti-noise blacklist: "copropriété", "syndic", "locaux commerciaux" — adjacent topics, out of scope

What got cut: 8 landlord questions, 4 commercial lease questions, 3 homeowner renovation questions.

Final distribution: loyer-abusif: 26 / dpe-invalide: 6 / depot-garantie: 3.

Score range: 8 to 347 upvotes. The agent sorted by score within each tag and picked the top 3 for the homepage hero.

The output: 776 lines of HTML + a CC-BY 4.0 JSON dataset

build_questions_reelles_page_run367.py generated /questions-reelles-locataires-fr.html with:

JSON-LD: WebPage + BreadcrumbList + FAQPage (3 legal questions) + Dataset + Organization
35 questions by category, anonymized (id_hash only, subreddit + score shown)
Direct links to the legal templates matching each question tag
CC-BY 4.0 downloadable JSON dataset (27 KB)

Dataset schema:

{
  "metadata": {
    "title": "Questions réelles de locataires FR — Reddit 2026-05",
    "source": "Reddit public search API",
    "license": "CC-BY 4.0",
    "scrape_date": "2026-05-27",
    "filter_methodology": "tag-anchor title match + renter scope + blacklist anti-noise",
    "total_questions": 35
  },
  "questions": [
    {
      "id_hash": "e95b9fed0029",
      "tag": "loyer-abusif",
      "title": "Mais qui respecte l'encadrement des loyers pour les petites surfaces ?",
      "subreddit": "paris",
      "score": 93,
      "num_comments": 41
    }
  ]
}

The homepage hero swap — 12 hours later

Strategic audit-27 arrived at 10:00Z the next morning with a single prescription: use the corpus to replace the hypothetical homepage intro. The agent picked 3 questions (one per tag, highest score), added citation badges, and modified 22 lines of index.html.

Before:

<p>Vous vous demandez peut-être si votre loyer respecte l'encadrement légal,
si votre DPE est valide, si votre propriétaire peut garder votre caution...</p>

After: 3 real questions with r/paris · 93 votes badges. A CTA linking to all 35 questions. Commit 47404ed, smoke-tested via curl -s https://bailleurverif.fr/ | grep "loyer abusif" ✅.

The whole cycle — audit-26 prescribed, agent built and shipped, audit-27 prescribed homepage swap, agent shipped — completed in under 16 hours, with zero messages from me.

The side discovery: 60% of "direct" visits were bots

While auditing the funnel this week, the agent cross-referenced session IDs against raw UA strings in visits.jsonl. The result: 159 raw "direct" sessions → 63 plausibly human after UA filtering.

The filter:

import re

BOT_UA_PATTERNS = [
    "Googlebot", "Applebot", "Yandex", "HeadlessChrome", "GoogleOther",
    "PerplexityBot", "GPTBot", "ClaudeBot", "CCBot", "Bytespider",
    "bot", "crawler", "spider",
]
HEADLESS_CLEAN_VERSION = re.compile(r"Chrome/14[5-9]\.0\.0\.0")
# Real Chrome 148 reports "148.0.7778.96" — headless Chrome reports "148.0.0.0"

def is_bot(ua: str) -> bool:
    ua_lower = ua.lower()
    if any(b.lower() in ua_lower for b in BOT_UA_PATTERNS):
        return True
    return bool(HEADLESS_CLEAN_VERSION.search(ua))

Two visits the agent had labeled "plausible human" in earlier runs turned out to be Googlebot Nexus 5X and YandexRenderResourcesBot. The agent now tracks direct_humans_after_ua_filter_lifetime = 63 as a separate metric from raw visit counts.

The 40% noise estimate is probably conservative — sessionId=null visits (JS not executed) add another layer of bot signal the agent is now checking.

Lessons from this cycle

Reddit public JSON is underused for user research. No auth needed, 2s/request rate limit is acceptable, and you get score + comment count as engagement proxy for free. Query it before writing your landing page copy.
Filter harder than you scrape. 50 → 35 with strict title-anchor + scope criteria. The 15 removed would have diluted the page signal and made the homepage swap weaker. Quality over volume.
Real user questions beat copywriter hypothesis every time. We had no surveys, no user interviews. But 26 questions tagged loyer-abusif with scores ranging from 8 to 347 are more actionable than "you might be wondering if..."
Track humans_after_ua_filter from day one. Raw visit counts are noise-heavy. Cross-reference UA strings. Ship a derived counter early.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638

90 pages with broken Rich Results. My autonomous agent found them, fixed them, and rewrote its own monitoring.

Florian Demartini — Wed, 20 May 2026 14:35:48 +0000

Twelve days into running my solo SaaS on an autonomous agent, I watched it silently fix a structured data bug across 90 pages — then patch its own monitoring sub-agent to detect the same pattern forever.

Here's the actual log.

The product

BailleurVérif is a French rental compliance checker built on open data (data.gouv.fr + ADIL jurisprudence). It generates programmatic HTML pages to answer questions like "is my Paris apartment rent legally capped?" — no account needed.

The agent running it fires every hour via cron, reads server logs and memory files, makes decisions, ships code, and commits to GitHub. No human in the loop unless I send an explicit brief.

The bug that hid for 11+ cycles

For 11+ wake cycles — roughly 11 consecutive hours — 90 HTML files were serving invalid BreadcrumbList JSON-LD. The missing field: the item property on ListItem position 2.

Google's Rich Results parser requires both name AND item for every breadcrumb position. Without item, the breadcrumb rich result is silently dropped from SERP — no error, no warning, just quietly gone.

// BROKEN (what was live)
{
  "@type": "ListItem",
  "position": 2,
  "name": "Encadrement des loyers Paris"
}

// VALID (what it should be)
{
  "@type": "ListItem",
  "position": 2,
  "name": "Encadrement des loyers Paris",
  "item": "https://bailleurverif.fr/loyer-legal-paris.html"
}

I noticed it via GSC URL Inspection — the tool showed "BreadcrumbList is invalid" for the Paris page I'd shipped 9 hours earlier. I sent a brief to the agent at 09:45Z: "Fix missing item field on BreadcrumbList position 2 — 81+ pages."

The agent's response in one wake cycle

Run-321 started at 10:00Z. By 11:00Z:

1. Committed the fix across 90 files

The agent ran a Python str.replace pass — turned out 90 files had the bug, not just 81 (the wider grep caught more templates). One commit:

commit 3ee81da
fix: add missing item field on BreadcrumbList position 2 (81+ pages)

90 files: 31 encadrement-loyer + 50 DPE F/G + 9 connexes
(guide-bailleur, scanner-arnaque, irl-revision-loyer, etc.)

2. Wrote a permanent discipline document

The agent created memory-agent/concepts/seo-discipline.md (+80 lines): the correct JSON-LD pattern, 6 canonical hub URLs, 4 anti-patterns, and a rule that sub-seo-monitor should detect this automatically going forward.

Not as a one-time note — as a concept file that every future wake loads as context.

3. PATCHed its own monitoring sub-agent

This is the part I find genuinely interesting. The agent sent an HTTP PATCH to sub-seo-monitor — a Haiku sub-agent running nightly — to add a new audit task between existing tasks 2 and 3:

def audit_breadcrumbs(html_content: str) -> dict:
    import json, re
    results = []
    for script in re.findall(
        r'<script type="application/ld\+json">(.*?)</script>',
        html_content, re.DOTALL
    ):
        try:
            data = json.loads(script)
            if data.get("@type") == "BreadcrumbList":
                for item in data.get("itemListElement", []):
                    if "item" not in item:
                        results.append(item)
        except Exception:
            pass
    return {"pages_with_missing_item": len(results)}

# Alert rule: if pages_with_missing_item >= 1 → prepend inbox.md HEAD

The sub-seo-monitor prompt went from 3,301 → 5,766 characters (+2,465 chars). The backup hash was logged: 81a0184d8f687290. The sub-agents registry was updated with last_update_run=run-321.

From now on: any HTML template regression that reintroduces a missing item field gets caught within 24 hours.

What happened in the 12 hours after the fix

Independently that same day, the SEO infrastructure closed a loop I'd been waiting on for weeks:

Googlebot WRS Mobile rendered the homepage with JavaScript for the first time.

The proof is in server.log. Three consecutive requests from IP 66.249.73.129 (verified Googlebot):

2026-05-20T06:40:00Z  GET /                        200
2026-05-20T06:40:01Z  GET /api/changelog?limit=5   200   ← JS-only endpoint
2026-05-20T06:40:02Z  POST /api/visit               200

/api/changelog is called exclusively by client-side JavaScript on the homepage. A plain HTML crawler never hits it. Googlebot hitting it means Googlebot is actually executing our JS.

That same day, 9 distinct bot crawls hit the Paris page within 12 hours of it going live — from 4 independent channels:

Googlebot Mobile WRS (rendered JS, see above)
Google-InspectionTool/1.0 (rare signal, likely GSC quality check)
GPTBot/1.3 (OpenAI LLM ingestion pipeline)
Generic AWS/Bing crawlers

// dashboard-extras.json, 12h post-ship
{
  "bot_hits_24h": 60,
  "bot_hits_lifetime": 118,
  "gptbot_today": 11,
  "last_googlebot": "2026-05-20T08:43:24Z"
}

Same week: the agent added Wikidata entity Q139857638 to the site's Organization JSON-LD sameAs array, and made the footer links to GitHub and Wikidata visible. Moat category-4 count went from 2 → 3 substantive components.

Stack

Agent runtime: Claude claude-opus-4-6 (Builder Opus) running the main cron wake; Claude claude-haiku-4-5 (sub-agents: sub-seo-monitor, sub-observatoire-publisher, sub-critic, sub-linkedin-drafter)
Memory: flat .md files in memory-agent/ (concepts, decisions, kpis, snapshots) — no vector DB, no embeddings, just structured Markdown loaded at wake start
Orchestration: cron 0 * * * * on a Linux VPS, each wake = 1 Claude API call, time-boxed 15 min
Sub-agent management: local Node.js agent-browser server with PATCH/GET API, agents registered in sub-agents-registry.json
HTML generation: Python str.replace on templates, 90 static files, committed and pushed via GitHub PAT
SEO signals: JSON-LD (Organization, BreadcrumbList, FAQPage, Dataset), IndexNow pings, sitemap.xml auto-generated
Data: data.gouv.fr reuse 6a0c30a, ADIL jurisprudence scraping, observatoire 121-wave cross-analysis (57.6% violation rate nationally)

Takeaways

Silent structured data bugs are insidious. Google drops invalid Rich Results without noise. The only detection path is GSC URL Inspection or a dedicated nightly audit — not your server logs.
Patching your own monitoring is the actual fix. The breadcrumb code fix took 3 minutes. Writing the discipline doc and PATCHing the sub-agent prompt took 12 more. But now any template regression is caught within 24h automatically, forever.
Googlebot rendering JS is a measurable milestone. The gap between "crawls HTML" and "executes JavaScript" matters for JS-heavy pages. server.log is your proof: look for client-side-only API endpoints in the Googlebot user agent trail.
Flat Markdown memory beats vector stores for small autonomous agents. The agent's memory-agent/concepts/ directory is just structured .md files. It loads relevant files at wake start, writes new ones when it learns something. No embedding pipeline, no retrieval latency. For under 200 files, simple grep and read is fast enough.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638

DEV Community: Florian Demartini

My critic agent caught 3 data integrity bugs my main agent introduced — all in 24h

Background: BailleurVérif, 424 wakes later

Bug 1: The subscriber that wasn't

Bug 2: 47 pages with {ville} in their JSON-LD

Bug 3: 5 pages claiming a law that doesn't apply

The architecture behind the catches

Takeaways

My autonomous agent scraped 35 real questions from French renters — then rewrote our homepage

Background: 370 autonomous cycles, one French housing rights tool

The problem: our homepage copy was entirely hypothetical

The scraping pipeline

The quality filter: why 50 → 35 matters

The output: 776 lines of HTML + a CC-BY 4.0 JSON dataset

The homepage hero swap — 12 hours later

The side discovery: 60% of "direct" visits were bots

Lessons from this cycle

90 pages with broken Rich Results. My autonomous agent found them, fixed them, and rewrote its own monitoring.

The product

The bug that hid for 11+ cycles

The agent's response in one wake cycle

What happened in the 12 hours after the fix

Stack

Takeaways

Bug 2: 47 pages with `{ville}` in their JSON-LD