My agent shipped a UX fix — then diagnosed why it couldn't measure if it worked

#claude #ai #automation #saas

My agent shipped a capture fix. 28 hours of monitoring later, it flagged its own measurement failure.

You ship a UX improvement. You set up proper monitoring. You wait. Then your critic agent comes back with: "MISS ≥80% confidence — the root cause is not UX, it's acquisition."

That happened this week on bailleurverif.fr.

The fix

Two weeks ago, my strategic critic agent issued a prescription: ship a capture optimization patch to improve email signups at the verdict screen — the moment a user sees whether their apartment listing is legally compliant. The hypothesis: reduce friction at that critical step → more emails collected.

Commit 1f0f669 went live. The agent entered PAUSE-AND-MEASURE mode: no new features, no other UX changes. The protocol, set by the critic:

Wait 72 hours
Collect ≥30 qualifying human visits (users who reached a verdict_displayed event)
Then evaluate

What the data showed at T+28h

{
  "visits_total": 421,
  "last_new_visit": "2026-06-09T20:56:49Z",
  "visit_source": "Googlebot",
  "humans_engaged_lifetime": "5-6",
  "email_submitted_lifetime": 0,
  "subscribers_by_intent": {"unset": 1},
  "humans_via_seo_cluster_93_pages": 0,
  "gap_flat_consecutive_hours": 28
}

421 total visits. The last new visit was a Googlebot at 20:56 UTC. For 28 consecutive hours after that: zero new qualifying humans.

The critic's evaluation was blunt:

# Criteria set by Strategic Critic (audit-53):
# required_sample  = 30    # qualifying humans (verdict_displayed)
# window_hours     = 72
# actual_at_T28h   = 3     # humans
# projected_at_T72h = 10-15  # humans
# projected_verdicts = 1-2
# P(email_submitted >= 1)  = 0.10 to 0.40  # absolute
#
# Verdict: MISS >= 80% confidence
# Reason:  insufficient sample, IRRESPECTIVE of UX quality

The patch might be perfect. The patch might be terrible. There's no way to know, because there's almost no one to test it on.

The real bottleneck: acquisition, not UX

This is the insight that made the week worth writing about.

The tactical critic's verdict (audit-71, scored 8.2/10) was not "the UX fix failed." It was something more uncomfortable: the bottleneck is acquisition-traffic structural, not UX-capture. You cannot measure conversion improvements at 2-3 qualifying humans per day.

My acquisition breakdown for bailleurverif.fr:

Organic pull via LLM (ChatGPT, Perplexity, etc.): ~1 qualifying session per week
SEO cluster for Seine-Saint-Denis city pages (Saint-Denis, Montreuil, Saint-Ouen): 0 humans in 22h
Target set by critic: ≥3 humans via city pages by 2026-06-11T22:00Z → MISS ≥85% confidence

The critics raised this as ★★★ priority — the tier that overrides the usual no-interruption protocol and triggers a direct flag to me. The asymmetry: my response time is ~30 seconds to acknowledge, versus a 10-minute agent action to document it. Worth the interrupt.

What the agent did while waiting

The same session, the strategic critic's audit-55 issued one carve-out exception: enrich one new city page in the cluster-93 SEO pool.

The agent chose Saint-Ouen-sur-Seine over Bobigny — deliberately. Saint-Ouen belongs to EPT Plaine Commune (same administrative grouping as Saint-Denis and Montreuil, both empirically validated pages). Bobigny belongs to Est Ensemble — a different EPCI, no prior data. EPCI alignment was the tiebreaker.

The enrichment on encadrement-loyer-saint-ouen-2026.html added:

FAQPage JSON-LD with 5 Q/R grounded in cited legal sources: loi ELAN art. 140, décret 2020-1502 (Plaine Commune entry into the encadrement device), loi 3DS art. 79, ADIL 93, TJ Bobigny
A comparison table: Plaine Commune cap (25.2€/m²) vs Paris 17e/18e (27.7-33.2€/m²)
A section linking to the observatoire dataset (1 conforming Saint-Ouen listing + 1 violation from Saint-Denis for context)
A CTA preset: /scan-url.html?q=saint-ouen

Page grew from 255 → 358 lines (+40%). One Indexing API ping sent. Then: stop. Back to baseline monitoring.

The ECLI hallucination protocol

One more thing worth documenting: 4 files with hallucinated ECLI identifiers (French court judgment reference numbers in format ECLI:FR:CCASS:...) were flagged this week.

The SB-2 safeguard was introduced after two prior incidents — pages shipped with court citations that looked authoritative but were fabricated by the agent. The fix: any ECLI reference must be verified against Judilibre (the French judiciary official API) before shipping. Files with unverified ECLIs get quarantined, not published.

Four such files accumulated. The agent prepared a batch purge decision file and required explicit human acknowledgment to execute — because deleting legal content that might be partially correct has asymmetric risk.

I didn't respond within 24 hours. The agent logged it as "trust juridique drift accepté audit trail" and moved on. The files are still there, quarantined.

Lesson: even a well-designed safeguard can accumulate silent drift if the human loop response time is too slow.

Stack

bailleurverif.fr
├── Static HTML (no framework — SEO first)
├── Python agents (Claude Sonnet 4.6 via Anthropic SDK)
│   ├── Builder executor    — cron */2h
│   ├── Tactical Critic     — cron ~*/12h, scores sessions 0-10
│   ├── Strategic Critic    — weekly, moat + architecture review
│   └── Sub-agents          — SEO monitor, content syndicator, Bluesky
├── Browserbase + Playwright (observatoire scraping)
├── SQLite + JSONL          (funnel events, visits, ledger)
└── OVH + Google Indexing API

The multi-critic setup has been the highest-leverage architectural decision in this project. Without an independent tactical critic scoring each session, the agent optimizes for activity rather than outcomes — shipping things that look productive but don't move any real metric.

Takeaways

Know minimum viable sample before shipping UX experiments. 3 humans in 28 hours is not enough signal. An independent critic calculated this and flagged it. Without that check, the agent would have waited a full 72h window and concluded "the UX didn't work" — the wrong lesson from the wrong diagnosis.
Autonomous agents need an acquisition loop, not just a conversion loop. Building a better funnel is useless at 2-3 qualifying humans per day. The real work right now is seeding acquisition: Wikipedia FR edits, semantic pull from LLMs, city page SEO at scale.
Safeguards create silent drift when human response times are slow. The ECLI quarantine system works — but 4 files accumulated while I wasn't watching. Design for the loop, not just the rule.

The next strategic audit will likely prescribe a hard pivot: stop optimizing UX, start building acquisition. The agent that figures out how to get 30 qualifying humans per week will be worth measuring.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638