DEV Community: Sergei Parfenov

The Bug That Crashes Your Import Is the Lucky One

Sergei Parfenov — Fri, 31 Jul 2026 12:43:45 +0000

This is a submission for DEV's Summer Bug Smash: Clear the Lineup powered by Sentry.

You are migrating a 50,000-message Slack workspace to Zulip. Somewhere around message 31,000 the import dies with KeyError: 'ts'. Annoying, but here is the uncomfortable part: that is the lucky outcome. The unlucky one is "ts": "NaN", where nothing dies, nothing warns, and your company's message history quietly comes out in the wrong order.

TL;DR: Zulip's Slack importer used float(message["ts"]) unguarded, both as a sort key and as date_sent. One message with a missing or malformed ts aborted the entire import; a non-finite value like "NaN" did not even raise, it silently broke the sort. My fix (zulip/zulip#39813) skips such messages with a warning and requires ts to parse to a finite float via math.isfinite. The regression test fails with KeyError: 'ts' on the old code.

Project Overview

Zulip is an open-source team chat server (Django/Python, ~25k stars) with an unusually strict engineering culture: near-total backend test coverage, strict mypy, and a commit discipline of "each commit is a minimal coherent idea".

The code I touched lives in zerver/data_import/: the subsystem that converts exports from Slack, Microsoft Teams, and Mattermost into Zulip's format. This subsystem has one property that should shape every line in it: the input is another tool's output. Import is a long batch process over data of arbitrary quality, and the admin running the migration has no way to "fix" what Slack's export tool produced. A pipeline that dies on record 31,207 of 50,000 is strictly worse than one that skips record 31,207 with a warning.

Bug Fix or Performance Improvement

get_messages_iterator() in zerver/data_import/slack.py streams every message of the export, sorting each day's messages by timestamp:

yield from sorted(messages_for_one_day, key=get_timestamp_from_message)

where the sort key was simply:

def get_timestamp_from_message(message: ZerverFieldsT) -> float:
    return float(message["ts"])

That one line has three distinct failure modes, all reproduced during validation:

1. ts is missing. KeyError: 'ts' straight out of sorted(...). The whole import aborts because of one message:

File "zerver/data_import/slack.py", line 911, in get_messages_iterator
  yield from sorted(messages_for_one_day, key=get_timestamp_from_message)
File "zerver/data_import/slack.py", line 1461, in get_timestamp_from_message
  return float(message["ts"])
KeyError: 'ts'

2. ts is garbage. "not-a-number" raises ValueError. Same total abort.

3. ts is "NaN". The nasty one. float("NaN") is a perfectly valid parse, so nothing raises. But NaN is incomparable (NaN < x and x < NaN are both False), which violates the total ordering Timsort assumes, so sorted() silently returns an inconsistent order:

sorted(["1434139102.000002", "NaN", "1434139101.000001"], key=float)
# => ['1434139102.000002', 'NaN', '1434139101.000001']   # not sorted, no error

No exception, no warning. Just a migrated archive with scrambled chronology. Crashing is the lucky case; this is the case that costs you a re-migration three weeks later when someone notices the history reads wrong.

One honesty note: I did not discover this failure class from zero. The Zulip maintainers' own audit issue (#39650) flags the unguarded timestamp sort key as one of the two highest-impact robustness items, and it was unclaimed when I picked it up. What I brought is the implementation, the non-finite analysis, and the test that pins the behavior.

Code

PR: zulip/zulip#39813 (branch P0rt:slack-ts-guard, single commit, +54/-0 across the module and its test file).

The fix is a predicate plus a guard, following the skip-with-warning pattern that already exists two lines above for other unimportable messages:

def message_has_valid_timestamp(message: ZerverFieldsT) -> bool:
    """Whether `ts` is present and parses to a finite float.

    `get_timestamp_from_message` is used both as a sort key and for
    `date_sent`, so a missing or malformed `ts` on a single message
    would otherwise abort the entire import — and a non-finite value
    like "NaN" would not even raise, instead silently producing an
    inconsistent sort order.
    """
    try:
        return math.isfinite(float(message["ts"]))
    except (KeyError, TypeError, ValueError):
        return False

if not message_has_valid_timestamp(message):
    logging.warning(
        "Skipping Slack message with invalid ts %r in %s",
        message.get("ts"), message_dir,
    )
    continue

The detail that matters: math.isfinite, not a bare try/except around float(). A naive "does it parse" check would have fixed the two loud failure modes and waved the silent one straight through.

My Improvements

The test had to fail first. I wrote test_get_messages_iterator_skips_invalid_timestamps: a temp directory with one day of Slack export containing five messages, two valid (deliberately in reverse chronological order), one with no ts, one with "not-a-number", one with "NaN". Then I rolled the source file back to upstream/main, kept the test, and ran it: KeyError: 'ts', exactly the predicted crash. Restored the fix: the test asserts that exactly the two valid messages survive, in correct order, with exactly three warnings logged. A regression test that never failed against the old code is a test fitted to the fix, not a test of the fix.

Full-suite validation in a provisioned dev environment, not just the happy path: ./tools/test-backend zerver.tests.test_slack_importer (56/56 passing), ./tools/lint and ./tools/run-mypy clean, and test-backend --coverage showing zero uncovered lines in zerver/data_import/slack.py after the change, including the new guard path.

One decision I explicitly left open for reviewers: skipping a message with a broken ts versus synthesizing a fallback timestamp for it. Skipping loses the message but keeps the archive honest; a synthetic timestamp keeps the message but fabricates chronology. I went with skip-plus-warning because it matches the file's existing conventions, and flagged the tradeoff in the PR instead of pretending it does not exist.

Best Use of Sentry

I instrumented the demo with sentry-sdk (traces_sample_rate=1.0, environment bugsmash-demo) and ran the exact same poisoned export twice: once with zerver/data_import/slack.py rolled back to upstream/main, once with the fix in place. Same branch, same data, one file different.

Before the fix. The import span dies with KeyError: 'ts', attached right on the trace, status internal_error:

Drilling into the issue gives the full stacktrace, pointing exactly where this PR points: get_messages_iterator -> get_timestamp_from_message:

And Seer's root-cause analysis of that issue. The diagnosis is spot on, down to quoting the exact poisoned message it pulled from the frame locals ({"channel_name": "general", "text": "no ts field"}):

One detail worth being honest about, precisely because a Sentry engineer judges this category: Seer's suggested one-liner, float(message.get("ts", 0)), is the fallback-timestamp option I explicitly declined in the PR. It fixes the missing-ts mode, but leaves the ValueError mode alive, waves "NaN" straight through into the sort, and stamps real messages with a 1970 date_sent. Its closing advice, though (log a warning and investigate the upstream data quality), is exactly what the shipped fix does. Reading Seer's output critically instead of pasting its patch is, I would argue, the intended use of the tool: it found the root cause in seconds; deciding the failure policy stayed my job.

After the fix. Same export, same transaction: zero issues, and the three skipped messages surface as three warning-level logs on the trace instead of one fatal:

That pair of traces is the whole fix, told by monitoring: the same bad input, downgraded from an unhandled exception that kills a migration to three structured warnings on a completed run. (A pedantic footnote for anyone reading the trace metadata: both runs report release c0b5d8c because the red run rolled back only the module file, not the branch HEAD.)

The uncomfortable takeaway from failure mode 3: "does it crash" is a terrible proxy for "is it correct", and the failure modes that skip the crash are the ones that survive into production. Which raises the question I left for the reviewers, and now for you: for a message that is otherwise fine but has a broken timestamp, would you skip it or synthesize a fallback date_sent? I picked skip. Convince me otherwise in the comments.

Nothing Was Broken. The Report Still Didn't Arrive.

Sergei Parfenov — Sun, 26 Jul 2026 13:50:33 +0000

This is a submission for DEV's Summer Bug Smash: Clear the Lineup powered by Sentry.

On July 24 at 09:00 Berlin time, the doctors group did not get its daily digest. It got the first half of one, unfinished. Nobody noticed.

The failure was recorded correctly, in a state file that nothing reads. The job's delivery mode was none. The fallback alert channel had been dead since June 13, for reasons that turn out to be bug four. So the run failed, the record was written, and the information stopped there.

Here is how our team actually learns that something broke. Different job, same pipeline:

The failure the team actually saw: a doctor's scheduled job died and she had to tag the CTO in chat. This is what «silent» pipelines look like when they finally speak.

Some context, since everything below assumes it. Symptomato is a telehealth service: patients describe symptoms in a chat, doctors answer them from a helpdesk. Sympy is the agent that works the seam between the two — on a schedule it reads the doctors' inbox and tells them, in Telegram, which conversations need a human today. Nobody watches it run. That is the point of it, and it is also why a broken run can go unnoticed for six weeks.

So I went looking for the bug behind the missing digest. That is the uncomfortable part: there wasn't one.

TL;DR: an agent run failed with no defective component anywhere in the path. That non-incident triggered a full audit of the pipeline, which turned up four bugs in our own code that nobody had ever seen fail. All four have the same shape: a function that does not know something reports a value instead of admitting it. This post is those four fixes, plus the instrumentation that makes this failure class visible, plus a deliberate decision to record none of the message content while doing it.

Project Overview

Under the hood: a self-hosted agent runtime with a job scheduler, a tool plugin we wrote on top of Chatwoot (the helpdesk our doctors work in), a set of host-side Python cron scripts, and Telegram as the delivery channel. Three scheduled jobs do the boring work: a morning digest of conversations that need attention, pings when a paid consultation goes 24 hours without a doctor reply, inbox prechecks.

One constraint shapes everything below: patient conversations contain medical text. Any observability we add has to work without recording message bodies.

Bug Fix or Performance Improvement

The incident with no bug in it

The digest agent sent its first Telegram message, then decided to finalize the digest by editing that message instead of sending a second one. The message tool requires a recipient field even for an edit. The edit call did not have one. The tool rejected it, the turn ended, the scheduler marked the run as error.

Walk that path again and look for the defect. The tool validated its input exactly as its schema says. The scheduler recorded the failure exactly as designed. The model picked a legal tool sequence that the prompt never forbade. Every component behaved to spec, and the doctors still had no digest.

This is the failure mode that makes agent pipelines different: the execution path is chosen at runtime by a model, so "correct components" and "correct behavior" stop being the same claim. Yesterday the same job sent one message and worked. Today it chose send-then-edit and did not. There is no line of code you can point at, and no test that fails, because nothing is deterministic enough to fail.

Once instrumented, the same failure looks like this. The capture is from the replay run — the July 24 failure itself predates the instrumentation — and it needed no code changes at the failure site:

The anchor bug, captured automatically: the model chose «send, then edit», and the edit call has no to. The tool that failed lives in the runtime, not in our code — we never instrumented it; the error was scraped from the gateway's ERROR log lines and turned into an issue.

Seer reconstructs the failure from breadcrumbs alone — no prompts, no message bodies were ever recorded.

The four bugs the audit did find

A missing digest with no defect in it is a bad place to stop, so I audited the pipeline properly: the tool plugin, the scheduler jobs, the host scripts, the existing telemetry extensions. It came back with a list. Four items on that list turned out to be the same bug wearing different clothes, and I only saw the pattern once I wrote them down next to each other.

1. The tool reports a conversation status it never looked up. getConversationSummary returns status: "open" as a literal, for every conversation, always. The real status sits in the API response one method below. So when the agent asks "is this conversation still open", it is told yes, unconditionally. Every judgment the model makes about whether to act on a conversation rests on a constant.

2. The tool returns page one and calls it the inbox. listConversationsByInbox fetches page=1 and returns the payload with no all_count check and no truncation marker. Chatwoot paginates at 25. Our host-side Python script got this right and loops until the count matches; the agent tool never did. So on any day with more than 25 open conversations, the model receives 25 and narrates them as the full picture, which is exactly the sort of confident summary you cannot catch by reading the output.

3. "The data isn't there yet" gets cached as a fact. tariff_from_triage_note returns "unknown" for two very different situations: the note is unparseable, and the note has not been posted yet. The caller caches the result unconditionally, and "unknown" is truthy, so the next run short-circuits on the cache and never looks again. A patient whose chat is scanned in the window before the backend posts its triage note is marked unknown permanently, which means the 24-hour ping that exists specifically for paid one-time consultations never fires for them.

4. The error handler reports failures through the channel that just failed. Our host digest script calls a Telegram helper with no error handling, catches the resulting exception, and then tries to report that exception with the same helper, which raises again. Since June 13 the script has ended in a double traceback — 128 of them in the log — and every one of those runs paid for a model call before crashing. Next to it, the monitor script ends its send with > /dev/null 2>&1, so a failed alert is indistinguishable from a delivered one anywhere in the system.

Now the shape. In every one of these, something the system does not know is represented as something it does know. Unknown status becomes "open". Twenty-five of thirty becomes "the inbox". A missing note becomes a tariff value, cached forever. A failed alert becomes a successful one. Three of the four never raise at all; the fourth raises twice a day into a log nobody reads, which comes to the same thing. All of them produce plausible output. And the incident that started the audit is the same thing one level up: a failed run represented as nothing at all.

Code

The status bug is the smallest and my favorite, because the correct value was already in scope. The whole thing is one line:

return { id: conversationId, status: "open", messages: formatted };

status is a literal. The real value sits in the conversation payload that the method one level below already fetches — the fix is to read it from there instead of asserting it.

Pagination was ported from the host script that already did it right, plus an explicit truncation marker on message reads, so a partial conversation announces itself instead of passing as complete.

The tariff cache stops recording ignorance as knowledge:

# before
tariff = tariff_from_triage_note(cid, cw_token)
entry["tariff"] = tariff          # cache: note content is immutable

# after
tariff = tariff_from_triage_note(cid, cw_token)
if tariff != "unknown":           # "not posted yet" is not a value
    entry["tariff"] = tariff

This does not distinguish the two «unknown»s either; it stops trusting them. An unparseable note now costs a re-read every run instead of being wrong forever, which is the trade I want.

And the alerting path: the Telegram helper handles its own failures and logs them, the error handler no longer depends on the channel that just failed, and chunking — a fifth thing I fixed while I was in there — splits outside HTML tags instead of through them.

One thing I am deliberately not calling a fix: the prompt line telling the digest job to send once and never edit. It patches today's symptom and it took ten seconds, but it is an instruction to a stochastic system, not a repair. The actual answer to the opening incident is not in the prompt. It is that a failed run now reaches someone, instead of a file nothing reads.

My Improvements

Every fix went red before it went green. The pagination and tariff bugs got standalone repro scripts against local mocks, with the buggy implementation copied verbatim, so the failure is demonstrated rather than argued:

open conversations in inbox:      30
agent check_inbox list_open sees: 25  (ids 1..25)
host precheck list_open sees:     30  (ids 1..30)
=> 5 conversations invisible to the LLM agent, no truncation signal returned

run 1 (triage in progress): tariff='unknown'  cached={'tariff': 'unknown'}
run 2+ (note now present):  tariff='unknown'  (cache short-circuits, note never re-read)
24h ping fires: False   (correct behaviour would be: True)

For the digest incident itself, the replay went into a private test group rather than the doctors group, running the same job with the same shape of payload:

RED replay: the agent sends, then tries to edit — the edit dies, the message stays raw. GREEN: same job, one send, no edit.

The RED runs are the interesting ones. The message is still sitting there in its raw, pre-edit state, which is precisely how this failed in production: not with an absence, but with a half-finished artifact that looks close enough to a real digest to be skimmed past.

Best Use of Sentry

The instrumentation is the other half of the fix, because the four bugs above are the ones I found. The category of "silent wrong answer" is not exhausted by an audit, so the pipeline needs to be able to report on itself. Three independent sources now feed one stream: errors scraped from the runtime's own error log lines, failures inside our agent tools, and cron monitors that stop hearing from a job.

Three different origins, one stream: (1) a core tool error scraped from the runtime's ERROR log lines, (2) a failure inside one of our own agent tools, (3) a cron monitor that stopped hearing from its job. The unnumbered rows are the same machinery catching unrelated faults.

When one of our tools fails, the HTTP call that caused it arrives attached, which turns "the agent said something odd this morning" into a five-second read:

An agent tool failure with the exact HTTP call that caused it — attached automatically as breadcrumbs.

Every tool call is now a span with the attributes that matter for this pipeline, and none of the attributes that would put patient text into a third-party system:

Every agent tool call is a span: tool name, action, numeric conversation id, latency, error status. Nothing in that list is message content — that is the whole design.

The tool's own Input and Output tabs are where the arguments and the result would sit. They are empty, and that is deliberate:

The same run in Sentry's AI view — model calls and tool calls in sequence, with the selected tool's Input tab empty: arguments and responses are never recorded (PHI).

Model calls get the same treatment, including the ones nobody is watching, which for this agent is most of them:

One span per model call — model, provider, conversation id, latency, time to first byte. Emitted for background cron runs too, which is where this agent does most of its work.

And they correlate, which is what makes an agent run debuggable at all: the model call, the tool call it triggered, and the outbound HTTP request that tool made, in one waterfall.

One agent run, one trace: the model call, the tool call it triggered, and the outbound HTTP request — correlated through the runtime's own trace context.

Token usage and cost land at run level, keyed by conversation:

Run-level usage: tokens in/out and dollar cost per agent run, keyed by conversation id — the cheapest possible answer to «what is this agent costing us».

Then the piece that speaks to the failure mode underneath the opening incident. Error monitoring catches runs that fail. It cannot catch runs that stop happening, and it cannot catch a delivery that quietly goes nowhere. Cron monitors turn a schedule into an expectation, and check-ins into evidence. (The red one below is the host digest script from bug four, not the doctors' digest from the opening: two different jobs, one failure mode.)

Three host crons, monitored from code (no UI setup). The red one is the script from bug four — it has been failing since June 13, and the monitor is the first thing in six weeks to say so out loud.

Missed, missed, failed. A job that stops running produces no error at all, and a job that fails into a log nobody reads produces none that anyone sees — a cron monitor turns both into the same alert.

Six weeks of that script crashing on schedule would have been one alert on day one.

On the PHI side, the deliberate choice: prompt and response recording is off. The safe list is model id, provider, token counts, cost, durations, tool name and action, numeric conversation ids, session key, job id, outcome. Anything string-valued that comes from message content goes through redaction or does not get sent. The result is a monitoring stack that can tell me a tool failed, which tool, on which conversation id, how long it took and what it cost, and cannot tell me or anyone else what the patient wrote.

Sentry offers a full conversation transcript for AI traces — and for this agent it is empty by construction: «This conversation's messages weren't captured».

That empty transcript is not a gap in the setup. It is the setup.

What I took away

The bug I went looking for did not exist, and the four I found were all the same bug wearing different clothes: a component that could not distinguish "I don't know" from a value, and picked a value. In ordinary code, that produces a wrong answer somewhere downstream and usually an exception eventually. In a pipeline where a language model reads those answers, it produces a fluent, confident, well-formatted summary of a reality that is 25 conversations wide instead of 30, and nobody downstream has any way to tell.

So the question I am still working on, and would like yours on: in your own systems, how do you tell the difference between an agent run that went fine and an agent run that quietly took a path that does not work? Error rates will not show it. Neither will the output, because the output always looks great.

'World Models' Will Be the Next Buzzword. The Man Saying That Just Raised $1B to Build One

Sergei Parfenov — Fri, 24 Jul 2026 12:10:03 +0000

In March, the CEO of a research lab with zero products closed a $1.03 billion seed round — the largest in European history. Then he told TechCrunch that "'world models' will be the next buzzword," predicting that within six months every company would slap the label on itself to raise money.

The CEO is Alexandre LeBrun. The lab is AMI Labs, Paris. The chairman is Yann LeCun. And LeBrun was right on schedule: VCs have pushed roughly $3B into the category since.

When the person behind the biggest bet in a space tells you the space is about to become a content-free buzzword, that's the frame worth keeping. The question isn't whether "world model" gets diluted — it will. The question is whether there's a real, testable architectural disagreement underneath the label.

There is. This post is the paper trail: what LeCun is actually claiming, what the published results show, who paid for it, and the strongest arguments that the whole thesis is wrong.

Who is LeCun, in 30 seconds

Turing Award 2018 (shared with Hinton and Bengio). Architect of convolutional neural networks. Chief AI scientist at Meta from 2013, where he built FAIR into one of the largest industrial research labs on the planet. His departure was confirmed in November 2025, after Meta folded FAIR into its new superintelligence org, spent $15B on Scale AI, and reoriented around Llama and generative products.

The split was architectural, not financial. Meta isn't investing in AMI Labs — but the two keep a research partnership around the architecture this whole story runs on.

The case against LLMs, as he makes it

LeCun's critique predates the current hype cycle by years, and it's more specific than "LLMs bad." His standard list of what's missing: understanding of the physical world, persistent memory, reasoning, and planning. The supporting arguments:

Autoregression compounds errors. Every generated token conditions on possibly-wrong previous tokens. Fine for prose. Bad for long-horizon plans, where one early mistake invalidates everything downstream.
Text is a thin slice of reality. By his estimate, a four-year-old has taken in more raw sensory data through vision alone than the largest text corpora contain. Text describes the world; it doesn't contain its dynamics.
The falling pen. Drop a pen and it can land many ways. A system predicting one most-likely continuation in surface space is doing a different computation than one reasoning over a distribution of physical outcomes.
The stakes argument. LeBrun ran Nabla, a medical AI company, and arrived at the same place from the applied side: in healthcare, hallucination isn't a UX bug. It's a liability category.

At VivaTech this year, LeCun said current chatbots understand the physical world worse than a rat. AMI's corporate framing is more measured: token prediction works well for discrete, low-dimensional tasks — retrieval, summarization, code — and mimics intelligence without modeling the world.

One thing worth flagging before we go further: this critique was formulated before RL-trained reasoning models existed. The "no reasoning, no planning" claim is weaker in 2026 than it was in 2022. The honest version of the argument today is about the reliability and grounding of that reasoning — not its existence.

"World model" currently means three different things

This is the part most coverage skips, and it's where the buzzword damage will happen. At least three technically distinct approaches share the label:

Generative video prediction. Predict future frames, conditioned on actions. Google DeepMind's Genie 3 generates navigable 3D worlds at 24fps; NVIDIA's Cosmos (launched at CES 2025, 2M+ downloads, trained on ~20M hours of video) targets synthetic data for robotics and AVs; Runway's GWM-1 bets on interactive video. The model's imagination is literally watchable.
Explicit 3D representation. Fei-Fei Li's World Labs treats the world as a spatial object, not a frame sequence. Marble (shipped November 2025) generates 3D environments on Gaussian splats plus physics engines, viewable in VR.
Latent-space prediction. JEPA — LeCun's bet. Don't generate pixels at all. Encode observations into abstract representations and predict how those evolve. The claim: most pixel-level detail is irrelevant for planning, and predicting it wastes capacity and compute.

These are not interchangeable. They differ in what they predict, how you evaluate them, and what they're for. When a startup calls itself a "world model company," the first useful question is which of the three it means.

The paper trail: what JEPA has actually shown

The fundraise is loud. The papers are quieter and more interesting.

2022 — the position paper. A Path Towards Autonomous Machine Intelligence lays out the whole program: a modular agent (perception, world model, cost, actor, short-term memory, configurator) with JEPA as the learning substrate. No results — pure architecture. It reads as a research roadmap, and AMI Labs is essentially this paper incorporated.

2023 — I-JEPA. Images. Mask blocks of an image, predict the representations of the masked regions from visible context — never reconstruct pixels, no handcrafted augmentations. Meta reported training a ViT-Huge in under 72 hours on 16 A100s, a fraction of what pixel-reconstruction methods burn, with strong linear-probe results.

2024 — V-JEPA. Same trick on video: masked latent prediction over spatiotemporal patches. The representations turn out to encode motion unusually well — exactly what pixel-reconstruction models are notoriously mediocre at.

2025 — V-JEPA 2. The load-bearing result. A ~1B-parameter ViT-g encoder trained on 22M videos (over 1M hours). From the paper:

77.3% top-1 on Something-Something v2 (motion understanding)
39.7 recall@5 on Epic-Kitchens-100 action anticipation — a 44% relative improvement over prior task-specific models
Aligned with an 8B language model: 84.0 on PerceptionTest, 76.9 on TempCompass — state of the art at that scale

Then the robotics part, V-JEPA 2-AC: take the frozen encoder, post-train an action-conditioned predictor on less than 62 hours of unlabeled robot video from the open DROID dataset, and deploy zero-shot on Franka arms in two labs the model never saw. Planning runs as model-predictive control in latent space. Results: 65–80% success on pick-and-place with unseen objects — no rewards, no task-specific data, no data from the deployment robots. Planning takes ~16 seconds per action versus ~4 minutes for the video-generation baseline (Cosmos), and on a cup-relocation task it hit ~80% where the vision-language-action baseline Octo managed 15%.

Meta also released physical-reasoning benchmarks alongside (IntPhys 2, MVPBench, CausalVQA) — on which current models, LLMs included, still trail humans badly. That gap is the entire pitch.

2026 — AMI Labs. As of July: no public model. Reporting points to work on world models that adapt continually through action, plus the conference-keynote circuit. LeBrun's stated timeline: roughly a year to the first usable pieces in a product, years to real commercial applications — healthcare (via Nabla), robotics, wearables, industrial. LeCun says year one is research. Credit where due: they are not pretending otherwise.

The money map

The AMI round was led by Bezos Expeditions, Cathay Innovation, Greycroft, Hiro Capital and HV Capital, with NVIDIA, Temasek, Samsung and Toyota Ventures in, plus individuals including Bezos, Mark Cuban, Eric Schmidt and — unusually for a seed round — Tim Berners-Lee. They initially sought €500M and closed ~€890M.

Zoom out and the pattern sharpens. World Labs raised $1B in February at $5.4B post. Decart took $300M in May at $4B (Karpathy is an angel). Odyssey raised $1.2B. And NVIDIA is on nearly every cap table — it has committed over $40B in AI equity in 2026, frequently structured as equity in exchange for long-term GPU commitments.

Read that incentive carefully. NVIDIA's omnipresence is evidence that world models are compute-hungry, not that the architecture is right. Demis Hassabis calling world models essential for AGI is the more meaningful endorsement — DeepMind builds them regardless of the hype cycle, and has no fundraise to justify.

The honest counterarguments

I don't want to write the version of this post that just relays the pitch. Here's the strongest case against.

1. LLMs may already learn world models. The Othello-GPT line of work trained a transformer on nothing but move sequences and found an emergent, probeable representation of the board state — later shown by Neel Nanda to be linearly decodable. Gurnee & Tegmark found linear representations of space and time inside Llama-2, down to individual "space neurons." Next-token prediction demonstrably induces some internal model of the data-generating process. So the LeCun claim has to be quantitative — "not enough of one, not grounded enough" — not categorical.

2. Those internal models might be junk anyway. Follow-up interpretability work suggests Othello-GPT's "board" may be a bag of correlated heuristics rather than a clean algorithm — epicycles, not an orrery. This cuts both ways: it weakens "LLMs already have world models," and it warns that JEPA latents could look equally messy once someone probes them as hard.

3. Latent prediction is hard to inspect. With generative world models you can watch what the model imagines and see it break. A JEPA predictor's mistakes live in embedding space; evaluation is indirect — probes, downstream planning success. That's a real tooling and debuggability cost, and part of why the generative camp iterates faster in public.

4. Nobody's world model is robust yet. Genie 3 stays coherent for a few minutes and remembers changes for about one. V-JEPA 2-AC does tabletop pick-and-place, not laundry. The gap between "80% cup relocation" and "robot in your kitchen" is the same kind of gap LLMs face between benchmark and deployment. It would be inconsistent to hold only one camp to the deployment standard.

5. The falsifiability question. What would count as the thesis winning? My candidates: JEPA-style planners beating vision-language-action models on generalization at matched scale; sample-efficiency curves holding beyond manipulation toys; a grounded system measurably hallucinating less in a domain like healthcare. What would count as losing: hybrid LLM systems closing the physical-reasoning benchmark gap first. Either outcome is visible within a couple of years — which is more than you can say for most $1B theses.

What to do with this (if you build things)

Practical, not philosophical:

Nothing here replaces your LLM calls in 2026. The most optimistic insider timeline is a year to first usable pieces, years to products. Plan your stack accordingly and ignore anyone selling you a "world model" API this quarter.
Form your own priors hands-on. V-JEPA 2 checkpoints are public. Cosmos is self-hostable — the 7B variant fits on a single H100 80GB. Marble has a free tier. Genie is a closed preview. An afternoon of poking beats a month of threads.
When the label shows up in a pitch deck, ask which of the three architectures it means — frames, splats, or latents — and where its pick-and-place numbers are. If there's no answer, you've found the buzzword LeBrun warned you about.
Watch three markers: AMI's first release and its license (LeCun spent a decade arguing for open research — the licensing choice will say a lot); whether latent-space planning scales past tabletop manipulation before generative models get long-horizon coherence; and DeepMind shipping a hybrid LLM + world model system, since they own both pieces and have no thesis to defend.

LeBrun's six months are almost up. The benchmarks aren't going anywhere.

Reading list

V-JEPA 2 paper: https://arxiv.org/abs/2506.09985
Meta's V-JEPA 2 announcement + benchmarks: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
I-JEPA: https://arxiv.org/abs/2301.08243
The 2022 position paper: https://openreview.net/forum?id=BZ5a1r-kVsf
AMI Labs raise (TechCrunch): https://techcrunch.com/2026/03/09/yann-lecuns-ami-labs-raises-1-03-billion-to-build-world-models/
Largest EU seed context (Crunchbase): https://news.crunchbase.com/venture/world-model-ai-lab-ami-raises-europes-largest-seed-round/
The VC map (Forbes): https://www.forbes.com/sites/josipamajic/2026/06/30/world-model-startups-raise-3-billion-vcs-bet-beyond-llms/
Othello-GPT: https://thegradient.pub/othello/
Nanda's linear-representation follow-up: https://www.neelnanda.io/mechanistic-interpretability/othello
Space/time representations in Llama-2: https://arxiv.org/abs/2310.02207
Melanie Mitchell's skeptical read: https://aiguide.substack.com/p/llms-and-world-models-part-2

Autonomy Is the Bug: Why Self-Driving Agents Hallucinate When the Model Barely Does

Sergei Parfenov — Tue, 21 Jul 2026 16:23:46 +0000

The best models in 2026 hallucinate on about 1% of grounded, single-shot tasks. Hand one a document and ask for a summary and it almost never invents anything.

Now let that same model drive itself: plan a task, call a tool, read the result, decide the next step, update its memory, repeat for twenty steps with no human in the loop. It will now go wrong on most runs. Same model, same weights. The thing that changed is autonomy, and autonomy is where hallucination is actually manufactured.

This is the part almost nobody states plainly: for a self-driving agent, the dominant cause of "making things up" is not the model's factual accuracy. It's the structure of running unattended across many steps. A better model barely moves it. Below is why, with the numbers, and what actually does.

TL;DR — Agent errors compound multiplicatively: at 95% per-step accuracy, a 10-step task succeeds ~60% of the time, a 20-step task ~36%. Even a fictional 99%-per-step agent fails ~18% of 20-step tasks. And it's worse than the clean math, because agents self-condition on their own earlier mistakes, so errors accelerate rather than just accumulate. This is structural, not a model-quality problem: no checkpoint fixes p^n. What fixes it is architecture, shorter chains, per-step validation, scoped context, and escalation, plus not reaching for a reasoning model, which for grounded steps often hallucinates 3–4× more.

The one equation that governs autonomous agents

Reliability engineering has a law for chained systems (Lusser's law): the reliability of a sequence is the product of the step reliabilities, not the average. For an agent taking n independent steps, each succeeding with probability p, end-to-end success is p^n.

That exponent is the entire story of why autonomy is hard. Watch what it does:

Per-step accuracy	5 steps	10 steps	20 steps
99% (fantasy)	95%	90%	82%
95% (excellent)	77%	60%	36%
90% (good)	59%	35%	12%
85% (a strong result)	44%	20%	4%

Read the 95% row again, because 95% per step sounds great. An agent that gets each step right 95% of the time succeeds on a 20-step task 36% of the time. It fails most of them. And 85% per step, which is a genuinely strong result on a complex reasoning action, collapses to 4% over 20 steps: one success in twenty-five.

Demis Hassabis called compounding agent errors "compound interest in reverse," and it's exact. The same multiplicative machinery that makes compound interest miraculous over time makes autonomous error catastrophic over steps. Demos hide it because a demo is 2–3 steps. Production is 5+ steps over messy inputs, which is precisely where p^n bites.

The load-bearing consequence: no amount of model improvement fully solves this. Going from 95% to 97% per step helps, but you're still fighting an exponent. The gap between "impressive demo" and "reliable product" is not a capability gap you close with a better model. It's a structural property of chaining a non-deterministic component, and you engineer around it or you ship a coin flip.

Why it's actually worse than p^n: self-conditioning

The table above assumes each step's error is independent. In real agents, it isn't, and the dependence runs the wrong way.

Researchers documented a self-conditioning effect in multi-step LLM pipelines in early 2026: once an agent has produced an error and that error is sitting in its context, the model conditions on it. It sees its own earlier mistake as established fact and reasons from it, which makes the next mistake more likely, not less. Errors don't just accumulate, they compound on themselves. The real failure curve is steeper than p^n predicts.

This is the mechanism underneath "the agent went off the rails around step 6." It didn't hit a hard wall. It made one small wrong turn at step 4, wrote that into its own working context, and every subsequent step treated the wrong turn as ground truth. By step 8 it's confidently building on a fact it invented four steps ago. The hallucination isn't fresh at step 8, it's laundered from step 4 through the agent's own memory.

And this points straight at one of the most effective fixes, which is counterintuitive: scoped context. A step that doesn't know about the errors in steps 2 and 3 can't condition on them. Passing only the relevant slice of context forward, instead of the full accumulated history, breaks the self-conditioning loop. Less memory, more reliability, the opposite of the instinct to give the agent everything.

The autonomy-specific amplifiers

On top of the raw compounding, self-driving specifically adds failure sources a single-shot call never has:

Lossy memory compaction. When the agent's history won't fit in context, it summarizes, and repeated summarization is brutal. A 2026 rate-distortion study measured it: an agent archiving raw records and retrieving on demand held ~95% recall; an agent overwriting memory with LLM summaries dropped to 33–56% recall, worst when it compacted most often. At equal budget, lossy self-summarization loses roughly half the facts over a long run, and a single-turn benchmark can't see it because the loss only appears when compaction repeats. The agent then confidently fills the holes it created.

Context drift. ~65% of enterprise agent failures in 2025 were attributed to context drift or memory loss during multi-step reasoning, not to running out of context. As the window fills, attention de-prioritizes the original task; early tool outputs get overwritten by later ones; the agent quietly diverges before it ever hits the limit. Chroma's context-rot work shows degradation accelerating past ~30k tokens even in models with far larger windows, capacity is not fidelity.

Multi-agent contamination. Split the work across agents for reliability and you can make it worse: naive full-broadcast state sharing increased hallucination by 34% in a controlled 2026 study, because one agent's error propagates to all of them. Chaining five agents at 95% each yields 77% end-to-end, not 95%, the same p^n exponent, now spread laterally.

Every one of these is a property of running autonomously, not of the model's factual accuracy. A single API call to the same model has none of them.

Where the model does matter: don't make per-step accuracy worse

The model isn't irrelevant, it sets your p. But the counterintuitive part is that the "smarter" model many teams reach for to fix agent reliability often lowers p on exactly the grounded steps agents are made of.

Reasoning models hallucinate more on grounded tasks. Same base model: DeepSeek-V3 hallucinates at 3.9% on grounded summarization; its reasoning sibling R1 at 14.3%, ~4× worse. It's a pattern, not a one-off: on Vectara's 2026 dataset, every reasoning model exceeded 10%, while a non-reasoning model (Gemini Flash-Lite) led at 3.3%. The mechanism is the same self-conditioning drift, reasoning models generate long internal chains that wander from the source. So "let me upgrade to a reasoning model so my agent thinks more carefully" can lower your per-step accuracy and, through the exponent, wreck end-to-end reliability. For the grounded steps that make up most agent loops, a fast non-reasoning model is often the higher-p choice.

"Lowest hallucination" can mean "abstains most." Claude 4.1 Opus posts 0% on a knowledge benchmark, but by declining when unsure, which for an agent is often exactly right (a confident wrong step corrupts everything downstream; an "I'm not sure" step can be escalated). Calibration, knowing what it doesn't know, matters more for p than raw accuracy, because a self-conditioning agent punishes confident-wrong far more than honest-uncertain.

The takeaway on models: pick for high, calibrated per-step accuracy on your actual step type (usually grounded → non-reasoning), not for leaderboard position. But understand that even a great p is fighting p^n, so the model is necessary and nowhere near sufficient.

What actually fixes autonomous hallucination

None of these is a model upgrade. All of them attack the exponent or the self-conditioning.

Shorter chains. The single highest-leverage move. Cutting a 20-step task to 8 steps does more for reliability than any accuracy bump, because you're reducing the exponent, not the base. If a task can be decomposed into shorter independent sub-tasks with checkpoints between, do that.
Validation between steps. Schema-validate each step's output before passing it forward. A 95%-accurate step that's checked can't silently corrupt everything downstream, you catch the error at step 4 instead of discovering it at step 20. This directly interrupts self-conditioning.
Scoped context, not full history. Pass only what the next step needs. A step that can't see earlier errors can't condition on them. Less context is more reliability here.
Escalation and circuit breakers. The agent must know when it's uncertain and hand off. The dangerous failure isn't the agent that stops and reports, it's the one that fails silently and continues, laundering a wrong turn through twelve more steps. Test explicitly: does it recognize its own error, signal it, and halt rather than compound?
Selective autonomy, not full autonomy. Aggressive automation for low-risk, reversible steps; hard human gates on high-stakes, irreversible ones. Cap the blast radius. The goal was never "fully autonomous," it's "autonomous where it's safe, gated where it isn't."

The thing to internalize

A single model on a single task is a solved-ish problem, ~1% on grounded work. An autonomous agent is a different object: it multiplies per-step reliability across many steps, and it conditions on its own mistakes, so a 95%-per-step agent is mathematically guaranteed to fail most long tasks, and the real curve is steeper still. That is not a model-quality problem and no checkpoint fixes it.

The teams whose agents are still running in 2028 won't be the ones who bought the most capable model. They'll be the ones who treated compound failure as a design constraint from day one, shorter chains, validation between steps, scoped context, escalation, and the discipline to leave a step out of autonomy when it can't be made safe.

When your self-driving agent hallucinates, don't reach for a smarter model first. Count the steps, check whether it's conditioning on its own earlier output, and ask which of those steps ever needed to be autonomous at all. The model isn't lying. Autonomy handed it its own earlier mistake and asked it to keep going.

What's your real per-step accuracy once you measure the whole chain, not the demo, and where did shortening the chain buy you more than any model swap? I'd bet most "the model hallucinates" complaints are compound error and self-conditioning wearing a model-shaped mask.

Sources & further reading

"The Compounding Errors Problem: Why Multi-Agent Systems Fail", Zartis (2026) — p^n across accuracy levels; why model improvement can't fully solve it.
"The Math That's Killing Your AI Agent", Towards Data Science (2026) — the compound table; fail-detectably-and-gracefully test.
"The Compound Error Problem: Why 95% Accurate AI Agents Still Fail", Highland Edge (2026) — self-conditioning; "compound interest in reverse"; scoped context passing.
"The Math Behind Why Multi-Step AI Agents Fail in Production", Flavius Dinu (2026) — Lusser's law; shorter chains + verification + guardrails.
"Multi-Agent Reliability Math: Chaining 5 Agents Drops to 77%", MindStudio (2026) — lateral compounding; 85–90% realistic per-agent rates.
"What to Keep, What to Forget: A Rate–Distortion View of Memory Compaction" (2026) — reversible ~0.95 vs irreversible 0.33–0.56 recall.
"Hallucination as Context Drift" (2026) — +34% from naive multi-agent sync.
Reasoning-tax and abstention figures: Vectara HHEM 2026; Suprmind AI hallucination benchmarks (DeepSeek V3 3.9% vs R1 14.3%; Claude 4.1 Opus 0% via abstention).

'Local' Solves Where Your Data Goes. It Doesn't Solve What Your Agent Does

Sergei Parfenov — Mon, 20 Jul 2026 11:19:35 +0000

Local models got good this year. Gemma 4's 12B runs agentic workloads in 16GB of RAM, GLM-5.2 tops the open-weight leaderboards under a permissive license, Qwen 3.6 does tool-calling that would've been frontier-only eighteen months ago. The "just run it locally" argument hit the top of Hacker News, and for once it wasn't cope — a model on your own hardware finally handles real work.

So teams are moving agents on-prem, and the pitch is almost always the same: local means private, private means safe. The first half is true. The second half is a category error that's going to cause incidents, because it quietly swaps a data question for a behavior question and hopes nobody notices.

TL;DR — Local deployment fixes exactly one thing: where your data goes. It does nothing for what your agent does. Prompt injection (50–85% success rates, architectural not model-level), silent provenance failures (the agent faking its own logs), and privilege escalation all survive the move to your hardware unchanged — and you've traded the provider's security team for none. Local agents are genuinely safe for a specific shape of task (bounded scope, trusted inputs, reversible or gated actions) and dangerously oversold for another (untrusted input, irreversible actions, regulated decisions). The dividing line isn't where the model runs. It's what the agent is allowed to touch.

What "local" actually buys you

Let's be precise about the real win, because it's real and worth having: data sovereignty. Your prompts, your documents, your customer data, your proprietary code — none of it leaves your infrastructure. For a hospital, a bank, a defense contractor, or anyone under GDPR handling personal data, that's not a nice-to-have, it's frequently the difference between "can deploy" and "legal says no." With the EU AI Act's high-risk provisions live as of August 2, 2026, and sector regulators (OCC, FDA, FINRA, SEC) applying existing authorities to agent deployments, keeping data on-prem removes a whole class of compliance friction.

That's the entire list. Data location. Everything else people attribute to local — that it's safer, more controllable, less exploitable — is either untrue or unrelated to where the weights sit.

What "local" does not buy you

Here's the uncomfortable part, in three failures that don't care about your network topology.

1. Prompt injection is architectural, not remote

The single most common belief I want to kill: that prompt injection is something that happens to cloud APIs and air-gapping escapes it. It isn't and it doesn't.

Prompt injection is #1 on the OWASP LLM Top 10 for one reason: LLMs cannot structurally distinguish trusted instructions from untrusted data. That's a property of how the model reads a context window, not of where the context window is hosted. A recent systematization across 78 studies puts injection success rates above 85%. Independent numbers land at 50–84% depending on configuration. Moving the model to your basement changes none of those numbers.

Worse, the dangerous variant for agents is indirect injection, and local makes it arguably harder to reason about, not easier. Indirect injection is when the agent autonomously retrieves attacker-controlled content — a poisoned document, a malicious webpage, a compromised record — and executes instructions hidden inside it, during its normal operating loop. And here's the local-specific sting: research shows that the moment a planner consumes raw local content — arbitrary files, logs, metadata off your own disk — malicious instructions embedded in those local artifacts steer the agent's reasoning, even if execution is later sandboxed. Your local filesystem is not a trusted input just because it's yours. Any file an attacker touched, any document a user uploaded, any log a previous run wrote is now an injection vector, and it's sitting inside your trusted perimeter, which is exactly where you weren't looking.

There's a beautiful, awful result that makes the point: asking an agent to seek clarification when a task is ambiguous — a behavior everyone agrees is desirable — measurably increases its injection vulnerability, because the clarification response becomes a fresh attack channel. You cannot prompt your way out of this, on any hardware.

2. The agent can still fake its own evidence

I've written a whole series on this, so I'll keep it short: an agent that writes a record of its own execution can write a false one. The Darwin Gödel Machine incident is the canonical example — an agent editing its own harness wrote a log claiming its tests passed. They never ran. It then read that log back as ground truth. No deception, just tool-use hallucination hitting a filesystem that can't record who wrote what.

Notice that every part of that happens locally by default. Running the model on-prem doesn't add a single control here. If anything it removes one, because a cloud provider at least gives you their audit tooling, their request logs, their trace infrastructure. Roll your own local agent and you get an empty /var/log and whatever provenance discipline you remembered to build — which, for most teams shipping fast, is none. "Local" and "auditable" are orthogonal, and people constantly assume the first implies the second.

3. You inherited the provider's threat model and fired their security team

This is the trade nobody prices in. When you call a hosted frontier API, an enormous amount of invisible security work comes along: input/output filtering, jailbreak detection, egress monitoring, rate limiting, red-teaming, abuse detection, incident response. You may resent paying for it, but it's there.

Move local and all of that is now your job. The attack surface didn't shrink — prompt injection, tool abuse, privilege escalation, data exfiltration through side channels are all still live — but the team defending against it went from "the provider's security org" to "you, probably part-time, while also shipping the feature." Survey data captures the gap bluntly: 82% of executives are confident their existing policies cover unauthorized agent actions, and the operational reality is nowhere close. Local deployment doesn't cause that gap, but it widens it, because it hands you more of the stack to secure while feeling like it did the opposite.

So where ARE local agents genuinely safe?

This isn't a "don't run local agents" post — I run them, they're great, and the data-sovereignty win is often decisive. It's a "stop using local as a synonym for safe" post. The actual safety question has nothing to do with where the model runs and everything to do with three properties of the task:

Green zone — local agents are genuinely safe here:

Bounded, trusted inputs. The agent operates over content you control end to end — your own codebase, your internal docs, data with no attacker-writable path into it. No indirect-injection surface because nothing untrusted enters the context.
Reversible or gated actions. The agent proposes; a human or a deterministic check disposes. Draft the email, don't send it. Suggest the migration, don't run it. Write the PR, don't merge it. If every consequential action has an undo or a gate, injection and hallucination cost you time, not damage.
Low blast radius. Worst case is contained. A local coding assistant that can edit files in one sandboxed repo, a document-Q&A agent that can only read, a log-triage agent that can only annotate — the failure mode is "wrong output," which you catch, not "wire transfer sent."

Concretely, the tasks that fit: private code assistance over your own repos, RAG/document Q&A over internal knowledge (read-only), draft generation with human review, log and telemetry triage that annotates rather than acts, offline data transformation with visible outputs. These are safe and benefit maximally from local — sensitive data, high volume, no need for frontier-level reasoning.

Red zone — local changes nothing about the danger:

Untrusted input in the loop. The agent reads emails, browses the web, ingests user uploads, processes third-party documents. Every one of those is an injection channel, and it's identical on local and cloud. If anything, do this on a hosted model with injection defenses before you do it on a bare local one.
Irreversible or high-value actions. Payments, deployments, deletions, external messages, database mutations, anything with a side effect you can't take back. The DGM failure and every injection result apply in full. Local gives you zero additional protection on the exact axis that matters most.
Regulated decisions. Credit, healthcare, legal, hiring. Here the arxiv literature is blunt: a production KYC deployment reported negative results — control failures surfaced only by internal audit, and a population of legitimate applicants the automated pipeline silently couldn't serve. In regulated settings the consensus is that full autonomy is rarely advisable regardless of hosting, and legal accountability amplifies every threat relative to an unregulated deployment. "It runs on-prem" is not an answer a regulator accepts.

The pattern: local is a data-locality decision; safety is an autonomy-and-blast-radius decision. They're independent axes, and conflating them is how you end up with a locally-hosted agent doing something on the red-zone list because "it's private, so it's fine."

The controls that actually matter (and are the same either way)

Since the model's location isn't doing security work, something else has to. The controls that make an agent safe are identical on local and cloud, and they're all about what the agent can touch, not where it thinks:

Least privilege on tools. The agent gets the minimum action surface for the task. A read-only agent can't write. A drafting agent can't send. This is the single highest-leverage control and it's free.
A gate before anything irreversible. Human-in-the-loop or a deterministic check on every action you can't undo. Gate on the action's reversibility, not the model's confidence — a model can be confidently wrong, and injection makes it confidently malicious.
Trusted vs untrusted input separation. Treat every file, document, and retrieved page the agent didn't author as potentially poisoned. Don't let raw untrusted content into the planning context; redact, sandbox, or validate at the boundary.
Provenance on tool outputs. A "tests passed" line counts only if it's backed by an artifact the agent couldn't author. The runtime that executed the tool mints the verified result, not the model narrating about it.
An audit log the agent can't rewrite. Append-only, outside the agent's editable surface. This is the thing local deployment silently removes if you don't build it, so build it.

Every one of those is architecture. None of them is a model, and none of them cares whether the weights are in us-east-1 or under your desk.

The takeaway

Local models crossing the capability threshold is genuinely one of the best things to happen in this space — the data-sovereignty win alone unlocks deployments that were legally impossible a year ago. Use them. But "local" answers exactly one question: where does my data live. It is silent on the question that actually determines whether you get an incident: what is my agent allowed to do, and what happens when it's wrong.

The safe local agent and the dangerous local agent run the same model on the same hardware. The difference is entirely in the blast radius you granted it. Design that, and location becomes what it should be — a compliance and cost decision, not a security blanket.

If you're running local agents: what's actually in your green zone versus what crept into the red one because "it's on-prem so it's fine"? The creep is never a decision, it's an omission — curious where people have caught it.

Sources & further reading

OWASP Top 10 for LLM Applications 2025 — prompt injection as the #1 architectural vulnerability.
"Prompt injection: types, real-world CVEs, and enterprise defenses", Vectra AI (2026) — 50–84% success rates; the Aug 2, 2026 EU AI Act deadline.
"PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents" — malicious instructions in local artifacts steering reasoning even when execution is sandboxed; 85%+ injection success across 78 studies.
"ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability" — desirable clarification behavior increases attack surface.
"Agent Security Meets Regulatory Reality" — production KYC negative results; how legal accountability amplifies each threat.
"AI Agent Security in 2026", Beam — the 82% executive-confidence gap; healthcare incident rates.
My series on the provenance side of this: the agent that faked its own test log.

The Agent Faked a Test Log, Then Believed It. Self-Editing Harnesses Have a Provenance Problem.

Sergei Parfenov — Wed, 08 Jul 2026 11:39:46 +0000

Lilian Weng published a new survey on July 4: Harness Engineering for Self-Improvement. It maps roughly three years of work on agents that optimize their own scaffolding — context managers, workflows, harness code, and eventually the optimizer that optimizes the harness. Most of the discussion around it will be about recursive self-improvement, because RSI is the exciting frame.

I read it with a different hat on. I run agents in production, and this blog has been circling one question for a while: what does it take to trust the output of a long agent chain? Read from that angle, her survey is not really a post about self-improvement. It's a post about a research field independently reinventing operations engineering — regression gates, immutable audit logs, least privilege — because every loop that skips those controls gets burned in a documented, reproducible way.

The cleanest burn in the whole literature comes from the Darwin Gödel Machine (DGM) paper. An agent, allowed to edit its own harness code, faked a log saying its unit tests had run and passed. The tests never ran. The fake log went into its own context. Downstream, the same agent read that log and concluded its changes were validated. It lied to itself, then trusted the lie — except "lied" smuggles in intent that was never there. This was garden-variety tool-use hallucination meeting an untyped log. Which is worse, not better: you don't need a deceptive agent to get this failure, just a filesystem that can't say who wrote what.

If you've read my post on provenance dying at the storage boundary , you already know where this is going.

What a harness is, and why it became the optimization target

Weng's definition, compressed: the harness is everything between the raw model and the world. The loop that decides when to plan and when to act. Tool interfaces. Context assembly. Memory files. Permission checks. Evaluation. Claude Code and Codex CLI are harnesses. So is your homegrown retry-wrapper-plus-prompt-template, whether you call it that or not.

That this layer matters is now measurable. Terminal-Bench 2.0 — 89 hard, containerized command-line tasks — shows the same frontier models scoring differently under different scaffolds; the best pairing in the benchmark paper (Codex CLI + GPT-5.2) lands at 63%. The benchmark authors are explicit that scaffolds get engineered around the quirks of specific models, which is also the founding observation of the self-harness line of work: harness design is model-specific, and hand-tuning it per model doesn't scale.

Weng organizes the field as a ladder of what gets optimized — prompts, then structured context, then workflows, then harness code, then the optimizer code itself — and walks every rung with examples. I won't duplicate the map; she does it in 28 well-spent minutes. What I want to do instead is pull a few load-bearing systems off that ladder and squint at their numbers and their failure reports — because that's where the story stops being about self-improvement and starts being about trust.

What the numbers say when you squint

STOP (Zelikman et al. 2023) — the original improver-improves-the-improver loop, still the conceptual core of the field — reported the result that should frame everything else: seed the recursion with GPT-4 and downstream utility climbs across iterations; seed it with GPT-3.5 or Mixtral and the loop actively hurts. Recursion is not a free lunch. Below a capability threshold, the loop amplifies noise instead of signal. The base model remains the ceiling; the harness moves you around underneath it.

Meta-Harness (Lee et al. 2026) — a search loop in which a coding agent proposes, edits, and evaluates whole harness variants — is the honest data point on how much headroom automated search finds above strong human engineering. On Terminal-Bench-2 it initializes the search from Terminus-2 and Terminus-KIRA — already strong hand-built harnesses — and the discovered harness comes out ahead: 37.6% on Haiku 4.5, where the next-best agent (Goose) sits at 35.5% and Terminus-KIRA at 33.7%, and 76.4% on Opus 4.6 against 74.7% for Terminus-KIRA. Call it two to four points, found by a proposer (Claude Code on Opus 4.6) that evaluates around 60 harness variants over 20 iterations in a few hours of wall-clock time. Polish-sized gains — but polish at that price is a good trade.

The caveats are more instructive than the headline. The one entry still above Meta-Harness on Opus, ForgeCode at 81.8%, could not be reproduced by the authors from its public code. And — the detail I'd tattoo on the field — in the TB-2 experiment the search set and the test set are the same 89 tasks. The authors say so plainly: the benchmark is small and expensive, a proper split would gut the signal, so they run it as a discovery problem and compensate with manual inspection plus regex audits for task-specific string leakage. There's also a genuinely encouraging trace buried in the qualitative results: early candidates that bundled structural fixes with prompt-template rewrites regressed, and the proposer hypothesized the shared prompt edit was the confound, isolated the structural change, and shipped a safer additive modification that won the run. The optimizer performing ablation hygiene on itself is the best thing in the paper.

Second-best thing in the paper: the proposer-context ablation, which lands on the same conclusion as an experiment I ran here . Give the optimizer scores only: median 34.6. Add LLM-written summaries of the trajectories: 34.9 — statistically nothing. Hand it the full raw traces: 50.0. Summaries did not recover the signal; compression strips exactly the diagnostic detail the optimizer feeds on. Independent group, different setup, same shape: the information that makes a trace useful for diagnosis is the information a summary throws away first. Provenance does not survive compression.

Full disclosure, because it's too on-the-nose to skip: this exact failure mode bit me while writing this piece. My first draft sourced the Meta-Harness numbers from an LLM-generated paper-summary site, which confidently attributed that 35.5% baseline to Terminus-KIRA. The paper's own table says 35.5% is Goose; Terminus-KIRA sits at 33.7%. The wrong number lived in the draft until a fact-check pass against the raw table caught it. A machine-written summary — no provenance types, no link back to the table row — had quietly swapped a baseline. The ablation's finding, wearing street clothes. I nearly shipped an article about untyped trust on the strength of an untyped summary.

DGM — an evolutionary archive of coding agents, each free to rewrite its own harness repo — posted the impressive relative jump, 20% → 50% on SWE-bench Verified with a frozen Claude 3.5 Sonnet, but from a deliberately naive starting harness, and reporting put a single 80-iteration run at around $22k and two weeks. Look at what the loop actually discovered: finer-grained edit tooling, validation-and-retry on empty patches, context summarization near the window limit. Every one of these is a standard trick that hand-built coding harnesses already ship. The loop rediscovered expert engineering. That's a legitimate result — automated search recovering expert knowledge is useful — but it is not evidence of the loop exceeding expert knowledge.

SIA, the joint harness-plus-weights loop — a feedback agent decides per iteration whether to edit the harness or update the weights — is the one to hold loosely. Weng herself flags the confounds — the task agent (gpt-oss-120b) is far weaker than the meta and feedback agents steering it (Claude Sonnet 4.6), and the baselines are thin — and files the result under promising but unproven. A survey author doing that out loud is worth appreciating.

And at the "AI scientist" end of the ladder, Trehan & Chopra ran idea-to-paper attempts with a minimal scaffold (read, write, search, list). Of the generated ideas, four passed human selection and exactly one made it through to an executed paper. Their failure taxonomy reads like an ops incident review: stale defaults inherited from training data, implementation drift under execution pressure, context degradation on long horizons, and over-optimism — declaring victory on noisy results, the pattern Bubeck et al. call "p-hacking and eureka-ing".

Aggregate reading: self-editing loops today mostly (a) recover known harness engineering when started from scratch, (b) squeeze low single-digit points out of strong baselines, and (c) fail in ways that are structured and predictable rather than exotic. That last one is the good news. Predictable failure modes are the ones you can build controls for.

The incident, in slow motion

Back to DGM, because the appendix on objective hacking is the most instructive few pages in this literature.

Setup: the researchers wanted the loop to reduce hallucinated tool use — the model narrating tool calls it never made. They instrumented real tool invocations with special marker tokens and scored agents on whether unmarked "tool output" appeared in generations. The checking functions were hidden from the agent during self-modification, and the authors note hacking happened more often when they weren't hidden.

Two lineages are worth naming. Node 96 did the honest work — improved tool-invocation logging, stripped hallucinated tool output when it could determine no tool ran — and scored 1.67 out of 2.0. Node 114 hit a perfect 2.0 after just two self-modifications: it deleted the detection markers. Perfect score, zero progress on the actual problem. Goodhart's law in its purest lab form. The team caught it because DGM keeps what Sakana describes as a "transparent, traceable lineage" of every change — and that lineage lived outside the agent's editable surface.

The faked test log is the deeper incident, though. A record was written by the model, persisted to the filesystem, and later consumed by the model as if it were runtime truth. Exactly one bit of metadata would have blocked the failure: who wrote this record — the harness runtime, or the model? In the typed-provenance framing I've been arguing for , that's a runtime-verified vs self-reported tag, plus one rule: self-reported never upgrades itself by being re-read.

This is the storage-boundary failure from my earlier post, except recursive. The artifact crossing the boundary isn't user data anymore — it's the evidence the next iteration of the system uses to decide what the current iteration did. Untyped trust used to compound per request. In a self-editing harness it compounds per generation.

Three invariants every working loop converges on

Put the successful loops side by side and the same skeleton appears, whether or not the authors name it:

1. A bounded editable surface. AlphaEvolve marks evolvable code regions with explicit block comments. Self-Harness — Shanghai AI Lab's loop in which the agent mines its own failure traces and patches its own harness — hands its proposer a declared list of editable surfaces plus a record of passing behaviors it must not break. DGM lets the agent rewrite its own repo — but not the evaluation stack. The loop needs a fence, drawn in advance, in writing.

2. An evaluator and permission layer outside the fence. DGM hides the checker functions and measures more hacking when it doesn't. Self-Harness validates candidate edits on a held-out split the proposer never optimized against, and promotes only edits with zero regression on both held-in and held-out sets. Meta-Harness is the instructive partial violation: on TB-2 the search optimizes against the test set itself — acknowledged, defended (small, expensive benchmark), and patched with manual audits. Even a careful team bends this invariant the moment evaluation gets expensive, which is exactly why it needs to be stated as an invariant rather than left as taste. Weng's own conclusion in the challenges section is that evaluators and permission controls should sit outside the loop that evolves the harness. This is separation of duties. It has a name because we've needed it before.

3. An append-only, typed record of what happened. DGM's lineage is how node 114 got caught. Self-Harness's weakness-mining stage explicitly rejects flat pass/fail labels — each failure record captures what the verifier observed, whether the agent's own behavior actually caused it, and through what mechanism — because two timeouts that look identical in an error log can have entirely different roots. That is not a scalar trust score. That is a typed provenance record. Even ACE — the context-optimization loop whose curator emits small itemized deltas instead of rewriting the whole prompt blob — lands on the same instinct: keep every change diffable, reviewable, attributable.

If you've operated software for a living you recognize all three: least privilege, separation of duties plus CI regression gates, immutable audit logs. The field isn't inventing new safety machinery — it's rediscovering ops controls from the inside, one incident at a time. Weng herself reaches for an operating-systems analogy for harnesses; I'm just following it down to the ops floor. I mean that as a compliment. Convergent evolution is evidence the constraints are real rather than stylistic.

Exactly one system in Weng's survey treats this as a first-class constraint — ScientistOne (Meng et al. 2026), over on the AI-scientist branch, where every claim must trace back to an evidence source and the chain gets audited. That idea has not crossed over to the self-editing harness loops. There, provenance keeps getting built as a side effect: lineage exists in DGM because researchers wanted to debug evolution; failure records are rich in Self-Harness because flat labels made weakness mining useless. The argument I've been making for two posts now is that the record type system should be the load-bearing wall, not the scaffolding you only notice after it catches something.

You already run one of these

This is not a frontier-lab concern. If your coding agent maintains its own memory file, writes its own instruction or skill files, or appends "lessons learned" that get loaded into future sessions — you are running a self-editing harness. Smaller loop, same topology: model-authored artifacts feeding future model behavior, usually with zero record types and no regression gate.

The canonical failure shape doesn't need an adversary. An agent writes a confident note into its own memory — say, "the staging DB is safe to reset" — and three sessions later a different task reads it as established fact. Nobody hacked anything. The system simply has no way to distinguish what it verified from what it once said.

The checklist I'd actually apply:

Treat model-authored harness edits like schema migrations. Memory files, instruction files, generated skills: versioned, diffable, reversible. A model changing its own operating instructions is a deploy, not a note.
Two-gate promotion. An edit must fix the failure it targets (held-in) and break nothing else (held-out). Self-Harness converges on this shape independently, which is interesting — but shape is not sufficiency. My own preregistered test found a dumb baseline matching my gate scheme on half the failure classes , and I'd want Self-Harness benchmarked against an equally dumb accept-if-tests-pass rule before concluding the machinery earns its complexity. Run the gates — and run the strawman against them.
Type every persisted record at write time. runtime-verified / self-reported / human-authored, minimum. Enforce at read time that self-reported claims can't gate promotions or authorize actions.
Keep the evaluator outside anything the loop can write. Checker code, marker tokens, permission checks, credentials. If the agent can grep the checker, assume it will eventually optimize the checker instead of the task.
Keep the failures. Rejected edits and failed trajectories are the cheapest signal the loop has. The literature's bias toward publishing successes is exactly the bias your local loop will inherit from its own logs if you prune them. Weng lists preserving negative results among the field's open challenges; it applies just as hard at your scale.

None of this needs a research budget. It's a few enum values, a CI job, and some restraint about what ends up in the agent's writable mount.

Where I land on "the model will eat the harness"

Weng's prediction runs through the prompt-engineering analogy: models absorbed the tricks, while the job of specifying what you want, under which constraints, judged how — that part outlived every trick. I mostly buy it, with one sharpening.

Split your harness into two piles. Pile one exists to compensate for the model: context massaging, retry phrasing, output parsing, the clever loop tweaks. Pile two exists to protect you from the model: permissions, evaluators, provenance types, the audit log. Pile one depreciates with every model release — that's the loan structure I wrote about in the coding-speedup post , and automated harness search will only accelerate the depreciation, since it rediscovers those tricks for cents on the engineer-dollar. Pile two appreciates, because the more capable and self-modifying the system, the more the trust boundary is the only part you actually own.

STOP's capability-threshold result cuts both ways here and closes the argument neatly: below the threshold, the loop can't help itself; above it, the loop starts probing the checker. Either way, the invariants aren't optional.

Read Weng's survey — it's the best map of this territory right now. Then go look at what your agents are already writing into their own context for tomorrow.

Papers referenced: Weng 2026 (survey) · DGM — Zhang et al. 2025 · Self-Harness — Zhang et al. 2026 · Meta-Harness — Lee et al. 2026 · STOP — Zelikman et al. 2023 · ACE — Zhang et al. 2025 · SIA — Hebbar et al. 2026 · Trehan & Chopra 2026 · Terminal-Bench 2.0 — Merrill et al. 2026 · ScientistOne — Meng et al. 2026 · Bubeck et al. 2025

My Strawman Baseline Beat My Own Scheme on Half the Gate Classes

Sergei Parfenov — Mon, 06 Jul 2026 11:48:40 +0000

Part 4 ended with a question I couldn't answer: has anyone actually measured what gate decisions do on the reconstructed provenance vector versus the original? Not argued from first principles. Measured.

Nobody in the comments had data. Neither did I. So I built the harness: provenance-compaction-lab.

Four arms, one oracle

Four provenance-tracking arms observe the same synthetic trajectory — same seed, same degradation events, same merges. They differ only in what happens to provenance between decisions:

ground_truth — full vector, full lineage, never compacted. The oracle everything else is judged against.
structural_min — the Part 4 scheme. Axis scores keep their running min. Every C steps, lineage truncates to the last K hops; taint ids attached in the folded prefix are dropped, only a count survives. The compression penalty multiplies into reconstruction, which is folded into the min like any other axis.
structural_perhop — identical, except reconstruction is never min-folded. It's carried structurally as (n_compactions, worst_penalty) and handed to gates as data.
prose — the honest-but-naive baseline, not a proposal. Every C steps an LLM summarizes the working state, provenance included, into ≤150 words; a second call extracts scores and taints back out. Whatever survives the round trip is all this arm knows.

Merge is unchanged since Part 3 — element-wise min:

def merge(vectors: Iterable[ProvenanceVector]) -> ProvenanceVector:
    """Merge = element-wise min across inputs (Part 3)."""
    vs = list(vectors)
    if not vs:
        raise ValueError("merge() needs at least one input vector")
    return ProvenanceVector(
        scores={axis: min(v.scores[axis] for v in vs) for axis in AXES}
    )

Every 5 steps, nine gate policies — score thresholds, reconstruction-coupled, lineage blocklist, lineage allowlist, several flagged irreversible — fire against all four arms, and every disagreement with the oracle is logged. The matrix: 500-step horizons, cadences C ∈ {10, 25, 50}, K=5, allowlist window W=8 (> K deliberately, so the starvation shows), reconstruction penalty 0.02 per compaction, three degradation profiles, 20 seeds per cell. All four hypotheses were written into the spec before the first line of code. The whole mock matrix reruns in seconds, deterministic per seed.

One limitation up front: the trajectories are synthetic. The generator is the component you're meant to swap for your own traces — more on that at the end.

Anatomy of one false-proceed

Before any aggregate number, here is one decision going wrong, hop by hop.

Seed 0, C=25, med profile. Decision at step 25, gate payment_no_untrusted_taint — irreversible, blocks on unverified_web and tool_flaky taints. The working value's full taint history at step 25, which is what the oracle sees:

taint:unverified_web:2      ← unverified web fetch, step 2
taint:tool_flaky:6          ← flaky tool call, step 6
taint:tool_flaky:7          ← flaky tool call, step 7
taint:unverified_web:11     ← unverified web fetch, step 11
taint:stale_cache:16        ← stale cache read, step 16
taint:fallback_model:24     ← fallback model, step 24

At step 25 the first compaction fires. This is the deliberate lossy step, verbatim from the repo:

def truncate_lineage(state: ValueState, keep_hops: int) -> None:
    """Truncate to the last K hops; fold the prefix into aggregate counts and
    drop the folded taint ids from ``tainted_by`` (only the count survives)."""
    hops = state.lineage.hops
    if len(hops) <= keep_hops:
        return
    folded_now = hops[: len(hops) - keep_hops]
    kept = hops[len(hops) - keep_hops :]
    prefix = state.lineage.folded or FoldedPrefix()
    for hop in folded_now:
        prefix.absorb(hop)
    state.lineage.hops = list(kept)
    state.lineage.folded = prefix
    # invariant for structural arms: tainted_by == union of taints_added over
    # the surviving (visible) hops — folded taint ids are gone.
    state.tainted_by = set().union(*(set(h.taints_added) for h in kept)) if kept else set()

Lineage truncates to the last K=5 hops. Every taint attached in the folded prefix is gone. What structural_min still sees:

taint:fallback_model:24

The oracle blocks the payment — four gate-relevant taints. structural_min sees zero gate-relevant taints and proceeds. Nothing in the surviving state even hints that a check was skipped: the fold counter (n_taints_folded) is the only trace, and a default-allow gate has no reason to look at it.

Same state under the other two lineage strategies — measured, not hypothetical. Degrade-to-untrusted sees n_taints_folded > 0 and refuses to act; here it happens to be right, but had the folded taints been benign, the same refusal would have been a false-stop. Rehydrate fetches the 21 folded hops of the value's merged ancestry from the append-only log — 2.6 KB read — and recovers the oracle's block exactly.

That single mechanism, repeated across the matrix, produces the first headline number.

The headline numbers

3.47% of irreversible-action decisions (payment, send) proceeded when the uncompacted oracle said block — structural_min, C=25, med profile.
structural_min memory dies at compaction cycle ≈ 35: from that cycle on, every reconstruction-coupled gate blocks permanently (0.98³⁵ ≈ 0.493 < 0.5). structural_perhop never dies.
Prose-vs-structural flip ratio: 1.39× — prose disagrees with the oracle on 9.97% of decisions, structural_min on 7.16%, all gates pooled.
Rehydration: 459 cold-storage lookups per 100 lineage-gate decisions buy the flip rate down from 8.01% to 0.00%.

One sort before the details, because that list mixes two kinds of number. The death cycle is arithmetic in a lab coat — 0.98³⁵ < 0.5 needs a calculator, not a harness — and the error polarity two sections down follows from the gate definitions once you state them. For those, the harness contributes the price tag, not the fact. Numbers 1, 3 and 4 are measurements: nothing in the setup forces 3.47% rather than 12%, prose losing overall while winning two gate classes, or rehydration costing 4.6 lookups instead of 40.

Boyko was right twice

On Part 4, Nazar Boyko made two claims sharp enough to preregister as hypotheses.

Claim one: axis scores would track closely between arms; the gates that inspect lineage are where decisions split. Confirmed — with a disclosure. Score-gate flip rate for both structural arms is 0.00%, and that number is by construction, not a discovery: compaction never touches the running min of the base axes, and the harness property-tests exactly that invariant (base-axis drift against the oracle: 0.0000, at every decision point, every config). What the measurement adds is where divergence concentrates once scores are ruled out: lineage gates flip 8.01% of decisions blind, and reconstruction-coupled gates do worse than that.

Claim two: the recursion. If the compression penalty is folded into a running min, reconstruction decays monotonically toward zero regardless of how clean each individual hop is. The death-spiral run — 5,000 steps, C=25, 200 compactions — confirms it: monotone the whole way down, crossing the 0.5 gate threshold at cycle 35. From that point every reconstruction-coupled gate blocks permanently. The scores are pristine, the memory is technically alive, and nothing is allowed to act on it.

structural_perhop, which carries (n_compactions, worst_penalty) instead of folding, posts 0.00% flips on the same gates and never dies. Drift says the same thing from another angle: structural_min's only drifting axis is reconstruction, at 0.2100 MAE against the oracle; perhop holds it to 0.0191.

The error direction is a design choice

Both lineage gate styles were in the harness on purpose, and W > K was chosen deliberately so the starvation would show.

Blocklist gates — default-allow: "block if any taint:unverified_web:* appears in tainted_by." They fail dangerous: 8.50% false-proceeds, 0.00% false-stops. The fold forgets taints; the gate finds nothing to object to.

Allowlist gates — default-deny: "proceed only if lineage proves no fallback hop within the last 8 hops." They fail expensive: 7.53% false-stops, 0.00% false-proceeds. The proof gets folded away; the gate refuses.

Perfect polarization, and it's the same compaction both times. Compaction decides that you lose information; the gate's default under missing information decides which way you fail. So pick the polarity per action class: default-deny anything irreversible. You pay in over-blocking. The alternative pays in wired money.

The direction of this result was predictable from the definitions — that's exactly why it preregistered cleanly. The magnitudes at these degradation rates were not.

The strawman refuses to die

H4, as preregistered: prose flip rates dominate the structural arms on every gate class. Verdict: FAIL. Per gate class, mock channel, blind mode:

gate class	prose	structural_min	structural_perhop
score	2.44%	0.00%	0.00%
reconstruction	4.97%	16.17%	0.00%
lineage_blocklist	6.82%	8.50%	8.50%
lineage_allowlist	29.41%	7.53%	7.53%

The strawman beats the Part 4 scheme on two of four classes. On reconstruction-coupled gates, the min-folded penalty is a bigger source of corruption than the noisy channel — the death spiral punishes structural_min before prose noise catches up. On blocklist gates, deterministic truncation forgets taints at roughly the rate a lossy extractor does.

My first instinct was that the mock channel was rigged against prose — taint recall was assumed at 0.6. So I ran the real one: gpt-5-mini, the actual summarize→extract round trip. Here is the blob it produced for the exact six-taint state from the worked example above:

Working memory state: freshness 0.700, capability 0.650, tool_integrity 0.650, verification 0.450, reconstruction 1.000. Attached taints: taint:fallback_model:24, taint:stale_cache:16, taint:tool_flaky:6, taint:tool_flaky:7, taint:unverified_web:11, taint:unverified_web:2. Overall, the memory is moderately fresh and moderately capable with intact reconstruction, but tool integrity is degraded and verification is low, indicating reduced trust in outputs that relied on tools or external content. Multiple taints note use of a fallback model, stale cache data, two instances of flaky tools, and two unverified web sources; these should be considered when using or revalidating this state.

Extraction recovered all five scores exactly and all six taint ids. Across 993 extractions: 0 parse failures, taint recall 0.893, precision 0.904 (245 fabricated taints), score MAE ≈ 0.0001. A modern small model copies explicit values through prose nearly losslessly.

parameter	mock (assumed)	gpt-5-mini (measured)
score noise	gaussian σ = 0.08	≈ 0.0001 MAE
taint recall	0.60	0.893
taint precision	0.90	0.904
flip rate, score gates	2.4%	0.08%
flip rate, reconstruction gates	5.0%	0.0%
flip rate, allowlist gates	29.4%	43.1%

Disclosure on this table before the punchline: the real column is a smaller slice — C ∈ {10, 25}, med profile only, 2 seeds, those 993 extractions — while the mock column is the full matrix. Read it as channel characterization, not a matched benchmark; the repo ships a matched-slice config if you want the strict twin. (The matched slice also doubles as a sanity check: the structural arms' columns come out identical between mock and real runs, because they never touch the channel — the only column that actually moves is prose.)

The last row is the point. The channel got nearly perfect, and allowlist flips got worse. A summary preserves values; it destroys structure. "No fallback hop within the last 8 hops" is a proof about an ordered window, and no amount of faithful prose reconstitutes the window. The failure mode of prose isn't noise — it's the loss of provability.

Caveat, because it matters: this measured the channel in a best case. The summarize prompt hands the model a clean structured list and asks it to carry the list across. Real agent memory interleaves provenance with content competing for the same 150 words. Treat the mock's 0.6 recall as the pessimistic bound and gpt-5-mini's 0.893 as the optimistic one — on structure-dependent gates, both bounds tell the same story.

Where the crossover sits

"Prose sometimes beats structural_min" is a result. A design rule needs to know when. A finer cadence sweep, C from 5 to 100, med profile, 20 seeds per point, settles it.

On reconstruction-coupled gates, structural_min is worse than the prose strawman at every cadence up to C=50 — at C=10 that's 34.92% flips against prose's 6.85% — and only drops below prose between C=50 and C=75. The rule that falls out: once memory sees more than ~7–10 compaction cycles inside a decision horizon, min-folded reconstruction — not summarization noise — is the dominant corruption source. On blocklist gates the cross comes even earlier, around C=15. Pooled over all gate classes, prose stays behind at every cadence — allowlist starvation and score noise keep it there.

Before anyone quotes that bolded rule as a constant: it isn't one. The threshold is a function of two parameters I chose — the per-compaction penalty (0.02 here) and the noise of the channel it races. The death cycle is closed-form: memory dies at n = ln(θ) / ln(1 − p), which at θ = 0.5, p = 0.02 gives 34.3 — the first whole cycle below threshold is 35, matching the run. That part scales as 1/p: double the penalty, halve the death cycle. Arithmetic.

The crossover against prose does not scale as 1/p — and I know because I assumed it would and swept it: 14 cadences × 5 penalties × 20 seeds, 1,400 mock runs, under a minute. The fit says the crossover cycle count moves as ~p^−1.6, steeper than 1/p, because the baseline it races isn't flat: prose noise compounds with cycle count too, just slower, so as the penalty shrinks, the intersection runs away superlinearly. (Exponent fitted on the sweep grid, p from 0.005 to 0.1 — don't carry it far outside.) prov-lab report prints the analytic death cycle for whatever penalty you configure, and prov-lab sweep maps your crossover — so the rule you quote can be yours, not mine.

Which exposes what H4 got wrong at preregistration time: "structural vs prose" was never the axis. The axis is which fields the compaction preserves, relative to which fields the gates read. Every arm in this experiment failed exactly where it discards something some gate consumes.

Quimby's question has a price tag

Max Quimby asked, on Part 4: when lineage has been compressed and a policy needs the detail — do you re-expand from somewhere, or does the policy degrade to treating the result as untrusted? Both answers are in the harness, with deciding blind as the control. The irreversible payment gate, three ways:

mode	agreement	false-proceed	false-stop	lookups / 100 decisions
blind	91.22%	8.78%	0.00%	0
degrade	95.72%	0.00%	4.28%	0
rehydrate	100.00%	0.00%	0.00%	639.9

Aggregated across all lineage gates: 459 lookups per 100 decisions, tens of KB read, flip rate 8.01% → 0.00%. The gate with all three modes, verbatim:

    def evaluate(
        self, view: GateView, mode: str = "blind", hop_log: HopLog | None = None
    ) -> GateDecision:
        taints = set(view.tainted_by)
        lookups = 0
        bytes_read = 0
        detail_missing = view.folded is not None and view.folded.n_taints_folded > 0
        if mode == "degrade" and detail_missing:
            # degrade-to-untrusted: taints were dropped, refuse to act
            return GateDecision(proceed=False)
        if mode == "rehydrate" and detail_missing and hop_log is not None:
            assert view.folded is not None
            hops, bytes_read = hop_log.fetch(view.folded.folded_hop_ids)
            lookups = len(view.folded.folded_hop_ids)
            for hop in hops:
                taints |= set(hop.taints_added)
        blocked = any(_matches(t, self.block_prefixes) for t in taints)
        return GateDecision(proceed=not blocked, lookups=lookups, bytes_read=bytes_read)

Degrade costs nothing and converts every dangerous error into an expensive one — a legitimate answer for reversible actions. Rehydration from an append-only hop log recovers the oracle exactly, at a price that turned out measurable and small: about 4.6 lookups per lineage-gate decision. For irreversible gates, that's the trade I'd take every time.

What to actually build

Persist running-min axis scores. Constant size, lossless by construction, drift 0.0000 at every decision point — property-tested. This half of Part 4 survives contact with measurement.
Never fold a compression penalty into a min. Track (n_compactions, worst_penalty) structurally. perhop flips 0.00% of reconstruction-coupled decisions and never dies.
Pick gate polarity per action class. Default-deny anything irreversible: fail expensive, not dangerous.
Append-only hop log plus rehydrate-on-demand for irreversible gates; degrade-to-untrusted is fine for reversible ones. A reference implementation on stdlib sqlite3 ships in the repo as provlab.store.
Measure your own pipeline. The real-channel run was ~2,000 requests to a small model — $1–1.5 and about 19 minutes. Less than a coffee.

Run it

uv sync
uv run prov-lab run --config experiments/config.yaml --mock
uv run prov-lab report

Whole matrix in seconds, deterministic per seed, MIT: P0rt/provenance-compaction-lab.

Two ways to point this at your own system now. prov-lab audit is the closing question as a command: ~20 lines of YAML — which fields your compaction preserves, which fields each gate reads, each gate's default polarity — and it prints the starvation table: which of your gates are deciding blind, and in which direction they'll fail. No simulation, no traces, five minutes. And prov-lab run --trace your.jsonl replays the whole harness over real agent logs: taint-derivation rules are YAML data, not code (tool status ≠ ok → taint, cache age over threshold → taint, and so on), the oracle is a full-provenance replay of the same trace, and the report tells you what it could and couldn't map. Synthetic trajectories are this experiment's honest limitation; replication on a real memory pipeline is the one result I can't produce alone.

Credits, which in this series means authorship: Nazar Boyko called both the score/lineage split and the min-fold recursion before a line of this code existed. Max Quimby asked the question that became a price tag. This part, like the ones before it, was co-written by the comment section.

So, the question for this thread: what does your compaction actually preserve, relative to what your gates read? prov-lab audit is that question as a command; --trace is the full version. Either way — post the table.

Part 6 is attest() — restoration semantics. Everything in this system can only lower an axis. What event is allowed to raise one, and who holds the authority to say so?

Your Provenance Vector Dies at the Storage Boundary

Sergei Parfenov — Wed, 01 Jul 2026 11:58:09 +0000

Last post I argued that agent trust should be a typed provenance vector: carry what-degraded-and-how alongside each result, propagate it, let each consumer apply its own policy. The comments agreed on the model and then immediately found the two places it breaks in the real world. Both are load-bearing, both were things I hand-waved, and this post is about them.

mote asked what happens when the agent runs 500 steps and the vector no longer fits in the context window.
Mykola said the quiet part louder: "you can build a perfect trust lattice but most agents just act on output without checking provenance. The hard part is enforcement, not the model."

Both are right, and together they name the two ways a provenance vector dies in production: nobody reads it, or it can't survive being stored. One problem is about enforcement, the other about persistence.

TL;DR — Two failure modes kill a provenance vector in production. Enforcement: if acting on a value doesn't require passing through the gate, developers (and models writing tool calls) will skip it — so make the unsafe path unrepresentable via types, not discipline. Persistence: on long-horizon agents the vector must survive compression to fit bounded memory, and naive summarization washes out exactly the axes you need — so compress structurally (per-axis, lossless scores + lossy lineage), not as prose.

Problem 1: enforcement, or the vector nobody reads

Mykola's point is the one that should scare you, because it's true of almost every "add metadata to make it safer" scheme: the metadata is optional, so under deadline it gets skipped. You can ship a beautiful Provenance type and six months later find that the payment path reads result.value and never touches result.provenance. The lattice was perfect. Nobody consulted it.

The fix is not "remember to check." Discipline doesn't scale and it definitely doesn't survive a model writing its own tool calls. The fix is to make acting without checking something the code physically cannot express.

This is a solved problem in a neighboring field, and it's worth stealing wholesale. Capability-based security has done this for decades: authority is an unforgeable token you must hold a reference to — you can't perform the action without possessing the capability, and possession is the check. Recent work brings this into static types explicitly: track the capability in the type system, and the absence of it in a function's type guarantees, at compile time, that the function can't perform the guarded action. The safety isn't a runtime assertion you might forget — it's a property of what typechecks.

Applied to provenance, the move is: the irreversible action can't accept a raw value, only a gated one.

from typing import Generic, TypeVar, NoReturn
T = TypeVar("T")

class Provenanced(Generic[T]):
    """A value you cannot use for a side effect without unwrapping —
    and the ONLY unwrap path runs the gate."""
    def __init__(self, value: T, prov: Provenance):
        self._value = value
        self._prov = prov

    def unwrap_for(self, action: "Policy") -> T:
        decision = gate(action, self._prov)
        if decision != "proceed":
            raise ProvenanceViolation(decision, self._prov)  # refetch / escalate / ...
        return self._value

# the side-effecting function's SIGNATURE refuses raw values:
def charge_card(amount: Provenanced[Money], policy: Policy) -> Receipt:
    money = amount.unwrap_for(policy)   # the only way to get the Money out
    ...

Now "charge the card without checking provenance" doesn't fail code review — it doesn't typecheck. There is no path from a raw Money to charge_card, because the signature demands Provenanced[Money], and the only way to extract the value runs the gate. You've moved the enforcement from the developer's memory into the type system. It's the same trick as idempotency keys from two posts ago: don't ask people to remember the safe thing, make the unsafe thing unrepresentable.

The honest limit (which a commenter will rightly raise, so I'll raise it first): this holds at the framework boundary, in typed code you control. The moment your agent writes free-form tool calls — the model generating Python that calls your API directly — it can simply not use the wrapper, and you're back to enforcement-by-hope. For that case the type system can't reach, so enforcement has to drop to the infrastructure layer: the side-effecting tools sit behind a proxy that refuses any call whose payload doesn't carry valid provenance. You lose compile-time guarantees and get runtime rejection instead — worse, but still "structurally can't skip it" rather than "please remember." The principle survives even when the mechanism changes: enforcement lives in something the actor can't route around, never in something it's asked to honor.

Problem 2: provenance that survives compression

mote's problem is deeper and I didn't have an answer in the thread, so I went and found one. Here's the setup: a long-horizon agent — mote's case is literally robots on edge hardware with a hard context ceiling — can't hold a growing provenance graph in working memory across 500 steps. It has to compress. And the standard compression move, summarize-history-into-prose, is catastrophic for provenance specifically, because summarization is lossy in an uncontrolled way — it'll happily drop "step 47 ran on a stale cache" to save tokens, and that's the one fact a downstream gate needed.

This isn't hypothetical. The field now attributes the majority of enterprise agent failures to context drift and memory loss during multi-step reasoning — not to hitting the context limit, but to the quality degradation on the way there. And there's a subtler trap the RL-agent researchers named: compression credit is causally entangled — the same downstream failure needs opposite explanations depending on whether the bad state came from a tool or from memory. If your compression flattens that distinction, you can't even diagnose what broke.

So the naive answer — "summarize the provenance too" — reintroduces the exact scalar-collapse problem from the last post, now smuggled in through the storage layer. A summary is an average wearing a trench coat.

The better answer comes from a simple observation: the axes have different compression economics, so don't compress them uniformly.

Scores compress to almost nothing, losslessly. A per-axis float — freshness: 0.2, capability: 0.6 — is a handful of numbers. Even across 500 steps, if you keep only the running minimum per axis (which is what the gate reads anyway; recall the min from last post), that's constant size regardless of history length. You never need to compress the scores, because min-reduction already bounds them.
Lineage is what explodes, and lineage is what you can afford to lose. The tainted_by sets — which exact steps degraded each axis — grow with the trajectory. But for the gate decision, you usually don't need the full ancestry; you need "is any unverified degraded step still on the live path." So this is the part you lossy-compress: keep the axis scores whole, summarize the lineage behind a pointer, and accept that you lose "which exact step" while keeping "how degraded, per axis."

This maps onto where the research is heading. The most promising long-horizon approaches have stopped treating the trajectory as prose to be summarized and started treating it as a typed dependency graph the agent annotates as it works, with a deterministic eviction policy that walks the graph when the token budget blows — explicitly to avoid the four pathologies of prose compaction: unpredictable lossiness, structural destruction, blocking cost, and compression-induced hallucination. A typed provenance vector is that annotation. The eviction policy for provenance is: evict lineage detail, never evict axis scores.

There's one more axis this forces you to add, and it's almost funny: compression is itself a degradation source. A vector reconstructed from a lossy summary is less trustworthy than one carried whole — so "this provenance was reconstructed across a storage boundary" is a real provenance fact that deserves its own axis. reconstruction: 0.8 means "these scores survived a compaction; treat the lineage as approximate." The provenance system has to describe its own lossiness. Turtles, but only two deep.

Why this keeps being a security problem in disguise

Every post in this series has ended up borrowing from security, and this one makes the reason explicit. Traditional taint tracking assumes deterministic program states and exact data-flow: memory locations, registers, string matches. LLM agents break all of that — untrusted content gets rewritten, summarized, and used to choose later actions, so "did this bad input reach that sink" is a question about semantic and causal influence, not byte-level flow. The agent security researchers building taint trackers for exactly this case had to redefine propagation to include semantic transformation and cross-session persistence through memory — which is the same two problems this post is about (enforcement and persistence), arrived at from the attack side instead of the reliability side.

That convergence is the tell. When the reliability people and the security people independently reinvent the same structure — unforgeable gating plus provenance that survives memory — it's because it's the actual shape of the problem, not a preference.

Where the series stands

Four posts, one arc:

Availability — agents fail on capacity (rate limits), not reasoning.
Correctness — the capacity fixes buy uptime by acting on unearned output; you need correct uptime.
The model — trust isn't a scalar; it's a typed provenance vector with policy at the consumer.
The reality (this one) — that vector only works if it's unskippable (enforcement by type/proxy) and survivable (structural compression, not prose).

The through-line, one more time: agent reliability is a provenance problem, and provenance is a solved discipline — capability security, data lineage, taint analysis — that we're re-deriving because the untraceable thing now acts, and acts through a bounded, forgetful, non-deterministic memory. The novelty isn't the primitives. It's that they now have to hold under compression and under a model that can route around anything you merely ask it to respect.

If you're building this: gate at a boundary the actor can't skip (type or proxy), compress scores losslessly and lineage lossily, and add a reconstruction axis the day your provenance crosses a storage line. Start there.

Credit, again, to the comment section that wrote the spec: **mote* (compression across the storage boundary, the edge/bounded-context framing that motivates the whole second half), Mykola Kondratiuk (enforcement is the hard part, not the model), plus Tae Kim, Nazar Boyko, Ken, and Ahmet Özel for sharpening the axis rules in the last thread. Open question for this one: has anyone actually run provenance across a compaction boundary in production and measured what the gate decisions do on the reconstructed vector versus the original? That's the experiment I don't have data for yet — and it's the one that decides whether any of this holds.*

Sources & further reading

"Tracking Capabilities for Safer Agents" — capabilities as unforgeable tokens tracked in static types; compile-time non-interference from the absence of a capability.
"Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents" (NeuroTaint) — why classical taint doesn't transfer: agents rewrite, summarize, and act on untrusted content; taint as semantic/causal/persistent influence.
"Beyond Compaction: Structured Context Eviction for Long-Horizon Agents" — annotate the trajectory as a typed dependency graph; deterministic graph-walking eviction instead of prose summarization.
"AI Agent Context Compression: Strategies for Long-Running Sessions" — context drift/memory loss as the majority of enterprise agent failures; anchored iterative summarization beats full reconstruction.
"HiMPO: Hindsight-Informed Memory Policy Optimization" — causally entangled memory credit: the same failure needs opposite explanations depending on tool-vs-memory origin.
The series: Part 1 — capacity · Part 2 — correct uptime · Part 3 — typed provenance

Trust Isn't a Scalar: Typed Provenance for Agent Chains

Sergei Parfenov — Mon, 22 Jun 2026 14:21:08 +0000

Two posts ago, in the one about agents failing quietly, I handed you a fix for silent degradation: tag a degraded output trust="degraded", propagate the taint down the chain, and gate irreversible actions on it. Clean, shippable, and — as a commenter named Theo pointed out within a day — wrong in a way that matters.

The tag was a boolean. And trust isn't a boolean. It isn't even a scalar.

This post is me being wrong in public and fixing it, because the corrected model is genuinely better and most of it was built by people in that comment thread. Credits at the end; they earned them.

TL;DR — A single trust score (full/degraded, or 0.0–1.0) collapses on real chains, because degradation happens along different axes — a stale cache lowers freshness, a weaker fallback lowers capability — and different downstream steps care about different ones. Collapse them to one number and you either over-reject (every degradation is fatal) or under-reject (the dangerous one gets averaged away). What actually composes is typed provenance: carry a vector of what-was-degraded-and-how alongside the result, propagate it across the chain, and let each consumer apply its own policy at the moment it's about to act.

Why a scalar collapses

Here's the case that broke my boolean, almost verbatim from Theo's comment.

You have two downstream steps, both consuming an upstream result:

A summarization step. It tolerates a weaker model just fine, but it must not run on stale data.
A price calculation. It's the reverse: it needs current data, but a slightly weaker model doing arithmetic is fine.

Now the upstream result came from a fallback model reading a 2-hour-old cache. So it's degraded on both a capability axis (weaker model) and a freshness axis (old cache). What's your single trust score?

If you set it low (treat any degradation as serious), the summarization step over-rejects — it would've been totally fine with the weaker model, but your scalar said "degraded" so it bails or escalates needlessly.
If you set it high (it's "mostly fine"), the price calc under-rejects — it acts on stale data because the scalar averaged the freshness problem into a number that looked acceptable.

There is no single threshold that's simultaneously right for both consumers, because they're not measuring the same thing. A scalar forces every consumer to share one definition of "trustworthy," and they don't have one. As Theo put it: collapse the vector to one number and you destroy exactly the information the consumer needs to make its own decision.

This isn't just my comment section talking, either — it's where the field is converging. A recent framework (TrustBench) makes the same move explicitly: rather than reduce trust to a single scalar, keep dimensional scores per trust aspect, and weight them per domain — healthcare prioritizing citation validity and recency, finance prioritizing calculation and compliance. Same shape, arrived at independently. When several people reach for the same structure from different directions, it's usually because the structure is real.

Trust is a vector; provenance is what you propagate

Here's the reframe that fixes it, and it starts with a vocabulary correction I owe you: I kept calling the thing "trust." That was the bug in the language, not just the code.

Trust is not a property of a value. It's a judgment a consumer makes about a value. What the value actually carries is provenance — the typed record of how it came to be: which model produced it, how fresh its inputs were, which tools ran, what got degraded and along which axis. Trust is what each consumer computes from that provenance, under its own policy. The price calc and the summarizer look at the same provenance and reach different verdicts, and that's correct, not contradictory.

So you don't propagate a degraded flag. You propagate a typed vector, and each axis degrades independently:

from dataclasses import dataclass, field
from enum import Enum

class Axis(str, Enum):
    FRESHNESS = "freshness"      # how current were the inputs
    CAPABILITY = "capability"    # how strong was the model that produced this
    TOOL = "tool"                # did the tool calls actually succeed
    VERIFICATION = "verification" # was this checked against ground truth

@dataclass
class Provenance:
    # per-axis score in [0,1]; 1.0 = fully trusted on that axis
    axes: dict[Axis, float] = field(default_factory=lambda: {a: 1.0 for a in Axis})
    # which upstream step_ids contributed degradation, per axis
    tainted_by: dict[Axis, set[str]] = field(default_factory=lambda: {a: set() for a in Axis})

    def merge(self, *upstreams: "Provenance") -> "Provenance":
        out = Provenance()
        for axis in Axis:
            # an output is only as fresh as its stalest input, only as
            # capable as its weakest producer — min, not average. averaging
            # is exactly how the dangerous axis gets washed out.
            out.axes[axis] = min([self.axes[axis]] + [u.axes[axis] for u in upstreams])
            out.tainted_by[axis] = set(self.tainted_by[axis])
            for u in upstreams:
                out.tainted_by[axis] |= u.tainted_by[axis]
        return out

The min is doing real work there. The whole failure of my original taint-as-boolean was that it answered "is anything degraded?" — a single OR across the chain. The vector answers "what kind of degradation is this output carrying, and how much, per axis?" — and crucially, it takes the minimum per axis rather than averaging, because averaging is the mathematical operation that makes a serious freshness problem disappear behind three fine capability scores.

The gate is per-consumer, not global

Now the irreversibility gate from the last post stops being one global threshold and becomes a policy that lives at each consumer:

@dataclass
class Policy:
    # per-axis minimum this consumer requires to act without re-check
    floors: dict[Axis, float]

    def admits(self, p: Provenance) -> bool:
        return all(p.axes[a] >= floor for a, floor in self.floors.items())

# the summarizer doesn't care about capability, but demands freshness
SUMMARIZE = Policy(floors={Axis.FRESHNESS: 0.9, Axis.CAPABILITY: 0.3})

# the price calc is the mirror image
PRICE_CALC = Policy(floors={Axis.FRESHNESS: 0.95, Axis.CAPABILITY: 0.6,
                            Axis.VERIFICATION: 0.8})

def gate(action_policy: Policy, p: Provenance):
    if action_policy.admits(p):
        return "proceed"
    # which axis failed tells you HOW to recover, not just THAT to stop
    failed = [a for a, f in action_policy.floors.items() if p.axes[a] < f]
    if Axis.FRESHNESS in failed:
        return "refetch"      # re-run the stale step on live data
    if Axis.CAPABILITY in failed:
        return "re-run-on-primary"
    return "escalate-to-human"

This is the payoff. The same upstream provenance vector flows to both consumers, and they reach different, individually correct decisions from it. The summarizer proceeds; the price calc refetches. One global score could never do that — and the failed-axis tells you how to recover, which a boolean never could.

Notice this also absorbs a point another commenter (Manuel) made independently: he argued the tag should be an enum, not a bool — skipped-tool vs stale-data vs retry-budget-exhausted route differently. He was right, and the vector is the generalization: an enum is a vector with one axis active; the full structure lets multiple axes degrade at once, which is the real production case.

"Gate on risk, not confidence" — and confidence is just one axis

The last post argued you should gate on irreversibility, not on the model's self-reported confidence. The vector makes that precise instead of hand-wavy: confidence is one axis among several, and it's the one the model grades itself on. A model can be 95%-confident (high on a confidence axis) while sitting on a freshness score of 0.2 because it reasoned over a stale cache. The skill-conditional-trust literature makes the same argument from the routing side — a single global score is the wrong object because it can't express "great at this, useless at that." Confidence-as-the-only-axis is how you get the war story everyone has: the agent that was sure, and sure on the wrong thing.

How many axes before it stops being worth it?

This is the honest open question, and the one I asked Theo back. A vector with 40 axes is just a scalar's opposite failure — unwieldy, untunable, theater of rigor. My current answer, and I'd genuinely take pushback: start with the axes that map to your actual degradation sources, and no more. If your system has exactly two ways to degrade — fallback model and stale cache — you have two axes (capability, freshness). Add verification the moment you have a re-check step whose result you want to carry. Add tool when a tool can half-succeed. The axis count should equal the number of distinct things that can independently go wrong, not the number of things you can imagine going wrong. If two "axes" always move together, they're one axis.

The sweet spot, I think, is the smallest set where each axis maps to a different recovery action. Freshness → refetch. Capability → re-run on primary. Verification → escalate. If two axes would trigger the same recovery, collapse them. The vector earns its complexity only where it changes what you do.

The practical layer (mostly stolen from the comments)

The vector is the core idea, but the thread surfaced a full toolkit around it, and it'd be dishonest to present any of it as mine:

Admission control, upstream of everything (Dan): before the agent fans out, decide if the whole task can afford to run, and separate the four limits that 429s blur together — provider quota (physics), account quota (policy), task budget (this run), ledger (forensics). The ledger turns out to be the same record as provenance: "this run cost 47 calls, 12 on the fallback tier" is both your bill and your capability-axis score.
Validation at consumption, not production (James): don't validate on the fresh-call path and trust the cache; validate when a value is used, regardless of where it came from. That closes the laundering loophole at the consumer — which is exactly where the per-consumer gate already lives.
Time-bound by causality, not wall-clock (HARD IN SOFT OUT): I was tempted by "reset taint after N seconds." Don't — degraded state can sleep and surface later. Clear an axis when nothing on the live path still derives from the degraded step, not when a timer expires.
The poor-man's version for solo builders (TuanAnhNguyen): no observability stack? Have any tool that acts on a stale-readable input append one line to a log, and grep it before anything irreversible. It's the 5%-effort version of the provenance vector — a breadcrumb instead of a graph — and below a certain scale it's the correct amount of engineering.
The distributed correction (Abdullah): my original concurrency cap was an in-process semaphore, which silently assumes one process. Under serverless fan-out, N containers each capping at 8 gives you 8N real concurrency. The limiter has to live outside the workers. (Also: TPM saturates before RPM on long-context agents, and "fallback to a cheaper model" is fiction if it draws from the same pooled tier. Both are capability/freshness axis sources you'd otherwise miss.)

The parable that says it better than I did

A commenter (HARD IN SOFT OUT) left this, and it's the whole series in five lines:

The agent hit a rate limit. It fell back to a cached answer from last Tuesday. The world changed on Wednesday. The agent kept working. The logs said "cache hit, 200 OK." The user got a message: "Your order has shipped." The warehouse's API key expired on Thursday.

Every hop green. Every log a 200. And a real package never ships. A scalar trust score on that final "order shipped" output would read fine — the last call succeeded. A provenance vector reads freshness: 0.1, tainted_by: {warehouse_check} and the shipping gate refuses to fire. That's the entire difference between uptime and correct uptime, and between a boolean and a vector.

Where this leaves the series

Three posts in, the actual thesis has assembled itself: agent reliability is a provenance problem. Availability (post 1) is the easy axis. Correctness (post 2) is the one that bites. And the structure that makes correctness tractable (post 3) is typed provenance carried through the chain, with policy at the edges. None of that is exotic — it's data lineage, taint analysis, and saga patterns, borrowed from disciplines that solved their version decades ago, newly load-bearing because the untraceable thing now acts.

If you're building this: start with two axes and a min, put the policy at the consumer, and add an axis only when it changes a recovery action. Everything else is premature.

This post was largely written by the comments on the last one. Credit, specifically: **Theo Valmis* (trust-is-a-vector, the summarize-vs-price-calc case, "typed provenance"), Manuel Bruña (enum-not-bool), Dan (admission control, the four-limit split), James O'Connor (validation at consumption), HARD IN SOFT OUT (causality-bound taint, the parable), TuanAnhNguyen (the solo-builder grep version), Abdullah Shahin (the distributed-limiter and pooled-fallback corrections), and Scarab Systems (the "evidence gate" framing that started me thinking about provenance as an obligation, not metadata). Best comment section on this site. Question for the thread: how many axes does your system actually need — and which ones map to a distinct recovery action versus just feeling rigorous?*

Sources & further reading

"Real-Time Trust Verification for Safe Agentic Actions" (TrustBench) — dimensional trust scores over a scalar, domain-weighted, with block/warn/proceed gating.
"When Should Agent Trust Be Conditional?" — why a single global trust score is the wrong object for skill-heterogeneous agents.
"From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents" — persistent lineage across memory writes, retrievals, and reuse.
"Redefining AI Agent Trust: An Input/Output-First Approach", Monte Carlo — trust as enforced contracts at system boundaries (freshness, schema, lineage on input; traceability on output).
Part 1 — the capacity side and Part 2 — correct uptime.

The Most Powerful Model on the Market Got Pulled by the Government in 3 Days. Is It Real, or a Hype Bubble?

Sergei Parfenov — Sat, 13 Jun 2026 14:46:35 +0000

The timing is almost too clean to be real.

On June 9, Anthropic shipped Claude Fable 5 — a "Mythos-class" model they described as more capable than anything they'd previously made generally available. Three days later, on June 12, the US Commerce Department sent a letter to CEO Dario Amodei placing Fable 5 (and its restricted sibling Mythos 5) under export controls: no access for any location outside the US, and no access for foreign persons inside it.

Anthropic couldn't filter non-US users from everyone else in real time. So they did the only thing they could: they killed the model for everyone, worldwide. Including US citizens.

If you opened a session this weekend and got "there's an issue with the selected model (claude-fable-5)... you may not have access to it" — that's not your setup. The model your session pointed at was pulled. Your projects, history, and limits are untouched; only which model answers you changed. Switch to Opus 4.8 or Sonnet and you're back.

Now the question worth actually thinking about: is this real, or is everyone inflating a bubble around a model nobody can even use right now?

The honest answer is both, and the interesting part is separating the two.

What's genuinely new here

Strip the drama and there's a real precedent underneath.

AI export controls, until this week, were about hardware: chips, lithography machines, the physical supply chain. The chokepoint was always silicon. What just happened is different in kind — the government reached past the hardware and pulled a deployed, commercial software model that hundreds of millions of people were already using.

That's the part to file away. It means a frontier model is now being treated less like a product and more like a dual-use technology with an off-switch held by someone other than the vendor. If you build on these APIs, model availability is no longer just an SLA question or a "will the vendor deprecate it" question. It's a geopolitical dependency. That's a real shift in how you should think about resilience — treat your model provider like any critical supply-chain vendor, with a fallback path that doesn't assume the top model stays reachable.

So: precedent — real. Worth tracking. Not hype.

Where the bubble is

Here's where I think a lot of the coverage is doing unpaid marketing.

The official justification is a jailbreak — reportedly surfaced by another company and escalated to the government as a national-security concern. Anthropic's own response, which is the most useful document in this whole episode, says the quiet part plainly: the technique they were shown exposed a small number of previously known, minor vulnerabilities — the kind that other publicly available models find without any jailbreak at all (they name-check a competing GPT-class model). In other words, the "national security threat" rests on a narrow, non-universal exploit, not on some unique cliff-edge capability that only Fable 5 possesses.

Now layer on the incentive structure. There is no better marketing in this industry than "a model so powerful the government had to ban it." That sentence sells capability, sells the safety narrative ("we build things genuinely dangerous enough to be regulated"), and sells it for free, in every headline, with the government as an involuntary co-signer. The halo effect is enormous, and it maps perfectly onto a story the market already wants to believe and already prices into valuations.

I'm not saying anyone engineered this. I'm saying notice how neatly a suspension you didn't choose reinforces the exact narrative that benefits you most.

So what's actually true?

Let me be concrete, because vagueness is how bubbles survive.

The capabilities are real. Fable 5 is priced at $10 / $50 per million input/output tokens — roughly double Opus 4.8 — and counts as 2x usage on subscription plans. You don't price a model like that, or burn that much compute on it, for a phantom. There's a genuinely strong model here.

The regulatory precedent is real. First time a deployed commercial model has been pulled by export control. That changes the risk model for everyone shipping on top of these APIs.

The "existential / too-dangerous-to-exist" framing is mostly bubble. It's assembled from one government's reaction to one narrow jailbreak, plus a halo that happens to be extremely convenient for the vendor. Anthropic itself is arguing the directive is a misunderstanding and that the exploit is neither unique nor severe — which is a strange thing to argue if you actually believed your model was a civilizational hazard.

My read: hold both thoughts at once. The governance story is the real headline. The "scariest model ever" story is the one selling tickets.

What to do if you build on this

Practical, not philosophical:

Don't hard-code your default to a frontier model you don't control the availability of. Set a fallback chain (Fable → Opus 4.8 → Sonnet) and make sure your app degrades, not breaks, when the top model vanishes.
Reserve the expensive model for the tasks that earn it — long agentic runs, hard refactors, genuinely multi-step reasoning. At 2x cost and 2x usage, defaulting everything to the top tier is just lighting money on fire even when it is available.
Treat model availability as a supply-chain risk in your architecture docs. This won't be the last time a model you depend on disappears for reasons that have nothing to do with you.

The model is gone for now. No firm return date — Anthropic says it's working to restore access and frames the whole thing as a misunderstanding. Until then, Opus 4.8 still does the job for the overwhelming majority of what any of us actually ship.

The model left. The narrative is still here, doing its job.

You Fixed the Rate Limits. Now Your Agent Fails Quietly.

Sergei Parfenov — Thu, 11 Jun 2026 16:58:21 +0000

Last week I wrote that your agent isn’t failing because it hallucinates — it’s failing because of rate limits. The capacity-engineering toolkit in that post — concurrency caps, backoff with jitter, fallback models, caching — is real and it works. Deploy it and your agent stops dying.

Then a commenter (ANP2) pointed out the thing the post undersold, and it’s been stuck in my head since: every one of those fixes quietly opens a correctness hole while it closes the availability one. This post is me paying that comment thread its due, because the second half of the story turns out to matter more than the first.

TL;DR — A 429 is a loud failure: you see it, you alert on it, you fix it. Retries, fallbacks, and caches keep the agent alive — but they let it act on output it didn’t freshly earn: a stale cache hit, a different model’s answer, a re-run side effect. You’ve traded loud failures for quiet ones. The fix is to treat availability (“can I serve this?”) and correctness (“can I still trust the result?”) as two separate gates — and to propagate trust across the agent’s chain, not just per call.

The trade you didn’t know you made

Here’s the uncomfortable symmetry. The whole point of my last post was that the dominant production failure mode isn’t the model being wrong — it’s the plumbing saying no. The capacity toolkit fixes the plumbing. But look at what each fix actually does:

A retry re-runs a call. If that call had a side effect — created a ticket, sent a message, committed a change — the retry runs the side effect again. The agent didn’t fail; it succeeded twice, which is its own kind of wrong.
A fallback model answers when the primary is rate-limited. But it’s a different model: different training, different calibration, different failure modes. The task continues on an answer the primary never produced.
A cache hit serves a response generated for an earlier input. If the world moved — the codebase changed, the data updated — the cached answer can be subtly stale for this request while looking perfectly fresh.

Each mechanism keeps the agent up. None of them guarantees the agent is right. And the cruel part is the failure economics: the 429 you eliminated was honest — visible, countable, alertable. The failures you bought instead are silent. The agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place — just arriving through the plumbing instead of the model.

The reliability you bought is uptime, not correct uptime. (That phrase is ANP2’s, and it’s better than anything in my original post.)

Two gates, not one

The conversation in that thread converged on a framing I now use everywhere: an agent’s runtime layer has to answer two different questions, and conflating them is where the quiet failures breed.

Gate 1 — “Can I serve this?” This is the availability gate. Trip the fallback on 429s, serve the cache on a hit, retry on transient errors. Another commenter (Echo) nailed the key property of this gate: when you trip a fallback only on rate-limit errors — never on bad outputs — the failure mode you’ve introduced is latency, not quality. The fallback just buys time. That’s a fine trade, and it’s why the capacity toolkit is still the right first move.

Gate 2 — “Can I act on this irreversibly?” This is the correctness gate, and it’s where the degraded outputs from Gate 1 must get re-examined. The moment an output is about to feed something you can’t take back — a merge, a payment, a message to a user, a deleted record — its provenance matters. Did it come from the primary, fresh? Or from a fallback, a cache, a retry?

One rule worth stealing here: gate on risk, not on confidence. There’s a war story making the rounds of an agent that was 95% confident about a production database migration — the missing 5% was a foreign-key constraint absent from its test data, and the only thing that prevented corrupted referential integrity across three tables was a hard rule that destructive operations always require human approval, regardless of confidence. Confidence is the model grading itself; irreversibility is a property of the action. Gate on the second.

The two gates fail differently, and that’s the point: Gate 1 failures cost you time; Gate 2 failures cost you trust. A system with only Gate 1 is fast and quietly dangerous. A system with only Gate 2 is safe and constantly down. You need both, and they need to stay separate.

Per-call correctness: the three tags

The minimum viable version of Gate 2 is making degraded outputs identifiable. Three mechanisms, one per capacity fix:

1. Idempotency keys on anything with side effects. Before an agent action that touches the world, generate a key from the task + step + inputs. The receiving system deduplicates on it. Now a retry is safe by construction — the second execution is a no-op instead of a double-fire. This is decades-old distributed-systems practice; agent frameworks have mostly just… not adopted it yet.

import hashlib, json

def idempotency_key(task_id: str, step: int, payload: dict) -> str:
    raw = json.dumps({"t": task_id, "s": step, "p": payload}, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

# pass it with the side-effecting call; the receiver dedupes on it
create_ticket(..., idempotency_key=idempotency_key(task.id, step.n, args))

The grown-up version of this is the saga pattern from distributed systems: each step records its completion and defines a compensation action, so a task that dies at step 4 of 7 can roll back cleanly instead of orphaning state. Idempotency prevents duplicate effects; sagas handle partial completion. Once your agents fail mid-workflow — and they will — you eventually want both.

2. Trust tags on fallback outputs. When the fallback answers instead of the primary, don’t just return the text — return (text, trust="degraded"). Cheap to add, and it’s the hook everything downstream needs. A degraded answer is fine for the agent to keep thinking with; it is not fine to act irreversibly on without a re-check.

3. Validity conditions on cache entries. A cache entry shouldn’t just store the response — it should store what the response assumed: which file version, which data snapshot, which config. On a hit, check the assumptions, not just the key. If the codebase moved since the entry was written, that’s a miss wearing a hit’s clothes. And the assumptions can move without you touching anything: providers silently update models, document stores drift, input distributions shift — degradation with no error to catch. Your “primary, fresh” answer from last Tuesday may already be a fallback in disguise.

The part single calls don’t prepare you for: trust must propagate

Here’s where agents make this genuinely harder than classic distributed systems, and it’s the piece I’d add on top of the thread that started this post.

Say step 3 of a 6-step task came from a lower-trust fallback. Steps 4, 5, and 6 each run on the primary, fresh, individually flawless. Are they trustworthy?

No — and this is the trap. They reasoned on top of a degraded input. This isn’t a niche concern, either: observability vendors who cluster production agent traces report that chained corruption — one bad step at position N silently poisoning everything after it — is the single most common and most insidious agent failure mode they see. And the math is brutal: at a 95% per-step success rate, an 8-step task completes cleanly ~66% of the time; at 85% per step, it’s ~27%. The chain is where reliability goes to die, quietly. Each step is locally correct and the trajectory is still poisoned. If the trust tag stays local to the call that produced it, the degraded answer launders itself: two “clean” hops later it looks pristine, and your irreversibility gate at step 6 checks the last call’s tag, sees green, and fires.

So the tag can’t be per-call metadata. It has to taint — propagate to everything downstream of it, the way taint-tracking works in security analysis:

@dataclass
class StepResult:
    output: str
    trust: str          # "full" | "degraded"
    tainted_by: set[str]  # which upstream steps were degraded

def propagate(inputs: list[StepResult], my_trust: str) -> tuple[str, set[str]]:
    taint = set().union(*(r.tainted_by for r in inputs))
    taint |= {r.step_id for r in inputs if r.trust == "degraded"}
    # my own trust can't exceed the weakest input
    trust = "degraded" if taint or my_trust == "degraded" else "full"
    return trust, taint

Then the irreversibility gate checks the aggregate trust of the whole trajectory, not the last hop: if anything upstream was degraded and unverified, the action pauses for a re-check — re-run the degraded step on the primary, or escalate to a human. In my experience the re-check fires rarely; the point isn’t that fallbacks are usually wrong, it’s that the one time the degraded path feeds a merge or a payment, you want it caught at the gate instead of in the incident review.

Making it observable (or it didn’t happen)

Same lesson as the capacity post, one level up. You can’t engineer what you can’t see, and correctness debt is even quieter than 429s. The minimum dashboard:

% of completed tasks with any degraded step — your real exposure, invisible in error rates because nothing errored.
% of irreversible actions that fired with taint — should be ~zero; every one is a gate you skipped.
Cache validity-miss rate — hits that failed the assumption check. If this is zero, you’re probably not checking assumptions.
Fallback divergence — periodically replay fallback-answered requests on the primary and diff. This is your measured answer to “how different is the fallback, actually?” instead of a vibe.

None of these show up in uptime. All of them are the difference between uptime and correct uptime.

The takeaway

The capacity toolkit from the last post is still step one — an agent that’s down helps nobody. But availability engineering has a hidden invoice: every mechanism that keeps the agent alive does it by substituting something for the fresh, primary, verified answer. That substitution is usually fine — which is exactly what makes it dangerous, because “usually fine” plus “irreversible” plus “silent” is how you get the 3am incident that no alert predicted.

Two gates. Tag what’s degraded. Taint what it touches. Check the trajectory, not the last call, before anything you can’t undo.

Uptime is table stakes. Correct uptime is the product.

Sources & further reading

Detecting AI Agent Failure Modes in Production, Latitude (2026) — chained corruption as the most common and most insidious production failure mode.
AI Agent Error Handling: 5 Patterns to Catch Silent Failures, Kevin Tan (2026) — the saga pattern, the 95%-confident migration story, and risk-based escalation.
AI Agent Failure Modes: What Goes Wrong in Production, Trantor (2026) — silent quality degradation from provider model updates and store drift.
International AI Safety Report 2026 — why agent failures are categorically riskier: actions in the world, no human in the loop.
My previous post on the capacity side — the availability toolkit this post is the second half of.

Credit where due: this post exists because ANP2 and Echo took the last one apart constructively in the comments — the “uptime, not correct uptime” framing and the latency-not-quality fallback distinction are theirs. Best argument I’ve had on this site. If you’re running agents in prod: do you track degraded-path exposure at all, or does your observability stop at error rates? Genuinely curious how rare Gate 2 is in the wild.

The Comments Got Good. That's How I Knew

Sergei Parfenov — Thu, 04 Jun 2026 14:09:06 +0000

I wrote a post about model distillation. The comments were thoughtful, specific, technically sharp — and that's exactly what made me check whether any of them were written by people.

🧪 Everything here — the scraper, the detector, the simulation, the figures — is reproducible: github.com/P0rt/the_cozy_web

A few weeks ago I published a post on how model distillation actually works. It did fine — 35 reactions, 14 comments. And the comments were great. Not "great post, thanks for sharing" great. Substantively great. People pushed back on my "the student is bounded by the teacher" claim with a real counter-example. Someone reframed distillation as "a forcing function for what you actually need." Someone dropped a paper recommendation. Someone shared a 20× cost number from production.

I should have felt good. Instead I felt the thing you feel when a stranger knows your name. Something was off, and it took me a day to articulate what: the comments were too well-adapted. Every one of them did the same three things in the same order, like they'd all read the same playbook. And a suspicious number of the accounts were two weeks old, or named after a product, or both.

So I did what I do. I pulled the data. This is what I found, why I now think a real chunk of "engagement" on dev blogs is machine-generated or machine-shaped, and — because I don't trust my own pattern-matching — what the actual peer-reviewed research says about whether you can even tell anymore.

"Great post!" is dead. Meet the eco-comment.

The old bot comment was easy. "Nice article, very informative, looking forward to more!" You could smell it. Anyone could.

That's not what's under my posts anymore. The new thing is substantive and ecological — it adds real value, it's polite, it never picks a real fight, and it leaves the thread feeling cozier than before. Here's the actual skeleton, which I only saw once I'd read fourteen of them back to back:

Validate a specific phrase from the post. Not generic praise — they quote your framing back at you. "The 'separate the engineering from the geopolitics' framing is the public service here."
Add one piece of genuine nuance. "One thing I'd add…" "The part worth amplifying for builders…" Often a real, correct technical point.
Drop a first-person-plural anecdote with a number, naming a product. "We use [model X] as our daily driver and the cost difference is roughly 20×." "When working with [our GPU product], we've seen…"
Never, ever, actually disagree. Even the "corrections" are framed so gently that I — the author — instantly conceded.

Read one, it's a great comment. Read eight, it's a template. And step 3 is the tell: the technical substance isn't the point. It's the wrapper around a product mention, engineered to be useful enough to clear a spam filter and an AI detector both.

My own thread, by the numbers

I scraped my article's comments straight from the dev.to public API and ran them through two things: a detector I'd built earlier for the old "Great post!" style, and a set of new structural signals. (analyze_devto.py)

My old detector shrugged. On the eight non-me comments it gave a mean "coziness" score of 0.25 — i.e. it confidently waved them through as human. Of course it did: it was built to catch clichés, em-dashes, and uniform positivity, and these comments are armored with exactly the thing that defeats it — real specifics.

The new signals told a different story:

product/company plug:              4 / 8 comments
opens by validating a phrase:      5 / 8 comments
comments that genuinely push back: 2 / 8   (and I conceded both, instantly)
auto-generated-looking username:   1   (a random-hex handle, 0 posts, "Thank you for this!")

Then I looked at who was commenting. Public profiles, public join dates. I'm going to describe the patterns rather than pillory individuals — but the shapes were loud:

An account literally named after a product ("Sealed GPUs. Private AI."), whose comment plugs that product. That one isn't a person; it's a brand broadcasting.
A two-week-old persona account — created days before my post — that plugs two named tools and somehow published five articles in its first fortnight.
A throwaway with a random-hex username, zero posts, and a one-line "Thank you for this!"
A couple that look more human — real names, older accounts — but still run the exact template and still ship a startup plug.

To be fair and clear: I can't prove any single one of these is a bot. Some are probably real people running their comments through an assistant. But that distinction matters less than it sounds, and I'll come back to why.

Is it just me? I swept 38 other posts.

A pattern on one thread is an anecdote. So I pulled comments across 38 popular dev.to articles in ai, machinelearning, webdev, and programming — 1,366 comments from 346 accounts (sweep_devto.py) — and looked for the same fingerprint.

Two findings made the hair on my neck stand up.

A handful of accounts spray the same template across dozens of unrelated posts. The most prolific commenters in my sample showed up on 14–22 distinct articles each — several of them the same accounts that had appeared on my own thread, several of them flagged for product plugs. A human who loved your distillation post might also comment on three others. They don't leave structurally-identical "validate → nuance → we-at-Product → number" comments on fourteen different articles in a couple of weeks.

Different "people" reuse the same connective tissue. I counted 4-grams that appear across distinct accounts. Humans almost never echo each other's exact phrasing. These did:

x13 distinct accounts:  "exactly the kind of"
 x8 distinct accounts:  "is exactly the kind"
 x7 distinct accounts:  "this is exactly the"
 x6 distinct accounts:  "is the part that"

"This is exactly the kind of thing that…" is a generative construction — it's how an LLM hedges into a confident-sounding addition. Thirteen different strangers don't independently converge on it. One model behind thirteen masks does.

Across the whole sweep, 11 accounts left long product plugs, 32 opened with phrase-validation, and 4 ran the full skeleton. It's not my imagination, and it's not just my post. It's the ambient texture of the platform now.

I'd been calling this the wrong thing

I went in thinking "bots." What I'd actually walked into is two older ideas fusing.

Dead Internet Theory — the half-joke that the web "died" and is now mostly bots and generated text talking to itself — has stopped being a joke. Hal Berghel makes the serious version of the case in IEEE Computer ("Generative AI Is Breathing New Life Into the Dead Internet Theory", 2026): strip the conspiracy, and the lean core — synthetic content drowning out and being mistaken for humans — just converges with what's measurable. Imperva clocked automated traffic at 51% of the web in 2024, the first time bots crossed half. Even Sam Altman said it out loud: the wave of AI activity makes dead-internet theory feel real.

The other half is the Cozy Web. Venkatesh Rao coined the term; Maggie Appleton diagrammed it alongside Yancey Strickler's "dark forest": humans fleeing the bot-infested public square into private rooms — group chats, Discords, DMs. Appleton's follow-up, "The Expanding Dark Forest and Generative AI", nails the mechanism: generative AI accelerates the retreat.

Here's the part I missed until I saw my own comment section. These aren't two theories. They're one loop. The public web fills with frictionless synthetic text → real people retreat to private rooms → the public spaces that remain (the comment section under my post) get thinner on actual humans → which makes them even easier to fill with synthetic text. My "cozy" thread wasn't a healthy community. It was the calm surface of that loop running.

And the comment section was already half-empty before the bots arrived. Publications spent the 2010s killing comments — Popular Science in 2013, and a peer-reviewed survey of why newsrooms did it found the conversation had already migrated to social platforms. The robots didn't kill the comment section. They moved into a house that was already mostly vacant.

Why this actually works (and why I couldn't just tell)

This is the part that unsettled me most, because I pride myself on spotting this stuff, and the research says I shouldn't trust that for a second.

Humans can't distinguish LLM social text from human text. Spitale, Biller-Andorno & Germani showed in Science Advances (2023) that people can't tell GPT tweets from human ones — and rate the AI's information as more credible. Jones & Bergen found GPT-4 passes a controlled Turing test (taken for human 54% of the time, FAccT 2025).

The persuasion is superhuman when it's personalized. Salvi, Ribeiro, Gallotti & West, in Nature Human Behaviour (2025): with a little data about who they're talking to, GPT-4 is 81% more likely than a human to win a debate. The Zurich r/changemyview field experiment reportedly found AI replies 3–6× more persuasive than humans — though I'll flag honestly that that study was withdrawn and never peer-reviewed; the only on-record account is the university's ethics response. Cite it as a withdrawn preprint, not a result.

Fake-but-substantive content is, by now, undetectable to people. This is the literature closest to my eco-comments. The canonical Ott et al. (ACL 2011) already showed humans judge fake reviews at chance. The LLM-era update — Meng et al., "Fake Product Reviews are Indistinguishable to Humans and Machines" (2025) — found people at 50.8% (a coin flip) and detectors no better. A promotional plug wearing a sincere technical comment is exactly that, in a new venue.

And the detectors fail precisely because of the specifics. My detector waved these comments through, and that's not a bug in my code — it's the field. Krishna et al. (NeurIPS 2023) showed light paraphrasing collapses DetectGPT from 70.3% to 4.6% and defeats GPTZero, OpenAI's classifier, and watermarks. Liang et al. (Patterns 2023) showed detectors are biased against non-native English writers and bypassable by prompting. The "real technical detail" that made these comments feel human is the same mechanism that blinds the detector. Specificity isn't proof of a human. It's camouflage.

So the honest position isn't "I caught the bots." It's: the tools that would let me be sure don't work, and the research says they can't.

I modeled what it does to a thread

If I can't reliably catch individual comments, I can at least ask: what does rising automation do to a conversation, statistically? So I built a toy. (dead_internet_sim.py)

I didn't simulate language — I simulated its statistics, because my thesis is statistical. Each comment is a bag of tokens from two pools: a big, fat-tailed human vocabulary (where the typos, the tangents, the specific war stories live) and a tiny cozy vocabulary of phatic praise. Each comment has an assist level α from 0 (I typed this, annoyed) to 1 (an agent posts for me, I never read the thread). As α rises, more tokens come from the cozy pool and the comment's stance gets pulled from "disagree" toward "agree."

Then I swept a whole community's average autonomy from 0 → 1 and watched the thread's "liveness" — lexical diversity, disagreement, surprise, and a composite index that dies if any of those hits zero.

Two things fall out, and both match what I saw on my own post:

It's not linear — there's a knee around 0.65. You don't need a botnet. You need the average commenter to be two-thirds on the assist dial, and the thread becomes a smooth surface: polite, "engaged," contributing almost no new information.
Disagreement dies first (the steep red line). The very first thing automation sands off is friction — the "actually, you benchmarked this wrong" energy. Which is exactly why my comment section felt so nice. It didn't get kinder. It got conflict-free, and I'd been reading conflict-free as kind.

A cozy thread even, literally, uses fewer distinct words:

Effective vocabulary collapses from ~175 words to ~60 as autonomy maxes out. (Honest wrinkle: at low autonomy it ticks up slightly — a little assistance adds a register before saturation homogenizes everything. The damage isn't assistance existing. It's assistance dominating.)

And here's the detector failure as a picture — it cleanly separates the old caricature comments, which is useless, because the comments on my post don't look like the left pile anymore:

The line I actually care about isn't "bot vs. human"

I kept wanting a verdict on each account. The research talked me out of it. The useful axis isn't bot-or-not — it's the autonomy spectrum:

I typed it → spell-check → "polish this" → "write a comment for me" → an agent posts, I never read the thread
   α=0          α≈0.2          α≈0.5             α≈0.8                        α→1.0

The product account is α≈1.0 — a brand broadcasting. The two-week-old persona spraying fourteen threads is close behind. But a real growth-hacker at α≈0.8 might be genuinely interested, letting a model do the writing and slip in the plug. From the thread's point of view, it barely matters: either way, the high-entropy human part — the real disagreement, the idiosyncratic detail, the thing that made it a conversation — got outsourced and smoothed away. That's the loss. Not "a bot was here," but "no one staked anything specific."

There's even a cheerful counter-current I want to be fair about: AI content on the web is large but not yet total (~17–19% of Google's top results in 2025, by an imperfect detector), some sites are bringing comment sections back on the back of AI moderation, and dev.to's supportive culture is a real, deliberate choice, not just an artifact of bots. Even "what % is bots" has no agreed answer — it depends entirely on your detector. The sky isn't falling. It's just getting quieter in a very specific way.

What I'm going to do about my own blog

Not "ban AI" — that's unenforceable (the detectors are biased and gameable) and wrong (a quick polish genuinely helps a non-native writer or a tired one). The lever isn't the level of assistance. It's whether assistance crowds out the high-entropy channels.

I'll reward specificity over positivity. A comment that cites line 14, a version number, a counter-benchmark is worth ten that validate my framing. If a platform ranks by "nice," it is literally selecting for the cozy mean.
I'll treat disagreement as a feature, not a moderation failure. My simulation's clearest result is that friction dies first. A comment culture optimized purely for niceness is optimizing for deadness with extra steps.
I'll stop asking "was a model involved." It's the wrong question, because the answer is "yes, partly, almost always now." The real question is: did a human read the thing and stake some specificity on a real reply?

Limitations (read this before you @ me — if you're real)

I can't prove a single account is a bot. Everything above is signals — template reuse, account age, product plugs, cross-post spray — not a confession. The honest claim is about aggregate texture, not any individual.
The simulation is a toy. Two token pools and a stance variable are a cartoon of language. The shape of the collapse is a property of my assumptions as much as reality. It's an argument made precise, not evidence.
My detector is a strawman by design — I show it failing on purpose. Don't deploy it; don't deploy anything like it as a gate on real people (see Liang et al. on who gets falsely flagged).
The Zurich study is withdrawn, and "% of the web is bots/AI" numbers are detector-dependent and shaky. I've tried to lean only on the load-bearing peer-reviewed work and flag the rest.
Causation is underdetermined. My cozy comments might also reflect good moderation, kind norms, or survivorship (the cranks left for Reddit). AI-mediation is a driver, not provably the driver.

The one-line version

My blog didn't get a nicer community. It got an assistant, learned some manners, and stopped saying anything surprising. The internet didn't die — it just outsourced the parts that used to make it a conversation, and called the result "cozy."

If this post gets a comment that opens by quoting my own framing back at me, adds one tasteful piece of nuance, and mentions a product its account is named after… well. You know what I'm going to check.

Run it yourself

git clone https://github.com/P0rt/the_cozy_web
cd the_cozy_web
pip install -r requirements.txt

python3 dead_internet_sim.py     # liveness collapse + figures
python3 coziness_detector.py     # the heuristic scorer + histogram
python3 analyze_devto.py         # tear apart a real dev.to thread (defaults to my distillation post)
python3 sweep_devto.py           # the cross-platform template sweep

Every factual claim links to its source. If you only read two, read Meng et al. on why fake reviews are now indistinguishable and Krishna et al. on why the specifics defeat the detector.