고광웅

Posted on Apr 21 • Originally published at ernham.substack.com

Response Quality Is Not Conversation Quality. A Paper Quantifies the Gap.

#aiagents #llmevaluation #observability #multiturn

The Metric Most Agent Products Are Missing

Most AI evaluation work you see on agent products measures the same thing: was this response good? You get a score per output, you track it over time, you look for regressions. That's the pattern whether you're using LLM-as-judge, user thumbs-up/down, or hand-graded samples.

This is a reasonable thing to measure. It's also an incomplete thing to measure, in a way that matters more for multi-turn agents than the industry has quite caught up to.

Here's what we're missing: a conversation can be full of individually good responses and still be structurally broken. The agent contradicts itself across turns. It shifts topic in ways the user didn't cue. It answers the current message fine but has stopped tracking what was agreed three messages ago. Each response passes a quality bar. The conversation fails anyway.

A paper uploaded to arXiv last week tries to formalize this gap — and more interestingly, proposes a way to measure it in production without embeddings, judges, or access to model internals. I want to walk through what it shows, because I think the measurement framing is more important than the specific method.

The paper is Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction (Hafez, Nazeri, v2 April 17).

The Number That Made Me Read It Twice

Across 4,574 conversational turns spanning 34 conditions, three frontier teacher models and one student model, the authors report:

Their proposed signal aligns with structural consistency in 85% of conditions.
It aligns with semantic quality in only 44% of conditions.

Put those two numbers next to each other and they tell a story.

Response quality and conversational consistency are not the same thing. They can be measured to diverge. And the tools most teams use — LLM-as-judge on outputs, user feedback on individual responses — are measuring the 44% side of the gap, not the 85% side.

If your agent is deployed in anything that looks like an ongoing interaction — support, coaching, tutoring, sales, therapy-adjacent use cases, long-form research, gaming — the side you're not measuring is the side where the trust breaks.

What the Paper Actually Proposes

The authors define a metric they call Bipredictability (P). Their description of it, taken from the abstract: it "measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty."

In plain terms: across a conversation turn, there's information that the context predicts about the response, information that the response predicts about the next prompt, and information that the next prompt predicts about the context. How much those loops overlap, relative to how uncertain each turn is overall, is what they track.

The implementation is a lightweight auxiliary component they call the Information Digital Twin (IDT) — running alongside the agent and computing P from the token stream. No embeddings. No auxiliary evaluator model. No white-box model access.

That engineering profile matters. It means the signal can, in principle, sit in a production deployment at trivial cost. Most "measure what's happening in your LLM agent" proposals involve an LLM judge or a vector DB query per turn. This one is token frequency statistics.

I haven't built their system, and the abstract doesn't go deep on implementation details, so I'm hedging on whether the engineering will be as clean in practice as the abstract implies. But the design choice is pointing at something real: if you want a monitoring signal that can run continuously in production, it has to be cheap enough to not change your deployment economics. Bipredictability, at least as described, fits that constraint.

What Their IDT Caught

The detection result reported in the abstract: 100% sensitivity for contradictions, topic shifts, and non-sequiturs in their tested set.

Sensitivity claims at that level always deserve scrutiny — it means every tested failure was caught, not that every failure in every real deployment will be. The authors are testing against constructed conversations with known failures. Production distributions will be messier.

Still, even accepting the number at face value for the test conditions, three failure types are worth naming because they map directly to real user complaints:

Contradictions — agent says A in turn 3, says not-A in turn 12, and the user has been quietly losing faith since turn 7.
Topic shifts — agent pivots away from the user's thread without a cue. Feels "off" in a way users rarely articulate.
Non-sequiturs — response that's individually coherent but doesn't actually engage with what just happened.

If you've ever had a user say "I don't know, it just stopped feeling right" — they're usually describing one of these three. None of them are caught by "rate this response 1-5" dashboards.

Why This Is a Different Measurement Problem Than Response Quality

I think the piece builders most often miss is that response-level evaluation and conversation-level evaluation are structurally different problems.

Response quality is a pointwise judgment. You can sample, score, aggregate. LLM-as-judge does a decent job of this. It's the kind of evaluation that fits neatly into existing observability tooling — each output is a discrete event with a score attached.

Conversation-level consistency is a sequence problem. You can't score it by looking at any single turn. You need to look at relationships between turns. The measurement surface is the conversation trajectory, not individual messages.

The tools haven't caught up. Agent observability platforms like Langfuse, LangSmith, Helicone are doing better-than-ever work on per-call metrics — latency, cost, tool usage, response sampling. Very little in that category instruments conversation-level structural properties. Which is the level where multi-turn agents mostly fail.

The paper's contribution, from a tools-thinking perspective, is identifying that there's a cheap signal at this level if you know where to look.

Three Audit Questions For Your Multi-Turn Agents

If you're shipping any kind of multi-turn agent, three questions are worth sitting with:

1. Do you measure anything about the conversation as a whole, or only about individual turns?

Most teams I know answer "only individual turns" after thinking about it. The shape of current dashboards enforces this — each row is a request.

2. If a user tells you "the conversation got weird around message 15," can you go find what happened?

Most production agents don't retain full conversation state in a way that makes this analyzable after the fact. Or they retain it, but nothing about the trajectory is indexed or searchable.

3. Have you instrumented topic shifts, contradictions, or non-sequiturs in any form?

If you haven't, you're outsourcing detection of these failures to your users. They'll notice, but you won't — and by the time they tell you, attrition has happened.

These aren't theoretical failures. They're the most common complaint pattern in post-cancellation interviews I've seen for multi-turn AI products: "It worked fine at first but then kind of drifted."

What This Paper Leaves Open

A few things the abstract doesn't settle that I'd want to know before building on it:

Exactly how Bipredictability is computed. The verbal description is suggestive but not precise enough to reimplement from. Need to read the full paper.
Which frontier models were used. The abstract says "three frontier teacher models" without naming them. Worth checking whether the signal transfers across model families.
Whether code or data is public. The arXiv page doesn't list code or dataset resources. For a proposal that's essentially "add this runtime monitor to your system," reference implementation availability will determine how fast this gets adopted.
False positive behavior at production scale. 100% sensitivity on a curated test set is a different claim from "works reliably at scale without flooding you with false flags." The abstract doesn't report specificity in a form I can quote.

I'm flagging these not to dismiss the paper but because the distance between "this is the right idea" and "this is deployable" is where most interesting research lives, and it's worth staying honest about that distance.

The Builder Takeaway

The specific metric matters less to me than the framing. The framing is:

Response quality is a property of individual outputs. Conversational reliability is a property of the trajectory. If you only measure the first, you're blind to failure modes that happen at the second level — and those are the failure modes that drive user churn in multi-turn products.

Whatever the eventual best implementation turns out to be — Bipredictability, embeddings-based, something else — the thing worth internalizing is that there's a measurement gap here, and closing it probably requires rethinking what your agent observability stack is watching.

For me, the immediate action from reading this isn't "implement IDT." It's closer to: audit what dashboards my team and I are actually looking at, and note how many of them measure conversation-level properties at all. The answer for most of us is going to be close to zero. That's the gap worth working on before worrying about which specific metric to adopt.

The Throughline

I've been writing this week about AI agent primitives — persona that's actually runtime steering rather than a static string, design that's content-structure-first, and now measurement that's trajectory-level rather than pointwise.

There's a pattern connecting them. The abstractions we shipped first for AI agents were the ones that were easy to build: persona as string, design as template, evaluation as per-output score. In each case, the more accurate primitive is a little harder and a little more runtime-y: persona as steering, design as content-reading, evaluation as trajectory-watching.

That's not a coincidence. Early AI product design has been constrained by what was cheap and easy to instrument at the call site. What I'm watching the research space do, right now, is build the tooling that lets the harder and more accurate primitives become cheap and easy too. When they do, the products that got shipped on the easier abstractions will look more brittle than they currently do.

If you're building, the question worth asking isn't just "what do I ship now?" It's also "which of my current primitives is an early-days hack that I'll want to replace when better measurement lands?" For multi-turn agents, my guess is that evaluation is one of those — and this paper is a pointer toward where the replacement starts.

DEV Community