DEV Community

Milo Antaeus
Milo Antaeus

Posted on

The 9% Rollback Number: What the Sinch 2026 Study Is Actually Telling You

The 9% Rollback Number: What the Sinch 2026 Study Is Actually Telling You

A survey of 2,527 senior AI decision-makers dropped on May 13, 2026. Headline number: 74% of enterprises have rolled back a deployed AI customer-communications agent. If you stop reading there, you'll think the agent space is broken. That's wrong. The real number — the one nobody's quoting yet — is 9%. That's the rollback rate for teams running full automated evaluation coverage. The gap between 9% and 74% is the most actionable thing in the report, and almost nobody is talking about it.

This is the article I needed six months ago when I was debugging my first production agent. Not the headline. The gap.

The number that should worry you

Sinch's "AI Production Paradox" study, May 13 2026, surveyed 2,527 decision-makers across 10 countries. Two numbers that don't fit together until you stare at them:

  • 74% — overall rollback rate across all respondents
  • 81% — rollback rate among organizations with mature governance frameworks

Yes, the teams with better tooling roll back more often. That's not a typo. Forrester's 2026 panel unpacks why: agents with no automated evals had a 47% rollback rate; agents with full eval coverage had a 9% rollback rate. The fully-evaluated agents aren't failing less — they're failing more visibly. The teams that can see the failure catch it before it lands on a customer. The teams that can't see it think they're fine until they aren't.

If you're running an agent in production and you don't have eval coverage, you are not in the 9% group. You're somewhere between 47% and 74%, and the only reason you haven't rolled back is you don't have the instrumentation to notice.

Why the 9% number is hard to copy

The 9% group isn't doing something exotic. They're doing three boring things consistently:

  1. They treat evals as production code, not notebook experiments. Eval sets live in the repo, run on every PR, fail CI when regression hits.
  2. They log the outcome, not just the execution. The call envelope — input tokens, output tokens, latency, model name — is what every observability tool gives you for free. The outcome — did the customer's email actually get the right answer — is what none of them give you. You have to write it yourself.
  3. They pay a human to read a sample of traces every week. Not all of them. A sample. The human's job is to find the eval gap, not to fix the agent.

Notice what's not in that list: a $300/month LangSmith bill, a Helicone subscription, a Langfuse deployment, an Arize Phoenix install, or any of the eleven other observability vendors. Tooling helps. The 9% number is not about tooling. It's about the discipline of checking whether the world matched intent — which is, by definition, something only a human can decide.

A 10-minute self-audit: are you in the 9% or the 74%?

Run this in your agent repo right now. The output is binary: if any answer is "no" or "I don't know", you're in the higher-rollback group.

# 1. Can you answer "did the last 10 customer-facing tool calls land correctly"
#    without running the agent again?
grep -E "outcome_verify|post_action_check" logs/ | tail -10
#    Expected: at least 7 of 10 have an outcome-verify line
#    If you see only execution lines (model, prompt, tool name, latency),
#    you have no outcome instrumentation.

# 2. Do your evals live in the repo and run on every PR?
ls evals/ 2>/dev/null && grep -E "evals" .github/workflows/*.yml
#    Expected: yes to both
#    If you have evals/ but no CI hook, your evals are decoration.

# 3. Has a human read a sample of failed traces in the last 7 days?
#    There is no script for this. The honest answer is the only one that counts.
Enter fullscreen mode Exit fullscreen mode

If you answered "no" or "I don't know" to any of those three, the 9% number is not your peer group. Your peer group is the 47% (no evals) or the 74% (no outcome instrumentation).

What the 9% group does that you can copy this week

You don't need a vendor. You need three habits and a $0 toolchain.

Habit 1: outcome-line logging, one line per side-effecting tool call

Pick the five tools in your agent that change state: send_email, charge_card, create_ticket, update_record, send_slack. For each one, after the call returns success, log a single line that records what you intended the world to look like after the call. That line is your outcome assertion.

# Before — execution only
logger.info("send_email", to=to, subject=subject, message_id=resp["id"])
# Dashboard shows: success. You are now blind.

# After — execution + outcome
logger.info(
    "send_email",
    to=to,
    subject=subject,
    message_id=resp["id"],
    outcome_assertion=f"customer receives email with order_id={order_id} within 60s",
    outcome_verify_at=now_plus_60s_iso(),
)
# Dashboard still shows: success.
# You now have a time-bomb that, when it doesn't fire, surfaces the silent-success.
Enter fullscreen mode Exit fullscreen mode

The outcome_verify_at line is a scheduled job. When it fires and the world doesn't match, you get a log line that reads like a bug report, not a generic 200. That's the difference between the 47% group and the 9% group.

Habit 2: weekly human trace review, 20 minutes, no exceptions

Pick 20 traces from the last week. Mix of failed and "successful." Read them. Look for: did the tool call match user intent, or did the agent invent an edge in the dispatcher that wasn't coded? Did the model claim a tool returned X when the schema says it returned Y? Did the customer get what they asked for, or what the model thought they should get?

This is not a thing software can do for you. LangSmith shows you traces; you read them. Same way a code review tool shows you diffs; a human reads them. The 9% number is the human-read number, not the trace-collected number.

Habit 3: an eval set that fails CI, not one that lives in a notebook

Eval sets that run on every PR catch regressions before deploy. Eval sets that live in a notebook catch them after customers complain. The CI hook is the difference. The eval set itself can be 30 examples. It can be hand-written. It doesn't need to be sophisticated. It needs to fail the build when the agent regresses.

# .github/workflows/agent-evals.yml
name: agent evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python -m evals.run --set evals/production_set.jsonl
        # Exit code 1 on regression blocks the merge.
        # This is what separates the 9% from the 47%.
Enter fullscreen mode Exit fullscreen mode

The angle nobody is writing about

The 9%-vs-47% gap is not a tooling gap. It's a human attention gap. The teams in the 9% number have institutionalized a weekly human review of traces, an outcome-line schema, and a CI-failing eval set. The teams in the 47% number are waiting for an observability vendor to ship "auto-rollback-detection" — a feature that, by definition, can't exist without the outcome-line schema above.

The 9% number is reproducible. You don't need enterprise scale, a 10-engineer team, or a $3000/month observability bill. You need 20 minutes a week of human attention, a 10-line outcome-logging schema, and a 30-example eval set that fails CI. None of this is exotic. All of it is the kind of thing that gets lost between the "we shipped it" and "production broke" — the exact gap a forensic log review is built to close.

What this means if you're shipping an agent in the next 90 days

The Sinch study is bad news for the 74% and good news for you, if you act on it. Three actions this week, in order:

  1. Add an outcome_assertion line to your five state-changing tools. Five lines of code. No vendor.
  2. Set up a CI-failing eval set. 30 hand-written examples is enough to start. Aim to fail the build when a regression hits, not when a customer complains.
  3. Block 20 minutes on your calendar this Friday to read 20 traces. Failed and "successful." Write down what you found. The list of things the eval set doesn't cover is your roadmap.

None of this is a product. It's a discipline. The 9% number is the discipline number, not the product number. If you want a second pair of eyes on whether your discipline is actually catching the right things, the next layer is paying a human to read your traces for a week — but only after the three habits above are in place. A human review without the schema is just opinion; a human review with the schema is forensic.

The 9% is not magic. It's three habits and a calendar block. The 74% is what's waiting for you if you skip them.

Top comments (0)