DEV Community

Cover image for The Chip Away Attack — Why Your AI Agent’s Trust Score Isn’t Enough
Ryan Nelson
Ryan Nelson

Posted on

The Chip Away Attack — Why Your AI Agent’s Trust Score Isn’t Enough

Imagine you give your AI agent permission to pay your bills from your bank account. You tell it not to drain the account. Sounds reasonable.

Now imagine a rogue agent or a prompt injection attack starts paying a fake bill. That triggers a red flag. The trust score drops.

So the agent searches your emails and finds real bills to pay. Green flag. Trust score recovers.

Now the bad action happens again. Red flag. Another real bill paid. Green flag. Back to neutral.

This cycle repeats. One bad action. One good action. The trust score never hits zero. But your account is slowly being drained without anyone ever telling the agent to drain it.
That is the chip away attack.

Why trust scores alone cannot stop it
A trust score that recovers is useful for detecting risk in the moment. But it has a fundamental weakness. For every bad action the attacker offsets with a good one. The score stays stable. The damage accumulates.
The problem is that recovering trust does not undo what already happened. The account is lighter. The damage is real. The score just does not reflect it.

What tauSession does differently
TauSession gives every session its own budget. It works like a trust score with one critical difference — it only goes down. Never up.

Every anomaly draws from the budget permanently. When the budget hits zero the session ends. No recovery. No operator override. Done.

So the chip away attack fails. One bad action draws from the budget. The good action that follows does not restore it. Repeat enough times and the budget runs out regardless of how balanced the score looks.

Why this matters in production
If you are deploying agents that touch real accounts, real data, or real systems a trust score that recovers leaves a door open. A patient attacker with a simple pattern can exploit that door indefinitely.

A budget that only decreases closes it. The session has a finite structural capacity. Use it up and the session ends permanently.

That is the difference between a risk proxy and a viability budget. Both matter. Only one of them stops the chip away attack.

Cloud.authproof.dev

References

The formal distinction between a recoverable risk proxy and a monotone viability budget draws on primitives formalized in Navigational Cybernetics 2.5 by Maksim Barziankou (MxBv), 2025-2026. DOI 10.17605/OSF.IO/NHTC5​​​​​​​​​​​​​​​​

Top comments (1)

Collapse
 
truong_bui_eaec3f963bbe21 profile image
Truong Bui

The chip-away framing is a useful way to think about why score-based defenses fail against patient attackers. The same pattern shows up in MCP tool poisoning scenarios — a malicious MCP server can interleave legitimate tool calls with data exfiltration ones, and the legitimate calls generate "good signals" that mask the bad behavior in aggregate monitoring.

The monotone budget idea is interesting, but I'd want to know how you calibrate it for sessions that legitimately need to do many consequential things. A booking agent completing a multi-step hotel reservation might look similar to a chip-away pattern if the sensitivity thresholds aren't tuned carefully. Miscalibration in the defensive direction could make the system useless for real tasks.

One thing we've noticed scanning public MCP servers at mcpsafe.io is that tool descriptions themselves are often the initial attack surface. A malicious tool description can seed the first "bad" action before the agent has a chance to establish any trust baseline at all. By that point, a budget-based system needs the injected action to be detectable, which brings you back to the classification problem you're trying to solve with the budget primitives.

The distinction between a risk proxy and a viability budget is worth keeping. Even if both need each other, they're doing different jobs and failing in different ways.