I’ve always had a soft spot for authentication systems.
About seven years ago, I started working on auth, and something just clicked. What began a...
For further actions, you may consider blocking this person and/or reporting abuse
Never-trust-always-verify covers who's calling, but agents broke it on a different axis for me. The call is authorized and still wrong. An agent with valid creds that confidently does the wrong thing passes every auth check, so I ended up gating on the output being worth paying for, not on the identity making the request.
Yeah, that shift caught me off guard too while building this.
At some point I realised the scary part wasn’t an unauthorised action anymore — it was a fully authorised system confidently continuing in the wrong direction 😅
It starts feeling less like classic security and more like continuously validating whether the system is still being sensible.
Same reframe stuck with me. The model moved from is this call allowed to is this even the right thing to do, and auth only answers the first one. Intent drift keeps going, so the check has to keep running instead of firing once at the door.
Exactly! I am afraid it is more of a wait-and-watch situation at the moment, where zero trust is the bare minimum and, honestly, just the tip of the iceberg.
The perimeter is the easy part. You can authorize an agent perfectly and it'll still confidently do the wrong allowed thing, which is a correctness problem, not an access one. I don't think auth frameworks touch that layer at all.
I think that's what makes agent systems so interesting from a security perspective.
For years we've focused on answering "who can do what?" and got pretty good at it. Now we're running into a different class of problems where the permissions are correct, but the outcome still isn't.
Feels like we're missing a layer between authorization and business logic that deals with whether the system's behaviour remains aligned over time.
The trajectory framing in these comments is the right call, but I think it hides a second trap worth naming. If you validate the trajectory by reading the event stream it emitted — transactions_uploaded → score_calculated → insights_generated — you're checking the agent's own account of its trajectory, not the trajectory. That's the same never-trust-the-self-report problem you started with, just moved up from the single call to the sequence. And it's exactly the failure you're worried about: every step honestly reported success while the world quietly diverged underneath it.
What's worked better for me is to stop validating the path and make each step re-prove its own footing instead. Step N is allowed only if its preconditions can be re-derived from primary state at that moment — not inherited as a result handed down from step N-1. The score step doesn't trust "categorisation done" because the previous step said so; it re-checks against the actual categorised data. When something has drifted, the chain breaks at the first step whose precondition no longer re-derives, instead of marching to the end on a counterfeit. That turns "validate the whole trajectory," which is unbounded, into "each step proves its warrant locally," which you can actually enforce.
It also answers your logs-tell-you-what-not-why point directly. If each step records which precondition it re-derived and against what state, the authorization trace is the why — you're not reconstructing intent from an event log months later, you're reading what each action proved about the world before it ran.
The smaller thing underneath all of it: the default at each step has to be stop-unless-warranted, not continue-unless-flagged. Drift only marches on because the loop continues by default. "Should it keep acting" becomes a question the system re-asks every step, and the safe answer, when it can't re-derive its footing, is no.
I like that distinction.
One thing I hadn't really thought about while building this was that an event stream is still a form of self-reporting. The system is effectively saying "trust me, I completed the previous step correctly."
Re-deriving critical assumptions at each stage feels much more robust, especially as workflows get longer. It also naturally creates better auditability because you're recording what the system checked before acting, not just the fact that it acted.
I think your last point is probably the most important one though. Most systems are designed to keep moving unless something explicitly fails. For agentic workflows, there may be a strong argument that the default should be the opposite: keep proving you should continue, otherwise stop.
Glad it landed. The one place this bites when you build it: stop-unless-warranted can over-halt if "warranted" is too strict — every step stalls because it can't perfectly re-derive, and now the safety mechanism is the bottleneck. So the real design call is which assumptions are worth re-proving each step.
The heuristic that's worked for me: re-derive the ones whose silent drift is unrecoverable — the side-effecting, can't-take-it-back steps (the notification you send, the score you write, the timeline row you update) — and let the cheap reversible reads ride. A step you can redo for free doesn't need a warrant. A step that fires a side effect into the world does, because if it ran on a stale assumption you can't un-send it. That keeps the proving local instead of turning every node into a checkpoint, which is the thing that would actually make people rip it back out.
The "re-derive preconditions" pattern is solid, but it introduces its own attack surface worth thinking about. If an adversary knows the agent re-checks state before each step, they can target the verification endpoint itself — rate-limit it, poison the primary state it reads, or just make it intermittently slow so the agent keeps halting on the "stop-unless-warranted" default. You've effectively turned your safety mechanism into a denial-of-service vector. The re-derivation checks are also observable side effects — anyone monitoring the target systems can infer the agent's decision pipeline from the pattern of reads it makes before acting.
That's a really interesting trade-off.
A lot of safety mechanisms end up becoming critical dependencies themselves. The moment a workflow depends on continuously proving its footing, the verification path becomes part of the system's attack surface.
It also highlights that "fail closed" isn't free. Stopping when you can't verify is often the safer choice, but it can also become a reliability problem if verification is noisy, unavailable, or under attack.
Feels like there's a balancing act here between trust, resilience, and observability. The more checks you add, the more confidence you get in each decision, but the more dependent the system becomes on those checks remaining healthy.
Both of these are right, and worth separating because they hit different properties.
The DoS one is real, but it's the trade you want. If an attacker can make the re-derivation slow or unavailable, the agent halts — that's a denial of service, not a denial of safety. Their win is "you stopped," not "you acted on the state I poisoned." Fail-closed deliberately moves the worst case under attack from "silently wrong" to "loudly stopped," and you then manage the DoS as a DoS — read budgets, a bounded degraded path. The continue-unless-flagged alternative hands that same attacker the strictly worse outcome: keep acting on stale state. So re-derivation doesn't so much create the surface as choose which failure you take when it's under pressure.
The poisoning point is the sharper one, and it isn't really a counter — it's the same principle one level down. Re-derivation is only worth what the independence of the source it reads is worth. If the adversary controls the primary state, re-deriving against it is just self-report wearing a different hat — you're re-checking against their fiction. Which means the source has to be one they can't author: a signed reading, several independent reads that have to agree, a system outside their reach. "Re-derive" was always shorthand for "consult something the actor — and ideally the attacker — couldn't have produced"; if the only available primary fails that test, it buys nothing there, and that's worth knowing rather than papering over.
The read-pattern side channel is the one I'd concede cleanest. It's a real confidentiality leak, orthogonal to the integrity the re-derivation buys. You can make the reads oblivious — re-derive a fixed set regardless of which branch you're on, so the pattern stops revealing the branch taken — but that's constant-time-crypto reasoning and it isn't free. In a high-adversarial setting it's a genuine cost, not a footnote.
The trajectory framing is the one that stuck with me. We ran into the same gap from a
different angle the memory side.
The question we kept hitting was not just whether the agent is authorized to act, but
whether the memory driving the action is authorized to govern that specific decision at
that moment. A memory can pass every ingestion gate, have valid metadata, and still be
stale or misscoped by the time it reaches the action.
CLAIM-24 in our series tests exactly that. A grant is still timestamp-valid but the
source state underneath it has changed. The re-derivation pattern ANP2 describes here
is what we built each step re-proves its footing against live state instead of
inheriting the result from the previous step.
The layer after that is what you called intent drift. Even when the memory is
authorized and the source is fresh, the instruction inside can be anomalous. Recipient
doesn't match the session. Scope is wider than the operation requires. That is CLAIM-28
for us behavioral norm detection. The memory cleared every authority check and still
should have been refused.
Zero Trust at the door is real and necessary. It just does not reach these two layers.
Interesting angle. In pentesting CI/CD pipelines, I've seen a similar pattern where every step — checkout, build, push — is technically authorized, but attackers chain them to create effects far beyond any single permission. A poisoned dependency or a workflow manipulation can turn a series of valid actions into a full compromise. Zero Trust verifies the moment, but it's blind to the trajectory. The "drift" you describe in agentic systems feels almost identical to what we hunt for in supply chain attacks: the signature is valid, the commits are signed, but the sequence becomes malicious.
Have you ever considered mapping this kind of trajectory validation onto automated workflows like CI/CD? Each step is a decision point, and the question "should it keep acting?" is exactly what we need to ask when a pipeline starts auto-merging, auto-deploying, or auto-releasing based on previous outputs. That might be a fruitful intersection.
The trajectory point in the comments is the sharper one. Zero trust validates identity at each call but can't validate that the sequence of calls wasn't drifting somewhere nobody approved. The step after trajectory-aware policy is trajectory-level evidence: each action bound to the context active at that moment, so the drift is reconstructible months later.
That's a great point.
One thing I found while building PlanetLedger was that logs tell you what happened, but not always why it happened. When you're looking back weeks or months later, that missing context makes it really hard to understand how a system arrived at a particular outcome.
Being able to reconstruct the reasoning and context behind a chain of actions feels like it will become increasingly important as these systems become more autonomous.
The "validates moments, not trajectories" framing is the right one, and the trajectory concept is what is missing from most agent governance writeups. Every step in the example you gave is individually allowed, individually audited, individually within scope. The system as a whole is drifting in a direction that nobody approved.
The cheap-and-honest version of trajectory tracking is a session-level "what is the user trying to do?" check that runs at the end of every step, not at the start. Most agent designs gate on entry: "is this tool call allowed?" They do not gate on exit: "after this tool call, is the agent's overall trajectory still the one the user asked for?".
The practical implementation is a small post-action hook that diffs the agent's stated intent (from the user prompt at session start) against the actual actions in the last N tool calls. If the diff gets large, the hook pauses the agent and asks the user to re-confirm. The "diff" can be a simple embedding similarity, or a regex of file paths touched, or just a count of distinct modules touched. The point is the drift signal, not the exact metric.
The thing this is not: it is not "the model is hallucinating" or "the tool is broken". The drift is a normal property of an agent doing useful work over time. The fix is to make drift visible and pauseable, not to prevent it.
The hardest part of implementing this in practice is that the user is rarely available to re-confirm mid-task. The two patterns that have worked for us: (a) a hard cap on "drift score" per session, after which the agent stops and posts a summary, and (b) a periodic "where are we?" check-in that the user can ignore. Both are imperfect, and both are better than the alternative of a session that quietly does more than the user asked for.
This is super close to what I kept running into while testing chained workflows.
The weird part was that nothing looked obviously broken in isolation, but after a few steps the overall behaviour sometimes started feeling slightly “off”. Small deviations compound really quickly.
I also like the idea of treating this as something to monitor and surface rather than trying to eliminate completely. Otherwise agents probably become too restrictive to actually be useful.