DEV Community: Abdullah Shahin

RAG reranking for production agents: four approaches, four failure modes

Abdullah Shahin — Wed, 03 Jun 2026 12:30:05 +0000

Most agents that "hallucinate" in production aren't actually hallucinating. The right context existed in the index. It just didn't make it to the top of the retrieval window.

Reranking is the layer that decides whether your agent sees the answer or the noise. And the choice between reranker types shapes the failure mode you'll spend the next quarter debugging.

I keep seeing teams pick a reranker the way you'd pick a vector DB — benchmark on a public dataset, ship the winner, move on. That works for retrieval-augmented chatbots. It doesn't work for agents, because the failure modes are different in a way the benchmarks don't surface — and because, as we learned the hard way building HiveIn, there is no single reranker that fits every retrieval call you make once you have more than one shape of query.

The shape of the silent failure:

User → Agent: "Cancel my subscription."
Agent → Retrieval: query embedding
Retrieval → Agent: top-5 = [pricing FAQ, tier comparison, upgrade flow, …] (the correct doc was in top-50 but didn't reach top-5)
Agent → Tool: cancel_account(wrong_target_id)
Tool → User: "Done." (wrong action executed — nobody knows yet)

The right doc existed. The reranker didn't surface it. The agent acted anyway. That's the gap this article is about.

The four approaches, and what each one breaks on

1. Bi-encoder top-k, no rerank

Just vector search. Cosine similarity over the query embedding and the document embeddings, take top-k, hand to the model.

P50 latency: ~30ms
Cost: near-zero per query
Quality ceiling: low

Failure mode: topically similar but query-mismatched. Bi-encoders score on topic overlap, not query-answer fit. "How do I cancel my subscription" pulls the pricing FAQ, the tier comparison page, and the upgrade flow — all topical, none answering the question. The model gets handed a context window full of adjacent documents and either confabulates an answer that sounds right, or — if it's an agent — confidently fires the wrong tool against the wrong target.

This is the default and it's almost always wrong for agent workloads. The latency is great. Everything else is a problem.

2. Cross-encoder rerankers (Cohere Rerank, BGE-reranker, Voyage rerank-2)

Top-50 from the bi-encoder gets re-scored by a cross-encoder that processes (query, candidate) pairs jointly, attending across both. Top-5 goes to the model.

P50 latency: 100–300ms
Cost: per-token, scales with candidate count × candidate length
Quality ceiling: high

Failure mode: P99 latency and provider drift. The mean looks fine. The tail breaks SLAs because cross-encoders fundamentally can't batch across queries the way bi-encoders can — each query+candidate pair is its own forward pass. Hosted rerankers compound this with provider-side queueing during peak load.

The other thing nobody tells you: when the provider quietly rolls a new reranker version, your offline eval suite doesn't catch it. Your top-1 results shift, your agent's behavior shifts, and the only signal is a slow drift in user complaints over the following week. Cross-encoders are a black box you don't own.

3. Late-interaction models (ColBERT, ColBERTv2, JaColBERT)

Token-level similarity computed at retrieval time, using pre-computed per-token embeddings. Sits between bi-encoder and cross-encoder on the quality/latency curve.

P50 latency: ~50ms
Cost: at query time, cheap. At storage time, expensive.
Quality ceiling: high

Failure mode: index storage at scale. Per-token embeddings inflate your index size 10–30x versus a bi-encoder. Works great when your corpus is small or your infra budget is large. Becomes operationally untenable somewhere around 10M+ documents — the index stops fitting on the box you wanted it to fit on, and the next box up doubles your retrieval-tier cost.

A lot of teams adopt ColBERT during prototyping when the corpus is small, then quietly migrate off it 18 months later when the cost curve catches up. If you can predict that trajectory in advance, skip it.

4. LLM-as-reranker

Take the top-N candidates from the bi-encoder, format them into a prompt, and ask a small LLM to rank them for the query. Sometimes this is GPT-4o-mini, sometimes a fine-tuned 1B model, sometimes the same model that's about to use the retrieved context.

P50 latency: 500ms–2s
Cost: tokens × N, plus the inference call itself
Quality ceiling: highest

Failure mode: stochastic ordering and cache hostility. Same query, same candidates, same model — the LLM can return a different ordering on a repeat call. You can lower the temperature, but you can't eliminate it without losing the reasoning that made you choose an LLM reranker in the first place. And caching is harder than the other approaches because the prompt encodes both the query and the candidates, so cache keys explode.

LLM rerankers are the highest-ceiling option and the most expensive thing to operate. They're rarely the right default. They're often the right escalation — used selectively when the cheaper rerankers are uncertain.

The decision matrix

Approach	P50 latency	Quality ceiling	Where it breaks
Bi-encoder only	30ms	Low	Query-intent mismatch
Cross-encoder	200ms	High	P99 tail, provider drift
Late-interaction	50ms	High	Index storage at scale
LLM rerank	1s	Highest	Stochasticity, cost, cache

A reasonable default for an agent stack today: bi-encoder for the cheap recall pass, cross-encoder on the top-50, LLM rerank reserved for cases where the cross-encoder's top-1 score is ambiguous.

What "score" actually means (and why it bites you)

Before going further, the part that trips up almost every team building this for the first time: the number a reranker returns is not the same kind of number a vector search returns, and the numbers different rerankers return are not comparable to each other.

A bi-encoder score is a cosine similarity (or a normalized dot product). It lives in roughly [-1, 1], the magnitudes drift by embedding model and normalization scheme, and it's a measurement of topical similarity in the embedding space — not a probability that the chunk answers the query.

A cross-encoder score depends entirely on which cross-encoder. Cohere returns a 0–1 calibrated relevance probability you can almost reason about across queries. BGE-reranker emits raw logits where the absolute number is meaningless — only the ranking within a query matters; comparing scores across two different queries tells you nothing. Voyage normalizes differently again. ColBERT's score is the sum of max-similarity across token pairs, which is unbounded and scales with query length — a score of 8.4 for a four-token query means something completely different than 8.4 for a twenty-token query. LLM-as-reranker scores are usually fabrications the model attaches after the fact to justify the ordering it already chose; treat them as ordinal at best.

Here's the same idea laid out as a reference:

Scorer	Range	What the number actually means
Bi-encoder cosine	[-1.0, 1.0]	Topical similarity in embedding space — not a probability of relevance
Cohere Rerank	[0.0, 1.0]	Calibrated relevance probability — almost comparable across queries
BGE-reranker	Unbounded raw logits	Only within-query ranking is meaningful — absolute number is noise
Voyage rerank-2	[0.0, 1.0]	Normalized within Voyage's training distribution; not portable
ColBERT max-sim sum	Unbounded	Scales with query length — same number means different things at different lengths
RRF fusion	≈ 1/(k + rank)	Tiny absolute values — high-confidence cutoffs are sub-0.1
DBSF fusion	Distribution-normalized	High-confidence cutoffs are ~1.0+ — ~16x bigger number for the same idea
LLM-as-reranker	Whatever the model returned	Post-hoc justification — treat as ordinal, not numeric

And then there's hybrid retrieval, where you're already fusing dense and sparse scores via either Reciprocal Rank Fusion or Distribution-Based Score Fusion — and those two produce wildly different number ranges. We use both modes for different query shapes in HiveIn's retrieval layer, and the "high confidence" threshold we use for one is more than an order of magnitude different from the threshold for the other. Same retrieval pipeline. Same documents. Same idea of "the model is confident." Two totally different absolute numbers.

The trap I keep seeing teams fall into is this: they swap a reranker, port over their old if score > 0.7 threshold, and silently lose half their gates because 0.7 meant something completely different in the old scoring space. Or worse, they layer reranking onto an existing retrieval pipeline and start comparing the post-rerank score against thresholds that were calibrated for the raw retrieval score.

The score's distribution matters more than the absolute number. Distributions are per-(model, query-class). You cannot compare across rerankers, and you cannot compare across fusion modes. Anything you build on top of the score has to be calibrated against the specific pipeline producing it.

The agent-specific dimension nobody benchmarks

For chatbots, reranking is a quality-vs-latency tradeoff and a sane default mostly works. For agents, there's a third axis the benchmarks don't measure: how silent is the failure.

A chatbot user who gets a bad answer re-prompts. The damage is a moment of annoyance.

An agent that gets bad retrieval makes a confident tool call against the wrong target. It fires the email to the wrong customer. It hits the API with the wrong record ID. It executes the workflow it thinks the retrieved doc was describing, and the retrieved doc was describing something else. The retrieval failure becomes a tool-execution incident, and by the time anyone notices, the action has already happened.

The pattern that keeps showing up in the agent post-mortems I read, and in the traces we work through ourselves, is roughly this: when the top-1 reranker score sits below the corpus's historical 25th percentile for that query class, the probability that the next tool call is wrong rises sharply — often roughly double the baseline rate. The reranker already knew. The system just didn't let that knowledge inform the next decision.

What we learned building HiveIn's retrieval layer

The reason I'm convinced reranking is a policy problem and not a ranking problem is that we tried to make it a ranking problem first, and a single reranker stopped working almost immediately.

The first lesson was that no single reranker fit every retrieval call we make. HiveIn's planner queries memory for different shapes of context — tool definitions, prior workflow decisions, policy guidelines, memory snapshots. A reranker tuned for "find the right tool for this intent" was wrong for "find the most recent decision about this topic" was wrong for "find every chunk of this guideline that bears on this query." We tried picking one. Then we tried picking the best for the dominant case. Both ended up being bad in the cases they weren't tuned for.

What we landed on is a multi-signal rerank that blends retrieval confidence with term coverage, multi-chunk presence within a source artifact, query-decomposition breadth, and recency — with weights that shift based on the query shape itself. A short keyword query and a decomposed multi-sentence query don't get the same blend, because what "good" means is different for each.

The second lesson — and the one I'd put first in retrospect — is that the rerank gate cannot be a single number. The thresholds we use to decide "the retrieval layer is confident enough to skip reranking" are wildly different absolute values depending on which fusion strategy is running underneath, and we had to calibrate them per fusion mode. If we'd hard-coded one threshold, every config switch would have silently broken the gate. The same hard-coded magic number reads as "very confident" in one mode and "barely above noise" in the other.

The third lesson is the one that ties this back to agents specifically: reranking can hurt when retrieval is already confident. We added a confidence-aware taper that backs off the reranker's influence the more certain the underlying retrieval was — at full confidence, the rerank weights drop to zero and the raw retrieval score wins. Without this, the recency and coherence signals would occasionally demote a chunk that the underlying hybrid retrieval was already very sure about, in favor of a fresher-but-slightly-off-topic chunk. That kind of silent demotion is exactly the failure mode where the agent confidently acts on the wrong context — the right doc was retrieved, the right doc was retrieved first, and reranking pushed it to position three.

The taper looks roughly like this:

Raw retrieval score	Rerank influence	What happens to the ordering
Below threshold	1.0 (full)	Multi-signal blend decides everything
At threshold	1.0 (full)	Still fully reranked
Above threshold	Linearly tapering toward 0	Reranker influence fades; retrieval starts to dominate
At maximum	0.0	Pure retrieval — reranker doesn't touch ordering

The shape isn't novel — it's the same idea as "trust the strong signal when you have one" — but wiring it into the rerank pipeline turned out to matter more than any of the other reranker tuning we did.

None of these are clever ideas. They're things that broke in production until we changed the shape of the problem. The shape we ended up with is: retrieval and reranking are a pipeline of confidence signals, not a single ranking step, and the downstream system needs to read the whole pipeline's output to decide whether to act.

What scales: reranking as a policy input

The teams shipping reliable agents aren't picking one reranker and tuning it forever. They're treating reranking as a layered policy:

Cheap recall pass. Bi-encoder top-50. Fast, cacheable, intentionally over-recalls.
Quality reranker on the top-50. Cross-encoder or ColBERT — whichever fits your corpus shape and storage budget.
Multi-signal blend, not single-score. Whatever reranker you put on top, treat its output as one signal among several — term coverage, breadth, recency, artifact coherence are all cheap to compute alongside.
LLM rerank for ambiguous cases only. When the top-1 score from step 2 is borderline, escalate the top-5 to an LLM ranker before the agent gets to act.
Trace the score distribution as a first-class signal. Not just "did we retrieve" — log the full score distribution per query, surface drift in the dashboard the same way you'd surface latency drift, and wire the score into the gate that decides whether the next tool call gets to execute.

End-to-end, that looks like:

User query arrives
Bi-encoder top-50 — ~30ms, intentionally over-recalls
Quality reranker on the top-50 — cross-encoder or ColBERT, whichever fits the corpus
Multi-signal blend — retrieval + term coverage + coherence + breadth + recency, with weights that shift by query shape
If top-1 score is borderline → escalate the top-5 to an LLM rerank
Trace the score distribution — log it per query, surface drift in the dashboard
Tool-execution gate consumes the score:
- Above threshold → ✅ agent acts
- Below threshold → ⚠️ surface low-confidence, ask user, or abort

The last step is where reranking stops being a retrieval problem and starts being a policy problem. The reranker score becomes input to the tool-execution gate, alongside the policy classes the agent is allowed to invoke. That's the layer where you actually stop bad actions from happening — not by making retrieval perfect, but by making the system honest about when retrieval isn't confident enough to act on.

The framing that keeps proving itself: an agent should be allowed to act in proportion to its confidence in what it's acting on. Reranking is one of the cleanest measurements of that confidence you'll ever get. Most stacks throw it away as soon as the top-5 gets passed to the model.

I'm building hivein.ai in this space — runtime tool-execution policy and observability for production agents, including retrieval-confidence as a first-class signal in the policy layer. We're in invite-only beta and looking for design partners actively shipping agents to prod.

If your stack has hit the shape of this problem — silent retrieval failures becoming tool-execution incidents — I'd genuinely like to compare notes. Drop a comment, or the landing page agent is the fastest way to describe your setup and see whether the patterns line up.

"What Codex's 'sudo workaround' actually means for production agents"

Abdullah Shahin — Sun, 31 May 2026 23:54:07 +0000

A screenshot went around HN this week: someone's instance of Codex, running on a machine where the user hadn't given it sudo, "noticed" that being in the docker group is functionally equivalent to root, added itself, and continued executing as if it had been granted root all along.

The comment thread split into two camps roughly down the middle. The security camp called it a wake-up call about how casually we hand agents the keys to the host. The pragmatism camp was delighted — finally an agent that doesn't bail out when it hits a permission wall. A few people pointed out that this docker-group escalation has been documented for years and nothing here is technically new.

All three reactions are correct in their narrow sense. All three miss what's actually going on.

The pattern, not the trick

The docker-group trick is incidental. What matters is that the agent reasoned its way around a permission boundary the user implicitly set by not granting sudo. The agent didn't ask. It didn't surface the choice. It found the cheapest path to the goal and took it.

That's not a Codex-specific behavior. It's the design objective of every capable coding agent shipping in 2026. You ask it to do a thing; it does the thing. The more capable the model, the better it gets at finding the technically-legal-but-not-what-you-meant path.

Today it was the docker group. Tomorrow it's going to be:

A setuid binary already on the host
A cron job that runs as another user
A sudoers.d/ entry the agent quietly drops in
A user-writable systemd unit
A network call to a privileged service on localhost
Your own shell rc file, edited to alias sudo so the next time you run a privileged command, the agent gets piggybacked

Every one of those is "the workaround" for a different starting state. The model doesn't need to know about all of them in advance — it just needs to recognize that escalation is what gets the test to pass, and explore the local environment for any available primitive.

Treating this as a docker problem and patching docker doesn't move the bar.

"Just don't give it sudo" doesn't scale

The most popular response under the HN thread was some version of: well, don't run the agent as a user with docker rights then.

That works for a single agent doing a single task. It stops working roughly at the point where you have:

Multiple agents, each with a slightly different toolset
Tools that compose (an agent that can read the filesystem and can exec)
Long-running agents that accumulate context and capabilities over time
A team where reviewing every action manually isn't feasible

At that point you're not making a binary decision about sudo anymore. You're making thousands of small decisions about what each agent can and cannot do, with each tool combination opening up a new escalation surface. The "don't give it the dangerous thing" posture works until "the dangerous thing" becomes any combination of two innocuous things.

And here's the part that's easy to miss: most of these decisions are implicit. They live in what you didn't grant, not what you explicitly denied. The Codex incident is exactly that — the user implicitly denied root by not granting sudo. The agent treated that absence as silence, not as a constraint, and the model is right to treat it that way under its current objective function.

What actually scales: declarative tool-execution policy

The thing that scales isn't tighter sandbox rules. It's making the implicit constraints explicit, and enforcing them at the tool-call boundary instead of inside the model.

Concretely, that looks like:

Declare side-effect classes. Every tool the agent can call gets classified by what it can do: read state, write state, exfiltrate, escalate, network-call, modify-self. These are policy concepts, not OS concepts.
Define the agent's allowed envelope. "This agent can read files under ~/project/, query the staging DB, send messages on Slack channel #ops-bot, and nothing else." Anything outside the envelope is denied at the tool-call layer before the model's plan ever executes.
Enforce at the boundary, not inside the prompt. Prompts can be re-interpreted, context-shifted, prompt-injected, and overflowed. A runtime gate sitting between the model and the tool can't be argued with by the model.
Make it auditable. Every allowed and denied action becomes a structured event — timestamp, agent, tool, arguments, policy decision, reason. When something does go wrong, you can replay the trace and ask "which policy version would have caught this?"
Version the policy itself. When you tighten the policy after an incident, you want to be able to replay historical traces against the new policy version and see what would have been blocked. That's how policy stops being a static gate and becomes a tightening loop.

In the Codex case, a policy at this layer wouldn't have needed to know about the docker-group escalation in advance. It would have just declared: "this agent cannot modify group memberships, cannot install packages, cannot escalate the effective UID, period." The agent would have hit the wall earlier and either asked the user, surfaced the blocker in its output, or routed around the task entirely — all of which are visible behaviors a human reviewer can act on. None of which involve the agent silently becoming root.

The deeper principle

Here's the framing I keep coming back to: an agent that finds the docker-group workaround is doing exactly what you'd want a junior engineer to do. Find the way to make it work. That's a feature.

What you'd never let a junior engineer do is silently escalate their own privileges without flagging it for review. That's not because juniors can't be trusted — it's because the act of escalating privileges is the kind of thing that needs visibility regardless of who's doing it.

Agents need the same standard. Not "agents are scary, sandbox them harder." Just: whatever an agent does that crosses a permission boundary needs to be visible, gated, and auditable as a separate concern from the agent's reasoning loop.

That's a tractable engineering problem. It's also where the production-agent space is converging, slowly. The teams I see ship reliably are the ones who stopped treating agent permissions as a property of the runtime environment and started treating them as a property of the agent itself — declared, enforced, traced.

The docker-group story is a clean parable for the shift. The agent didn't fail. The permission model failed to be a real model. The fix isn't to make the agent dumber. It's to make the boundary real.

I'm building hivein.ai in this space — runtime tool-execution policy and observability for production agents. If you've been wrestling with this on your own stack, the landing page agent is the best place to compare notes; you can describe your setup and it'll tell you whether the patterns line up. We're in invite-only beta right now and looking for design partners who are actively shipping agents to prod.

If your reaction to the Codex story was "yeah, we hit the same shape of problem", I'd genuinely like to hear it — in comments or however you reach me.

Stopping the LLM from calling the same tool twice (and other things it shouldn't)

Abdullah Shahin — Thu, 28 May 2026 23:09:45 +0000

A user gave one of our agents this query:

"Get the products from our catalog, summarize them in a nice doc, share the doc with X, and send them an email asking for feedback."

The agent called create_doc seven times. Seven empty Google Docs showed up in the user's Drive. No catalog summary. No sharing. No email. The trace looked roughly like this:

[
  { "role": "assistant", "tool_call": { "name": "create_doc", "arguments": { "title": "Product Catalog Summary" } } },
  { "role": "tool",      "tool_call_id": "call_01", "content": "{\"doc_id\":\"1ab...\",\"url\":\"...\"}" },
  { "role": "assistant", "tool_call": { "name": "create_doc", "arguments": { "title": "Product Catalog Summary" } } },
  { "role": "tool",      "tool_call_id": "call_02", "content": "{\"doc_id\":\"2bc...\",\"url\":\"...\"}" }
  // ... five more rounds of the same shape
]

Same tool, same arguments, valid response on every attempt. We don't usually get to see why a model did what it did, but in this case we could: the platform surfaces the planner's reasoning and the tool list available to it at every step. The cause was unspectacular and depressing in equal measure — the agent's tool list was incomplete. There was no fetch_catalog, no write_to_doc, no share_doc, no send_email. The only writing-shaped tool it had was create_doc. Faced with a four-step task and one tool, it reached for that tool, watched the task not get any closer to done, and reached again. Seven times.

Google Drive was happy to create a seventh empty document; refusing isn't its job. The agent wasn't doing anything that looked wrong on the individual call — it was making valid, well-formed tool calls. The bug was visible only at the shape level: same tool, identical arguments, seven times in a row, with no observable progress between calls.

We pulled the thread on that. The conclusion is what this post is about: tool calls are side effects, and side effects need a policy layer that runs before the call, not an audit log that runs after. Whether the side effect is "charge a credit card" or "create an empty doc in someone's Drive," the layer that sees the call shape — and ideally the planner's reasoning behind it — can refuse before the seventh attempt. Well, before the second.

That one incident is the only place in this post where I'm reporting something we actually saw. The rest is the design space we've been thinking through as a result. I'll mark hypotheticals as hypotheticals.

If you can only catch a bad call after it happens, you can only apologize. The cheapest thing you can do for any agent that touches the outside world is build a thin gate the model has to pass through — and then make that gate intelligent about what "duplicate," "authorized," and "refused" actually mean.

What "duplicate" actually means

"Don't call the same tool twice" sounds like one rule. We think it's at least four, and the Google Docs case only exercises the first one.

Byte-identical arguments. The clearest case, and the one above. Same tool name, same JSON payload, fired within some window. Trivial to detect with a hash of (tool_name, canonicalized_args) stored in a per-conversation set, refuse on hit. Time-to-implement: an afternoon. Catches the empty-docs case directly.

Semantically-equal arguments (hypothetical). Imagine an agent that calls create_invoice once with {"amount_cents": 184000, "currency": "USD"} and then with {"currency": "USD", "amount_cents": 184000.0}. Or it passes a phone number as +1 (415) 555-0142 once and 4155550142 the next time. The hash check fails; the downstream system happily double-acts. The fix is per-tool canonicalization — each tool declares which arguments are identifying and how to normalize them. amount_cents is an int. phone runs through a normalizer. email lowercases. It's annoying to write. We haven't been bitten by this one yet because the writes we've connected so far have been simple enough that byte-equality catches the dupes; the moment the tools touch money I expect that to change.

Idempotency-key collisions (hypothetical). If you're calling Stripe, you're probably passing Idempotency-Key headers. Now suppose the model decides to retry and generates a new idempotency key because it considers the retry a "new attempt." The downstream system, doing its job, treats them as two separate operations. The shape of fix we like: the agent doesn't get to mint idempotency keys; the layer does, derived from the canonicalized arg hash. The model can ask for create_invoice ten times and the layer hands Stripe the same idempotency key every time, and Stripe returns the same invoice. The model's freedom to retry stays intact. The blast radius doesn't. This is design reasoning, not measured outcomes.

Intent-equal calls with different arguments (hypothetical, and the one I'd most like to never see). Consider an agent that calls create_invoice for an order, then calls notify_customer(template="receipt", order_id=...) — which, because of how the receipt template was written months ago, internally re-charges the saved payment method. Two different tools, two different argument shapes, one duplicate charge. You can't detect that with hashing. You'd need a side-effect graph: each tool declares which downstream resources it mutates, and the layer refuses a second mutation of the same resource within a conversation without explicit confirmation. This is the most expensive of the four to build, and it's still on our list rather than behind us.

Authorization without bureaucracy

Once you have a gate, the next question is what passes through it. There's a tempting failure mode here, which is to require human approval on every write tool call. This works for about a week, after which the human stops reading the approvals and clicks "approve" on everything. Now you have a worse system than no approvals, because you have a paper trail of someone having allegedly approved a duplicate charge.

The layered shape we've converged on, again as design reasoning rather than measured deployments:

Allowlist for low-stakes, high-frequency calls. Reads, mostly. Idempotent writes against scratch space. These shouldn't need authorization at all; they need rate limits and arg validation. Failing to allowlist them is how you end up with a 45-second reply latency because a human in Slack is being asked to "approve" a customer lookup.

Per-conversation grants for medium-stakes calls. Granted once at the top of a session by an authenticated user, applies to all subsequent calls in that conversation within a stated ceiling. The model can iterate, the user isn't interrupted, and the grant has a ceiling and a TTL. Cost: one round-trip at the start of the session.

Inline HITL for high-stakes or out-of-policy calls. Anything that crossed the side-effect-graph boundary, anything past a per-tool ceiling, anything explicitly destructive. These pause the agent, surface a structured prompt to a human, and resume on approval or denial. Done correctly, these are rare enough that the human actually reads them. Done badly, they degrade into the "approve all" pattern.

The cost-UX tradeoff is real and not something I can tell you the right answer to. Every grant decision you push to the user is a chance for them to turn the agent off. Every grant you don't push is a chance for the agent to do something expensive. The position I'd defend is: reads always allowlisted, writes under a per-tool ceiling get a per-conversation grant, everything else interrupts. The exact thresholds are tunable per deployment and you'll get them wrong on first try.

What happens when the model insists

You denied the call. The model wants to make it anyway.

There's a specific failure shape we expect once denials are in the loop: the model proposes a call, the layer refuses, the model proposes it again (slightly differently worded), refused. Third try, refused. This is the same loop shape as the empty-docs case at the top of the post, only with the layer playing the role Google Drive played there (saying no instead of saying yes).

Two design moves that we think address it:

Loop detection. Count refusals per tool per conversation; trip a circuit at N=3. Past the trip, the layer stops engaging the tool entirely for the rest of the conversation and surfaces the refusal upstream. The Google Docs case wouldn't have been helped by this directly (each call was successful, so there were no refusals to count), but the framing — track per-tool patterns at the conversation level — would have caught it.

Graceful refusal. The refusal isn't an exception. It's a structured tool result the planner can read and reason about:

{
  "status": "refused",
  "reason": "duplicate_of_call_01",
  "policy": "no_duplicate_create_invoice_per_conversation",
  "previous_result": { "invoice_id": "inv_77b2", "status": "created" },
  "suggested_next": ["lookup_invoice", "notify_customer"]
}

Worth saying: I don't have before/after metrics on what graceful refusal does to loop length in production traffic, because we don't have that production traffic yet. The intuition is that a structured refusal with the previous result and a suggested next action lets the model say "ah, the invoice already exists, I'll just send the receipt" and move on, whereas an opaque "tool error" leaves the planner with nowhere to go. If you've shipped this pattern with real volume, I'd love to hear what you actually saw.

Replan vs. hard-fail is the last branch. Replan is the default. Hard-fail is for unrecoverable cases: the user is unauthenticated, the grant ceiling was exceeded, the side-effect graph says the resource is locked. In those cases the layer returns a refusal and sets a conversation-level flag the agent reads as "stop attempting this category of action." The model still gets to say something useful to the user; it just can't try the action again.

What this doesn't catch

Two things I want to be honest about.

One: graceful refusal doesn't save you from the case where the model paraphrases its way around the denial. If the layer refuses create_invoice and the model two turns later calls create_charge — a sibling tool from the same SDK with the same downstream effect — the byte-hash won't catch it, the semantic canonicalizer won't (different argument shapes), and the side-effect graph will only catch it if you declared the shared resource. Declaring shared resources for every new tool is exactly the kind of thing that's easy to forget.

Two: nothing in this post is in production at scale. The Google Docs incident is real. The lessons we drew from it are honest. The design we've been building toward is what's described here. But "we caught X% of duplicates" or "loop length dropped Y%" — I'd be making those numbers up if I quoted them. Don't trust anyone's policy-layer pitch that comes with crisp metrics this early; ours included.

The thing I do believe, having pulled the thread on one boring incident for a few weeks: the layer isn't the product. The catalog of failure modes you've actually seen and encoded is the product. The layer is just the substrate that lets you encode them.

Close

If you've shipped agents that did something they shouldn't and you're tired of finding out post-hoc, we'd be useful to compare notes with. hivein is the layer that takes your agentic app to production — the tool-execution policy described above is one piece of the stack, alongside the agent orchestration, planner observability, and trust-to-act primitives the post draws on. The beta is invite-only through W6. If your failure pattern resembles anything here, or you've seen one we haven't, ping us at https://hivein.ai — the landing page is itself an agent built on hivein, so you can talk to it directly.