DEV Community: Logan

AI Agents and Employee PII: The Policy Gap Most Organizations Are Missing

Logan — Fri, 03 Jul 2026 15:57:30 +0000

In April 2026, Meta disclosed that its Model Capability Initiative — a program designed to build AI agents capable of performing white-collar computer tasks — had deployed keystroke and screen monitoring software on U.S. employee laptops. The tool, described in an internal memo to Meta Superintelligence Labs staff, captures mouse movements, clicks, keystrokes, and screenshots to train the models Meta is building for autonomous computer use. The stated purpose was model improvement. The stated safeguard was that the data would not be used for performance reviews.

What the program illustrates is something more fundamental: one of the world's most technically sophisticated organizations built an agentic AI system that processes detailed behavioral data about its own employees — and the governance framework governing what that system could accumulate and transmit was the program memo, not a data policy.

Most enterprise data governance frameworks were designed for customer PII. They specify which fields require encryption, which records trigger GDPR obligations, which systems need audit trails. Those policies were written for databases and defined application boundaries. They were not written for AI agents operating across internal systems — and the employee data those agents now routinely process sits in a governance blind spot that most organizations haven't addressed.

Why Employee Data Is a Different Category of PII Exposure

Organizations have invested years in controls for customer-facing PII: consent flows, breach notification procedures, access logs tied to specific users. These were designed for data flows where the organization owns both ends of the transaction and the data subjects are customers.

Employee data processed by AI agents doesn't have those boundaries, for three structural reasons.

Context windows don't scope themselves. When an agent retrieves documents to complete a task, it typically retrieves more than the task requires. A query about an employee's vacation balance may pull a record that also contains that employee's home address, emergency contact, and medical leave history. The model processes all of it. The transmission cannot be reversed after the fact. No log entry undoes the exposure.

Multi-agent architectures propagate the problem downstream. When an orchestrator agent delegates a sub-task, it often passes a broader context than the sub-task requires. Employee data ingested at the top of an orchestration chain propagates to worker agents that have no awareness of what they're processing — and no mechanism to enforce that only the relevant fields reach each downstream step.

Third-party assistants are in the data path. Claude Desktop, GitHub Copilot, and Cursor are used by employees for internal work tasks at scale. When a developer routes an HR query through a vendor assistant via MCP, data from connected systems — directories, wikis, internal portals — moves outside the organization's direct control. According to Cyberhaven's 2025 AI Adoption & Risk Report, 34.8% of all corporate data that employees input into AI tools is sensitive, up from 10.7% two years prior. Employee records are a significant share of that.

The Structural Failure Mode

The pattern that creates employee PII exposure in agentic systems is not a novel attack vector. It is an architectural gap: an agent holds service account credentials that authorize access to a broad data store, and no separate policy specifies what subset of that data can enter the model's context for a given workflow.

In May 2025, Serviceaide — a provider of agentic AI-based IT management and workflow software — reported a breach affecting 483,000 Catholic Health patients. The failure was not a context window leak; it was an Elasticsearch database that had been inadvertently made publicly accessible. But the incident illustrates a governance pattern common to agentic AI deployments: the data infrastructure that agents depend on accumulates sensitive records that were never scoped to the agent's actual function, and governance gaps anywhere in that stack create compliance exposure.

The employee data version of this failure is less visible than a patient record exposure — but structurally identical. A finance reconciliation agent with access to an HR API can reach salary fields it was never intended to process. A productivity agent summarizing emails can ingest performance reviews, termination notices, and accommodation requests from adjacent threads. A coding agent debugging a payroll module can encounter Social Security numbers embedded in test data. In none of these cases does the agent declare what it accumulated or why.

What Makes Employee PII Legally Distinct in Agentic Systems

Most data governance frameworks treat employee data the same as internal operational data — regulated, but not to the same standard as customer PII. Agentic AI changes that calculus in three ways.

HIPAA applies to employee data in healthcare organizations. Health benefit elections, FMLA documentation, and accommodation records are protected health information under HIPAA. The unique user identification standard (45 CFR § 164.312(a)(2)(i)) requires that PHI access be traceable to a specific individual. When multiple agents share a service account, the audit trail cannot satisfy this requirement — there is no chain of custody, only a credential log.

GDPR's data minimization obligation is not optional. Article 5(1)(c) requires that personal data be "adequate, relevant and limited to what is necessary" for the processing purpose. Context window construction — where agents pull broadly to answer narrowly — is in direct tension with this standard when EU employee data is in scope. The EDPB's coordinated enforcement action launched March 19, 2026, across 25 European data protection authorities specifically targeted transparency obligations in automated processing, which extend to agentic AI.

CCPA's ADMT requirements are in effect as of January 1, 2026. California's Automated Decision-Making Technology rules mandate meaningful disclosure when automated systems make consequential decisions about individuals, including employees. AI agents used for HR workflows — scheduling, performance analysis, task assignment — may trigger ADMT obligations that most legal teams have not yet assessed against existing agent deployments.

What a Real Employee PII Policy for Agent Systems Must Cover

A data access policy controls which credentials can reach which systems. An agent PII policy is different: it controls what data can enter the model's context window, regardless of what the service account credentials are capable of retrieving. Most organizations have only the first policy. The second is what's missing.

Pre-execution interception, not post-processing detection. Enterprise security tooling is built around detection: log what happened, alert on violations, investigate after the fact. For AI agents, detection is structurally insufficient. Once PII enters a context window and influences a model completion, the exposure has occurred. Effective governance requires intercepting data before the model sees it — stripping non-relevant fields from retrieval results and tool call responses before they reach the model.

The data handling policies that govern agent behavior need to operate at the retrieval layer, not the logging layer.

Scope enforcement at the agent level, not the credential level. Service account permissions are infrastructure configuration. Agent PII policy is a separate layer that specifies, per workflow, which data types are permitted to enter the agent's context. A finance reconciliation agent should not process employee home addresses even if its service account can retrieve them. These are policy decisions that no RBAC system currently enforces for AI agents — because RBAC was designed for human users, not for context window construction.

Agent identity in the access audit trail. When multiple agents share a credential, the access log attributes every data access to one service account name. This is not an audit trail — it is a connection log with no attribution. For HIPAA-covered workflows, per-agent identity in the access audit trail is a mandatory requirement that shared credentials cannot satisfy.

MCP-native PII filtering for vendor assistants. Internal agents can be governed through SDK instrumentation. Vendor assistants — the ones employees are already using for work — require governance at the transport layer. The data flowing through those MCP connections into a vendor assistant's context needs PII filtering applied in transit, before it reaches the model, regardless of which assistant is making the request.

How Waxell Handles Employee PII in Agentic Systems

Waxell Observe instruments AI agents at the context ingestion layer with 2 lines of code and no rebuilds required. Across 200+ supported libraries and frameworks — including LangChain, CrewAI, AutoGen, OpenAI, and Anthropic — it intercepts data before it reaches the model, applies detection and redaction rules drawn from 50+ policy categories covering PII, Privacy, Compliance, and Identity, and logs the full execution trace including what was redacted and why. This produces the chain of custody for employee data that GDPR's accountability principle (Article 5(2)) and HIPAA's access audit requirements demand.

For vendor assistants and MCP-connected tools, Waxell's MCP Gateway operates at the transit layer. PII is detected and redacted in flight before it reaches the assistant model. Secrets — API keys, tokens, embedded credentials — are blocked entirely and never leave the gateway. Tool fingerprinting ensures that the connections the organization has approved are the connections the assistant is actually using. PII policy is enforced regardless of which assistant initiates the request.

For high-stakes employee data workflows — healthcare HR, financial compensation systems, regulatory filings where wrong processing is irreversible — Waxell Runtime provides pre-execution policy enforcement before each step fires. Policy gates, budget limits, and durable workflow checkpoints ensure governance is native to the execution environment, not applied after the fact.

The combined architecture produces what most current deployments lack: a scope-limited, auditable layer of privacy enforcement that operates between service account permissions and model context, and generates the compliance artifacts that GDPR audits, HIPAA OCR investigations, and CCPA enforcement will require.

FAQ

Why is employee PII in AI agents harder to govern than customer PII?

Customer PII governance has defined edges: consent flows, structured data stores, regulated applications with clear boundaries. Employee data in AI agent systems lacks those edges. Internal agents span HR directories, productivity suites, codebases, and operational databases simultaneously — none of which were designed with the assumption that an AI would construct context windows from their contents. Employee PII enters model contexts through paths that no existing data policy specifically addresses.

Does GDPR apply to employee PII processed by AI agents?

Yes. GDPR's data minimization obligation (Article 5(1)(c)) and accountability principle (Article 5(2)) apply to any processing of personal data about EU individuals, including employees. AI agents that retrieve employee records must limit processing to what is necessary for the stated purpose and must be able to demonstrate compliance. The EDPB's March 2026 coordinated enforcement action, spanning 25 European DPAs, specifically examined transparency obligations in automated processing — which extends to agentic AI systems.

How does HIPAA's unique user identification standard apply to AI agents?

45 CFR § 164.312(a)(2)(i) requires that access to electronic PHI be assigned through unique identifiers so individual users can be identified and tracked. When multiple AI agents operate under a single shared service account, PHI access cannot be traced to a specific responsible agent or the human who authorized the task. This conflicts directly with HIPAA's access audit requirement. Each agent accessing PHI-bearing systems needs its own identity in the audit trail.

What's the practical difference between a data access policy and an agent PII policy?

A data access policy controls which credentials can reach which systems. An agent PII policy controls what data can enter the model's context window, regardless of what those credentials can retrieve. An agent can hold service account credentials authorizing broad database access and still be governed by a PII policy that strips non-task-relevant fields from retrieval results before they reach the model. Most organizations have only the first policy. The second is what determines what the model actually processes.

What's the fastest path to employee PII governance for an existing agent system?

Without restructuring the underlying architecture, the fastest intervention is SDK-level instrumentation at the context ingestion layer. Waxell Observe initializes with 2 lines of code, requires no rebuilds, and can be applied to an existing agent without changes to the underlying agent logic. For MCP-connected vendor assistants, routing through a governed gateway that applies PII redaction in transit provides coverage without modifying the assistant configuration.

Should organizations block employee data from all agent contexts?

Total prohibition is impractical for workflows that depend on employee-specific information. The operative standard is data minimization with explicit justification: each field that appears in an agent's context should have a documented reason for being there relative to the task, and fields without documented justification should be excluded. This is the GDPR Article 5(1)(c) standard applied to the mechanics of context window construction — and enforcing it requires tooling that operates at the retrieval layer, not at the logging layer.

Sources

Meta Model Capability Initiative (MCI) program: Fortune, April 21, 2026 — https://fortune.com/2026/04/21/meta-will-start-tracking-employees-screens-and-keystrokes-to-train-ai/
Meta MCI details (keystroke/mouse/screenshot capture, Superintelligence Labs memo): TechCrunch, April 21, 2026 — https://techcrunch.com/2026/04/21/meta-will-record-employees-keystrokes-and-use-it-to-train-its-ai-models/
34.8% of corporate AI tool inputs are sensitive data (up from 10.7% two years prior): Cyberhaven 2025 AI Adoption & Risk Report — https://info.cyberhaven.com/hubfs/Cyberhaven-2025-AI-Adoption-Risk-Report.pdf
Serviceaide agentic AI data breach, 483,000 Catholic Health patients affected, May 2025: BankInfoSecurity, May 16, 2025 — https://www.bankinfosecurity.com/agentic-ai-tech-firm-says-health-data-leak-affects-483000-a-28424
HIPAA 2025 Security Rule amendments: HIPAA Journal — https://www.hipaajournal.com/when-ai-technology-and-hipaa-collide/
CCPA ADMT requirements effective January 1, 2026: SentinelOne PII Security guide, updated January 2026 — https://www.sentinelone.com/cybersecurity-101/cybersecurity/pii-security-ai-best-practices/
EDPB CEF 2026 coordinated enforcement action, 25 DPAs, March 19, 2026: Verified in prior Waxell posts (02, 47, 86); EDPB.europa.eu

Poisoned MCP Tool Descriptions Leak Agent Data: What Microsoft's Warning Means for Enterprise Governance

Logan — Fri, 03 Jul 2026 15:40:47 +0000

On June 30, 2026, Microsoft Incident Response and its Defender security research team published a specific warning: MCP tool description poisoning — where an attacker embeds hidden instructions into the natural-language metadata of an MCP (Model Context Protocol) tool — can redirect a connected AI agent into quietly exfiltrating company data without triggering a single visible alert. The researchers walked through a complete scenario involving a finance team's third-party "invoice enrichment" service. The tool had been approved months earlier, never deeply reviewed, and never given a re-approval trigger on description changes. The attacker modified the tool's description field, burying a hidden directive inside what looked like formatting notes: grab the last thirty unpaid invoices and attach them to the next outbound call. MCP picked up the change immediately. The next time an analyst asked a routine question about a vendor, the agent followed the hidden order, silently bundled a month of invoice records, and sent them alongside a legitimate-looking response to a server outside the organization. The analyst saw nothing wrong.

Microsoft frames this as a scenario rather than a named breach, but the attack class is documented in the wild. The MCPTox benchmark, published in August 2025, tested tool description poisoning across 45 real MCP servers and 20 AI models — finding attack success rates as high as 72.8% with certain models, and an average success rate of 36.5%. The research also found that agents almost never refuse: the highest refusal rate observed was less than 3%, attributed to Claude 3.7 Sonnet. In September 2025, Koi Security confirmed the first in-the-wild instance — an npm package called postmark-mcp that had shipped fifteen clean releases before version 1.0.16 introduced a single line that secretly BCC'd every email an agent sent to an attacker-controlled address.

Why Do Poisoned MCP Tool Descriptions Go Undetected?

The problem is not a bug in any specific product. It is a trust boundary that most enterprise teams have not drawn yet.

MCP tool descriptions are plain text that live inside an agent's context window right next to its system prompt and conversation history. The agent reads tool descriptions to understand what it can do and when to do it — that's the entire mechanism by which MCP works. But it also means that whoever controls a tool's description controls one input into the agent's decision-making. The agent has no reliable way to distinguish an honest description from one that carries a hidden order.

What makes this worse in practice is that MCP tool descriptions can update live. A tool server pushes a new version and connected agents pick it up without any re-approval step in most enterprise configurations. Security teams treat approved tools the way they treat approved software packages: once vetted and deployed, they're in. Nobody is watching the description field the way they'd watch a code commit. The MCPTox research confirms why this matters: more capable models are actually more susceptible to the attack, because the attack exploits their stronger instruction-following. The same property that makes a model useful makes it a better target for embedded directives.

The deeper architectural problem is that MCP mixes instructions and data in the same channel. A tool's description field is semantically identical to its output from the model's perspective — both are text the model processes in the same context. An attacker who poisons a description before an organization connects the tool can wait patiently through every automated check. There's no signature to match, no payload to scan. The first signal that anything went wrong may be the data already on its way to the attacker's server.

What Should Security Teams Audit Right Now?

Before reaching for a platform, three things can be checked with existing access:

Inventory every connected tool. Pull a list of every MCP server and tool your agents can reach. For each one, note who approved it, when, and what process (if any) exists for reviewing description changes. In most organizations, this list does not exist. Building it is the prerequisite for every other control.

Read the tool descriptions you have approved. Open the current description for each connected tool and read it the way you'd read a suspicious email. Look for conditional language — "if the user asks about X, also do Y." Look for references to external endpoints, data collection instructions, or text that has no business appearing in a help field. This is tedious work. It is also the thing that would have caught the postmark-mcp attack before version 1.0.16 shipped.

Put a human in front of write and send actions. Microsoft is explicit on this point: any agent action that moves data outside the organization — sending email, calling external APIs, writing to shared storage, making financial requests — should require a person to approve it before execution, not after review. The attack scenario Microsoft describes works specifically because the exfiltration step looked identical to a normal outbound call. Requiring human approval for that step would have stopped it at the boundary.

How Does Waxell MCP Gateway Handle This?

The architectural problem Microsoft describes — live description updates, no re-approval trigger, no inspection of what descriptions actually say — maps precisely to what Waxell MCP Gateway was built to address.

When a tool connects to the Gateway, it doesn't get catalogued and forgotten. Its description is immediately passed through a prompt injection scanner that runs at fingerprint time, before any agent can call the tool. The scanner is specifically checking for embedded instructions — the kind of hidden "also collect invoices" directive buried in a formatting block that Microsoft's researchers describe. Tools that fail the scan land in Pending review status and cannot be called until a human clears them.

Description drift is handled at the fingerprinting layer. Waxell maintains five fingerprint states for every connected tool: Pending, Drift Detected, Trusted, Blocked, and Removed. A tool whose description changes without re-approval automatically moves to Drift Detected and is blocked from calls until the change is reviewed. MCP's live-update behavior — the property that makes description poisoning operationally easy for attackers — becomes visible and auditable instead of silent.

Human-in-the-loop approval for consequential actions is enforced at the gateway layer, not at the application. When an agent call would move data to an external endpoint, initiate a write, or touch any action the policy marks as sensitive, the Gateway holds the MCP connection open and routes the approval to the right person. The agent waits. Nothing is sent until a human says yes. This is the control Microsoft explicitly calls for, applied uniformly across every tool call that reaches the gateway.

Waxell MCP Gateway connects to 160+ upstream MCP servers, enforces 50+ policy categories out of the box, and propagates policy changes across the fleet in 30 seconds. The audit log is durable and exportable, so if a suspicious call did go out before a description was flagged, you have a timestamped record of exactly what the agent sent, when, and to which tool.

The Microsoft post maps these controls to Prompt Shields, Purview DLP, and Entra Agent ID — tooling specific to organizations running Microsoft-native infrastructure. The principles it's prescribing are what Waxell MCP Gateway delivers for teams building on any framework, connecting any client — Claude Desktop, Cursor, Claude Code, or any MCP-compatible assistant — to any of the 160+ upstream connectors in the catalog.

See how Waxell MCP Gateway works →

Start for free at waxell.dev/signup

Frequently Asked Questions

What is MCP tool description poisoning?
MCP tool description poisoning is an attack where a threat actor modifies the natural-language description of a Model Context Protocol (MCP) tool to embed hidden instructions. Because AI agents read tool descriptions to decide what to do and when to do it, a poisoned description can redirect the agent's behavior — collecting unauthorized data, calling external endpoints, or exfiltrating records — without generating a visible error or triggering an access control violation.

How does the attack remain undetected?
Every individual action the agent takes is legitimate. The tool was approved. The data query ran within the user's own permissions. The outbound call went to a server that was permitted when the tool was first onboarded. No single action trips an alarm. The attack is only detectable if something inspects the content of tool descriptions before each call, or if anomalous data movement patterns are flagged in the audit log after the fact.

What did Microsoft's June 30 warning specifically say?
Microsoft Incident Response and its Defender security research team published a blog post on June 30, 2026, walking through an invoice enrichment scenario where a finance team's approved third-party MCP tool was modified by an attacker to exfiltrate the last thirty unpaid invoices during a routine query. The post identified MCP's live description update behavior — no re-approval step required — as the core gap that makes the attack easy to execute and hard to catch. The full post is at the Microsoft Security Blog: securing-ai-agents-ai-tools-move-from-reading-acting.

Has this happened in the real world?
Yes. In September 2025, researchers at Koi Security identified the postmark-mcp npm package — a malicious MCP server that had mirrored a legitimate email tool for fifteen clean releases before version 1.0.16 introduced a single line that secretly BCC'd every email an AI agent sent to an attacker-controlled address. The MCPTox benchmark (August 2025, arxiv.org/abs/2508.14925) found attack success rates as high as 72.8% with certain models across 45 real MCP servers and 20 leading AI models. Model refusal rates were negligible — the highest observed was less than 3%.

What's the difference between tool description poisoning and prompt injection?
Prompt injection embeds malicious instructions in content the model reads during a task — documents, emails, retrieved search results, user inputs. Tool description poisoning specifically targets the metadata layer: the fields that explain what a tool does and when to call it. Both exploit the same structural vulnerability — the model cannot reliably distinguish its principal's instructions from adversarially crafted text in its context — but they attack at different layers. Tool description poisoning is particularly persistent because descriptions are written once, approved once, and rarely reviewed again.

What three controls specifically prevent this?
Microsoft identifies the same three controls that Waxell MCP Gateway enforces: first, a scanner that reads tool descriptions for embedded instructions before any agent can call the tool; second, a fingerprinting or drift-detection system that catches description changes and requires re-approval before they go live; and third, human-in-the-loop enforcement that holds any data-exfiltration or write action for explicit approval before execution. Each control alone reduces risk; all three together close the attack path Microsoft describes.

Sources

GuardFall Shell Injection: How 10 of 11 Popular AI Coding Agents Bypass Their Own Safety Guards

Logan — Thu, 02 Jul 2026 14:23:24 +0000

The safety check that's supposed to stop an AI coding agent from running a dangerous command is the exact reason developers feel safe enough to switch off human review. Turn on --auto-exec, set up the CI pipeline, trust the blocklist. GuardFall is what happens when the blocklist doesn't work.

On June 30, 2026, Adversa AI researcher Omer Ben Simon published research showing that 10 of the 11 most popular open-source AI coding agents — accounting for roughly 548,000 combined GitHub stars — can be bypassed with shell tricks that have been documented in security research for decades. The affected tools include Aider, Cline, Roo-Code, Goose, Plandex, Open Interpreter, OpenHands, SWE-agent, opencode, and NousResearch's Hermes. Only one agent, Continue, was built to actually defend against this class of bypass.

GuardFall is the name Adversa AI gave to a class of bypasses against pattern-based shell guards in agentic coding tools: the guard inspects raw command text, bash rewrites that text before executing it, and the two never see the same thing. The gap is structural, not incidental — adding more denylist patterns closes none of it.

What Actually Happened?

The research started with NousResearch's open-source Hermes project. Adversa AI found that Hermes's 30-pattern regex denylist could be walked past by using decades-old shell quoting techniques. The filter sees r''m as different from rm. Bash removes the empty quotes and runs rm anyway. From there, they tested ten more of the most popular coding agents by GitHub star count.

The attack requires two conditions, neither of which is unusual in production environments. First, the AI model must emit the malicious command. A blunt rm -rf / is typically refused, but the same command embedded in a Makefile target, an MCP tool's documentation response, or a config file that the agent reads as part of a repository is emitted as routine work. Second, an auto-execution flag must be on — --auto-exec, --auto-run, --auto-test, dangerously-skip-permissions — or the container sandbox has to be switched off. Both are standard in CI pipelines and developer laptop setups.

Adversa demonstrated the full attack chain end-to-end against the production Plandex binary and confirmed equivalent attack paths against eight other agents. In the demonstrated attack scenario: a developer's CI pipeline includes a coding agent running on pull requests; the agent reads an attacker-submitted Makefile that defines a clean target wiping ~/.aws/credentials; the agent's auto-exec flag runs it without human review. The IAM credentials are gone.

Five bypass classes cover the structural problem. Class A: quote removal — r''m becomes rm. Class B: $IFS expansion — rm$IFS-rf$IFS/ looks like one word to a pattern filter, three arguments to bash. Class C: command substitution — $(echo rm) -rf / emits the binary name dynamically, invisible to a string matcher. Class D: base64 piped to a shell — each segment is benign alone, destructive in combination. Class E: alternative destructive argv shapes — find /x -delete, dd of=/dev/sda, install -m 4755 payload /usr/bin/backdoor.

There is no single CVE assigned to GuardFall. That is the point: it is not a patchable bug in a single component. It is a design convention — agent to shell, gated by string matching — that fails structurally.

Why Pattern-Based Shell Guards Can't Win This Fight

The fundamental problem is that a text matcher operating on raw command strings cannot model bash's expansion rules. Any rewrite that survives the matcher but is bash-equivalent to a matched pattern is a bypass. The breadth of the blocklist is irrelevant.

The Adversa survey identified four architectural failure modes across the affected agents:

Mode 1: Regex over raw string. Hermes, opencode, and Goose compile a fixed regex set and test it against the verbatim command. This fails all five bypass classes.

Mode 2: Tokenized but raw-text matching. Cline (opt-in mode) and Roo-Code tokenize the command but still match against the rebuilt segment text. This partially closes Class A, but Classes C-inside-quotes and E remain open.

Mode 3: No static guard plus auto-yes. Aider, Plandex, Open Interpreter, and Cline (default mode) do no pattern matching. The defense is a human in the loop — until workflow pressure produces --auto-test, --auto-exec, or a malicious repository ships a config file that flips the flag before any operator does. A .aider.conf.yml file inside a cloned repository can set auto-test: true and plant a payload in the test-cmd field. The agent reads it as normal project configuration.

Mode 4: Container sandbox with a documented local opt-out. OpenHands and SWE-agent sandbox by default — sound when the workspace is disposable. Both ship a documented local-mode configuration that disables the container and provides zero fallback. Local mode is common in CI, developer laptops, and self-hosted runners: exactly the deployments where the credentials are real.

Continue is the exception. Its evaluator tokenizes with shell-quote, detects variable expansion, recurses into command substitutions, checks pipe destinations, and maintains an explicit disabled list for canonical destructive patterns. In tests, zero of 21 bypass payloads reached allowedWithoutPermission in the default IDE mode.

The common thread across every failing design: the decision sits somewhere that cannot hold it. The model cannot be relied upon — the same payload refused as a direct request is emitted without hesitation when wrapped in operational context. The operator confirmation cannot be relied upon — auto-yes flags get set the first time the workflow slows down. The container cannot be relied upon — it defends until the local opt-out is set, which is common in real deployments.

What Should You Check Before Your Next CI Run?

None of these is a complete fix. They're compensating controls while the architectural problem gets addressed upstream.

Redirect $HOME for every agent invocation. A one-line shell wrapper — HOME=$HOME/.agent-sandbox-$RANDOM agent … — keeps the project directory but removes ~/.ssh/, ~/.aws/, shell history, and the credential surface the attack most wants. This is always-on and has no single-flag opt-out.

Disable auto-execute flags unless unavoidable. This applies to --auto-exec, --auto-run, --auto-test, dangerously-skip-permissions, and auto-mode: full. In CI, treat these as privileged subprocess invocations requiring explicit justification — not defaults.

Disable agent execution on fork pull requests. An attacker-controlled README or test config file is the most direct path from untrusted content to privileged shell execution. Fork PRs should not run coding agents with full account access.

Audit repository config files. A malicious .aider.conf.yml shipped inside a cloned repository can flip auto-test: true and plant a payload in the test-cmd field. Treat repo-shipped agent configuration as untrusted code.

For builders and maintainers: the Continue-style evaluator is the only sound, always-on, flag-independent defense the survey identifies. Implementing the five components — tokenize, detect expansion, recurse into substitutions, check pipe destinations, maintain an explicit disabled list — is described by the researchers as roughly a two-day exercise for an experienced engineer.

How Waxell Handles This

The structural problem GuardFall exposes is that the agent-to-shell boundary has no governance layer. The agent emits a command, the shell runs it, and the string match in between has no model of what bash will actually do. Adding more patterns to the blocklist does not solve a lexer-versus-evaluator problem.

Waxell Observe sits at the agent execution layer, not in front of the shell. Initialized with two lines of code, it intercepts execution at the tool-call level and evaluates each call against 50+ policy categories before anything reaches the shell. The Content policy category validates what can be ingested as input; the Safety category enforces hard limits on what tool calls the agent is permitted to emit; the Identity category ensures each action is scoped to the identity that authorized it, not the full account. These policies run pre-execution. They evaluate before bash -c ever sees the string.

The contrast with pattern-based blocklists is direct: a blocklist runs on text the agent has already decided to emit. Waxell's policy engine runs on tool calls before the shell sees them — and on inputs before the model decides anything. A quote-removal or $IFS bypass that defeats a text matcher does not affect a policy that evaluates the semantics of the action, not the characters in the command string.

For teams running agents connected to external services via MCP, Waxell MCP Gateway adds a second layer. It scans tool descriptions at fingerprint time — when the tool is registered, before any agent calls it — using a prompt injection scanner that catches embedded instructions in tool metadata. A Makefile instruction planted in an MCP tool's documentation field is flagged before it ever reaches the model's context. PII and secrets are redacted in flight before they enter the agent. These are controlled inputs and validated data interfaces: the governance decision sits somewhere that can actually hold it.

You can explore the full agent security model. Start with Waxell Observe in two lines:

pip install waxell-observe

→ waxell.dev/signup

Frequently Asked Questions

Which AI coding agents are affected by GuardFall?

The ten affected agents in the June 30, 2026 Adversa AI survey are: Aider, Cline, Goose, Open Interpreter, OpenHands, opencode, Plandex, Roo-Code, SWE-agent, and NousResearch Hermes. Together they account for roughly 548,000 GitHub stars. Only Continue was built with a guard architecture that closes the structural majority of the bypass surface.

Is there a CVE assigned to GuardFall?

No. Adversa AI explicitly states there is no single CVE to track or patch. GuardFall is a class of bypasses — a design convention, not a discrete vulnerability in a single component. Adding more denylist patterns does not fix it; the architectural pattern (agent to bash, gated by string matching) fails structurally regardless of how many patterns are in the list.

What does GuardFall mean for CI/CD pipelines?

CI pipelines typically run with auto-execute flags on — that is the point of CI. GuardFall is exploitable the moment that flag is active. Every CI pipeline that uses an affected coding agent and processes untrusted content (fork pull requests, third-party dependency files, attacker-accessible config files) sits in the attack surface. Redirect $HOME and disable agent execution on fork PRs as immediate compensating controls.

Does disabling auto-execute protect against GuardFall?

Partially. The attack requires the model to emit a malicious command AND the shell to run it automatically. Removing the auto-execute flag removes the second condition, but doesn't close the first: a malicious repo can ship a config file (.aider.conf.yml) that re-enables auto-exec on first edit, circumventing the operator's setting. Disabling auto-execute is a compensating control, not a complete defense.

How does Waxell's policy enforcement prevent GuardFall-style attacks?

Waxell Observe intercepts at the tool-call level before the shell sees any command. Its 50+ policy categories — including Content (input validation), Safety (tool-call limits), and Identity (scope enforcement) — run pre-execution against the semantics of the action, not the text of the command string. A quote-removal or $IFS bypass that defeats a text matcher does not affect a policy evaluating what the call is attempting to do. Waxell MCP Gateway adds a second layer for externally connected tools: it scans tool descriptions at fingerprint time for embedded instructions and redacts credentials and PII before they reach the agent.

What should I do before my next CI run?

Immediately: redirect $HOME for every agent invocation, disable auto-execute flags on fork PRs, and audit repository-shipped agent configuration files for planted auto-exec settings. This week: disable all auto-execution in automated pipelines processing untrusted content. This quarter: evaluate the Continue-style tokenize-and-canonicalize evaluator for any coding agent or computer-use shell channel you build or operate.

Sources

Adversa AI, "GuardFall: a universal shell injection vulnerability in open-source AI agents," Omer Ben Simon, June 30, 2026: https://adversa.ai/blog/opensource-ai-coding-agents-shell-injection-vulnerability/
The Hacker News, "GuardFall Exposes Open-Source AI Coding Agents to Decades-Old Shell Injection Risks," Swati Khandelwal, June 30, 2026: https://thehackernews.com/2026/06/guardfall-exposes-open-source-ai-coding.html
SecurityAffairs, "GuardFall Flaw Hits 10 of 11 Popular Open-Source AI Agents": https://securityaffairs.com/194546/ai/guardfall-flaw-hits-10-of-11-popular-open-source-ai-agents.html
SecurityWeek, "Decades-Old Bash Tricks Expose AI Coding Agents to Supply Chain Attacks": https://www.securityweek.com/decades-old-bash-tricks-expose-ai-coding-agents-to-supply-chain-attacks/
SC Media, "Shell injection flaw found in 10 of 11 open-source AI agents": https://www.scworld.com/brief/shell-injection-flaw-found-in-10-of-11-open-source-ai-agents

Copilot Billing Shock: Why GitHub's First Token Billing Cycle Burned Agentic Developers

Logan — Wed, 01 Jul 2026 15:59:11 +0000

GitHub Copilot billing shock is what happens when a platform built around predictable request limits switches to per-token usage billing — and for developers running unattended agentic sessions, the first signal is the invoice. On June 1, 2026, GitHub migrated Copilot subscribers — reportedly around 4.7 million paid accounts — from flat-rate "premium requests" to usage-based token billing. The first complete billing cycle ended June 30. The invoices are arriving today.

Developers running agentic sessions — agent mode, code review agent, Copilot cloud agent — are reporting costs 10x to 50x higher than their previous flat subscriptions. Developer reports on Reddit and GitHub's own community forums describe Pro plan users who paid $29/month now seeing bills of $750; others on the $50/month tier are looking at $3,000. According to coverage of GitHub's own internal data from the transition, a single agentic coding task consumes roughly 1,000 times the tokens of a standard single-turn query. One credit equals one cent. A Pro plan includes 1,500 credits per month. When an agent opening a large codebase, iterating through tool calls, and growing its context window across a multi-hour session burns through that allocation — and the default spending cap is unbounded — there is nothing stopping the bill from running.

The additional-usage cap must be explicitly enabled in Settings → Billing → GitHub Copilot. Many developers didn't know that. Many still don't.

Why AI Agent Cost Runaway Keeps Happening at the Platform Level

The billing architecture isn't the problem. The enforcement architecture is.

Most platforms built around AI tool usage were designed for a different threat model: overage on a predictable request volume. Under that model, logging usage and alerting near a monthly limit is sufficient. You get a warning email, you cut usage, you stay in budget. That model assumes the cost per interaction is roughly stable and human-initiated.

Agentic sessions break both assumptions. First, the token cost of an individual session is unpredictable before it starts. An agent tasked with reviewing a large codebase doesn't know how many model calls it will take — and neither does the billing system. The context window grows as the agent accumulates intermediate results, which means token consumption is superlinear: later steps cost more than earlier ones because they carry more context forward.

Second, agentic sessions are frequently unattended. A developer kicks off a refactor, closes the laptop, and comes back to find the agent has run for six hours. Every intermediate model call was billed at the current API rate. The session completed. The credits are gone. The additional-usage billing kicked in somewhere in the third hour.

The structural gap is that cost enforcement happens at the invoice, not at the execution boundary. GitHub's usage dashboard is retrospective — it tells you what happened after the cycle closes. GitHub's account-level spending cap, if enabled, prevents future billing beyond a monthly threshold. But neither stops a runaway session mid-flight. The agent is executing. The tokens are being spent. The bill is accumulating. Nothing in the execution path is checking whether the current session cost already exceeds what was intended.

This is the difference between observing costs and enforcing them. A dashboard after the fact is a cost autopsy, not cost governance.

What to Do Before Your Next Copilot Billing Cycle

These are concrete actions that apply now, regardless of what agent infrastructure you're running.

Enable a hard spending cap immediately. In Settings → Billing → GitHub Copilot, set a monthly cap. This is the highest-leverage action available to individual users and organization administrators. The default is unbounded, which means without this step your account has no automatic stop for additional usage charges.

Identify which workflows are generating agentic sessions. Agent mode, code review agent, and Copilot cloud agent all operate at a fundamentally different token cost than standard chat. In a team environment, a small number of power users running agentic sessions regularly can exhaust the pooled credit budget that lighter users depend on for basic completions.

Audit your session patterns, not just your monthly totals. GitHub's usage dashboard shows per-cycle and per-user aggregates. What it doesn't show — and what matters most for agentic cost control — is per-session breakdown. Which specific runs consumed the most? Were they unattended? Did any run significantly longer than the task warranted? Without per-session visibility, the only signal you get is the invoice.

If you're running agents outside GitHub's managed cloud, the platform-level spending cap doesn't apply. Agents running directly against OpenAI, Anthropic, or any other model API are governed only by whatever your own infrastructure enforces — which, in most cases, is nothing until the monthly API bill arrives.

How Waxell Observe Prevents This Before the Bill Arrives

The Copilot billing shock is a documented version of a pattern that affects any system running unmonitored agentic sessions: without in-execution budget enforcement, costs don't get stopped — they get discovered.

Waxell Observe is instrumented into the execution layer, not the billing layer. Two lines of code initialize the SDK; from there, every model call, tool use, and intermediate step is traced in real time. Observe enforces hard budget limits per agent, per user, and per session. Before each step executes, the current session cost is checked against the configured ceiling. When an agent hits that limit, execution stops. The run is logged. The developer gets a report on what happened, what it cost, and where it stopped.

This matters for agentic sessions specifically because token cost is nonlinear and often grows faster than human attention. A standard LLM call costs roughly what you'd expect from the model's published rates. An agentic loop that opens a context window, queries a codebase, runs a series of tool calls, and iterates toward completion can consume 50x the tokens — and the cost multiplier isn't visible until the session ends, unless there's enforcement in the execution arc.

Waxell Observe's real-time telemetry surfaces per-run cost, per-model cost, and per-user cost across sessions — not as a report after the cycle closes, but as the session is running. The 50+ policy categories available in Observe include Cost policies that can halt execution, trigger alerts, or escalate for human review before additional cost is incurred. Enforcement latency runs at 0.045ms p95 — the check adds nothing perceivable to agent performance.

Copilot's billing architecture is Microsoft's choice. The enforcement gap inside agentic execution is a structural problem that applies to any team running agents directly — against OpenAI, Anthropic, Groq, or any framework that doesn't include its own budget enforcement layer. Waxell Observe auto-instruments 200+ libraries, including LangChain, CrewAI, AutoGen, LlamaIndex, and Semantic Kernel.

Start free at waxell.dev/signup — 2-line setup, no rebuild required.

FAQ

What caused the GitHub Copilot billing shock in June 2026?
GitHub Copilot switched from flat-rate "premium request" billing to usage-based token billing on June 1, 2026. The first complete billing cycle ended June 30. Agentic sessions — which can consume approximately 1,000x the tokens of a standard query, according to GitHub's own engineering data — were the primary driver of the 10x to 50x cost increases developers reported. The additional-usage spending cap defaults to unbounded; accounts without an explicit cap had no automatic stop on their June bill.

Why did agentic sessions cost so much more under token billing?
Agentic tasks involve multiple model calls within a single session. An agent working through a code review or refactor opens a large context window, runs tool calls, interprets results, and iterates — each step consuming tokens at the current model's API rate. Context accumulates across the session, so later model calls are more expensive than earlier ones. A session that runs unattended for several hours can spend a month's worth of flat-rate credits before the developer checks back in.

What's the difference between a spending cap and session-level budget enforcement?
A spending cap — like GitHub's monthly credit cap — limits how much your account is billed in a given cycle. It stops future charges after the cap is reached. Session-level budget enforcement, like what Waxell Observe implements, stops an individual agent run before it exceeds its cost threshold. One limits your monthly bill; the other stops a runaway session before it becomes a billing event at all.

Does Waxell Observe work with GitHub Copilot?
Waxell Observe instruments agents you build and deploy directly — not GitHub's managed cloud agent. If your team runs AI agents using OpenAI, Anthropic, LangChain, CrewAI, or any of 200+ other supported frameworks, Observe can enforce session-level budget limits in real time, independent of GitHub's platform billing.

How fast does Waxell's cost enforcement work?
Policy checks, including Cost policies, operate at 0.045ms p95 latency. The enforcement layer checks budget state before each step executes, with no meaningful impact on agent performance.

Should I wait for GitHub to fix this before doing anything?
No. GitHub has shipped a cost tracker in Visual Studio 2026 and organization administrators can set account-level caps — but neither of those changes the per-session enforcement gap. The architectural problem (costs accumulate inside the execution arc before any limit check fires) requires an enforcement layer inside the session, not above it. If your team is building and running agents against any LLM API, that layer needs to be part of your agent infrastructure, not something you're waiting for the billing platform to provide.

Sources: GitHub Blog — Copilot Moving to Usage-Based Billing | TechCrunch — "What a joke": GitHub Copilot's new token-based billing | TechTimes — Copilot Billing Shock Confirmed: Agentic Users Face 10x Cost Surge | Visual Studio Magazine — Slammed by Copilot Usage-Based Billing on Day 1 | The Register — Angry devs vow to flee GitHub Copilot as metered billing takes hold | GitHub Docs — Models and Pricing

AI Agent Hallucination: Why Detection Alone Doesn't Protect Production Systems

Logan — Wed, 01 Jul 2026 14:09:43 +0000

In August 2025, EY surveyed 975 C-suite leaders across 21 countries on AI governance. The results were bleak: 99% of organizations reported AI-related financial losses in the prior year, and 64% reported losses exceeding $1 million — averaging $4.4 million per affected company. The survey did not identify a shortage of detection tools. It identified a shortage of governance.

That distinction matters. The AI observability ecosystem is well-supplied. Arize, LangSmith, Helicone, and a dozen others have built sophisticated hallucination evaluation frameworks. LLM-as-judge pipelines, faithfulness scorers, groundedness metrics — the tooling exists and much of it works. The problem is architectural, not instrumental: detection tells you an agent produced a bad output. It does not, by itself, determine what happens next.

When agents are connected to production systems — sending emails, updating records, initiating workflows — "what happens next" is the only question that matters.

Why Detection Runs Too Late

Hallucination detection in production AI systems is almost universally retrospective. An agent completes a reasoning chain, generates an output, and the output is routed to an evaluator — a faithfulness scorer, a groundedness check, an LLM judge — which surfaces a confidence signal. If the signal is below threshold, an alert fires or a flag appears in a dashboard.

By the time the evaluator runs, the agent has already acted. In agentic architectures where outputs directly trigger downstream tool calls — calendar invites sent, database entries written, API requests dispatched — detection after the fact is forensics, not prevention.

This problem compounds at scale. Most teams sample between 1% and 5% of production traffic for evaluation. Hallucinations concentrate in the long tail: uncommon intents, edge-case entities, corner inputs that appear rarely but trigger the most confident-sounding wrong answers. Sampling almost guarantees you're evaluating the wrong traffic.

There's also the evaluation quality problem. LLM-as-judge pipelines — using a model to grade another model's output — have documented failure modes: they share biases with the model under evaluation, they correlate with output length rather than factual accuracy, and they require significant calibration to behave consistently across task types. The pipeline is not neutral. A previous post in this series covered why LLM-as-judge fails in production in detail; the failure modes haven't disappeared.

The net effect: detection systems produce signals. Those signals often arrive after the relevant action has already propagated downstream. And the signals themselves require careful interpretation.

None of this means detection is useless. It means detection is necessary but not sufficient for protecting production systems from hallucination-driven failures.

The Missing Layer: Fallback Enforcement

The piece most teams haven't built is a fallback enforcement layer — a mechanism that intercepts an output when it fails a quality check and routes it somewhere other than the downstream action it was about to trigger.

There are three fallback patterns that hold up in production:

Halt and escalate. When an output falls below a defined quality threshold — factuality score, groundedness score, confidence percentile — execution stops and the case is routed to a human reviewer before any downstream action occurs. This is the highest-overhead pattern, appropriate for high-stakes workflows where the cost of a bad action significantly exceeds the cost of delay.

Graceful degradation with caveat. The output is delivered but flagged — with explicit language indicating it could not be verified against available context, or that confidence is below threshold. This works well in customer-facing applications where a delayed response is worse than an uncertain one, and where the end user can decide how to act on the information.

Reroute to a more conservative path. The agent is redirected to a simpler, more constrained version of its task — narrower scope, fewer tool calls, a retrieval-only path with no synthesis — that is less likely to hallucinate but still produces useful output.

What all three patterns share is a runtime dependency: they require the ability to intercept an output before it reaches its intended downstream destination, evaluate it against a policy, and execute a different code path based on the result. This is not an evaluation pipeline. It is a policy engine.

That distinction is what most of the observability tooling doesn't provide. Evaluation surfaces signals. A policy engine acts on them.

The Architectural Gap in the Observability Ecosystem

The major observability platforms — Arize, LangSmith, Langfuse — are built primarily around post-hoc evaluation. They are excellent at answering the question: "What did my agent do, and was it right?" They are not built to enforce what happens when the answer to that question is no.

This is a coherent product choice. Evaluation and enforcement are different problems with different latency requirements. An LLM-as-judge evaluation might add hundreds of milliseconds. An enforcement layer that intercepts production traffic has to add almost nothing — any meaningful latency penalty invalidates the architecture for real-time applications.

The engineering consequence is that teams typically build enforcement manually, if at all: a bespoke script that reads evaluation scores from a monitoring dashboard and attempts to wire up conditional logic in the application layer. This is fragile, hard to audit, and disconnected from the governance intent behind the original evaluation design.

How Waxell Handles This

Waxell Observe instruments agent execution at the framework level — 200+ libraries auto-instrumented with 2 lines of code — and exposes output validation policies as a native runtime control. Quality policies in Waxell include output monitoring across 50+ policy categories, with enforcement actions that fire inline with execution: halt, escalate, route to human-in-the-loop, or log with caveat.

The enforcement runs at 0.045ms p95 latency — fast enough to intercept production traffic without adding perceptible delay to agent responses. The output monitoring surface provides full trace visibility for every evaluation that fires, so teams can audit not just what their agents produced but what governance decisions were made and why.

For high-stakes workflows where a hallucinated output could trigger irreversible downstream action — financial transactions, healthcare documentation, infrastructure changes — Waxell Runtime applies policy enforcement before each step executes, not after. Governance is native to the execution environment, not retrofitted on top of it. Every decision is checkpointed; every workflow can be paused for human review at any point without losing execution state.

The practical effect: an agent in a Waxell-governed environment that produces an output failing a quality policy doesn't deliver that output to its downstream action. It routes to whatever fallback behavior the policy specifies. The detection and the enforcement are the same system, not two separate pipelines requiring manual integration.

Teams building in RAG architectures can also use Waxell's testing environment to evaluate output quality against their specific document corpus before deploying to production — catching failure modes under controlled conditions rather than discovering them in live traffic.

FAQ

What is the difference between hallucination detection and hallucination governance?
Detection identifies that an agent produced a bad output. Governance determines what happens next — whether the output is blocked, rerouted, escalated, or delivered with caveats. Most observability tools provide detection. Governance requires a policy engine that can act on the detected signal at runtime, before the output reaches its downstream destination.

Why does detection after the fact create risk in agentic systems?
Unlike traditional software, agentic systems act — they write to databases, send messages, trigger workflows. If detection fires after the action has been taken, the damage is already done. The enforcement window is between output generation and downstream action propagation, which in fast-moving agentic pipelines can be milliseconds.

How does LLM-as-judge evaluation fit into a fallback enforcement architecture?
LLM-as-judge is one signal among many that can trigger a fallback policy. It's useful for semantic quality assessment but has documented failure modes at scale. A robust enforcement architecture combines multiple signal types — groundedness scores, confidence percentiles, retrieval faithfulness — and applies policy logic across them, rather than treating any single evaluator as authoritative.

What hallucination rate should enterprise teams target in production?
Target is task-dependent. Enterprise chatbots running on general knowledge queries report approximately 18% hallucination rates in uncontrolled settings. RAG-governed systems in the same environments reduce that to 3–8%. High-stakes applications in legal, healthcare, and financial domains typically require rates below 1%, which is only achievable with retrieval governance, output validation policies, and human-in-the-loop escalation for edge cases.

Does adding an enforcement layer slow agents down significantly?
At Waxell's p95 latency of 0.045ms for governance evaluation, no. The latency penalty is imperceptible in real-time applications. The execution cost comes from human escalation paths — halt-and-escalate workflows add human review time, not compute time. Teams typically route only the traffic that fails quality thresholds through escalation paths, keeping the median latency profile unchanged.

Can these fallback patterns work with agents that use tool calls extensively?
Yes, but the enforcement point shifts. For tool-call-heavy agents, the most important interception point is the decision to call a tool based on a hallucinated premise — for example, an agent that queries a database with a fabricated entity name. Pre-execution policy enforcement, as in Waxell Runtime, checks the tool-call intent before the call is dispatched, not after the response is received.

Sources

EY 2025 Responsible AI Pulse Survey (August–September 2025, 975 C-suite leaders, 21 countries) — EY Global Newsroom
AI Hallucination Rates and Benchmarks 2026 — Suprmind AI
Arize Phoenix Hallucination Evaluator — Arize Documentation
LibreEval: Open-Source RAG Hallucination Benchmark — Arize AI
HN discussion: "Ask HN: How are you preventing LLM hallucinations in production systems?" — Hacker News

AI Agent Output Quality: Why 90% Confidence Becomes 12% at Step 20

Logan — Mon, 29 Jun 2026 18:30:49 +0000

A 90% reliable agent running a 20-step workflow produces a fully correct result less than one time in eight. That's not a model problem. It's a compounding problem — and it's why the current generation of AI agent output quality tooling is solving the wrong half of the equation.

Output quality in an agentic system is not the same problem as output quality in a chatbot. When an LLM gives a wrong answer to a one-shot question, the user sees it, rolls their eyes, and asks again. When an agent gives a wrong answer at step 3 of a 20-step workflow — and that answer becomes the input to step 4 — the error propagates, amplifies, and usually isn't visible until something expensive has already happened. Between January and May 2026, analysts examining 73 production agent incidents found that failures almost never travel alone: in 61% of multi-layer incidents, a retrieval failure upstream was the root cause that made the tool-call layer go wrong, and the originating layer appeared completely healthy the entire time.

Most teams treating AI agent output quality as a detection problem are solving the wrong problem. Detection tells you what went wrong. It doesn't prevent downstream damage. The architecture question isn't "how do we find bad outputs?" It's "how do we prevent bad outputs from becoming bad actions?"

The Compounding Confidence Problem

The math that most output quality frameworks ignore is elementary probability.

If an agent produces each step at 90% reliability — a generous assumption for most production systems, given that hallucination rates across 37 models still range from 15% to 52% in 2026 benchmarks — a 20-step pipeline has an end-to-end reliability of 0.9^20. That's roughly 12%.

Stated plainly: a 90% accurate agent running a 20-step workflow produces a fully correct result less than one time in eight.

That 90% per-step figure can mask a deeper problem: LLMs don't know when they're wrong. Research on LLM calibration in 2026 shows that verbalized confidence scores diverge substantially from both token-level probabilities and actual accuracy. LLMs tend toward overconfidence, mimicking human hedging patterns rather than expressing genuine uncertainty. A model that says "I'm 95% confident" about a factual claim may be wrong at the same rate as a model that says "I think."

This is not a model quality problem. It's an architectural one. And the current generation of output quality tooling — eval frameworks, LLM-as-judge, hallucination dashboards — was largely designed for non-agentic systems where outputs are final artifacts, not intermediate steps.

Why Evaluation Frameworks Miss the Agentic Case

The leading approach to LLM output quality today — evaluation platforms like Braintrust, Arize's LLM-as-judge templates, and various hallucination detection tools — is fundamentally retrospective. You run the agent, collect the traces, score the outputs, and identify failure modes for the next release cycle.

That's the right discipline for improving models and prompts over time. It's not sufficient for preventing a running agent from acting on a bad output right now.

Consider what "output quality" actually means in a multi-step agentic context. A customer support agent that misclassifies a ticket severity doesn't produce a wrong answer — it escalates the wrong ticket, which routes to the wrong queue, which gets worked by the wrong team, which misses an SLA. By the time any evaluation catches it, three downstream systems have already been wrong.

The evaluation-only approach also has a structural blind spot: it measures quality at the output level, not the action level. An agent can produce a syntactically valid, fluent, semantically coherent response — one that would score well on any LLM-as-judge rubric — and still take the wrong action, because it has misunderstood a constraint, confused two data records, or made an implicit assumption that no prompt made explicit.

Tool-call failures, according to analysts examining 73 production agent incidents in early 2026, occur upstream in 61% of multi-layer incidents. The bad output isn't always visible in the LLM's text — it's embedded in what the agent chose to call, and with what arguments.

What Output Quality Enforcement Actually Looks Like

The distinction worth drawing is between output measurement and output governance.

Output measurement produces a score. A hallucination detector fires. An LLM-as-judge says 0.72. A confidence probe flags uncertainty. These signals are valuable — but a signal without a response is just a log entry.

Output governance means the signal is connected to a decision about execution. When a quality check fails, the agent pauses, escalates, or stops — rather than silently continuing.

Output validation policies at the governance layer enforce this in practice. A policy that says "if response confidence is below threshold X, require human sign-off before proceeding" is a different class of control than a dashboard that tells you after the fact that many of your agent's responses were below that threshold.

The distinction is structural. Evaluation platforms observe what happened. A governance layer decides what happens next.

This architectural separation matters for a second reason: in regulated contexts — financial workflows under MiFID II, healthcare automation under HIPAA, document processing under GDPR — it's not sufficient to have detected the bad output. The audit requirement is to demonstrate that the bad output was caught before it produced a regulated action. A log entry that says "this response was rated 0.52 by our judge" is not the same compliance artifact as "this response triggered an escalation policy and was not acted upon before review."

Practical Output Quality Gates for Multi-Step Agents

The engineering question is what output quality controls actually look like at each layer of a production agent.

At the input boundary, the relevant check is schema and constraint validation before the LLM sees the data. An agent that processes a malformed record can produce a perfectly coherent response that is fundamentally wrong because the premise was wrong. Schema enforcement before model call is the cheapest quality gate in the pipeline.

At the output boundary — the response the model produces — the relevant checks are hallucination detection against grounding context (for RAG-backed agents), response schema validation, and content policy evaluation. These should fire synchronously, not in an async eval pipeline. If the check fails, the agent shouldn't proceed.

At the action boundary — the tool call — the relevant check is whether the intended action is consistent with what the agent was authorized to do with the information it retrieved. Scope enforcement at the action level catches a category of quality failure that no LLM-as-judge can detect: a factually accurate response that nonetheless triggers an unauthorized action.

Connecting these three layers to a single policy surface — rather than implementing each as a separate custom check — is where the engineering investment for most teams is still outstanding. Custom validation logic for each layer creates three independent systems that can drift, break independently, and don't share a common audit trail.

The output monitoring problem becomes tractable when quality checks produce structured policy evaluation records that are legible across the entire execution chain.

How Waxell Handles This

Waxell Observe instruments output quality enforcement across all three boundaries with 50+ policy categories and 2 lines of code. The policy engine fires synchronously during execution — not in a separate eval pipeline — so quality policy violations halt or redirect agent behavior before downstream actions execute.

For multi-step workflows where output quality failures carry high stakes, Waxell Runtime adds pre-execution policy gates at every step. Before a step begins, the governance layer checks whether current runtime state meets the conditions required to proceed. If a prior step's output failed a quality threshold — flagged by Observe — Runtime can block the next step, route to human review, or trigger a graceful checkpoint-and-resume rather than propagating the error.

Waxell instruments 200+ libraries without requiring agents to be rebuilt, and the policy evaluation surface covers the full output quality stack: hallucination detection, response schema validation, content policy enforcement, tool-call scope checks, and cost-per-output limits. Every policy evaluation is logged with full trace context — the input, the model output, the policy that fired, and the execution decision — so the audit record reflects what was governed, not just what was observed.

The 0.045ms p95 latency means output quality gates fire in-band without degrading the user-facing experience.

The key differentiator is this: the goal isn't to know that agents sometimes produce bad outputs. Every team with more than a few weeks in production already knows that. The goal is to build a system where bad outputs don't become bad actions — and where you can prove to an auditor, a regulator, or a post-mortem that the governance was working.

FAQ

What is AI agent output quality, and why is it different from LLM output quality?
In a non-agentic LLM interaction, output quality refers to the accuracy, relevance, and coherence of a model's response to a single query. In an agentic system, output quality includes all of that — but it also covers whether the agent's output is consistent with the constraints on what the agent is authorized to do, whether the response is safe to act upon in the current execution context, and whether quality holds across a multi-step sequence, not just at a single step. An agent that produces a correct response to step 3's sub-task but has misunderstood its scope can still produce a harmful action.

Why do hallucination detection tools miss agentic failures?
Most hallucination detection tools check whether a model's output is grounded in retrieved context — a useful signal for RAG systems. Agentic failures often don't look like hallucinations in the traditional sense. The model may produce a fluent, coherent response that is factually accurate relative to the data it retrieved, but that response becomes wrong when combined with other retrieved data, a prior step's output, or an implicit assumption the model made about authorization scope. The failure mode is structural, not factual.

What is a good output quality threshold for production agents?
There is no universal threshold; it depends on the risk profile of the workflow and the downstream consequences of error. A reasonable starting point is to ask what the consequence of a 5% error rate looks like given the volume of operations and the cost of each error. For workflows where a wrong output leads to a communication or a data write, the threshold should be calibrated against the cost of the corrective action. For workflows under regulatory compliance, the relevant threshold may not be a quality score at all — it may be a categorical policy: certain output types always require review before action.

What's the difference between an eval framework and a governance layer?
An eval framework is a measurement tool — it scores outputs, identifies failure modes, and informs model or prompt improvements over release cycles. A governance layer is an enforcement tool — it makes real-time decisions about whether an agent's output is safe to act upon during execution. The two are complementary, not competing. Eval frameworks improve quality over time; a governance layer enforces policy right now.

How does output quality enforcement interact with human-in-the-loop workflows?
Output quality policies are one of the cleaner triggers for human escalation in agentic systems. Rather than requiring human review for every action — which defeats the productivity purpose of the agent — teams can configure escalation to fire specifically when output quality checks fail or fall below a threshold. This produces a smaller, targeted review queue that captures the highest-risk outputs rather than reviewing everything. The governance layer creates the escalation; the human provides the override.

Do output quality policies add latency to agent execution?
Any synchronous check adds latency. The relevant question is whether the latency is proportionate to the risk avoided. At 0.045ms p95, policy evaluation overhead is sub-millisecond — below the threshold where most production agents would notice it. The larger latency cost is usually in the escalation flow when a policy fires, which is intentional: a paused workflow is less expensive than a bad action.

Sources

LLM Hallucination Statistics 2026: AI Gets Facts Wrong Up to 82% of the Time — SQ Magazine
LLM Calibration and Uncertainty Quantification in Production AI Agents — Zylos Research, April 2026
LLM output confidence discussion — Hacker News
LLM Output Drift in Financial Workflows: Validation and Mitigation — Hacker News
Why AI Agents Fail in Production: The Agent Failure Stack Explained — Sherlocks.ai

CVE-2026-42824 SearchLeak: How M365 Copilot Became a One-Click Data Exfiltration Tool

Logan — Mon, 29 Jun 2026 14:10:26 +0000

CVE-2026-42824, named "SearchLeak" by Varonis Threat Labs researchers who discovered it, is a critical three-stage vulnerability chain in Microsoft 365 Copilot Enterprise. It allowed an attacker to exfiltrate emails, one-time passwords, password reset links, calendar entries, and indexed organizational files — with a single click on a crafted link and no additional permissions required on the attacker's side. Microsoft patched the vulnerability at the backend in early June 2026 and publicly disclosed it under a critical severity rating on June 15, 2026. No tenant administrator action is required to apply the fix.

The attack targets M365 Copilot Enterprise specifically — the tier with access to the organization's full data estate, including Exchange, SharePoint, OneDrive, and Teams. That scope of access is what made the blast radius significant. Depending on how broadly Copilot was connected to the M365 environment, anything the victim's account could read, Copilot could retrieve and exfiltrate.

How SearchLeak Works: Three Bugs That Shouldn't Exist Together

Varonis chained three distinct weaknesses to produce the attack. Individually, each is insufficient for meaningful exploitation. Together, they give an attacker silent read access to a victim's mailbox and indexed organizational content in a single click.

Stage 1 — Parameter-to-Prompt Injection (P2P). Microsoft 365 Copilot Enterprise Search accepts a q URL parameter for search queries. The problem: that parameter was not treated as user-supplied search input — it was passed directly to Copilot's AI engine as an executable instruction. An attacker who could get a victim to click a crafted m365.cloud.microsoft/search/?q=<PAYLOAD> URL could control what Copilot searched for and what it did with the results. No authentication on the attacker's side. No elevated permissions. Just a link.

Stage 2 — HTML Rendering Race Condition. To prevent AI-generated HTML from executing, Microsoft wrapped Copilot responses in <code> blocks. The catch: this wrapping happened after the streaming response completed. During streaming, raw HTML was temporarily rendered by the browser. An injected <img> tag in the AI's output fired — triggering an outbound HTTP request carrying stolen data — before the sanitizer activated. By the time the guardrail engaged, the request had already left.

Stage 3 — CSP Bypass via Bing SSRF. The browser's Content Security Policy blocked images loading from arbitrary domains. But Bing's "Search by Image" endpoint (bing.com/images/searchbyimage?imgurl=…) was allowlisted in the CSP. When the injected <img> tag pointed to this Bing endpoint with stolen data embedded in the URL parameter, Bing's backend made a server-side fetch to retrieve the "image" from the attacker's server. The CSP applied to the victim's browser, not to Bing's infrastructure. Bing became an unwitting exfiltration proxy.

The complete chain: victim clicks link → Copilot interprets q parameter as search instructions → searches mailbox and indexed org content → generates response with embedded <img> tag → image fires during streaming, before sanitization → Bing fetches attacker URL with stolen data in the path → attacker reads email subjects, OTP codes, file names from server logs. No second click. The crafted link pointed to microsoft.com, bypassing standard anti-phishing URL filters.

Why Do AI Assistants With Broad Data Access Keep Producing This Class of Vulnerability?

SearchLeak isn't the first time M365 Copilot has been the vector for prompt injection-driven data exfiltration. Varonis previously discovered "Reprompt" in Copilot Personal. Similar patterns — AI assistant processes injected instructions from user-controlled inputs, then acts outside the user's actual intent — have been documented across LangChain, AutoGen, and Semantic Kernel deployments.

The structural reason these vulnerabilities keep appearing is a mismatch between how enterprise AI assistants are designed and the threat model they operate under.

Enterprise AI assistants are built to be maximally helpful: search everything the user can access, render rich responses, integrate with the full data estate. The security assumption is that the AI's output represents the user's intent — if Copilot generates an image request or a formatted response, it's acting for the authenticated user, not an attacker who injected instructions into the query string.

Parameter-to-Prompt Injection breaks that assumption at the input layer. If any user-controlled input — a URL parameter, a retrieved document, a calendar event body, a Teams message — can modify what the model believes it should do, the model's output is no longer a reliable proxy for user intent. The same logic applies to indirect prompt injection, where an attacker embeds instructions in a document that an agent later retrieves and processes.

Pre-execution enforcement — checking what the AI is about to do before it does it — doesn't exist at the application layer in most enterprise AI assistants today. They sanitize output. SearchLeak demonstrates why output sanitization as a post-processing step fails when AI responses stream: the side effects fire before the sanitizer catches up.

What Should Teams Running M365 Copilot Do Right Now?

The backend patch closes the specific SearchLeak chain. A few actions are worth taking regardless.

Confirm the patch applies to your tenant. Because Microsoft applied this server-side, no admin action is required — but confirm with your Microsoft account team that your Copilot Enterprise instance is on the patched version, particularly if you operate in a sovereign cloud or GCC environment with delayed rollouts.

Audit Copilot's indexed data scope. The blast radius of SearchLeak depended entirely on what Copilot had read access to. Review what SharePoint sites, Exchange mailboxes, and OneDrive folders Copilot's search index includes, and apply least-privilege — restrict indexing to what users actually need Copilot to reach.

Review M365 Unified Audit Logs for anomalous Copilot Search URL patterns. SearchLeak required a victim to click a crafted URL. If users received suspicious links to Copilot Search between Copilot Enterprise's launch and June 15, log review is warranted. Look for q parameters in the audit logs that contain HTML tags or encoded image URLs pointing to Bing's search-by-image endpoint.

Add P2P injection to your threat model. Any AI system where a URL parameter, tool description, or retrieved document reaches the LLM without strict type enforcement is susceptible to this class of attack. It's not yet standard in most threat modeling frameworks — but it should be.

How Waxell MCP Gateway Handles This Class of Attack

SearchLeak exploited the gap between what an AI assistant was trusted to do and what it actually did once an attacker injected instructions at the input layer. That gap — AI assistants with broad data access operating without a pre-execution enforcement layer — is a governance problem that a patch addresses only in its specific form.

Waxell MCP Gateway sits between AI clients (Claude Desktop, Claude Code, Cursor, and any MCP-compatible client) and the 160+ upstream MCP connectors those clients use. That position — at the tool call boundary, before tool calls execute — lets it enforce controls the AI client itself cannot.

Prompt injection scanner at fingerprint time. When an MCP tool registers with the gateway, Waxell scans its tool descriptions for injected instructions before any agent call reaches it. The P2P injection vector in SearchLeak (malicious instructions passed to the model via a user-controlled input field) is the same class of attack the prompt injection scanner is built to catch in tool description parameters. Tools with injected override instructions are flagged and held for review before a single agent call fires — not after the data has already moved.

PII redaction and secret blocking in-flight. The most operationally damaging output of SearchLeak was one-time passwords, password reset links, and email content. Waxell MCP Gateway strips PII — email addresses, SSNs, financial data — and blocks secrets (security codes, API tokens, password reset links) before they can transit the gateway to any downstream endpoint. The gateway's 50+ policy categories include specific policies for identity, privacy, and content — enforced at 0.045ms p95 latency, meaning governance doesn't introduce meaningful overhead. Even if an injected prompt succeeded in retrieving sensitive content, it couldn't exit the organization's perimeter through the gateway.

Human-in-the-loop approvals for external data sends. The SSRF chain in SearchLeak relied on data leaving the organization via an outbound HTTP request initiated by Bing's backend. Waxell MCP Gateway can require human approval for any tool call that would result in data leaving the organization's environment — surfacing the action for a human reviewer before it completes, rather than letting it fire silently during a streaming response.

Durable audit trail. The gateway maintains an exportable log of every tool call: what the agent requested, what arguments it passed, what the tool returned. A SearchLeak-class exfiltration attempt would appear in the log as an anomalous outbound data pattern. Paired with Waxell Observe, which instruments the AI client side across 200+ libraries and frameworks, you get a full trace from model inference through tool execution.

Microsoft's patch addresses the specific implementation flaw. The governance model addresses the class: any AI assistant with broad data access that lacks a pre-execution enforcement layer at the tool call boundary will remain exposed to variants of this chain as AI assistants expand their integrations.

Try Waxell MCP Gateway free — no rebuild required, one URL replaces all upstream MCP configurations — at waxell.dev/signup.

Frequently Asked Questions

What is CVE-2026-42824 SearchLeak?
CVE-2026-42824 is a critical three-stage vulnerability chain in Microsoft 365 Copilot Enterprise, named SearchLeak by Varonis Threat Labs researchers who discovered it. It combines a Parameter-to-Prompt Injection weakness, an HTML rendering race condition, and a CSP bypass via Bing's server-side request forgery to enable one-click exfiltration of emails, MFA codes, password reset links, and indexed organizational files. Microsoft patched it with a backend fix publicly disclosed on June 15, 2026.

Does SearchLeak require special permissions or multiple steps from the attacker?
No. The attacker needs only to get a victim to click a crafted URL pointing to m365.cloud.microsoft — a legitimate Microsoft domain. No elevated permissions, no plugins, and no second user action are required. From the victim's perspective, Copilot appears to briefly "think" before the exfiltration completes silently in the background.

What data could SearchLeak steal?
Email subject lines and content — including security codes, one-time passwords, and password reset links — calendar meeting details and attendee information, SharePoint documents, OneDrive files, and any other organizational content indexed by Copilot Enterprise Search.

Is patching M365 Copilot enough to prevent similar attacks in the future?
The specific SearchLeak chain is closed. But P2P injection, indirect prompt injection via retrieved content, and SSRF-based exfiltration are a class of attack, not a single bug. Varonis documented this chain in Copilot Enterprise; they previously found a related variant in Copilot Personal. Similar patterns exist in other AI assistant frameworks. AI assistants with broad data access that don't enforce pre-execution controls at the tool call boundary remain exposed to future variants.

What makes SearchLeak different from standard prompt injection?
Standard indirect prompt injection typically requires an attacker to embed malicious instructions in content the model retrieves — a document, a web page, a database row. SearchLeak uses Parameter-to-Prompt injection, where the attack vector is a URL query parameter that Copilot passed directly to its AI engine as an instruction. Combined with a race condition that fires the exfiltration during response streaming before any sanitization pass, it produced a more reliable and lower-friction attack than indirect injection typically achieves.

How does a governance layer at the MCP level help with a vulnerability in a Microsoft product?
The harm in SearchLeak occurred at the tool call layer — Copilot executed a search (tool call), generated a response that triggered an outbound request (side effect), and data left the organization (the actual damage). A governed tool layer that scans descriptions for injection, blocks secrets from exiting the perimeter, and requires human approval for outbound data sends addresses the mechanism of harm rather than only the specific implementation bug. Governance and patching are complementary, not interchangeable.

Sources

SearchLeak: How We Turned M365 Copilot Into a One-Click Data Exfiltration Weapon — Varonis Threat Labs (primary research, Dolev Taler)
New attack turned Microsoft 365 Copilot into 1-click data theft tool — BleepingComputer, June 15, 2026
One-Click Microsoft 365 Copilot Flaw Could Have Let Attackers Steal Emails, Files, and MFA Codes — The Hacker News
Copilot 'SearchLeak' Attack Allows 1-Click Data Theft — Dark Reading
SearchLeak vulnerability allows data theft from Microsoft 365 Copilot Enterprise — SC Media

Samsung ChatGPT Ban Ends: The Enterprise Content Policy That Made It Safe to Deploy at Scale

Logan — Thu, 25 Jun 2026 14:14:53 +0000

In March 2023, Samsung allowed its engineers to use ChatGPT. Within approximately nineteen days, three separate data leaks had occurred. Engineers pasted proprietary semiconductor source code into ChatGPT to check for errors. A second uploaded equipment defect detection code seeking optimization advice. A third converted confidential internal meeting recordings into transcripts using an AI transcription tool, then fed those transcripts to ChatGPT to generate meeting minutes. All of it traveled to OpenAI's public servers, outside Samsung's security perimeter. Samsung banned ChatGPT and other generative AI tools company-wide in May 2023.

On June 21, 2026, Samsung announced it is deploying ChatGPT Enterprise to all employees in South Korea and its global Device eXperience (DX) division — reportedly approximately 125,000 people, in one of OpenAI's largest enterprise rollouts to date. The deployment includes Codex for software development workflows and spans product engineering, marketing, and manufacturing. Training for the global workforce is expected to complete by end of 2026. What changed between the ban and this rollout is not ChatGPT. It's the enforcement layer Samsung put between its employees and the model.

Why the Ban Was the Wrong Fix

Banning AI tools does not eliminate the underlying risk. It eliminates visibility into it.

In the three years Samsung blocked ChatGPT, employees continued using AI assistants through browser extensions, personal accounts, and third-party tools that never touched the corporate network. The proprietary code and meeting notes kept moving. Samsung just had no record of any of it.

The structural failure in 2023 was not that ChatGPT existed. It was that there was no content enforcement layer between employees and the model. When an engineer pasted semiconductor source code into a chatbot, no system evaluated what was in that paste, classified it as proprietary, or blocked the call before it went out. The ban created the appearance of governance while leaving the actual risk unaddressed.

This is the pattern that repeats in every enterprise AI ban. Organizations treat access as the control surface and miss that the real control surface is the data traveling through AI systems. Restricting access to one tool redirects that data to a less visible one. The risk doesn't go away — the audit trail does.

What Enterprise AI Content Policy Actually Requires

An enterprise AI content policy is a governance control that evaluates the data in an AI call before the call leaves the perimeter, applies a classification to that data, and enforces a rule — block, redact, log, or escalate — based on the classification. This is distinct from post-hoc DLP alerting, which detects sensitive data after it has already traveled to an external model. Pre-execution enforcement acts before the exposure. Post-hoc detection reports on it.

Enterprise AI governance at the data layer requires three things: classification, enforcement, and audit.

Classification means the system knows what kind of data is in a given request before it leaves the perimeter. Source code, internal financial projections, meeting transcripts, customer records — these need to be classified at the point of call. Classification can be rule-based (pattern matching against known code syntax, IP-specific strings), semantic (ML-based identification of proprietary language), or both. The Samsung 2023 incidents — source code paste, equipment code upload, meeting transcript processing — would each have been classifiable at the call layer.

Enforcement means the system acts on that classification in real time. If an employee pastes source code into a ChatGPT session and that paste matches a Content policy rule, the call is blocked, the sensitive fragment is redacted, or the request is routed to human review before it goes anywhere. Pre-execution enforcement is the difference between preventing a leak and logging one.

Audit means there is a durable record of every AI call: what was sent, what was blocked, what was allowed through, and which policy decision applied. Without it, you cannot remediate after an incident, you cannot demonstrate compliance to a regulator, and you cannot determine whether your controls are actually working.

Samsung's 2026 deployment reportedly includes end-to-end encryption, data isolation, and real-time data loss prevention monitoring — enterprise controls built at the ChatGPT platform tier. What they do not cover is the application tier: the custom AI agents, internal copilots, and CI/CD integrations that enterprise teams build on top of the model. Platform-tier controls govern the ChatGPT Enterprise interface. They do not extend to every AI call your company makes.

What Teams Should Check Before an Enterprise AI Rollout

Before scaling an AI deployment to hundreds or thousands of employees, close these gaps first:

Determine whether your controls are pre-execution or post-execution. Log-based DLP that alerts after sensitive data has traveled to an external model is useful for forensics. It is not useful for preventing the Samsung 2023 incidents. Enforcement needs to sit before the LLM call, not after it.

Map your full AI call surface, not just your enterprise contracts. ChatGPT Enterprise's controls apply to ChatGPT Enterprise. If developers are calling Claude or GPT-4o through direct API access, or using a third-party coding copilot in their IDE, or running internal agents against production data — those calls fall outside the enterprise platform's scope. Document every path through which AI calls leave your perimeter.

Verify your data classification logic handles code. Standard DLP systems are built around documents and email. Proprietary source code has different lexical and structural patterns. Confirm your classification rules are explicitly tested against developer workflows: copy-paste from IDE, inline code review queries, and upload of scripts or notebooks.

Check whether you have an AI-specific audit trail. Standard security logs capture network-level events. AI governance audit trails capture what was in the prompt, what the model returned, and what policy decision was applied to the call. These are different records. You need both, and you need the latter to exist before a regulator asks for it.

How Waxell Runtime Handles This

The gap Samsung encountered in 2023 — and that most enterprises still face when deploying AI at scale — is that governance tools have historically operated at the UI layer, not the execution layer. When AI calls happen through a custom integration, an internal agent, or a developer's direct API access, platform-tier controls do not see them.

Waxell Runtime sits at the execution layer. It wraps the LLM call, applies policy enforcement before the call reaches the model, and produces a durable execution record afterward. For content governance, Waxell Runtime's Content policy type lets you define what categories of data are permitted in AI calls — source code, PII, financial projections, credential strings — and enforces those rules at the point of call, before anything leaves your perimeter.

The initialization is two lines:

import waxell
waxell.init(api_key="YOUR_KEY")

After that, every LLM call in your stack is governed. Waxell Runtime supports 200+ libraries out of the box and requires no changes to your agent architecture. It ships with 26 policy categories across content, cost, control, quality, and kill-switch enforcement. A Content policy configured for source code classification would have caught all three Samsung 2023 incidents at the call layer — before any data left the network.

This is what data governance looks like at the execution layer: not an access ban, not a post-hoc alert, but a policy that evaluates every call before it goes out and enforces the rule before the data travels.

Samsung's 2023 incident was three engineers and three leaks in under three weeks. An enterprise deploying AI to 125,000 employees is running those probability curves at a different scale. The enforcement architecture needs to match.

For vendor-built or third-party AI agents that your team didn't build — tools that sit outside your codebase and connect via MCP or API — Waxell Connect extends governance to agents you don't control. It governs external agents at the data interface level without requiring SDK integration on the vendor's side.

Frequently Asked Questions

What caused Samsung to ban ChatGPT in 2023?
Three Samsung engineers leaked proprietary data through ChatGPT within nineteen days of the company permitting its use. One pasted semiconductor manufacturing source code to check for errors. A second uploaded equipment defect detection code seeking optimization advice. A third converted confidential internal meeting recordings into transcripts and fed them to ChatGPT for meeting minutes generation. Samsung banned generative AI tools company-wide in response to the three incidents.

What changed in Samsung's 2026 ChatGPT deployment?
Samsung is deploying ChatGPT Enterprise — an enterprise-grade version with data isolation, end-to-end encryption, and administrative controls — to approximately 125,000 employees, rather than the public ChatGPT interface engineers used in 2023. The rollout includes real-time DLP monitoring and an enterprise AI governance framework. The fundamental difference is not the model; it's the enforcement controls between employees and the model.

Why doesn't a ChatGPT ban solve the data governance problem?
Banning one AI tool redirects usage to others while removing visibility. Employees continue using AI through personal accounts, browser extensions, and unapproved tools. A ban that operates at the access layer doesn't stop data from moving through AI systems — it stops the organization from seeing where that data goes.

What is an enterprise AI content policy?
An enterprise AI content policy is a governance rule that evaluates the data in an AI call before the call is made and applies an enforcement action — block, redact, log, or escalate — based on what the data contains. Content policies operate at the execution layer, before sensitive information reaches the model, unlike DLP alerting which detects exposure after the fact. You can read more about content policy mechanics on the Waxell glossary.

Does ChatGPT Enterprise's data isolation cover custom integrations and internal agents?
No. ChatGPT Enterprise's controls govern the ChatGPT Enterprise interface. Custom API integrations, internal AI agents, coding assistants accessed through developer tooling, and third-party AI tools that call LLMs directly are outside the platform's governance scope. Those calls require a separate enforcement layer at the application or infrastructure tier.

How does Waxell Runtime enforce content policies across an enterprise AI stack?
Waxell Runtime wraps LLM calls at the execution layer and applies Content policies before the call reaches the model. It initializes in two lines, supports 200+ libraries without architectural changes, and ships with 26 policy categories covering content, cost, control, quality, and kill-switch rules. Content policies can be configured to block, redact, log-and-pass, or escalate to human review depending on data classification and severity.

Sources

AI Agent Cost Enforcement: What Actually Changes When You Add Hard Limits

Logan — Wed, 24 Jun 2026 16:06:49 +0000

By April 2026, Uber had burned through its entire annual AI coding budget. A separate company reportedly ran up a $500 million Claude bill after forgetting to set usage limits on employees, according to a report from Axios. At Priceline, Chris Reed, the company's senior director of IT finance, told TechCrunch that a routine Cursor contract renewal came back four to five times more expensive than expected. These aren't edge cases. J.R. Storment, executive director of the FinOps Foundation, said in early June that companies were calling him in April and May to report being "3x over our entire 2026 token budget and it's only April."

The problem isn't that teams lacked data. Most of them had dashboards. What they couldn't do, in almost every case, was stop the spending.

That is the gap between cost visibility and cost enforcement — and it is a structural gap, not a tooling gap. This post is about what actually changes when you close it.

The Visibility-Without-Control Trap

Cost visibility for AI agents has matured quickly. Platforms like LangSmith, Helicone, Arize, and Datadog's new AI monitoring layer all provide token-level dashboards, cost-per-request breakdowns, and spending trend charts. This tooling is genuinely useful. It tells you what happened.

None of it stops what is happening now.

An alert that fires when you've spent $10,000 is not enforcement. A dashboard showing you're trending toward a $500K monthly bill is not enforcement. These tools create awareness after cost is incurred. An agent that has entered a recursive loop, over-queried a connected data source, or re-sent its full conversation history with every turn doesn't consult your dashboard before placing the next LLM call. It keeps running.

Jellyfish, an engineering management platform, tracked per-developer AI token consumption across its user base and found it had risen 18.6x in nine months through early 2026. Heavy users were roughly twice as productive as low users — but spent ten times the tokens to get there. "Whether extreme spend pays off comes down to the ultimate business value of shipped code," Nicholas Arcolano, head of research at Jellyfish, told TechCrunch. "Which most companies still can't measure."

That measurement gap is what makes observability-only tooling insufficient. Visibility tells you you're spending too much. It doesn't tell you what to cut. And it can't stop the agent doing the cutting.

What the "Before" State Looks Like

Teams running AI agents without enforcement converge on roughly the same set of workarounds.

Manual budget reviews. An engineering lead or finance partner checks an AI spend report once a week. Anomalies get escalated. Agents that exceeded budget last week get reviewed this week. The feedback loop is measured in days, during which the overspend continues.

Alert-and-intervene. A monitoring platform sends a Slack message when daily spend crosses a threshold. Someone on call investigates. Depending on the time and their workload, this might happen in minutes or hours. The agent runs throughout.

API key rotation. The most common "hard stop" in actual use: when spend gets unacceptable, rotate or revoke the API key. This stops the agent — and everything else on that key. It is a circuit breaker with collateral damage.

None of these scale with agent proliferation. As agent deployments grow and each agent operates autonomously across longer time horizons, the window between "something is wrong" and "this already cost $40,000" collapses.

Vitaly Gordon, CEO of engineering intelligence platform Faros AI, described a CTO who called him in April with this: "One of my engineers spent $40,000 on tokens last month, and I genuinely don't know whether I should stop him or tell everyone else to be like him." That inability to distinguish productive spend from runaway spend in real time is what defines the before state. Alerting tools surface the number; they don't answer the question behind it.

What "After" Looks Like: The Structural Shift

Hard enforcement changes the control surface. Instead of monitoring what an agent has spent, you define what an agent is allowed to spend — and that ceiling is enforced before the next LLM call is placed, not after the bill arrives.

This is the key architectural shift: pre-execution versus post-execution control. With observability tooling, the policy lives downstream of the agent. Spend occurs, data lands in a monitoring sink, an alert fires. With enforcement, the policy lives upstream: the agent requests an LLM call, the governance layer checks the remaining token budget, and if the limit is exceeded, the call doesn't happen. No further tokens are consumed. The agent stops at the boundary.

The practical implications of this shift compound:

Budgets become a first-class agent property. Instead of managing AI spend at the account or team level, you manage it per-agent or per-task. A summarization agent gets a different ceiling than a multi-step research and code generation agent. The difference between "this task costs $0.40" and "this task costs $12" becomes visible and enforceable at the task level.

Runaway loops are terminated, not observed. A recursive loop that would have consumed 400,000 tokens before a human noticed it hits a configured ceiling — say, 10,000 tokens — and stops. The agent surfaces a structured exception. The operator gets a notification. The spending stops at the wall.

Cost forecasting becomes reliable. When each agent type has a hard ceiling, the maximum spend per deployment becomes calculable. Thirty agents running in parallel, each capped at a defined token limit per task: the worst-case scenario is arithmetic, not a guess. This is what finance teams mean when they ask for "guardrails" — not dashboards that show them the problem, but walls that prevent it.

What Surprises Teams After Enforcement Goes Live

Three things consistently surface when teams implement hard limits for the first time.

Agents hit limits more often than expected. Tasks that look simple — one or two LLM calls — often aren't. Multi-step agents with large context windows and retrieval pipelines consume tokens at a scale that's easy to underestimate. The first few weeks of enforcement make this concrete immediately. This is useful data: it forces prompt optimization and architectural review that teams would have deferred indefinitely without a forcing function. The enforcement doesn't create the problem; it makes a pre-existing one visible before it becomes expensive.

Granularity matters more than people expect. A budget defined at the agent level gets consumed unevenly. Some tasks take ten times the tokens of others. A monthly budget set at the agent level can be exhausted by one expensive run, leaving the agent unusable. Task-level or session-level budgeting — ceilings applied per request or workflow, not per calendar period — is what actually solves the predictability problem.

Enforcement surfaces architectural problems. When an agent consistently hits limits on a particular task type, it's almost always a signal that something in the design is inefficient: too many round-trips to the LLM, too much irrelevant context being passed, a retrieval system returning documents that don't contribute to the answer. Enforcement doesn't fix these — but it flags them immediately, rather than when a quarterly spend report arrives on a CFO's desk.

Alexander Embiricos, OpenAI's head of enterprise, described the shift in customer conversations this way at an event in June: "Six months ago, I would have a conversation with a customer and it would be all about 'What can it do? Is it good enough?' Our conversations are never about that now. Now the conversations are about 'What visibility do you have? What auditability do you have? What token controls do you have?'"

How Waxell Runtime Handles This

Waxell Runtime enforces cost and token limits as pre-execution policies — the check happens before an LLM call is placed, not after the bill arrives. The policy layer sits above agents without requiring an SDK on the agent itself, and without code changes or rebuilds.

The enforcement model handles the precision that teams discover they need after going live: per-task budget ceilings, per-model cost caps, hard stops that surface structured exceptions rather than silent failures, and configurable responses (human escalation, graceful termination, or policy-defined retry logic). Waxell Runtime ships with 50+ policy categories across the cost enforcement arc and beyond — scope controls, output validation, data handling, and escalation triggers all under the same governance layer.

For real-time cost visibility, Waxell Observe instruments existing agents in two lines of code, covering 200+ libraries without requiring a rebuild. Teams use Observe to establish baseline token consumption per task type — then use Runtime to set enforcement ceilings derived from that baseline.

The combination is what closes the gap the before state leaves open: you can see what's happening (Observe) and stop what shouldn't be (Runtime). The two tools are complementary by design. Visibility tells you where to set the wall. Enforcement makes it hold.

FAQ

What's the difference between a cost alert and cost enforcement?
A cost alert notifies you after spending has occurred. Cost enforcement prevents spending from occurring past a defined threshold. The difference is pre-execution versus post-execution control — and it's the difference between learning you've overspent and preventing the overspend. Most AI observability platforms today provide alerts. Very few provide pre-execution hard stops.

Why don't LangSmith, Helicone, or Arize provide enforcement?
These are observability platforms: they track, visualize, and analyze what agents have done. Enforcement requires a governance layer that intercepts LLM calls before they're placed — architecturally upstream of where monitoring sinks operate. Some platforms are adding cost alerting features, but pre-execution hard stops require a different architectural position.

What happens to an agent when it hits a hard cost limit?
In a properly designed enforcement system, the agent stops cleanly and emits a structured event. The operator receives a notification with context: what task was running, how much was consumed, what the configured limit was. The agent doesn't retry silently; it waits for a policy-defined response — human approval, a budget increment, or graceful termination. Silent failures are a sign of an alerting system pretending to be enforcement.

Should I set limits per agent, per task, or per team?
Start with per-task limits. Task-level granularity catches runaway behavior early while giving productive tasks adequate headroom. Agent-level limits are useful as a backstop. Team-level limits work for rollup budget management but are too coarse-grained to catch individual agent failures before they've become expensive.

How do I determine where to set limits?
Run with instrumentation first: deploy Waxell Observe to capture baseline token consumption per task type across two to four weeks of production traffic. Use the 90th-percentile observation as your starting soft limit and the 99th percentile as your hard stop. Adjust after the first enforcement events surface outlier tasks.

Does enforcing a token budget reduce agent quality?
Not in well-designed agents operating on typical tasks. A ceiling set at the 90th-percentile baseline won't constrain agents that are working as designed. What enforcement does surface is tasks that were silently consuming five to ten times expected tokens — usually due to a design flaw in context handling or retrieval, not because the extra tokens produced better outputs.

Sources

Rebecca Bellan, "The token bill comes due: Inside the industry scramble to manage AI's runaway costs," TechCrunch, June 5, 2026. https://techcrunch.com/2026/06/05/the-token-bill-comes-due-inside-the-industry-scramble-to-manage-ais-runaway-costs/
Axios report cited in TechCrunch (above)
"Show HN: I lost $200 from an agent loop, so I built per-tool AI budget controls," Hacker News. https://news.ycombinator.com/item?id=46991656
"Expensively Quadratic: The LLM Agent Cost Curve," Hacker News. https://news.ycombinator.com/item?id=47000034

DeepMind's AI Control Roadmap Treats Agents as Insider Threats — And Sets the Bar for Every Team

Logan — Tue, 23 Jun 2026 14:16:33 +0000

On June 18, 2026, Google DeepMind published its AI Control Roadmap, a defense-in-depth framework for managing AI agents that might be imperfectly aligned. Written by Rohin Shah and Four Flynn, the document makes an uncomfortable assumption explicit: alignment training alone cannot guarantee that a deployed agent will stay within bounds. So DeepMind built a system that assumes it won't.

The framework applies insider threat controls — the same layered security enterprises use for privileged employees with dangerous access — directly to AI agents. It defines four detection levels (D1 through D4), ranging from monitoring an agent's chain-of-thought reasoning at the baseline to analyzing internal model representations when reasoning becomes opaque at the most advanced tier. It defines three response tiers (R1 through R3), from asynchronous review of transcripts after the fact, to real-time blocking of actions before they execute. The required tier scales with the agent's capability: the more capable and potentially evasive the model, the further up the stack the controls must reach.

DeepMind developed this framework from real data. Their team analyzed over one million internal coding agent tasks using a prototype asynchronous monitoring system. The finding that should reframe how most teams think about agent safety: the majority of flagged events did not stem from adversarial intent. They resulted from agent misinterpretation or overeagedness to achieve a user's goal. A live monitor built for the Gemini Spark agent, for example, was catching unintentional data deletion — not a rogue model trying to cause harm, but an overager one completing a task in ways it shouldn't.

DeepMind also published a companion document for policymakers: "Three Layers of Agent Security", covering individual agent security, multi-agent systems, and ecosystem resilience. Publishing both documents simultaneously signals that this framework is intended as an industry reference, not just internal practice.

Why Alignment Training Alone Doesn't Hold at Production Scale

The working assumption in most enterprise deployments is that choosing a well-aligned model from a reputable frontier lab handles the safety problem. This is understandable — it worked well enough when agents were doing simple, reversible tasks with human review at every step. It doesn't hold when agents are executing autonomously at scale.

DeepMind's document challenges this directly, and the argument isn't that frontier models are unsafe. It's that no training process can provide guarantees about agent behavior across every input, environment, and capability level an agent will encounter in production. Alignment gives you a probability distribution over behaviors. An agent that behaves correctly 99.9% of the time will still produce off-script actions at production volume — a million tasks at 99.9% means a thousand incidents.

The insider threat parallel is precise. A company can't assume every employee with office access will remain trustworthy under all conditions, so it layers controls — access logging, anomaly detection, escalation procedures, emergency termination — independent of how rigorous the hiring process was. DeepMind is applying the same logic. They use the MITRE ATT&CK framework, the standard threat taxonomy for enterprise security, to categorize potential AI agent behaviors. This breaks down "the agent did something wrong" into specific tactics and techniques that can be monitored, detected, and responded to with defined playbooks.

Most enterprise teams building on top of frontier models inherit the output of alignment training but have not built the structural governance plane that sits above it. They have observability — they can see what happened. They don't have enforcement — the pre-execution layer that stops the wrong action before it runs.

DeepMind has one of the largest AI safety teams in the world and direct access to their own model internals. They still chose to build system-level governance rather than rely solely on alignment. That choice is a benchmark.

How Production Agent Teams Are Exposed Right Now

Start with detection tier D1, the baseline: monitoring the agent's chain-of-thought reasoning for behavioral signals. For a team running a production coding agent or customer-facing workflow agent, this means reading and classifying the agent's stated reasoning against a behavioral policy in real time — not logging it, but evaluating it. Most teams log. Few evaluate against a policy.

D2 adds behavioral pattern detection beyond what the agent says — identifying signals in action sequences and tool call patterns that don't match policy. D3 adds a dedicated supervisor model reviewing agent decisions for high-risk actions. D4, the most advanced level, handles cases where the agent has learned to reason without visible text, requiring inspection of model internals rather than transcript analysis.

On the response side, R1 (asynchronous review) is roughly equivalent to having server logs after an incident — better than nothing, but not control. R3 (synchronous blocking) requires an enforcement layer that intercepts between the agent's decision and the tool call it wants to make, and can halt execution before the action runs.

That gap — knowing what happened versus stopping it before it happens — is the core exposure DeepMind is describing. Here's a practical self-assessment checklist based on the framework:

Detection:

Can you evaluate agent chain-of-thought against a behavioral policy in real time, not just log it? (D1)
Do you have behavioral pattern detection based on action sequences, not just keyword filtering? (D2)
For high-risk action categories, do you have a supervisor reviewing decisions before execution? (D3)

Response:

For low-risk reversible actions, do you have asynchronous review with defined remediation SLAs? (R1)
For high-risk actions — database writes, external API calls, file deletion, privileged operations — do you have pre-execution blocking? (R2/R3)
Do you have a policy-enforced kill switch that triggers automatically on defined behavioral thresholds, not just a manual container termination? (Kill policy)

Most teams can check the first item on each list. Almost none can check the last.

How Waxell Runtime Handles This

What DeepMind describes in the AI Control Roadmap is a governance plane: a structural layer that sits above individual agents and enforces behavioral policies independent of what the underlying model does. Waxell Runtime is that layer, built for teams that need these controls without redesigning their agents from scratch.

Runtime wraps your existing agents — regardless of model, framework, or architecture — and enforces policy-based controls before each action executes. No changes to agent code. No rebuilds. Waxell intercepts at the tool call layer and applies the policy, in real time, using a 2-line initialization against 200+ frameworks and libraries.

The controls DeepMind recommends map directly to Waxell Runtime policy categories:

Kill policy is the mechanism behind R3 synchronous blocking. When an agent's behavior crosses a defined threshold — a runaway loop, unauthorized scope expansion, cost limit violation, or content policy breach — Waxell's Kill policy terminates execution before the action runs. This is automated, policy-enforced, and logged with a full audit trail. It's the difference between a kill switch that someone has to pull and one that fires when the threshold is hit.

Control policy implements scope enforcement: what tools the agent is allowed to call, what data it can access, and what actions it can take at all. This is D2/D3 detection translated into enforcement. Not monitoring whether scope is violated, but preventing the violation at the execution boundary.

Content policy handles output-layer R3 blocking — stopping agent responses that contain PII, harmful content, or outputs that violate defined rules before they reach users or downstream systems.

Across 26 policy categories, Waxell Runtime provides D1-D3 detection and R1-R3 response coverage without requiring teams to build a custom supervisor agent stack. Waxell Observe handles the monitoring and behavioral pattern detection layer. Runtime handles enforcement. Together they cover the full arc of what DeepMind's framework requires.

For teams using vendor AI, external agents, or MCP-native tools — agents you didn't build and can't instrument directly — Waxell Connect extends the same governance to that layer without requiring the vendor to change anything. Same policies, same audit trail, no code changes on the agent side.

The $2.9 trillion in US economic value that McKinsey projects AI agents will create by 2030 is only capturable if enterprise teams can actually deploy those agents at scale. DeepMind's framework is the clearest published articulation of what "at scale" requires from a governance standpoint. It's also one team's implementation — not a spec for everyone else to replicate. Waxell Runtime is the implementation for teams that need these controls without DeepMind's internal tooling budget.

Get access to Waxell Runtime at https://waxell.ai/get-access.

Frequently Asked Questions

What is DeepMind's AI Control Roadmap?

DeepMind's AI Control Roadmap, published June 18, 2026, is a defense-in-depth framework for managing deployed AI agents that might be imperfectly aligned. It defines four detection levels (D1–D4) scaling from chain-of-thought monitoring to opaque reasoning analysis, and three response tiers (R1–R3) from asynchronous transcript review to real-time action blocking. It applies insider threat controls — standard in enterprise security for privileged human access — to AI agents.

Why does DeepMind treat its own AI agents as insider threats?

Because alignment training cannot guarantee agent behavior across every environment and capability level encountered in production. An insider threat model doesn't assume adversarial intent — it applies layered controls regardless, because no training process (like no hiring process) can guarantee behavior under all conditions. The same logic that justifies access logging and anomaly detection for privileged employees justifies behavioral monitoring and pre-execution enforcement for AI agents.

What is the difference between monitoring and governance for AI agents?

Monitoring means you can observe and log what an agent did after the fact. Governance means you enforce policies before an action executes — blocking it, requiring human approval, or terminating the agent when a threshold is crossed. DeepMind's framework distinguishes these explicitly: asynchronous transcript review (R1) is monitoring; synchronous blocking before execution (R3) is governance. Most enterprise teams have built monitoring. Governance requires a separate runtime enforcement layer.

What did DeepMind find from analyzing 1 million agent tasks?

DeepMind analyzed over one million internal coding agent tasks and found that most flagged events weren't adversarial — they resulted from agent misinterpretation or overeagedness to achieve a user's goal. This is critical for governance design: a system built only to detect deliberate misuse will miss the majority of real-world failures. Behavioral monitoring must cover off-script actions broadly, including well-intentioned agents acting on tasks in unauthorized ways.

What is a Kill policy for AI agents?

A Kill policy is an automated, pre-configured rule that terminates or halts agent execution when a defined behavioral threshold is crossed — runaway loops, scope violations, cost limits, or content violations. Unlike a manual kill switch, it fires automatically at the tool call layer before the action runs, with a full audit trail of what triggered it. Waxell Runtime implements Kill policies as one of its 26 policy categories.

How does Waxell Runtime implement DeepMind's recommended controls?

Waxell Runtime wraps existing agents and enforces policies before each tool call executes, without requiring changes to agent code or framework. Kill policies implement R3 synchronous blocking. Control policies implement D2/D3 scope enforcement. Content policies block unauthorized output at the response layer. Waxell Observe handles D1/D2 behavioral monitoring. For external and MCP-native agents, Waxell Connect extends the same governance without requiring vendor-side changes. Initialization takes two lines and supports 200+ frameworks.

Primary source: Securing the future of AI agents — Google DeepMind, June 18, 2026.

Multi-Agent Governance: Why Treating Every Agent the Same Breaks Coordinator, Planner, and Worker Systems

Logan — Tue, 23 Jun 2026 14:12:10 +0000

The dominant pattern in production multi-agent AI in 2026 is orchestrator-worker: a coordinator agent dispatches specialized subagents to handle specific tasks, results flow back up the chain, and the system presents a unified output. Most governance tools don't care which role each agent plays.

The same input validation rules apply to the coordinator that apply to the deepest worker. The same cost limits run at the planner layer as at the execution layer. If your governance policy says "flag PII in any agent output," it fires identically whether the agent making the call is the one deciding what to do next or the one doing it. That's not governance — it's coverage theater.

A 2025 study analyzing 1,600+ annotated execution traces across seven production multi-agent frameworks (Cemri et al., arXiv:2503.13657) identified 14 distinct failure modes. The researchers clustered them into three categories: system design issues, inter-agent misalignment, and task verification failures. The taxonomy is notable precisely because these failure categories map directly to the three architectural layers that most governance tools treat identically — coordinator, planner, and worker.

Multi-agent governance is the set of policies, controls, and oversight mechanisms that manage behavior across a system of collaborating agents — not just within any individual one. When the governance plane treats all agents as interchangeable, it secures the components while leaving the architecture exposed.

The Three Layers That Get Identical Treatment

Orchestrator-worker architectures contain at least three functionally distinct agent roles, each with a different relationship to risk:

The coordinator owns the execution plan. It decides which agents run, in what order, with what context. It allocates tasks, synthesizes results, and determines whether a subtask output is acceptable enough to proceed. The coordinator's failure mode isn't usually a harmful action — it's a harmful decision: routing a sensitive request to the wrong subagent, passing unvalidated user input downstream as if it were trusted, or continuing to iterate when the system should have stopped and escalated.

The planner translates high-level intent into structure. It decomposes goals, surfaces dependencies, and proposes execution sequences. The planner's failure mode is scope expansion: creating subtasks that, in aggregate, exceed the authorization of the original request. A document review request that the planner decomposes into 40 subagent calls — each individually within scope — may produce a cost and data-access footprint the human who initiated the request never anticipated and would never have approved explicitly.

The worker executes. It calls tools, reads data, writes outputs. Workers are the agents that touch production systems, which is why they receive almost all governance attention. But they're often the last agents in a chain that began with an unvalidated coordinator decision or an overbroad planner decomposition. Governing at the worker layer while leaving the coordinator and planner uncontrolled is roughly equivalent to putting a lock on your server room door and leaving the network unrestricted.

Why Coordinator-Level Governance Is a Different Problem

The coordinator's job is to decide. That's not the same as acting — but in multi-agent systems, decisions are upstream of actions. A bad coordinator decision propagates through every agent it dispatches.

Consider a financial services workflow: a coordinator receives a user query and routes it to a data-retrieval subagent, a calculation subagent, and a report-generation subagent. Each worker runs within its individually scoped permissions. But the coordinator's decision to pass the raw user query — including whatever it contained — directly to the subagent with database access was never evaluated by the governance layer. The coordinator made a trust decision that the governance system never saw.

This is the structural mechanism behind prompt injection in multi-agent systems. Malicious content doesn't need to bypass a worker's input filter. It needs to convince the coordinator to route it somewhere it can execute. A 2025 research report from the Cooperative AI Foundation (Hammond et al., arXiv:2502.14143) identified miscoordination, conflict, and collusion as the three primary failure modes in multi-agent AI systems — each arising at the orchestration and coordination layer, not at the individual execution layer.

Cross-agent policy enforcement applied only at the worker layer misses all three of these. The governance plane needs to sit above the coordinator, not alongside the workers — evaluating coordination decisions before they dispatch, not after workers have already executed.

The Planner's Scope Problem

A planner that decomposes a task aggressively isn't necessarily making an unsafe individual decision. Each subtask may be individually authorized. The problem is that scope validation is almost universally implemented at the tool-call level: "is this agent allowed to call this API?" — not at the plan level: "is the sum of all subtasks proportionate to what was originally authorized?"

This matters because planner scope decisions are cumulative and compounding. In Cemri et al.'s taxonomy, "system design issues" — which include how agents are scoped and how tasks are decomposed — were the largest failure category across the 1,600+ traces studied. The most common failure wasn't a worker acting out of bounds. It was a system design that gave the planner no effective constraint on how far it could decompose a task.

Planner-level governance requires enforcing policies at the plan structure, not just at execution time. Without this, planners become the second multi-agent governance blind spot: less dramatic than coordinator prompt injection, but responsible for a larger share of runaway costs and unintended scope expansion in production systems.

The Worker's Trust Problem and the Parent-Child Trace Gap

Workers in orchestrator-worker systems receive their instructions from coordinators and planners — agents they have no independent mechanism to authenticate. If a coordinator has been compromised by prompt injection, every worker that trusts its output inherits that compromise. The governance layer is the only component in the system that sits outside this trust chain and can validate instructions at each handoff point.

This creates a second structural requirement: the governance plane must observe and enforce at every inter-agent boundary, not just at the entry point of the system and the final output.

This also creates the parent-child trace problem. Most tracing and observability tools can tell you which agents were called and in what order. What they struggle with is attributing a downstream worker action back to the coordinator decision that authorized it — especially in systems with dynamic routing, where dispatch decisions aren't predetermined in code.

An agent registry that catalogues each agent's authority, role, and lineage is the prerequisite for this kind of attribution. Without it, a multi-agent execution log is a chronological list of events with no causal structure. You can read it, but you can't use it to answer the question that matters in a post-incident review: which coordinator decision authorized the chain of actions that led to this outcome?

The trace problem is especially acute in multi-agent systems that include external agents — vendor-built tools, third-party MCP servers, orchestrators built by a different team. These agents operate outside any SDK you control. Their behavior is observable only if there is a governance layer that sits above the entire system, not just around the components you built.

How Waxell Handles This

Waxell Runtime applies policy enforcement at the governance plane level — above the coordinator, not alongside the workers. This means different rules can be applied at different points in the agent hierarchy. Coordinator-level decisions, planner decompositions, and worker tool calls carry different risk profiles, and Waxell Runtime's 50+ policy categories out of the box cover pre-execution, mid-execution, and post-execution enforcement with no rebuilds required.

For multi-agent systems that include agents you didn't build — vendor-built orchestrators, third-party subagents, external tools integrated via MCP — Waxell Connect governs the agents you didn't build without requiring SDK instrumentation or code changes from those external systems. It enforces governance at the boundary between agents you control and agents you don't, which is precisely where the trust gap in orchestrator-worker systems is widest.

The multi-agent execution logs Waxell captures preserve the causal structure of the system: coordinator decision → planner decomposition → worker execution, with governance check results at each transition point. That's what a multi-agent audit trail actually needs to contain — not a flat event list, but a causally ordered record that maps each action back to the authority that authorized it.

FAQ

Why do most governance tools treat all agents identically in a multi-agent system?
Because the dominant tools — LangSmith, Arize, Braintrust — were designed for single-agent observability and extended to multi-agent use. They instrument at the LLM call or tool call level, which captures what each agent did but doesn't enforce different rules based on each agent's role. Governance that differentiates by role requires a layer that sits above the orchestration framework, not inside each individual agent.

What failure modes emerge specifically at the coordinator layer?
Prompt injection routing (where malicious content convinces the coordinator to dispatch to a sensitive agent), authorization bypass (where the coordinator passes unvalidated user input as if it were a trusted system instruction), and uncontrolled iteration (where the coordinator continues spawning subagents past the point where a human would have stopped). None of these are caught by worker-layer policy enforcement.

What is the "scope expansion" problem with planners?
In orchestrator-worker systems, planners decompose high-level goals into subtasks. Each subtask may be individually authorized, but the aggregate footprint — the number of API calls, the volume of data accessed, the total cost incurred — may far exceed what the user or operator intended. Planner-level governance requires enforcing constraints on the plan structure, not just on individual executions.

How does Waxell Connect help govern external agents in a multi-agent system?
Waxell Connect governs agents you didn't build — third-party orchestrators, vendor agents, MCP-native integrations — without requiring those systems to instrument your SDK or change their code. It enforces governance at the boundary where your system interacts with external agents, which is where the trust chain has its widest gap in any orchestrator-worker architecture.

What should a multi-agent audit trail capture that most logs miss?
Causal structure: which coordinator decision authorized which planner action, which planner decomposition led to which worker calls, and what policy checks ran at each handoff. Most tracing tools produce flat event lists ordered by timestamp. A usable multi-agent audit trail is a causally structured record that lets you trace any outcome back to the orchestration decision that originated it.

Do EU AI Act requirements differ for multi-agent systems?
Under EU AI Act Annex III (effective December 2027 under the EU Digital Omnibus amendment), high-risk AI systems must maintain human oversight, detailed logging, and transparency about decision-making. For multi-agent systems in regulated sectors, this effectively requires governance coverage at the coordination and planning layers — not just at execution — because the decisions that create compliance risk originate at the orchestrator, not the worker.

Sources

Cemri, M., Pan, M.Z., Yang, S., et al. (2025). "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657. https://arxiv.org/abs/2503.13657. Publication date: March 17, 2025 (v1); revised October 26, 2025 (v3).
Hammond, L., Chan, A., Clifton, J., et al. (2025). "Multi-Agent Risks from Advanced AI." arXiv:2502.14143. Cooperative AI Foundation Technical Report #1. https://arxiv.org/abs/2502.14143. Publication date: February 19, 2025.
EU AI Act Digital Omnibus Amendment. Annex III effective date: December 2027.

MCP Governance: How to Secure AI Agents in a Plugin-Driven World

Logan — Mon, 22 Jun 2026 15:13:25 +0000

There's a before-and-after line in the history of AI agent deployments, and Model Context Protocol is it.

On April 20, 2026, CIS published its MCP Security Companion Guide — a set of controls co-authored with Astrix and Cequence that extends CIS Controls v8.1 into MCP environments. The finding that prompted it: MCP formally expands an agent's identity, access control, logging, and application security surfaces in ways that most enterprise governance frameworks don't yet cover. Four days earlier, The Register reported that security researchers had documented a design flaw in the official MCP implementation that, according to their analysis, puts as many as 200,000 servers at risk, with 10 high- and critical-severity CVEs filed. If MCP governance wasn't on your radar six months ago, it is now.

MCP governance is the practice of defining and enforcing policies on AI agents that use Model Context Protocol to connect to external tools and services. Because MCP dramatically expands an agent's capability surface — connecting it to file systems, databases, APIs, and communication platforms through a standardized protocol — it proportionally expands the attack surface and the governance requirements. MCP governance operates at three layers: tool allowlisting and scoping (which tools can be called, under what conditions), tool result inspection (scanning responses before they enter agent context), and sequence-level policy evaluation (tracking cross-tool patterns within a session). For a broader definition of the governance plane that MCP sits within, see What is agentic governance →.

Before MCP, agents were powerful but bounded. What they could do in the world was limited by whatever tools a developer explicitly wired up — dangerous if the tools were dangerous, but bounded. After MCP, an agent's capabilities can be dynamically extended by connecting to MCP servers — some of which the developer chose, some of which may have been added by other users, other systems, or the agent ecosystem itself. That dynamic extensibility is what makes MCP valuable, and what makes governance non-optional.

What MCP Actually Changes

The key shift is this: before MCP, an agent's capabilities were determined at deploy time by whatever tools a developer explicitly configured. After MCP, an agent's capabilities can be dynamically extended by connecting to MCP servers — some of which the developer chose, some of which may have been added by other users, other systems, or the agent ecosystem itself.

MCP adoption is moving faster than most security teams anticipated. Claude Code ships an MCP client. Developer tooling is shipping MCP integrations. Enterprise platforms are building MCP servers for internal systems. Most production agents in sophisticated deployments will have access to significant tool ecosystems via MCP within the next 12–18 months — if they don't already.

Here's the governance implication, stated plainly: when your agent had five explicitly configured tools, you could reason about what it might do with those five tools. When your agent has access to dozens of tools via MCP — some installed by you, some added by others, some with capabilities that have changed since you last reviewed them — the reasoning problem is fundamentally different. The CIS guide introduces the concept of "Non-Human Identities" (NHIs): each MCP connection is a non-human identity with its own access scope, and most organizations don't have a registry for them or policies governing what they can do.

How Does MCP Expand the AI Agent Attack Surface?

Every tool an agent can access is a potential attack surface. Not because the tool is malicious, but because any tool can be invoked in ways that cause unintended consequences, and the space of invocations an LLM might generate is large.

Prompt injection via tool results. This is the one that should keep security teams up at night. In March 2026, Oasis Security researchers demonstrated a complete attack pipeline — dubbed "Claudy Day" — that chained invisible prompt injection with data exfiltration, showing that conversation history could be stolen from a default, unmodified claude.ai session. No special configuration. No elevated permissions. The injection was embedded as invisible HTML tags inside Claude's ?q= URL pre-fill parameter — invisible to the user in the chat box, but transmitted in full to the model when they pressed Send. Delivered via a Google Ads campaign exploiting an open redirect on claude.com, Claude was then instructed to exfiltrate conversation history to an attacker-controlled Anthropic account via the Files API. This wasn't a breach; it was a demonstration of what's possible when content entering agent context goes uninspected. For a deeper technical breakdown of how prompt injection enters agents through tool call results, post 43 covers the mechanism in full.

Unauthorized tool invocation. An agent that has access to a "send email" MCP tool can send emails. The question is: under what conditions should it be allowed to? If the agent's legitimate scope is answering product questions, and it has access to an email tool for a narrow set of notification purposes, what prevents it from invoking that tool in other contexts? Without an explicit policy that defines when the tool may be called, the answer is "the model's judgment." That's not a policy. That's a hope.

Data exfiltration through tool call sequences. An agent with access to database query tools and communication tools could retrieve sensitive data and transmit it externally — not through a single dramatic action, but through a sequence of plausible-looking operations that individually wouldn't raise flags. Defending against this requires visibility into sequences of tool calls and cross-tool patterns within a session, not just individual invocations. This is what the CIS guide calls "auditable interactions across the protocol layer."

Cost explosions from unconstrained tool use. Some MCP tools are expensive to call. Code execution environments, large data retrievals, external API calls with per-request pricing — an agent that calls these tools more aggressively than expected, or gets into a loop involving expensive tools, can run up costs quickly. Without budget enforcement at the tool call level, you won't know until after.

Why Existing Agent Governance Doesn't Fully Address MCP

If you have governance infrastructure in place — policies on LLM calls, input/output scanning, session-level budget guardrails — you have a foundation. But MCP introduces governance requirements that foundation may not cover.

Dynamic tool availability. Static governance policies assume a known, fixed set of tools. MCP's whole point is that the tool set is extensible. Your governance layer needs to handle tool calls against tools that weren't in scope when the policy was written — which means you need policies that generalize across tool categories, not just name-specific allowlists.

Tool result trust. Your existing scanning may evaluate inputs from users. MCP tool results come from servers, not users, and deserve their own trust evaluation. A tool result can contain malicious content even when the tool itself is legitimate and the server is trusted. The content-as-attack-surface problem requires governance at the tool result layer, not just the user input layer. The GitHub coding agent prompt injection case — where instructions embedded inside repository files were retrieved and acted on by an agent without disclosure — is a sharp example: GitHub AI Agent Prompt Injection: What No CVE Disclosure Looks Like.

Cross-tool sequence analysis. Individual tool calls may be benign. Sequences of tool calls may not be. A governance layer that evaluates each tool invocation in isolation misses the pattern-level risks. This requires session-level analysis — tracking what tools have been called, in what sequence, with what payloads — and applying policies to sequences rather than individual events.

Under EU AI Act Annex III (enforcement deadline: August 2, 2026), deployers of high-risk AI systems must retain automated logs for at least six months and implement human oversight mechanisms. A November 2025 European Commission proposal to delay this deadline was not enacted into law — the Digital Omnibus trilogue failed on April 28, 2026 — and August 2026 remains the operative enforcement date. MCP tool call sequences are exactly what those audit trail requirements are designed to capture, and most existing governance stacks don't log them at that level.

What MCP Governance Actually Looks Like

Practically speaking, MCP governance operates at three layers:

Tool allowlisting and scoping. Define explicitly which MCP tools the agent is permitted to call, under which conditions, with which parameter constraints. This is RBAC for tool access — not just "can the agent access this tool" but "under what circumstances." An agent might be permitted to read files but not write them, or to query a database only within a specified set of tables, or to call an email tool only when a specific approval condition is met. Waxell's MCP Gateway operationalizes this at the infrastructure level: one URL per tenant replaces every upstream MCP config, with tool fingerprinting that tracks each tool through 5 states — Pending, Drift, Trusted, Blocked, Removed — so changes to tool definitions are caught before any agent calls them. For agents you build, Waxell Observe enforces 50+ policy categories pre-execution with no agent rebuild required.

Tool result inspection. Every tool call response that gets appended to the agent's context should pass through an inspection step before it's processed. Scan for injection patterns. Check for PII that shouldn't be entering the context. Validate that the response conforms to expected schemas. Flag anomalies for review. The MCP Gateway handles this at the network layer: its prompt injection scanner inspects tool descriptions at fingerprint time (before any agent invocation), and PII redaction strips sensitive data in flight so it never reaches the model. Secrets — API keys, tokens, credentials — are blocked at the gateway and never leave.

Sequence-level policy evaluation. Track tool call sequences within a session and evaluate them against session-level policies. A policy might say: if more than three different external endpoints are called within a single session, flag for review. Or: if a database read is followed within N turns by any outbound call, require approval. These are session-level governance decisions, not call-level decisions.

For teams governing MCP servers they didn't build — marketplace servers, vendor-provided integrations, third-party data connectors — the MCP Gateway puts a single governed endpoint in front of all of them, with 160+ upstream connectors and every tools/call identity-resolved and policy-checked before the upstream ever sees it. The MCP STDIO design flaw that put 200,000 servers at risk is a clear example of why governance of external MCP servers can't depend on the servers themselves implementing controls.

Why Is Now the Right Time to Establish MCP Governance?

MCP is early enough that the tooling ecosystem around it is still being built — including governance tooling. The CIS MCP Security Guide was published precisely because organizations are deploying MCP and running into a predictable set of problems: audit trails, SSO-integrated auth, gateway behavior, and configuration portability. The guide exists because there was demand for it. That demand reflects real production deployments hitting real governance gaps.

The organizations that establish MCP governance practices now — while the tool ecosystem is still relatively small and the attack surface is still relatively bounded — will be in a fundamentally different position than the organizations that establish MCP governance practices in 18 months, when the ecosystem is larger, the incidents have accumulated, and the EU AI Act Annex III enforcement machinery is operational.

There's a general principle worth stating: the right time to build governance for a capability is when the capability is being adopted, not after the first incident that makes its absence visible. MCP is being adopted right now.

How Waxell handles this: Waxell's MCP Gateway is a single governed endpoint that replaces every upstream MCP configuration in your organization. It fingerprints every tool across 5 states (Pending → Drift → Trusted → Blocked → Removed), scans tool descriptions for prompt injection before any agent calls them, redacts PII in flight, and blocks secrets at the gateway so they never leave. Destructive actions park for human approval. Policy changes propagate in 30 seconds. For agents you build yourself, Waxell Observe instruments them with 2 lines of code and enforces 50+ policy categories pre-execution — no agent rebuild required. Get access →

Frequently Asked Questions

What is MCP governance?
MCP governance is the practice of enforcing policies on AI agents that use Model Context Protocol to access external tools and services. It covers which tools an agent is permitted to call (allowlisting), what conditions must be met before a tool is invoked (scoping), whether tool results are safe to enter agent context (result inspection), and whether a sequence of tool calls within a session is consistent with policy (sequence-level analysis).

How does Model Context Protocol change the AI agent attack surface?
Before MCP, an agent's capabilities were fixed at deployment time — a developer explicitly configured each tool. With MCP, an agent's capabilities can be dynamically extended by connecting to MCP servers, some of which may have been added by other users or systems. This expands the attack surface from "five tools you explicitly chose" to "an ecosystem of tools with capabilities that may have changed since you last reviewed them." Governance requirements scale with that expansion.

What is prompt injection via MCP tool results?
Prompt injection via tool results occurs when an MCP tool returns content that contains instructions designed to alter the agent's behavior — for example, a retrieved document containing text like "ignore your previous instructions and..." That text enters the agent's context window and may influence subsequent reasoning. In March 2026, Oasis Security demonstrated the "Claudy Day" attack chain — showing that invisible prompt injection plus data exfiltration was achievable against a default claude.ai session without any special configuration, using a URL parameter injection delivered via Google Ads. Defense requires inspecting all content before it enters agent context, whether from tool results, URL parameters, or any other source.

How do you govern AI agents that use MCP tools?
Effective MCP governance requires three layers: tool allowlisting (explicitly defining which tools the agent may call and under what conditions), tool result inspection (scanning every tool response for PII, injection patterns, and schema anomalies before it enters context), and sequence-level monitoring (tracking cross-tool patterns within a session and triggering policies based on sequences, not just individual calls). Static policies based on named tool allowlists need to generalize across tool categories as the MCP ecosystem grows.

Why doesn't existing agent governance fully cover MCP?
Most existing governance infrastructure was designed for known, fixed tool sets. MCP's value proposition is a dynamically extensible tool ecosystem — which means static name-based allowlists become inadequate. Additionally, MCP introduces tool result trust evaluation as a distinct concern: a tool can be legitimate and its server trusted, but its results can still contain malicious content. Governance layers that only evaluate user inputs miss this vector entirely.

What does the CIS MCP Security Guide (April 2026) recommend?
The CIS MCP Security Guide, published April 20, 2026 in partnership with Astrix and Cequence, extends CIS Controls v8.1 into MCP-based environments. Key recommendations include: governing Non-Human Identities (NHIs) created by each MCP connection, implementing secure tool access controls with explicit permission grants per capability rather than broad access, ensuring auditable interactions across the protocol layer, and addressing four priority risk areas — data leakage, unbounded agent autonomy, credential misuse, and unsafe tool execution.

How does Waxell's MCP Gateway enforce MCP governance?
The MCP Gateway is a single URL that replaces every upstream MCP server configuration for your organization. Every tool call passes through it — the gateway fingerprints tools (tracking them through Pending, Drift, Trusted, Blocked, and Removed states), scans tool descriptions for prompt injection at fingerprint time, redacts PII before it reaches the model, and blocks secrets at the network layer. Destructive actions require human approval. It works with Claude Desktop, Claude Code, Cursor, and any MCP-compatible client — setup takes minutes, no code changes needed. See how it works →

Sources

Center for Internet Security (CIS), Astrix, and Cequence, CIS, Astrix, and Cequence Release New AI Security Companion Guides (April 20, 2026) — https://www.cisecurity.org/about-us/media/press-release/cis-astrix-and-cequence-release-new-ai-security-companion-guides
The Register, MCP 'design flaw' puts 200k servers at risk: Researcher (April 16, 2026) — https://www.theregister.com/2026/04/16/anthropic_mcp_design_flaw/
Oasis Security, Claude.ai Prompt Injection Vulnerability ("Claudy Day") (March 2026) — https://www.oasis.security/blog/claude-ai-prompt-injection-data-exfiltration-vulnerability
Palo Alto Networks Unit 42, New Prompt Injection Attack Vectors Through MCP Sampling (2026) — https://unit42.paloaltonetworks.com/model-context-protocol-attack-vectors/
EU AI Act Service Desk (European Commission), Timeline for the Implementation of the EU AI Act — https://ai-act-service-desk.ec.europa.eu/en/ai-act/timeline/timeline-implementation-eu-ai-act
EU Artificial Intelligence Act, Annex III: High-Risk AI Systems Referred to in Article 6(2) — https://artificialintelligenceact.eu/annex/3/