Yesterday, Anthropic's Claude Code source code leaked. The entire safety system for dangerous cybersecurity work turned out to be a single text file with one instruction: "Be careful not to introduce security vulnerabilities."
That is the safety layer at one of the most powerful AI companies in the world. Just a prompt asking the model nicely to behave.
This is not a shot at Anthropic. It is a symptom of something the whole industry is dealing with right now. We have confused guidance with enforcement, and as agents move into production, that distinction is starting to matter a lot.
Why prompt guardrails feel like they work
When you are building an agent in development, prompt-based guardrails seem totally reasonable. You write something like "never delete production data," the model follows it, and you ship it. It works.
The problem is that prompts are probabilistic. The model does not follow your instructions because it is enforced to. It follows them because that response is statistically likely given your system prompt, and that is a fundamentally different thing.
That gap is small in a controlled demo, but it widens under a few conditions that come up all the time in production.
Prompt injection happens when an attacker embeds instructions inside content your agent reads, whether that is a document, an email, or a database record. The injected instruction competes with your system prompt, and researchers have shown attack success rates exceeding 90% against production guardrail systems.
Multi-step reasoning is another problem. A prompt check happens at the input boundary, but agents do not operate at the input boundary. They reason across multiple steps, call tools, read results, and reason again. A message that looks completely clean at the first step can trigger a dangerous tool call three steps later that no classifier ever saw.
Model updates create a third issue. Your guardrail was tuned against one version of the model, and when the model updates, the probability distribution shifts. The guardrail that worked last month might not work the same way next month.
None of this is theoretical. The OWASP Agentic Top 10, published in late 2025, documents ten agent-specific attack categories that did not exist in the original LLM threat model, and most of them happen entirely outside the layer that prompt guardrails watch.
Where the gap actually lives
Here is what happens when a LangGraph agent calls a tool:
# The agent decides to call a tool
tool_call = {
"name": "stripe/refund",
"arguments": {"amount": 800, "customer_id": "cust_123"}
}
# The tool executes
result = stripe_refund(amount=800, customer_id="cust_123")
There is a moment between the agent deciding to call that tool and the tool actually running, and that moment is where enforcement has to happen. Not before the prompt, not after the response, but right there between intent and action.
Prompt guardrails do not live in that moment. They live before it, in the system prompt, where the model reads them and decides whether to follow them. If the model has been manipulated, or is just statistically unlikely to comply, nothing fires.
A runtime enforcement layer lives in that moment. It intercepts the tool call before it executes, checks it against policies defined in code, and makes a deterministic decision: permit, deny, or defer for human approval. The model does not get a vote.
What this looks like in practice
With Faramesh, you add one command to run your agent:
faramesh run agent.py
No SDK changes, no changes to your agent code. Faramesh wraps the execution layer and checks every tool call against your policies before anything runs.
Those policies are written in FPL, the Faramesh Policy Language. It is a domain-specific language built specifically for agent governance, with agent-native concepts as first-class primitives: sessions, delegation chains, budget limits, and human approval flows. Unlike YAML or OPA Rego, FPL is readable by engineers and non-engineers alike, and the same policy that takes 60+ lines of YAML takes around 50 lines of FPL with stronger guarantees.
A policy for a payment agent looks like this:
agent payment-bot {
default deny
model "gpt-4o"
framework "langgraph"
rules {
deny! shell/* reason: "never shell"
defer stripe/refund
when amount > 500
notify: "finance"
reason: "high value refund"
permit stripe/*
when amount <= 500
}
}
This is deterministic. It does not matter what the model was told or what was injected into the context. A refund over $500 gets deferred to a human every time, and a shell command gets blocked every time. That is what enforcement actually means.
The deny! effect is worth pointing out specifically. It is a compile-time guarantee, meaning no subsequent permit rule can override it, no child policy can override it, and the compiler verifies this structurally. It is not a runtime convention. It is a guarantee baked into the language itself.
Why this matters beyond security
This is not only a security problem. It is a deployment problem.
Right now, agents can assist with payments, infrastructure changes, customer operations, and credential management, but most teams will not deploy them autonomously into those workflows because there is no layer they can trust to keep things in bounds. Prompt guardrails create the appearance of control, while a runtime enforcement layer creates actual control, and that distinction is what unlocks the use cases that make agents genuinely valuable, not just as assistants, but as workers that can actually operate the systems that matter.
The Claude Code leak is a good reminder that even the companies building the most sophisticated AI in the world are still relying on text files to enforce safety boundaries. That is just where the industry is right now, and the enforcement layer that should exist is what we are building at Faramesh.
The core repo is open source at github.com/faramesh/faramesh-core. Would love to hear from anyone building agents in production.
Top comments (0)