I've been contributing to Microsoft's agent-governance-toolkit for the past few weeks — fixing bugs, writing docstrings, building Colab notebooks. And the more I dug into the codebase, the more one question kept coming up: why can't you just use a well-crafted system prompt to keep your agent safe?
The short answer: you can. Until you can't.
The Prompt Engineering Approach
Prompt-level guardrails look like this:
"You are a helpful assistant. Never recommend illegal activities. Never share personal user data. Always stay within budget constraints."
This works remarkably well — right up until it doesn't. The problem is structural: your guardrail lives inside the model's context window, which means it's subject to the same probabilistic reasoning as everything else the model does.
Three things consistently break prompt-only guardrails:
Jailbreak attacks. Users have discovered that framing requests differently — roleplay scenarios, "hypothetical" framings, multi-step manipulations — can cause models to comply with things they were explicitly told not to do. This isn't a bug in the model. It's a feature of how language models work. They follow the most compelling framing of the current context.
Context window exhaustion. A long conversation eventually pushes your system prompt out of the context window. Your "never exceed budget" guardrail disappears after message 40. The model doesn't know what it forgot.
Model dependency. A guardrail tuned for GPT-4 behaves differently on Claude, which behaves differently on Llama. Every model swap means re-testing every prompt. At scale, that's not sustainable.
The Policy-as-Code Approach
Policy-as-code doesn't ask the model to follow rules. It enforces rules in code, before and after the model is ever called.
Here's what that looks like with the agent-governance-toolkit:
from agent_os.integrations.base import GovernancePolicy
from agent_os.integrations import LangChainKernel
policy = GovernancePolicy(
blocked_patterns=["DROP TABLE", "rm -rf", r"\b\d{3}-\d{2}-\d{4}\b"],
max_tool_calls=10,
require_human_approval=False,
)
kernel = LangChainKernel(policy=policy)
governed_agent = kernel.wrap(my_chain)
Every call to governed_agent.invoke() now goes through a policy check before it reaches the model. If the input matches a blocked pattern — SQL injection, destructive command, SSN pattern — the call is rejected. The model never sees it.
This is a fundamentally different architecture. The guardrail isn't in the model's context. It's in your infrastructure.
Where Each Approach Fails
Neither approach is complete on its own.
Where prompt engineering fails:
- Adversarial users who iterate on jailbreaks
- Long-running agents where context window limits apply
- Multi-model deployments where you can't tune per model
- Compliance scenarios where you need an audit trail
Where policy-as-code falls short:
- Creative tasks where rules can't capture nuance
- Behavioral style (tone, personality, format)
- Rapid iteration — changing a policy requires a code deploy
- User experience — prompts understand context, rules do not
A guardrail that says blocked_patterns=["inappropriate content"] can't capture the infinite ways "inappropriate" manifests in natural language. A prompt can. But a prompt can't guarantee that a rule is never violated. Code can.
Real Example: PII in Memory Writes
Here's a concrete failure mode I saw while contributing to this toolkit.
An agent with memory capabilities might inadvertently write a user's SSN into its long-term memory during a conversation about financial planning. A prompt guardrail saying "never store sensitive data" relies on the model recognizing what counts as sensitive — and recognizing it every single time, across thousands of conversations.
The toolkit handles this differently. Memory writes are intercepted at the code layer:
_PII_PATTERNS = [
re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), # SSN
re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"), # email
re.compile(r"\b(?:password|secret|token)\s*[:=]\s*\S+", re.IGNORECASE),
]
Before anything is written to memory, it's checked against these patterns. If a match is found, the write is blocked and a PolicyViolationError is raised. The model's cooperation is irrelevant — the rule is enforced in code.
The Layered Model That Actually Works
The right answer isn't "choose one." It's to use both at the right layer.
Use policy-as-code for:
- Hard constraints that must never be violated (safety, compliance, budget)
- Anything that needs an audit trail
- Rules that apply regardless of model behavior
- Security boundaries (tool allowlists, blocked patterns, call budgets)
Use prompt engineering for:
- Behavioral style and tone
- Output formatting
- Persona and creative direction
- User experience nuances
Use human-in-the-loop for:
- High-stakes decisions that exceed both layers
- Ambiguous cases the policy engine flags as uncertain
- Exception handling with accountability
Think of it as defense in depth. Prompts guide behavior. Policies enforce limits. Humans handle edge cases. No single layer carries all the weight.
A Concrete Starting Point
If you're building agents today without a policy layer, here's the minimum viable governance setup using the toolkit:
pip install agent-governance-toolkit[full]
from agent_os.integrations.base import GovernancePolicy
from agent_os.integrations import LangChainKernel
# Start with just these three
policy = GovernancePolicy(
blocked_patterns=["rm -rf", "DROP TABLE"], # safety
max_tool_calls=20, # budget
log_all_calls=True, # audit trail
)
kernel = LangChainKernel(policy=policy)
governed = kernel.wrap(your_existing_agent)
You don't need to rewrite your agent. You wrap it. Your existing prompts stay. You've added a code-layer enforcement layer underneath them.
The Bigger Picture
The shift from prompt-only to policy + prompt is the maturation of agent governance. It's the difference between "I hope my agent behaves" and "I can prove my agent behaves."
Prompt engineering got us to where we are. It's powerful and flexible and necessary. But as agents move into production — taking real actions, handling real data, running autonomously at scale — "I told it not to" is no longer sufficient as a governance strategy.
Policy-as-code is how you make agent behavior auditable, testable, and enforceable. Not instead of good prompts. On top of them.
I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.
Source code and docs: github.com/microsoft/agent-governance-toolkit
Top comments (0)