DEV Community: Kanish Tyagi

Agent SRE — SLOs, Error Budgets, and Circuit Breakers for AI Agents

Kanish Tyagi — Fri, 10 Apr 2026 21:30:57 +0000

When a traditional web service goes down, you get a 500 error. You check the logs, find the exception, fix it, deploy. The failure is deterministic and reproducible.

When an AI agent degrades, it's different. It doesn't crash — it just starts giving worse answers. It calls tools more times than it should. It hallucinates details it used to get right. It slows down under load in ways that are hard to measure. By the time you notice, it's been quietly failing for hours.

This is why AI agents need Site Reliability Engineering. Not the same SRE you apply to APIs and databases — a version adapted to the specific ways agents fail.

How Agents Fail Differently

Traditional services fail in discrete, measurable ways: error rate goes up, latency goes up, availability goes down. You set thresholds, you get paged, you fix it.

Agents fail on a spectrum:

Accuracy degradation — outputs become less correct over time as context accumulates
Tool call inflation — agent starts using 15 tool calls for tasks that used to take 3
Hallucination rate increase — factual errors appear more frequently
Task completion drift — agent completes the literal request but misses the intent
Delegation loops — agent spawns sub-agents that spawn more sub-agents recursively

None of these show up as a 500 error. They require different observability, different thresholds, and different response strategies.

Defining SLOs for Agents

A Service Level Objective is a target for how well your service should perform. For a web API, it's usually availability (99.9%) and latency (p99 < 200ms).

For an agent, you need different dimensions:

from agent_os.sre import AgentSLO

slo = AgentSLO(
    agent_id="research-agent",

    # How often should the agent successfully complete its task?
    task_success_rate=0.95,        # 95% of tasks complete successfully

    # How many tool calls is reasonable per task?
    max_tool_calls_per_task=10,    # alert if average exceeds this

    # How long should a task take?
    max_latency_ms=30000,          # 30 seconds max

    # How often can the agent violate policy?
    max_error_rate=0.02,           # 2% error rate maximum

    # How available should the agent be?
    min_availability=0.99,         # 99% uptime
)

These numbers come from your domain. A research agent doing deep analysis might reasonably use 20 tool calls. A customer service agent answering simple questions should never need more than 3. Define SLOs based on what good behavior looks like for your specific use case.

Error Budgets: When to Throttle vs Keep Running

An error budget is the inverse of your SLO. If your task success SLO is 95%, your error budget is 5% — that's how much failure you can tolerate before taking action.

from agent_os.sre import ErrorBudget

budget = ErrorBudget(slo=slo, window_hours=24)

# After 1000 tasks with 60 failures (6% error rate)
budget.record_tasks(total=1000, failed=60)

print(budget.remaining_percent)  # -20% — budget exhausted
print(budget.status)             # "EXHAUSTED"
print(budget.recommendation)    # "Throttle agent — error budget depleted"

When the error budget is exhausted, you have three options:

Throttle — reduce the agent's call rate, buy time to investigate
Degrade gracefully — switch to a simpler, more reliable mode
Circuit break — stop the agent entirely until the issue is resolved

The right choice depends on the stakes. A research agent exhausting its budget? Throttle it. A financial agent exhausting its budget? Circuit break immediately.

Circuit Breakers: Automatically Stopping Degraded Agents

I built a circuit breaker simulation in the Colab notebook I contributed to this repo. Here's the core concept translated to production code:

from agent_os.sre import CircuitBreaker

breaker = CircuitBreaker(
    agent_id="research-agent",
    failure_threshold=0.05,    # open circuit at 5% error rate
    latency_threshold_ms=30000, # open circuit if avg latency exceeds 30s
    cooldown_seconds=300,       # try again after 5 minutes
)

# Before each agent call
if breaker.is_open():
    # Circuit is open — agent is degraded
    raise AgentUnavailableError("Circuit breaker open — agent degraded")

try:
    result = agent.run(task)
    breaker.record_success()
except Exception as e:
    breaker.record_failure()
    raise

The circuit has three states:

Closed — normal operation, all calls go through
Open — agent is degraded, calls are rejected immediately
Half-open — cooldown expired, testing if agent has recovered

print(breaker.state)  # "CLOSED", "OPEN", or "HALF_OPEN"
print(breaker.failure_rate)  # current failure rate
print(breaker.time_until_retry)  # seconds until half-open

This prevents a degraded agent from silently producing bad outputs at scale. Instead of hoping someone notices the quality drop, the circuit breaker makes the degradation explicit and stops it automatically.

Chaos Testing: Breaking Your Agent on Purpose

The best time to discover how your agent fails is before it fails in production. Chaos testing deliberately introduces failures to find weaknesses.

from agent_os.sre import ChaosEngine

chaos = ChaosEngine(agent=governed_agent)

# Inject 50% tool failure rate
with chaos.inject_tool_failures(rate=0.5):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Success rate under chaos: {sum(r.success for r in results)/len(results):.1%}")

# Inject high latency
with chaos.inject_latency(ms=5000):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Timeout rate under latency: {sum(r.timed_out for r in results)/len(results):.1%}")

# Inject policy violations
with chaos.inject_policy_violations(rate=0.3):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Violation detection rate: {sum(r.violation_caught for r in results)/len(results):.1%}")

Run chaos tests before deploying. The questions you're answering:

Does the circuit breaker actually trip when it should?
Does the agent degrade gracefully or fail catastrophically?
Are error budgets calculated correctly under real failure conditions?
Does the governance layer catch violations during chaos?

What to Monitor and Alert On

Traditional monitoring: CPU, memory, error rate, latency.

Agent monitoring adds:

from agent_os.sre import AgentMetrics

metrics = AgentMetrics(agent_id="research-agent")

# Core reliability metrics
print(metrics.task_success_rate)      # % of tasks completed successfully
print(metrics.avg_tool_calls)         # average tool calls per task
print(metrics.p99_latency_ms)         # 99th percentile task latency
print(metrics.error_budget_remaining) # % of error budget left

# Agent-specific metrics
print(metrics.hallucination_rate)     # estimated factual error rate
print(metrics.policy_violation_rate)  # % of calls blocked by governance
print(metrics.delegation_depth_avg)   # average sub-agent spawn depth

# Circuit breaker state
print(metrics.circuit_state)          # CLOSED / OPEN / HALF_OPEN
print(metrics.circuit_opens_24h)      # how many times circuit opened today

Alert thresholds I'd recommend starting with:

Metric	Warning	Critical
Task success rate	< 95%	< 90%
Avg tool calls	> 15	> 25
Error budget remaining	< 25%	< 10%
Circuit opens (24h)	> 3	> 10
Policy violation rate	> 5%	> 15%

AccuracyDeclaration: Formally Declaring Agent Accuracy

One underused feature of the toolkit is AccuracyDeclaration — a formal, versioned statement of what accuracy levels your agent guarantees:

from agent_os.sre import AccuracyDeclaration

declaration = AccuracyDeclaration(
    agent_id="research-agent",
    version="1.2.0",
    declared_accuracy={
        "factual_retrieval": 0.94,    # 94% accuracy on factual questions
        "task_completion": 0.97,      # 97% of tasks complete successfully
        "policy_compliance": 0.999,   # 99.9% policy compliance
    },
    measurement_method="automated_eval_suite_v3",
    valid_for_days=90,
    supersedes="1.1.0",
)

declaration.publish()

This serves two purposes. First, it creates accountability — you've formally stated what your agent guarantees, and you're tracking against it. Second, it enables automated degradation detection — if current metrics fall below declared accuracy, alert immediately.

The SRE Mindset for Agent Teams

Traditional SRE asks: "How do we keep this service running?"

Agent SRE asks: "How do we keep this agent correct?"

Running and correct are different things for AI systems. An agent can be running — responding to every request, returning outputs — while being completely wrong. That's the failure mode traditional SRE doesn't catch.

The principles are the same: define what good looks like (SLOs), measure continuously (metrics), respond automatically to degradation (circuit breakers), and test your failure modes before they happen (chaos testing). The implementation is different because the failure modes are different.

If you're deploying agents in production and you don't have SLOs defined for them, you don't actually know if they're working. You just know they're running.

Getting Started

pip install agent-governance-toolkit[full]

The interactive Colab notebook I built for this repo walks through SLOs, circuit breakers, and chaos testing with live code:

👉 github.com/microsoft/agent-governance-toolkit

I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.

Building Trust Between AI Agents — DIDs, Signatures, and Zero-Trust Mesh

Kanish Tyagi — Wed, 08 Apr 2026 18:39:54 +0000

Imagine you walk into a room full of strangers. Everyone is wearing a mask. Someone hands you a document and says "sign this — it's from the CEO." How do you know it's actually from the CEO? You don't. You have no way to verify identity.

This is the trust problem in multi-agent AI systems. And it's more serious than most teams realize.

Why Agents Need Identity

When a single AI agent operates in isolation, trust is simple — you trust it or you don't. But modern AI systems are increasingly multi-agent. A research agent delegates to a writing agent. A planning agent spawns execution sub-agents. An orchestrator coordinates dozens of specialized agents in parallel.

In these systems, every interaction between agents is a potential attack surface.

Consider this scenario: Agent A receives a message claiming to be from Agent B, instructing it to delete a set of records. How does Agent A verify that:

The message actually came from Agent B
Agent B is authorized to make that request
Agent B hasn't been compromised and is acting within its intended scope

Without cryptographic identity, the answer to all three is: it can't.

The Solution: Cryptographic Agent Identity

The agent-governance-toolkit solves this with a trust mesh — a framework where every agent has a verifiable cryptographic identity, and every inter-agent interaction is validated before execution.

Here's how it works.

Ed25519 Key Pairs

Each agent gets an Ed25519 key pair at creation time:

from agent_os.identity import AgentIdentity

# Each agent gets a unique cryptographic identity
identity = AgentIdentity.create(
    agent_id="research-agent-001",
    role="researcher",
    capabilities=["web_search", "read_file"],
)

print(identity.did)
# did:key:z6MkhaXgBZDvotDkL5257faiztiGiC2QtKLGpbnnEGta2doK

print(identity.public_key_hex)
# ed25519 public key

The DID (Decentralized Identifier) is a self-describing, globally unique identifier derived from the public key. No central registry required.

Signed Messages

When Agent A sends a message to Agent B, it signs it with its private key:

message = {
    "from": identity.did,
    "to": "did:key:z6Mk...",
    "action": "analyze_dataset",
    "payload": {"dataset_id": "ds_001"},
    "timestamp": "2026-04-05T10:00:00Z",
}

signed_message = identity.sign(message)
# Includes: message + signature + public key

Agent B verifies the signature before acting:

from agent_os.identity import verify_message

is_valid = verify_message(signed_message)
if not is_valid:
    raise SecurityError("Message signature invalid — rejecting")

If the message was tampered with in transit, the signature check fails. The action is rejected before execution.

Trust Scoring: Agents Earn and Lose Trust

Identity verification tells you who sent a message. Trust scoring tells you whether to act on it.

The toolkit maintains a trust score (0-1000) for each agent in the mesh:

from agent_os.trust import TrustEngine

trust = TrustEngine()

# New agent starts with baseline trust
score = trust.get_score("research-agent-001")
print(score)  # 500 (baseline)

# Trust increases with successful verified interactions
trust.record_success("research-agent-001")
score = trust.get_score("research-agent-001")
print(score)  # 510

# Trust decreases with policy violations
trust.record_violation("research-agent-001", severity="high")
score = trust.get_score("research-agent-001")
print(score)  # 460

Agents with low trust scores get restricted automatically:

from agent_os.trust import TrustPolicy

policy = TrustPolicy(
    minimum_score=400,          # below this, agent is quarantined
    require_human_approval=600, # below this, human must approve actions
    full_autonomy=800,          # above this, agent operates freely
)

This is how the system handles compromised agents gracefully. An agent that starts behaving badly loses trust, gets restricted, and eventually gets quarantined — automatically, without human intervention.

Delegation Chains: Limited Capability Grants

In multi-agent systems, parent agents often spawn child agents to handle subtasks. The challenge: how do you give a child agent enough capability to do its job without giving it unlimited power?

The answer is delegation chains — cryptographically signed capability grants with explicit limits:

from agent_os.delegation import DelegationChain

# Parent agent creates a limited delegation for child
delegation = DelegationChain.create(
    parent_identity=orchestrator_identity,
    child_did="did:key:z6Mk...",
    granted_capabilities=["read_file"],  # subset of parent's capabilities
    excluded_capabilities=["delete_file", "send_email"],
    max_depth=2,          # child cannot further delegate more than 2 levels
    expires_in_seconds=300,  # delegation expires in 5 minutes
)

The child agent presents this delegation when making requests:

# Child agent acts with delegated authority
result = child_agent.execute(
    action="read_file",
    params={"path": "/data/analysis.csv"},
    delegation=delegation,
)

The toolkit verifies the entire delegation chain before execution:

Is the delegation cryptographically valid?
Is the requested capability within the granted scope?
Has the delegation expired?
Is the delegation depth within limits?

If any check fails, the action is rejected. A compromised child agent cannot exceed the capabilities its parent explicitly granted.

A Practical Example: 3 Agents Collaborating

Here's how trust verification plays out in a real multi-agent workflow:

from agent_os.identity import AgentIdentity
from agent_os.trust import TrustEngine
from agent_os.delegation import DelegationChain

# Agent 1: Orchestrator (high trust, broad capabilities)
orchestrator = AgentIdentity.create(
    agent_id="orchestrator",
    role="orchestrator",
    capabilities=["web_search", "read_file", "write_file", "send_email"],
)

# Agent 2: Researcher (medium trust, search only)
researcher = AgentIdentity.create(
    agent_id="researcher",
    role="researcher",
    capabilities=["web_search", "read_file"],
)

# Agent 3: Writer (medium trust, write only)
writer = AgentIdentity.create(
    agent_id="writer",
    role="writer",
    capabilities=["read_file", "write_file"],
)

trust = TrustEngine()
trust.set_score("orchestrator", 900)
trust.set_score("researcher", 700)
trust.set_score("writer", 700)

# Orchestrator delegates research task to researcher
research_delegation = DelegationChain.create(
    parent_identity=orchestrator,
    child_did=researcher.did,
    granted_capabilities=["web_search"],
    max_depth=1,
    expires_in_seconds=600,
)

# Researcher executes with verified delegation
# Toolkit checks: valid signature + sufficient trust + within capability scope
result = researcher.execute(
    action="web_search",
    query="latest AI governance frameworks",
    delegation=research_delegation,
)

# Orchestrator delegates writing task to writer
write_delegation = DelegationChain.create(
    parent_identity=orchestrator,
    child_did=writer.did,
    granted_capabilities=["write_file"],
    max_depth=1,
    expires_in_seconds=600,
)

# Writer cannot exceed delegated scope
# This would be rejected — send_email not in delegation
writer.execute(
    action="send_email",
    delegation=write_delegation,
)
# SecurityError: capability 'send_email' not in delegation scope

Every interaction is verified. Every capability grant is explicit and time-limited. No agent can exceed what it was explicitly authorized to do.

Comparison With Human Trust Models

The cryptographic trust model mirrors how humans establish trust in high-stakes environments:

Human World	Agent Mesh
Government-issued ID	Ed25519 key pair + DID
Signed contract	Signed message
Professional reputation	Trust score
Power of attorney	Delegation chain
Expiry date on credentials	TTL on delegations
Revocation list	Trust score below threshold

The difference: human trust systems have gaps. IDs can be forged. Signatures can be disputed. Reputation takes months to establish. The cryptographic model is mathematically verifiable — either the signature is valid or it isn't. Either the capability was granted or it wasn't.

Why This Matters for Production Systems

Most teams building multi-agent systems today are operating on implicit trust — agents interact freely, permissions are not enforced, and there's no audit trail of inter-agent communication.

This works fine in development. It becomes a serious problem in production, where:

Agents interact at scale across organizational boundaries
Compromised agents can propagate bad behavior to other agents
Regulatory requirements demand audit trails of automated decisions
A single rogue agent in a mesh can corrupt the entire workflow

The zero-trust mesh approach — verify every identity, validate every capability, audit every interaction — is how you build multi-agent systems that are safe to run in production.

Getting Started

pip install agent-governance-toolkit[full]

Full source code and documentation:
👉 github.com/microsoft/agent-governance-toolkit

I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.

I Added Governance to My AI Agent in 30 Minutes — Here's How

Kanish Tyagi — Wed, 08 Apr 2026 18:33:51 +0000

A few weeks ago I was building a LangChain agent and realized I had no idea what it was actually doing. It could call any tool, write anything to memory, make unlimited API calls. It was essentially unsupervised.

Then I found Microsoft's agent-governance-toolkit. I added governance to my agent in about 30 minutes. Here's exactly how.

What We're Building

A LangChain agent that:

Can only use approved tools
Blocks dangerous patterns (SQL injection, destructive commands, PII)
Logs every action for audit
Stops itself when it hits a call budget

No rewrites. Just a governance wrapper around your existing agent.

Step 1: Install (2 minutes)

pip install agent-governance-toolkit[full]

Verify it works:

agent-governance verify

You should see a green checkmark. You're ready.

Step 2: Your First Policy (5 minutes)

Before governance, your agent looks like this:

from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType

llm = ChatOpenAI(model="gpt-4")
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

# No limits. No logging. No audit trail.
agent.run("Do whatever the user asks")

After governance:

from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from agent_os.integrations.base import GovernancePolicy
from agent_os.integrations import LangChainKernel

llm = ChatOpenAI(model="gpt-4")
base_agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

# Define what your agent is allowed to do
policy = GovernancePolicy(
    name="my-first-policy",
    blocked_patterns=["DROP TABLE", "rm -rf"],
    max_tool_calls=10,
    log_all_calls=True,
)

# Wrap your existing agent — no rewrite needed
kernel = LangChainKernel(policy=policy)
governed_agent = kernel.wrap(base_agent)

# Same interface, now governed
governed_agent.invoke({"input": "Help me analyze this dataset"})

That's it. Your agent now has a policy layer.

Step 3: Add Tool Restrictions (5 minutes)

You probably don't want your agent to be able to delete files or execute arbitrary shell commands. Add an allowlist:

policy = GovernancePolicy(
    name="restricted-policy",
    # Only these tools are allowed
    allowed_tools=["web_search", "read_file", "send_email"],
    # These are always blocked regardless
    blocked_patterns=[
        "rm -rf",           # destructive shell
        "DROP TABLE",       # SQL injection
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSN pattern
        r"password\s*[:=]",  # credential leak
    ],
    max_tool_calls=10,
)

Test it:

# This gets blocked before reaching the LLM
allowed, reason = kernel.pre_execute(ctx, "Execute: DROP TABLE users")
print(allowed)  # False
print(reason)   # "Blocked pattern 'DROP TABLE' detected"

# This passes through
allowed, reason = kernel.pre_execute(ctx, "Search for recent AI papers")
print(allowed)  # True

The rejection happens at the code layer — the model never sees the blocked input.

Step 4: Add Audit Logging (5 minutes)

Every action your agent takes is now logged automatically when log_all_calls=True. But you can also inspect the audit trail directly:

from datetime import datetime

audit_log = []

def on_action(event):
    audit_log.append({
        "timestamp": datetime.now().isoformat(),
        "input": event.get("input"),
        "allowed": event.get("allowed"),
        "reason": event.get("reason"),
    })

# Check what happened
for entry in audit_log:
    status = "✅ ALLOWED" if entry["allowed"] else "🚫 BLOCKED"
    print(f"{entry['timestamp']} — {status}: {entry['input'][:50]}")

In production, this goes to your observability stack. During development, it tells you exactly what your agent tried to do.

Step 5: Run the OWASP Compliance Check (3 minutes)

The toolkit includes a built-in compliance checker against the OWASP Agentic Top 10:

agent-governance verify --badge

This checks your configuration against known attack vectors:

Prompt injection resistance
Tool poisoning detection
Data exfiltration patterns
Privilege escalation blocks

You get a pass/fail report with specific recommendations.

Before vs After

Here's what changed in 30 minutes:

	Before	After
Tool calls	Unlimited	Max 10 per session
Dangerous inputs	Passed to LLM	Blocked before LLM
PII handling	Unvalidated	Checked on every write
Audit trail	None	Every action logged
OWASP compliance	Unknown	Verified
Performance overhead	—	<0.1ms per action check

The last point matters: governance doesn't slow your agent down in any meaningful way. The policy checks are in-process Python, not network calls.

What I Learned Contributing to This Repo

I spent the past week contributing code to this toolkit — fixing bugs, adding docstrings, building Colab notebooks. Here's what surprised me:

The attack surface is bigger than you think. I saw how MCP tool poisoning works, how PII leaks into memory writes, how delegation chains can be exploited. None of these are theoretical. They're patterns security researchers are actively demonstrating.

Wrapping is better than rewriting. The governance layer wraps your existing agent. You don't throw away your prompts or your framework. You add a code layer underneath them. That's the right architecture.

Auditability is the real value. The blocked requests matter less than the audit trail. Knowing what your agent tried to do — even when it was allowed — is how you understand and improve agent behavior in production.

Try It Yourself

The full toolkit is open source:

pip install agent-governance-toolkit[full]

There are also interactive Colab notebooks (which I built!) that let you explore policy enforcement, MCP security scanning, and multi-agent governance without any local setup:

👉 github.com/microsoft/agent-governance-toolkit

The quickstart takes 10 minutes. The governance layer takes 5 more.

I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.

Policy-as-Code vs Prompt Engineering — When Guardrails Need Governance

Kanish Tyagi — Sun, 05 Apr 2026 12:20:44 +0000

I've been contributing to Microsoft's agent-governance-toolkit for the past few weeks — fixing bugs, writing docstrings, building Colab notebooks. And the more I dug into the codebase, the more one question kept coming up: why can't you just use a well-crafted system prompt to keep your agent safe?

The short answer: you can. Until you can't.

The Prompt Engineering Approach

Prompt-level guardrails look like this:

"You are a helpful assistant. Never recommend illegal activities. Never share personal user data. Always stay within budget constraints."

This works remarkably well — right up until it doesn't. The problem is structural: your guardrail lives inside the model's context window, which means it's subject to the same probabilistic reasoning as everything else the model does.

Three things consistently break prompt-only guardrails:

Jailbreak attacks. Users have discovered that framing requests differently — roleplay scenarios, "hypothetical" framings, multi-step manipulations — can cause models to comply with things they were explicitly told not to do. This isn't a bug in the model. It's a feature of how language models work. They follow the most compelling framing of the current context.

Context window exhaustion. A long conversation eventually pushes your system prompt out of the context window. Your "never exceed budget" guardrail disappears after message 40. The model doesn't know what it forgot.

Model dependency. A guardrail tuned for GPT-4 behaves differently on Claude, which behaves differently on Llama. Every model swap means re-testing every prompt. At scale, that's not sustainable.

The Policy-as-Code Approach

Policy-as-code doesn't ask the model to follow rules. It enforces rules in code, before and after the model is ever called.

Here's what that looks like with the agent-governance-toolkit:

from agent_os.integrations.base import GovernancePolicy
from agent_os.integrations import LangChainKernel

policy = GovernancePolicy(
    blocked_patterns=["DROP TABLE", "rm -rf", r"\b\d{3}-\d{2}-\d{4}\b"],
    max_tool_calls=10,
    require_human_approval=False,
)

kernel = LangChainKernel(policy=policy)
governed_agent = kernel.wrap(my_chain)

Every call to governed_agent.invoke() now goes through a policy check before it reaches the model. If the input matches a blocked pattern — SQL injection, destructive command, SSN pattern — the call is rejected. The model never sees it.

This is a fundamentally different architecture. The guardrail isn't in the model's context. It's in your infrastructure.

Where Each Approach Fails

Neither approach is complete on its own.

Where prompt engineering fails:

Adversarial users who iterate on jailbreaks
Long-running agents where context window limits apply
Multi-model deployments where you can't tune per model
Compliance scenarios where you need an audit trail

Where policy-as-code falls short:

Creative tasks where rules can't capture nuance
Behavioral style (tone, personality, format)
Rapid iteration — changing a policy requires a code deploy
User experience — prompts understand context, rules do not

A guardrail that says blocked_patterns=["inappropriate content"] can't capture the infinite ways "inappropriate" manifests in natural language. A prompt can. But a prompt can't guarantee that a rule is never violated. Code can.

Real Example: PII in Memory Writes

Here's a concrete failure mode I saw while contributing to this toolkit.

An agent with memory capabilities might inadvertently write a user's SSN into its long-term memory during a conversation about financial planning. A prompt guardrail saying "never store sensitive data" relies on the model recognizing what counts as sensitive — and recognizing it every single time, across thousands of conversations.

The toolkit handles this differently. Memory writes are intercepted at the code layer:

_PII_PATTERNS = [
    re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),  # SSN
    re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),  # email
    re.compile(r"\b(?:password|secret|token)\s*[:=]\s*\S+", re.IGNORECASE),
]

Before anything is written to memory, it's checked against these patterns. If a match is found, the write is blocked and a PolicyViolationError is raised. The model's cooperation is irrelevant — the rule is enforced in code.

The Layered Model That Actually Works

The right answer isn't "choose one." It's to use both at the right layer.

Use policy-as-code for:

Hard constraints that must never be violated (safety, compliance, budget)
Anything that needs an audit trail
Rules that apply regardless of model behavior
Security boundaries (tool allowlists, blocked patterns, call budgets)

Use prompt engineering for:

Behavioral style and tone
Output formatting
Persona and creative direction
User experience nuances

Use human-in-the-loop for:

High-stakes decisions that exceed both layers
Ambiguous cases the policy engine flags as uncertain
Exception handling with accountability

Think of it as defense in depth. Prompts guide behavior. Policies enforce limits. Humans handle edge cases. No single layer carries all the weight.

A Concrete Starting Point

If you're building agents today without a policy layer, here's the minimum viable governance setup using the toolkit:

pip install agent-governance-toolkit[full]

from agent_os.integrations.base import GovernancePolicy
from agent_os.integrations import LangChainKernel

# Start with just these three
policy = GovernancePolicy(
    blocked_patterns=["rm -rf", "DROP TABLE"],  # safety
    max_tool_calls=20,                           # budget
    log_all_calls=True,                          # audit trail
)

kernel = LangChainKernel(policy=policy)
governed = kernel.wrap(your_existing_agent)

You don't need to rewrite your agent. You wrap it. Your existing prompts stay. You've added a code-layer enforcement layer underneath them.

The Bigger Picture

The shift from prompt-only to policy + prompt is the maturation of agent governance. It's the difference between "I hope my agent behaves" and "I can prove my agent behaves."

Prompt engineering got us to where we are. It's powerful and flexible and necessary. But as agents move into production — taking real actions, handling real data, running autonomously at scale — "I told it not to" is no longer sufficient as a governance strategy.

Policy-as-code is how you make agent behavior auditable, testable, and enforceable. Not instead of good prompts. On top of them.

I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.

Source code and docs: github.com/microsoft/agent-governance-toolkit

Why AI Agent Governance Matters in 2026

Kanish Tyagi — Fri, 03 Apr 2026 03:53:48 +0000

A few weeks ago I was browsing GitHub looking for open source projects to contribute to. I stumbled on Microsoft's agent-governance-toolkit and decided to dig in. What I found surprised me — not because the code was impressive (it was), but because the problem it solved was one I hadn't thought seriously about before.

We talk a lot about building AI agents. We don't talk nearly enough about what happens when they go wrong.

The Rise of Autonomous AI Agents

In 2026, AI agents aren't a research curiosity anymore. They're running in production. Companies are deploying agents that browse the web, write and execute code, query databases, send emails, and call external APIs — all without a human in the loop for each step.

This is genuinely powerful. An agent that can autonomously research, analyze, and act can compress hours of work into minutes. But there's a problem that comes with that power that most teams are only starting to reckon with: an agent that can do anything, will eventually do the wrong thing.

Not out of malice. Out of ambiguity, edge cases, and the fundamental unpredictability of systems built on probabilistic models.

What Can Go Wrong — And Does

Let me give you concrete examples from patterns I saw while contributing to this toolkit.

Data exfiltration via MCP tool poisoning. MCP (Model Context Protocol) is a standard for giving agents access to tools. What most teams don't realize is that an attacker can embed hidden instructions directly in a tool's description. The agent reads the description, interprets it as instructions, and suddenly a "read file" tool is quietly sending your data to an external endpoint. The attack is invisible at the UI level. It lives in the metadata.

Runaway tool calls. An agent that's allowed to call tools indefinitely will sometimes loop. A bug in the task framing, an ambiguous goal, or an unexpected API response can send an agent into a cycle that burns through API credits, hits rate limits, or worse — makes irreversible changes at scale.

PII in memory writes. Agents with memory capabilities can inadvertently store sensitive user data — SSNs, emails, API keys embedded in text — in their long-term memory. Without validation at the write layer, that data persists and propagates.

Prompt injection across delegation chains. When agents spawn sub-agents (which is increasingly common in multi-agent architectures), each handoff is an attack surface. A malicious instruction injected early in a chain can propagate through multiple agents before anyone realizes something is wrong.

These aren't theoretical. Security researchers are actively demonstrating all of these attack patterns in the wild right now.

What Governance Actually Means

"Governance" is one of those words that sounds bureaucratic until you see what the absence of it costs.

In practical terms, governance for AI agents means three things:

1. Policy enforcement before execution. Every tool call, every action an agent wants to take, gets checked against a set of rules before it happens. Not after. Not logged for review later. Before. If the action matches a blocked pattern — a SQL injection attempt, a destructive shell command, a PII pattern — it's stopped. The agent gets a policy violation error. The action doesn't execute.

2. Audit logging you can actually use. Every action an agent takes, whether allowed or blocked, gets recorded with enough context to understand what happened. Not just "agent called tool X" but which agent, which context, what the input was, what the policy decision was, and why.

3. Budget and circuit breaker enforcement. Agents need hard limits — on the number of tool calls per session, on the delegation depth between agents, on execution time. When those limits are hit, the agent stops. Not degrades gracefully. Stops.

How agent-governance-toolkit Implements This

This is where it gets concrete. The toolkit wraps your existing agent frameworks — LangChain, CrewAI, AutoGen, PydanticAI, smolagents, Google ADK — with a governance layer that sits between your agent and the outside world.

You define a policy:

from agent_os.integrations.base import GovernancePolicy

policy = GovernancePolicy(
    blocked_patterns=["DROP TABLE", "rm -rf", r"\b\d{3}-\d{2}-\d{4}\b"],
    max_tool_calls=10,
    require_human_approval=False,
)

Then you wrap your agent:

from agent_os.integrations import LangChainKernel

kernel = LangChainKernel(policy=policy)
governed_agent = kernel.wrap(my_langchain_chain)

Every call to governed_agent.invoke() now goes through a pre-execution check. If the input matches a blocked pattern, the call is rejected before reaching the LLM or any external service. If the agent has exceeded its tool call budget, the call is rejected. If the output contains something problematic, the post-execution check catches it.

The interception happens at multiple levels. Deep hooks intercept individual tool calls within the agent, memory writes are validated for PII before they're saved, and delegation chains between sub-agents are tracked and depth-limited.

For MCP specifically, the toolkit includes a security scanner that checks tool definitions for poisoning patterns — hidden instructions, exfiltration attempts, privilege escalation, role overrides — before those tools are registered with the agent.

Key Concepts Worth Understanding

The trust mesh. In a multi-agent system, trust isn't binary. Different agents should have different permission levels based on their role, their source, and the context they're operating in. The toolkit models this through trust cards and identity verification — each agent has a verifiable identity, and policies can be scoped to specific agents or roles.

Policy as code. Governance policies are defined as YAML files that can be version-controlled, reviewed, and deployed like any other configuration. They have a schema, they can be validated before deployment (agentos policy validate), and they can be diffed between versions. You don't want to be manually updating code every time a new blocked pattern needs to be added.

SLOs for agents. Just like you set service level objectives for APIs and infrastructure, you can set them for agents. Error rate thresholds, latency limits, availability targets. When an agent breaches its SLO, a circuit breaker trips and the agent is taken out of service. This prevents a degraded agent from silently producing bad outputs at scale.

Audit logging as a first-class concern. Every governance decision — allow or block — is recorded with full context. This isn't just for debugging. It's for compliance, incident response, and understanding actual agent behavior in production over time.

Why This Matters Now

The timing matters. We're at an inflection point.

Most teams deploying AI agents today are doing so without any governance layer. They're moving fast, the agents are working well enough in testing, and governance feels like a problem for later. But "later" in AI agent deployments often means "after the first serious incident."

The cost of retrofitting governance onto an existing agent system is much higher than building it in from the start. The audit trail doesn't exist yet, the policy boundaries haven't been defined, and the agents have already been granted permissions that are difficult to revoke without breaking workflows.

This isn't just a security problem either. It's a reliability problem and a trust problem. If your agents take actions that users or customers didn't expect and can't explain, trust in the system erodes quickly. Governance is what makes agent behavior predictable and explainable — not just safe.

Getting Started

The quickstart takes about 10 minutes:

pip install agent-governance-toolkit[full]

There are also three interactive Google Colab notebooks that let you explore the toolkit without setting anything up locally:

Policy Enforcement 101 — define a policy, see violations blocked in real time
MCP Security Proxy — scan tool definitions for poisoning patterns
Multi-Agent Governance — SLOs, circuit breakers, chaos testing

Full source code and docs: github.com/microsoft/agent-governance-toolkit

A Note From a Contributor

I came to this project as someone who builds ML models and data pipelines — not a security engineer. But contributing to this codebase changed how I think about agent design. The attack surfaces are real, the failure modes are unintuitive, and the tooling to address them is still early.

If you're building agents in production, governance isn't optional. It's the difference between a system you can operate confidently and one you're constantly firefighting.

I'm Kanish Tyagi — MS Data Science student at UT Arlington and open source contributor. Find me on GitHub and LinkedIn.