Microsoft just open-sourced the Agent Governance Toolkit, a runtime governance platform that covers all 10 risks in the OWASP Agentic Top 10. I've spent the morning reading through the architecture, benchmarks, and OWASP compliance docs, and it's one of the most thorough agent governance framework I've seen from any company, open-source or otherwise.
Policy evaluation at 0.012ms latency.
Ed25519 cryptographic agent identity with trust scoring.
Four-tier execution rings with kill switches.
Circuit breakers and chaos engineering for reliability.
Adapters for 12+ frameworks including LangChain, AutoGen, CrewAI, and Google ADK.
6,100+ tests. MIT licensed.
This is the kind of infrastructure that the agentic ecosystem desperately needs, and Microsoft giving it away for free accelerates the entire space.
It also makes me more confident about the bet we've been making at Rynko, because the toolkit solves a genuinely hard set of problems that we don't solve and it leaves room for the specific problem that we do.
What the Toolkit Does Well
The toolkit has four components, and each one addresses a real production concern that teams building agentic systems struggle with.
Agent OS is the policy engine. Every agent action passes through it before an execution. You define capabilities like which tools the agent can call, resource limits like token budgets, API call caps and content policies. It evaluates these at sub-millisecond latency 72,000 policy evaluations per second for single rules, 31,000 for 100-rule policies. Custom policies can be written in OPA/Rego or Cedar, which means teams can reuse their existing policy infrastructure rather than learning a new DSL, a thoughtful design choice.
AgentMesh handles identity and inter-agent trust. Every agent gets an Ed25519 cryptographic credentials. Trust scores on a 01000 scale determine what an agent can do eg. a score of 900+ gets verified partner access, below 300 gets read-only. The communication between agents is encrypted through trust gates, and it bridges A2A, MCP, and IATP protocols. The trust scoring model is particularly well thought out, eg. new agents default to 500 and progress based on compliance history, which mirrors how you'd onboard a new team member with gradually expanding permissions.
Agent Runtime is the execution supervisor. It uses four privilege rings to isolate what agents can touch. Saga orchestration is used to coordinate multi-step operations. Kill switches terminate non-compliant agents and Append-only audit logs record everything for forensic replay.
Agent SRE provides reliability engineering. SLO enforcement, error budgets, circuit breakers are available to prevent cascading failures, replay debugging and chaos engineering. The production observability patterns you'd expect from a team that runs Azure at scale.
All four components work together to answer a fundamental question: is this agent allowed to do what it's trying to do, and is it doing it safely?
This is genuinely hard infrastructure to build correctly. Identity, policy enforcement, execution isolation, and reliability engineering each have deep rabbit holes, and Microsoft has the engineering depth to go down all of them properly.
Where Flow Adds a Complementary Layer
The toolkit governs agent behavior permissions, identity, execution boundaries, reliability. Flow governs agent output i.e. the actual data the agent produces when it completes an action.
These are different concerns. The toolkit ensures the agent is authorized and operating safely. Flow ensures the data the agent produces is correct and hasn't been tampered with before reaching the downstream system.
One reasonable question to ask would be: couldn't AgentMesh's trust gates or the Agent OS policy engine handle data validation too? Technically, you could write OPA/Rego policies that inspect payload fields Rego is expressive enough to check input.payload.amount > 0. But policy engines are designed to return allow/deny decisions, not structured validation errors with field-level messages that an agent can use to self-correct and resubmit. You'd also be mixing authorization concerns with domain-specific business logic in the same policy files. Also, you wouldn't get HMAC-based payload verification or human approval routing. It's a bit like using a firewall for input validation it can inspect packet contents, but that doesn't make it the right layer for checking whether an invoice total matches its line items.
Think about the OWASP compliance mapping in the toolkit. ASI-05 addresses unexpected code execution through privilege rings and sandboxing. This makes sure that the agent can't run arbitrary code. That's the right control for that risk. But once the agent produces a result through an approved tool call - an invoice, a purchase order, a compliance report - there's a different question to answer: is the data in that result actually correct?
An agent can be fully authorized, properly authenticated, running within its privilege ring, with no circuit breaker tripped. The policy engine approved the action. And the agent still submits "currency": "usd" instead of "USD", calculates a total that's off by a rounding error, or drops a required field. These are domain-specific data quality issues that a behavioral governance layer isn't designed to catch, and honestly shouldn't try to, that would mix concerns and bloat the policy engine with domain logic.
This is what Flow was built for. You define a gate with a schema and business rules specific to your domain, and the agent's output gets validated before it reaches the downstream system. Validation Failures return structured errors which the agent can use to self-correct. Passed validations return a validation_id - an HMAC-SHA256 hash of the validated payload which the downstream system can independently verify.
How the Two Layers Work Together
The distinction maps to how we think about security in traditional systems. Authentication and authorization tell you who's making a request and whether they're allowed to. Input validation tells you whether the data they're sending is well-formed and correct. You've always needed both. The agentic world isn't different.
| Layer | Question | Microsoft Toolkit | Rynko Flow |
|---|---|---|---|
| Identity | Who is this agent? | Ed25519 credentials, trust scores | API key auth |
| Authorization | Can it call this tool? | Policy engine, capability model | - |
| Execution | Is it running safely? | Privilege rings, sandboxing | - |
| Reliability | Will failures cascade? | Circuit breakers, SLOs | - |
| Output correctness | Is the data valid? | - | Schema + business rules |
| Output integrity | Was the data tampered? | - | HMAC verification |
| Human oversight | Should a person review? | - | Approval routing |
The toolkit handles the rows above the line. Flow handles the rows below it. Together, they cover the pipeline end to end.
A Practical Example
Say you have an order processing agent running in an environment with the toolkit deployed. The policy engine confirms the agent has permission to submit orders. AgentMesh verified its identity. The runtime supervisor confirmed it's operating within its privilege ring.
The agent submits this order:
{
"order_id": "ORD-2847",
"vendor": "Acme Corp",
"amount": -500,
"currency": "usd",
"line_items": []
}
From the toolkit's perspective, everything checks out. The agent was authorized, authenticated, and operating within bounds. The policy engine approved the action. And it should approve it the toolkit's job is to enforce behavioral governance, not validate business data.
Flow picks up where the toolkit leaves off. A gate with the appropriate schema and rules catches three issues:
{
"success": false,
"errors": [
{ "field": "amount", "message": "Must be >= 0" },
{ "field": "currency", "message": "Must be one of: USD, EUR, GBP" },
{ "rule": "line_items.length > 0", "message": "Must have at least one line item" }
]
}
The agent self-corrects using the structured feedback, resubmits, and gets a validation_id on success. The downstream system verifies the ID before accepting the data. The toolkit made sure the right agent submitted the order safely. Flow made sure the order itself was correct.
Performance Both Layers Are Essentially Free
One thing the toolkit's benchmarks highlight is that governance overhead should be invisible relative to LLM latency. Their policy evaluation adds 0.010.1ms. An LLM API call takes 2003,000ms. I think they're exactly right about this governance shouldn't be the bottleneck, and at those numbers it never will be.
Flow operates at a different timescale because it's doing more work per evaluation parsing payloads, validating schemas against variable arrays, running expression-based business rules through a recursive descent parser. Our benchmarks show ~50ms server-side validation for enterprise-scale payloads (21 schema variables, 10 business rules, 900 line items in a single payload). For typical payloads (a few KB), it's single-digit milliseconds.
Combined, both layers add maybe 5060ms to a pipeline where the LLM inference took 5003,000ms. You're paying a negligible cost for behavioral governance and output validation together.
The Bigger Picture
Between the OWASP Agentic Top 10, the AWS Agentic AI Security Scoping Matrix, Snapchat's Auton framework, and now Microsoft's toolkit, the industry is converging on something I think is important: agent governance is not a single problem with a single solution. It's a stack of specialized layers, each addressing different risks at different points in the pipeline.
Microsoft releasing this toolkit validates the category in a way that benefits everyone building in the space. When the company that runs Azure tells the world "agent governance is infrastructure, here's our reference implementation for free," it moves the conversation from "do we need agent governance?" to "which layers do we still need to add?"
We think output validation is one of those layers. Not because the toolkit missed something, but because domain-specific data correctness is a separate concern that deserves its own specialized tooling. Checking whether an invoice has the right currency code, whether an order total matches its line items, or whether a compliance report includes all required fields isn't a policy evaluation problem. It's a schema and business rule problem with optional human review in the loop.
That's what we built Flow to handle. If you're deploying the Agent Governance Toolkit and want to add output validation to the pipeline, try dropping a Flow gate between the governed agent and your downstream system. The free tier gives you 500 validation runs per month and three gates enough to see how the two layers work together in practice.
Rynko Flow is a validation gateway for AI agent outputs. Try it free or read the docs.


Top comments (0)