DEV Community: Srijith Kartha

Madrigal's "Failures as Eval Suites" Pattern and How Flow Already Provides the Infrastructure

Srijith Kartha — Wed, 06 May 2026 08:23:56 +0000

A blog post on LangChain's site about how Madrigal Pharmaceuticals built their multi-agent AI platform caught my attention this week. Not because of the architecture — orchestrator routing, parallel agents, shared workspace — that part is well-trodden ground by now. What stood out was one sentence buried in their quality assurance section:

"Production failures feed back into our LangSmith datasets automatically. Every meaningful error becomes a new test case. The eval suite grows from real failures, not synthetic scenarios."

This is the most underappreciated pattern in production AI right now: using your validation failures as your evaluation dataset.

Why Synthetic Evals Aren't Enough

Most teams building agent evaluation suites start the same way: write 50-100 synthetic test cases, run the agent against them, measure pass rates. It works for getting to v1. Then the agent goes to production, and you discover failure modes that none of your synthetic cases covered — because you couldn't have imagined them.

An invoice agent that passes every test but occasionally drops the currency field when the source data uses ISO 4217 codes it hasn't seen before. A contract extraction agent that handles English perfectly but silently returns empty arrays for bilingual documents. A code review agent that flags security issues correctly 98% of the time but gives a false positive on every string comparison that happens to contain "eval" in a variable name.

These are the failures that matter, and they only surface in production. Madrigal's insight was to capture them automatically and feed them back into their eval datasets. Their eval suite doesn't just test what they thought could go wrong — it tests what actually went wrong.

What Madrigal Built Custom

Madrigal's setup, based on the LangChain post, works roughly like this:

Agents process tasks in production
LangSmith traces every tool call, retrieval, and decision
When a meaningful error occurs, it's captured and added to a LangSmith dataset
That dataset becomes a test case for future agent iterations
LLM-as-judge evaluators grade full agent runs, using criteria that "mirror real end-user business feedback forms"

This is a solid approach, and LangSmith's tracing gives them visibility into the full chain of agent decisions. But there are two things worth noting about how they built it:

First, the evaluation is primarily LLM-based. They use "LLM-as-judge graders" that score outcomes. There's no mention of deterministic validation — schema checks, type enforcement, business rule evaluation — happening before the LLM judges the output.

Second, the failure capture appears to happen at the tracing level. When something goes wrong, the trace captures it. But traces capture everything — they don't inherently distinguish between "the agent made a wrong tool call" and "the agent's final output violated a business constraint." Both end up in the same dataset, and someone has to curate which failures are actually useful as eval cases.

Flow Already Captures This — With Structure

When we built Flow, we weren't thinking about it as an eval dataset generator. We built it as a validation gateway: agent submits output, Flow checks it against a schema and business rules, the output is either approved, rejected, or routed to a human. But it turns out that every rejected run is, by definition, a production failure case with structured metadata about what went wrong.

Here's what a Flow rejection looks like:

{
  "success": false,
  "status": "REJECTED",
  "errors": [
    {
      "type": "business_rule",
      "rule": "total === lineItems.reduce((sum, item) => sum + item.amount, 0)",
      "message": "Line item amounts must sum to total",
      "computed": { "total": 1250, "sum": 1150 }
    },
    {
      "type": "schema",
      "field": "currency",
      "message": "Required field missing"
    }
  ]
}

And when AI Judge is enabled, you also get:

{
  "aiJudge": {
    "criteria": [
      {
        "criterion": "Invoice must include a payment terms clause",
        "verdict": "fail",
        "confidence": 0.92,
        "reasoning": "No payment terms found in the document. Net-30/60/90 clause is missing."
      },
      {
        "criterion": "All monetary values must use two decimal places",
        "verdict": "pass",
        "confidence": 0.97,
        "reasoning": "All amounts use consistent two-decimal formatting."
      }
    ]
  }
}

Every rejection tells you exactly which rules failed, what the computed values were, and (with AI Judge) which qualitative criteria the output didn't meet. That's not a raw error log — it's a structured eval case with a grading rubric attached.

Three Things Flow Captures That Traces Don't

1. Deterministic failure reasons before LLM evaluation

Madrigal's LLM-as-judge runs on every output. Flow's business rules catch schema violations and constraint failures before AI Judge even fires. If your agent submitted amount: -500, you don't need an LLM to tell you that's wrong — you need a rule that says amount > 0. The deterministic layer filters out failures that are unambiguous, so your eval dataset isn't polluted with cases where the answer is obvious.

2. Per-criterion failure granularity

When a Flow run fails AI Judge, you don't just get "the output was bad." You get per-criterion verdicts — which specific criteria failed, with what confidence, and why. If your agent keeps failing the "payment terms clause" criterion but passing everything else, that's a specific, actionable signal. You know exactly what to fix in your prompt or retrieval pipeline.

3. Human override signal

When a run goes to REVIEW_REQUIRED and a human approver overrides the AI's verdict — either accepting something the AI rejected or rejecting something the AI approved — that's a calibration signal for your eval criteria. If humans are consistently overriding a specific AI Judge criterion, the criterion is probably wrong or too strict. This is eval metadata you can't get from traces alone.

The Feedback Loop

Here's how the feedback loop works in practice:

The gate doesn't just validate — it generates the data that makes your agent better over time. Every failure is a test case. Every human override is a calibration signal. The eval suite grows from production reality, not from what you guessed might go wrong.

And because the rejection data is structured JSON with specific rule IDs, field names, and computed values, you can aggregate failure patterns across thousands of runs. "This agent fails the amount > 0 rule 3% of the time" is something you can track on a dashboard and watch improve over sprint cycles.

What We're Building Next

Reading the Madrigal post pushed us to think about this more explicitly. Right now, all the rejection data exists in Flow's run history — you can query it via the API, see it in the dashboard, and build reports on it. But we haven't made it easy to export that data into external eval platforms.

Here's what's on the roadmap:

Failure pattern dashboard. Group rejected runs by which rules and criteria failed most often. Instead of scrolling through individual runs, see at a glance that 40% of your rejections are the same missing field.

Eval dataset export. One-click export of rejected runs (with payloads, failure reasons, and human override data) in formats that work with eval tools — LangSmith datasets, Braintrust, or plain JSONL for custom pipelines.

Webhook on rejection. Fire a webhook when a run is rejected, with the full failure payload. Pipe it directly into your CI, your eval pipeline, or a Slack channel where your agent developers see it in real time.

Gate-level eval metrics. Track how your failure rate changes over time for each gate. When you push a new agent version, see whether the rejection rate went up or down — not just whether your synthetic eval suite passed.

How This Complements LangSmith

I want to be clear: this isn't a "Flow vs. LangSmith" argument. LangSmith does tracing, observability, and eval orchestration — it gives you visibility into what the agent did and why. Flow does output validation — it gives you a verdict on whether the agent's final output meets your requirements. They solve different problems, and using both makes the feedback loop stronger.

The ideal setup looks like this: LangSmith traces the agent's reasoning chain. Flow validates the agent's output. When Flow rejects a run, that rejection — with its structured failure reasons — gets added to a LangSmith dataset. Now your eval suite tests both the reasoning (via traces) and the output quality (via Flow's validation criteria). The agent gets better at both thinking and producing.

Madrigal built the output validation part custom. If they'd had Flow, they could've defined their validation criteria as gates and gotten structured rejection data for free, instead of parsing it out of traces.

Getting Started

If you're running agents in production and want to start building an eval suite from real failures:

Define a Flow gate with the schema and business rules your agent's output needs to satisfy
Add AI Judge criteria for the qualitative checks you can't express as rules
Point your agent at the gate — one API call per submission
Pull rejected runs from the API and add them to your eval dataset
Watch your rejection rate decrease as your agent improves

The free tier gives you 500 runs per month and three gates. Enough to validate whether this pattern works for your pipeline before committing to it.

Rynko Flow is a validation gateway for AI agent outputs. Try it free or read the docs.

Microsoft's Agent Governance Toolkit and Where Rynko Flow Fits In

Srijith Kartha — Sun, 22 Mar 2026 02:29:54 +0000

Microsoft just open-sourced the Agent Governance Toolkit, a runtime governance platform that covers all 10 risks in the OWASP Agentic Top 10. I've spent the morning reading through the architecture, benchmarks, and OWASP compliance docs, and it's one of the most thorough agent governance framework I've seen from any company, open-source or otherwise.

Policy evaluation at 0.012ms latency.
Ed25519 cryptographic agent identity with trust scoring.
Four-tier execution rings with kill switches.
Circuit breakers and chaos engineering for reliability.
Adapters for 12+ frameworks including LangChain, AutoGen, CrewAI, and Google ADK.
6,100+ tests. MIT licensed.

This is the kind of infrastructure that the agentic ecosystem desperately needs, and Microsoft giving it away for free accelerates the entire space.

It also makes me more confident about the bet we've been making at Rynko, because the toolkit solves a genuinely hard set of problems that we don't solve — and it leaves room for the specific problem that we do.

What the Toolkit Does Well

The toolkit has four components, and each one addresses a real production concern that teams building agentic systems struggle with.

Agent OS is the policy engine. Every agent action passes through it before an execution. You define capabilities like which tools the agent can call, resource limits like token budgets, API call caps and content policies. It evaluates these at sub-millisecond latency — 72,000 policy evaluations per second for single rules, 31,000 for 100-rule policies. Custom policies can be written in OPA/Rego or Cedar, which means teams can reuse their existing policy infrastructure rather than learning a new DSL, a thoughtful design choice.

AgentMesh handles identity and inter-agent trust. Every agent gets an Ed25519 cryptographic credentials. Trust scores on a 0–1000 scale determine what an agent can do eg. a score of 900+ gets verified partner access, below 300 gets read-only. The communication between agents is encrypted through trust gates, and it bridges A2A, MCP, and IATP protocols. The trust scoring model is particularly well thought out, eg. new agents default to 500 and progress based on compliance history, which mirrors how you'd onboard a new team member with gradually expanding permissions.

Agent Runtime is the execution supervisor. It uses four privilege rings to isolate what agents can touch. Saga orchestration is used to coordinate multi-step operations. Kill switches terminate non-compliant agents and Append-only audit logs record everything for forensic replay.

Agent SRE provides reliability engineering. SLO enforcement, error budgets, circuit breakers are available to prevent cascading failures, replay debugging and chaos engineering. The production observability patterns you'd expect from a team that runs Azure at scale.

All four components work together to answer a fundamental question: is this agent allowed to do what it's trying to do, and is it doing it safely?

This is genuinely hard infrastructure to build correctly. Identity, policy enforcement, execution isolation, and reliability engineering each have deep rabbit holes, and Microsoft has the engineering depth to go down all of them properly.

Where Flow Adds a Complementary Layer

The toolkit governs agent behavior — permissions, identity, execution boundaries, reliability. Flow governs agent output i.e. the actual data the agent produces when it completes an action.

These are different concerns. The toolkit ensures the agent is authorized and operating safely. Flow ensures the data the agent produces is correct and hasn't been tampered with before reaching the downstream system.

One reasonable question to ask would be: couldn't AgentMesh's trust gates or the Agent OS policy engine handle data validation too? Technically, you could write OPA/Rego policies that inspect payload fields — Rego is expressive enough to check input.payload.amount > 0. But policy engines are designed to return allow/deny decisions, not structured validation errors with field-level messages that an agent can use to self-correct and resubmit. You'd also be mixing authorization concerns with domain-specific business logic in the same policy files. Also, you wouldn't get HMAC-based payload verification or human approval routing. It's a bit like using a firewall for input validation — it can inspect packet contents, but that doesn't make it the right layer for checking whether an invoice total matches its line items.

Think about the OWASP compliance mapping in the toolkit. ASI-05 addresses unexpected code execution through privilege rings and sandboxing. This makes sure that the agent can't run arbitrary code. That's the right control for that risk. But once the agent produces a result through an approved tool call - an invoice, a purchase order, a compliance report - there's a different question to answer: is the data in that result actually correct?

An agent can be fully authorized, properly authenticated, running within its privilege ring, with no circuit breaker tripped. The policy engine approved the action. And the agent still submits "currency": "usd" instead of "USD", calculates a total that's off by a rounding error, or drops a required field. These are domain-specific data quality issues that a behavioral governance layer isn't designed to catch, and honestly shouldn't try to, that would mix concerns and bloat the policy engine with domain logic.

This is what Flow was built for. You define a gate with a schema and business rules specific to your domain, and the agent's output gets validated before it reaches the downstream system. Validation Failures return structured errors which the agent can use to self-correct. Passed validations return a validation_id - an HMAC-SHA256 hash of the validated payload which the downstream system can independently verify.

How the Two Layers Work Together

The distinction maps to how we think about security in traditional systems. Authentication and authorization tell you who's making a request and whether they're allowed to. Input validation tells you whether the data they're sending is well-formed and correct. You've always needed both. The agentic world isn't different.

Layer	Question	Microsoft Toolkit	Rynko Flow
Identity	Who is this agent?	Ed25519 credentials, trust scores	API key auth
Authorization	Can it call this tool?	Policy engine, capability model	-
Execution	Is it running safely?	Privilege rings, sandboxing	-
Reliability	Will failures cascade?	Circuit breakers, SLOs	-
Output correctness	Is the data valid?	-	Schema + business rules
Output integrity	Was the data tampered?	-	HMAC verification
Human oversight	Should a person review?	-	Approval routing

The toolkit handles the rows above the line. Flow handles the rows below it. Together, they cover the pipeline end to end.

A Practical Example

Say you have an order processing agent running in an environment with the toolkit deployed. The policy engine confirms the agent has permission to submit orders. AgentMesh verified its identity. The runtime supervisor confirmed it's operating within its privilege ring.

The agent submits this order:

{
  "order_id": "ORD-2847",
  "vendor": "Acme Corp",
  "amount": -500,
  "currency": "usd",
  "line_items": []
}

From the toolkit's perspective, everything checks out. The agent was authorized, authenticated, and operating within bounds. The policy engine approved the action. And it should approve it — the toolkit's job is to enforce behavioral governance, not validate business data.

Flow picks up where the toolkit leaves off. A gate with the appropriate schema and rules catches three issues:

{
  "success": false,
  "errors": [
    { "field": "amount", "message": "Must be >= 0" },
    { "field": "currency", "message": "Must be one of: USD, EUR, GBP" },
    { "rule": "line_items.length > 0", "message": "Must have at least one line item" }
  ]
}

The agent self-corrects using the structured feedback, resubmits, and gets a validation_id on success. The downstream system verifies the ID before accepting the data. The toolkit made sure the right agent submitted the order safely. Flow made sure the order itself was correct.

Performance — Both Layers Are Essentially Free

One thing the toolkit's benchmarks highlight is that governance overhead should be invisible relative to LLM latency. Their policy evaluation adds 0.01–0.1ms. An LLM API call takes 200–3,000ms. I think they're exactly right about this — governance shouldn't be the bottleneck, and at those numbers it never will be.

Flow operates at a different timescale because it's doing more work per evaluation — parsing payloads, validating schemas against variable arrays, running expression-based business rules through a recursive descent parser. Our benchmarks show ~50ms server-side validation for enterprise-scale payloads (21 schema variables, 10 business rules, 900 line items in a single payload). For typical payloads (a few KB), it's single-digit milliseconds.

Combined, both layers add maybe 50–60ms to a pipeline where the LLM inference took 500–3,000ms. You're paying a negligible cost for behavioral governance and output validation together.

The Bigger Picture

Between the OWASP Agentic Top 10, the AWS Agentic AI Security Scoping Matrix, Snapchat's Auton framework, and now Microsoft's toolkit, the industry is converging on something I think is important: agent governance is not a single problem with a single solution. It's a stack of specialized layers, each addressing different risks at different points in the pipeline.

Microsoft releasing this toolkit validates the category in a way that benefits everyone building in the space. When the company that runs Azure tells the world "agent governance is infrastructure, here's our reference implementation for free," it moves the conversation from "do we need agent governance?" to "which layers do we still need to add?"

We think output validation is one of those layers. Not because the toolkit missed something, but because domain-specific data correctness is a separate concern that deserves its own specialized tooling. Checking whether an invoice has the right currency code, whether an order total matches its line items, or whether a compliance report includes all required fields isn't a policy evaluation problem. It's a schema and business rule problem with optional human review in the loop.

That's what we built Flow to handle. If you're deploying the Agent Governance Toolkit and want to add output validation to the pipeline, try dropping a Flow gate between the governed agent and your downstream system. The free tier gives you 500 validation runs per month and three gates — enough to see how the two layers work together in practice.

Rynko Flow is a validation gateway for AI agent outputs. Try it free or read the docs.

IBM's $11 Billion Confluent Acquisition, AWS + Cerebras, and Where Output Validation Fits In

Srijith Kartha — Wed, 18 Mar 2026 08:08:21 +0000

IBM is solving real-time data for agents. AWS is solving inference speed. Both are foundational. Here's how Rynko Flow adds an output validation layer to complement what they're building.

Two announcements in the same week paint a clear picture of where enterprise AI infrastructure is headed, and both of them are exciting.

IBM closed its $11 billion acquisition of Confluent, the Kafka-based streaming platform used by 40% of Fortune 500 companies. The thesis is sound: enterprises moving from AI experimentation to production need live, continuously flowing data — not batch exports that arrive hours late. As Rob Thomas (IBM SVP) put it, "AI decisions need to happen just as fast" as the transactions generating the data. That's exactly right, and Confluent is the best platform in the world for making it happen.

Meanwhile, AWS announced a collaboration with Cerebras to bring wafer-scale inference to Amazon Bedrock. The CS-3 delivers thousands of times more memory bandwidth than the fastest GPU, targeting the decode bottleneck that slows agentic workloads. Andrew Feldman (Cerebras CEO) called it "blisteringly fast inference." Their disaggregated architecture pairs Trainium for compute-heavy prefill with Cerebras WSE for bandwidth-heavy token generation — an order of magnitude faster inference than what's available today. For anyone building real-time agentic workflows, this is a big deal.

These are the kind of infrastructure investments that make agentic systems practical at enterprise scale. They also got me thinking about where Rynko Flow fits into this picture.

The Pipeline and Where Each Layer Contributes

The enterprise AI pipeline looks roughly like this:

IBM + Confluent handle the input: getting live, governed, trustworthy data to the agent. AWS + Cerebras handle the processing: making the agent produce output fast enough for real-time operations. Both are necessary — an agent making decisions on stale data is worse than no agent at all, and an agent that takes 30 seconds to respond isn't useful for time-sensitive workflows.

What we've been focused on at Rynko is the next step in that pipeline: once the agent processes that real-time data at speed and produces a result — an invoice, a purchase order, a compliance report — how do you validate that the result is correct before it reaches the downstream system?

This is a genuinely different problem from data freshness or inference speed, and it's the problem we built Flow to solve. Even with perfect input data, agents can submit "usd" instead of "USD", produce a total that's off by a rounding error, or silently drop a required field. The data flowing in was pristine. The processing was fast. The output still needs a checkpoint.

What Flow Adds to the Pipeline

Flow is a validation gateway that sits between the agent's output and your downstream systems. You define a gate with a schema and business rules, the agent submits its output, and Flow validates it before the data moves forward. Failed submissions return structured errors the agent can use to self-correct. Passed submissions return a tamper-proof validation_id that the downstream system can verify to confirm nothing was modified in transit.

Say you have an order processing agent. Confluent is streaming real-time order events from your POS systems, inventory databases, and payment providers. The agent processes these events and produces a purchase order to send downstream. Here's the Flow gate that checks the agent's output:

Schema:
  - order_id: string, required
  - vendor: string, required
  - amount: number, required, min 10
  - currency: string, required, enum [USD, EUR, GBP]
  - line_items: array of objects, required

Business Rules:
  - amount > 0 ("Order amount must be positive")
  - amount <= 100000 ("Single order cannot exceed $100,000")
  - line_items.length > 0 ("Must have at least one line item")

The agent submits its payload. Flow validates it against the schema and evaluates every business rule. If the agent submitted amount: -500, it gets back:

{
  "success": false,
  "status": "validation_failed",
  "errors": [
    { "rule": "amount > 0", "message": "Order amount must be positive" }
  ]
}

The agent self-corrects and resubmits. When validation passes, the response includes a validation_id:

{
  "success": true,
  "status": "validated",
  "validation_id": "val_4f546e9bcb76f120c4984d72"
}

That validation_id is an HMAC-SHA256 hash of the validated payload, computed using canonical JSON serialization with recursively sorted keys. This means even if the payload passes through multiple systems that reorder the JSON keys or reformat the whitespace, the verification still works. The downstream system receives the payload and the validation_id from the agent, then calls Flow to verify:

curl -X POST https://api.rynko.dev/api/flow/verify \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "validation_id": "val_4f546e9bcb76f120c4984d72",
    "payload": { "order_id": "ORD-001", "vendor": "Acme", "amount": 500, ... }
  }'

{
  "verified": true,
  "runId": "550e8400-e29b-41d4-a716-446655440000",
  "gateName": "Order Validation",
  "gateSlug": "order-validation"
}

If the agent tampered with the payload after validation — changed the amount, added a field, removed a required value — verification returns verified: false. The downstream system knows not to trust the data.

Validation Doesn't Have to Be a Bottleneck

One concern I hear is whether validation adds meaningful latency to a real-time pipeline. We benchmarked Flow against enterprise-scale payloads — the kind of data you'd see flowing through Kafka in a large manufacturing or logistics operation.

We tested with a Sterling Commerce OMS-style order payload: 21 schema variables, 10 business rules, 900 order line items. The payloads were around 9MB for XML and 7.3MB for JSON.

Metric	XML (9.1 MB)	JSON (7.3 MB)
Total round-trip	4,989 ms	4,401 ms
Server-side validation	~50 ms	~50 ms
Network upload (at ~800 KB/s)	~3,800 ms	~3,100 ms

The validation itself — schema checks plus 10 business rule evaluations — takes about 50 milliseconds. The rest is network transfer. At typical payload sizes (a few KB for a single order or invoice), the validation adds single-digit milliseconds. For a 30-line order at 0.3MB, the total round-trip was 1,960ms with most of that being upload time over a standard connection.

Server-side processing is fast because Flow runs validation in-memory: schema validation against a pre-compiled variable array, then expression evaluation through a recursive descent parser for each business rule. No database queries during validation. Persistence runs asynchronously after the response is sent — payloads go to S3, run metadata goes to Postgres, both fire-and-forget so the agent gets its response immediately.

For Kafka-speed pipelines where even 50ms matters, Flow also supports webhook delivery — validation happens, and the validated payload is pushed directly to your endpoint without the agent needing to relay it. That eliminates the agent-as-middleman entirely.

All Three Layers Together

Here's how I'd architect an agentic pipeline with all three layers working together:

Confluent handles data-in-motion: live events, governed, streaming at scale. Cerebras on Bedrock processes those events fast — the disaggregated Trainium+WSE architecture means agents produce structured output at speeds that make real-time workflows practical. Flow validates that output against your schema and business rules, returns errors for self-correction or a tamper-proof validation_id on success. The downstream system verifies the validation_id before accepting the data.

Three separate problems, three separate layers. Each one does its part well. Confluent ensures the agent gets good data. Cerebras ensures the agent processes it fast. Flow ensures the output is correct before it reaches production systems.

Why This Matters Now

I want to be clear: Flow doesn't replace anything IBM, Confluent, AWS, or Cerebras are building. They're solving data infrastructure and inference speed — foundational problems that every enterprise needs addressed. These are massive, hard engineering challenges, and both acquisitions and partnerships reflect the kind of investment this space deserves.

What Flow adds is a complementary output validation layer. As agents move from experimental to production, and as the data flowing through them gets faster and the inference gets cheaper, the volume of agent-generated outputs hitting downstream systems is going to increase significantly. Having a validation checkpoint in that pipeline — one that catches domain-specific errors, enforces business rules, and provides tamper-proof verification — becomes more valuable as the rest of the stack gets faster.

AWS's Agentic AI Security Scoping Matrix (published November 2025) calls out many of the capabilities Flow provides: approval gateway enforcement, agent controls, audit trails, agency perimeters. We've mapped Flow against every scope in that framework — it covers Scopes 2 and 3 well, with partial coverage at Scope 4 where fully autonomous agents need capabilities beyond what a validation gateway provides alone.

If you're building agentic workflows on Kafka, Bedrock, or both, try dropping a Flow gate between your agent and your downstream system. The free tier gives you 500 validation runs per month and three gates — enough to see how output validation fits into your pipeline.

Rynko Flow is a validation gateway for AI agent outputs. Try it free or read the docs.

Teaching Gates to Learn: How We Built Intelligence Into Rynko Flow

Srijith Kartha — Mon, 16 Mar 2026 07:14:12 +0000

Flow validates agent outputs against schemas and business rules. But when 60% of agents fail the same rule on first attempt, the gate should be telling you why — and helping you fix it.

The key insight: When agents fail and retry without clear guidance, it's not a minor inconvenience — it's a reliability failure. Every failed correction loop is a moment where your automation is stuck in a cycle of non-compliance. Gate Intelligence identifies the friction points that prevent agents from reaching a successful state, and feeds that knowledge back into the gate's contract so the next agent gets it right on the first try.

When we launched Flow, the pitch was straightforward: define a gate with a schema and business rules, point your agent at it, and Flow validates the payload before it reaches your database. Schema checks, expression-based rules, optional human approval. It works well — agents submit data, gates validate it, failed submissions come back with structured errors the agent can act on.

But we were sitting on a pile of useful data and not doing anything with it.

Every run Flow processes is stored: the input payload, the validation verdict, which rules passed, which failed, and the exact values that caused the failure. For agents that self-correct, we track the full chain — first attempt, second attempt, third, until either the agent gets it right or gives up. That's tens of thousands of data points per gate per week, and until now, it only showed up as numbers on the analytics dashboard.

Gate Intelligence turns that data into concrete suggestions for improving your gates.

The Problem It Solves

Here's a real pattern we saw in our own test gates. We set up an invoice validation gate with five business rules: amount must be positive, currency must be a 3-letter uppercase code, vendor can't be empty, line items must sum to the total, and there must be at least one line item. Standard stuff.

When we ran agents against it, 40% of first attempts failed the currency format rule. The agents were submitting "usd" and "eur" instead of "USD" and "EUR". Another 25% failed the line items sum check — off by a fraction of a cent due to floating-point rounding. 15% of submissions omitted the vendor field entirely.

None of these are schema problems. The schema says currency is a string, vendor is required, amount is a number. All correct. The issue is that the gate's documentation and rules don't give agents enough context to get it right on the first try. The currency rule says "must equal its own uppercase version" — technically precise, but a Claude or GPT model reading the MCP tool description doesn't know that means "must be uppercase ISO 4217."

When agents fail and retry without that context, the automation isn't saving time — it's stuck. Each failed loop is a moment where your pipeline is spinning instead of producing results. If an agent needs five attempts and 45 seconds to pass a rule that a single well-placed hint would have fixed on the first try, the gate itself is the bottleneck.

Gate Intelligence identifies these patterns automatically and tells you what to do about them.

What It Computes

Every hour, a background job runs for each active gate. It analyzes the last 7 days of runs and computes six metrics:

Per-rule failure rates — what percentage of first-attempt submissions fail each rule, with trend direction compared to the previous 7-day window. If your "amount must be positive" rule went from 5% failure to 15%, that's flagged as trending up.

Common failure values — the actual values agents submitted that caused failures. For the currency rule, this surfaces "usd", "eur", "gbp" as the top offenders. For a numeric rule, it might show 0, -1, or 99999.999. These values are what make suggestions actionable — instead of "rule X fails a lot," the system can say "agents are submitting lowercase currency codes."

Field omission rates — how often required schema fields are missing from submissions. A 30% omission rate on the vendor field means agents don't realize it's required, or the field name isn't clear enough.

Chain convergence — of all the failed submissions that triggered a correction chain, what percentage eventually succeeded? If agents submit, fail, retry, fail again, and give up 70% of the time, that's a fundamental reliability problem. A 33% convergence rate doesn't just mean "some retries" — it means your automation succeeds less than a third of the time. For any system that's supposed to run autonomously, that's a non-starter.

Average chain length and time-to-correction — how many attempts does it take, and how long does the cycle last? Two attempts averaging 3 seconds is healthy. Five attempts averaging 45 seconds means the agent is struggling.

Pattern Detection

Raw metrics tell you what is failing. Pattern detection tells you why.

The system examines common failure values and classifies them into four fixable patterns:

Case mismatch — the submitted value is a lowercase version of what's expected. "usd" vs "USD", "active" vs "Active". This usually means the gate needs to either make the rule case-insensitive or add explicit guidance about expected casing.
Rounding tolerance — a numeric value is within 1% of the expected threshold but fails because of floating-point precision. An amount of 99.999 failing an exact equality check where the expected sum is 100.00. The fix is usually adding a small tolerance to the rule.
Type coercion — a string representation of a number where a number is expected. The string "42" instead of the number 42. Common with agents that serialize JSON from natural language.
Empty string — an empty string where a non-empty value is expected. Distinct from a missing field — the agent knows the field exists but doesn't have a value for it.

These patterns feed into the suggestions. Instead of a generic "rule X fails 60% of the time," the suggestion says "the currency format rule fails 60% of the time — agents submit lowercase currency codes (usd, eur, gbp). Consider adding a note that currency must be uppercase ISO 4217."

Suggestions and the Intelligence Tab

Each gate now has an Intelligence tab alongside Configuration and Performance. The tab shows a summary bar with insight counts by severity, per-rule failure rates with trend arrows, field omission rates, chain convergence metrics, and a health trend chart built from historical snapshots.

Below the analysis cards, concrete suggestions appear as dismissable cards with three severity levels:

Severity	Trigger	Example
Critical	Rule fails >50% of first attempts	"Currency format fails 60% — add format guidance"
Critical	Chain convergence below 50%	"Agents give up on this gate 70% of the time"
Warning	Required field missing >30%	"Vendor omitted in 38% of submissions"
Info	Rule never fails (500+ runs)	"This rule may be redundant — 0 failures in 600 runs"
Info	All rules >95% success	"Excellent validation performance"

Each suggestion has three actions: Apply, Dismiss, and Snooze (hide for 7 days).

Version-Controlled Hints

This is where the architecture gets interesting. The first version of "Apply" directly modified the gate's description field — injecting hint text like "Common mistakes: agents submit lowercase currency codes." It worked, but it was a bad design for three reasons.

First, no version control. The description change bypassed the gate's draft/publish pipeline. In regulated environments — banking, healthcare, insurance — operators need to know exactly when and why a gate's contract changed. A direct write to the description is invisible in the version history.

Second, no audit trail. If an agent's behavior shifts after someone clicks "Apply" (for better or worse), there's no correlation between the click and the behavior change.

Third, no review step. The hint goes live immediately. If Gate Intelligence generates five suggestions and the operator clicks Apply on all of them, five changes hit production with no review.

So we changed the approach. Hints are now a first-class versioned field on the gate — stored alongside the schema, business rules, and identity key fields. When you click "Apply" on a suggestion, it adds the hint text to the gate's draft version. If no draft exists, it creates one. The hint doesn't go live until you review it in the gate configurator and publish.

The gate configurator now has a dedicated "Hints" panel sitting between the Details and Schema steps — visible at a glance without opening any dialog. You can see what Intelligence suggested, edit the text, add your own custom hints, or remove ones you don't want. When you're satisfied, you publish the gate version — which goes through the existing audit log, resets circuit breakers, and notifies connected MCP sessions that the tool description has changed.

This means hints get the same treatment as any other gate configuration change: versioned, auditable, rollbackable.

How Hints Reach the Agent

The MCP tool description for each gate is assembled from three sources:

Submit data to Invoice Validation
Validates invoice payloads before processing to the ERP system.

Business rules:
- amount_positive: Amount must be greater than zero
- currency_format: Currency must be valid ISO 4217
- line_items_match: Line item totals must equal invoice amount

--- Best Practices ---
- Currency must be uppercase ISO 4217 (e.g., USD, EUR, not usd)
- Line item totals must sum to the invoice amount within ±0.01
- Vendor name is required — do not submit an empty string

The gate description is always included. Business rules are always appended so the agent knows the constraints. The "Best Practices" section only appears when the auto-hints toggle is enabled on the gate — it's off by default because it changes what agents see, and the gate owner should make that decision deliberately.

The key improvement from the original architecture: reading hints is now a simple array read from the published gate record, not a database query against the insights table. The old approach queried the insights service every time an MCP tool description was built, which meant a database hit on every tool list request. The new approach reads directly from the gate record — the hints were copied there at publish time. This matters because MCP tool descriptions are assembled on every session connection and tool refresh. Moving from a database lookup to a direct read keeps the tool-build path fast and predictable, which is critical when you're serving multiple concurrent agent sessions.

Historical Snapshots and Trend Analysis

Each time the intelligence job runs, it saves a snapshot with aggregate metrics: total runs, overall failure rate, per-rule failure rates, field omission rates, chain convergence, and suggestion counts. This creates a time series of gate health that's visible in the Intelligence tab as a bar chart.

The chart color-codes each bar: red for failure rates above 50%, amber for 20–50%, and the primary color below 20%. Hovering shows the exact values and date. Over time, you can see whether applying suggestions actually improved the gate's success rate — which is the whole point.

What's Next

Gate Intelligence today is reactive — it analyzes historical data and suggests improvements. There are two directions on the roadmap:

Proactive schema evolution: if Intelligence detects that agents consistently submit a field that isn't in the schema (say, a tax rate keeps appearing in payloads that only define amount and currency), it suggests adding it. This requires analyzing raw payloads beyond just validation results, which is a different data pipeline.

AI Judge integration: the current business rules are deterministic expressions. We're building an "AI Judge" mode that evaluates payloads using an LLM for semantic checks that can't be expressed as expressions — things like "the description should be professional in tone" or "the address looks like a real postal address." Intelligence would track AI Judge pass/fail rates the same way it tracks expression rules, but the suggestion engine would need to account for the non-deterministic nature of LLM evaluation.

Neither is shipped yet, but the foundation is designed for them. The analysis pipeline, suggestion engine, versioned hints, and snapshot time series are all extensible — adding a new data source feeds into the same pattern detection and suggestion framework without rearchitecting anything.

Getting Started

If you have an active Flow gate with at least 50 runs, Intelligence will start generating insights on the next hourly cycle. Open any gate, click the Intelligence tab, and hit Refresh to trigger analysis immediately.

We're rolling it out gradually — it's available today for all paid tiers (Starter, Growth, Scale) and will be available on the Free tier once we're confident in the compute overhead.

Whether your agents are running on AWS Bedrock, OpenAI's API, or any other provider — the validation layer is where reliability is won or lost. If your gates are rejecting 60% of first attempts and your correction chains converge less than half the time, your automation isn't autonomous. It's just expensive retry logic. Gate Intelligence gives you the data to fix that, and the versioned hints to make the fix stick.

Flow docs: docs.rynko.dev/flow

Get started: app.rynko.dev/signup — free tier, 500 runs/month, 3 gates, no credit card.

How Rynko Flow Maps to the AWS Agentic AI Security Scoping Matrix

Srijith Kartha — Thu, 12 Mar 2026 18:54:10 +0000

AWS published a framework for securing autonomous AI agents. We mapped every scope and security dimension to what Flow does today — and where the gaps are.

When AWS published the Agentic AI Security Scoping Matrix in November 2025, it put language around something we'd been building toward with Rynko Flow for a few months. The framework categorizes agentic AI systems into four scopes based on two axes — agency (what the agent can do) and autonomy (how independently it acts) — and maps six security dimensions across each scope. It's been referenced by OWASP, CoSAI, and multiple systems integrators since publication.

I read through it and realized we'd already implemented a significant portion of what it recommends, particularly at Scopes 2 through 4. But I also found gaps worth being honest about. This post walks through each scope, maps it to Flow's current capabilities, and flags where we're still building.

A Quick Primer on What Flow Does

For context if you haven't seen Flow before: Rynko Flow is a validation gateway that sits between AI agents and downstream systems. You define a gate with a JSON schema and business rules, your agent submits payloads to it, and Flow validates the data before it proceeds. Failed validations return structured errors the agent can use to self-correct. Successful validations return a tamper-proof validation_id. Optionally, you can add human approval steps and webhook delivery.

The pipeline:

Gates are exposed as MCP tools, so agents discover and use them without per-gate integration code. We also support REST API submission for non-MCP agents.

One distinction that matters throughout this post: Flow is a validation checkpoint, not a centralized orchestrator. It doesn't manage agent workflows or decide what runs next. The agent (or whatever framework orchestrates it — LangGraph, CrewAI, your own code) decides when to call a gate and what to do with the result. Flow's job is narrower: validate the data, return a verdict, and track what happened.

That said, Flow's webhook delivery does enable a form of event-driven orchestration — when a payload passes validation, the webhook can trigger the next agent or service in a pipeline, creating loosely-coupled handoffs without a central coordinator. This means Flow covers some of the AWS framework's security dimensions deeply (audit, agent controls, agency perimeters) and has a partial but real story for orchestration through webhooks.

With that context, here's how Flow maps to each scope in the AWS framework.

Scope 1: No Agency

What AWS describes: Systems with human-initiated processes and no autonomous change capabilities. The agent follows predefined paths, processes data within workflow nodes, but can't modify anything. Read-only operations. Fixed execution paths.

Security focus: Process integrity, boundary enforcement, preventing agents from exceeding their boundaries.

How Flow maps here: Flow isn't really designed for Scope 1. If your agent is purely read-only and follows a fixed workflow with no ability to produce output that reaches external systems, you don't need a validation gateway — there's nothing to validate.

That said, Flow's schema validation does share DNA with one of Scope 1's key requirements: "input validation at each workflow step boundary." If you're building a pipeline where each stage processes data and hands it to the next, you could place a Flow gate between stages to validate that each node's output conforms to the expected shape. But that's using Flow as plumbing, not as an agent governance layer.

Flow coverage: Minimal — and that's fine. Scope 1 agents don't produce autonomous outputs.

Scope 2: Prescribed Agency

What AWS describes: Human-initiated, human-approved agentic actions. Agents can gather information, analyze data, and prepare recommendations, but all actions of consequence require explicit human approval. This is the "human in the loop" (HITL) scope.

Key characteristics:

Agents can execute change with human review and approval
Real-time human oversight with approval workflows
Bidirectional interaction — agents can ask humans for context
Audit trails of all human approval decisions

Security focus: Securing approval workflows, preventing agents from bypassing human authorization, and maintaining oversight effectiveness.

How Flow maps here: This is where Flow starts to fit well. The approval workflow was one of the first features we built into Flow, and it maps directly to what the AWS framework calls "approval gateway enforcement."

Here's how each Scope 2 security dimension looks in Flow:

Security Dimension	Scope 2 Requirement	Flow Implementation
Identity context	User auth, service auth, human identity verification for approvals	JWT auth for dashboard users, API key auth for agents (scoped to team/workspace), magic-link reviewer identity via HMAC-SHA256 signed tokens
Data, memory, & state protection	Role-based access control, human approval workflows, read-mostly permissions for agents	Workspace-scoped gates, team-based RBAC, agents can only submit — they can't modify gate schemas or approve their own runs
Audit & logging	Human decision audit trails, agent recommendation logging, approval process tracking	Every run logged with full payload, per-rule validation verdicts, approval decisions (approve/reject) with reviewer identity and timestamp
Agent & FM controls	Approval gateway enforcement, extended session monitoring	Gate validation = approval gateway. MCP session tracking with `mcpSessionId` ties agent submissions to a specific session. Circuit breaker monitors session health
Agency perimeters & policies	Human-validated constraint changes, time-bound elevated access, multi-step validation	Gate schema versioning (draft → publish cycle means a human approves schema changes). Magic links expire in 72 hours. Schema validation + business rules = multi-step validation
Orchestration	Multi-step workflow orchestration, approval-gated tool access, human-validated tool chains	Flow doesn't centrally orchestrate — the agent or framework (LangGraph, CrewAI, etc.) decides what to call and when. However, Flow's webhook delivery provides an event-driven handoff mechanism: when a submission passes validation (and approval, if configured), the validated payload is pushed to a webhook endpoint that can trigger the next agent, tool, or service. This enables loosely-coupled pipeline orchestration without a central orchestrator — each gate validates one stage and hands off to the next via webhook

The Scope 2 implementation consideration that stood out to me was "time-bounded approval tokens with automatic expiration." We built this: magic links for external reviewers are HMAC-SHA256 signed, expire in 72 hours, and are single-use for approval actions. The reviewer doesn't need a Rynko account — they click the link, see the payload rendered with safe content (sanitized HTML/Markdown), and approve or reject.

One gap: the paper mentions "cryptographically signed approval decisions." Our approval decisions are stored in the database with reviewer identity and timestamp, but we don't produce a standalone cryptographic proof of the decision. It's an area where we could do more.

Flow coverage: Strong. Approval workflows, audit trails, scoped identity, and time-bounded reviewer access align closely with Scope 2 requirements.

Scope 3: Supervised Agency

What AWS describes: Human-initiated, but the agent executes autonomously. The agent makes decisions and takes actions without further approval. Humans define objectives and trigger execution, but agents operate independently through dynamic planning and tool usage. Optional human intervention points exist, but the agent can proceed without them.

Key characteristics:

Agents can execute change with no (or optional) human review
Dynamic planning and decision-making during execution
Direct access to external APIs and systems
Persistent memory across extended sessions
Autonomous tool selection and orchestration within defined boundaries

Security focus: Maintaining control during autonomous execution, scope management, and behavioral monitoring.

How Flow maps here: This is Flow's primary operating mode. Most teams using Flow today have agents that submit data autonomously — no human in the loop — and rely entirely on the gate's schema and business rules to catch problems. The agent self-corrects from structured errors, and Flow's circuit breaker intervenes if the agent enters a failure loop.

Security Dimension	Scope 3 Requirement	Flow Implementation
Identity context	Agent authentication, identity delegation for autonomous actions	API keys authenticate agents at team/workspace scope. MCP sessions bind agent identity to a persistent session with its own state
Data, memory, & state protection	Context-aware authorization, just-in-time privilege elevation, dynamic permission boundaries	Schema + business rules provide context-aware authorization — the gate evaluates each submission based on the data content, not just the caller's identity. Gate versioning allows operators to update what's accepted without downtime (publish a new version to tighten or relax rules). Flow doesn't provide privilege elevation — agents have the same permissions throughout a session
Audit & logging	Comprehensive action logging, reasoning chain capture, extended session tracking	Full run audit trail (payload, validation verdicts, processing time). Self-correction chain tracking links retries — you can see the full sequence of submit → fail → correct → resubmit as a single chain with a shared `correlationId`. MCP session IDs track activity across an agent's full conversation
Agent & FM controls	Container isolation, long-running process management, tool invocation sandboxing	Gate validation acts as a sandbox for agent outputs — the agent can call `validate_*` tools, but every payload must pass through deterministic rules before it's accepted. Circuit breaker prevents runaway retry loops by pausing the gate after consecutive failures. Flow doesn't provide container isolation or manage the agent's process lifecycle — it only controls the validation boundary
Agency perimeters & policies	Runtime constraint evaluation, resource scaling limits, automated safety checks	Business rules are evaluated at runtime against each submission — constraints are enforced per-payload, not just at setup time. Monthly run quotas cap total throughput per team. Circuit breaker acts as an automated safety check, tripping after N consecutive failures. These are static limits though — they don't adjust dynamically based on agent behavior
Orchestration	Dynamic tool orchestration, parallel execution paths, cross-system integration	Flow isn't a centralized orchestrator — it doesn't decide what runs next. But it supports two forms of integration: (1) MCP tool discovery, where agents find gates dynamically, and (2) webhook delivery, where validated payloads are pushed to downstream endpoints that can trigger the next step in a pipeline. This enables event-driven, loosely-coupled orchestration — gate A validates and webhooks to service B, which processes and submits to gate C. The sequencing emerges from the webhook chain, not from a central coordinator

The self-correction chain tracking deserves specific mention here. The AWS paper talks about "reasoning chain capture" — being able to see why an agent made the decisions it did. Flow's chain tracking gives you a concrete version of this for validation: when an agent submits to a gate, fails, reads the errors, and resubmits, the entire sequence is linked by a correlationId. You can see exactly what the agent submitted each time, which rules it violated, what it fixed, and whether it eventually succeeded. In the webapp, chains are displayed as collapsible groups — the latest attempt shows as the primary row with a badge showing "3 attempts," and expanding reveals the full correction timeline.

The circuit breaker is Flow's implementation of the "automated safety checks" requirement. When an agent keeps failing the same gate — tracked per session for MCP agents, per payload hash for REST agents — the circuit breaker trips after a configurable number of consecutive failures. The gate transitions to a paused state, the system sends in-app and email notifications to the gate creator, and all further submissions are blocked until the cooldown expires or a new gate version is published.

Here's what makes the circuit breaker interesting from the AWS framework perspective: it's an example of graceful degradation, which the paper calls out as a key architectural pattern. The paper says: "Design systems to automatically reduce autonomy levels when security events are detected." The circuit breaker does exactly this — when the agent can't produce valid output, Flow reduces the agent's effective autonomy by blocking further submissions and notifying the human operator.

Flow coverage: Strong to comprehensive. Self-correction chains, circuit breaker, runtime validation, MCP session tracking, and structured audit trails address most Scope 3 requirements.

Scope 4: Full Agency

What AWS describes: Fully autonomous AI that initiates its own activities based on environmental monitoring, learned patterns, or predefined conditions. No human triggers the agent — it operates continuously, makes independent decisions about when and how to act. The highest level of agency and risk.

Key characteristics:

Self-directed activity initiation based on environmental triggers
Continuous operation with minimal human oversight
High to full degrees of autonomy in goal setting, planning, and execution
Dynamic interaction with multiple external systems and agents
Capability for recursive self-improvement

Security focus: Continuous behavioral validation, enforcing agency boundaries, preventing capability drift, and maintaining organizational alignment.

How Flow maps here: Flow isn't a Scope 4 system itself — it doesn't initiate actions or make autonomous decisions. But it serves as a governance layer that Scope 4 agents submit to. The distinction matters: Flow doesn't control what the agent does; it controls what outputs the agent can successfully land in downstream systems. In the AWS framework's terms, Flow provides the "Advanced Deterministic Guardrails" that Scope 4 requires.

Security Dimension	Scope 4 Requirement	Flow Implementation	Gap
Identity context	Dynamic identity lifecycle, federated auth, continuous identity verification, agent identity attestation	API key auth, MCP session binding	No agent identity attestation or dynamic identity lifecycle. Flow authenticates the agent but doesn't verify its internal state or attest to its identity to third parties
Data, memory, & state protection	Behavioral authorization, adaptive access controls, continuous authorization validation	Every submission is checked against schema and business rules — this provides continuous authorization validation at the data level. Gate versioning lets operators evolve rules over time. Business rules reject outputs that violate constraints regardless of which agent submitted them	No ML-based adaptive controls — rules are deterministic, defined by humans. No behavioral authorization that learns from past patterns
Audit & logging	Continuous behavioral logging, pattern analysis, predictive monitoring, automated incident correlation	Full run audit trail. Chain tracking correlates related submissions. Circuit breaker events log failure patterns. Archive sync for long-term retention	No predictive monitoring or ML-based pattern analysis. Circuit breaker counts failures but doesn't detect novel anomaly patterns
Agent & FM controls	Behavioral analysis, anomaly detection, automated containment, self-healing security	Circuit breaker provides automated containment — pauses the gate and notifies operators when failures accumulate. Self-correction chain tracking gives visibility into agent retry patterns. Reset-on-publish clears stale circuit breaker state when new rules are deployed, which is closer to operational recovery than true self-healing	No behavioral analysis beyond failure counting. No anomaly detection on payload content. No automated response that adapts without human intervention
Agency perimeters & policies	Self-adjusting boundaries, context-aware constraints, cross-system resource management, autonomous limit adaptation	Business rules provide context-aware constraints (evaluated per-payload). Circuit breaker provides a form of autonomous limit adaptation — it auto-pauses the gate without human action. Monthly quotas cap resource usage	No self-adjusting boundaries — rules are static until a human publishes a new version. No cross-system resource management
Orchestration	Autonomous multi-agent orchestration, cross-session learning, dynamic service discovery	MCP tool discovery lets agents find gates dynamically. Webhook delivery enables event-driven handoffs between stages — a validated payload can trigger the next agent or service without a central orchestrator. This supports loosely-coupled multi-agent pipelines	No centralized multi-agent coordination. No cross-session learning. No dynamic service discovery beyond MCP tool listing. Each gate is independent — Flow doesn't manage the pipeline topology

I want to be transparent about where the gaps are, because the Scope 4 requirements are genuinely hard. The paper calls for "continuous monitoring with machine learning-based anomaly detection" and "automated response systems for behavioral deviations." Flow's circuit breaker is an automated response system, but it's simple — it counts consecutive failures. It doesn't analyze payload content for anomalies, detect drift in agent behavior patterns over time, or predict when an agent is likely to start producing invalid output.

That said, Flow provides the infrastructure that makes Scope 4 deployment safer:

Every submission is validated — the agent can't skip the gate. Schema + business rules are deterministic, not probabilistic. A Scope 4 agent that submits an order with a negative total gets rejected regardless of how autonomously it's operating.
Self-correction is tracked — you can see whether a Scope 4 agent is self-correcting successfully (resilient autonomy) or spiraling into repeated failures (failing autonomy). Chain tracking gives you this visibility without instrumenting the agent itself.
Automated containment — the circuit breaker pauses the gate when failures accumulate. This is the "failsafe mechanism that can halt operations when confidence drops" that the paper recommends for Scope 4.
Human re-entry point — when the circuit breaker trips, the gate creator gets notified (in-app + email). This is the "human ability to inject strategic guidance without disrupting operations" pattern. The human publishes a new gate version (potentially with adjusted rules), which reactivates the gate and resets circuit breakers.

Flow coverage: Partial but meaningful. Flow provides the deterministic guardrails, automated containment, and audit infrastructure that Scope 4 requires. The gaps are in ML-based anomaly detection, agent identity attestation, and cross-session learning — areas that require capabilities beyond what a validation gateway provides on its own.

The Six Security Dimensions — Summary Matrix

Here's a consolidated view of how Flow maps across all four scopes and six dimensions:

Dimension	Scope 1	Scope 2	Scope 3	Scope 4
Identity	N/A	JWT + API key + magic links	+ MCP session binding	Gap: no agent attestation
Data & state	N/A	RBAC, workspace scoping	+ runtime schema validation, gate versioning	Gap: no adaptive controls
Audit	N/A	Run logs, approval trails	+ chain tracking, session tracking	Gap: no predictive monitoring
Agent controls	N/A	Approval gateway	+ circuit breaker, chain tracking (observes correction, doesn't drive it)	Gap: no behavioral analysis
Agency perimeters	N/A	Schema versioning, expiring tokens	+ runtime rules, quotas, circuit breaker	Gap: no self-adjusting boundaries
Orchestration	N/A	Webhook delivery enables event-driven handoffs	+ MCP tool discovery	Gap: no centralized orchestration or cross-session learning. Supports loosely-coupled pipelines via webhooks, not coordinated multi-agent workflows

The Key Architectural Patterns

The AWS paper concludes with five architectural patterns for agentic AI deployments. Flow aligns with four of them:

Progressive autonomy deployment — "Start with Scope 1 or 2 implementations and gradually advance." Flow supports this directly. A gate can start with approval workflows (Scope 2 — human reviews every submission). Once you're confident in the schema and rules, remove the approval step and let the agent operate autonomously (Scope 3). The gate's validation logic stays the same; you're just adjusting the human oversight level.

Continuous validation loops — "Establish automated systems that continuously verify agent behavior against expected patterns." This is literally what Flow does. Every agent submission is validated against the gate's schema and business rules. The self-correction loop (submit → fail → read errors → fix → resubmit) is a continuous validation loop operating at the individual submission level.

Human oversight integration — "Maintain meaningful human oversight through strategic checkpoints, behavioral reporting, and manual override capabilities." Flow's approval workflows are the strategic checkpoints. Chain tracking and circuit breaker notifications are the behavioral reporting. Gate versioning and manual pause/resume are the manual override capabilities.

Graceful degradation — "Design systems to automatically reduce autonomy levels when security events are detected." The circuit breaker does this: when an agent accumulates consecutive failures, the gate pauses, notifications fire, and the agent's effective autonomy drops to zero until a human intervenes or the cooldown expires. The paper specifically recommends systems that "automatically inject tighter restrictions such as requiring more HITL or reducing the actions an agent can take" — this is exactly what happens when a gate transitions from auto-approve to paused.

The one pattern we don't cover well is layered security architecture — "defense in depth with security controls at multiple levels." Flow operates at the application layer (validating agent outputs). It doesn't provide network-level controls, model-level guardrails, or infrastructure-level isolation. Teams building Scope 3-4 systems need Flow as one layer in a broader security stack, not as the only layer.

What We're Building Next

Reading the AWS framework confirmed some things on our roadmap and added others:

AI Judge (semantic validation) — Currently, Flow's validation is deterministic: JSON Schema types and expression-based business rules. For Scope 4 agents producing unstructured or semi-structured outputs, deterministic rules aren't always enough. We're building an LLM-based evaluation mode where a second model reviews the agent's output against criteria defined in natural language. This addresses the "continuous behavioral validation" gap at Scope 4.

Approval timeout enforcement — Our approval workflow creates a pending_approval state, but there's no automatic expiration. The paper's Scope 2 recommendation of "time-bounded approval tokens with automatic expiration" applies here. We're adding configurable timeout with auto-reject or auto-escalate behavior.

Agent behavioral baselining — The paper's Scope 4 requirements for "pattern analysis" and "predictive monitoring" are beyond what a simple failure counter provides. We're exploring tracking submission patterns per agent (payload shape, submission frequency, validation pass rate) and flagging deviations. Not ML-based yet, but statistical baselines that surface when an agent's behavior changes.

Where Flow Fits in Your Agentic Architecture

Rynko Flow isn't a complete Scope 4 security stack — no single product is. But it provides the validation gateway, automated containment, and audit infrastructure that the AWS framework identifies as critical across Scopes 2-4.

If you're building agents that produce data destined for production systems — orders, invoices, tickets, customer records — Flow gives you deterministic guardrails that work regardless of which model or framework your agent uses. The gate doesn't care whether the agent is a Claude tool-use loop, a LangGraph pipeline, or a custom orchestrator. It validates the output, tracks the correction chain, and trips the circuit breaker if things go sideways.

The AWS framework is worth reading in full — it provides a structured way to think about where your agent sits on the agency/autonomy spectrum and what security controls you need at that level. And if you're at Scope 2 or above, a validation gateway isn't optional — it's one of the six critical security dimensions.

Paper: The Agentic AI Security Scoping Matrix

Flow docs: docs.rynko.dev/flow

Get started: app.rynko.dev/signup — free tier, 500 runs/month, 3 gates, no credit card.

Validating CrewAI Agent Outputs with Rynko Flow

Srijith Kartha — Tue, 10 Mar 2026 14:41:45 +0000

CrewAI's strength is that you define agents with roles, goals, and tools, and the framework handles the orchestration. An agent researches, another analyzes, a third writes the report. The problem shows up when the last agent in the chain produces the final output — a JSON payload that needs to be structurally valid, conform to business rules, and sometimes get human approval before it goes downstream.

Most CrewAI tutorials skip this part. The output comes back as a string, maybe you parse it as JSON, and you hope it's correct. In production, that hope turns into bugs.

I've been using Rynko Flow as the validation layer after CrewAI tasks. The agent does its work, the output goes through a Flow gate that checks schema and business rules, and only validated data moves forward. When validation fails, the error response is structured enough that the agent can fix itself and retry.

What We're Building

A CrewAI crew with two agents:

Order Processor — Takes a natural language order request and extracts structured data
Validator — Submits the extracted data to a Rynko Flow gate, handles errors, and retries if needed

The validator agent uses a custom tool that wraps the Flow API, so it gets structured validation errors directly in its tool response.

Setup

pip install crewai httpx

You'll need:

A Rynko account (free tier is fine)
A Flow gate with your schema (setup guide)
An OpenAI API key (CrewAI's default LLM)

The Flow Validation Tool

CrewAI agents use tools — Python functions decorated with @tool. Here's one that submits data to a Flow gate and returns the result in a format the LLM can reason about:

import os
import json
import httpx
from crewai.tools import tool

RYNKO_BASE_URL = os.environ.get("RYNKO_BASE_URL", "https://api.rynko.dev/api")
RYNKO_API_KEY = os.environ["RYNKO_API_KEY"]
GATE_ID = os.environ["FLOW_GATE_ID"]

@tool("validate_order")
def validate_order(order_json: str) -> str:
    """Validate an order payload against the Flow gate.
    Input must be a JSON string with fields: vendor (string),
    amount (number), currency (USD/EUR/GBP/INR), po_number (optional string).
    Returns validation result with status and any errors."""

    try:
        payload = json.loads(order_json)
    except json.JSONDecodeError as e:
        return json.dumps({"success": False, "error": f"Invalid JSON: {e}"})

    response = httpx.post(
        f"{RYNKO_BASE_URL}/flow/gates/{GATE_ID}/runs",
        json={"payload": payload},
        headers={
            "Authorization": f"Bearer {RYNKO_API_KEY}",
            "Content-Type": "application/json",
        },
        timeout=30,
    )

    result = response.json()

    if result.get("status") == "validation_failed":
        errors = result.get("error", {}).get("details", [])
        error_lines = [f"- {e.get('field', e.get('rule_id', 'unknown'))}: {e.get('message')}" for e in errors]
        return json.dumps({
            "success": False,
            "status": "validation_failed",
            "errors": error_lines,
            "message": "Fix these errors and resubmit.",
        }, indent=2)

    return json.dumps({
        "success": True,
        "status": result.get("status"),
        "run_id": result.get("runId"),
        "validation_id": result.get("validation_id"),
    }, indent=2)

The tool returns structured JSON in both success and failure cases. When validation fails, the error messages are specific enough — "currency must be one of: USD, EUR, GBP, INR" — that the LLM can fix the issue without guessing.

Defining the Agents

from crewai import Agent

order_processor = Agent(
    role="Order Processor",
    goal="Extract structured order data from customer requests accurately",
    backstory=(
        "You are an order processing specialist. You extract vendor name, "
        "amount, currency, and PO number from natural language requests. "
        "You output clean JSON with fields: vendor, amount, currency, po_number. "
        "Currency must be a 3-letter code (USD, EUR, GBP, or INR)."
    ),
    verbose=True,
    allow_delegation=False,
)

order_validator = Agent(
    role="Order Validator",
    goal="Validate extracted orders against business rules and fix any issues",
    backstory=(
        "You validate order data by submitting it to the validation gateway. "
        "If validation fails, you read the error messages carefully, fix each "
        "issue in the JSON, and resubmit. You keep trying until it passes or "
        "you've made 3 attempts. Always report the final validation status."
    ),
    tools=[validate_order],
    verbose=True,
    allow_delegation=False,
)

The validator agent has the Flow tool and explicit instructions to read errors and retry. CrewAI agents follow their backstory closely, so the self-correction behavior comes from the backstory rather than from framework-level retry logic.

Defining the Tasks

from crewai import Task

extract_task = Task(
    description=(
        "Extract order data from this customer request:\n\n"
        "{user_request}\n\n"
        "Output a JSON object with fields: vendor (string), amount (number), "
        "currency (3-letter code: USD, EUR, GBP, or INR), po_number (string, optional). "
        "Output ONLY the JSON, nothing else."
    ),
    expected_output="A JSON object with vendor, amount, currency, and optional po_number",
    agent=order_processor,
)

validate_task = Task(
    description=(
        "Take the order JSON from the previous task and validate it using the "
        "validate_order tool. If validation fails, read the error messages, fix "
        "the JSON, and call the tool again with corrected data. "
        "Report the final run ID and validation status."
    ),
    expected_output="Validation result with run ID and status (validated or failed)",
    agent=order_validator,
    context=[extract_task],
)

The context=[extract_task] tells CrewAI to pass the output of the extract task to the validator. The validator then takes that JSON and runs it through Flow.

Running the Crew

from crewai import Crew, Process

crew = Crew(
    agents=[order_processor, order_validator],
    tasks=[extract_task, validate_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(
    inputs={
        "user_request": (
            "We need to process an order from Globex Corp for "
            "twelve thousand five hundred dollars, PO number PO-2026-042"
        )
    }
)

print("\n--- Final Result ---")
print(result)

What Happens at Runtime

When you run this, the output shows the full agent reasoning:

[Order Processor] Extracting order data...
> {"vendor": "Globex Corp", "amount": 12500, "currency": "USD", "po_number": "PO-2026-042"}

[Order Validator] Validating order...
> Using tool: validate_order
> Tool result: {"success": true, "status": "validated", "run_id": "..."}

--- Final Result ---
Order validated successfully. Run ID: 550e8400-...

Now here's the interesting case. Say the processor extracts currency: "Dollars":

[Order Validator] Validating order...
> Using tool: validate_order
> Tool result: {"success": false, "errors": ["- currency: must be one of: USD, EUR, GBP, INR"]}

[Order Validator] The currency is invalid. Fixing to "USD" and resubmitting...
> Using tool: validate_order
> Tool result: {"success": true, "status": "validated", "run_id": "..."}

The validator reads the error, fixes the currency, and resubmits. One retry, no human involved.

Handling Multiple Agents Writing to the Same Gate

CrewAI shines when you have multiple specialized agents. In a more complex setup, you might have separate crews for different order types — one for domestic orders, one for international, one for recurring subscriptions. All three can validate against the same Flow gate.

# Different crews, same validation gate
domestic_crew = Crew(agents=[domestic_processor, validator], ...)
international_crew = Crew(agents=[intl_processor, validator], ...)
subscription_crew = Crew(agents=[sub_processor, validator], ...)

The gate enforces consistent validation regardless of which crew produced the data. If you change a business rule — say, increasing the minimum order amount from $10 to $50 — you update it once in the Flow dashboard and every crew picks it up immediately.

Flow's analytics dashboard shows validation results by session, so you can see which crew or agent is producing the most errors and needs prompt tuning.

Adding Human Approval

For high-value orders, configure the gate's approval mode to require human review. When the validator submits a $50,000 order, Flow holds it in a review_required state instead of auto-approving. A reviewer gets an email, reviews the payload, and approves or rejects.

Your CrewAI task can poll for the approval result:

@tool("wait_for_approval")
def wait_for_approval(run_id: str) -> str:
    """Poll a Flow run until it reaches a terminal state."""
    for _ in range(60):
        response = httpx.get(
            f"{RYNKO_BASE_URL}/flow/runs/{run_id}",
            headers={"Authorization": f"Bearer {RYNKO_API_KEY}"},
            timeout=30,
        )
        status = response.json().get("status")
        if status in ("approved", "rejected", "completed", "delivered"):
            return json.dumps({"status": status, "run_id": run_id})
        time.sleep(5)
    return json.dumps({"status": "timeout", "run_id": run_id})

Using MCP Instead of REST

If you prefer the agent to discover Flow gates dynamically through tool calling (rather than hardcoding the gate ID), you can connect CrewAI to Flow's MCP endpoint. Flow auto-generates a validate_{gate_slug} tool for each active gate, and the tool schema includes field types and constraints so the LLM knows what to submit.

This is useful when your agents work across multiple gates and need to pick the right one based on context.

Local Development Setup

# Create project
mkdir crewai-flow-demo && cd crewai-flow-demo
python -m venv .venv
source .venv/bin/activate

# Install
pip install crewai httpx python-dotenv

# Environment
cat > .env << 'EOF'
OPENAI_API_KEY=sk-...
RYNKO_API_KEY=your_api_key_here
FLOW_GATE_ID=your_gate_id_here
EOF

Create main.py with the code above, add from dotenv import load_dotenv; load_dotenv() at the top, and run with python main.py. CrewAI's verbose=True shows you the full agent reasoning — useful for debugging prompt issues.

Full Working Example

The complete code — agents, tools, tasks, .env.example, and two test scenarios — is in our developer resources repo. Clone it, add your API keys, and run python src/main.py.

Resources:

Rynko Flow documentation
CrewAI documentation
Sign up for free — 500 Flow runs/month, no credit card
Self-correction demo (terminal recording)

Adding Output Validation to Your LangGraph Agent with Rynko Flow

Srijith Kartha — Tue, 10 Mar 2026 14:12:42 +0000

Your LangGraph agent works great in demos. But in production, every node's output needs to be validated before the next node acts on it. Here's how to add a validation step without writing custom checking logic.

LangGraph gives you fine-grained control over your agent's execution graph — you define nodes, edges, and conditional routing. But one thing that's missing from most LangGraph tutorials is what happens when a node produces bad data. The next node just receives it and either crashes or propagates the error downstream.

I ran into this when building an order processing pipeline with LangGraph. The extraction node would occasionally produce negative amounts, invalid currencies, or missing fields. The downstream nodes — pricing, invoicing, fulfillment — would silently process the bad data. By the time someone noticed, the damage was already in the database.

The typical fix is writing validation logic inside each node. That works, but it means every node carries its own schema checks, the validation rules are scattered across your codebase, and there's no central place to see what's failing and why.

So I hooked up Rynko Flow as an external validation step in the graph. The agent extracts data, Flow validates it against a schema and business rules, and only if it passes does the pipeline continue. If it fails, the agent gets structured errors it can use to self-correct.

What You'll Build

A LangGraph agent with three nodes:

Extract — LLM extracts order data from a natural language request
Validate — Submits the extracted data to a Rynko Flow gate
Process — Handles the validated order (or routes back for correction)

The graph looks like this:

extract → validate → process
              ↓ (if failed)
          extract (retry with error context)

Prerequisites

pip install langgraph langchain-openai httpx

You'll also need:

A Rynko account (free tier works)
A Flow gate configured with your order schema
An OpenAI API key (or any LangChain-compatible LLM)

Setting Up the Flow Gate

Create a gate in the Flow dashboard with this schema:

Field	Type	Constraints
vendor	string	required, min 1 char
amount	number	required, >= 0
currency	string	required, one of: USD, EUR, GBP, INR
po_number	string	optional

Add a business rule: amount >= 10 with error message "Order amount must be at least $10."

If you already have a Pydantic model, you can import the schema directly — run YourModel.model_json_schema() and paste the output into the gate's Import Schema dialog. There's a tutorial for that.

Save and publish the gate. Note the gate ID — you'll need it in the code.

The Validation Client

First, a small wrapper around the Flow API. This is what the validate node will call:

import httpx
import os

RYNKO_BASE_URL = os.environ.get("RYNKO_BASE_URL", "https://api.rynko.dev/api")
RYNKO_API_KEY = os.environ["RYNKO_API_KEY"]

def validate_with_flow(gate_id: str, payload: dict) -> dict:
    """Submit a payload to a Flow gate and return the result."""
    response = httpx.post(
        f"{RYNKO_BASE_URL}/flow/gates/{gate_id}/runs",
        json={"payload": payload},
        headers={
            "Authorization": f"Bearer {RYNKO_API_KEY}",
            "Content-Type": "application/json",
        },
        timeout=30,
    )
    return response.json()

This returns the full validation result — status, errors, validation ID, the works. The important fields are status (either "validated" or "validation_failed") and errors (an array of specific field-level issues when validation fails).

Defining the Graph State

LangGraph uses a typed state that flows between nodes. Ours tracks the user request, extracted data, validation result, and retry count:

from typing import TypedDict, Optional

class OrderState(TypedDict):
    user_request: str
    extracted_data: Optional[dict]
    validation_result: Optional[dict]
    validation_errors: Optional[str]
    retry_count: int
    final_result: Optional[str]

The Three Nodes

Extract Node

The LLM extracts structured order data from the user's natural language request. If there were previous validation errors, they're included in the prompt so the LLM can correct its output:

from langchain_openai import ChatOpenAI
import json

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

GATE_ID = os.environ["FLOW_GATE_ID"]  # Your gate ID

def extract_order(state: OrderState) -> dict:
    error_context = ""
    if state.get("validation_errors"):
        error_context = (
            f"\n\nYour previous extraction had validation errors:\n"
            f"{state['validation_errors']}\n"
            f"Fix these issues in your new extraction."
        )

    response = llm.invoke(
        f"Extract order data from this request as JSON with fields: "
        f"vendor (string), amount (number), currency (string, one of USD/EUR/GBP/INR), "
        f"po_number (string, optional).\n\n"
        f"Request: {state['user_request']}"
        f"{error_context}\n\n"
        f"Respond with ONLY valid JSON, no markdown."
    )

    try:
        extracted = json.loads(response.content)
    except json.JSONDecodeError:
        extracted = {"vendor": "", "amount": 0, "currency": ""}

    return {"extracted_data": extracted}

Validate Node

This is the Flow integration — submit the extracted data to the gate and capture the result:

def validate_order(state: OrderState) -> dict:
    result = validate_with_flow(GATE_ID, state["extracted_data"])

    if result.get("status") == "validation_failed":
        errors = result.get("error", {}).get("details", [])
        error_text = "\n".join(
            f"- {e.get('field', e.get('rule_id', 'unknown'))}: {e.get('message', 'invalid')}"
            for e in errors
        )
        return {
            "validation_result": result,
            "validation_errors": error_text,
            "retry_count": state.get("retry_count", 0) + 1,
        }

    return {
        "validation_result": result,
        "validation_errors": None,
    }

Process Node

If validation passed, the order moves forward. In a real system this would write to your database, trigger fulfillment, or call another API:

def process_order(state: OrderState) -> dict:
    validation_id = state["validation_result"].get("validation_id", "")
    return {
        "final_result": (
            f"Order processed successfully.\n"
            f"Vendor: {state['extracted_data']['vendor']}\n"
            f"Amount: {state['extracted_data']['amount']} {state['extracted_data']['currency']}\n"
            f"Validation ID: {validation_id}"
        )
    }

The validation_id is a tamper-proof token from Flow — your downstream systems can verify that the data passed validation and hasn't been modified since.

Wiring the Graph

Now connect the nodes with conditional routing. If validation fails and we haven't exhausted retries, route back to the extract node with the error context:

from langgraph.graph import StateGraph, END

def should_retry(state: OrderState) -> str:
    if state.get("validation_errors") and state.get("retry_count", 0) < 3:
        return "retry"
    elif state.get("validation_errors"):
        return "give_up"
    return "proceed"

# Build the graph
graph = StateGraph(OrderState)

graph.add_node("extract", extract_order)
graph.add_node("validate", validate_order)
graph.add_node("process", process_order)

graph.set_entry_point("extract")
graph.add_edge("extract", "validate")

graph.add_conditional_edges(
    "validate",
    should_retry,
    {
        "retry": "extract",     # Back to extraction with error context
        "proceed": "process",   # Validation passed
        "give_up": END,         # Max retries reached
    },
)
graph.add_edge("process", END)

app = graph.compile()

Running It

result = app.invoke({
    "user_request": "Process an order from Globex Corp for twelve thousand five hundred dollars USD, PO number PO-2026-042",
    "retry_count": 0,
})

print(result["final_result"])

Output:

Order processed successfully.
Vendor: Globex Corp
Amount: 12500.0 USD
Validation ID: v_abc123...

The Self-Correction Loop

The interesting part is what happens when the LLM makes a mistake. Say it extracts currency: "Dollars" instead of "USD". Flow returns:

{
  "status": "validation_failed",
  "errors": [
    {"field": "currency", "message": "must be one of: USD, EUR, GBP, INR"}
  ]
}

The graph routes back to the extract node, which now includes the error in its prompt. The LLM reads "currency must be one of: USD, EUR, GBP, INR", fixes its extraction to "USD", and the second attempt passes validation.

This happens automatically — no human intervention, no hardcoded fixes. The LLM uses the structured error feedback from Flow to correct itself.

In our testing, most validation issues resolve in one retry. The retry_count cap of 3 is a safety net — if the agent can't fix it in three attempts, something is fundamentally wrong with the input and it's better to fail explicitly.

Why Not Just Use Pydantic in the Node?

You could validate with Pydantic directly in the extract node. For a single agent, that works fine. But Flow gives you a few things Pydantic doesn't:

Business rules that cross fields. Pydantic validates field types and constraints, but expressions like endDate > startDate or quantity * price == total need custom validators. Flow evaluates these as expressions — you configure them in the dashboard, no code changes needed.

Centralized validation across agents. If you have five different LangGraph pipelines submitting orders, they all validate against the same gate. Change a rule once, it applies everywhere. With Pydantic, you'd need to update the model in every repo.

Observability. Flow's analytics dashboard shows you which fields fail most often, which business rules trigger, and which agents (by session) are producing the most errors. When you're debugging why Agent C keeps submitting bad currencies, this is where you look.

Approval workflows. For high-value orders, add a human approval step on the gate. The pipeline pauses, a reviewer approves or rejects, and the graph resumes. You can't do this with a Pydantic validator.

Adding MCP for Direct Tool Access

If you want the LLM to call Flow tools directly (instead of going through a hardcoded REST call), you can use LangChain's MCP tool integration. Flow's MCP endpoint at https://api.rynko.dev/api/flow/mcp auto-generates a validate_{gate_slug} tool for each active gate in your workspace.

This means the LLM can discover available gates and submit payloads through tool calling, which is useful when the agent needs to decide which gate to validate against based on the input.

Local Development Setup

To set up a local LangGraph development environment:

# Create a project directory
mkdir langgraph-flow-demo && cd langgraph-flow-demo

# Set up a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install langgraph langchain-openai httpx python-dotenv

# Create .env file
cat > .env << 'EOF'
OPENAI_API_KEY=sk-...
RYNKO_API_KEY=your_api_key_here
FLOW_GATE_ID=your_gate_id_here
EOF

Create a main.py with the code from this tutorial, add from dotenv import load_dotenv; load_dotenv() at the top, and run it with python main.py.

For iterative development, LangGraph has a built-in visualization tool:

# Print the graph structure
app.get_graph().print_ascii()

# Or save as PNG (requires pygraphviz)
app.get_graph().draw_png("graph.png")

This shows you the nodes, edges, and conditional routing at a glance — useful for verifying the self-correction loop is wired correctly.

Full Working Example

The complete code for this tutorial — including the graph, Flow client, .env.example, and two test scenarios — is in our developer resources repo. Clone it, add your API keys, and run python src/main.py.

Resources:

Rynko Flow documentation
Flow API reference
LangGraph documentation
Sign up for free — 500 Flow runs/month, no credit card
Self-correction demo (terminal recording)

Launching Rynko Flow: A Self-Correcting Validation Gateway for AI Agent Outputs

Srijith Kartha — Mon, 09 Mar 2026 17:41:18 +0000

When we launched Rynko, the focus was document generation — templates, PDFs, Excel files. But the more we worked with teams building AI-powered workflows, the more we noticed the same problem showing up everywhere: the agent produces structured data, and the developer writes validation code to check it before passing it downstream. Schema checks, business rule enforcement, sometimes a human review step. Every team was building some version of this from scratch.

So, we built Flow.

What Flow Does

Rynko Flow is a validation gateway that sits between your AI agent and your downstream systems. You define a gate with a schema and business rules, your agent submits payloads to it, and Flow validates the data before it moves forward. If the payload fails, the agent gets a clear error response it can act on. If it passes, Flow returns a tamper-proof validation_id that downstream systems can verify to confirm the data hasn't been modified in transit.

The pipeline looks like this:

Each stage is independent. Schema validation checks field types, required fields, and constraints like min/max values and allowed enums. Business rules evaluate cross-field expressions — things like endDate > startDate or quantity * price == total. If you need a human to review before delivery, add an approval step with internal team members or external reviewers. Once everything passes, Flow delivers the payload to your webhook endpoints.

Gates, Not Middleware

A gate is a named validation checkpoint. It has a schema (the structure you expect), business rules (the constraints that cross fields), and optionally an approval configuration and delivery channels. Each gate gets its own API endpoint.

Creating a gate takes about a minute in the dashboard:

Open the Flow dashboard and click Create Gate
Name your gate — give it something descriptive like "Order Validation". Flow generates a URL-friendly slug automatically (order-validation)
Define the schema — use the schema builder to add fields. For an order gate, you'd add orderId (string, required), amount (number, required, min 0), currency (string, required, allowed values: USD/EUR/GBP), and customerEmail (string, required, email format). Each field has a type dropdown and constraint options — no JSON to write by hand
Add business rules — click Add Rule and write expressions like amount >= 10 with an error message ("Order amount must be at least $10"). The rule editor validates your expression as you type, so you know it'll work before you save
Save the gate — it's immediately active and ready to receive payloads

If you already have your data models defined in code, you don't have to recreate the schema manually. Flow supports importing schemas directly from Pydantic (Python) and Zod (TypeScript). In the schema builder, click Import Schema, pick the format, and paste the JSON Schema output from model_json_schema() (Pydantic) or zodToJsonSchema() (Zod). Flow maps the types, constraints, and required fields automatically. There's a full tutorial with code examples for both.

This means if you have a Pydantic model like:

class Order(BaseModel):
    order_id: str
    amount: float = Field(ge=0)
    currency: Literal["USD", "EUR", "GBP"]
    customer_email: EmailStr

You run Order.model_json_schema(), paste the output into the import dialog, and your gate schema is ready — field types, constraints, and all.

{
    "title": "Order",
    "type": "object",
    "properties": {
      "order_id": {
        "title": "Order Id",
        "type": "string"
      },
      "amount": {
        "title": "Amount",
        "type": "number",
        "minimum": 0
      },
      "currency": {
        "title": "Currency",
        "type": "string",
        "enum": ["USD", "EUR", "GBP"]
      },
      "customer_email": {
        "title": "Customer Email",
        "type": "string",
        "format": "email"
      }
    },
    "required": ["order_id", "amount", "currency", "customer_email"]
  }

When your agent submits a payload to this gate, the response tells you exactly what happened at each validation layer:

{
  "success": true,
  "runId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "validated",
  "validation_id": "v_abc123...",
  "layers": {
    "schema": "pass",
    "business_rules": "pass"
  }
}

If validation fails, the response includes specific error details — which field failed, which constraint was violated, which business rule returned false. The agent gets actionable feedback it can use to fix the data and resubmit.

Why This Matters for AI Agents

LLMs hallucinate. They produce plausible-looking data that might have an invalid enum value, a missing required field, or a number that violates a business constraint. When you're generating a single document, you catch these by eye. When an agent is processing hundreds of payloads autonomously, you need systematic validation.

The interesting thing we've seen in practice is that agents self-correct. When an MCP-connected agent submits a payload that fails validation, it reads the error response, fixes the issues, and resubmits — often without any human involvement. We ran tests where we intentionally gave agents incomplete or incorrect data, and the validation-resubmission loop resolved the issues in one or two attempts. (ref: Flow MCP — AI Agent Integration Test Report | Rynko Documentation)

Flow has a built-in circuit breaker for this pattern. If an agent (identified by its MCP session) keeps submitting payloads that fail the same gate, Flow backs off — first warning, then temporarily blocking submissions from that session. This prevents a malfunctioning agent from burning through your quota with an infinite retry loop. The circuit breaker tracks failures per gate per session, with configurable thresholds and cooldown periods.

Multi-Agent Workflows

The single-agent case is straightforward, but Flow was really designed for multi-agent architectures — the kind you build with LangGraph, CrewAI, AutoGen, or your own orchestration layer. In these setups, you have multiple specialized agents handling different parts of a pipeline: one agent researches, another drafts, a third formats, and a fourth submits to your system of record. Each agent is good at its job, but none of them knows what the others are doing, and any of them can produce data that doesn't meet your downstream requirements.

Gates are the shared contract between these agents and your systems. A "Customer Order" gate doesn't care whether the payload comes from a single monolithic agent or from the last step in a five-agent chain — it validates the same schema and business rules regardless. This means you can swap agents, change your orchestration graph, or add new agents to the pipeline without touching your validation logic. The gate is stable while the agents evolve around it.

In practice, this plays out in a few ways:

Pipeline validation. An orchestrator runs Agent A (data extraction) → Agent B (enrichment) → Agent C (formatting), and the final output goes through a Flow gate before hitting your database. If Agent C produces bad data, the orchestrator gets structured errors it can route back to the responsible agent for correction — not a generic 400 from your API.

Parallel agents, same gate. Multiple agents process different inputs concurrently — say, ten order-processing agents each handling a different customer. They all submit to the same "Order Validation" gate. Flow validates each independently, the circuit breaker tracks failures per session so one misbehaving agent doesn't affect the others, and your downstream system only receives validated payloads.

Cross-agent consistency. When Agent A writes to the "Invoice" gate and Agent B writes to the "Payment" gate, and both gates have business rules referencing amount ranges and currency constraints, you get consistent validation across your entire agent fleet without encoding those rules in each agent's prompt.

The analytics dashboard makes this observable — you can see which agents (by session) are hitting which gates, what their failure rates look like, and which business rules are triggering most often. When you're running dozens of agents in production, this is how you find the one that's drifting.

Human-in-the-Loop When You Need It

Not everything should be auto-approved. For high-value transactions, sensitive data changes, or any scenario where you want a human to verify before the data moves downstream, Flow supports approval workflows.

You configure approvers on a gate — either team members who review from the dashboard, or external reviewers who get a magic link via email. External reviewers don't need a Rynko account. They click a link, see the payload, and approve or reject it. The magic links are HMAC-SHA256 signed, expire after 72 hours, and are single-use for approval actions.

The approval model is any-approves: the first approver to act determines the outcome. For high-volume gates, we batch notification emails into 5-minute digests so reviewers don't get buried in individual emails. There's also a hard safety cap of 30 emails per hour per approver to prevent notification fatigue.

The review experience for freetext content (Markdown, HTML, plain text) includes scroll-to-approve guardrails — the approve button stays disabled until the reviewer has scrolled through the entire content. For long documents, we auto-generate a table of contents from Markdown headers so reviewers can navigate quickly.

MCP Integration: Dynamic Tools per Gate

This is where Flow connects directly to the AI agent workflow. When you connect an AI tool (Claude Desktop, Cursor, VS Code, or any MCP client) to Flow's MCP endpoint at https://api.rynko.dev/api/flow/mcp, Flow auto-generates a validation tool for each active gate in your workspace.

A gate with slug order-validation becomes a tool called validate_order_validation. The tool's input schema is generated from the gate's current schema — each field becomes a typed JSON Schema property with its constraints. The tool description includes the gate's business rule error messages, so the agent understands the constraints before submitting.

When you update a gate's schema, Flow pushes a notifications/tools/list_changed event to connected agents. They automatically see the updated tool list without reconnecting. No redeploy, no config change.

A conversation with an MCP-connected agent might look like this:

You: Validate this order before submitting.

Agent: I'll validate it against your Order Validation gate.
       [calls validate_order_validation]

       The order passed validation:
       - Schema validation: pass
       - Business rules: pass
       - Validation ID: v_abc123...

       You can use the validation ID to confirm this data
       hasn't been modified when it reaches your order system.

Freetext Mode

Not all agent outputs are structured JSON. Sometimes the output is a Markdown document, an HTML email body, or a code snippet. Flow supports a freetext mode where the gate accepts content as a string instead of a structured schema.

Content format is declared at gate creation — plaintext, Markdown, HTML, or code. For Markdown and HTML content, Flow runs a sanitization pipeline on the backend using sanitize-html with a strict allowlist. Script tags, iframes, event handlers, and inline styles are stripped. Links get rel="noopener noreferrer". The reviewer sees sanitized content in a sandboxed view.

This is useful for agent workflows that produce reports, summaries, or email drafts where you need a human to review the content before it gets sent.

Delivery and Reliability

After validation (and approval, if configured), Flow delivers the payload to your webhook endpoints. Deliveries are signed with HMAC-SHA256 so you can verify the payload hasn't been tampered with.

The retry policy is straightforward: 5 attempts with delays at 30 seconds, 2 minutes, and 10 minutes. Each delivery attempt is logged, and failed deliveries can be retried manually from the dashboard or via the SDK.

Flow enforces per-team concurrency caps to keep the multi-tenant system fair — the exact limits scale with your tier, but even on the Scale plan, no single team can consume more than 25% of total worker concurrency. This prevents one team's spike from degrading service for everyone else.

SDKs

We've added Flow support to all three official SDKs. The pattern is the same across Node.js, Python, and Java — submit a run, poll for result, handle approvals:

import { Rynko } from '@rynko/sdk';

const client = new Rynko({ apiKey: process.env.RYNKO_API_KEY });

const run = await client.flow.submitRun('gate_abc123', {
  input: {
    orderId: 'ORD-2026-042',
    amount: 1250.00,
    currency: 'USD',
    customerEmail: 'buyer@example.com',
  },
});

const result = await client.flow.waitForRun(run.id, {
  pollInterval: 1000,
  timeout: 60000,
});

if (result.status === 'approved' || result.status === 'completed') {
  console.log('Validated:', result.validatedPayload);
} else if (result.status === 'validation_failed') {
  console.log('Errors:', result.errors);
}

The SDKs are at version 1.3.1 with 14 Flow methods covering gates (read-only), runs, approvals, and deliveries. All three SDKs include automatic retry with exponential backoff for rate limits and transient errors.

Pricing

Flow is a separate subscription from Rynko's document generation (Render). The pricing is based on validation runs per month:

Tier	Runs/Month	Gates	Price	Overage
Free	500	3	$0	—
Starter	10,000	Unlimited	$29/mo	$0.005/run
Growth	100,000	Unlimited	$99/mo	$0.004/run
Scale	500,000	Unlimited	$349/mo	$0.002/run

The free tier is gate-limited to 3 — this is intentional. Most teams find they need more gates once they start connecting multiple agents to different validation checkpoints, and that's the natural upgrade trigger. Paid tiers have unlimited gates.

To celebrate the launch, we're opening a Founder's Preview — sign up today and get 3 months of the Growth tier (100,000 runs/month, unlimited gates) completely free. No credit card required, no commitment. Once the preview ends, you can stay on Growth or switch to any tier that fits your usage.

If you also need document generation, Render Packs are available as add-ons on any tier — $19/month for 500 documents, $49 for 2,000, or $119 for 10,000.

The Dashboard

Flow comes with a full web dashboard for managing gates, reviewing runs, handling approvals, and tracking analytics. The gate configurator includes a visual schema builder (with Pydantic and Zod import), a business rule editor with live expression validation, and approval/delivery configuration. The runs view shows real-time status updates, validation error breakdowns, and a timeline of each run's journey through the pipeline.

The analytics dashboard covers the metrics you'd expect — run outcomes by gate, top failing rules, approval rates and decision times, throughput over configurable periods, and circuit breaker health. These metrics help you tune your gates and catch systemic issues early.

Getting Started

Sign up for free — 500 Flow runs/month included, no credit card
Create a gate in the Flow dashboard — define your schema and business rules
Submit a test payload using the Quick Start guide or the dry-run endpoint (doesn't count against quota)
Connect your AI agent via the MCP endpoint — Claude Desktop, Cursor, VS Code, Windsurf, Zed, or Claude Code

Flow is live and production-ready. We've been running it internally for weeks and the architecture has handled sustained load without surprises. If you're building with AI agents and need a systematic way to validate their outputs before they reach downstream systems, this is what we built it for.

Questions or feedback: support@rynko.dev or Discord.

Fast PDF Generation API: A Native Puppeteer Alternative

Srijith Kartha — Mon, 23 Feb 2026 10:29:54 +0000

Today, I am introducing Rynko. This is a new document generation platform built to help developers and AI agents design and generate PDF and Excel documents at scale without the traditional overhead.

If you are building a SaaS, you eventually have to generate an invoice, a receipt, or a complex report. Developers usually waste days wrestling with CSS media queries or setting up resource-heavy HTML-to-PDF microservices using Puppeteer. Rynko provides the infrastructure to design, version, and generate documents deterministically. You can go from a blank canvas to a production-ready template in minutes and get back to building your core product.

Architecture: Native Speed, No Bloat

Rynko generates PDF and Excel documents from a single definition. This definition is a structured JSON component tree rather than HTML.

We explicitly chose not to use HTML because headless browsers are heavy. A standard Chromium-based PDF generator can easily consume hundreds of megabytes of RAM per instance. Rynko uses a native layout pipeline powered by the Yoga Layout Engine and PDFKit.

The result is a massive win for your server costs and performance:

Low Footprint: Rynko workers operate at roughly 50MB of memory.
High Speed: Documents generate in a median of 426 milliseconds.
Deterministic: Identical JSON input produces identical PDF output every single time. There are no rendering differences between your local machine and your production server.

The Template Designer

You do not have to write the JSON manually. Templates are designed visually using a drag-and-drop editor that supports over 19 component types. These include data tables, charts, dynamic QR codes, and conditional logic.

Each component has a strict property schema validated at design time. You can preview templates in real time with live variable substitution and sample data. Once designed, the exact same template can generate both a highly styled PDF and a formatted Excel spreadsheet with native formulas.

AI Integration: Let Agents Do the Work

To make integration even faster, we built a native Model Context Protocol (MCP) server. This allows AI agents from Claude Desktop, Cursor, or Windsurf to interact with Rynko directly.

You can prompt your IDE to "Generate an invoice template for Acme Corp with a tax calculated field." The agent will use the MCP tools to build the JSON tree and draft the template. You can then review it visually in the dashboard before using it in your application.

Developer Experience

We treat document generation as a code-first citizen. We provide official SDKs for Node.js, Python, and Java. These SDKs feature automatic retries with exponential backoff.

You can batch generate multiple documents in a single API call. Final documents are delivered via cryptographically signed URLs that automatically expire after three days. Webhook deliveries include HMAC-SHA256 signature verification so you can securely update your database when a document is ready.

Infrastructure That Grows with You

Rynko is easy enough for a weekend side project, but it is built to handle enterprise scale when your startup grows.

We organize resources using Projects and Environments. You get complete resource isolation for your dev, staging, and production environments. When you land enterprise clients, Rynko is ready with PDF/A-2b compliance for long-term archival, role-based access control for your team, and full audit logs.

Join the Public Beta: Founder's Preview

Rynko isn't just a basic PDF wrapper. We are building the deterministic infrastructure that allows developers and AI agents to generate documents at scale.

Rynko is currently in Public Beta: Founder's Preview. Join today to claim 5,000 free document generation credits and start building deterministic document workflows without the Chromium overhead.

Try It Free | MCP Setup Guide | Documentation

Questions? Join our Discord or check the npm package.

Disclosure: I ideate and draft content, then refine it with the aid of artificial intelligence tools like Claude and revise it to reflect my intended message.