Srijith Kartha

Posted on Mar 16 • Originally published at blog.rynko.dev

Teaching Gates to Learn: How We Built Intelligence Into Rynko Flow

#rynko #ai #agents #automation

Flow validates agent outputs against schemas and business rules. But when 60% of agents fail the same rule on first attempt, the gate should be telling you why — and helping you fix it.

The key insight: When agents fail and retry without clear guidance, it's not a minor inconvenience — it's a reliability failure. Every failed correction loop is a moment where your automation is stuck in a cycle of non-compliance. Gate Intelligence identifies the friction points that prevent agents from reaching a successful state, and feeds that knowledge back into the gate's contract so the next agent gets it right on the first try.

When we launched Flow, the pitch was straightforward: define a gate with a schema and business rules, point your agent at it, and Flow validates the payload before it reaches your database. Schema checks, expression-based rules, optional human approval. It works well — agents submit data, gates validate it, failed submissions come back with structured errors the agent can act on.

But we were sitting on a pile of useful data and not doing anything with it.

Every run Flow processes is stored: the input payload, the validation verdict, which rules passed, which failed, and the exact values that caused the failure. For agents that self-correct, we track the full chain — first attempt, second attempt, third, until either the agent gets it right or gives up. That's tens of thousands of data points per gate per week, and until now, it only showed up as numbers on the analytics dashboard.

Gate Intelligence turns that data into concrete suggestions for improving your gates.

The Problem It Solves

Here's a real pattern we saw in our own test gates. We set up an invoice validation gate with five business rules: amount must be positive, currency must be a 3-letter uppercase code, vendor can't be empty, line items must sum to the total, and there must be at least one line item. Standard stuff.

When we ran agents against it, 40% of first attempts failed the currency format rule. The agents were submitting "usd" and "eur" instead of "USD" and "EUR". Another 25% failed the line items sum check — off by a fraction of a cent due to floating-point rounding. 15% of submissions omitted the vendor field entirely.

None of these are schema problems. The schema says currency is a string, vendor is required, amount is a number. All correct. The issue is that the gate's documentation and rules don't give agents enough context to get it right on the first try. The currency rule says "must equal its own uppercase version" — technically precise, but a Claude or GPT model reading the MCP tool description doesn't know that means "must be uppercase ISO 4217."

When agents fail and retry without that context, the automation isn't saving time — it's stuck. Each failed loop is a moment where your pipeline is spinning instead of producing results. If an agent needs five attempts and 45 seconds to pass a rule that a single well-placed hint would have fixed on the first try, the gate itself is the bottleneck.

Gate Intelligence identifies these patterns automatically and tells you what to do about them.

What It Computes

Every hour, a background job runs for each active gate. It analyzes the last 7 days of runs and computes six metrics:

Per-rule failure rates — what percentage of first-attempt submissions fail each rule, with trend direction compared to the previous 7-day window. If your "amount must be positive" rule went from 5% failure to 15%, that's flagged as trending up.

Common failure values — the actual values agents submitted that caused failures. For the currency rule, this surfaces "usd", "eur", "gbp" as the top offenders. For a numeric rule, it might show 0, -1, or 99999.999. These values are what make suggestions actionable — instead of "rule X fails a lot," the system can say "agents are submitting lowercase currency codes."

Field omission rates — how often required schema fields are missing from submissions. A 30% omission rate on the vendor field means agents don't realize it's required, or the field name isn't clear enough.

Chain convergence — of all the failed submissions that triggered a correction chain, what percentage eventually succeeded? If agents submit, fail, retry, fail again, and give up 70% of the time, that's a fundamental reliability problem. A 33% convergence rate doesn't just mean "some retries" — it means your automation succeeds less than a third of the time. For any system that's supposed to run autonomously, that's a non-starter.

Average chain length and time-to-correction — how many attempts does it take, and how long does the cycle last? Two attempts averaging 3 seconds is healthy. Five attempts averaging 45 seconds means the agent is struggling.

Pattern Detection

Raw metrics tell you what is failing. Pattern detection tells you why.

The system examines common failure values and classifies them into four fixable patterns:

Case mismatch — the submitted value is a lowercase version of what's expected. "usd" vs "USD", "active" vs "Active". This usually means the gate needs to either make the rule case-insensitive or add explicit guidance about expected casing.
Rounding tolerance — a numeric value is within 1% of the expected threshold but fails because of floating-point precision. An amount of 99.999 failing an exact equality check where the expected sum is 100.00. The fix is usually adding a small tolerance to the rule.
Type coercion — a string representation of a number where a number is expected. The string "42" instead of the number 42. Common with agents that serialize JSON from natural language.
Empty string — an empty string where a non-empty value is expected. Distinct from a missing field — the agent knows the field exists but doesn't have a value for it.

These patterns feed into the suggestions. Instead of a generic "rule X fails 60% of the time," the suggestion says "the currency format rule fails 60% of the time — agents submit lowercase currency codes (usd, eur, gbp). Consider adding a note that currency must be uppercase ISO 4217."

Suggestions and the Intelligence Tab

Each gate now has an Intelligence tab alongside Configuration and Performance. The tab shows a summary bar with insight counts by severity, per-rule failure rates with trend arrows, field omission rates, chain convergence metrics, and a health trend chart built from historical snapshots.

Below the analysis cards, concrete suggestions appear as dismissable cards with three severity levels:

Severity	Trigger	Example
Critical	Rule fails >50% of first attempts	"Currency format fails 60% — add format guidance"
Critical	Chain convergence below 50%	"Agents give up on this gate 70% of the time"
Warning	Required field missing >30%	"Vendor omitted in 38% of submissions"
Info	Rule never fails (500+ runs)	"This rule may be redundant — 0 failures in 600 runs"
Info	All rules >95% success	"Excellent validation performance"

Each suggestion has three actions: Apply, Dismiss, and Snooze (hide for 7 days).

Version-Controlled Hints

This is where the architecture gets interesting. The first version of "Apply" directly modified the gate's description field — injecting hint text like "Common mistakes: agents submit lowercase currency codes." It worked, but it was a bad design for three reasons.

First, no version control. The description change bypassed the gate's draft/publish pipeline. In regulated environments — banking, healthcare, insurance — operators need to know exactly when and why a gate's contract changed. A direct write to the description is invisible in the version history.

Second, no audit trail. If an agent's behavior shifts after someone clicks "Apply" (for better or worse), there's no correlation between the click and the behavior change.

Third, no review step. The hint goes live immediately. If Gate Intelligence generates five suggestions and the operator clicks Apply on all of them, five changes hit production with no review.

So we changed the approach. Hints are now a first-class versioned field on the gate — stored alongside the schema, business rules, and identity key fields. When you click "Apply" on a suggestion, it adds the hint text to the gate's draft version. If no draft exists, it creates one. The hint doesn't go live until you review it in the gate configurator and publish.

The gate configurator now has a dedicated "Hints" panel sitting between the Details and Schema steps — visible at a glance without opening any dialog. You can see what Intelligence suggested, edit the text, add your own custom hints, or remove ones you don't want. When you're satisfied, you publish the gate version — which goes through the existing audit log, resets circuit breakers, and notifies connected MCP sessions that the tool description has changed.

This means hints get the same treatment as any other gate configuration change: versioned, auditable, rollbackable.

How Hints Reach the Agent

The MCP tool description for each gate is assembled from three sources:

Submit data to Invoice Validation
Validates invoice payloads before processing to the ERP system.

Business rules:
- amount_positive: Amount must be greater than zero
- currency_format: Currency must be valid ISO 4217
- line_items_match: Line item totals must equal invoice amount

--- Best Practices ---
- Currency must be uppercase ISO 4217 (e.g., USD, EUR, not usd)
- Line item totals must sum to the invoice amount within ±0.01
- Vendor name is required — do not submit an empty string

The gate description is always included. Business rules are always appended so the agent knows the constraints. The "Best Practices" section only appears when the auto-hints toggle is enabled on the gate — it's off by default because it changes what agents see, and the gate owner should make that decision deliberately.

The key improvement from the original architecture: reading hints is now a simple array read from the published gate record, not a database query against the insights table. The old approach queried the insights service every time an MCP tool description was built, which meant a database hit on every tool list request. The new approach reads directly from the gate record — the hints were copied there at publish time. This matters because MCP tool descriptions are assembled on every session connection and tool refresh. Moving from a database lookup to a direct read keeps the tool-build path fast and predictable, which is critical when you're serving multiple concurrent agent sessions.

Historical Snapshots and Trend Analysis

Each time the intelligence job runs, it saves a snapshot with aggregate metrics: total runs, overall failure rate, per-rule failure rates, field omission rates, chain convergence, and suggestion counts. This creates a time series of gate health that's visible in the Intelligence tab as a bar chart.

The chart color-codes each bar: red for failure rates above 50%, amber for 20–50%, and the primary color below 20%. Hovering shows the exact values and date. Over time, you can see whether applying suggestions actually improved the gate's success rate — which is the whole point.

What's Next

Gate Intelligence today is reactive — it analyzes historical data and suggests improvements. There are two directions on the roadmap:

Proactive schema evolution: if Intelligence detects that agents consistently submit a field that isn't in the schema (say, a tax rate keeps appearing in payloads that only define amount and currency), it suggests adding it. This requires analyzing raw payloads beyond just validation results, which is a different data pipeline.

AI Judge integration: the current business rules are deterministic expressions. We're building an "AI Judge" mode that evaluates payloads using an LLM for semantic checks that can't be expressed as expressions — things like "the description should be professional in tone" or "the address looks like a real postal address." Intelligence would track AI Judge pass/fail rates the same way it tracks expression rules, but the suggestion engine would need to account for the non-deterministic nature of LLM evaluation.

Neither is shipped yet, but the foundation is designed for them. The analysis pipeline, suggestion engine, versioned hints, and snapshot time series are all extensible — adding a new data source feeds into the same pattern detection and suggestion framework without rearchitecting anything.

Getting Started

If you have an active Flow gate with at least 50 runs, Intelligence will start generating insights on the next hourly cycle. Open any gate, click the Intelligence tab, and hit Refresh to trigger analysis immediately.

We're rolling it out gradually — it's available today for all paid tiers (Starter, Growth, Scale) and will be available on the Free tier once we're confident in the compute overhead.

Whether your agents are running on AWS Bedrock, OpenAI's API, or any other provider — the validation layer is where reliability is won or lost. If your gates are rejecting 60% of first attempts and your correction chains converge less than half the time, your automation isn't autonomous. It's just expensive retry logic. Gate Intelligence gives you the data to fix that, and the versioned hints to make the fix stick.

Flow docs: docs.rynko.dev/flow

Get started: app.rynko.dev/signup — free tier, 500 runs/month, 3 gates, no credit card.