Vaibhav Kumar Kandhway

Posted on Jun 7

The Execution Safety Crisis in Multi-Agent Workflows — And the Architectural Pattern That Solves It

#ai #programming #architecture #systemdesign

The biggest unresolved problem in multi-agent workflows is not reasoning. It is execution safety.

Most teams building with LLMs today have not encountered this problem yet — because they have not scaled yet. This article is for the ones who are about to.

The Core Tension

LLMs are probabilistic by nature. Every output is a sample from a probability distribution. There is no guarantee that the same prompt produces the same output twice. That is not a bug — it is the fundamental property that makes language models useful.

Production backend systems are deterministic by requirement. The same input must always produce the same state change, traceably, verifiably, with an audit log that can be reconstructed after the fact.

When you connect an agent directly to an execution environment — via raw Python, open-ended tool calling, or unstructured function dispatch — you are bridging these two worlds with no safety boundary between them.

The agent reasons correctly ninety-nine times. The hundredth time, it hallucinates a parameter, misreads a context window, or generates structurally valid but semantically wrong instructions. In a traditional software system, that is a bug you catch in testing. In an agentic system with direct execution access, that is a silent state corruption — no stack trace, no audit log, no clean error surface.

This is not a prompting problem. It is not a model quality problem. It is an architectural problem.

Three Approaches — And Why Two Fail at Scale

Approach 1 — Direct Execution (Raw Tool Calling)

The agent generates intent and executes it directly via function calls, shell commands, or Python scripts. The architecture looks like this:

User Intent → LLM → Tool Call → System Execution

This is where most teams start. It is fast to prototype, easy to wire up with LangChain or CrewAI, and works impressively in demos.

The problem surfaces in production. There is no layer between what the model decided and what the system did. Failures are runtime failures — discovered after state has already changed. Invalid arguments do not fail cleanly; they fail at the system boundary, often silently, often after partial execution.

A 2025 research taxonomy of multi-agent failures identified 14 unique failure modes across frameworks including AutoGen, ChatDev, and CrewAI. The study's core finding was stark: "improvements in the base model capabilities will be insufficient to address the full taxonomy. Instead, good multi-agent system design requires organizational understanding; even organizations of sophisticated individuals can fail catastrophically if the organization structure is flawed."

The failures are not in the model. They are in the architecture.

There is also a compounding reliability problem. If each agent in a chain is 95% reliable, chaining three agents together drops overall task success to roughly 86%. Add more steps and reliability falls exponentially — not because any individual agent is bad, but because failures cascade across the chain with no structural containment.

Direct execution has no containment layer. This is the approach that cannot scale.

Approach 2 — Natural Language Parsing with Guardrails

A validation layer sits between the agent and the execution environment, checking outputs against a set of rules before running them.

User Intent → LLM → Output → Guardrail Filter → Execution

This is better. Frameworks like NeMo Guardrails, Guardrails-AI, and AWS Bedrock Guardrails operate in this space. They provide output validation, content filtering, and policy enforcement at the boundary.

But the grammar of what the agent can produce is still unbounded. The model outputs free-form text or loosely structured JSON. The guardrail then attempts to validate that output against a rule set.

The problem is fundamental: you are filtering an infinite space rather than constraining the space itself. Rule-based validation written against ambiguous, open-ended output will always have edge cases. An agent that outputs something technically valid but semantically harmful can slip through. An agent that outputs something in a format the guardrail did not anticipate can fail unpredictably.

Microsoft's research on LLMs and DSLs found that models still hallucinate outputs even when given grammar files and format constraints — they produce correctly formatted responses that are semantically wrong. Filtering catches some of that. It cannot catch all of it, because the thing you are filtering against is not formally defined.

This approach is necessary but not sufficient.

Approach 3 — The LLM-to-DSL Compiler Pattern

This is the architectural shift that moves the safety guarantee from runtime behavior to structural design.

User Intent → LLM → DSL Output → Grammar Validator → Execution Engine

Instead of generating free-form code or natural language instructions, the agent compiles user intent into a Domain-Specific Language — a rigid, custom grammar with a strictly bounded output space. The system then runs that DSL through a deterministic validation engine before a single instruction touches system state.

We have used DSLs for decades to constrain logic to strict domains:

SQL does not let you accidentally invoke a shell command
Terraform does not let you accidentally write to a file system
CSS does not let you accidentally make a network request

The grammar defines what is expressible. Everything outside it is structurally impossible — not filtered, not blocked, but inexpressible by construction.

The new paradigm applies this same principle to AI orchestration.

The Three Stages of the LLM-to-DSL Pattern

Stage 1 — Constrained Generation

The agent translates user intent into DSL rather than general-purpose code.

Here is a minimal illustration of the difference. Consider an agent tasked with querying a database.

Open-ended tool calling (Approach 1):

# The LLM generates this. Anything goes.
import subprocess
result = subprocess.run(["psql", "-c", "DROP TABLE users;"], capture_output=True)

The model intended to query. It hallucinated a destructive operation. The grammar of Python allowed it.

DSL-constrained output (Approach 3):

QUERY users
  WHERE status = "active"
  LIMIT 100
  RETURN [id, name, email]

This grammar does not contain a DROP keyword. It cannot be expressed. The hallucination has no surface to land on.

The DSL defines the contract between the AI's reasoning and the system's execution. Not by filtering what the model says — by defining what the model can say.

Stage 2 — Deterministic Validation

A backend engine parses the DSL output against a formal grammar. Because the grammar is bounded, parsing is deterministic. Valid DSL either passes or fails — no ambiguity, no partial execution, no silent errors.

Here is what that validation layer looks like structurally:

DSL Input → Lexer → Token Stream → Parser → AST → Semantic Validator → Execution Plan

At each stage, failure is explicit:

Lexer rejects unknown tokens
Parser rejects malformed structure
Semantic Validator rejects valid syntax with invalid logic (e.g., referencing a field that does not exist in the schema)

The result: hallucinations and invalid logic do not produce silent runtime failures. They fail at the compilation step — before execution begins. The error is precise, attributable, and logged at the grammar level, not discovered as a corrupted state three steps later.

This is the parallel to Rust's ownership model. C trusted the programmer — one lapse, and the consequences were severe. Garbage-collected languages trusted the runtime — safety was real, but you lost control. Rust encoded correctness into the compiler itself — the guarantee is structural, not behavioral. The LLM-to-DSL pattern does the same thing for agentic execution.

Stage 3 — Diffable Execution

The validated instruction set is human-readable, structured, and reviewable. Before any state change executes, a team can inspect exactly what the agent is proposing.

  AGENT PROPOSED EXECUTION PLAN
  ─────────────────────────────
  QUERY orders
+   WHERE status = "pending"
-   WHERE status = "completed"
    RETURN [order_id, customer_id, amount]
    LIMIT 500

This is not just good engineering practice. It is what makes human-in-the-loop workflows operationally viable at scale. Without a DSL layer, human review of agent actions means reading raw code or natural language outputs — which does not scale and introduces its own interpretation errors. With a DSL layer, review means reading a structured, bounded instruction set where the semantic meaning is explicit.

You can see what the agent is about to do. You can diff it against what you expected. You can reject it before execution. This is what "auditability" actually means in practice.

This Is Not Hypothetical — It Is Already in Production

In late 2025, PayPal published research detailing exactly this pattern deployed at production scale. Their system implements a declarative DSL that separates agent workflow specification from implementation — enabling the same pipeline definition to execute across multiple backend languages (Java, Python, Go) and deployment environments.

The results on real e-commerce workflows processing millions of daily interactions:

60% reduction in development time compared to imperative implementations
3x improvement in deployment velocity
Complex workflows expressed in under 50 lines of DSL versus 500+ lines of imperative code
Sub-100ms orchestration overhead — the DSL layer added no meaningful latency

The finding that stands out most: the declarative approach enabled non-engineers to modify agent behaviors safely. The grammar constraint did not just make the system safer — it made the system accessible to a wider set of contributors, because the bounded grammar prevented them from making structurally dangerous changes by accident.

Business Implications

The technical architecture has direct business consequences. They compound at scale.

Auditability Becomes a Compliance Asset

In regulated industries — finance, healthcare, legal — every action an agent takes must be attributable, reviewable, and reversible. A DSL-based control plane produces a structured, human-readable record of every proposed state change before execution. That is not just good engineering. In many jurisdictions, it is the difference between a deployable system and an undeployable one.

The GDPR's right to explanation, HIPAA's audit trail requirements, and SOC 2's access control standards all require that automated actions be attributable and reconstructable. An agent operating via direct execution cannot satisfy these requirements by design. An agent operating via a DSL control plane satisfies them structurally.

Incident Cost Drops Dramatically

When an agent operating via direct execution corrupts state, the failure is discovered at runtime — after the fact, often without a clear trace of what instruction caused it. Recovery requires reconstructing intent from logs that may be incomplete.

When an agent operating via DSL produces invalid logic, the failure is caught at parse time — before execution, with a precise error at the grammar level. The blast radius is zero. No state was changed. The mean time to detection collapses from hours to milliseconds.

The documented production failure cases make this concrete. Two agents trapped in a runaway interaction loop ran for 11 days before detection — generating a $47,000 API bill. Expense report agents fabricating plausible but false entries at Ramp generated over $1 million in fraudulent invoices in 90 days. These are not reasoning failures. They are execution containment failures. A DSL control plane with bounded grammar would have caught both patterns at the validation stage — an agent cannot enter an infinite loop if the DSL grammar does not express unbounded iteration.

Human Oversight Becomes Operationally Viable

Diffable execution means a human reviewer can inspect exactly what the agent is about to do — in structured, readable form — before approving it. This makes human-in-the-loop architectures practical at scale.

This matters because the emerging regulatory consensus around autonomous AI systems is moving toward mandatory human oversight for high-stakes actions. Building that oversight capability into the architecture now, rather than retrofitting it later, is a significant operational advantage.

Vendor and Model Portability Increases

When your execution layer depends on the specific output format of a particular model, switching models breaks production. Your agent's behavior is coupled to the model's generation behavior — and that coupling is invisible until it breaks.

When your execution layer depends on a DSL grammar that the model compiles into, the model becomes interchangeable. The contract is with the grammar, not the model. You can swap Claude for GPT-4o, or fine-tune a smaller model on DSL generation, without touching your execution layer. The separation of concerns is structural.

The Deeper Principle

The LLM-to-DSL pattern is an instance of a broader principle that keeps appearing across the history of computing: the most reliable systems are the ones that make unsafe states inexpressible, not the ones that catch unsafe states at runtime.

Type systems do this for data. Memory ownership models do this for allocation. Formal grammars do this for syntax. The LLM-to-DSL pattern does this for agentic execution.

General-purpose languages build the engines. DSLs constrain the agents.

The teams that will win in production agentic infrastructure are not the ones with the best models. They are the ones that figured out the boundary between AI reasoning and system execution — and made that boundary structurally enforced.

Have you hit execution safety failures in an agentic system you were building? I would like to know where the boundary broke down — and what architectural choices you made in response.

Top comments (2)

Maks • Jun 7

ty for sharing

Vaibhav Kumar Kandhway • Jun 7

no problem i will be posting more technical deep-dives stay tuned.