TL;DR
Letting an LLM write runtime code feels powerful until you run it every weekday at 9:00. Small per-step errors compound across long workflows. We shifted code generation out of runtime and into the planning phase using a typed DSL, validated components, and MCP. The LLM plans and composes; a deterministic engine executes. Result: lower failure rates, bounded costs, and behavior teams can trust.
Why runtime codegen breaks at scale
We started the common way: the model wrote code on the fly to call tools. After months of tuning, reality intruded:
- 20%+ end-to-end failure rate
- Missing await
- Type mismatches
- Edge cases that crashed workflows
The math hurts: if each step is 95% accurate, a 10-step job succeeds about 0.95^10 ≈ 59.87%. That is not usable for customer-facing operations.
The shift: plan first, execute deterministically
We moved generation from runtime to the plan:
Natural language → DSL (pre-built components) → Validate → Execute
- The model no longer writes arbitrary code.
- It selects and arranges vetted components, like clicking Lego bricks together.
- Validation happens before execution, so most errors surface early.
Separating control flow from data flow
- Control flow (planning): the LLM decides steps and order.
- Data flow (runtime): typed DataFrames process records in memory, outside the model context.
This gives us:
- Datasets that exceed context windows
- Summaries to the model rather than raw data
- Live previews that feel like spreadsheets
- Predictable compute and token costs
Why MCP helps
MCP (Model Context Protocol) standardizes how agents discover and call tools. In practice it gave us:
- Clear tool schemas and capabilities
- A consistent way to register and permission servers
- Fewer bespoke adapters, easier testing
We built early with MCP at MaybeAI and found it easier to enforce typing and contracts around tools.
What production users actually need
Our users live in Excel: sales ops, marketing analysts, finance. They want the same workflow to run every Monday at 09:00, with the same guarantees. Wrong emails or polluted CRM data are not theoretical bugs. They are business incidents.
So we aim for:
- Natural-language planning with a familiar, “vibe coding” surface
- Deterministic execution under the hood
- Strong typing and schema validation at the edges
Architecture sketch
Intent capture
- Users describe the job in chat.
- We map it to goals, inputs, outputs, and constraints.
Planning (LLM + DSL)
- The LLM composes a plan using typed components.
- We validate types, schemas, and preconditions.
Execution (deterministic engine)
- DataFrames flow through steps with idempotent writes.
- Concurrency, retries, and backoff are policy-driven.
Observation
- Trace IDs for every run
- Structured errors with semantics
- Replay and diff
A short checklist for production agents
- Front-load validation in planning
- Typed inputs/outputs and schema checks at every boundary
- Pre-built, battle-tested components as the default building blocks
- Explicit error semantics and routing rather than a generic “failed” bucket
- Idempotency keys for writes
- Budgeted retries with exponential backoff
- Audit, replay, and runbooks for human fallback
- Clear SLAs and alerts (for example, P95 latency and error thresholds)
Anti-patterns we learned to avoid
- Treating long-running business workflows as open-ended research loops
- Shipping untyped tool calls from runtime codegen
- Hiding data inside the model context instead of using structured frames
- Opaque error messages that cannot be aggregated or routed
What improved after the shift
- Stability: most errors are caught in validation, not while writing to external systems.
- Cost control: fewer unnecessary tokens and retries.
- Scale: plans reference components; components can be optimized once and reused.
- User trust: previews look and behave like spreadsheets they already understand.
Open questions for the community
- How are you composing tools at scale with MCP?
- Where do you enforce typing: at the protocol boundary, inside components, or both?
- Have you found a reliable pattern for guardrails that still keeps planning flexible?
We are still iterating, and MCP is evolving fast. If you are running agents in production, share your patterns and pitfalls. We would love to compare notes.
Top comments (0)