MaybeAI

Posted on Nov 6

Building Reliable AI Agents with MCP: What We Learned Moving Codegen to the Planning Phase

#ai #mcp #agents #tooling

TL;DR

Letting an LLM write runtime code feels powerful until you run it every weekday at 9:00. Small per-step errors compound across long workflows. We shifted code generation out of runtime and into the planning phase using a typed DSL, validated components, and MCP. The LLM plans and composes; a deterministic engine executes. Result: lower failure rates, bounded costs, and behavior teams can trust.

Why runtime codegen breaks at scale

We started the common way: the model wrote code on the fly to call tools. After months of tuning, reality intruded:

20%+ end-to-end failure rate
Missing await
Type mismatches
Edge cases that crashed workflows

The math hurts: if each step is 95% accurate, a 10-step job succeeds about 0.95^10 ≈ 59.87%. That is not usable for customer-facing operations.

The shift: plan first, execute deterministically

We moved generation from runtime to the plan:

Natural language  →  DSL (pre-built components)  →  Validate  →  Execute

The model no longer writes arbitrary code.
It selects and arranges vetted components, like clicking Lego bricks together.
Validation happens before execution, so most errors surface early.

Separating control flow from data flow

Control flow (planning): the LLM decides steps and order.
Data flow (runtime): typed DataFrames process records in memory, outside the model context.

This gives us:

Datasets that exceed context windows
Summaries to the model rather than raw data
Live previews that feel like spreadsheets
Predictable compute and token costs

Why MCP helps

MCP (Model Context Protocol) standardizes how agents discover and call tools. In practice it gave us:

Clear tool schemas and capabilities
A consistent way to register and permission servers
Fewer bespoke adapters, easier testing

We built early with MCP at MaybeAI and found it easier to enforce typing and contracts around tools.

What production users actually need

Our users live in Excel: sales ops, marketing analysts, finance. They want the same workflow to run every Monday at 09:00, with the same guarantees. Wrong emails or polluted CRM data are not theoretical bugs. They are business incidents.

So we aim for:

Natural-language planning with a familiar, “vibe coding” surface
Deterministic execution under the hood
Strong typing and schema validation at the edges

Architecture sketch

Intent capture

Users describe the job in chat.
We map it to goals, inputs, outputs, and constraints.

Planning (LLM + DSL)

The LLM composes a plan using typed components.
We validate types, schemas, and preconditions.

Execution (deterministic engine)

DataFrames flow through steps with idempotent writes.
Concurrency, retries, and backoff are policy-driven.

Observation

Trace IDs for every run
Structured errors with semantics
Replay and diff

A short checklist for production agents

Front-load validation in planning
Typed inputs/outputs and schema checks at every boundary
Pre-built, battle-tested components as the default building blocks
Explicit error semantics and routing rather than a generic “failed” bucket
Idempotency keys for writes
Budgeted retries with exponential backoff
Audit, replay, and runbooks for human fallback
Clear SLAs and alerts (for example, P95 latency and error thresholds)

Anti-patterns we learned to avoid

Treating long-running business workflows as open-ended research loops
Shipping untyped tool calls from runtime codegen
Hiding data inside the model context instead of using structured frames
Opaque error messages that cannot be aggregated or routed

What improved after the shift

Stability: most errors are caught in validation, not while writing to external systems.
Cost control: fewer unnecessary tokens and retries.
Scale: plans reference components; components can be optimized once and reused.
User trust: previews look and behave like spreadsheets they already understand.

Open questions for the community

How are you composing tools at scale with MCP?
Where do you enforce typing: at the protocol boundary, inside components, or both?
Have you found a reliable pattern for guardrails that still keeps planning flexible?

We are still iterating, and MCP is evolving fast. If you are running agents in production, share your patterns and pitfalls. We would love to compare notes.

DEV Community