DEV Community: Karl Mehta

The Missing Engineering Stack for Production AI Agents

Karl Mehta — Sun, 17 May 2026 15:29:24 +0000

The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that decide whether your agent survives contact with real users, real data, and real adversaries — context-window discipline, skill composition, capability-based security, and drift telemetry. Concrete patterns, named tradeoffs, and the enterprise integrations that let you ship past prototype.

This is part 1 of a 3-post series. Part 2 — Why current IDEs need to be redesigned for the agent era — covers the developer-tooling argument. Part 3 introduces what I'm shipping next.

1. Tokens — context-window discipline
A token is the unit of inference cost, the unit of latency, and the unit of model attention. Treat it like memory in a 1990s embedded system: budget every byte, evict aggressively, and never assume the next call gets the same allocation.

Prompt caching is a 90% cost cut you'd be insane to ignore
Anthropic's cache_control: { type: 'ephemeral' } marker (5-minute TTL by default, 1-hour via the extended-TTL beta) deduplicates the static prefix of your prompts at the inference layer. Cached tokens are billed at 10% of input cost; cache writes cost 25% more on the first call. The math: any system prompt + tool catalog + few-shot exemplar bank that's reused more than ~3 times per 5 minutes is a net cost win. Order matters — the cache is a prefix, not a content-addressable store, so the cached span has to be byte-identical and at the start.

messages: [
{ role: "user", content: [
{ type: "text", text: STATIC_TOOL_CATALOG, cache_control: { type: "ephemeral" } },
{ type: "text", text: STATIC_SKILLS_BUNDLE, cache_control: { type: "ephemeral" } },
{ type: "text", text: dynamicUserTurn },
]}
]
Two cache breakpoints because cache reads accumulate up to the most recent cache_control marker — splitting tool catalog from skill bundle lets either evolve without busting the other. OpenAI's automatic prefix caching (no opt-in, but no extended TTL) and Gemini's explicit CachedContent resources are the equivalents on the other major providers.

Model routing — pay Haiku rates for Opus-class outcomes
A single agent run rarely needs the same model for every step. The cost spread is enormous: Claude Haiku 4.5 is $1/$5 per million in/out, Sonnet 4.6 is $3/$15, Opus 4.7 is $15/$75. The pattern that's worked for me is a three-tier router:

Retrieval / classification / extraction → Haiku. Use structured outputs (forced JSON via tool_use with strict mode) so the model can't waste tokens on freeform.
Synthesis / reasoning over retrieved context → Sonnet. The default mid-tier; this is where 80% of business logic lives.
Tool selection / planning / disambiguation → Opus only when the planner has to coordinate >5 tool calls or weigh ambiguous user intent.

Switching costs ~50ms of router latency. The cost amortization is typically 4–8× on production workloads. The trap: don't route based on input length alone — route based on the step type. A 50-token "is this a refund request?" classifier on Haiku is 60× cheaper than the same call on Opus.

Streaming, KV reuse, and the structured-output dodge
Streaming via SSE (Anthropic, OpenAI) or gRPC bidirectional (Vertex) is non-negotiable for latency. The first token typically lands at 200–600 ms; the full response at 2–8 seconds. If your UX waits for the full response, you've added 4 seconds of perceived latency for zero product reason.

KV cache reuse across calls is the under-discussed companion to prompt caching. Modern Anthropic and OpenAI back-ends keep the attention key-value cache warm across the cache TTL. Order tool calls so the most-frequently-called tools come first in your tool list, because tool definitions are part of the prefix that gets cached.

The structured-output dodge: when you need a list, a classification, or a structured fact, don't ask the model in freeform — define a tool, force it via tool_choice, and receive a typed JSON object. You skip 50–80% of the freeform tokens the model would otherwise generate, and the output is parser-safe by construction. Pair with strict mode (OpenAI) or JSON Schema with $defs (Anthropic) to refuse off-schema outputs at the decoder.

2. Skills — composition, not prompts
A "skill" is the unit of behavior an agent can perform. Most production agents conflate three different things into a megaprompt: identity (who are you), capabilities (what can you do), and policies (what you must / must not do). That conflation makes prompts impossible to evolve safely. Separate them into composable fragments, then assemble at runtime.

The model I've shipped against — and what I think every production agent eventually converges on — is the trigger / action / restriction triple per skill:

{
"id": "refund-policy-2024",
"trigger": "the user asks for a refund",
"action": "verify the order is within the 30-day window, then issue a refund via tools.stripe.refund and post-confirm via tools.email.send",
"restriction": "never issue refunds > $500 without a human-approval gate; never refund subscription items in their first cycle"
}

Domain experts (PMs, ops, legal) author triples in plain English. The runtime composes them into a system-prompt slot. Versioning per skill — not per agent. Eval suites attach to the skill, so swapping out a refund policy in 2026 doesn't require reblessing the entire agent.

Tool use, MCP, and the transport question
Tools are the IO of an agent. The schema is the contract. Two opinions worth holding:

Strict JSON schemas with additionalProperties: false. Closed-world schemas catch hallucinated arguments at the validator instead of in production. Strict mode (OpenAI) and the Anthropic tool_choice + JSON-Schema combo both enforce this.
Tools should be small and idempotent. orders.refund(orderId, amountCents), not orders.handle(intent, payload). The agent's planner is dramatically more reliable when each tool does one thing with a typed input.

Once you have more than ~5 tools, the catalog itself becomes worth standardizing. Model Context Protocol (MCP) — Anthropic's open-source agent ↔ tool spec — is the answer that's consolidating the ecosystem. Three transports, three different tradeoffs:

stdio — local-process tools. Lowest latency, zero network surface. Use this for code execution, filesystem ops, anything sensitive.
SSE (deprecated in favor of StreamableHTTP) — long-poll over HTTP. Browser-friendly, easy to host. Latency ~50ms.
StreamableHTTP — single-endpoint HTTP with optional SSE for streaming responses. The current recommendation for hosted MCP servers. Compatible with most cloud LB stacks.

The plan-execute-review loop
For agents with >3 sequential tool calls, prompt the model to plan first (one message, no tool calls), execute against that plan (n messages, tool calls only), then review the result against the plan's stated success criteria (one message, no tool calls). Anthropic's Agent SDK ships this pattern via the plan_mode primitive; it's also straightforward to implement in raw fetch with three system-prompt slots.

The bonus: when the agent fails, the failure is grounded in a textual plan you can replay, eval, and red-team — instead of an opaque chain of tool calls.

3. Security — capability-based, not vibe-based
The threat surface of an agent is wider than people pretend. A short list:

Prompt injection — adversarial input in retrieved context, tool outputs, or user data flips the agent's instructions.
Data exfiltration — the agent calls a tool that emits sensitive data to an attacker-controlled destination (an email, a webhook, a markdown image with a query string).
Tool abuse / RCE — the agent uses a legitimate tool in a way the designer didn't intend (a shell tool, a code-exec tool).
Supply chain — a tool dependency or model weight is compromised.
Secret leakage — API keys end up in logs, prompts, or tool error messages.

Capability-based authority, not ambient authority
The security primitive that's stood up best in 50 years of OS research is the object capability: hand a process the smallest unforgeable token that lets it do exactly the thing it needs, and nothing else. Apply this to agents.

Concretely: don't give the agent a long-lived OPENAI_API_KEY with billing access. Give it a per-session token, scoped to specific endpoints, with a TTL. Every tool gets a separate principal. Authorize via OAuth 2.1 with PKCE — the agent walks the user through delegated authorization, the user sees the exact scopes, and tokens are stored in the OS keychain (libsecret on Linux, Keychain on macOS, DPAPI on Windows; Electron's safeStorage wraps the platform primitive for cross-OS).

Sandbox the tools, not just the agent
If a tool runs untrusted code or writes to a filesystem, isolate it. Three real options ranked by overhead:

WASM (Wasmtime, Wasmer) — sub-millisecond startup, deny-by-default I/O, easy to configure capability lists. The right choice for code-exec and policy-evaluation tools.
gVisor — userspace kernel; near-full Linux compatibility with a 10–100ms startup cost. Right for tool subprocesses that need the full POSIX surface.
Firecracker — microVM; ~125ms startup, hardware-backed isolation. Right for multi-tenant agent execution in shared infra.

ko/distroless container images, SLSA Level 3 build attestation, and sigstore-signed artifacts close the supply-chain surface. If your agent runs in a long-lived process, write the SBOM to the artifact registry and gate deploys on cosign verification.

Prompt injection defense
The most under-addressed threat. The mitigations that actually work:

Channel separation. Treat tool outputs and retrieved documents as data, not as instructions. Anthropic's recent research on instruction-data separation in the system prompt is the current best practice — wrap untrusted content in clearly labeled XML-ish tags and tell the model to ignore any instructions inside them.
Allowlist tool surfaces. The agent can call send_email only to addresses on a per-conversation allowlist that the user explicitly authorized. The same pattern applies to outbound HTTP, database writes, file outputs.
Output content classifiers. Run a small model over the agent's tool calls before they execute, looking for known exfil patterns (suspicious destinations, base64-encoded blobs, sensitive-field references).
HITL gates on consequential actions. Anything that costs money, sends external communication, modifies a database, or touches PII goes through a human approval before execution. The threshold is per-skill.

4. Trust — telemetry, not vibes
"It worked when I tested it" is not a trust story. The four signals you actually need on every agent in production:

Eval pass rate against a golden set
A regression suite of input/output pairs the agent must continue to pass. Run on every prompt change, every model upgrade, every tool catalog edit. Tag failures by skill so you can localize regressions. Pairwise LMSYS-style judging works for tone-sensitive outputs; exact-match works for structured outputs. Don't conflate them.

Drift detection
Even with a stable model, your agent's behavior drifts when the input distribution shifts — new product launches, seasonal traffic, adversarial probing. Track distribution shift on input embeddings (cosine distance from a reference centroid) and behavioral metrics (tool-call mix, refund rate, escalation rate). Alarm at 2σ; investigate at 1σ.

Behavioral canaries
Plant N synthetic inputs per day designed to exercise the prompt-injection, exfil, and jailbreak surfaces. Pass rate on canaries is your live red-team signal. When a new attack class appears in the wild, add it to the canary set; you'll know the next time someone tries it.

Audit trail with integrity
Every run captured as JSONL — input, system prompt, tool calls, model responses, costs, latencies. Hash chain over the events; periodically anchor the head into an immutable store (S3 Object Lock, GCS Bucket Lock). When auditors ask "what did the agent do on March 12 at 14:22 UTC", you have a Merkle-verifiable answer.

A composite TrustScore rolls these up: weighted blend of eval pass rate, drift score, canary survival, HITL approval rate. Per agent, per skill, per day. The score is operationally meaningful only if it's grounded in those underlying signals — a score with no traceable inputs is theater.

The compliance + enterprise integrations
For anything regulated — health, finance, government, EU operations — the trust telemetry has to map onto external frameworks. The integrations I've found genuinely useful:

TrustModel.ai for the GRC overlay — NIST AI RMF, ISO 42001, EU AI Act Article-by-Article mapping, SOC 2, FedRAMP. The TrustScore feeds directly into the control library and produces auditor-ready reports without re-instrumenting the agent.
Cisco DefenseClaw — Apache 2.0, free, OSS. Jeetu Patel announced it from the RSAC 2026 keynote stage on March 23, 2026; it's the most consequential agent-security release of the year. Four components ship in the box: Skills Scanner (capability scan before execution), MCP Scanner (allow/block on MCP server inspection), CodeGuard (static analysis for secrets, unsafe deserialization, weak crypto, and injection patterns), and a Guardrail Proxy (runtime inspection of prompts, completions, and tool calls via regex rules + optional LLM judgment). Stack is a Go gateway sidecar + Python CLI + a TypeScript plugin for the OpenClaw framework that DefenseClaw was built to protect. The framework is observable by default, with first-class Splunk connectivity for the audit-trail story above. It bridges the trust gap that has 85% of enterprises experimenting with agents but only 5% running them in production. Personal note: Jeetu Patel is one of my role models, and I started coding the integration into the IDE I'm shipping the moment he walked off the RSAC stage. The most quoted line from the announcement — "I run OpenClaw at home — that's exactly why we built DefenseClaw" — is the right framing. There's no good reason not to wrap DefenseClaw around every production agent.
OpenTelemetry GenAI — the emerging standard for agent telemetry semconv. Emit the standard span attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens) and your traces work in any OTel-compatible backend.

The bar
A production agent is not a model and a prompt. It's a token economy, a skill catalog with versioning, a capability-scoped security model, and a trust telemetry stack. Each of those is a non-trivial engineering surface in its own right; together, they're more work than the "build an agent in 5 minutes" tutorials acknowledge.

The argument I'll make in part 2 is that the IDEs we have weren't built to help engineers hit this bar. They were built for the 2010 unit of work — one developer, one project, one file at a time — and the unit of work in 2026 is an agent that gets trained, guard-railed, and overseen by a domain expert who isn't the engineer. The tooling has to follow.

The Commoditization of LLM Models

Karl Mehta — Tue, 05 May 2026 00:55:29 +0000

I’m becoming more convinced that LLMs are moving toward the same structure as payment networks. The models will be incredibly important. But the largest value will not be captured by the raw model layer alone. It will be captured by the layers above it: routing, evals, RAG, MCP, memory, orchestration, agentic workflows, vertical applications, and trust infrastructure.

As a founder and developer, this pattern feels familiar to me.I previously built a fintech company that routed transactions across multiple rails and 100+ payment methods around the world. It was eventually acquired by Visa. In payments, Visa, Mastercard, and AmEx were critical rails. But Stripe, PayPal, Adyen, PlaySpan (acquired by VISA) and others created enormous value by abstracting those rails, optimizing routing, managing risk, improving developer experience, and owning the merchant workflow. I think the same thing is happening with LLMs.

At the bottom, we will likely have a small number of frontier model providers: OpenAI, Anthropic, Google, and a strong open-weight ecosystem. They will remain valuable. They will set the capability frontier. But for most production apps, the model will increasingly become a pluggable inference rail. The value moves up the stack.

Layer one: model gateways and routing.

OpenRouter, LiteLLM, Bedrock, Together, Fireworks, Groq, and internal enterprise gateways are making model access interchangeable. A developer can route a request to GPT, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen, or a fine-tuned model depending on cost, latency, context length, modality, privacy, or benchmark performance. This is where the “LLM as rail” abstraction begins.

Layer two: RAG and context engineering.

The hard problem in enterprise AI is not generating fluent text. It is assembling the right context at the right time. A useful AI system needs to know the patient record, contract clause, support ticket, lab result, CRM object, claim history, policy document, API schema, prior memory, and user permission boundary. RAG is evolving from “vector search over PDFs” into a full context layer: hybrid search, graph retrieval, tool retrieval, memory retrieval, structured database queries, re-ranking, summarization, and dynamic context packing. The LLM is only as good as the context substrate around it.

Layer three: MCP and tool connectivity.

MCP makes the harness layer much stronger because it standardizes how agents discover and call tools. Instead of every app building custom glue code for Gmail, Slack, GitHub, Postgres, EHRs, CRMs, calendars, and internal APIs, MCP gives agents a more consistent interface to external systems. This is a big deal.

Once tools become discoverable and composable, the agent is no longer just a chat interface. It becomes a workflow runtime that can read, reason, act, verify, and update state across systems.

Layer four: agentic orchestration.

This is where frameworks like LangGraph, LlamaIndex, LangChain, CrewAI, AutoGen, Semantic Kernel, and custom orchestration layers matter. The future agentic app will not call one model once.

It will use one model for planning, another for coding, another for extraction, another for medical reasoning, another for summarization, and another for cheap classification. It will make these decisions in real time based on task type, latency, cost, reliability, and safety constraints. One task may go to Claude for long-context reasoning. Another may go to Gemini for multimodal input. Another may go to GPT for tool use. Another may go to a local or open-weight model for cheap classification. Another may run through multiple models in parallel for consensus, critique, or ensemble evaluation.

This is exactly how payment orchestration worked. You didn’t hard-code one rail. You routed dynamically based on geography, fees, approval rates, fraud risk, currency, merchant category, and availability.

Layer five: evals, trust, and governance.

This is where I think platforms like TrustModel.ai become important. If the application can route across multiple LLMs, the system also needs a way to continuously evaluate which model is right for which task. Not just “which model is smartest,” but which one is safest, cheapest, fastest, most compliant, most consistent, most robust against prompt injection, best at structured output, best at domain reasoning, and least likely to hallucinate.

A serious agentic system needs multi-dimensional evals across models and workflows. It needs to test safety, quality, bias, factuality, privacy leakage, tool-use reliability, refusal behavior, cost, latency, and auditability. That eval layer becomes the control plane for selecting models and keeping applications safe across changing model providers. This is not optional in healthcare, finance, legal, or enterprise AI.

Layer six: vertical workflow applications.

This is where the most durable value gets created. A healthcare agent that closes care gaps is not valuable because it uses one specific LLM. It is valuable because it understands clinical workflows, patient context, lab data, insurance constraints, escalation paths, HIPAA boundaries, and provider operations. A revenue cycle agent is valuable because it knows claims, denials, CPT codes, payer policies, appeal letters, and EHR workflows.

A legal agent is valuable because it knows contract structures, risk positions, fallback clauses, negotiation playbooks, and approval workflows. The model is necessary. But the system, data, workflow, distribution, trust, and feedback loop create the moat. This is why I do not think “which model wins?” is the most interesting question. The better question is: who owns the orchestration layer between the model and the workflow?

My bet is that most serious applications and agents will be multi-model by default. That is already how I’m building. I’m working on agents that use five different LLMs in parallel, each selected for the task where it performs best: reasoning, extraction, summarization, coding, evaluation, or low-cost classification. The system should optimize in real time, just like a payment router optimizes transaction success, cost, and risk across multiple rails.

LLMs are becoming intelligence rails. The value will accrue to the builders who turn those rails into reliable systems.