Dr Hernani Costa

Posted on Mar 27 • Originally published at radar.firstaimovers.com

Agentic AI Isn't a Script: Why Production Fails

#ai #architecture #automation #production

Most companies don't fail with agentic AI because models are weak. They fail because they treat production systems like notebooks—and that's a $500k mistake waiting to happen.

A prompt gets longer. A chain gets messier. A few tools get bolted on. Then someone calls it an "agent." What they actually shipped is hidden state, unclear control flow, weak auditability, and no reliable way to replay failures.

That is not an AI strategy. That is probabilistic scripting.

If you are serious about production, the transition from LLM spaghetti to agentic systems is not a prompt rewrite. It is a shift into distributed systems engineering. The winning question is no longer, "Which framework looks clever in a demo?" It is, "How do we operationalize this without breaking the business?"

That is where the real work starts.

The architectural shift that matters

The old LangChain notebook pattern was useful for exploration. It was never the right operating model for high-stakes systems. In production, a solid AI Architecture requires explicit state, deterministic routing where possible, durable execution, and clear pause-and-resume semantics for human review.

That is exactly why LangGraph has become the default serious choice for many Python teams: it is built around stateful graphs, persistence, interrupts, and replay, not just prompt orchestration. LangChain's own stack now makes that relationship explicit, and LangGraph documents persistence and time-travel-style recovery as core behaviors, not afterthoughts. read

CrewAI has moved in a similar direction. Its current documentation is clear: if you want a production-ready application, start with a Flow. Flows manage state, persist execution, and resume long-running workflows. Crews then become a unit of work inside that structure, not the structure itself. That is an important distinction, because too many teams still confuse "multiple agents talking" with architecture. read

The migration path: from notebooks to a real runtime

The first step is not choosing a cloud provider. It is choosing discipline.

Start by extracting every LLM call, retriever call, tool invocation, and policy decision into discrete Python units. Make those units idempotent where you can. Then define a typed state model that becomes the source of truth for a single request or workflow thread. That is the point where your system stops behaving like a pile of prompts and starts behaving like software.

The next move is even more important: push decision logic out of the model whenever you can. If routing can be handled by a rule, validator, regex, threshold, or policy matrix, do it in code. Save the model for ambiguity, synthesis, and language reasoning. Every decision you remove from the prompt is one less production incident you will have to explain later.

Then use a strangler pattern. Wrap the new graph behind an API boundary, run it in shadow mode, and compare it with the legacy path using a shared trace ID. Do not only compare final outputs. Compare retrieval quality, tool selection, branch decisions, latency, and cost by node. If you cannot compare trajectories, you are not really validating the migration.

Human Review in Agentic AI Is a Runtime Primitive

One of the most dangerous mistakes in agentic systems is treating human approval like a front-end workflow.

It is not.

It is a runtime control point.

LangGraph's interrupt model pauses execution, persists graph state, and waits until the process is resumed. LangChain's human-in-the-loop middleware applies the same idea to tool calls that may require review, such as file writes, SQL execution, or other high-stakes operations. This is the right model for compliance-heavy or business-critical environments because it turns oversight into enforced control flow instead of soft guidance in a prompt. read

CrewAI is moving in the same direction with HITL triggers, flow persistence, and enterprise management features. That makes it more credible than many people assume, especially for organizations that prefer an opinionated orchestration layer and a more packaged operational experience. read

The key operational rule is simple: never block a business-critical approval step on an open synchronous request. Persist state, emit the review event, notify the approver, and resume when the decision arrives. If your system times out while waiting for a human, the problem is not your prompt. The problem is your architecture.

Observability is the product

If a customer says, "The AI made the wrong call," your team needs to answer four questions immediately:

What state was the system in?
What tool calls were made?
What branch was taken and why?
Can we replay it?

If the answer to the last question is no, you do not have production observability. You have logging.

LangGraph's persistence model is built for replay and resumption. LangSmith now supports managed cloud, hybrid, and self-hosted deployment options, which matters for teams balancing convenience with control and compliance. read

For teams that want a broader model gateway and stronger operational controls across providers, LiteLLM has become increasingly practical. Its proxy layer supports routing, fallback, spend tracking, rate limits, session budgets, guardrails, and even MCP permission control through a fixed gateway endpoint. That makes it useful during migration, especially when old and new systems must coexist. read

And testing can no longer be "looks good to me." Promptfoo's current documentation is explicit about CI/CD integration and red teaming. That is the right mindset. Agentic systems need regression testing not only for quality, but for security, tool abuse, and prompt injection. read

Infrastructure: do not jump to Kubernetes because it feels serious

This is where many teams overbuild.

If you are calling hosted models from OpenAI, Anthropic, or Azure OpenAI, you usually do not need Kubernetes on day one. Managed container platforms are often the smarter move because they let you focus on state, runtime behavior, and governance instead of cluster operations.

Azure Container Apps is especially relevant here. Microsoft positions it as a serverless container platform, and the current docs show support for private endpoints plus jobs for finite-duration background tasks that share the same environment, networking, and logging as your apps. That is a clean fit for agentic workloads with asynchronous steps. read

AWS Fargate offers a similar value proposition on the AWS side: run containers without managing servers or clusters, while each task gets its own network interface inside your VPC. That is enough for many production deployments, especially when the main complexity is orchestration and governance rather than custom inference. read

Use Kubernetes when you truly need it: self-hosted models, GPU scheduling, specialized inference stacks, or platform-level controls that justify the operational tax. Not because it sounds enterprise.

The data layer should be boring until the problem proves otherwise

A lot of AI architecture discussions still jump straight to vector databases as if retrieval begins and ends there.

That is outdated.

For many production systems, Postgres plus pgvector is the best default because it lets you store vectors alongside transactional and relational data with ACID semantics, joins, and point-in-time recovery. That lowers operational sprawl and gives lean teams a simpler foundation. read

Use Redis for hot-path caching and short-lived coordination, not as your only durable system of record.

Add object storage for raw files, prompts, attachments, and archived traces.

And only bring in a graph database when the domain truly depends on relationships, not just similarity. Neo4j now maintains an official GraphRAG Python package, which is a strong signal that graph-based retrieval is becoming a serious production pattern, especially where relationships matter as much as text. read

In other words: do not build a five-database AI platform because a conference talk made it sound sophisticated.

Start with the simplest architecture that preserves state, traceability, and retrieval quality. Expand only when the workload justifies it.

The real CTO decision

The hard truth is this: most agentic AI programs do not need more model experimentation.

They need a stronger operating model.

That means:

explicit state instead of hidden prompt logic
durable execution instead of best-effort retries
interrupts and resume semantics instead of manual workarounds
trajectory-level evaluation instead of eyeballing answers
controlled rollout instead of big-bang rewrites
simple infrastructure until real constraints force more complexity

If you get those choices right, the framework debate becomes much easier. LangGraph is strong because it treats agent execution like a system, not a script. CrewAI is improving because it is moving toward the same production reality through Flows. LiteLLM matters because model access, budgets, and routing need governance. Managed container platforms matter because most teams should spend their effort on runtime reliability, not cluster babysitting.

That is the playbook in 2026.

Not more prompts.

Better systems.

Your move

If your team is still running critical AI workflows out of notebooks, scattered chains, or undocumented tool loops, you do not have an agentic platform yet.

You have migration debt.

The companies that win this next phase will not be the ones with the flashiest demo. They will be the ones that can explain, replay, govern, and safely evolve what their AI systems actually do.

That is where AI Strategy Consulting stops being theory and starts becoming leverage.

DEV Community