I lead engineering, but I have not built an agentic AI system before. Right now I am learning this space, and this post is mainly how I am thinking about it — what can go wrong, what architecture I want to try, and what I expect before even running a POC.
Why this post exists
I have built and delivered many software systems over the years. But "agentic AI in production" feels different to me. I have done POCs and some experimental coding, but not multi-agent systems with strong operational constraints.
When I started reading about agentic AI, most of the content was not very helpful at this stage:
- Vendor demos show what is possible, but not what works in real production.
- Success stories are written after everything worked, so the path looks obvious in hindsight.
When you are starting, decisions are not obvious. So before writing production code, I am doing what I usually do — create a framework to evaluate architectures against possible failure scenarios, then document it so I can validate it later. This post is that framework.
The number that started my thinking
Around 95% of enterprise GenAI pilots do not create measurable business impact (MIT report). BCG says ~70% of failures are operational, not model problems.
The question is not "can we build an agent" — that is already proven. The real question is: what does the 5% do differently?
Most likely, the answer is in operational aspects that demos do not show — cost, latency, observability, and fallback behavior.
Four failure modes most tutorials skip
These are hypotheses, not proven results.
1. One model handling everything. A customer support POC probably has ~50% simple FAQ, ~35% needing tool calls, ~15% complex multi-step. Using a single large model for all of it means high cost and high latency even when not needed.
2. No memory of repeats. Customer support has many repeated questions with small variations. A basic agent calls the model every time. Caching is rarely discussed in tutorials, or treated as "later optimization" — by which time cost is already too high.
3. Invisible behavior. Agent tells a customer their refund is processed, but it isn't. Later it becomes an escalation, and no one can explain why. Logs show API success — not agent reasoning, tool calls, or parameters.
4. No graceful degradation. During peak traffic, latency spikes. A good system should reduce response complexity and answer faster instead of waiting for a perfect answer.
What comes next
I mapped each failure mode to an architectural decision — model routing, semantic caching, OpenTelemetry-based observability, latency-aware routing — and built a benchmark harness comparing a naive baseline against the optimized system.
The full architecture, the stack rationale (NeMo Agent Toolkit + Azure AI Foundry + NVIDIA NIM), and the benchmark setup with 81 customer support queries are in the full post:
👉 Read the full framework, architecture, and benchmark setup →
Benchmark numbers in a follow-up. If you're also evaluating agentic AI for the first time, I'd value pushback on whether these are the right failure modes to design around.
Top comments (0)