DEV Community: Hemang Joshi

The Enterprise AI Buyer's Checklist: 12 Questions to Ask Before Hiring an AI Consultancy

Hemang Joshi — Fri, 17 Apr 2026 06:37:39 +0000

After auditing dozens of failed AI consulting engagements, I've noticed buyers keep asking the wrong questions. "Do you have AI expertise?" "Can you build an LLM app?" "What's your day rate?" Every consultancy on Earth answers yes, yes, and a reasonable number. Six months later: stalled PoC, burned budget, eroded trust.

Here's the checklist I wish every CTO, VP Engineering, and procurement leader had before signing a statement of work. Four groups: Delivery Proof, Technical Depth, Engineering Practices, Business Fit.

Group 1 - Delivery Proof

1. "Show me three production systems you've shipped in the last 18 months." Green flag: specific deployments, named models/frameworks, user load, what broke in first 30 days, anonymised case studies. Red flag: demo videos, hackathon wins, "NDA" across all three.

2. "What's your PoC-to-production conversion rate?" Green flag: a specific number with context. Red flag: "every PoC goes to production" — either untrue or the PoCs proved nothing.

3. "Walk me through a project that went wrong and how you handled it." Green flag: genuine, uncomfortable story with root cause and process fix. Red flag: "nothing's ever gone wrong" — walk away.

Group 2 - Technical Depth

4. "Which agent framework — CrewAI, LangGraph, AutoGen — would you pick for my use case, and why?" Green flag: they ask about latency, failure tolerance, observability, team maintenance capacity before recommending. See our agentic AI practice. Red flag: "we always use X — it's the best."

5. "How do you evaluate LLM output quality over time?" Green flag: Ragas, DeepEval, Promptfoo, versioned test sets, CI regression, quality dashboards. Red flag: "we spot-check outputs."

6. "Describe your RAG — chunking, embedding, reranking strategy?" Green flag: semantic chunking, hybrid BM25+vector, query expansion, Cohere Rerank or cross-encoder. Red flag: "we just use [vector DB] with OpenAI embeddings."

Group 3 - Engineering Practices

7. "How do you handle prompt versioning and rollback?" Green flag: prompt registry (PromptLayer, LangSmith, Langfuse), tied to release versions, rollback under 5 min. Red flag: "we update in code and redeploy."

8. "Production observability for AI systems?" Green flag: distributed tracing (LangSmith, Langfuse, Arize, Helicone), cost/quality dashboards, hallucination-rate alerts, provider-outage playbook. Red flag: "we log to CloudWatch."

9. "Guardrails and safety rails?" Green flag: layered — input validation, NeMo Guardrails or Guardrails AI, PII redaction, jailbreak detection, audit logs. Red flag: "the model won't say bad things — we tested it."

Group 4 - Business Fit

10. "Who specifically will be on my engagement — and are they senior?" Green flag: named individuals with LinkedIn profiles, GitHub history, written commitment that the lead stays. Red flag: "we'll assign at kickoff."

11. "Communication cadence and escalation process?" Green flag: weekly written updates, shared Slack, named escalation contact, 24-hour SLA, biweekly demos. Red flag: "monthly status report."

12. "If we wanted to take the system fully in-house in six months, how would you enable that?" Green flag: concrete KT plan — documentation standards, pair programming, runbooks, ADRs, formal handover milestone. Red flag: "Why would you want to do that?" — you're being sold a subscription, not a system.

Using the Checklist

You don't need all 12 answers to be perfect. You need all 12 to be specific, grounded, and intellectually honest. A consultancy that responds with concrete examples, named tools, and genuine trade-offs is a partner. One that responds with generalities or "we customise our approach" is a future post-mortem line item.

Print this. Take it into your next vendor meeting. Watch the room.

And if you'd like to see how we'd answer all 12 — with specifics, not slides — book 30 minutes at cal.com/hemangjoshi37a. Bring the hardest question on your list.

For a related read before any vendor meeting, here are the 4 mistakes that kill most enterprise AI projects.

CrewAI vs LangGraph vs AutoGen: Which Framework for Production AI Agents?

Hemang Joshi — Fri, 17 Apr 2026 06:22:29 +0000

📍 Industrial-automation focused companion piece: For the manufacturing-floor angle on these three frameworks (PLC/SCADA, ERP/MES, operator assistants + a 38% wire-scrap reduction case study from our AutoCut V2 wire-cutter deployment in Gandhinagar), see the 2026 LangGraph vs AutoGen vs CrewAI for industrial automation guide on hjLabs.in.

We've shipped production agents on all three frameworks in the last 18 months. Here's the honest comparison most tutorials won't give you.

Every other week a new agent framework trends on GitHub, and every other week a tech lead asks us the same question: "Which one should we actually build on?" The short answer is: it depends on what you're building, who's building it, and how mature your ops practice is. The long answer is this article.

This is not a benchmark post. Benchmarks on toy tasks tell you almost nothing about how a framework behaves when a retrieval call times out at 2 a.m., a tool returns malformed JSON, or the product team asks for a human approval step on step 7 of a 12-step workflow. What follows is a field report from real deployments, including the parts that hurt.

The elevator pitches

CrewAI models agents as roles on a crew. You define agents (researcher, writer, critic), give them goals and backstories, and compose them into sequential or hierarchical "crews" that complete tasks. The mental model is a small team of specialists delivering a deliverable.

LangGraph models agents as a state graph. You define nodes (functions that mutate state), edges (transitions, conditional or static), and a reducer for state updates. The mental model is a finite-state machine with LLM-powered nodes.

AutoGen (we'll focus on v0.4+, which was a near-total rewrite from v0.2) models agents as asynchronous actors that exchange messages. Conversations between agents drive the work forward. The mental model is a group chat where each participant has different skills and tools.

All three support tool use, multi-LLM routing, and memory in some form. Where they diverge is in control flow, observability, and what "production-ready" means.

CrewAI: opinionated, clean, linear-friendly

CrewAI's strength is that it gets out of your way on the 70% of use cases that are essentially a pipeline of specialist steps. Research, then summarize, then critique, then format. You don't fight the framework to express that. The Crew, Agent, and Task abstractions read well, onboarding a new engineer takes an afternoon, and the built-in hierarchical process (where a manager agent delegates to workers) is genuinely useful for research-style workloads.

Where it creaks:

Branching and loops: any workflow with "if condition X, loop back to step 2" ends up with you writing meta-orchestration around CrewAI rather than inside it. The framework is not built around arbitrary graph traversal.
State management: context gets passed implicitly between tasks. For anything beyond a handful of steps, you will want structured state, and that is awkward in the crew abstraction.
Error recovery: the default behavior on a tool failure or malformed LLM output is to surface the exception. Wrapping retries, fallbacks, and partial-progress recovery is DIY.
Observability: there is built-in logging and a paid CrewAI Plus tier for traces, but for serious production debugging you end up plugging in Langfuse, Arize, or your own OpenTelemetry layer.

Ideal use cases: content generation pipelines, research briefs, document summarization workflows, anything that reads as "first do A, then B, then C, and the shape doesn't change per run."

Avoid when: you need cyclic control flow, long-running jobs with resumability, or fine-grained checkpointing per step.

LangGraph: verbose, powerful, production-shaped

LangGraph is what we reach for when the workflow has loops, conditional branches, or needs to survive a process restart. The graph-first model is close to how production distributed systems are actually designed: explicit states, explicit transitions, explicit failure modes. We ship most of our industrial agentic AI deployments on LangGraph for this exact reason.

The headline features that matter in production:

Checkpointing: state is persisted at each node boundary (SQLite, Postgres, Redis). If the process dies, you resume from the last checkpoint. This alone makes it the only serious choice for anything running longer than a few minutes.
Human-in-the-loop: the interrupt primitive lets you pause a graph, surface state to a human, and resume after an approval or correction. We use this heavily for agents that write to production systems.
Streaming: per-node streaming of tokens, state updates, and tool calls. Makes it realistic to build a responsive UI on top of a multi-step agent.
Deterministic reducers: state updates are explicit and testable. You can write unit tests against individual nodes with mocked LLMs, which is borderline impossible in the free-form chat frameworks.
LangSmith integration: native tracing. If you're already in the LangChain ecosystem, observability is essentially free.

The costs are real:

Learning curve: engineers new to the framework need a week or two to internalize graphs, reducers, and channels. If your team doesn't have someone with state-machine instincts, you'll write bad graphs that look like pipelines.
Boilerplate: a trivial workflow takes more code than its CrewAI equivalent. The payoff shows up at complexity.
LangChain dependency surface: you inherit a large package graph and its version churn. Pinning and reproducibility matter.

Ideal use cases: any agent that must be resumable, auditable, or human-reviewed; multi-step reasoning with backtracking; long-running research agents; workflows with SLAs.

Avoid when: the workflow is genuinely linear and the team is small. You're paying for infrastructure you won't use.

AutoGen v0.4+: conversational, research-flavored, improving fast

AutoGen was the framework that made multi-agent conversation mainstream, and v0.4 is a mature rewrite with a clean actor model, async messaging, and a proper runtime. The AgentChat high-level API is ergonomic, and Core gives you the low-level actor primitives when you need them.

What it's good at:

Open-ended collaboration: "researcher and critic loop until the critic is satisfied" is natural in AutoGen, awkward in CrewAI, verbose in LangGraph.
Group chat patterns: SelectorGroupChat, RoundRobinGroupChat, and SwarmGroupChat give you prebuilt multi-agent coordination policies that are genuinely useful for exploratory work.
Code execution: the code executor agents with sandboxed Docker or local execution are still the cleanest implementation in the ecosystem.
Microsoft backing: v0.4 is maintained by a dedicated team, and the roadmap is public.

What hurts in production:

Deterministic flows: AutoGen is optimized for open-ended conversation, not fixed pipelines. Forcing deterministic behavior often means constraining the group chat manager with custom selectors, at which point you're rebuilding what LangGraph gives you for free.
Cost control: free-form agent loops tend to drift. Without aggressive termination conditions, token spend on a single task can surprise you. Budget your max-turns.
Testability: because flow is emergent from messages, unit tests are harder. You end up writing integration tests against recorded transcripts.
Breaking changes: v0.2 to v0.4 was a migration, not an upgrade. Plan your version commitments accordingly.

Ideal use cases: research agents, red-team/blue-team critique loops, code-generation tasks with test-execute-fix cycles, exploratory data analysis agents.

Avoid when: you need predictable latency, predictable cost, or predictable output shape on a per-request basis.

A production decision matrix

When we're helping a team choose, we score against six criteria that actually matter once a system has users:

Criterion	CrewAI	LangGraph	AutoGen
Observability out of the box	Moderate	Strong (LangSmith)	Moderate
Cost control (token budgeting, turn limits)	Good	Strong	Needs work
Error recovery and retry	DIY	First-class checkpoints	DIY
Human-in-the-loop	DIY	First-class (`interrupt`)	Possible via custom agents
Streaming	Basic	Per-node, granular	Message-level
Testability	Good for linear tasks	Strong (pure node tests)	Weak (needs transcripts)

Our production rule of thumb

Start with LangGraph unless one of the following is true:

The workflow is genuinely linear and will stay linear. Use CrewAI; you'll ship faster.
The core value is open-ended agent collaboration or iterative code generation with execution. Use AutoGen.
Your team has no state-machine experience and can't afford the ramp. Use CrewAI, with a plan to migrate the parts that grow complex.

We also mix frameworks in the same system. A common pattern: LangGraph as the top-level orchestrator with explicit state and human approval gates, calling into a CrewAI sub-pipeline for a well-defined content generation step. This is fine. Don't let framework loyalty drive architecture.

Integration tips worth knowing

LLM providers: all three support OpenAI, Anthropic, Azure, and local models via LiteLLM or similar. LangGraph gives you the cleanest per-node model routing (use Haiku for classification, Sonnet for reasoning). CrewAI supports per-agent model config. AutoGen supports per-agent clients via ModelClient.

Tool use: LangGraph's ToolNode plus structured output validation with Pydantic is the most robust combo we've shipped. CrewAI's @tool decorator is ergonomic but you own retry logic. AutoGen's function-calling agents are solid; just cap turn counts.

Memory: none of the three give you production-grade memory for free. Bring your own: Redis for short-term, a vector store (pgvector, Weaviate, or Qdrant) for long-term semantic memory, and an explicit summarization step for conversation compression. The framework should be the orchestrator, not the memory substrate.

RAG: keep retrieval outside the agent loop when you can. A common anti-pattern is giving an agent a search_docs tool and letting it decide when to call it; agents over-call or under-call. A deterministic retrieval step at graph entry, with results injected into state, usually outperforms and costs less.

Closing

Framework choice is less important than most teams treat it. The teams that ship reliable agents are the ones that invested in observability, evaluation harnesses, prompt versioning, and human-review workflows — regardless of what they built on. Any of CrewAI, LangGraph, or AutoGen can be driven to production quality. What varies is how much you fight the framework along the way.

If you're picking a framework right now, the honest answer usually depends on team composition and ops maturity more than on feature sets. We help teams make that call every week — you can see how we structure agentic AI engagements here, or book 30 minutes and I'll sanity-check your choice against your actual workload at cal.com/hemangjoshi37a.

On the operational side, I've written a 2-week playbook for deploying autonomous agents in production and a breakdown of the 4 mistakes that kill most enterprise AI projects — both worth a read before you commit to a framework.

No framework is a silver bullet. But the wrong one, picked for the wrong reasons, costs six months.

How CibrAI Automated 80% of Their Security Analyst Workflow With Agentic AI

Hemang Joshi — Fri, 17 Apr 2026 06:18:04 +0000

A case study in building pragmatic agent systems for cybersecurity operations — written for CISOs and security leaders who are tired of throwing more headcount at alert fatigue.

When CibrAI's security team was drowning in alerts, they didn't need more analysts — they needed AI that could triage like one.

That single reframing became the starting point for one of the most rewarding engineering engagements we've had in the last eighteen months. This is the story of how a focused, two-phase agentic AI build freed the CibrAI security team from tier-1 triage drudgery, sharpened their mean time to resolution on real incidents, and gave their senior analysts their weekends back.

It's also a story about restraint. We didn't build a chatbot. We didn't replace anyone. We built an agent that behaves like a junior analyst who never sleeps and never panics — and we put the senior humans firmly in charge of anything that matters.

The Problem: A Wall of Alerts, a Finite Team

CibrAI is a cybersecurity firm with a lean, high-signal detection team. Like most modern security organisations, their stack produces a large and relentless stream of telemetry: EDR alerts, identity anomalies, cloud posture findings, network IDS hits, phishing reports, and vulnerability notifications. On a typical weekday their analysts were processing thousands of incoming events.

The problem wasn't detection quality. Their tooling was good. The problem was human attention.

Before the engagement, their workflow looked roughly like this:

Every alert landed in a shared queue, regardless of severity or context.
Analysts opened each ticket cold, without historical context pre-attached.
Roughly 70-80% of alerts turned out to be benign once enriched — known-good processes, expected admin behaviour, stale IOCs, or low-confidence correlations.
Senior analysts were routinely pulled into low-severity triage instead of doing the threat hunting and response work they were hired for.

The cost wasn't just time. It was judgement quality. When a team spends seven hours out of eight clicking through benign alerts, the one real incident in that shift gets the tired brain.

The CISO, Andy Curtis, put it plainly in our first call: the team didn't need a bigger headcount. It needed the first pass to stop being a human problem.

The Approach: An Agent, Not an Automation

There's a temptation in security automation to reach for rigid playbooks — SOAR-style flowcharts that execute the same sequence every time. They work until they don't. Real alert contexts are messy and the interesting decisions are almost always the ones the playbook author didn't anticipate.

We took a different path: an agentic AI system that reasons about each alert, decides what information it needs, fetches that information using tools, and escalates when it reaches the edge of its confidence.

The design principles we agreed on up front:

The agent never takes destructive action. It can read, query, correlate, and recommend — but anything that touches production (isolating a host, disabling an account, blocking a hash) stays with a human.
Every decision must be auditable. Each triage decision produces a structured rationale with the tool calls, evidence, and confidence score that led to it.
The agent must know when it doesn't know. Escalation to a human analyst is a first-class outcome, not a failure mode.
The senior analysts define truth. The ground-truth labels that trained the retrieval layer and shaped the prompts came from their historical triage decisions, not from a generic threat taxonomy.

Technical Architecture

The system is built around a central agent loop with tool use, grounded by a retrieval-augmented generation (RAG) layer over the customer's own security knowledge.

The agent loop. A reasoning LLM sits at the centre, receiving each new alert as a structured input. It plans — what do I need to know to classify this? — and then executes tool calls to gather evidence. After each tool call it re-evaluates, either asking for more information or committing to a classification and recommended action.

Tool surface. We gave the agent a carefully scoped set of tools:

SIEM query tools for pulling related events in a time window around the alert
Asset context lookups (is this host a domain controller? a developer laptop? a crown-jewel server?)
Identity context lookups (is this user a privileged admin? on PTO? recently onboarded?)
Threat intelligence enrichment for IPs, domains, and file hashes
Historical case lookup — "have we seen a similar alert pattern before, and how did we triage it?"
A human-escalation tool that opens a high-priority ticket with the agent's full reasoning trace attached

The RAG layer. This is the piece that turned a generic capable model into a CibrAI-specific analyst. We indexed their runbooks, historical closed tickets, internal threat intel notes, and their written detection logic. The agent retrieves from this corpus before committing to a classification, so its judgement reflects how this team thinks about alerts, not how the public internet does.

Observability. Every run produces a structured trace: inputs, retrieved documents, tool calls, intermediate reasoning, final verdict, confidence. These traces feed a weekly review where senior analysts flag disagreements — and those flags become the next round of RAG corpus improvements. The feedback loop is the product.

Results

We rolled out in two phases. Phase one ran in shadow mode for six weeks — the agent triaged every alert alongside the humans, but its decisions didn't route tickets. We compared its verdicts against analyst decisions, tuned the retrieval corpus, tightened the prompts, and added tools where the agent was consistently asking for information it couldn't reach.

Phase two put the agent in the live path for tier-1 triage, with mandatory human escalation above a configurable confidence threshold.

The headline numbers after the first full quarter in production:

~80% of tier-1 triage is now handled by the agent end-to-end. These are the clear benign-by-context alerts and the clear low-severity confirmations that previously ate senior analyst hours.
Mean time to resolution on genuine incidents dropped materially. Because the agent pre-enriches every escalated ticket with its full reasoning trace, analysts start investigations with context already attached instead of building it from scratch.
Senior analysts shifted time into proactive threat hunting. The hours freed didn't disappear into the backlog — they went into higher-leverage work that the team was previously postponing indefinitely.
Analyst-reported alert fatigue dropped noticeably. Subjective, but important. The team is doing more of the work they were hired to do.

As Andy Curtis, CISO at CibrAI, put it in his reference for the engagement:

"Fantastic AI engineer with pragmatic business and technical skills."

That word — pragmatic — is the one we're proudest of. The engagement didn't sell a vision. It shipped a working system and then kept tuning it.

Four Lessons for Other Security Teams

If you're considering a similar build, these are the lessons that would have saved us weeks:

1. Start in shadow mode, and budget real time for it. Six weeks of shadow evaluation sounds long until you see the categories of edge case that only surface in production traffic. Don't cut this phase short.

2. Your runbooks are your moat. The difference between a generic agent and one that behaves like a member of your team is almost entirely in the retrieval corpus. Invest heavily in curating historical tickets, runbooks, and team conventions before you tune a single prompt.

3. Confidence thresholds are a product decision, not a model decision. Where you set the escalation cutoff determines the split between analyst time saved and analyst trust earned. Start conservative, measure, and loosen deliberately.

4. The agent's reasoning trace is as valuable as its verdict. Analysts adopted the system faster once they realised the escalated tickets arrived with a structured explanation they could audit in seconds. Make the trace a first-class output, not a debug log.

Closing

If your security team is stuck in manual triage, an agentic AI approach can transform the workflow without replacing people. Done well, it doesn't introduce a black box into your SOC — it introduces a tireless junior analyst whose reasoning you can inspect, correct, and improve every week.

You can read more engagements like this one on our case studies page, or book a 30-minute consultation at cal.com/hemangjoshi37a to talk through what an agentic triage layer could look like on top of your existing stack.

No pitch deck. Just a conversation about whether the approach fits your team.

Planning a similar rollout? See our 2-week production deployment playbook and how to choose between CrewAI, LangGraph, and AutoGen.

Deploying Autonomous AI Agents in Production: A 2-Week Playbook

Hemang Joshi — Fri, 17 Apr 2026 06:10:47 +0000

Most teams quote 3-6 months to deploy production AI agents. We've done it in 2 weeks. Repeatedly. Not demos. Not hackathon toys. Agents that handle real tickets, call real tools, pass real evals, and don't set the observability dashboard on fire at 2 AM.

The difference isn't some secret framework. It's ruthless scoping, opinionated tooling choices, and a hardening phase that most teams skip because they're still arguing about whether to use LangGraph or AutoGen on day 40.

Here is the exact playbook we run for enterprise AI teams. Fourteen days, start to production. Steal it.

Days 1-2: Scoping and Success Metrics

If you skip this step, the rest of the playbook collapses. Ninety percent of "failed" agent projects I've inherited failed here, not in the code.

The first question is not "which framework" or "which model." It is: what does a working agent actually output, measured how, at what cost, at what latency?

On day one we lock down four numbers:

Task success rate target. For a tier-1 support triage agent, we aim for 85% deflection on classified intents with <2% escalation-to-human-too-late. For a sales research agent, 90% correct CRM field population on a 200-row golden set.
P95 latency budget. Streaming-first agents usually land at 4-8s time-to-first-token and 20-40s time-to-final. Non-streaming batch agents get 60-120s. Anything over that and users will abandon or your queue depth explodes.
Unit economics ceiling. Cost per successful task, not cost per token. A $0.40/task agent that replaces a $12 human action is a business. A $0.08/task agent that hallucinates 15% of the time is a liability.
Blast radius. What's the worst thing this agent can do to a production system if it misfires? If the answer is "wire money" or "drop a database," we architect a human-in-the-loop gate from hour one, not hour 200.

We run a one-hour scoping workshop with the product owner and two senior engineers from the customer side. The output is a one-page spec with those four numbers, three to five representative tasks, and a list of tools the agent must call. That's it. If the customer can't agree on the four numbers, we don't start coding.

Days 3-5: Framework and Tool Selection

By now half the reader is shouting "just tell me which framework." Fine.

CrewAI wins when the workflow is naturally role-based and relatively linear: a researcher agent, a writer agent, an editor agent, pass the baton. It's fast to stand up, the abstractions match how non-engineers think about work, and it plays well with both OpenAI and Anthropic tool-calling. Weakness: anything with complex conditional branching or long-running state becomes awkward.

LangGraph wins the moment the agent needs real state, cycles, interrupts, or human-in-the-loop checkpoints. It is our default for anything touching financial workflows, medical triage, or multi-step enterprise processes with approval gates. The graph model maps cleanly to real business processes. The learning curve is real, though; budget a day for your team to internalize the state schema discipline.

AutoGen wins for research-flavored problems where multiple agents genuinely debate, critique, and iterate. It is overkill for anything deterministic, and we usually don't ship it to production untouched. We port the design to LangGraph once the conversation topology stabilizes.

Raw tool-calling loops with no framework win more often than anyone admits. If your agent is a single role with 4-8 tools and a 20-step max horizon, a 200-line custom loop with structured outputs will beat any framework on latency, debuggability, and on-call sanity.

Tool definitions matter more than framework choice. We standardize on JSON Schema with strict mode enabled (OpenAI) or XML-tagged tool-use blocks (Anthropic). Every tool gets a deterministic name, a one-sentence description optimized for the LLM, typed parameters with enums wherever possible, and a structured error return so the agent can self-correct.

More on our approach: hjlabs.in/AIML/services/agentic-ai/

Days 6-8: Build the PoC Agent

Three days. Not three weeks. If the PoC takes longer, the scope is wrong.

The system prompt is written last, not first. We start with tool definitions and a golden task, let the model try to solve it with zero instructions, and watch where it fails. The system prompt patches those specific failures. Short, behavioral, present tense, under 800 tokens. Long system prompts are a smell.

Memory: working memory is the conversation plus a scratchpad. Episodic memory goes in Postgres with a simple schema. Semantic memory, if needed at all, goes in the RAG layer.

Error handling: every tool call wraps in a retry with exponential backoff on transient errors, a structured error return on permanent ones, and a hard circuit breaker on the total step budget. An agent that has taken 30 tool calls to solve a task is almost certainly in a loop. Kill it, log the transcript, fall back to human handoff.

We enable prompt caching from day one. On Anthropic, cache the system prompt and tool definitions; on OpenAI, rely on automatic prefix caching and order your prompt carefully to maximize hits. 40-70% cost savings from two lines of code.

Days 9-10: RAG Knowledge Base

Most agents that "hallucinate" are actually starving for context. A good RAG layer fixes 80% of perceived agent quality issues.

Our defaults:

Chunking. Semantic chunking on paragraph and section boundaries, 400-800 tokens per chunk with 15% overlap. Tables and code get their own chunk type.
Embedding model. text-embedding-3-large for English-dominant corpora, voyage-3 for technical documentation, cohere-embed-multilingual-v3 for multilingual. Benchmark on the customer's own content.
Vector store. Qdrant for self-hosted production (binary quantization: 32x memory compression, <1% recall loss). Weaviate for hybrid search. Pinecone for zero-ops.
Reranker. Non-negotiable. Cohere Rerank 3 or fine-tuned BGE reranker on top-50 candidates, returning top-5 to the LLM. 15-25 points of answer quality for 80ms of latency.
Query rewriting. A small model rewrites the user query into 2-3 retrieval queries before search.

Deeper writeup: hjlabs.in/AIML/services/rag-systems/

Days 11-12: Evaluation Harness

No evals, no production. Full stop.

The golden dataset is built by hand, by a domain expert, in a spreadsheet. 80-150 tasks, each with the input, the expected tool-call trajectory, and the expected final output shape.

Three layers of evals:

Deterministic checks. Did the agent call the required tools? Did the output parse as valid JSON? Did it stay under the step budget?
LLM-as-judge. A stronger model grades task success against a versioned rubric. Pairwise comparison beats absolute scoring for subjective tasks.
Human spot checks. 10-20 tasks per release, reviewed by the domain expert.

The eval harness runs in CI on every prompt or tool change. A regression in task success rate blocks the merge.

Langfuse or Arize as the trace and eval backend. Langfuse is faster to self-host; Arize has the edge on enterprise features.

Days 13-14: Production Hardening

The last two days are where most teams run out of budget and ship a prototype to prod. Do not be most teams.

Observability: structured trace per agent run with inputs, every tool call, every LLM call with token counts, and the final output.
Rate limiting: per-user, per-tenant, per-tool. A runaway agent that hammers an expensive tool 200 times is a six-figure incident.
Fallbacks: tiered model strategy. Primary model with streaming, secondary model on timeout, static fallback on total failure.
Cost monitoring: dashboards per agent per customer, alert thresholds at 1.5x and 3x daily budget.
Guardrails: input and output. Input guardrails block prompt injection. Output guardrails validate structure, redact PII, check policy violations.

The Real Lesson

Two weeks is not a magic number. It is the result of refusing to let any single phase expand beyond its budget. Scoping eats forever if you let it. Framework debates eat forever if you let them. RAG tuning eats forever if you let it. The playbook works because each phase has a hard stop and a crisp deliverable.

Teams that struggle are almost never struggling with the model. They are struggling with scope discipline, eval discipline, and the unsexy production hardening work that doesn't make for good demos.

Want this playbook executed for your team by engineers who have shipped it to enterprise production more than a dozen times? Book a 30-min scoping call: cal.com/hemangjoshi37a

Before you start, make sure you're not about to hit the 4 mistakes that kill most enterprise AI projects — and see how one security team automated 80% of their workflow with this exact approach.

The 4 Mistakes That Kill 80% of Enterprise AI Projects

Hemang Joshi — Fri, 17 Apr 2026 05:59:05 +0000

What three years of auditing enterprise LLM deployments taught me about why most of them fail before they ship — and how to reverse the damage.

I've audited more than 40 enterprise AI projects over the past three years. Fortune 500 banks. Mid-market logistics firms. Insurance carriers with billion-dollar loss ratios. A few well-funded scale-ups trying to retrofit agents onto a legacy monolith.

Roughly 80% of them had already failed by month three. Not failed as in "the model doesn't work" — failed as in "the pilot stalled, the budget got frozen, the sponsoring VP quietly moved on, and the system is running in a Jupyter notebook that nobody dares touch."

The frustrating part: the failures almost never come from the model itself. GPT-4o, Claude, Llama 3.1, Gemini — they're all more than capable of handling 90% of enterprise workloads. The failures come from how teams wrap the model. And the same four mistakes show up again and again, across industries, team sizes, and budgets.

If you're a CTO, VP of Engineering, or Head of Data mid-deployment on a GenAI initiative, this is the short list I'd audit against today.

Mistake 1: Scoping the entire AI system upfront

Symptom: A 60-page PRD. A 9-month roadmap. A steering committee. A promise to ship "the AI assistant" in Q4.

Root cause: Enterprise teams default to waterfall planning for AI the same way they plan ERP rollouts — because that's the muscle they have. But AI systems are fundamentally probabilistic. You cannot know what your retrieval pipeline needs to look like until you've seen how the model fails on real documents. You cannot know your eval rubric until you've seen the edge cases. You cannot design the agent graph until you've watched a naive single-call pipeline break.

I've seen teams spend four months architecting a multi-agent system, complete with Kubernetes-level infrastructure diagrams, before a single prompt ever touched a real user document. When the PoC finally ran, 70% of the upfront design became obsolete inside two weeks.

The fix: Run a ruthless, scoped PoC first — ideally 2 to 4 weeks, one use case, one data source, one metric. Don't design the whole system. Build the thinnest possible slice that lets you watch the model succeed and fail on your actual data. Then, and only then, decide whether you need a vector database, an agent framework, a reranker, a fine-tune, or any of the other expensive commitments.

The best-run enterprise AI programs I've seen treat every major capability as a distinct PoC-to-production pipeline. Scoping discipline up front is worth more than any framework choice downstream.

Mistake 2: Building agent orchestration from scratch

Symptom: Six months in, the team has built their own planner, their own tool-call dispatcher, their own retry logic, their own memory store, and their own tracing layer. None of it is battle-tested. All of it needs to be maintained by the same engineers who should be building product features.

Root cause: A senior engineer read the ReAct paper, ran a weekend experiment, and concluded that agent orchestration is "just a while loop around tool calls." Three months later, that while loop has sprouted state machines, recovery paths, and a bespoke DSL for describing agent workflows — and nobody outside the original author can reason about it.

The open-source ecosystem has already solved most of what in-house teams keep rebuilding. CrewAI gives you role-based multi-agent patterns with clean handoffs and human-in-the-loop hooks. LangGraph gives you an explicit, inspectable state machine for agentic workflows — with checkpointing, interruption, and replay baked in. AutoGen gives you conversational multi-agent patterns and a strong story around code execution. LlamaIndex Workflows gives you event-driven orchestration with tight RAG integration.

Picking one of these is not a loss of control. It's a gain in leverage. You inherit thousands of engineering hours of edge-case handling — timeouts, partial failures, circular tool calls, token budget overruns, human approval gates — and you can spend your own cycles on the part that actually differentiates you: the domain logic.

The fix: If your team has spent more than two weeks building agent plumbing, stop. Audit what you have against what CrewAI or LangGraph give you out of the box. In most cases, the right move is to port the business logic onto an established framework and delete the custom orchestration. A well-designed agentic system (we write about the patterns that hold up in production here) is mostly about picking the right decomposition and the right guardrails — not about inventing a new runtime.

Mistake 3: No evaluation harness

Symptom: The team ships a feature. A week later, somebody tweaks a prompt to fix an edge case reported by a customer. A week after that, a different edge case silently breaks — and nobody notices until a support ticket lands. Rinse, repeat. Confidence in the system decays over time.

Root cause: Most enterprise teams have never built a rigorous eval harness for LLM-backed features. They're used to unit tests and integration tests for deterministic systems. When outputs are probabilistic, those patterns don't map cleanly, so teams default to "spot check the demo" — which is not an evaluation methodology.

This is the single highest-leverage investment I recommend on audits. A proper eval harness for an LLM feature includes:

A fixed, versioned test set of representative inputs — ideally 100 to 500 examples, curated from real usage, covering happy paths, edge cases, and adversarial inputs.
Reference outputs or scoring rubrics for each test. For extractive tasks, you can use exact match or F1. For generative tasks, use LLM-as-judge with a carefully designed rubric, calibrated against human graders on a subset.
Automated regression runs on every prompt change, model change, or retrieval change. CI-integrated, with a dashboard that surfaces win/loss deltas per change.
Online evaluation in production — user feedback signals, implicit signals (did they accept the suggestion? did they re-prompt?), and canary comparisons across prompt versions.

Tools like Ragas (for RAG-specific metrics — faithfulness, context precision, answer relevance), DeepEval, Promptfoo, and LangSmith make this dramatically less painful than it used to be. But the tool matters less than the discipline. Without an eval harness, you are flying blind, and any improvement is anecdotal.

The fix: Before you ship a second LLM feature, build the harness for the first one. If you're mid-deployment without evals, pause new feature work for two weeks and retrofit. The ROI — measured in regressions caught, prompts shipped with confidence, and customer escalations avoided — pays back inside a quarter.

Mistake 4: Treating AI like traditional software

Symptom: No prompt versioning. No structured logging of model inputs and outputs. No tracing across tool calls. No safety rails on generated content. When something goes wrong in production, the team reconstructs what happened by squinting at nginx logs.

Root cause: The same engineering culture that produced excellent observability for REST APIs hasn't yet internalized that LLM-backed features need a different observability shape. A traditional log line tells you which endpoint was called and what it returned. An LLM log line needs to tell you: which prompt template version, which model, which retrieved context chunks, which tool calls, which intermediate reasoning, and what the final output was — plus user feedback if any.

Four specific gaps I almost always find on audits:

Prompt versioning. Prompts are code. They belong in version control with semantic versions, changelogs, and a rollback path. Storing them in a database with no history is how teams end up in meetings arguing about what the prompt was yesterday.
Retrieval observability. In a RAG system, the model's output is only as good as the chunks it saw. You need to log, for every query, the top-k retrieved chunks, their similarity scores, and the final context window. Without this, diagnosing a wrong answer is guesswork. (We go deeper into production RAG patterns — chunking, reranking, hybrid search, and eval — here.)
Safety rails. Output validation, PII detection, toxicity filters, and jailbreak detection should be part of the request path, not an afterthought. NeMo Guardrails, Llama Guard, and Presidio are mature enough for enterprise use.
Cost and latency telemetry per trace. LLM costs can 10x overnight if a retry loop goes wrong. You need per-request cost attribution, wired into the same dashboards your SREs already watch.

The fix: Treat every LLM interaction as a first-class, traced, versioned, observable event. The tooling is commoditized now — LangSmith, Langfuse, Arize Phoenix, Weights & Biases Weave — pick one and wire it in before you scale usage.

If any of this sounds familiar

Most of these mistakes are recoverable. A stalled PoC can be re-scoped in two weeks. A bespoke agent runtime can be ported to LangGraph in a sprint. An eval harness can be retrofitted in ten days. Observability can be bolted on in a week with the right framework.

What's not recoverable is the credibility you lose with your executive team if the pilot drifts for another quarter.

If your team is mid-deployment and any of these four mistakes feel uncomfortably familiar, I'm happy to do a no-commitment diagnostic. Thirty minutes, on Zoom, and you'll walk away with a concrete list of what to stop doing, what to change, and what to measure. You can book a slot directly at cal.com/hemangjoshi37a.

The teams that ship production AI in 2026 aren't the ones with the biggest budgets or the fanciest models. They're the ones that avoid these four mistakes — or catch them early enough to reverse.

If you're starting now, our 2-week deployment playbook shows the disciplined path — and the enterprise AI buyer's checklist helps you vet vendors who avoid these traps.