The 5 Places Every AI Agent Dies (and the 4,000-Line Repo That Fixes All Five)

souvik roy — Fri, 24 Apr 2026 10:25:00 +0000

"Most agents fail in one of five places. OpenAgent is built for all five."

I've spent the last six months debugging other people's agents in production. Different stacks, different models, different domains. Same five failure modes. Every time.

If you've ever shipped an agent that worked beautifully in the demo and then produced a confident, fluent, completely wrong answer in prod — this post is for you.

We'll walk through the five places agents die, and we'll use a repo called OpenAgent as our reference implementation. It's ~4,000 lines. MIT. Reads like a cookbook. By the end of this post, you'll either fork it, or you'll understand your own agent well enough to fix it.

Either is a win.

The TL;DR (for the skimmers)

Every production agent needs to answer five questions, in order:

Intent — what is the user actually asking?
Ambiguity — what do I not know yet?
Clarifier — should I ask, or look it up?
Planner — what's the smallest set of steps that gets us there?
Executor — how do I run those steps without losing the thread? Most agent frameworks collapse two or three of these into a single prompt. That's why they break in ways you can't debug. OpenAgent keeps all five as separate, typed, testable stages.

Intent ▸ Ambiguity ▸ Clarifier ▸ Planner ▸ Executor

Each stage has a typed input and a typed output. The Pydantic schema between any two stages is your test surface — and your debug trail.

Why most agents fail

Picture the most common agent architecture shipped in 2024–2025:

while True:
    response = llm(prompt + conversation + tools)
    if done(response):
        break

This "mega-prompt in a while loop" works for demos. It dies in prod because:

There's no notion of "what does the user actually want" separate from "what do I do next."
There's no moment where the agent admits it doesn't know enough.
There's no plan you can inspect, edit, or resume.
When it breaks, the only debug signal is a 4,000-token transcript. You don't have an agent. You have a stochastic while loop.

OpenAgent's thesis: split the loop into five small specialists, wire them with typed contracts, and never let the LLM decide control flow.

Let's go stage by stage.

Stage 1 — Intent: turn fuzz into a function signature

The problem. Humans don't type goals. They type fragments, vibes, half-sentences. "can you make this better" is not a specification. Executing on it gives you a confident wrong answer.

The job. Turn raw text into a typed object. Goal, context, constraints, output format, success criteria — and, crucially, alternative interpretations.

class IntentSchema(BaseModel):
    goal: str
    context: str
    constraints: list[str]
    expected_output_format: str
    success_criteria: list[str]
    alternative_interpretations: list[str]   # your future self thanks you
    confidence: float                         # triggers the next stage

That last field — alternative_interpretations — is the one most tutorials skip and the one that saves you. If the model lists three plausible reads of the request, you have a signal that the intent is fuzzy. That signal flows into Stage 2.

Mental model: intent is the function signature. Until you have it, you don't have a problem. You have a feeling.

Stage 2 — Ambiguity: your agent's epistemic humility layer

The problem. Even a cleanly extracted intent can be under-specified. "Write a blog post about our launch" is structurally fine but missing audience, length, tone, deadline, channel. An agent that steamrolls past this produces a polished artifact no one asked for.

The job. Audit the intent along fixed dimensions — scope, audience, depth, format, deadline, domain — and flag each with a severity.

class AmbiguityReport(BaseModel):
    flags: list[AmbiguityFlag]   # dimension, level, impact
    needs_clarification: bool
    reasoning: str

This report is a decision gate. The pipeline branches on needs_clarification, not on a gut feel. If medium-or-higher flags exist, we route to Stage 3. Otherwise we sail straight to planning.

Key separation: the ambiguity agent flags what's missing. It does not write the clarifying questions. Mixing those two jobs optimizes both poorly.

Stage 3 — Clarifier: ask, or look it up?

This is the stage that makes OpenAgent feel different from every other agent you've built.

The naive fix for ambiguity is "just ask the user." Do that every time and your agent becomes a questionnaire. Users bounce after three questions. Seven is a bloodbath.

The better fix: answer what the web can answer. Ask the user only for what they alone know.

# 1. Derive targeted web queries from each ambiguity flag
questions, search_results = await clarifier.generate_questions(ambiguity_report)

# 2. For each question: can this be confidently answered from the search results?
questions = await clarifier.auto_resolve_questions(
    questions, search_results, ambiguity_report
)

# 3. Split resolved from unresolved
unresolved = [q for q in questions.questions if not q.auto_resolved]

if unresolved:
    await send_to_user(unresolved)         # pause the pipeline
    answers = await wait_for_user_response()
else:
    answers = clarifier.build_auto_answers(questions)   # sail through

clarified = await clarifier.process_answers(ambiguity_report, answers)

Think of this as a cost-triage step. User attention is the most expensive resource your agent has. Spend it only on personal or organizational context — things that are genuinely unknowable without the user.

Defaults worth stealing:

Confidence threshold for auto-resolve: 0.7. Lower and the model confabulates sources.

- Max questions to the user: 3. Users answer 3. They abandon 7.

Stage 4 — Planner: a DAG, not a vibe

The problem. Dropping a clarified goal into a single "do the thing" prompt gives you a brittle monolith. The model can't back up. You can't resume. You can't verify anything until the whole thing finishes — and by then, you're five paragraphs deep into the wrong answer.

The job. Turn a clarified intent into numbered, dependency-aware, independently verifiable steps.

class PlanStep(BaseModel):
    step_number: int
    description: str
    inputs: list[str]
    expected_output: str
    dependencies: list[int]       # topological execution
    validation: str               # how to know it succeeded

Two fields here are non-negotiable in production:

dependencies — lets the executor topologically sort, and later parallelize independent branches.
validation — turns "done" from a feeling into a checkable predicate. If you can't write a validation criterion, the step is too vague. Rewrite it. This alone will make your agent 10× more reliable.

Stage 4.5 — Context: gather before you act

Most tutorials skip this. It's the highest-leverage hidden stage in the whole pipeline.

The problem. An executor that reaches for tools mid-generation is slow and erratic. The model decides while generating what to search for, then context-switches. Quality drops.

The job. Before executing any step, read the whole plan and, for each step, decide what it needs: knowledge-base lookups, web searches, outputs from dependency steps. Fan out all retrievals in parallel. Attach the results to each step as a StepContext.

resource_plan = await context_agent.gather(
    plan, clarified_intent, session_state
)
# resource_plan.step_contexts[i] contains everything step i needs

Separate gathering (embarrassingly parallel) from reasoning (serial). Don't interleave them. Sequential retrievals leave 3–5× latency on the table.

Stage 5 — Executor: run the steps, keep the thread, prove you hit the goal

The problem. "Execute the plan" is another vibe. A real executor has to:

Run steps in dependency order.
Pass prior outputs into dependents.
Stream chunks to the UI so it doesn't freeze.
Survive a step failure without corrupting the rest.
At the end, prove the deliverable actually answers the original goal.

for step in topological_sort(plan.steps):
    ctx = resource_plan.step_contexts[step.step_number]
    deps = {d: step_results[d] for d in step.dependencies}
    result = await executor.process_step(
        step, ctx, deps, on_stream=send_chunk
    )
    step_results[step.step_number] = result

final = await executor.assemble_final(plan, list(step_results.values()))
# final includes:
#   output, completeness_check, clarity_check,
#   relevance_check, correctness_check, trace_to_goal

That last field — trace_to_goal — is what catches a technically-correct, goal-irrelevant output. It's the check that turns a pipeline into an agent you can actually trust.

Five recipes you can steal today

Even if you don't adopt OpenAgent, steal these five patterns. They each solve a class of bugs that costs teams days:

1. Stream by default, not as an afterthought. Every agent should accept an on_stream callback. Your UI code should never have two paths.

2. Pause the pipeline like a coroutine, not a state machine. When the clarifier needs user input, suspend between phases and resume when the answer arrives. Use a session object that knows its current phase — not a pile of booleans.

3. Cache the intent, not the final output. Intent is a deterministic function of user_text + prompt. Cache it. Final output depends on the full session including clarifications — caching it will bite you.

4. Typed schemas at every boundary. Pydantic between agents is not ceremony. It's your test surface. Bad outputs get caught at parse time, not six steps later when something dereferences a missing field.

5. Keep the LLM out of control flow. The LLM decides content. Python decides flow — which phase runs, when to pause, when to retry, when to fall back. If your prompt has an if tool_name == "ask_user": branch, you've inverted it.

How this compares to LangGraph / CrewAI / AutoGen

	OpenAgent	LangGraph	CrewAI	AutoGen
Mental model	Typed pipeline	Graph of nodes	Role-playing crew	Multi-agent chat
Typed contracts between stages	✅ Pydantic	⚠️ Optional	⚠️ Loose	⚠️ Loose
Auto-resolving clarifier	✅ Built-in	❌	❌	❌
Framework weight	~4k LOC	Heavy	Heavy	Heavy
"Pause for user" first-class	✅	⚠️ Via interrupts	❌	⚠️ Via prompts
Reads like a cookbook	✅ By design	⚠️ Reference docs	❌	❌

When to pick OpenAgent: you want to understand every moving part, control each prompt, and own your agent's reasoning end-to-end — not inherit someone else's abstraction.

When to pick a framework: you want to ship fast without thinking about architecture, and the framework's defaults match your domain.

Quickstart (literally 60 seconds)

git clone https://github.com/OpenGraph-AI/OpenAgent.git
cd OpenAgent
pip install -r requirements.txt
cp .env.example .env       # set LLM_API_KEY at minimum
python run.py

Open http://localhost:8000/static/index.html, type a fuzzy request, and watch each phase stream into the UI in real time: intent extraction, ambiguity flags, clarifying questions, the plan, and the executor producing the answer step-by-step.

Minimum config: one variable — LLM_API_KEY. Works with any OpenAI-compatible provider. No Redis? Falls back to in-memory. No Exa? Skips web search. Missing keys are features, not errors.

Where to start reading the code

If you clone it, open files in this order:

backend/models/schemas.py — the contracts between phases. Read this first. Everything else is transformations.
backend/agents/intent_agent.py — the simplest agent. A clean template for your own.
backend/agents/clarification_agent.py — the most interesting. Auto-resolve via web search is the trick worth stealing.
backend/orchestrator/pipeline.py — how phases are wired, paused, and resumed.

5. `backend/agents/execution_agent.py` — step-by-step execution with per-step context injection.

The meta-lesson

The real insight in OpenAgent isn't any single stage. It's the shape of the solution.

The LLM hype cycle trained us to think bigger prompt = better agent. In production, the opposite is true. Better agents come from:

Smaller prompts.
Stronger contracts between prompts.
Explicit control flow outside the LLM.
Pauses, retries, and fallbacks as first-class citizens. If you're building an agent right now, forget the framework wars for a minute. Ask yourself: can I point to the five places mine could fail, and the typed object that lives at each boundary?

If yes, ship.

If no, spend an afternoon reading OpenAgent. Then go back and fix yours.

Your turn

Drop a comment with the stage you struggle with most — intent extraction, ambiguity flagging, clarification UX, planning, execution tracing — and I'll reply with the specific file in OpenAgent that nails it.

And if this saved you a 2am debugging session, the repo lives or dies on one thing:

⭐ Star OpenAgent on GitHub

Fork it. Break it. Build yours on top of it. That's the whole point.

Built with intent, by the folks at OpenGraph.tech.

DEV Community: souvik roy