Your engineering team wants to build an AI agent. They have a proof of concept. It demos well. You greenlight it.
Six months later, Gartner's prediction lands on your desk: over 40% of agentic AI projects canceled or abandoned before production by end of 2027. Yours is one of them.
This happens because the wrong questions get asked at the start. Engineering asks "can we build it?" The PM should be asking "should we ship it — and what breaks if we do?"
Here are 7 questions that separate AI agent projects that ship from ones that burn budget.
1. What Happens When the Agent Is Wrong?
Every AI agent hallucinates. Every single one. The question is not whether it will produce incorrect output — it will. The question is what happens next.
Before greenlighting, demand answers to three sub-questions:
- Detection: How do we know the agent gave a wrong answer? Is there a validation layer, a human review step, or automated checks?
- Impact: What is the blast radius of a wrong answer? Does it send an email, execute a trade, update a database?
- Recovery: Can we undo the action? Is there a rollback mechanism?
If the blast radius is high and recovery is manual, you need a human-in-the-loop design. If the team says "the model is really accurate," push back. Accuracy rates are measured on benchmarks — not on your production data with your edge cases.
What to write in the PRD: "Define the failure mode for incorrect output. Document detection method, blast radius, and rollback procedure."
2. What Does This Cost Per User Per Month?
AI agents are expensive in ways that traditional software is not. A standard API endpoint costs fractions of a cent per call. An AI agent making 5-10 LLM calls per user request costs 10-100x more.
Here is why the math surprises PMs:
- Output tokens cost 3-10x more than input tokens at most providers
- Agents chain multiple LLM calls: planning, tool selection, execution, verification, response generation
- Context windows grow with conversation history — every follow-up message costs more than the last
- Retries on failures double or triple the bill
What to ask engineering: "Give me the cost per successful task completion, not cost per API call. Include retries, context growth, and tool calls."
A mid-sized product with 1,000 daily active users running multi-turn agent conversations can burn through 5-10 million tokens per month. At current pricing, that is $3,200-$13,000/month in LLM costs alone — before infrastructure, monitoring, or maintenance.
What to write in the PRD: "Define a cost ceiling per user per month. Include a kill switch if agent costs exceed 2x the ceiling."
3. How Do We Measure If It Actually Works?
"It works in the demo" is not a success metric. Demos use clean inputs, happy paths, and cherry-picked examples. Production has typos, edge cases, and users who try things nobody anticipated.
Define success metrics before building:
- Task completion rate: What percentage of user requests does the agent fully resolve without human intervention?
- Accuracy rate: Of completed tasks, how many produced correct results? Measure against ground truth, not vibes.
- Time to value: Does the agent save the user time compared to the manual workflow? Measure this — do not assume.
- Fallback rate: How often does the agent hand off to a human? A 60% fallback rate means you built an expensive routing layer, not an agent.
What to write in the PRD: "Define 4 success metrics with target numbers before development begins. If metrics are not hit within 8 weeks of launch, trigger a go/no-go review."
4. What Data Does It Need — and Do We Have It?
Data quality kills more AI agent projects than bad models. Gartner's research on agentic AI failures found that data quality issues are the single most cited reason for pilot failures.
Ask your engineering team:
- What data sources does the agent need access to? Internal databases, APIs, documents, user history?
- Is that data clean, structured, and accessible via API? Or does it require a 3-month data pipeline project first?
- What happens when the data is stale? If the agent answers based on last week's data, is that acceptable?
- What data does the agent generate? Logs, decisions, user interactions — where does this go?
The hidden trap: many teams discover mid-build that the data they need lives in a system with no API, behind a firewall, or in PDFs that require extraction. This turns a 2-month agent project into a 6-month data engineering project.
What to write in the PRD: "List every data source required. For each, document: access method, freshness requirement, and owner. Flag any source without API access as a risk."
5. Who Reviews the Agent's Output Before Users See It?
This is the question most teams skip. The answer reveals how much trust you are placing in an unpredictable system.
Three common architectures:
- Fully autonomous: Agent acts without review. Fast, but risky for high-stakes actions. Suitable for low-impact tasks like summarization or search.
- Human-in-the-loop: Every action is reviewed by a person. Safe, but slow and expensive. Defeats the purpose if review takes longer than doing the task manually.
- Guardrails + escalation: Agent acts autonomously within defined boundaries. Actions outside those boundaries trigger human review. This is where most production agents land.
The third option is the right default for most teams. But it requires defining the boundaries explicitly — not in a meeting, in code. Examples of boundaries: dollar amount thresholds for automated actions, confidence scores below which the agent defers, and action types that always require approval (sending external emails, modifying production data, making purchases).
What to write in the PRD: "Define which actions the agent can take autonomously, which require human approval, and what triggers escalation. Document this as a decision matrix."
6. What Is the Maintenance Cost After Launch?
AI agents are not "build and forget" software. They degrade over time in ways traditional software does not:
- Model deprecation: LLM providers regularly deprecate model versions. A working agent can break overnight because the underlying model changed. Production AI systems have failed from model deprecations with zero warning.
- Prompt drift: Small changes to the model's behavior across versions change agent output quality. Prompt engineering that worked in March may fail in June.
- Data drift: The real-world data your agent encounters changes over time. Training or few-shot examples become stale.
- Cost creep: As users discover the agent, usage grows. So does the bill.
A realistic maintenance budget for a production agent includes: monthly prompt tuning, quarterly evaluation runs, model version monitoring, and cost tracking dashboards.
What to ask engineering: "What is the ongoing maintenance plan? Who owns the agent after launch? What is the monthly time commitment?"
What to write in the PRD: "Define agent ownership post-launch. Allocate 15-20% of build effort for ongoing maintenance in the first year."
7. Can We Ship a Deterministic Version First?
This is the question that saves the most money. Before building an AI agent, ask: can we solve 80% of this problem with rules, templates, and traditional automation?
- If the workflow has fewer than 10 decision branches, a decision tree probably works.
- If the input format is predictable, regex or template matching is faster and cheaper.
- If the success criteria is clear and measurable, test automation handles it.
AI agents excel at tasks with high variability, unstructured inputs, and complex reasoning. They are overkill for structured workflows that follow predictable patterns.
The winning pattern: ship the deterministic version first. Measure where it fails. Build the AI agent only for the gap. This is cheaper to build, easier to debug, and gives you real production data about where AI would actually add value — instead of guessing during sprint planning.
What to write in the PRD: "Document why this task requires AI over deterministic automation. If >70% of cases could be handled by rules, build the rules first and add AI for the remainder."
The Checklist
Before your next sprint planning, run through all 7:
- ☐ Failure modes defined (detection, blast radius, rollback)
- ☐ Cost per user per month calculated (including retries and context growth)
- ☐ Success metrics defined with target numbers
- ☐ Data sources listed with access methods and freshness requirements
- ☐ Review architecture chosen (autonomous / human-in-loop / guardrails)
- ☐ Maintenance plan and post-launch ownership defined
- ☐ Deterministic alternative evaluated and ruled out with evidence
If your team cannot answer all 7, the project is not ready for greenlighting. That does not mean "never" — it means "not yet."
The 40% failure rate Gartner predicts is not inevitable. It is the result of projects that skipped these questions.
Follow @klement_gunndu for more AI engineering content. We're building in public.
Top comments (0)