Every AI vendor has a demo that works perfectly. That is the problem.
Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls (Gartner, June 2025). A separate Gartner report found that through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data (Gartner, February 2025).
The pattern is consistent: teams greenlight AI products based on impressive demos, then discover the gap between demo and production is a canyon.
PMs sit at the decision point. You approve the budget. You set the timeline. You own the outcome when it ships — or when it doesn't. These 5 red flags help you spot the canyon before you walk into it.
Red Flag 1: "AI-Powered" With No Explanation of What That Means
The vendor says their product is "AI-powered." You ask what the AI actually does. They pivot to a slide about "leveraging machine learning" or "using advanced neural networks."
This is AI washing. The term "AI-powered" has become so overused that the U.S. Federal Trade Commission issued guidance warning companies about making unsubstantiated AI claims (FTC, February 2023). The problem has only gotten worse since then.
What to ask instead:
- "What specific task does the AI perform that wasn't possible before?"
- "What model or approach powers this? Is it a foundation model, a fine-tuned model, or a rules engine with an AI label?"
- "What happens when I turn the AI off? What manual process does it replace?"
If the vendor cannot explain in one sentence what the AI does — not what it "leverages" or "harnesses" — the product is either not AI or the team does not understand their own technology. Both are disqualifying.
The test: Ask the sales engineer, not the account executive. Sales engineers talk implementation. Account executives talk vision. You need implementation.
Red Flag 2: The Demo Uses Their Data, Not Yours
The demo runs on a curated dataset. The search returns perfect results. The classification hits 98% accuracy. The generated text reads like a press release from a Fortune 500 company.
Then you feed it your data — messy CSVs with missing fields, inconsistent naming conventions, and 3 years of legacy formatting — and accuracy drops to 60%.
This is the most common gap between demo and production. Gartner found that lack of AI-ready data is the primary reason organizations abandon AI projects (Gartner, February 2025).
What to ask instead:
- "Can we run the demo on our data? Not our cleanest data — our realistic data."
- "What data preparation did you do before this demo? How long did it take?"
- "What percentage of your customers needed data cleaning before going live? How long did that take on average?"
If the vendor hesitates to run on your data, that tells you everything. A mature product handles messy inputs. An immature product needs a clean room.
The test: Bring a sample dataset to the second meeting. Not your best data. Your average data. Watch what happens.
Red Flag 3: No Production Customers — Only Pilots and POCs
"We have 15 enterprise pilots running right now."
Pilots are not production. A pilot is a controlled experiment with a dedicated support team, a narrow scope, and a safety net. Production means the product handles real traffic, real edge cases, and real failures at scale with no one holding its hand.
Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025 (Gartner, July 2024). The pilot-to-production gap is where most AI projects die.
What to ask instead:
- "How many customers are running this in production today — not pilots, production?"
- "What is your average time from pilot to production deployment?"
- "Can I talk to a production customer in my industry? Not a reference they prepared — one I choose from a list."
- "What is the biggest production failure you have seen, and how did you handle it?"
The last question is the most revealing. Every production system fails. A vendor who cannot describe a specific failure and their response to it has either never been in production or is hiding something.
The test: Ask for 3 production customer names. If they give you 3 pilot names instead, you have your answer.
Red Flag 4: Pricing That Hides the Real Cost
The vendor quotes $2,000 per month for their AI platform. What they don't mention: the platform makes API calls to a foundation model provider, and those calls are billed separately based on token usage.
Your proof of concept runs 50 queries a day. Your production environment will run 5,000. The $2,000/month platform fee stays the same. The model inference cost goes from $200/month to $20,000/month.
This is not a vendor problem — it is an AI economics problem. Foundation model costs scale with usage in ways that traditional SaaS does not. A SaaS product costs the same whether 10 users or 10,000 users run the same query. An AI product that calls GPT-4 or Claude costs more with every query, every token, every retry.
What to ask instead:
- "What is the total cost at 10x our current volume? At 100x?"
- "Does your pricing include model inference costs, or are those separate?"
- "What happens to cost if we switch to a different model? Are we locked into one provider?"
- "What cost optimization have you built in? Caching? Model routing? Batch processing?"
If the vendor quotes a flat rate and cannot answer volume questions, they either haven't scaled or they're counting on you not asking.
The test: Ask for a cost calculator or a cost projection at 3 volume tiers: current, 10x, and 100x. If they don't have one, they haven't thought about it.
Red Flag 5: No Answer to "What Happens When It's Wrong?"
The AI agent summarizes a contract and misses a liability clause. The AI classifier labels a support ticket as low priority when the customer is about to churn. The AI recommendation engine suggests a product that was discontinued last month.
Every AI system produces wrong outputs. The question is not "is it perfect?" — the question is "what happens when it isn't?"
What to ask instead:
- "What is your accuracy rate on tasks similar to ours? How do you measure it?"
- "When the system produces a wrong output, how does it signal that to the user?"
- "Is there a confidence score? What threshold do you recommend for human review?"
- "What is your feedback loop? If a user corrects an error, does the system learn from it?"
- "Do you have an audit trail? Can I trace why the system made a specific decision?"
A product that cannot explain its failure mode is a product that has not been tested at scale. Confidence scores, human-in-the-loop workflows, and audit trails are not optional features — they are table stakes for any AI product that touches business decisions.
The test: Ask the vendor to show you a wrong output from their system. If they can show it and explain why it happened, they understand their product. If they insist the system "doesn't make mistakes," leave the meeting.
The 5-Question Cheat Sheet
Print this. Bring it to your next vendor meeting.
| # | Question | What a Good Answer Sounds Like |
|---|---|---|
| 1 | What specific task does the AI do? | "It classifies support tickets into 12 categories with 94% accuracy, measured on our benchmark of 10,000 labeled tickets." |
| 2 | Can we run the demo on our data? | "Yes. Send us a sample and we'll run it in our sandbox. Here's the data format we need." |
| 3 | How many production customers use this? | "47 production customers. Average time from pilot to production: 6 weeks. Here are 3 you can call." |
| 4 | What is the total cost at 100x volume? | "Platform fee stays flat. Inference costs scale linearly — here's our cost calculator with 3 tiers." |
| 5 | What happens when the AI is wrong? | "We surface a confidence score on every output. Below 0.85, the system flags it for human review. Here's our audit trail." |
If the vendor cannot give concrete answers to all 5 questions, the product is not ready for your team. It might be ready for a pilot. It is not ready for production.
The Deeper Problem
These red flags are symptoms of a market moving faster than its quality controls. AI vendors are under pressure to ship. PMs are under pressure to adopt. The result: decisions made on demo quality instead of production evidence.
The fix is not to avoid AI products. The fix is to evaluate them the same way you evaluate any production dependency: with your data, at your scale, with a clear understanding of failure modes and costs.
40% of agentic AI projects will be canceled by 2027. The teams that avoid that outcome are the ones asking these questions before they sign — not after.
Follow @klement_gunndu for more AI product content. We're building in public.
Top comments (11)
red flag 2 burned me harder than any other. we approved budget after a demo on their curated dataset, then spent two months realizing our actual data killed accuracy by 40%. vendor response was basically "your data needs cleanup" which... sure, but that gap should have been the first conversation not the last.
the sales engineer test in flag 1 is genuinely underrated. best filter question i've found is "what breaks first?" - if they don't have a confident fast answer, they haven't run it at scale.
That "your data needs cleanup" deflection is basically the vendor admitting they knew the gap existed and chose not to surface it. A honest demo would have asked for a sample of your actual data before the budget conversation even started.
the 'your data needs cleanup' line is almost a red flag in itself at this point - i've heard it enough times to recognize it as a pattern. the better vendors ask for a dirty sample upfront specifically to show how they handle it. the ones who don't usually already know the answer won't look good
Red flag 3 (no failure mode discussion) is the one I see trip up teams the most. I'd add a practical test: ask the vendor to run the demo with deliberately bad input — malformed data, edge cases, adversarial prompts. If they can't or won't, that tells you everything about how far the product is from production-ready. The best AI tools I've worked with fail gracefully and tell you why they failed. The worst ones just confidently return wrong answers.
Deliberately bad input is the best vendor filter — if they flinch at malformed data during a demo, imagine what happens in production with real user chaos. That "run it with garbage" test should honestly be step one in every PM's eval checklist.
Red flag 4 (hidden costs at scale) is the one I wish more teams talked about openly. I run a system that makes thousands of automated calls across multiple APIs and LLMs daily — the difference between prototype costs and production costs isn't linear, it's exponential once you factor in retries, error handling, and the data normalization layer you inevitably have to build.
One thing I'd add to the cheat sheet: ask about degradation behavior. Not just "what happens when it's wrong" but "what happens when your upstream model provider has a bad day?" I've seen agents that work flawlessly on Claude 3.5 suddenly produce garbage when the provider silently updates model weights or changes rate limits. The best production systems I've built have fallback chains and output validation layers — and that's infrastructure the vendor demo will never show you.
The "ask the sales engineer, not the AE" advice is gold. I'd extend it further: ask to talk to their on-call engineer. That person knows every failure mode the product has, because they've been woken up by each one.
Retries are the silent budget killer — we saw one pipeline where retry loops alone 3x'd our token spend before anyone noticed. Logging cost-per-completed-action instead of cost-per-request changed how we think about scaling entirely.
Cost-per-completed-action is such a better mental model than cost-per-request. I've seen the same pattern with automated pipelines — you don't realize how much retries are costing until you actually instrument at the action level.
Curious: did you end up adding circuit breakers or just capping retry attempts? I found that exponential backoff with a hard ceiling on total retries per action was the sweet spot for keeping costs predictable without sacrificing reliability.
christ does anyone on this entire website write anything themselves?
Fair challenge — I write from building AI products daily and getting burned by exactly these demo tricks. Happy to debate any specific point if something felt off.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.