After auditing dozens of failed AI consulting engagements, I've noticed buyers keep asking the wrong questions. "Do you have AI expertise?" "Can you build an LLM app?" "What's your day rate?" Every consultancy on Earth answers yes, yes, and a reasonable number. Six months later: stalled PoC, burned budget, eroded trust.
Here's the checklist I wish every CTO, VP Engineering, and procurement leader had before signing a statement of work. Four groups: Delivery Proof, Technical Depth, Engineering Practices, Business Fit.
Group 1 - Delivery Proof
1. "Show me three production systems you've shipped in the last 18 months." Green flag: specific deployments, named models/frameworks, user load, what broke in first 30 days, anonymised case studies. Red flag: demo videos, hackathon wins, "NDA" across all three.
2. "What's your PoC-to-production conversion rate?" Green flag: a specific number with context. Red flag: "every PoC goes to production" — either untrue or the PoCs proved nothing.
3. "Walk me through a project that went wrong and how you handled it." Green flag: genuine, uncomfortable story with root cause and process fix. Red flag: "nothing's ever gone wrong" — walk away.
Group 2 - Technical Depth
4. "Which agent framework — CrewAI, LangGraph, AutoGen — would you pick for my use case, and why?" Green flag: they ask about latency, failure tolerance, observability, team maintenance capacity before recommending. See our agentic AI practice. Red flag: "we always use X — it's the best."
5. "How do you evaluate LLM output quality over time?" Green flag: Ragas, DeepEval, Promptfoo, versioned test sets, CI regression, quality dashboards. Red flag: "we spot-check outputs."
6. "Describe your RAG — chunking, embedding, reranking strategy?" Green flag: semantic chunking, hybrid BM25+vector, query expansion, Cohere Rerank or cross-encoder. Red flag: "we just use [vector DB] with OpenAI embeddings."
Group 3 - Engineering Practices
7. "How do you handle prompt versioning and rollback?" Green flag: prompt registry (PromptLayer, LangSmith, Langfuse), tied to release versions, rollback under 5 min. Red flag: "we update in code and redeploy."
8. "Production observability for AI systems?" Green flag: distributed tracing (LangSmith, Langfuse, Arize, Helicone), cost/quality dashboards, hallucination-rate alerts, provider-outage playbook. Red flag: "we log to CloudWatch."
9. "Guardrails and safety rails?" Green flag: layered — input validation, NeMo Guardrails or Guardrails AI, PII redaction, jailbreak detection, audit logs. Red flag: "the model won't say bad things — we tested it."
Group 4 - Business Fit
10. "Who specifically will be on my engagement — and are they senior?" Green flag: named individuals with LinkedIn profiles, GitHub history, written commitment that the lead stays. Red flag: "we'll assign at kickoff."
11. "Communication cadence and escalation process?" Green flag: weekly written updates, shared Slack, named escalation contact, 24-hour SLA, biweekly demos. Red flag: "monthly status report."
12. "If we wanted to take the system fully in-house in six months, how would you enable that?" Green flag: concrete KT plan — documentation standards, pair programming, runbooks, ADRs, formal handover milestone. Red flag: "Why would you want to do that?" — you're being sold a subscription, not a system.
Using the Checklist
You don't need all 12 answers to be perfect. You need all 12 to be specific, grounded, and intellectually honest. A consultancy that responds with concrete examples, named tools, and genuine trade-offs is a partner. One that responds with generalities or "we customise our approach" is a future post-mortem line item.
Print this. Take it into your next vendor meeting. Watch the room.
And if you'd like to see how we'd answer all 12 — with specifics, not slides — book 30 minutes at cal.com/hemangjoshi37a. Bring the hardest question on your list.
Top comments (0)