Anton Resnick

Posted on May 29 • Originally published at softwarebuilding.ai on May 7

How to Choose an AI Development Agency: 12 Questions That Separate Real Builders from Hype Shops

#ai #business #productivity #startup

Vetting an AI agency is harder than it should be. The category is two years old. Most of the people selling AI services last year were selling something else the year before. Demos are easy to fake. Pitch decks are easier. The signal is mostly in the questions you ask and the texture of the answers you get back.

We are an AI agency. We have been on both sides of these calls. Below are the twelve questions that, in our experience, do the most to separate teams that have shipped production AI from teams that have shipped a demo and a deck. We have included what a strong answer sounds like and what should make you walk.

Before you ask anything: decide if you should hire an agency at all

If you have a senior AI engineer in-house with capacity, you do not need an agency for most projects. The math changes around build speed, breadth of tooling experience, and whether you can hire that engineer this quarter. We wrote a separate post on the agency-vs-in-house decision; read that first if you have not made it yet. The questions below assume you have decided agency is the right call.

The 12 questions

1. Show me a production AI agent you shipped that is still running.

Strong answer: a screen-share of the live system, written by their engineers, with at least 6 months in production. They will tell you what broke in the first month. If they hedge — "under NDA," "in private repos," "in a research environment" — they probably do not have one. "Demo videos" are not production.

2. What was the failure mode of your last project, and what did you learn?

Every team that has shipped AI has had something fail in production. Hallucination on an edge case, a runaway tool call, a permission leak, a model deprecation that broke an integration. If they cannot name a specific failure, they have not been there long enough.

3. Walk me through how you handle hallucinations on a high-stakes workflow.

Strong answer mentions at least three of: structured output validation with schema enforcement, an evaluation harness that runs on a held-out test set every deploy, human-in-the-loop checkpoints on high-cost actions, retry/fallback logic, and explicit denial paths (the agent saying "I do not know" instead of guessing). Weak answer is "better prompts" or "we use GPT-5."

4. What is your evaluation harness? Show me a recent eval run.

Production AI teams have a test set the agent runs against on every deploy. Regressions block merges. If they do not have one, every release is roulette. This is the single best filter for engineering maturity in 2026.

5. How will the system handle the eventual model migration?

Models get deprecated. Pricing changes. New models behave differently. Strong answer: "Our agent code is decoupled from the specific model. We have a thin provider layer; swapping from Claude 4.6 to 4.7 is a config change plus an eval run." Weak answer: "We have not had to do that yet."

6. Who owns the code, the data, and the model accounts?

The right answer is: you do. Code in your GitHub. Data in your accounts. LLM API keys in your name. If they keep the keys, the data lives in their infra, or the code is in a private repo they control, you are renting your AI from them. That is sometimes fine, but you should be choosing it intentionally, not having it sneak in.

7. What does the handoff look like if we want to take this in-house in 12 months?

Even if you intend to keep the agency forever, ask. The answer reveals the shape of the work. Strong: "Documented architecture, runbooks, eval harness, and a 2-week handoff sprint to your new engineer." Weak: "That has never come up" or "Why would you want to do that?"

8. Who specifically will be doing the engineering on my project?

Get names. Get GitHub profiles. Get LinkedIn profiles. Some agencies sell senior talent in the demo and assign juniors to the build. The technical lead on the demo should be the technical lead on your project, or there should be an explicit, named handoff to a peer-level engineer.

9. What is your pricing model? Why?

Hourly is fine for discovery and unknowns. Fixed-price is fine after a discovery sprint produces a tight scope. Time-and-materials with no discovery first is a red flag — it means they have not bothered to scope. The why matters more than the model.

10. What is the fastest you can ship a working v1?

Strong answer is concrete and grounded: "Three weeks for a single-workflow agent if your data is in good shape; six to eight if we have to clean it up first." Weak answer is generic: "It depends, every project is different." Of course it depends. The question is whether they have shipped enough to estimate.

11. What part of this would you tell me NOT to build?

Best signal of all. A team that has shipped real AI will know which parts of your idea are bad ideas. They will tell you. A team that says "we can build all of it" is selling a deliverable, not a system. The strongest agencies we know have killed at least one of their own pitched projects by talking the client out of it.

12. If we are unhappy in week three, what happens?

Strong answer: "We have an off-ramp clause — you can stop the project at the end of any sprint. We hand over what we have built and the prepaid balance is credited or refunded." Weak answer: a contract you cannot exit without paying for unfinished work. The off-ramp answer alone tells you a lot about the kind of team you are dealing with.

Hidden red flags inside good-looking answers

They have an "AI Center of Excellence" but cannot name the specific engineer who would lead your build. Often a reseller of someone else's work.
Every answer references LangChain by name regardless of the question. The framework is fine; the over-reliance is not.
"We are partners with [OpenAI / Anthropic / AWS]." Almost everyone is. Tells you nothing about engineering.
They cannot explain why they would NOT use a vector database for your use case. If everything is RAG, they have a hammer.
No public engineering writing. Not a deal-breaker, but the strongest teams almost always have something — a blog, a GitHub, a conference talk.

What strong answers actually sound like

We did not write this post to make you ask us those questions. We wrote it because we have watched too many AI projects start with the wrong agency and stall by month four. The right agency for you might be us. It might not. The questions above will help you tell.

If you want to pressure-test these questions on a live call before you take them to other agencies, book a free strategy call. We will answer them on the record and tell you which ones you should weight most heavily for your specific project.

Talk to us, or read more

Originally published at https://softwarebuilding.ai/blog/how-to-choose-an-ai-development-agency.

DEV Community