Autor Technologies Inc.

Posted on Jun 29

How to Evaluate an AI Development Studio in 2026 (The 8 Questions We'd Ask)

#typescript #ai #webdev #machinelearning

We've built over 50 AI products in four years. We've also watched companies waste six figures hiring the wrong development partner — teams that demo well, talk a great game about "leveraging AI," and then deliver a chatbot wrapper around GPT-3.5 with a React frontend. We've been brought in to rescue three of those projects in the last year alone.

So we sat down and asked ourselves: if we were a company looking to hire an AI development studio in 2026 — even if it wasn't us — what would we actually ask? Not the softball stuff on their website. The questions that separate studios that ship production AI from those that ship demos.

Why This Matters More Now Than a Year Ago

The AI development landscape in Canada and globally has exploded. A quick search turns up hundreds of studios claiming to build "custom AI solutions." Most of them stood up a website in 2023, built a few proof-of-concepts, and started charging enterprise rates. The barrier to calling yourself an AI development studio is essentially zero.

Meanwhile, the technical bar for production AI has gotten significantly higher. Customers expect sub-second response times. Healthcare and dental clients expect PHIPA and PIPEDA compliance out of the box. Everyone expects their AI to actually work at 2am on a Sunday, not just during a polished demo.

This gap — between how easy it is to claim expertise and how hard it is to deliver — is the reason you need better questions.

The 8 Questions

1. "What's running in production right now, and can I see the metrics?"

This is the single most important question. Not "what have you built" — what's running right now, handling real users, real data, real edge cases?

When we answer this, we show Loquent: thousands of automated calls per month, 89% automation rate, 4.3/5 patient satisfaction, sub-second first-response latency. These aren't pitch deck numbers. They're from our live Grafana dashboards.

A studio that can only point to completed projects but nothing currently in production is telling you they build things and hand them off. That's fine for a marketing website. It's not fine for AI that needs to handle real-world chaos at 3am.

Red flag: "We built an AI solution for [big company name] but can't share details due to NDA." Everyone has NDAs. Studios that actually ship production AI can at least describe the architecture, the scale, and the outcomes without naming the client.

2. "Walk me through a production incident you had and how you handled it."

We wrote an entire article about a 3am production failure during our healthcare AI launch. The short version: a Deepgram model update silently changed our transcription accuracy, we didn't have model version pinning, and patients in Montreal started getting transferred to humans at double our normal rate.

Any studio with real production experience has war stories. If they don't, they either haven't run anything in production long enough to hit problems, or they're not being honest about what happened. Both are bad signs.

What you want to hear: specific technical details about what broke, how they detected it, how long it took to fix, and what they changed to prevent it from happening again. Not "we had a minor issue and resolved it quickly."

3. "What's your team structure, and who exactly will work on my project?"

At Autor, we're senior engineers only. No offshore. No handoffs. The person you talk to in the sales call is the person writing your code. That's a deliberate choice — we charge $150/hr and keep the team small enough that quality never becomes someone else's problem.

Many studios operate differently: a senior architect scopes the work, mid-level developers build it, and offshore contractors handle the grunt work. Sometimes that model works. But for AI specifically, the gap between "it works in the demo" and "it works in production" is almost always a senior engineering problem. Junior developers can wire up an OpenAI API call. Handling the edge cases, building the monitoring, tuning the prompts — that takes experience.

Red flag: vague answers about team size, phrases like "we scale the team based on project needs," or an inability to name the specific engineers who'll work on your project.

4. "How do you handle data privacy and compliance?"

If you're in healthcare, dental, legal, or financial services in Canada, this question eliminates about 70% of studios immediately.

We built Loquent from day one with PHIPA and PIPEDA compliance as architectural constraints, not afterthoughts. Patient call data is encrypted at rest and in transit, stored on Canadian servers, access-logged with retention policies, and we can produce a complete audit trail for any patient interaction. We went through this with our healthcare clients' legal teams before writing a line of code.

A studio that says "we can add compliance later" or "we use [US cloud provider] but it's probably fine" doesn't understand how healthcare data works in Canada. Compliance isn't a feature you bolt on. It's a design constraint that affects your database schema, your deployment architecture, your logging strategy, and your vendor selection.

What you want to hear: specific compliance frameworks they've implemented, where data is stored, how access is controlled, and whether they've survived a client's legal or compliance review.

5. "What happens after you deliver? Who's on call?"

We've inherited projects from three different studios in the past year. In every case, the original studio delivered a working product, the client signed off, and then things started breaking. The original team had moved on to the next project. Support was an email address that got checked "within 48 hours."

Production AI isn't a build-and-forget deliverable. Models drift. Vendors push updates. User behavior changes. Call volumes spike. You need someone who monitors the system, responds to incidents, and proactively identifies degradation before users notice.

Ask for their SLA terms. Ask what monitoring they run. Ask what happens at 3am on a Saturday when something breaks. At Autor, we run our own AI products in production — Loquent is ours, not a client handoff. So we're already monitoring and maintaining production AI systems around the clock because our own reputation depends on it.

Red flag: "We offer a maintenance package" with no details on response times, monitoring, or what's actually included.

6. "Can you show me your testing process for AI-specific failures?"

Traditional software testing — unit tests, integration tests, end-to-end tests — is necessary but not sufficient for AI systems. AI has failure modes that conventional tests don't catch: prompt regressions, model drift, transcription accuracy changes, hallucination under edge-case inputs.

We built a test harness that replays 200 real call transcripts against every prompt change and flags regressions in intent detection, entity extraction, and task completion rate. We pin vendor model versions and test new versions against a saved corpus before promoting to production. We learned to do this the hard way, after a surprise model update dropped our Montreal automation rate by 9 points overnight.

A studio that tests AI products the same way they test a CRUD app will ship you a product that works great on demo day and degrades unpredictably in production.

What you want to hear: specific testing strategies for prompt regression, model version management, and real-world conversation replay testing. Not just "we have CI/CD."

7. "What AI infrastructure decisions would you make differently if you started over?"

This is a trap question, in the best way. A studio with real production experience has a long list of things they'd do differently. We certainly do: we'd build a graph-based conversation state machine instead of a linear flow, invest in observability from day one instead of bolting on Grafana at month three, and build our regression test corpus from the first call instead of waiting until month four.

A studio that says "we'd do everything the same" either hasn't run anything long enough to learn, or isn't self-aware enough to be trusted with your project. The best engineers are the ones who can articulate their own past mistakes clearly.

Red flag: confident assertions that their architecture and process are already optimal. Nobody's are.

8. "What would you tell us NOT to build?"

The most valuable thing an AI development partner can do is talk you out of building the wrong thing. We turned down a $200k project last month because the client's requirements would have produced a product their users didn't actually need. That's $200k we could have taken, but shipping something doomed to fail is worse for everyone.

Ask the studio what kinds of projects they've declined. Ask them to look at your requirements and tell you what's unnecessary, what's premature, and what would be a waste of money. A studio that says yes to everything is optimizing for their revenue, not your outcome.

What you want to hear: honest pushback on your own assumptions, specific examples of projects they've declined or descoped, and a willingness to tell you things you don't want to hear.

How to Weight These Answers

Not all eight questions carry equal weight. Here's how I'd prioritize:

Production systems with live metrics (Question 1) — this is pass/fail. No current production system, no consideration.
Compliance and data privacy (Question 4) — if you're in a regulated industry in Canada, this is also pass/fail.
Post-delivery support and monitoring (Question 5) — this is where most projects actually fail, six months after "delivery."
War stories and honest retrospectives (Questions 2 and 7) — these reveal real experience vs. marketing polish.
Everything else (Questions 3, 6, 8) — important differentiators once you've passed the first four filters.

If a studio passes the first two questions clearly, you're probably in good hands. The remaining six help you choose between the studios that make it through that filter.

The Uncomfortable Truth

Most companies hiring an AI development studio in 2026 will make their decision based on the proposal deck, the quoted price, and whether the sales call "felt good." They'll end up with a product that works in the demo and breaks in production, built by a team they never met, running on infrastructure nobody monitors.

The eight questions above aren't complicated. They just require the studio to be honest about what they've actually built, what went wrong, and what happens after they get paid. The studios that can answer them confidently are the ones worth hiring. The ones that can't will tell you they can't in how they dodge the questions.

We've been on both sides of this — as the studio being evaluated and as the team brought in to fix what another studio delivered. The difference between a good outcome and a bad one almost always comes down to whether someone asked the hard questions before signing the contract.

If you're evaluating studios right now — including us — we'd love to hear what questions you're asking. Reach out at hello@autor.ca or visit autor.ca.

DEV Community