DEV Community

Serhii Panchyshyn
Serhii Panchyshyn Subscriber

Posted on

How to Roll Out an Internal AI Product Without Lying to Yourself

I've helped teams roll out AI products for the past two years.

The same failure pattern shows up almost every time.

They build something that demos well. Leadership gets excited. They ship it to 50 users in week one. Within two weeks, trust is destroyed and the project gets shelved 😅

The teams that succeed do something different. This is the playbook I walk clients through now.


The problem I see everywhere

Most teams measure AI rollouts wrong.

They track one number. "Accuracy" or "user satisfaction" or something equally vague. The number looks good. They ship broadly. Then real users hit edge cases, the agent hallucinates, and suddenly everyone thinks "AI doesn't work for us."

The issue isn't the AI. The issue is they never built the infrastructure to see what was actually happening.

You can't improve what you can't observe. And most teams can't observe anything.


The rollout framework that works

Here's what I advise now. Nine steps, usually 6-8 weeks before external users.


Step 1: Start with 3 users, not 30

Every team wants to move fast. "Let's get feedback from the whole department!"

I push back hard on this.

More users means more noise. You can't inspect every trace. You start pattern-matching on vibes instead of data.

The right first cohort:

  • 3 people who actually need the tool for real work
  • Different roles (support, ops, sales)
  • Direct channel to the eng team

One client started with 30 users. Couldn't keep up. Rolled back to 5. Found more bugs in one week than the previous month.

// What I recommend tracking for each early user
interface EarlyUserContext {
  userId: string;
  role: string;           // "support", "ops", "sales"
  primaryUseCase: string; // "answer customer questions"
  feedbackChannel: string; // direct line to eng team
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Instrument everything before anyone touches it

This is where most teams cut corners. They want to ship. Observability feels like overhead.

It's not optional.

Before the first user session, you need to answer these questions from your traces:

  1. What query did the user send?
  2. What tools did the agent consider?
  3. Which tool did it pick and why?
  4. What context was in the window?
  5. What was the final response?
  6. Did the user accept, edit, or reject it?

I've seen teams ship without trace logging. They have no idea why things fail. They guess. They tweak prompts randomly. Nothing improves.

// Minimum viable trace structure
interface AgentTrace {
  runId: string;
  userId: string;
  query: string;
  toolsConsidered: string[];
  toolSelected: string;
  contextSummary: string;
  response: string;
  userFeedback: "accepted" | "edited" | "rejected" | null;
  latencyMs: number;
}
Enter fullscreen mode Exit fullscreen mode

LangSmith, Langfuse, whatever. The tool matters less than having something.


Step 3: Review every trace for the first week

Yes, every single one.

This is where you learn what's actually broken. Not what you assumed was broken.

I sit with clients and review traces together. Same patterns show up:

  • Wrong tool selection: Agent picked searchOrders when it should have picked searchShipments
  • Missing context: Agent couldn't answer because the right doc wasn't retrieved
  • Hallucinations: Agent made up data that doesn't exist
  • Premature stopping: Agent gave up too early
  • Slow responses: Anything over 10 seconds feels broken

Create a simple spreadsheet. Log every failure. Categorize them.

Run ID Failure Type Root Cause Fix
abc123 Wrong tool Vague tool name Renamed function
def456 Hallucination No source doc Added missing doc
ghi789 Slow response Too much context Scoped retrieval

After one week, you'll have a clear picture. This spreadsheet becomes your roadmap.


Step 4: Fix perception before prompts

Here's the insight that saves teams weeks of wasted effort:

90% of early failures come from three sources:

  1. Bad tool names and descriptions
  2. Missing or wrong context
  3. Retrieval pulling irrelevant docs

These aren't prompt problems. They're perception problems.

I tell clients: the agent can only do the right thing if it can see the right things.

// Before: I see this constantly
const tool = {
  name: "handleData",
  description: "Handles data operations"
}

// After: Clear enough for the model to reason about
const tool = {
  name: "createShipmentFromOrder",  
  description: "Creates a new shipment record from an existing order. Requires orderId. Returns shipmentId and tracking number."
}
Enter fullscreen mode Exit fullscreen mode

One client renamed 12 tools in week one. Tool selection accuracy went from 60% to 87%. No prompt changes.


Step 5: Build evals from your failures

Don't build generic evals. Build evals from the specific failures you observed.

Every row in that failure spreadsheet becomes a test case.

// Example eval case from a real client failure
const evalCase = {
  id: "shipment-status-check",
  query: "What's the status of order 12345?",
  expectedTool: "getShipmentByOrderId",
  expectedBehavior: "Return actual status from database",
  failureWeObserved: "Agent said 'delivered' without checking",
  groundTruth: "in_transit"
}
Enter fullscreen mode Exit fullscreen mode

One team I worked with had 47 eval cases after two weeks. All from actual user sessions. All testing things that actually broke.

Generic benchmarks tell you nothing. Failure-driven evals tell you everything.


Step 6: Measure the right things separately

This is where most teams lie to themselves.

They compute one accuracy number. "We're at 85%!" Leadership is happy. But 85% of what?

I push clients to measure these separately:

interface AgentMetrics {
  // Did we pick the right tool?
  toolSelectionAccuracy: number;

  // Did we retrieve relevant docs?
  retrievalRecall: number;

  // Did the final answer match ground truth?
  answerCorrectness: number;

  // Did we cite the right sources?
  groundingAccuracy: number;

  // Did the user accept the response?
  userAcceptanceRate: number;
}
Enter fullscreen mode Exit fullscreen mode

You can have 95% tool selection and 40% answer correctness. That means retrieval or synthesis is broken.

You can have 90% answer correctness and 60% user acceptance. That means the answer is technically right but useless in practice.

Separate metrics tell you where to focus. One number tells you nothing.


Step 7: Expand slowly with permission gates

After 2 weeks with 3 users, you might be ready for 10.

Don't flip a switch. Add gates.

const canUseAgent = (user: User): boolean => {
  // Phase 1: Named early adopters
  if (ROLLOUT_PHASE === 1) {
    return earlyAdopters.includes(user.id);
  }

  // Phase 2: Specific teams
  if (ROLLOUT_PHASE === 2) {
    return user.team === "support" || user.team === "ops";
  }

  // Phase 3: Everyone
  return true;
}
Enter fullscreen mode Exit fullscreen mode

Each phase should last at least a week. Each phase needs its own baseline metrics.

If metrics drop when you expand, you've found a gap. That's good. That's the system working.


Step 8: Watch for drift

The first week is not representative.

Early users are curious. They ask simple questions. They're forgiving.

By week 4, they're using it for real work. Queries get harder. Edge cases appear. Patience drops.

I tell clients to track metrics weekly, not just at launch:

Week 1: 87% tool accuracy, 72% answer correctness
Week 2: 85% tool accuracy, 75% answer correctness  
Week 3: 83% tool accuracy, 71% answer correctness
Week 4: 79% tool accuracy, 68% answer correctness  ← investigate
Enter fullscreen mode Exit fullscreen mode

If metrics drift down, dig into traces. Usually it's new use cases, missing docs, or users learning to ask harder questions.


Step 9: Know when you're actually ready

I've seen teams ship too early and destroy trust. I've also seen teams wait forever and never ship.

Here's what ready looks like:

✅ Tool selection accuracy > 90%

✅ Answer correctness > 80%

✅ User acceptance rate > 75%

✅ p95 latency < 8 seconds

✅ No hallucinations in last 100 traces

✅ You've handled the top 10 failure modes

Not ready:

❌ Still finding new failure categories weekly

❌ Metrics vary wildly day to day

❌ Users work around the agent instead of using it

❌ You can't explain why it fails when it fails


The outcome when this works

Teams that follow this playbook:

  • Ship with confidence, not hope
  • Have real data to show leadership
  • Know exactly where to focus engineering effort
  • Build user trust instead of destroying it

The teams that skip steps end up with shelved projects and skeptical users. I've seen it enough times to know.


The checklist

  1. Week 0: Instrument everything
  2. Week 1: 3 users, review every trace, build failure spreadsheet
  3. Week 2: Fix perception issues (tools, context, retrieval)
  4. Week 3: Build evals from failures, establish baselines
  5. Week 4: Expand to 10 users, new roles, new use cases
  6. Week 5: Fix new failures, update evals
  7. Week 6: Expand to full internal team
  8. Week 7+: Monitor drift, harden edge cases
  9. When metrics stabilize: Consider external rollout

The boring work is the real work. Instrument first. Start small. Review everything. Fix perception before prompts. Measure the right things separately. Expand slowly.

Your agent is only as good as your willingness to watch it fail and fix what you find.


If you're rolling out an AI product and want a second set of eyes on your approach, I help teams get this right. DM me on X or LinkedIn.

Top comments (0)