Serhii Panchyshyn

Posted on Apr 13

How to Roll Out an Internal AI Product Without Lying to Yourself

#ai #agents #llm #product

I've helped teams roll out AI products for the past two years.

The same failure pattern shows up almost every time.

They build something that demos well. Leadership gets excited. They ship it to 50 users in week one. Within two weeks, trust is destroyed and the project gets shelved 😅

The teams that succeed do something different. This is the playbook I walk clients through now.

The problem I see everywhere

Most teams measure AI rollouts wrong.

They track one number. "Accuracy" or "user satisfaction" or something equally vague. The number looks good. They ship broadly. Then real users hit edge cases, the agent hallucinates, and suddenly everyone thinks "AI doesn't work for us."

The issue isn't the AI. The issue is they never built the infrastructure to see what was actually happening.

You can't improve what you can't observe. And most teams can't observe anything.

The rollout framework that works

Here's what I advise now. Nine steps, usually 6-8 weeks before external users.

Step 1: Start with 3 users, not 30

Every team wants to move fast. "Let's get feedback from the whole department!"

I push back hard on this.

More users means more noise. You can't inspect every trace. You start pattern-matching on vibes instead of data.

The right first cohort:

3 people who actually need the tool for real work
Different roles (support, ops, sales)
Direct channel to the eng team

One client started with 30 users. Couldn't keep up. Rolled back to 5. Found more bugs in one week than the previous month.

// What I recommend tracking for each early user
interface EarlyUserContext {
  userId: string;
  role: string;           // "support", "ops", "sales"
  primaryUseCase: string; // "answer customer questions"
  feedbackChannel: string; // direct line to eng team
}

Step 2: Instrument everything before anyone touches it

This is where most teams cut corners. They want to ship. Observability feels like overhead.

It's not optional.

Before the first user session, you need to answer these questions from your traces:

What query did the user send?
What tools did the agent consider?
Which tool did it pick and why?
What context was in the window?
What was the final response?
Did the user accept, edit, or reject it?

I've seen teams ship without trace logging. They have no idea why things fail. They guess. They tweak prompts randomly. Nothing improves.

// Minimum viable trace structure
interface AgentTrace {
  runId: string;
  userId: string;
  query: string;
  toolsConsidered: string[];
  toolSelected: string;
  contextSummary: string;
  response: string;
  userFeedback: "accepted" | "edited" | "rejected" | null;
  latencyMs: number;
}

LangSmith, Langfuse, whatever. The tool matters less than having something.

Step 3: Review every trace for the first week

Yes, every single one.

This is where you learn what's actually broken. Not what you assumed was broken.

I sit with clients and review traces together. Same patterns show up:

Wrong tool selection: Agent picked searchOrders when it should have picked searchShipments
Missing context: Agent couldn't answer because the right doc wasn't retrieved
Hallucinations: Agent made up data that doesn't exist
Premature stopping: Agent gave up too early
Slow responses: Anything over 10 seconds feels broken

Create a simple spreadsheet. Log every failure. Categorize them.

Run ID	Failure Type	Root Cause	Fix
abc123	Wrong tool	Vague tool name	Renamed function
def456	Hallucination	No source doc	Added missing doc
ghi789	Slow response	Too much context	Scoped retrieval

After one week, you'll have a clear picture. This spreadsheet becomes your roadmap.

Step 4: Fix perception before prompts

Here's the insight that saves teams weeks of wasted effort:

90% of early failures come from three sources:

Bad tool names and descriptions
Missing or wrong context
Retrieval pulling irrelevant docs

These aren't prompt problems. They're perception problems.

I tell clients: the agent can only do the right thing if it can see the right things.

// Before: I see this constantly
const tool = {
  name: "handleData",
  description: "Handles data operations"
}

// After: Clear enough for the model to reason about
const tool = {
  name: "createShipmentFromOrder",  
  description: "Creates a new shipment record from an existing order. Requires orderId. Returns shipmentId and tracking number."
}

One client renamed 12 tools in week one. Tool selection accuracy went from 60% to 87%. No prompt changes.

Step 5: Build evals from your failures

Don't build generic evals. Build evals from the specific failures you observed.

Every row in that failure spreadsheet becomes a test case.

// Example eval case from a real client failure
const evalCase = {
  id: "shipment-status-check",
  query: "What's the status of order 12345?",
  expectedTool: "getShipmentByOrderId",
  expectedBehavior: "Return actual status from database",
  failureWeObserved: "Agent said 'delivered' without checking",
  groundTruth: "in_transit"
}

One team I worked with had 47 eval cases after two weeks. All from actual user sessions. All testing things that actually broke.

Generic benchmarks tell you nothing. Failure-driven evals tell you everything.

Step 6: Measure the right things separately

This is where most teams lie to themselves.

They compute one accuracy number. "We're at 85%!" Leadership is happy. But 85% of what?

I push clients to measure these separately:

interface AgentMetrics {
  // Did we pick the right tool?
  toolSelectionAccuracy: number;

  // Did we retrieve relevant docs?
  retrievalRecall: number;

  // Did the final answer match ground truth?
  answerCorrectness: number;

  // Did we cite the right sources?
  groundingAccuracy: number;

  // Did the user accept the response?
  userAcceptanceRate: number;
}

You can have 95% tool selection and 40% answer correctness. That means retrieval or synthesis is broken.

You can have 90% answer correctness and 60% user acceptance. That means the answer is technically right but useless in practice.

Separate metrics tell you where to focus. One number tells you nothing.

Step 7: Expand slowly with permission gates

After 2 weeks with 3 users, you might be ready for 10.

Don't flip a switch. Add gates.

const canUseAgent = (user: User): boolean => {
  // Phase 1: Named early adopters
  if (ROLLOUT_PHASE === 1) {
    return earlyAdopters.includes(user.id);
  }

  // Phase 2: Specific teams
  if (ROLLOUT_PHASE === 2) {
    return user.team === "support" || user.team === "ops";
  }

  // Phase 3: Everyone
  return true;
}

Each phase should last at least a week. Each phase needs its own baseline metrics.

If metrics drop when you expand, you've found a gap. That's good. That's the system working.

Step 8: Watch for drift

The first week is not representative.

Early users are curious. They ask simple questions. They're forgiving.

By week 4, they're using it for real work. Queries get harder. Edge cases appear. Patience drops.

I tell clients to track metrics weekly, not just at launch:

Week 1: 87% tool accuracy, 72% answer correctness
Week 2: 85% tool accuracy, 75% answer correctness  
Week 3: 83% tool accuracy, 71% answer correctness
Week 4: 79% tool accuracy, 68% answer correctness  ← investigate

If metrics drift down, dig into traces. Usually it's new use cases, missing docs, or users learning to ask harder questions.

Step 9: Know when you're actually ready

I've seen teams ship too early and destroy trust. I've also seen teams wait forever and never ship.

Here's what ready looks like:

✅ Tool selection accuracy > 90%

✅ Answer correctness > 80%

✅ User acceptance rate > 75%

✅ p95 latency < 8 seconds

✅ No hallucinations in last 100 traces

✅ You've handled the top 10 failure modes

Not ready:

❌ Still finding new failure categories weekly

❌ Metrics vary wildly day to day

❌ Users work around the agent instead of using it

❌ You can't explain why it fails when it fails

The outcome when this works

Teams that follow this playbook:

Ship with confidence, not hope
Have real data to show leadership
Know exactly where to focus engineering effort
Build user trust instead of destroying it

The teams that skip steps end up with shelved projects and skeptical users. I've seen it enough times to know.

The checklist

Week 0: Instrument everything
Week 1: 3 users, review every trace, build failure spreadsheet
Week 2: Fix perception issues (tools, context, retrieval)
Week 3: Build evals from failures, establish baselines
Week 4: Expand to 10 users, new roles, new use cases
Week 5: Fix new failures, update evals
Week 6: Expand to full internal team
Week 7+: Monitor drift, harden edge cases
When metrics stabilize: Consider external rollout

The boring work is the real work. Instrument first. Start small. Review everything. Fix perception before prompts. Measure the right things separately. Expand slowly.

Your agent is only as good as your willingness to watch it fail and fix what you find.

If you're rolling out an AI product and want a second set of eyes on your approach, I help teams get this right. DM me on X or LinkedIn.

DEV Community