<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Serhii Panchyshyn</title>
    <description>The latest articles on DEV Community by Serhii Panchyshyn (@serhiip).</description>
    <link>https://dev.to/serhiip</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F138013%2F5b142395-3c3d-49af-8418-515743a4e2fb.JPG</url>
      <title>DEV Community: Serhii Panchyshyn</title>
      <link>https://dev.to/serhiip</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/serhiip"/>
    <language>en</language>
    <item>
      <title>How to Roll Out an Internal AI Product Without Lying to Yourself</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:53:55 +0000</pubDate>
      <link>https://dev.to/serhiip/how-to-roll-out-an-internal-ai-product-without-lying-to-yourself-3bl2</link>
      <guid>https://dev.to/serhiip/how-to-roll-out-an-internal-ai-product-without-lying-to-yourself-3bl2</guid>
      <description>&lt;p&gt;I've helped teams roll out AI products for the past two years.&lt;/p&gt;

&lt;p&gt;The same failure pattern shows up almost every time.&lt;/p&gt;

&lt;p&gt;They build something that demos well. Leadership gets excited. They ship it to 50 users in week one. Within two weeks, trust is destroyed and the project gets shelved 😅&lt;/p&gt;

&lt;p&gt;The teams that succeed do something different. This is the playbook I walk clients through now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I see everywhere
&lt;/h2&gt;

&lt;p&gt;Most teams measure AI rollouts wrong.&lt;/p&gt;

&lt;p&gt;They track one number. "Accuracy" or "user satisfaction" or something equally vague. The number looks good. They ship broadly. Then real users hit edge cases, the agent hallucinates, and suddenly everyone thinks "AI doesn't work for us."&lt;/p&gt;

&lt;p&gt;The issue isn't the AI. The issue is they never built the infrastructure to see what was actually happening.&lt;/p&gt;

&lt;p&gt;You can't improve what you can't observe. And most teams can't observe anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The rollout framework that works
&lt;/h2&gt;

&lt;p&gt;Here's what I advise now. Nine steps, usually 6-8 weeks before external users.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Start with 3 users, not 30
&lt;/h2&gt;

&lt;p&gt;Every team wants to move fast. "Let's get feedback from the whole department!"&lt;/p&gt;

&lt;p&gt;I push back hard on this.&lt;/p&gt;

&lt;p&gt;More users means more noise. You can't inspect every trace. You start pattern-matching on vibes instead of data.&lt;/p&gt;

&lt;p&gt;The right first cohort:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 people who actually need the tool for real work&lt;/li&gt;
&lt;li&gt;Different roles (support, ops, sales)&lt;/li&gt;
&lt;li&gt;Direct channel to the eng team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One client started with 30 users. Couldn't keep up. Rolled back to 5. Found more bugs in one week than the previous month.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// What I recommend tracking for each early user&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;EarlyUserContext&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// "support", "ops", "sales"&lt;/span&gt;
  &lt;span class="nl"&gt;primaryUseCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// "answer customer questions"&lt;/span&gt;
  &lt;span class="nl"&gt;feedbackChannel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// direct line to eng team&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2: Instrument everything before anyone touches it
&lt;/h2&gt;

&lt;p&gt;This is where most teams cut corners. They want to ship. Observability feels like overhead.&lt;/p&gt;

&lt;p&gt;It's not optional.&lt;/p&gt;

&lt;p&gt;Before the first user session, you need to answer these questions from your traces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What query did the user send?&lt;/li&gt;
&lt;li&gt;What tools did the agent consider?&lt;/li&gt;
&lt;li&gt;Which tool did it pick and why?&lt;/li&gt;
&lt;li&gt;What context was in the window?&lt;/li&gt;
&lt;li&gt;What was the final response?&lt;/li&gt;
&lt;li&gt;Did the user accept, edit, or reject it?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I've seen teams ship without trace logging. They have no idea why things fail. They guess. They tweak prompts randomly. Nothing improves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Minimum viable trace structure&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentTrace&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;toolsConsidered&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;toolSelected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;contextSummary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userFeedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;accepted&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;edited&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rejected&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;latencyMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LangSmith, Langfuse, whatever. The tool matters less than having something.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Review every trace for the first week
&lt;/h2&gt;

&lt;p&gt;Yes, every single one.&lt;/p&gt;

&lt;p&gt;This is where you learn what's actually broken. Not what you assumed was broken.&lt;/p&gt;

&lt;p&gt;I sit with clients and review traces together. Same patterns show up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wrong tool selection&lt;/strong&gt;: Agent picked &lt;code&gt;searchOrders&lt;/code&gt; when it should have picked &lt;code&gt;searchShipments&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing context&lt;/strong&gt;: Agent couldn't answer because the right doc wasn't retrieved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt;: Agent made up data that doesn't exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premature stopping&lt;/strong&gt;: Agent gave up too early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow responses&lt;/strong&gt;: Anything over 10 seconds feels broken&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Create a simple spreadsheet. Log every failure. Categorize them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run ID&lt;/th&gt;
&lt;th&gt;Failure Type&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;abc123&lt;/td&gt;
&lt;td&gt;Wrong tool&lt;/td&gt;
&lt;td&gt;Vague tool name&lt;/td&gt;
&lt;td&gt;Renamed function&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;def456&lt;/td&gt;
&lt;td&gt;Hallucination&lt;/td&gt;
&lt;td&gt;No source doc&lt;/td&gt;
&lt;td&gt;Added missing doc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghi789&lt;/td&gt;
&lt;td&gt;Slow response&lt;/td&gt;
&lt;td&gt;Too much context&lt;/td&gt;
&lt;td&gt;Scoped retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After one week, you'll have a clear picture. This spreadsheet becomes your roadmap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Fix perception before prompts
&lt;/h2&gt;

&lt;p&gt;Here's the insight that saves teams weeks of wasted effort:&lt;/p&gt;

&lt;p&gt;90% of early failures come from three sources:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bad tool names and descriptions&lt;/li&gt;
&lt;li&gt;Missing or wrong context&lt;/li&gt;
&lt;li&gt;Retrieval pulling irrelevant docs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren't prompt problems. They're perception problems.&lt;/p&gt;

&lt;p&gt;I tell clients: the agent can only do the right thing if it can see the right things.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: I see this constantly&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;handleData&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Handles data operations&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// After: Clear enough for the model to reason about&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;createShipmentFromOrder&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Creates a new shipment record from an existing order. Requires orderId. Returns shipmentId and tracking number.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client renamed 12 tools in week one. Tool selection accuracy went from 60% to 87%. No prompt changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Build evals from your failures
&lt;/h2&gt;

&lt;p&gt;Don't build generic evals. Build evals from the specific failures you observed.&lt;/p&gt;

&lt;p&gt;Every row in that failure spreadsheet becomes a test case.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example eval case from a real client failure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;evalCase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipment-status-check&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;What's the status of order 12345?&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;expectedTool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;getShipmentByOrderId&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;expectedBehavior&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Return actual status from database&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;failureWeObserved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Agent said 'delivered' without checking&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;groundTruth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One team I worked with had 47 eval cases after two weeks. All from actual user sessions. All testing things that actually broke.&lt;/p&gt;

&lt;p&gt;Generic benchmarks tell you nothing. Failure-driven evals tell you everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Measure the right things separately
&lt;/h2&gt;

&lt;p&gt;This is where most teams lie to themselves.&lt;/p&gt;

&lt;p&gt;They compute one accuracy number. "We're at 85%!" Leadership is happy. But 85% of what?&lt;/p&gt;

&lt;p&gt;I push clients to measure these separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentMetrics&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Did we pick the right tool?&lt;/span&gt;
  &lt;span class="nl"&gt;toolSelectionAccuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did we retrieve relevant docs?&lt;/span&gt;
  &lt;span class="nl"&gt;retrievalRecall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did the final answer match ground truth?&lt;/span&gt;
  &lt;span class="nl"&gt;answerCorrectness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did we cite the right sources?&lt;/span&gt;
  &lt;span class="nl"&gt;groundingAccuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did the user accept the response?&lt;/span&gt;
  &lt;span class="nl"&gt;userAcceptanceRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can have 95% tool selection and 40% answer correctness. That means retrieval or synthesis is broken.&lt;/p&gt;

&lt;p&gt;You can have 90% answer correctness and 60% user acceptance. That means the answer is technically right but useless in practice.&lt;/p&gt;

&lt;p&gt;Separate metrics tell you where to focus. One number tells you nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Expand slowly with permission gates
&lt;/h2&gt;

&lt;p&gt;After 2 weeks with 3 users, you might be ready for 10.&lt;/p&gt;

&lt;p&gt;Don't flip a switch. Add gates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;canUseAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Phase 1: Named early adopters&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ROLLOUT_PHASE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;earlyAdopters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Phase 2: Specific teams&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ROLLOUT_PHASE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;team&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;team&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ops&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Phase 3: Everyone&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each phase should last at least a week. Each phase needs its own baseline metrics.&lt;/p&gt;

&lt;p&gt;If metrics drop when you expand, you've found a gap. That's good. That's the system working.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: Watch for drift
&lt;/h2&gt;

&lt;p&gt;The first week is not representative.&lt;/p&gt;

&lt;p&gt;Early users are curious. They ask simple questions. They're forgiving.&lt;/p&gt;

&lt;p&gt;By week 4, they're using it for real work. Queries get harder. Edge cases appear. Patience drops.&lt;/p&gt;

&lt;p&gt;I tell clients to track metrics weekly, not just at launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week 1: 87% tool accuracy, 72% answer correctness
Week 2: 85% tool accuracy, 75% answer correctness  
Week 3: 83% tool accuracy, 71% answer correctness
Week 4: 79% tool accuracy, 68% answer correctness  ← investigate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If metrics drift down, dig into traces. Usually it's new use cases, missing docs, or users learning to ask harder questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: Know when you're actually ready
&lt;/h2&gt;

&lt;p&gt;I've seen teams ship too early and destroy trust. I've also seen teams wait forever and never ship.&lt;/p&gt;

&lt;p&gt;Here's what ready looks like:&lt;/p&gt;

&lt;p&gt;✅ Tool selection accuracy &amp;gt; 90%&lt;br&gt;&lt;br&gt;
✅ Answer correctness &amp;gt; 80%&lt;br&gt;&lt;br&gt;
✅ User acceptance rate &amp;gt; 75%&lt;br&gt;&lt;br&gt;
✅ p95 latency &amp;lt; 8 seconds&lt;br&gt;&lt;br&gt;
✅ No hallucinations in last 100 traces&lt;br&gt;&lt;br&gt;
✅ You've handled the top 10 failure modes  &lt;/p&gt;

&lt;p&gt;Not ready:&lt;/p&gt;

&lt;p&gt;❌ Still finding new failure categories weekly&lt;br&gt;&lt;br&gt;
❌ Metrics vary wildly day to day&lt;br&gt;&lt;br&gt;
❌ Users work around the agent instead of using it&lt;br&gt;&lt;br&gt;
❌ You can't explain why it fails when it fails  &lt;/p&gt;




&lt;h2&gt;
  
  
  The outcome when this works
&lt;/h2&gt;

&lt;p&gt;Teams that follow this playbook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ship with confidence, not hope&lt;/li&gt;
&lt;li&gt;Have real data to show leadership&lt;/li&gt;
&lt;li&gt;Know exactly where to focus engineering effort&lt;/li&gt;
&lt;li&gt;Build user trust instead of destroying it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that skip steps end up with shelved projects and skeptical users. I've seen it enough times to know.&lt;/p&gt;




&lt;h2&gt;
  
  
  The checklist
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 0&lt;/strong&gt;: Instrument everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 1&lt;/strong&gt;: 3 users, review every trace, build failure spreadsheet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2&lt;/strong&gt;: Fix perception issues (tools, context, retrieval)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3&lt;/strong&gt;: Build evals from failures, establish baselines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 4&lt;/strong&gt;: Expand to 10 users, new roles, new use cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5&lt;/strong&gt;: Fix new failures, update evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 6&lt;/strong&gt;: Expand to full internal team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 7+&lt;/strong&gt;: Monitor drift, harden edge cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When metrics stabilize&lt;/strong&gt;: Consider external rollout&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;The boring work is the real work. Instrument first. Start small. Review everything. Fix perception before prompts. Measure the right things separately. Expand slowly.&lt;/p&gt;

&lt;p&gt;Your agent is only as good as your willingness to watch it fail and fix what you find.&lt;/p&gt;




&lt;p&gt;If you're rolling out an AI product and want a second set of eyes on your approach, I help teams get this right. DM me on X or LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>product</category>
    </item>
    <item>
      <title>Stop Prompting. Start Engineering Perception.</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:43:02 +0000</pubDate>
      <link>https://dev.to/serhiip/stop-prompting-start-engineering-perception-4fh5</link>
      <guid>https://dev.to/serhiip/stop-prompting-start-engineering-perception-4fh5</guid>
      <description>&lt;p&gt;I've watched teams spend weeks rewriting the same system prompt.&lt;/p&gt;

&lt;p&gt;Different phrasings. More examples. Clearer instructions. The agent still picks the wrong tool. Still hallucinates. Still feels broken.&lt;/p&gt;

&lt;p&gt;Then they rename six functions and accuracy jumps 30%.&lt;/p&gt;

&lt;p&gt;This pattern shows up constantly. The model doesn't care how clever your prompt is. It cares about what it can &lt;em&gt;see&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I see everywhere
&lt;/h2&gt;

&lt;p&gt;Teams treat prompts like magic spells. Say the right words, get the right output.&lt;/p&gt;

&lt;p&gt;But agents aren't following instructions. They're making predictions based on everything in context. The tool names. The API responses. The error messages. The structure of your data.&lt;/p&gt;

&lt;p&gt;That's perception. And it matters way more than your system prompt.&lt;/p&gt;

&lt;p&gt;Most teams optimize the wrong layer. They iterate on prompts for weeks while their tool names are &lt;code&gt;handleData&lt;/code&gt; and &lt;code&gt;processRequest&lt;/code&gt;. The model has no chance.&lt;/p&gt;

&lt;p&gt;Here are 10 patterns I've seen work across the past two years of helping teams build production agents 💪&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Tool names are the real prompt
&lt;/h2&gt;

&lt;p&gt;Bad tool names are invisible to the model.&lt;/p&gt;

&lt;p&gt;I audit client codebases and find this constantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ The model has no idea what this does&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handleRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Now it knows exactly when to use this&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;createShipmentFromOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client had 47 tools. Half had names like &lt;code&gt;processData&lt;/code&gt; or &lt;code&gt;executeAction&lt;/code&gt;. The model was guessing.&lt;/p&gt;

&lt;p&gt;We renamed 12 functions. Tool selection accuracy went from 60% to 87%. No prompt changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tool descriptions matter more than you think
&lt;/h2&gt;

&lt;p&gt;The model reads descriptions to decide which tool to pick.&lt;/p&gt;

&lt;p&gt;I tell clients: write descriptions like you're onboarding a new developer. Because you are.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Vague description&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;searchRecords&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search for records in the system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Specific description with constraints&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;searchShipments&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search shipments by tracking number, origin, destination, or date range. Returns max 50 results. Use filters to narrow results before searching.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specific descriptions reduce wrong tool selection by 30-40% in my experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Passing everything into context is lazy
&lt;/h2&gt;

&lt;p&gt;I've reviewed architectures where teams dump entire conversation histories into context. 20 turns. 50 tool results. Everything.&lt;/p&gt;

&lt;p&gt;The model drowns.&lt;/p&gt;

&lt;p&gt;What works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Last 3 turns by default&lt;/li&gt;
&lt;li&gt;Relevant retrieved docs only&lt;/li&gt;
&lt;li&gt;Structured summaries instead of raw data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Less context. Better decisions. Faster responses.&lt;/p&gt;

&lt;p&gt;One team cut their context by 60% and saw answer quality improve. Counter-intuitive until you realize the model was distracted by noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Scoped retrieval beats broad retrieval
&lt;/h2&gt;

&lt;p&gt;Early RAG implementations pull from everywhere. The whole knowledge base. 200+ docs. The model has no idea which ones matter.&lt;/p&gt;

&lt;p&gt;I push clients toward module-level filtering. If someone asks about shipments, only retrieve shipment docs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Retrieve from everything&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Scope to relevant module&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
  &lt;span class="na"&gt;module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;detectModule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; 
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recall goes up. Hallucinations go down. Should be the default from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Structured outputs prevent downstream chaos
&lt;/h2&gt;

&lt;p&gt;If another agent or system consumes your output, structure it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Free text response&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;I found 3 shipments that match. The first one is #12345 going to Chicago...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Structured response&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipments&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;destination&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Chicago&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12346&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;destination&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Denver&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delivered&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;total&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hasMore&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unstructured responses compound errors. Each downstream consumer has to parse and guess. I've seen entire pipelines break because one agent returned prose instead of JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Silent failures are invisible failures
&lt;/h2&gt;

&lt;p&gt;The model can't fix what it can't see.&lt;/p&gt;

&lt;p&gt;I audit error handling in every client codebase. Same pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Silent failure&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasPermission&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Loud failure&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasPermission&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PERMISSION_DENIED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;User lacks 'shipments.create' permission&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;requiredPermission&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipments.create&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;suggestedAction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Request access from workspace admin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicit errors let the agent reason about what went wrong. And let you debug faster.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Real system state beats assumed state
&lt;/h2&gt;

&lt;p&gt;I watched an agent confidently tell a user their shipment was delivered.&lt;/p&gt;

&lt;p&gt;It wasn't. The agent assumed based on typical timelines. It never checked the actual record.&lt;/p&gt;

&lt;p&gt;This happens when teams don't pass real state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Agent has to guess&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Agent knows the truth&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// actual current status&lt;/span&gt;
    &lt;span class="na"&gt;lastUpdate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2024-01-15T10:30:00Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;currentLocation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Memphis hub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents will make up state if you don't give them real state. Always.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Specialized agents beat one generalist
&lt;/h2&gt;

&lt;p&gt;I've seen teams try to build one agent that handles everything. Customer questions. Data entry. Workflow automation. Reports.&lt;/p&gt;

&lt;p&gt;It's mediocre at all of them.&lt;/p&gt;

&lt;p&gt;The pattern that works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One agent for Q&amp;amp;A&lt;/strong&gt; using org context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One agent for record operations&lt;/strong&gt; with strict schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One agent for document extraction&lt;/strong&gt; with specialized prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is easier to eval. Easier to constrain. Easier to improve.&lt;/p&gt;

&lt;p&gt;Generalist agents are harder to debug and harder to trust. I push clients toward decomposition early.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Guardrails should block bad things, not useful things
&lt;/h2&gt;

&lt;p&gt;I've seen guardrails so aggressive they blocked legitimate business operations.&lt;/p&gt;

&lt;p&gt;"Can you help me set up a webhook?" → BLOCKED (mentions code execution)&lt;/p&gt;

&lt;p&gt;"What's the API endpoint for shipments?" → BLOCKED (mentions API)&lt;/p&gt;

&lt;p&gt;The users stopped trusting the product. Not because the AI was bad. Because the guardrails were dumb.&lt;/p&gt;

&lt;p&gt;Narrow guardrails work better. Be specific about what's actually dangerous. Allow everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Audit perception before rewriting prompts
&lt;/h2&gt;

&lt;p&gt;When a client tells me their agent is underperforming, I ask these questions first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can it see the right tools? Are names and descriptions clear?&lt;/li&gt;
&lt;li&gt;Can it see the right context? Or is it drowning in noise?&lt;/li&gt;
&lt;li&gt;Can it see real state? Or is it guessing?&lt;/li&gt;
&lt;li&gt;Can it see errors? Or do failures happen silently?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nine times out of ten, the problem is perception. Not the prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  The outcome when you get this right
&lt;/h2&gt;

&lt;p&gt;Teams that engineer perception instead of prompts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stop the endless prompt iteration cycle&lt;/li&gt;
&lt;li&gt;Get measurable accuracy improvements in days, not months&lt;/li&gt;
&lt;li&gt;Build agents that actually work in production&lt;/li&gt;
&lt;li&gt;Have clear debugging paths when things break&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that keep tweaking prompts stay stuck. I've seen it enough times to know.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; asks: "How do I word this better?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perception engineering&lt;/strong&gt; asks: "What does the agent need to see to make a good decision?"&lt;/p&gt;

&lt;p&gt;One has diminishing returns after a few iterations.&lt;/p&gt;

&lt;p&gt;The other compounds as your system improves.&lt;/p&gt;




&lt;p&gt;Stop rewriting prompts. Start auditing what your agent can perceive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rename tools for clarity&lt;/li&gt;
&lt;li&gt;Scope your context&lt;/li&gt;
&lt;li&gt;Pass real state&lt;/li&gt;
&lt;li&gt;Make errors loud&lt;/li&gt;
&lt;li&gt;Use specialized agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your agent is only as good as what it can see 👀&lt;/p&gt;




&lt;p&gt;If you're building agents and want a second set of eyes on your architecture, I help teams get this right. DM me on X or LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>My First RAG System Had No Evals. 40% of Answers Were Wrong.</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 20:58:06 +0000</pubDate>
      <link>https://dev.to/serhiip/my-first-rag-system-had-no-evals-40-of-answers-were-wrong-ab</link>
      <guid>https://dev.to/serhiip/my-first-rag-system-had-no-evals-40-of-answers-were-wrong-ab</guid>
      <description>&lt;p&gt;When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality.&lt;/p&gt;

&lt;p&gt;Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped.&lt;/p&gt;

&lt;p&gt;So I started measuring everything. And the first thing I discovered: &lt;strong&gt;most RAG failures aren't LLM failures. They're retrieval failures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates.&lt;/p&gt;

&lt;p&gt;Here's what I've learned about measuring and fixing RAG systems after building them for B2B SaaS companies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The metric that actually matters: Recall@k
&lt;/h2&gt;

&lt;p&gt;Before I measure anything else on a new RAG system, I measure &lt;strong&gt;Recall@k&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Recall@k answers a simple question: "Of all the documents that &lt;em&gt;should&lt;/em&gt; have been retrieved, what percentage actually made it into the top k results?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall_at_k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relevant_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;What % of relevant docs are in the top k results?&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_ids&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance.&lt;/p&gt;

&lt;p&gt;Here's the math that drives everything:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P(correct answer) ≈ P(correct context retrieved)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  You can start measuring today
&lt;/h2&gt;

&lt;p&gt;You don't need production traffic to build evals. Generate synthetic test data from your corpus:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_synthetic_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate question-answer pairs from your chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;eval_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Generate 3 questions that this text can answer.
Make them specific. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is this about?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t test retrieval.

Text:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Return JSON: [{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}]
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;eval_pairs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;eval_pairs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help.&lt;/p&gt;


&lt;h2&gt;
  
  
  The two fixes that consistently move the needle
&lt;/h2&gt;

&lt;p&gt;I've tried a lot of retrieval improvements. Most make marginal differences. Two consistently deliver results.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix 1: Hybrid search
&lt;/h3&gt;

&lt;p&gt;Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords.&lt;/p&gt;

&lt;p&gt;But embeddings are weak on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Numbers&lt;/strong&gt;: They don't understand that 49 is close to 50&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exact match&lt;/strong&gt;: Product codes, IDs, ticker symbols&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rare terms&lt;/strong&gt;: Domain jargon not in the training data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;BM25 (keyword search) catches what embeddings miss. Combine them:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Combine embedding search and BM25 using RRF.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;embedding_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Reciprocal Rank Fusion
&lt;/span&gt;    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Typical improvement: &lt;strong&gt;5-15% recall boost&lt;/strong&gt; depending on query mix.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix 2: Add a reranker
&lt;/h3&gt;

&lt;p&gt;Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise.&lt;/p&gt;

&lt;p&gt;Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_with_rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve broadly, then rerank precisely.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Cast a wide net
&lt;/span&gt;    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rerank with cross-encoder
&lt;/span&gt;    &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Return top k after reranking
&lt;/span&gt;    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Typical improvement: &lt;strong&gt;another 5-10%&lt;/strong&gt; on top of hybrid search.&lt;/p&gt;

&lt;p&gt;Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably."&lt;/p&gt;


&lt;h2&gt;
  
  
  Chunking decisions that make or break retrieval
&lt;/h2&gt;

&lt;p&gt;Your chunking strategy matters more than your embedding model choice. A few things I always check:&lt;/p&gt;
&lt;h3&gt;
  
  
  The "it" problem
&lt;/h3&gt;

&lt;p&gt;Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: Prepend context to every chunk.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_with_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Prepend document and section info
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Section: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;split_section&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;section&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Other chunking rules I follow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never split mid-table.&lt;/strong&gt; A row without headers is meaningless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-20% overlap&lt;/strong&gt; between consecutive chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test multiple chunk sizes&lt;/strong&gt; (256, 512, 1024 tokens). Optimal depends on your queries.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  The workflow I use on every RAG project
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Week 1-2: Establish baseline&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse documents (test multiple parsers for PDFs)&lt;/li&gt;
&lt;li&gt;Chunk with context headers&lt;/li&gt;
&lt;li&gt;Generate 50-100 synthetic eval questions&lt;/li&gt;
&lt;li&gt;Build basic retriever&lt;/li&gt;
&lt;li&gt;Measure Recall@10&lt;/li&gt;
&lt;li&gt;Write down the number&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Week 2-4: Apply standard fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add hybrid search (BM25 + embeddings)&lt;/li&gt;
&lt;li&gt;Add reranker&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;li&gt;Compare to baseline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Week 4+: Debug specific failures&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Break down recall by query type&lt;/li&gt;
&lt;li&gt;Find worst-performing segment&lt;/li&gt;
&lt;li&gt;Fix that segment&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key: measure after every change. If you can't see improvement in numbers, you're guessing.&lt;/p&gt;


&lt;h2&gt;
  
  
  When to measure answer quality
&lt;/h2&gt;

&lt;p&gt;Only after retrieval is solid.&lt;/p&gt;

&lt;p&gt;Once Recall@10 is above 80%, start measuring end-to-end:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use LLM-as-judge for answer evaluation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Evaluate this answer. Return JSON:
- correct: true/false (factually accurate)
- grounded: true/false (supported by the context)
- complete: true/false (addresses the full question)

Context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;format_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill.&lt;/p&gt;


&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;RAG quality is retrieval quality.&lt;/p&gt;

&lt;p&gt;Before you touch your prompts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate synthetic evals from your corpus&lt;/li&gt;
&lt;li&gt;Measure Recall@10&lt;/li&gt;
&lt;li&gt;Add hybrid search&lt;/li&gt;
&lt;li&gt;Add a reranker&lt;/li&gt;
&lt;li&gt;Fix your chunking&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fixes are straightforward. The impact is not.&lt;/p&gt;



&lt;p&gt;&lt;em&gt;This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  About me
&lt;/h2&gt;

&lt;p&gt;I help B2B SaaS companies ship production AI in 6 weeks.&lt;/p&gt;

&lt;p&gt;If you're building RAG and want a second set of eyes, I do free AI Teardowns — a 30-45 min video showing exactly where your pipeline is breaking and how to fix it.&lt;/p&gt;

&lt;p&gt;No pitch. Just clarity.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://animanovalabs.com/" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fanimanovalabs.com%2Fog-image.png" height="420" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://animanovalabs.com/" rel="noopener noreferrer" class="c-link"&gt;
            AI Implementation for B2B SaaS | AnimaNova Labs | AnimaNova Labs
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Ship production AI features in 6 weeks. For B2B SaaS companies who need AI but can't hire fast enough. No $300K engineer. No 6-month timeline.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fanimanovalabs.com%2Ficon.svg%3Ficon.056_r5p2xm~fh.svg" width="1024" height="1024"&gt;
          animanovalabs.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
