<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nazar Boyko</title>
    <description>The latest articles on DEV Community by Nazar Boyko (@nazar_boyko).</description>
    <link>https://dev.to/nazar_boyko</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1875383%2Fbaa58f3a-24b4-4cd1-bea6-034acfb76210.jpg</url>
      <title>DEV Community: Nazar Boyko</title>
      <link>https://dev.to/nazar_boyko</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nazar_boyko"/>
    <language>en</language>
    <item>
      <title>AI Code Review: Helpful Assistant Or False Confidence Machine?</title>
      <dc:creator>Nazar Boyko</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:03:21 +0000</pubDate>
      <link>https://dev.to/nazar_boyko/ai-code-review-helpful-assistant-or-false-confidence-machine-mp2</link>
      <guid>https://dev.to/nazar_boyko/ai-code-review-helpful-assistant-or-false-confidence-machine-mp2</guid>
      <description>&lt;p&gt;You open a pull request. Thirty seconds later, an AI reviewer drops a comment: &lt;em&gt;"Looks good to me. No issues found."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You feel a tiny chemical reward. Approval. Speed. You're one click closer to merging. The cognitive cost of waiting for a human reviewer just got compressed into half a minute, and the diff you spent two hours wrestling with is now blessed by a system that has read more code than any single engineer alive.&lt;/p&gt;

&lt;p&gt;Then a week later, the same code ships a subtle authorization bug to production, and you're staring at an incident channel wondering how that "Looks good to me" survived a real review.&lt;/p&gt;

&lt;p&gt;This is the question that follows every AI code reviewer around like a shadow. Is it a helpful assistant - a faster, calmer, more patient version of the senior engineer who used to leave you twelve comments before lunch? Or is it a false confidence machine - a tool that says reassuring things about code it doesn't fully understand, and convinces you to merge anyway?&lt;/p&gt;

&lt;p&gt;The honest answer is &lt;em&gt;both&lt;/em&gt;, and which one you get depends almost entirely on how you set up the workflow around it. Let's break down the four angles that actually matter - what these reviewers catch well, where they fail on correctness, where they fail on security, where they make things up, and how to design a review process that gets the upside without buying the downside.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Code Review Is Actually Good At
&lt;/h2&gt;

&lt;p&gt;Before we pile on the failure modes, it's worth being fair about what these tools do well - because most teams hire them for the right reasons and then forget what those reasons were.&lt;/p&gt;

&lt;p&gt;AI reviewers are excellent at the surface layer of code review. The kind of comments that good engineers stop making out loud because they got tired of writing them but still wish someone would catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistent naming inside the same diff. You called it &lt;code&gt;userId&lt;/code&gt; in one function and &lt;code&gt;user_id&lt;/code&gt; in the next.&lt;/li&gt;
&lt;li&gt;Style drift. A trailing comma here, a missing one there, a callback where the rest of the file uses async/await.&lt;/li&gt;
&lt;li&gt;Obvious nullability holes. You destructured &lt;code&gt;response.data.user&lt;/code&gt; two lines after a &lt;code&gt;try&lt;/code&gt; block that could return &lt;code&gt;undefined&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Forgotten error handling. The new endpoint catches the database error but swallows the auth error two lines below.&lt;/li&gt;
&lt;li&gt;Dead code, unused variables, imports that no longer point at anything.&lt;/li&gt;
&lt;li&gt;Documentation that contradicts the function it's documenting because the signature changed and the JSDoc didn't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These comments aren't glamorous, but they used to consume real review bandwidth from senior engineers. Pushing them onto a tool that doesn't get tired, doesn't get cranky, and doesn't write &lt;em&gt;"as I've mentioned in the last four reviews..."&lt;/em&gt; in the comment thread is a clear win.&lt;/p&gt;

&lt;p&gt;The second thing AI reviewers are good at is &lt;strong&gt;mechanical pattern matching against well-known antipatterns.&lt;/strong&gt; The same patterns that have been on the OWASP Top 10 for fifteen years. The same patterns that have been in every senior interview question for ten years. The same patterns that show up in a thousand training-set blog posts.&lt;/p&gt;

&lt;p&gt;If you write a SQL query by concatenating a string with &lt;code&gt;req.query.user&lt;/code&gt;, an AI reviewer is going to spot it. If you skip CSRF protection on a state-changing endpoint, it'll spot it. If you log a raw password, it'll spot it. If you commit a JWT secret, it'll spot it.&lt;/p&gt;

&lt;p&gt;What both of these categories have in common is that &lt;em&gt;the code reveals the problem&lt;/em&gt;. The function looks suspicious. The diff is the evidence. The reviewer doesn't need to know what your system does - it only needs to read the lines you handed it.&lt;/p&gt;

&lt;p&gt;This is the half of code review that AI is genuinely changing. And if your team treats AI review as &lt;em&gt;exactly this much&lt;/em&gt; and no more, you'll get a real productivity win with very little risk.&lt;/p&gt;

&lt;p&gt;The problem starts the moment you start expecting it to do the other half.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Other Half: Correctness, Intent, And Where Bugs Actually Live
&lt;/h2&gt;

&lt;p&gt;Real bugs almost never live in the code you can see in the diff. They live in the relationship between the code in the diff and the rest of the system.&lt;/p&gt;

&lt;p&gt;Think about the last few real bugs your team shipped. Walk through them. How many of them were caught by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A reviewer noticing that the new endpoint duplicated logic from another endpoint, and now both versions disagreed?&lt;/li&gt;
&lt;li&gt;A reviewer remembering that the field &lt;code&gt;account.status&lt;/code&gt; could be &lt;code&gt;pending_review&lt;/code&gt; &lt;em&gt;only&lt;/em&gt; in tenants migrated before 2022, and the new code didn't handle that case?&lt;/li&gt;
&lt;li&gt;A reviewer realizing that the new background job ran every five minutes, but the table it queried got partitioned by date last quarter, and the query was about to start scanning the entire archive partition?&lt;/li&gt;
&lt;li&gt;A reviewer noticing that the new logic was correct in isolation, but the caller already did the same check three frames up, and now the count was off by one?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those bugs are visible in the diff. The diff looks fine. The diff &lt;em&gt;is&lt;/em&gt; fine, locally. The bug lives in the conversation between the diff and everything outside it - the other twenty thousand lines of your codebase, the migration that ran two years ago, the operational reality of how the system runs in production.&lt;/p&gt;

&lt;p&gt;This is where AI code review hits its hardest ceiling, and it's worth being precise about why. It's not that the model is bad. It's that &lt;em&gt;the information needed to catch the bug isn't in the prompt&lt;/em&gt;. The reviewer is reading the diff. It might be reading the surrounding file. It might even be reading some retrieved chunks of your codebase. But it isn't reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Slack thread from two years ago where the team decided that all auth checks happen at the middleware layer, not the controller.&lt;/li&gt;
&lt;li&gt;The runbook that says this service runs in three regions and the new code's assumption of a single timezone breaks two of them.&lt;/li&gt;
&lt;li&gt;The product manager's verbal decision that "soft delete" means hidden-from-list-but-still-billable, not gone.&lt;/li&gt;
&lt;li&gt;The unwritten rule that any &lt;code&gt;for await&lt;/code&gt; in this codebase needs a concurrency limit because the database connection pool is sized for ten.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A senior reviewer who's worked on your team for two years catches these because the system lives in their head. The AI catches them only if some surface artifact of those decisions made it into the code or into the context window. Most of the time, none of it did.&lt;/p&gt;

&lt;p&gt;The failure mode here isn't &lt;em&gt;wrong&lt;/em&gt; feedback. It's &lt;em&gt;confident absent&lt;/em&gt; feedback. The AI reviewer doesn't flag the bug because it doesn't see the bug. And worse, it tells you &lt;em&gt;"no issues found"&lt;/em&gt;, which sounds like an affirmative review, not an admission of partial coverage. A human who didn't know the codebase well would at least hesitate. A model doesn't hesitate. That's the false confidence.&lt;/p&gt;

&lt;p&gt;The fix isn't to expect the model to do more. It's to stop treating its silence as evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Review Is The Sharpest Edge Of This Problem
&lt;/h2&gt;

&lt;p&gt;Security is where the gap between "what looks suspicious in the diff" and "what's actually exploitable" gets the widest, the fastest.&lt;/p&gt;

&lt;p&gt;Consider three diffs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;diff-1.py&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/users/&amp;lt;user_id&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE id = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a SQL injection. Every AI reviewer alive will catch it. It's in the training data of every model from every era. It's also unlikely to ship in any team that has any review at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;diff-2.py&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/users/&amp;lt;user_id&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@authenticated&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_one&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks clean. There's an &lt;code&gt;@authenticated&lt;/code&gt; decorator. The query is parameterized. The return is structured. An AI reviewer reads this and says &lt;em&gt;"looks good"&lt;/em&gt; - and on the surface, it does.&lt;/p&gt;

&lt;p&gt;But this might be an Insecure Direct Object Reference (IDOR) vulnerability. There's no check that the &lt;em&gt;authenticated user&lt;/em&gt; is allowed to fetch &lt;em&gt;this particular user_id&lt;/em&gt;. Anyone who's logged in can read anyone else's profile.&lt;/p&gt;

&lt;p&gt;The model can't catch this without knowing the &lt;em&gt;intent&lt;/em&gt; of your authorization model. Does your system let authenticated users read all other users? Maybe - it's a social app. Or maybe not - it's a healthcare app and each user can only read themselves and their dependents. The diff doesn't say which. The reviewer doesn't know which. So it defaults to a comfortable answer, which is silence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;diff-3.py&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/teams/&amp;lt;team_id&amp;gt;/invite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nd"&gt;@authenticated&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invite_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;team_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;teams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_one&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;team_id&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same shape, harder to spot. Even if the model checks for the &lt;em&gt;user's&lt;/em&gt; authorization to act on &lt;code&gt;team_id&lt;/code&gt;, it might miss that &lt;code&gt;request.json["email"]&lt;/code&gt; is not validated, that the same endpoint can be used to enumerate which emails are already members based on the response shape, that the rate limit on this endpoint is missing because the team's existing pattern is to put rate limits on the gateway and this new endpoint is registered through a different path. None of that is visible in the eight lines you can see.&lt;/p&gt;

&lt;p&gt;This pattern repeats across every security category. Authorization bugs depend on &lt;em&gt;the rules of your domain&lt;/em&gt;. Race conditions depend on the &lt;em&gt;concurrency model your runtime actually uses in production&lt;/em&gt;. Cryptographic mistakes depend on which threat model you've accepted. Secrets handling depends on which systems the data flows through after this code returns. None of it is in the diff. All of it is in your team's head.&lt;/p&gt;

&lt;p&gt;The AI reviewer is genuinely useful as a &lt;em&gt;first pass&lt;/em&gt; on security - it'll catch the patterns that match the training data, the OWASP-shaped problems, the obvious dumb stuff. The danger is treating that first pass as a &lt;em&gt;security review&lt;/em&gt;. It isn't. Real security review needs someone who knows what your application is for, what data it holds, who's allowed to do what, and what an attacker would actually want.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;&lt;br&gt;
When the AI reviewer says "no security issues found", read it as &lt;em&gt;"no patterns from my training set matched the diff"&lt;/em&gt;. That is not the same sentence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Hallucination Problem In Review Specifically
&lt;/h2&gt;

&lt;p&gt;Hallucination in code generation is a known and well-discussed failure mode. Models invent function names, package versions, config flags, API endpoints. You've probably watched a model confidently import a module that doesn't exist.&lt;/p&gt;

&lt;p&gt;What's less discussed is what hallucination looks like in &lt;em&gt;code review&lt;/em&gt;. The shape is different - and arguably more dangerous - because the model isn't writing code you'll run. It's making &lt;em&gt;assertions&lt;/em&gt; you'll act on.&lt;/p&gt;

&lt;p&gt;A reviewing model can hallucinate in three distinct ways, and each one bites a different way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phantom references.&lt;/strong&gt; The reviewer comments: &lt;em&gt;"This is similar to the pattern in &lt;code&gt;src/auth/middleware.py&lt;/code&gt; where we already handle this case."&lt;/em&gt; You open &lt;code&gt;src/auth/middleware.py&lt;/code&gt;. There is no such pattern. There might not even be a file at that path. The model invented a plausible reference because that's what its training distribution rewards.&lt;/p&gt;

&lt;p&gt;You can spot this one if you check, but the dangerous version is when the reference is &lt;em&gt;almost&lt;/em&gt; real - the file exists, the function name is close but slightly wrong, the pattern is similar but not identical. You scan it, decide &lt;em&gt;"yeah, that looks consistent"&lt;/em&gt;, and merge. The reviewer has just convinced you of a fact that isn't quite true.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phantom guarantees.&lt;/strong&gt; The reviewer comments: &lt;em&gt;"The framework handles input validation for you here, so you don't need to add it explicitly."&lt;/em&gt; Sometimes this is true. Sometimes the framework has &lt;em&gt;some&lt;/em&gt; validation but not the kind that matters for this endpoint. Sometimes the framework had this feature in v3 and you're on v5 and they removed it. Sometimes the framework never had it and the model is mixing it up with a different framework's docs.&lt;/p&gt;

&lt;p&gt;You're now relying on an assertion about a system you didn't verify. The diff didn't change. The vulnerability is now live, and the reasoning that justified shipping it is a sentence that sounded confident in a review thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phantom approvals.&lt;/strong&gt; The reviewer reads the diff, doesn't understand a chunk of it, and chooses safety: it says nothing about that chunk, and says positive things about the parts it does understand. The overall summary lands as &lt;em&gt;"looks good"&lt;/em&gt;. You read the summary. You don't notice that the most complex part of the diff was never actually addressed.&lt;/p&gt;

&lt;p&gt;This is the quietest failure mode and the easiest to miss. The reviewer didn't say anything wrong. It just didn't say anything about the part that mattered. And because there's no visual indicator of "I read this and have no comment" vs "I didn't really understand this and have no comment", you can't tell the difference.&lt;/p&gt;

&lt;p&gt;The defence against all three is the same and it's old-fashioned: don't trust assertions. When the AI reviewer references a file, open the file. When it claims the framework does X, find the docs. When the diff has a tricky section and the AI didn't comment on it, &lt;em&gt;that's a signal&lt;/em&gt;, not a relief.&lt;/p&gt;

&lt;p&gt;This is the discipline that turns AI review from a confidence machine into a useful one. The model is allowed to say things you'd never have noticed. It's not allowed to be the last layer of trust on anything that matters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feh1rxku15r67s8bz4oxz.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feh1rxku15r67s8bz4oxz.webp" alt="Venn diagram comparison: AI Reviewer catches style consistency, naming drift, obvious nullability, dead code, and classic OWASP patterns; Human Reviewer covers authorization rules, business invariants, cross-service contracts, and operational reality. The thin overlap holds local correctness and surface-level smells." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing A Review Workflow That Actually Works
&lt;/h2&gt;

&lt;p&gt;Once you accept that AI review is great at one half of the job and structurally incapable of the other half, the workflow design becomes obvious. You arrange the two halves in series, not in parallel, and you give each one the job it can actually do.&lt;/p&gt;

&lt;p&gt;Here's the shape of a workflow that gets the upside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - AI reviewer runs first, as a lint pass.&lt;/strong&gt; The moment a pull request opens, the AI reviewer sweeps it. It flags style issues, naming inconsistencies, obvious antipatterns, classic security patterns, missing error handling, contradictory documentation. It posts these as inline comments on the diff, the same way a linter would.&lt;/p&gt;

&lt;p&gt;The author addresses these &lt;em&gt;before&lt;/em&gt; any human reviewer is paged. This is the whole productivity win - the senior engineer never has to write &lt;em&gt;"please use the existing util for this"&lt;/em&gt; again. The boring layer is done. The author shows up to human review with a diff that's already past the obvious objections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - Author writes the description that the AI didn't.&lt;/strong&gt; This is the step that gets cut and shouldn't. The AI is reviewing the &lt;em&gt;code&lt;/em&gt;. The human reviewer needs to review the &lt;em&gt;change&lt;/em&gt;. A change is the code plus the intent plus the constraints plus the trade-offs the author made.&lt;/p&gt;

&lt;p&gt;The author writes - in plain English - what this change does, why now, what alternatives they considered, what the rollout plan is, what they're explicitly &lt;em&gt;not&lt;/em&gt; trying to solve. If you've never read Tigran Sloyan's or Will Larson's writing on this, the short version is: a PR description is for the next person who reads the commit in three years, not for you today.&lt;/p&gt;

&lt;p&gt;A good description here is also the thing that catches the author's &lt;em&gt;own&lt;/em&gt; hallucinations. If you can't explain in two paragraphs why this change is correct, you've found the actual problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - Human reviewer focuses on what the AI can't see.&lt;/strong&gt; With the lint layer handled and the description written, the human reviewer's attention can land on the real questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does this change match what we agreed to build?&lt;/li&gt;
&lt;li&gt;Does it interact correctly with the rest of the system?&lt;/li&gt;
&lt;li&gt;Are the security invariants of &lt;em&gt;this product&lt;/em&gt; preserved, given who the users are and what data they hold?&lt;/li&gt;
&lt;li&gt;Is this the right unit of change to ship, or should it be split, sequenced, or guarded behind a feature flag?&lt;/li&gt;
&lt;li&gt;What are we not going to be able to undo after this merges?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the questions that need a person who knows the codebase and the product. They're the questions you wanted the AI to handle and it can't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 - Critical paths get an explicit "AI didn't review this" gate.&lt;/strong&gt; Any change to authentication, authorization, payments, data deletion, migrations, or anything else where the worst-case outcome is "the company gets fined or sued" carries a flag in your review template: &lt;em&gt;"This change touches a critical path. AI review is not sufficient. Require N human approvals with one being from ."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This isn't because the AI's review is &lt;em&gt;wrong&lt;/em&gt; on critical paths. It might be perfectly fine. It's because the cost of being wrong is asymmetric, and you don't let asymmetric risks ride on confidence-shaped reassurance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5 - Treat AI review output as evidence, not verdict.&lt;/strong&gt; When the AI says &lt;em&gt;"no issues found"&lt;/em&gt;, that is &lt;strong&gt;not&lt;/strong&gt; a green light. It's an absence of red flags from a system that can only see a fraction of them. Your team's mental model of an AI review needs to be: &lt;em&gt;"the obvious problems are probably caught; the subtle ones are not addressed either way".&lt;/em&gt; The decision to merge still belongs to a human and a passing test suite.&lt;/p&gt;

&lt;p&gt;This shape works. It doesn't deliver the "agents review your PRs and you ship faster forever" narrative - but it delivers something better, which is fewer comments your senior engineers hate writing and more attention on the parts of review that actually prevent incidents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq1yi6id24t84kzhdirm.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq1yi6id24t84kzhdirm.webp" alt="Five-stage horizontal review pipeline: PR opened, AI lint pass, author writes description (catches author hallucinations), human review (focuses on what AI can't see), merge. A separate red branch requires two human approvals for critical paths." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means For The Senior Engineer
&lt;/h2&gt;

&lt;p&gt;If you're a senior engineer reading this, the practical shift is small and worth doing deliberately.&lt;/p&gt;

&lt;p&gt;You're not going to stop using AI reviewers - they're too useful at the lint layer to give up. But you're going to stop &lt;em&gt;deferring&lt;/em&gt; to them on anything that matters. When the AI says &lt;em&gt;"looks good"&lt;/em&gt;, you're going to read it as &lt;em&gt;"the cheap problems aren't here, what about the expensive ones"&lt;/em&gt;. When the AI references something in the codebase, you're going to confirm it. When the AI gives you a confident-sounding explanation, you're going to ask the question &lt;em&gt;"how would I know this was wrong?"&lt;/em&gt; before you accept it.&lt;/p&gt;

&lt;p&gt;You're also going to write better PR descriptions, because once the AI handles the lint layer, the bottleneck in review moves to the human reviewer's ability to &lt;em&gt;understand the change&lt;/em&gt;. That's a writing problem, not a coding problem, and it's the one part of code review that has gotten &lt;em&gt;more&lt;/em&gt; important since AI showed up, not less.&lt;/p&gt;

&lt;p&gt;And you're going to push back when leadership says &lt;em&gt;"we have AI review now, we can have fewer humans on it"&lt;/em&gt;. The answer is no, and the reason isn't sentiment - it's that the half of the job AI doesn't do is the half where the bugs that cost real money live. You'd rather have one AI reviewer plus a careful senior than two AI reviewers and a fast merge button.&lt;/p&gt;

&lt;p&gt;The question in the title of this piece - assistant or false confidence machine - isn't really about the tool. It's about the team. The same model, in the same codebase, on the same diff, is one or the other depending on whether a human is treating its output as evidence or as a verdict.&lt;/p&gt;

&lt;p&gt;If you remember anything from this, remember that distinction. AI review is evidence. Humans make the verdict. The day you flip those two roles is the day your incident channel gets a lot louder.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.nazarboyko.com/articles/ai-code-review-helpful-assistant-or-false-confidence-machine" rel="noopener noreferrer"&gt;nazarboyko.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>codereview</category>
      <category>aiassistant</category>
    </item>
    <item>
      <title>Observability For AI Features In Production</title>
      <dc:creator>Nazar Boyko</dc:creator>
      <pubDate>Tue, 02 Jun 2026 04:48:40 +0000</pubDate>
      <link>https://dev.to/nazar_boyko/observability-for-ai-features-in-production-436e</link>
      <guid>https://dev.to/nazar_boyko/observability-for-ai-features-in-production-436e</guid>
      <description>&lt;p&gt;AI features are different from normal application features. A normal API endpoint usually has clear behavior: you send input, you get output, and you can test exact values. You can monitor latency, errors, CPU, memory, database queries, and queue failures. The contract is tight, and when it breaks, the failure has a shape you can recognize.&lt;/p&gt;

&lt;p&gt;AI features are messier. The same user question can produce slightly different answers. The model may call a tool, retrieval may return weak documents, the prompt may grow too large, and token cost can spike without warning. A small prompt change may quietly reduce quality, and a model upgrade may improve one workflow while breaking another.&lt;/p&gt;

&lt;p&gt;So observability for AI is not only "did the request fail?" You also need to ask: did the model receive the right context, did retrieval find the right documents, did the model call the right tools? Was the answer useful, was it safe, was it too slow, was it too expensive? Could we debug this later? If you cannot answer these questions, your AI feature is not production-ready yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0urew7cfpae8cl0nitdl.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0urew7cfpae8cl0nitdl.webp" alt="AI Request Lifecycle flow diagram: user request flows through prompt builder, retrieval, model, tool calls, response, and feedback, with observability probes at each step." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with the AI request lifecycle
&lt;/h2&gt;

&lt;p&gt;Before adding dashboards, define the lifecycle of one AI request. A common AI feature looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input
  -&amp;gt; validation
  -&amp;gt; prompt construction
  -&amp;gt; optional retrieval
  -&amp;gt; model call
  -&amp;gt; optional tool call
  -&amp;gt; response validation
  -&amp;gt; response shown to user
  -&amp;gt; user feedback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step can fail in a different way. Input validation can fail because the user asks for unsupported behavior. Prompt construction can fail because the template is missing variables. Retrieval can fail because documents are missing, stale, or irrelevant. The model call can fail because of latency, provider errors, rate limits, or poor output. Tool calls can fail because external APIs fail. Response validation can fail because the answer does not match the required schema. And user feedback can reveal that the answer was technically valid but not useful.&lt;/p&gt;

&lt;p&gt;That means your logs should not only say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They should tell the story of the request.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to log for prompts
&lt;/h2&gt;

&lt;p&gt;Prompt logs are useful, but they must be handled carefully. Prompts can contain personal data, customer data, secrets, internal documents, or sensitive business information. Do not blindly log everything forever. A safe approach is to log structured metadata by default and store full prompts only in controlled environments or with redaction.&lt;/p&gt;

&lt;p&gt;Example prompt log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_req_01HX9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"support_reply_assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt_template"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"support_reply_v4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"template_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"system_prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256:8f91..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt_length"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1840&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_prompt_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"redaction_applied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-03T18:40:12Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the hash. You do not always need to store the entire system prompt in every log; a hash plus a version is often enough to connect a production request to the exact prompt that produced it.&lt;/p&gt;

&lt;p&gt;For debugging, you may also store redacted prompt snapshots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;redactPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;A-Z0-9._%+-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+@&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;A-Z0-9.-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;\.[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;]{2,}&lt;/span&gt;&lt;span class="sr"&gt;/gi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[EMAIL]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b\d{4}[&lt;/span&gt;&lt;span class="sr"&gt;- &lt;/span&gt;&lt;span class="se"&gt;]?\d{4}[&lt;/span&gt;&lt;span class="sr"&gt;- &lt;/span&gt;&lt;span class="se"&gt;]?\d{4}[&lt;/span&gt;&lt;span class="sr"&gt;- &lt;/span&gt;&lt;span class="se"&gt;]?\d{4}\b&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[CARD_NUMBER]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/sk-&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;A-Za-z0-9_-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[API_KEY]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is simple, not perfect. In real systems, redaction should be layered and tested. But the principle is important: observability should not become a data leak.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to log for retrieval
&lt;/h2&gt;

&lt;p&gt;RAG features need retrieval observability. If an AI answer is bad, the model may not be the problem. The retrieved context may be weak, the index may be stale, or the user's question may sit outside the knowledge base entirely. Log what retrieval returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_req_01HX9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"retrieval"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"support_docs_prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"refund after annual renewal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"top_k"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"documents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"doc_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"refund-policy-2026"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Refund Policy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-12"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"doc_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"billing-faq"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Billing FAQ"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-01"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps you answer practical questions. Did we retrieve the correct policy? Was the document stale? Did the top result have a low score? Did the user ask a question outside the knowledge base? You can also track retrieval quality over time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;RetrievalMetric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;topScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;resultCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;clickedDocumentId&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userFeedback&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;helpful&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not_helpful&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;detectWeakRetrieval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RetrievalMetric&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resultCount&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topScore&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weak retrieval should be visible. Otherwise, teams blame the model when the actual issue is missing content.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmclqmfqz1fa3mi6szsov.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmclqmfqz1fa3mi6szsov.webp" alt="Technical RAG observability diagram: a user question hits a vector search index, returns top-k documents with similarity scores, version stamps, and freshness badges, then feeds the model with warning icons on stale or low-similarity results." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What to log for tool calls
&lt;/h2&gt;

&lt;p&gt;AI agents often call tools: database lookups, internal APIs, search services, code runners, ticket systems, or deployment systems. Tool calls need serious observability because they can change real systems.&lt;/p&gt;

&lt;p&gt;Log the tool name, the input schema version, the sanitized arguments, the result status, the duration, the retry count, the authorization context, and whether the tool was read-only or write-enabled. Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_req_01HX9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool_call"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_customer_subscription"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tool_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"read_only"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;182&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments_redacted"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"customer_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cus_[REDACTED]"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For write tools, add extra guardrails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cancel_subscription"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"requires_human_confirmation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confirmation_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"confirm_7831"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"executed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A production AI assistant should not casually execute dangerous actions because the model "thought it was right." Read-only tools are safer. Write tools need approvals, audit logs, permissions, and rollback plans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency: measure the full path, not only the model
&lt;/h2&gt;

&lt;p&gt;AI latency is often multi-part. A slow response may include prompt building, retrieval, model generation, tool calls, response validation, streaming delay, and frontend rendering. Track each part separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AiTiming&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;promptBuildMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;retrievalMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;modelMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;toolMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;validationMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;totalMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;logTiming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AiTiming&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai_timing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A dashboard should show p50, p95, and p99 latency. Average latency hides pain; if most requests finish in two seconds but 5% take thirty, users will notice the long tail before any chart does.&lt;/p&gt;

&lt;p&gt;For streaming responses, also track time to first token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_req_01HX9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"time_to_first_token_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;740&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"time_to_complete_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8420&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Time to first token matters because users feel the product is alive when streaming starts quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token usage and cost
&lt;/h2&gt;

&lt;p&gt;Token usage is not just a billing detail. It is a product health signal. A feature can become expensive because prompts include too much irrelevant context, retrieval returns too many long chunks, conversation history is not summarized, the model is too powerful for a simple task, agents call each other repeatedly, or retries happen silently. Log token usage per feature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pr_summary_assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-large-model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18420&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1620&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cached_input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"estimated_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.084&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_req_45KQ"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create cost dashboards by feature, customer, team, or workflow. A useful metric is cost per successful outcome, not only cost per request. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;support_reply_assistant
- 10,000 requests
- $420 model cost
- 6,800 helpful responses
- cost per helpful response = $0.061
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is much more useful than saying "we spent $420."&lt;/p&gt;

&lt;h2&gt;
  
  
  Failed generations and schema validation
&lt;/h2&gt;

&lt;p&gt;AI output can fail even when the API request succeeds. Maybe the response is not valid JSON, maybe it misses required fields, maybe it includes text when your application expects structured data. Use schema validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PrSummarySchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;changedBehavior&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
  &lt;span class="na"&gt;riskyFiles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
  &lt;span class="na"&gt;testsRun&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
  &lt;span class="na"&gt;missingTests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;parsePrSummary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;PrSummarySchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Invalid AI response schema: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then log validation failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai_response_validation_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pr_summary_assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"template_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-large-model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"missing required field: testsRun"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets you detect prompt regressions quickly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffig8gcrqzcon6361aq3v.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffig8gcrqzcon6361aq3v.webp" alt="Production observability dashboard mockup for AI features: panels for p95 latency, token cost, retrieval quality, tool failures, schema validation errors, and user feedback on a dark editorial background." width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  User feedback is part of observability
&lt;/h2&gt;

&lt;p&gt;AI quality is not only technical. Users can tell you when an answer was helpful, wrong, too long, unsafe, or irrelevant. Do not collect only thumbs up/down; add lightweight reason categories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AiFeedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;positive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;negative&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;incorrect&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;missing_context&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;too_verbose&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unsafe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not_actionable&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;other&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feedback can feed evaluation datasets. If users repeatedly mark answers as &lt;code&gt;missing_context&lt;/code&gt;, the problem may be retrieval. If they mark answers as &lt;code&gt;too_verbose&lt;/code&gt;, the prompt may need tighter formatting rules. If they mark answers as &lt;code&gt;incorrect&lt;/code&gt;, you need deeper analysis: bad prompt, bad context, weak model, ambiguous user input, or missing business rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation and regression testing
&lt;/h2&gt;

&lt;p&gt;Production observability tells you what happened. Evaluations help you prevent known failures from coming back. Create a small dataset of realistic cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"refund_annual_plan_001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Can I get a refund if my annual plan renewed yesterday?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected_traits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"mentions refund window"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"does not promise refund automatically"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"asks for account details if needed"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"forbidden_traits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"invented policy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"asks for full credit card number"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run this dataset whenever you change the prompt template, the retrieval index, the model, the tool definitions, the system instructions, or the response schema. The goal is not perfect testing. The goal is catching obvious regressions before users do.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple AI observability schema
&lt;/h2&gt;

&lt;p&gt;Here is a practical event model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AiEvent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.request.started&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;promptTemplate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.retrieval.completed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;topScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;resultCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;documentIds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.model.completed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;inputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;outputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;durationMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.tool.completed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;success&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;failure&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;durationMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.response.validation_failed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai.feedback.received&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;positive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;negative&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nl"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can send these events to your normal observability stack. The exact vendor matters less than consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy and retention
&lt;/h2&gt;

&lt;p&gt;AI observability can collect sensitive information if you are not careful. Set clear rules: redact secrets before logging, avoid storing raw prompts by default, set retention limits, separate debugging access from general analytics access, record prompt template versions, store document IDs instead of full documents when possible, and audit access to AI traces. This is especially important for internal assistants that can read customer support tickets, invoices, medical records, legal documents, or private engineering docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;AI features need observability because they are probabilistic, context-sensitive, and expensive. You need more than HTTP 200 and p95 latency. You need to see prompts, retrieval, tool calls, tokens, cost, validation failures, user feedback, and evaluation results.&lt;/p&gt;

&lt;p&gt;The best AI observability systems do not only help you debug failures. They help you improve the product. They show where retrieval is weak, where prompts waste context, where model upgrades changed behavior, and where users do not trust the answer. That is the difference between an AI demo and an AI product. A demo only needs to work once. A product needs to keep working tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI production best practices: &lt;a href="https://developers.openai.com/api/docs/guides/production-best-practices" rel="noopener noreferrer"&gt;https://developers.openai.com/api/docs/guides/production-best-practices&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenAI Agents SDK tracing: &lt;a href="https://openai.github.io/openai-agents-python/tracing/" rel="noopener noreferrer"&gt;https://openai.github.io/openai-agents-python/tracing/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenTelemetry documentation: &lt;a href="https://opentelemetry.io/docs/" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>production</category>
      <category>prompts</category>
    </item>
    <item>
      <title>AI for Documentation That Developers Actually Maintain</title>
      <dc:creator>Nazar Boyko</dc:creator>
      <pubDate>Tue, 02 Jun 2026 04:41:03 +0000</pubDate>
      <link>https://dev.to/nazar_boyko/ai-for-documentation-that-developers-actually-maintain-2h89</link>
      <guid>https://dev.to/nazar_boyko/ai-for-documentation-that-developers-actually-maintain-2h89</guid>
      <description>&lt;p&gt;Roughly 45% of the traffic to your developer documentation in 2026 isn't human.&lt;/p&gt;

&lt;p&gt;That's not a metaphor. Mintlify, which hosts docs for a few thousand engineering teams, &lt;a href="https://www.mintlify.com/blog/state-of-ai" rel="noopener noreferrer"&gt;published a traffic study in March 2026&lt;/a&gt; showing that AI coding agents (Claude Code, Cursor, OpenCode, ChatGPT, and friends) now account for &lt;strong&gt;45.3%&lt;/strong&gt; of all requests to their hosted docs sites. Human browsers are at 45.8%. Two of those agents, Claude Code and Cursor, make up &lt;strong&gt;95.6%&lt;/strong&gt; of all the bot traffic.&lt;/p&gt;

&lt;p&gt;So when your docs are wrong, you're not just misleading new hires. You're feeding wrong answers into the tool the rest of your team uses to ship features. The asymmetry is brutal: a single stale &lt;code&gt;POST /v1/orders&lt;/code&gt; page can leak into thousands of Cursor autocompletions before anyone notices.&lt;/p&gt;

&lt;p&gt;Which makes the old documentation problem suddenly load-bearing.&lt;/p&gt;

&lt;p&gt;And the old documentation problem isn't &lt;em&gt;writing&lt;/em&gt; docs. AI has solved writing. You can dictate three bullet points to Claude and get a polished wiki entry in twelve seconds. The hard part, the part nobody has actually fixed yet, is &lt;strong&gt;keeping the docs true&lt;/strong&gt; the day after they're written, the week after, the quarter after.&lt;/p&gt;

&lt;p&gt;This piece is about what that takes. Specifically: how to use AI in a way that produces wikis, API references, and architecture decision records that survive contact with a moving codebase. The headline trick is that AI's most useful role here isn't &lt;em&gt;generation&lt;/em&gt;. It's &lt;em&gt;verification&lt;/em&gt;. And once you accept that, three workflows fall out naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Maintenance Problem That Generation Doesn't Fix
&lt;/h2&gt;

&lt;p&gt;Let's name the bug before we fix it.&lt;/p&gt;

&lt;p&gt;Documentation rots because nothing in the normal development loop forces it to stay aligned with the code. You merge a refactor. Tests pass. CI is green. The wiki page about that subsystem? Nothing pinged it. Nothing knows it exists. It is now subtly, expensively wrong, and it will stay wrong until somebody hits a bug six weeks from now and goes looking.&lt;/p&gt;

&lt;p&gt;Stripe's &lt;a href="https://stripe.com/files/reports/the-developer-coefficient.pdf" rel="noopener noreferrer"&gt;Developer Coefficient study&lt;/a&gt; found that developers spend an average of &lt;strong&gt;17.3 hours per week&lt;/strong&gt; fixing the past instead of building the future: debugging, refactoring, and servicing technical debt, out of a roughly 41-hour week. That's about 42% of the week spent going backwards. A meaningful chunk of it is "I trusted the doc, the doc was wrong, I rewrote the doc after burning two hours on a bad assumption." It's a tax everyone pays and nobody itemises.&lt;/p&gt;

&lt;p&gt;The root cause is structural, not motivational. Developers don't avoid documentation because they're lazy. They avoid it because the system never makes them. Three failure modes show up over and over:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No ownership.&lt;/strong&gt; Code has a CODEOWNERS file. Docs don't. When the auth team renames a flag, nobody pings the doc that mentions the old name, because no script knows the doc exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No verification.&lt;/strong&gt; Tests fail when code is wrong. Nothing fails when docs are wrong. Drift is invisible until a human notices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No update prompt.&lt;/strong&gt; The PR template asks if you updated tests. It rarely asks if you updated the wiki. Even when it does, devs say "yes" reflexively and move on.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can dump infinite AI-generated prose into Confluence and not fix a single one of these. The docs will still rot. They'll just rot &lt;em&gt;faster&lt;/em&gt; because there's more surface area to rot.&lt;/p&gt;

&lt;p&gt;This is why the well-meaning "let's have AI write a wiki entry for every PR" projects keep dying after three months. They produce content, not maintenance. And content that nobody promised to maintain is content that decays at exactly the same rate as content nobody wrote.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Is Bad At (And What It's Actually Good At)
&lt;/h2&gt;

&lt;p&gt;AI is bad at &lt;em&gt;deciding what's true&lt;/em&gt;. That's not pessimism, that's mechanics. Frontier models in 2026 hallucinate at rates &lt;a href="https://suprmind.ai/hub/ai-hallucination-rates-and-benchmarks/" rel="noopener noreferrer"&gt;benchmarked between roughly 3% and 19%&lt;/a&gt;, depending on the model, the task, and the reasoning configuration. Three percent is the floor. Most production deployments aren't at the floor.&lt;/p&gt;

&lt;p&gt;So if you ask Claude "what does the &lt;code&gt;/v2/checkout&lt;/code&gt; endpoint return on partial inventory?" with no context, you will get an answer. It may even be plausible. It may also be entirely invented, indistinguishable from the real one, and confidently formatted as documentation. That's not a flaw of any single model. It's what a language model does when you don't give it ground truth to pull from.&lt;/p&gt;

&lt;p&gt;The good news, also from the 2026 benchmark research, is that &lt;strong&gt;retrieval grounding cuts hallucinations by up to ~70%&lt;/strong&gt; (some reports put it at 70-80% on enterprise knowledge-base tasks). The pattern is consistent: when you anchor the model to a verified source (the actual route handler, the actual schema, the actual diff), quality jumps. When you don't, you get plausible fiction.&lt;/p&gt;

&lt;p&gt;This is the lever. The job of AI in your docs pipeline isn't "write the doc." It's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summarise&lt;/strong&gt;, given a primary source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare&lt;/strong&gt;, given a doc and the code it claims to describe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flag&lt;/strong&gt;, when the two have drifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Draft&lt;/strong&gt;, in a structured shape, when you've decided what to say.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: AI is a junior tech writer with a photographic memory and zero judgement. Pair it with a senior reviewer and a verification harness, and it's transformative. Cut it loose with a Confluence write key and a friendly prompt, and you've automated the production of wrongness.&lt;/p&gt;

&lt;p&gt;The three workflows below are all variations on this idea.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqt9ge319e2177xpe758s.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqt9ge319e2177xpe758s.webp" alt="Two-pipeline comparison: AI as Generator outputs a polished but hallucinated doc page, while AI as Verifier compares a code change against the existing doc, emits a drift report, and routes it through human review before the doc updates" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiki Updates: Bind Them to the PR
&lt;/h2&gt;

&lt;p&gt;The wiki failure is the most common one and the easiest to fix.&lt;/p&gt;

&lt;p&gt;The pattern that works: &lt;strong&gt;every PR is required to produce a wiki delta&lt;/strong&gt;, and AI writes the first draft of that delta from the diff. Not the whole wiki, just what changed. The human reviewer accepts, edits, or rejects. The reviewer is the same person who's already reviewing the code. The wiki update lives in the same PR, gated by the same merge.&lt;/p&gt;

&lt;p&gt;This solves all three failure modes from earlier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ownership&lt;/strong&gt;: the PR author owns the wiki delta the same way they own the code change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt;: the reviewer is looking at the code and the doc side by side. Drift is impossible to ship without someone noticing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update prompt&lt;/strong&gt;: it's not a polite request in the PR template, it's a required field. Empty box, no merge.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice this looks like a CI step that runs against every PR:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;.github/workflows/wiki-delta.yml&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Wiki Delta&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;draft-wiki-update&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate wiki delta draft&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;npx wiki-delta-bot \&lt;/span&gt;
            &lt;span class="s"&gt;--diff "${{ github.event.pull_request.base.sha }}..HEAD" \&lt;/span&gt;
            &lt;span class="s"&gt;--wiki ./docs/wiki \&lt;/span&gt;
            &lt;span class="s"&gt;--out wiki-delta.md&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Comment delta on PR&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marocchino/sticky-pull-request-comment@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;wiki-delta.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(The &lt;code&gt;wiki-delta-bot&lt;/code&gt; here is a stand-in: substitute whatever internal tool you use, or Mintlify's PR preview, or a custom Claude call with the diff and the relevant wiki page as context.)&lt;/p&gt;

&lt;p&gt;The crucial detail isn't the tool. It's that the AI gets &lt;em&gt;both inputs&lt;/em&gt;: the diff and the existing wiki page. It writes the smallest change that reconciles the two. Not a full rewrite. Not a fresh page. A delta. The reviewer is reading thirty changed lines, not three hundred regenerated ones, and that's the difference between a process that survives and one that gets disabled after a sprint.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;br&gt;
Start with the wiki pages that get the most agent traffic. If Mintlify-style analytics are available, sort by AI-agent reads, descending. Those are the pages that, if stale, will mislead the most downstream tools.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The "docs updated?" gate as a PR requirement has been a docs-as-code talking point for years, well before AI. The reason it never quite caught on was friction: writing the doc update by hand in every PR is enough of a tax that teams quietly stop enforcing the gate after a few sprints. AI changes the friction math. When the bot has already written 80% of the delta, even the laziest reviewer will tidy it instead of deleting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Docs: Anchor First, Prose Second
&lt;/h2&gt;

&lt;p&gt;API documentation is the place where AI generation has the worst track record. Ask a model to "write API docs for this service" and you'll get a beautiful page where &lt;code&gt;POST /orders&lt;/code&gt; accepts a &lt;code&gt;quantity: int&lt;/code&gt; field. That field does not exist in your code. Nobody can find where it came from. It's been there for two months. Three internal teams are now writing client code that sends it.&lt;/p&gt;

&lt;p&gt;The fix is to invert the writing order: &lt;strong&gt;start from the code, end with the prose&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;OpenAPI is the lever. Modern frameworks (FastAPI, Spring Boot, NestJS, Hono, Go's &lt;code&gt;huma&lt;/code&gt;, Laravel's Scribe) can &lt;a href="https://www.xano.com/blog/openapi-specification-the-definitive-guide/" rel="noopener noreferrer"&gt;emit an OpenAPI specification straight from the actual route handlers, decorators, and types&lt;/a&gt;. The spec is mechanically true by construction. Routes that don't exist can't appear. Parameters that aren't in the handler signature can't appear. The shape is grounded.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Then&lt;/em&gt; you let AI write prose around that ground truth. Endpoint descriptions, examples, error explanations, "when to use this vs that", those are all genuine writing tasks where a model has nothing to fabricate, because the schema is already pinned down. Mintlify's own analysis of &lt;a href="https://www.mintlify.com/blog/ai-hallucinations" rel="noopener noreferrer"&gt;why hallucinations happen&lt;/a&gt; lands on the same point: structured input is the prevention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;api/routes/orders.ts&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// FastAPI-style example, but the pattern is framework-agnostic&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/v2/orders&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;OrderCreateSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// ← single source of truth&lt;/span&gt;
      &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;OrderSchema&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;// ← AI cannot invent a "quantity" field&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Create a new order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;&amp;lt;&amp;lt;AI_FILLS_THIS&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// ← prose, generated against the real schema&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;createOrderHandler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema is the anchor. Everything AI writes downstream is bounded by it. If a developer adds a new field, the spec changes, the AI's next pass at the description sees the new field, and the docs update with it. If a developer renames a field, the old name vanishes from the spec, and the AI's prose stops referring to it, because it never had license to.&lt;/p&gt;

&lt;p&gt;This is also why "generate docs from code" tools used to feel weak and now feel powerful. The tools didn't change much. The grounding layer did. A 2025 spec generator like &lt;code&gt;huma&lt;/code&gt; plus a 2026 model reading that spec produces output that an earlier model would have hallucinated past.&lt;/p&gt;

&lt;p&gt;The footgun in this section: &lt;strong&gt;don't let AI invent example values&lt;/strong&gt;. A model asked for "an example order body" will happily make up a credit card number, a UUID, and a customer email, none of which correspond to anything in your system. If you can't seed examples from real fixture data or factory output, mark them clearly as illustrative (&lt;code&gt;"customer_id": "&amp;lt;example-uuid&amp;gt;"&lt;/code&gt;) so the human reader, and the next AI consuming this page, can tell.&lt;/p&gt;

&lt;h2&gt;
  
  
  ADRs: AI as the Scribe, You as the Decider
&lt;/h2&gt;

&lt;p&gt;Architecture Decision Records are the documentation form that has the worst capture rate.&lt;/p&gt;

&lt;p&gt;Not because they're hard to write. They're trivial. The &lt;a href="https://adolfi.dev/blog/ai-generated-adr/" rel="noopener noreferrer"&gt;standard ADR format&lt;/a&gt; is a handful of short sections: context, decision, consequences, alternatives considered. You can fill it out in fifteen minutes.&lt;/p&gt;

&lt;p&gt;The reason ADRs go uncaptured is that the &lt;em&gt;moment&lt;/em&gt; you should write one, the moment you make the decision, you're in flow, you're shipping, you're on a Slack call, and an internal voice says "I'll write it down later." Later never comes. Six months later the decision is still in effect, but nobody on the team remembers why, and a new engineer is about to undo it because the original constraint isn't in any document.&lt;/p&gt;

&lt;p&gt;AI is shockingly good at fixing exactly this gap. Not by deciding architecture, it shouldn't, and it can't, because the tradeoffs are tied to your business and your team. But by being the scribe at the moment of decision.&lt;/p&gt;

&lt;p&gt;The pattern that works: when you make a non-trivial architectural call, you say it out loud (in chat, in a PR, in a meeting transcript). An AI agent (Claude Code skill, Cursor agent, whatever your team standardises on) picks that up, interrogates you for two minutes about context and alternatives, and emits a draft ADR. You read it for thirty seconds, fix the parts it got wrong, commit it under &lt;code&gt;docs/adrs/NNN-thing.md&lt;/code&gt;. Done.&lt;/p&gt;

&lt;p&gt;This works because the format is structured enough that the AI knows what's missing. If you say "we're switching from polling to webhooks," the ADR scribe knows it needs to ask: &lt;em&gt;why now, what was tried, what's the rollback, what does this lock you out of?&lt;/em&gt; The interrogation is the value. You'd have skipped those questions on your own.&lt;/p&gt;

&lt;p&gt;Here's roughly what one looks like, drafted from a five-minute conversation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/adrs/0042-switch-fulfilment-queue-to-redis-streams.md&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# ADR 0042: Switch Fulfilment Queue from SQS to Redis Streams&lt;/span&gt;

&lt;span class="gs"&gt;**Status:**&lt;/span&gt; Accepted
&lt;span class="gs"&gt;**Date:**&lt;/span&gt; 2026-05-12
&lt;span class="gs"&gt;**Deciders:**&lt;/span&gt; @nazar, @ana, @jp

&lt;span class="gu"&gt;## Context&lt;/span&gt;

Fulfilment workers process ~12k orders/day with bursts up to 800/min during
flash sales. We're hitting SQS message size limits (256KB) for ~3% of payloads
after we added line-item metadata in Q1, forcing an S3-pointer workaround that
adds two round trips per message.

&lt;span class="gu"&gt;## Decision&lt;/span&gt;

Replace SQS with Redis Streams (already deployed for session state) as the
primary fulfilment queue. Keep SQS as a dead-letter sink for messages that
fail Redis-side validation.

&lt;span class="gu"&gt;## Consequences&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Removes the S3-pointer hop; expected p99 latency drop from ~340ms to ~90ms.
&lt;span class="p"&gt;-&lt;/span&gt; Adds operational responsibility: Redis Streams retention and consumer-group
  lag now matter to ops on-call.
&lt;span class="p"&gt;-&lt;/span&gt; Loses SQS's native at-least-once + visibility-timeout semantics; we
  reimplement those via Stream ack + idempotency keys at the worker.

&lt;span class="gu"&gt;## Alternatives Considered&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Stay on SQS, increase message size via S3 pointers.**&lt;/span&gt; Already in place;
  the two-round-trip cost is what's prompting this change.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Kafka.**&lt;/span&gt; Closer match semantically, but we have no Kafka operational
  experience and Redis is already in the stack.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**NATS JetStream.**&lt;/span&gt; Strong candidate; deferred, not enough team familiarity
  and we don't want two new pieces of infrastructure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note what the AI did &lt;em&gt;not&lt;/em&gt; do here: it didn't decide. It collected what was already decided and put it in a shape that future-you can scan. The decision happened in the heads of three people on a call. The ADR captured it before it evaporated.&lt;/p&gt;

&lt;p&gt;This is, in my view, the highest-leverage use of AI in documentation right now. ADRs are uniquely valuable (they explain &lt;em&gt;why&lt;/em&gt;, which code itself can't), uniquely under-captured (because of the timing problem), and uniquely well-suited to AI scribing (because the format is rigid and the input is conversational).&lt;/p&gt;

&lt;p&gt;If you only do one thing from this article, set up an ADR-scribe workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verification Layer Behind All of This
&lt;/h2&gt;

&lt;p&gt;The three workflows above share a hidden ingredient: &lt;strong&gt;they all depend on something checking whether the doc still matches the code.&lt;/strong&gt; Without that check, you're just generating prose on a treadmill.&lt;/p&gt;

&lt;p&gt;The verification layer has three components, and they're not interchangeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Anchors that bind doc to code.&lt;/strong&gt; Swimm's whole product is built on this idea: the documentation includes "smart tokens" and "smart paths" that reference specific functions, types, or line ranges in the codebase. &lt;a href="https://docs.swimm.io/quick-reference-guide/" rel="noopener noreferrer"&gt;When the code changes&lt;/a&gt;, Swimm flags the affected doc with a "Review required" status. The mechanism is straightforward: it's structured references plus a CI job that compares the current code against what the doc expected. The same idea works without Swimm. You can roll your own by linking docs to git refs and running a diff check in CI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Prose linting and link checking.&lt;/strong&gt; &lt;a href="https://buildwithfern.com/post/docs-linting-guide" rel="noopener noreferrer"&gt;Vale&lt;/a&gt; for style (tone, terminology, forbidden phrasing) and any link checker (Lychee, markdown-link-check, Fern's built-in) for broken references. These are 2010s tech, but they're still the cheapest way to catch a category of rot (typos, dead URLs, renamed pages) that AI-generated content is especially prone to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AI-driven drift checks.&lt;/strong&gt; This is the new piece. On a schedule (nightly, weekly), an agent walks the docs tree, for each page pulls the code it references, and asks the model: "does this page accurately describe this code?" Output is a drift report. Humans triage. This is the verification arm of the same workflow you use to generate wiki deltas: same anchor, same comparison, just run on a cadence instead of on every PR.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ox1irarjj1dtdgcbvsb.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ox1irarjj1dtdgcbvsb.webp" alt="Three stacked layers of maintainable docs - Anchors binding code to docs, prose Linters catching style issues and broken links, and scheduled AI Drift Checks - with an arrow labeled stops rot rising through all three" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;None of these three works alone. Anchors without lint checks miss broken external links. Lint checks without anchors don't catch the case where the doc still parses cleanly but describes a function that's been renamed. AI drift checks without anchors get expensive fast: you're rescanning the whole repo every night instead of the bits the doc actually references.&lt;/p&gt;

&lt;p&gt;Stack all three and you have docs that fail loudly, like tests. Which is the only kind that gets maintained.&lt;/p&gt;

&lt;h2&gt;
  
  
  The llms.txt Detour Nobody Saw Coming
&lt;/h2&gt;

&lt;p&gt;A small piece of context that's reshaping all of the above: &lt;a href="https://www.mintlify.com/blog/what-is-llms-txt" rel="noopener noreferrer"&gt;llms.txt&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In September 2024, Jeremy Howard of Answer.AI &lt;a href="https://llmstxt.org/" rel="noopener noreferrer"&gt;proposed a standard&lt;/a&gt; for a &lt;code&gt;/llms.txt&lt;/code&gt; file at the root of any docs site, structured to be cheap for a language model to ingest. Within months, Mintlify rolled out automatic support across every site they host. &lt;a href="https://www.mintlify.com/blog/the-value-of-llms-txt-hype-or-real" rel="noopener noreferrer"&gt;Mintlify built the &lt;code&gt;llms-full.txt&lt;/code&gt; variant with Anthropic&lt;/a&gt; (a single concatenated dump of all docs), because parsing fragmented HTML was killing Anthropic's indexing pipelines. Cloudflare, Vercel, and a thousand others followed in months.&lt;/p&gt;

&lt;p&gt;It's a small file. It just lists your docs and gives a short description of each. But it changes what "documentation" means: it's the AI-readable interface to your knowledge, separate from the human-readable one. And the same maintenance problem applies, only worse, because nobody is looking at &lt;code&gt;llms.txt&lt;/code&gt; with human eyes. A page that drops out of the file is silently invisible to every Claude Code session for the next quarter.&lt;/p&gt;

&lt;p&gt;If you're already running the three-layer verification stack above, &lt;code&gt;llms.txt&lt;/code&gt; falls out for free; it's just a generated index over docs you're already keeping true. If you're not, &lt;code&gt;llms.txt&lt;/code&gt; accelerates the problem. AI agents read what's in the index, generate code based on it, and your stale docs get amplified into stale autocompletions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failure Mode That Bites Senior Teams
&lt;/h2&gt;

&lt;p&gt;Worth naming directly, because it's the one I see senior teams stumble on, not junior ones.&lt;/p&gt;

&lt;p&gt;The temptation: "AI is good enough now. Let's have it generate the docs in bulk from the codebase, ship the result, and call the project done." A weekend hackathon, three thousand new wiki pages, victory.&lt;/p&gt;

&lt;p&gt;This is how you build a much larger pile of plausibly-wrong content than you had before.&lt;/p&gt;

&lt;p&gt;Three things go wrong. First, the bulk-generated pass has a real hallucination rate. At 5-10% per page (a reasonable midpoint of the &lt;a href="https://suprmind.ai/hub/ai-hallucination-rates-and-benchmarks/" rel="noopener noreferrer"&gt;2026 benchmarks&lt;/a&gt;), three thousand pages contain hundreds of wrong claims. Nobody knows which ones. Second, none of those pages have a maintainer or a verification hook, so they start rotting on day one. Third, and this is the new failure mode, those pages are now read by your team's AI tools. Cursor pulls from your wiki. Claude Code pulls from your wiki. The wrong content compounds into wrong code suggestions, and the loop closes.&lt;/p&gt;

&lt;p&gt;Generation without maintenance was a productivity drain in 2022. In 2026, with AI agents accounting for nearly half your doc traffic, it's a self-reinforcing source of bugs.&lt;/p&gt;

&lt;p&gt;The fix isn't subtle: &lt;strong&gt;never generate a doc page that nobody has agreed to maintain.&lt;/strong&gt; If no human's name is on it and no anchor binds it to the code, don't write it. AI can produce infinite content; the bottleneck is and always will be the verification layer attached to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Changes in Your Workflow
&lt;/h2&gt;

&lt;p&gt;If you've read this far, the practical shape is probably clear, but here it is compactly.&lt;/p&gt;

&lt;p&gt;Your wiki gets a CI job that drafts a delta on every PR, anchored to the diff and the existing page. Reviewer approves or edits, in the same PR that ships the code. No delta, no merge.&lt;/p&gt;

&lt;p&gt;Your API docs get generated from a typed schema (OpenAPI, GraphQL SDL, gRPC proto, whatever), and AI fills in prose against that schema, never against thin air. Example values come from real fixtures, not from the model's imagination.&lt;/p&gt;

&lt;p&gt;Your ADRs get a scribe agent. The moment you make a non-trivial decision in chat or in a PR comment, the agent prompts you for two minutes and produces a draft. You commit it before the day ends.&lt;/p&gt;

&lt;p&gt;A verification stack runs in the background: anchors for code-to-doc binding, prose linting and link checking for hygiene, and a scheduled AI drift check for the cases the first two miss.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;llms.txt&lt;/code&gt; (or whatever standard your team settles on) gets generated from the same verified source, so the half of your doc traffic that's AI agents stays as accurate as the half that's humans.&lt;/p&gt;

&lt;p&gt;None of this is novel in isolation. Anchors existed before AI. Linting existed before AI. ADRs predate Claude by a decade. What's new is that AI, cheap enough to run on every PR against the actual diff in seconds, finally makes the verification side affordable. The 2010s answer to documentation rot was "be more disciplined." The 2026 answer is "make the discipline tractable." That's the actual win.&lt;/p&gt;

&lt;p&gt;Docs that developers maintain aren't a writing problem. They're an integration problem. AI is most valuable not as the writer but as the integration glue: the thing that reads the diff, compares it to the doc, drafts the delta, runs the check, and surfaces the drift before it ships.&lt;/p&gt;

&lt;p&gt;Build that, and your wiki stops lying. Build the alternative, and the lies just get faster.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>documentation</category>
      <category>architecture</category>
      <category>adr</category>
    </item>
  </channel>
</rss>
