<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Cloudstar</title>
    <description>The latest articles on DEV Community by Alex Cloudstar (@alexcloudstar).</description>
    <link>https://dev.to/alexcloudstar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1190670%2F18910089-3a37-4072-9b4c-289211f053eb.JPG</url>
      <title>DEV Community: Alex Cloudstar</title>
      <link>https://dev.to/alexcloudstar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexcloudstar"/>
    <language>en</language>
    <item>
      <title>TypeScript at Scale: Why Your tsc Takes 90 Seconds and How to Fix It</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Fri, 08 May 2026 08:41:54 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/typescript-at-scale-why-your-tsc-takes-90-seconds-and-how-to-fix-it-3g3k</link>
      <guid>https://dev.to/alexcloudstar/typescript-at-scale-why-your-tsc-takes-90-seconds-and-how-to-fix-it-3g3k</guid>
      <description>&lt;p&gt;The TypeScript codebase I inherited last year had a clean build time of 94 seconds. Incremental builds were 12 seconds on a good day. The editor would freeze for two or three seconds every time you hovered over a Zod schema. Nobody wrote new code without first opening their second monitor to scroll Twitter while the language server caught up.&lt;/p&gt;

&lt;p&gt;It is now 11 seconds for a clean build, sub-second incremental, and the editor stays responsive. We did not move to Project Corsa. We did not switch to Bun. We did not split the repo. We deleted three patterns that were generating millions of redundant type instantiations and tightened a few &lt;code&gt;tsconfig&lt;/code&gt; settings. The work took about a week.&lt;/p&gt;

&lt;p&gt;Most TypeScript performance problems at scale are not "TypeScript is slow." They are "we are asking TypeScript to do something quadratic and it is doing it." This post is the diagnostic playbook for figuring out which thing your codebase is doing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Question: Where Is the Time Going
&lt;/h2&gt;

&lt;p&gt;Before tuning anything, get real numbers. The TypeScript compiler ships with two flags that turn the diagnostic question from "feels slow" into "spends 47% of its time in type checking step X."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tsc &lt;span class="nt"&gt;--extendedDiagnostics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output gives you a breakdown: parse time, bind time, check time, emit time, total memory usage. If "Check time" dominates, your problem is in the type system. If "I/O Read time" or "Parse time" dominates, your problem is the size of what you are loading. These are very different problems with very different fixes.&lt;/p&gt;

&lt;p&gt;The next flag is more targeted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tsc &lt;span class="nt"&gt;--generateTrace&lt;/span&gt; ./trace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This drops a Chrome trace file into &lt;code&gt;./trace&lt;/code&gt;. Open it in &lt;code&gt;chrome://tracing&lt;/code&gt; or &lt;code&gt;https://ui.perfetto.dev&lt;/code&gt;. You get a flame graph of every file the compiler checked, how long each took, and what types it instantiated.&lt;/p&gt;

&lt;p&gt;The pattern to look for is single files that take seconds. Healthy code generates a flame graph where most files complete in under 100ms and the long tail tops out somewhere around 500ms. A file that takes 5 seconds is a file with a type the compiler is struggling with. A file that takes 30 seconds is the file generating most of your build pain, and finding it is most of the work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@typescript/analyze-trace&lt;/code&gt; is the tool that reads the trace and tells you what is hot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @typescript/analyze-trace ./trace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It surfaces the worst-offending files, the deepest type instantiations, and the most expensive type aliases. The output is sometimes opaque, but the file names it gives you are almost always the right places to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Patterns That Actually Cost You
&lt;/h2&gt;

&lt;p&gt;In every slow codebase I have looked at, the cost concentrates in a small number of patterns. The patterns are recognizable once you know what to look for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deeply Nested Generic Inference
&lt;/h3&gt;

&lt;p&gt;This is the most common offender, and it almost always lives in code that wraps a library with a generic helper.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;withRetry&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt; &lt;span class="nf"&gt;extends &lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RetryOptions&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Parameters&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Awaited&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;ReturnType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fetchUser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;withRetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine. The cost shows up when you wrap something whose signature is itself heavily generic. If &lt;code&gt;api.users.fetch&lt;/code&gt; returns a Drizzle query result, or a tRPC procedure, or a Zod-inferred type, the compiler has to expand all of those generics every time the wrapper is instantiated. If &lt;code&gt;withRetry&lt;/code&gt; is used in 200 places across your codebase, the compiler does that work 200 times in every type check.&lt;/p&gt;

&lt;p&gt;The fix is rarely to delete the wrapper. It is to break the chain of inference at strategic points. Instead of inferring &lt;code&gt;Awaited&amp;lt;ReturnType&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt; deep inside the type, accept a simpler input type and let the user spell it out at the call site, or use a type assertion to terminate the inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conditional Type Recursion in Hot Paths
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;DeepReadonly&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kr"&gt;keyof&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt;
    &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nb"&gt;Function&lt;/span&gt;
      &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DeepReadonly&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;DeepReadonly&lt;/code&gt; over a small interface is fine. A &lt;code&gt;DeepReadonly&lt;/code&gt; applied to your top-level state type, which contains your database row types, which reference your domain types, which contain unions of all your enums, is a recursive type explosion. The compiler will work through it, sometimes. Sometimes it gives up and emits &lt;code&gt;any&lt;/code&gt;, silently. Either way it is slow.&lt;/p&gt;

&lt;p&gt;The default position for recursive utility types should be: do not. If you find yourself reaching for &lt;code&gt;DeepPartial&lt;/code&gt;, &lt;code&gt;DeepReadonly&lt;/code&gt;, &lt;code&gt;DeepKeys&lt;/code&gt;, or anything that walks an arbitrary tree, ask whether you actually need the type to be deep. Most of the time you need it to be one or two levels deep, which is a much cheaper type to write explicitly.&lt;/p&gt;

&lt;p&gt;When you do need recursion, cap the depth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;DeepReadonly&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Depth&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Depth&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kr"&gt;keyof&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nx"&gt;DeepReadonly&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;K&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;Decrement&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Depth&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you the safety of a finite recursion at the cost of writing a numeric depth helper. The compiler can always finish.&lt;/p&gt;

&lt;h3&gt;
  
  
  Massive Discriminated Unions
&lt;/h3&gt;

&lt;p&gt;A union with eight variants is fast. A union with 200 variants generated from a Zod schema or a code generator is slow. Every time you narrow the union with a discriminator, the compiler has to consider every variant and prove which ones are eliminated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user.created&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;UserCreated&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user.updated&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;UserUpdated&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ... 198 more&lt;/span&gt;
  &lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;switch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user.created&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handleUserCreated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The narrowing inside the switch is where time goes. The compiler proves at each case statement which variants of the union are still possible. With 200 variants, that proof gets expensive. If &lt;code&gt;handle&lt;/code&gt; is called from many places, and each call site re-checks the union, you can pay this cost thousands of times in a single type check.&lt;/p&gt;

&lt;p&gt;Two fixes that usually work: split the union at module boundaries so any single function only deals with a subset, or convert the union into a record type keyed by the discriminator and look up the handler dynamically. The latter sacrifices exhaustiveness checking, which you can get back with a &lt;code&gt;satisfies&lt;/code&gt; clause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handlers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user.created&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;handleUserCreated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user.updated&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;handleUserUpdated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;satisfies&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;handlers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compiler still verifies completeness on the &lt;code&gt;satisfies&lt;/code&gt;, but the lookup at the call site is constant-time, not a union narrowing.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;as const&lt;/code&gt; Object Literals With Heavy Inference
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;routes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;users&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users/:id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;update&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users/:id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;RouteKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="kr"&gt;keyof&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;routes&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="kr"&gt;keyof&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;routes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kr"&gt;keyof&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;routes&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;as const&lt;/code&gt; keeps the literal types, which is what you want. The template literal type at the bottom is what is expensive. It generates the cartesian product of all top-level keys and all nested keys, and TypeScript materializes the full set during type checking. For a route table with 50 sections and 5 routes each, you have a 250-element string union that has to be computed every time something references &lt;code&gt;RouteKey&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix is to keep the inferred type but stop computing the joined string union at the type level. If you need to enumerate all routes, generate the list at runtime from the object and accept that you pay a tiny startup cost. If you need it at compile time for autocompletion, narrow the scope of the type so it only covers one section at a time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Library-Caused Slowdown
&lt;/h3&gt;

&lt;p&gt;Sometimes the slow file is not your code. It is &lt;code&gt;node_modules/some-library/dist/index.d.ts&lt;/code&gt;. The trace will show this clearly. Common offenders historically have been older versions of typed-form libraries, validation libraries with very expressive types, and ORMs that try to type your entire schema.&lt;/p&gt;

&lt;p&gt;The trace will tell you which library. The fix is usually one of: upgrade to a newer version that has fixed the issue, swap the library, or wrap the library at a thin module boundary so the heavy types do not leak into your call sites. The wrapping pattern works better than people expect: define a narrower internal type for the bits of the library you actually use, and import only that internal type from the rest of the codebase. The compiler stops re-checking the library's types every time you reference your internal type.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project References, the Right Way
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;tsconfig&lt;/code&gt; project references are the thing everyone reaches for and rarely sets up correctly.&lt;/p&gt;

&lt;p&gt;The promise of project references is that you split your codebase into smaller projects, each with its own &lt;code&gt;tsconfig.json&lt;/code&gt;, and the compiler builds each project once and reuses the output. Incremental builds are dramatically faster because changing a leaf project does not invalidate the type checking of unaffected projects.&lt;/p&gt;

&lt;p&gt;The catch is that project references require composite mode, which requires every referenced project to emit declaration files, which means every referenced project needs a real build output. This is fine for libraries. It is awkward for application code that historically just relied on &lt;code&gt;tsc --noEmit&lt;/code&gt; for type checking and a separate bundler for output.&lt;/p&gt;

&lt;p&gt;The setup that has worked for me on a Next.js + workspace setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps/
  web/tsconfig.json
packages/
  domain/tsconfig.json
  database/tsconfig.json
  ui/tsconfig.json
tsconfig.base.json
tsconfig.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root &lt;code&gt;tsconfig.json&lt;/code&gt; references each project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"references"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./packages/domain"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./packages/database"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./packages/ui"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./apps/web"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each package has &lt;code&gt;composite: true&lt;/code&gt;, &lt;code&gt;declaration: true&lt;/code&gt;, and produces a &lt;code&gt;.tsbuildinfo&lt;/code&gt; file. The first build is roughly the same speed as before. The second build is dramatically faster because unchanged packages are skipped entirely.&lt;/p&gt;

&lt;p&gt;The mistake to avoid: do not split into projects until you have profiled and have a real reason. A small codebase with project references is slower than the same codebase without, because the overhead of the build orchestration outweighs the savings. The crossover point is usually somewhere around 50,000 lines of TypeScript or three to four logical domains that change independently.&lt;/p&gt;

&lt;p&gt;For Astro, SvelteKit, and Next.js apps specifically, the project reference setup interacts with the framework's own type generation. Read the framework's docs before assuming the standard setup will work; they often have specific guidance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compiler Settings That Matter for Speed
&lt;/h2&gt;

&lt;p&gt;A handful of &lt;code&gt;tsconfig&lt;/code&gt; options have a direct performance impact. Most of the others do not, regardless of what online guides claim.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;skipLibCheck: true&lt;/code&gt;. This is the single highest-impact setting for most codebases. It tells the compiler not to type-check your &lt;code&gt;node_modules&lt;/code&gt;. The downside is that a broken type declaration in a dependency will not be caught at type-check time. The upside is that you stop doing redundant work for hundreds of dependencies. Almost every production codebase should have this on. Library authors who publish types should have it off in their own builds and on in their consumers' builds.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;incremental: true&lt;/code&gt; with a &lt;code&gt;tsBuildInfoFile&lt;/code&gt;. This caches the type-check graph between runs. Even on a single project (no references), this halves the time of subsequent runs because most files have not changed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;isolatedModules: true&lt;/code&gt;. Required if you are using a separate bundler for emit (which you almost certainly are in 2026 with Vite, Bun, esbuild, Turbopack, or any of the others). Forces you to write code that can be compiled file-by-file without cross-file type information. Slightly more restrictive but enables the bundler to skip work the compiler would otherwise have to redo.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;moduleResolution: "bundler"&lt;/code&gt;. The newer resolution mode introduced in TypeScript 5.0. Faster than &lt;code&gt;node16&lt;/code&gt; for most setups because it skips some of the legacy behavior. Use it if your bundler is newer than 2023.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;noUncheckedIndexedAccess: true&lt;/code&gt;. Not a performance setting, but worth mentioning because people assume it slows things down. It does not. It changes the inferred type of array index access from &lt;code&gt;T&lt;/code&gt; to &lt;code&gt;T | undefined&lt;/code&gt;. Pure type-system change, no impact on check time.&lt;/p&gt;

&lt;p&gt;The compiler options that do not matter for speed despite the rumors: &lt;code&gt;strict&lt;/code&gt;, &lt;code&gt;noImplicitAny&lt;/code&gt;, &lt;code&gt;strictNullChecks&lt;/code&gt;, &lt;code&gt;exactOptionalPropertyTypes&lt;/code&gt;. Turning these off does not measurably speed up type checking. They affect what gets reported, not how much work the compiler does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Editor Performance Is a Different Problem
&lt;/h2&gt;

&lt;p&gt;The TypeScript language server is what your editor uses for autocomplete, hover info, go-to-definition, and inline errors. It runs the same compiler as &lt;code&gt;tsc&lt;/code&gt; but with different priorities: it tries to give you fast partial answers rather than complete answers.&lt;/p&gt;

&lt;p&gt;When the editor feels slow, the &lt;code&gt;tsc&lt;/code&gt; benchmark does not always reflect it. The language server has its own performance characteristics. The diagnostic for editor performance is to open the TypeScript: Open TS Server log command in VS Code (or your editor's equivalent) and watch what it is doing. You will see entries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Info 1234 [10:31:42.123] getQuickInfoAtPosition: 4823.4ms
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;getQuickInfoAtPosition&lt;/code&gt; taking five seconds means the type at the position you hovered is genuinely that expensive to compute. The hot path in the compiler for hovers is type display, and large inferred types (especially from generic libraries) can blow up at display time even when type checking them is fast.&lt;/p&gt;

&lt;p&gt;Two specific editor optimizations that help:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Memory limit: 8192&lt;/code&gt; (or higher). The default language server memory limit is 3GB. Codebases with very rich types blow past this and the language server starts garbage collecting aggressively, which feels like lag. Bumping the limit in your editor settings is free if you have the RAM.&lt;/p&gt;

&lt;p&gt;Disable inlay hints in the files where they are slow. Inlay hints (the inferred parameter types and return types shown in the editor) require the language server to compute every type for display. In files with heavy generics, this is the single most expensive operation. Most editors let you disable specific inlay hint categories. Turning off "All inlay hints" on a heavy file is a quality-of-life win even if you keep them on globally.&lt;/p&gt;

&lt;p&gt;If you are running Cursor, Zed, or any of the AI-augmented IDEs from &lt;a href="https://dev.to/blog/cursor-vs-windsurf-vs-zed-ai-ide-2026"&gt;the IDE comparison post&lt;/a&gt;, the language server runs the same way. The AI features are layered on top, but the underlying TypeScript performance is the language server's responsibility, and the same diagnostics apply.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Project Corsa Changes, and What It Does Not
&lt;/h2&gt;

&lt;p&gt;The Go-based TypeScript compiler (&lt;a href="https://dev.to/blog/typescript-7-project-corsa-go-compiler-2026"&gt;Project Corsa&lt;/a&gt;) is the largest single performance change to the language since it shipped. The headline numbers are real: 10x faster on most codebases, sometimes more on codebases that are I/O bound.&lt;/p&gt;

&lt;p&gt;What it does not change is the type system. A codebase with quadratic type-instantiation patterns will still have quadratic type-instantiation patterns under Corsa. The 10x speedup compounds: a 90-second build becomes 9 seconds, but a 9-minute build becomes 54 seconds, which is still slow. If your codebase is generating millions of redundant type instantiations, fixing those patterns is still worth doing. Corsa makes the existing work faster; it does not make the work go away.&lt;/p&gt;

&lt;p&gt;For most codebases, the incremental version of Corsa lands as a drop-in replacement for &lt;code&gt;tsc&lt;/code&gt; and the language server. The migration is small. The wins are large. It is worth doing as soon as it is stable for your version of TypeScript. It is not worth waiting for if your build is currently slow; the patterns described above will pay off both before and after Corsa lands.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Diagnostic Loop
&lt;/h2&gt;

&lt;p&gt;If your build is slow and you do not know why, here is the order of operations that almost always isolates the problem.&lt;/p&gt;

&lt;p&gt;Start with &lt;code&gt;npx tsc --extendedDiagnostics&lt;/code&gt; and capture the timings. Save the output. You will compare against this later.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;npx tsc --generateTrace ./trace&lt;/code&gt; and &lt;code&gt;npx @typescript/analyze-trace ./trace&lt;/code&gt;. The output will list the hottest files. Pick the top three.&lt;/p&gt;

&lt;p&gt;Open each of the hot files. Look at the imports first. The expensive types usually come in through an import. Note any types from libraries that look complex (Zod, Drizzle, tRPC, anything with deep generics).&lt;/p&gt;

&lt;p&gt;Search for usages of those types in the file. Find any place where a generic is being inferred deeply or a conditional type is being recursively expanded. These are your candidates for surgery.&lt;/p&gt;

&lt;p&gt;Try the fixes one at a time. After each, re-run &lt;code&gt;tsc --extendedDiagnostics&lt;/code&gt; and compare against the baseline. You want to see the check time drop. If it does not, revert and try the next thing.&lt;/p&gt;

&lt;p&gt;The reason for one-at-a-time changes is that some "fixes" make things worse, and a batched change hides which one helped and which one hurt. The diagnostic is fast enough that the patience pays off.&lt;/p&gt;

&lt;p&gt;Once the hot files are no longer hot, run the trace again. New hot files will surface as the previous ones fall down the list. Stop when the worst file is in a range you are happy with, usually 200ms or less for a single file.&lt;/p&gt;

&lt;p&gt;The whole loop is a day or two of focused work for most codebases. The win is permanent unless someone reintroduces the same patterns, which is why a &lt;code&gt;tsc --extendedDiagnostics&lt;/code&gt; check in CI as a regression guardrail is worth considering.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Tell You If You Asked
&lt;/h2&gt;

&lt;p&gt;If you have a slow TypeScript codebase and limited time, the highest-leverage thing you can do is generate a trace and read it. Most teams skip this and try fixes blind. The fixes work some of the time, but the trace tells you exactly where to look, and the work after that is usually small.&lt;/p&gt;

&lt;p&gt;The second highest-leverage thing is &lt;code&gt;skipLibCheck: true&lt;/code&gt;, if you do not already have it. The savings are immediate. The downside is rarely material.&lt;/p&gt;

&lt;p&gt;The third is to cap any recursive utility types you have introduced and to push deeply inferred generic helpers to terminate inference earlier. These are pattern-level changes, not config tweaks, and they require reading the trace to know which patterns matter for your codebase.&lt;/p&gt;

&lt;p&gt;What I would not do: rewrite to a different language or framework hoping the performance will be better. Bun, Deno, and esbuild are faster at the bundling and parsing parts, but the type checking is still TypeScript's compiler doing TypeScript's compiler work. The gains from tooling come from building, not type-checking. You can ship faster builds with a faster bundler and still have a 90-second &lt;code&gt;tsc&lt;/code&gt; because nothing about the bundler changed how the type system works.&lt;/p&gt;

&lt;p&gt;The honest summary: TypeScript at scale is fast enough if you do not do the expensive things, and slow if you do. The expensive things are knowable and the fixes are not exotic. The work is figuring out which of them your codebase is doing, which is what the trace is for.&lt;/p&gt;

&lt;p&gt;For the broader picture of where TypeScript is heading, &lt;a href="https://dev.to/blog/typescript-7-project-corsa-go-compiler-2026"&gt;the Project Corsa post&lt;/a&gt; covers what is coming. For a related performance angle on running TypeScript without a build step at all, &lt;a href="https://dev.to/blog/typescript-without-a-build-step-native-type-stripping-in-nodejs"&gt;the type-stripping post&lt;/a&gt; is useful. Both are about reducing the work the toolchain has to do. This post is about reducing the work the type system has to do, which is the part you control directly even before any new compiler ships.&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Passkeys in Production: What I Wish I Knew Before Replacing Passwords</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Fri, 08 May 2026 08:41:53 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/passkeys-in-production-what-i-wish-i-knew-before-replacing-passwords-5dak</link>
      <guid>https://dev.to/alexcloudstar/passkeys-in-production-what-i-wish-i-knew-before-replacing-passwords-5dak</guid>
      <description>&lt;p&gt;The first passkey login I shipped to real users worked perfectly for forty minutes. Then the support tickets started.&lt;/p&gt;

&lt;p&gt;A user with a personal MacBook and a work Windows laptop could not figure out why his iPhone passkey was not showing up on the Windows machine. A second user had set up a passkey on her phone, lost the phone in a taxi, and now could not get into her account because we had quietly deleted her password fallback when she enrolled. A third user was on a corporate-managed Chrome that had &lt;code&gt;WebAuthn&lt;/code&gt; policy-locked to platform authenticators only, but our flow assumed roaming authenticators would always be offered.&lt;/p&gt;

&lt;p&gt;None of these are bugs in WebAuthn. They are the gap between "passkeys work" as a protocol statement and "passkeys work for the actual humans using your product." Most articles on this topic stop at the first half. This one is about the second half, the part you only learn by shipping.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Passkeys Actually Are, Stripped of Marketing
&lt;/h2&gt;

&lt;p&gt;A passkey is a WebAuthn credential where the private key lives in something the user trusts (their device, their password manager, their security key) and the public key lives on your server. Authentication is a signature challenge. Your server sends a random nonce, the authenticator signs it with the private key, you verify the signature against the public key you stored at registration.&lt;/p&gt;

&lt;p&gt;That much has been true since WebAuthn level 1 in 2019. What changed in 2022 and shipped broadly through 2024 and 2025 is the sync part. Apple, Google, and Microsoft started syncing WebAuthn credentials across devices through their cloud accounts. Then 1Password, Bitwarden, and Dashlane started doing the same across platforms. The credential is no longer locked to a single device.&lt;/p&gt;

&lt;p&gt;The user-facing pitch is "no more passwords, no more phishing, your account is just there on every device you trust." The pitch is mostly true. The mostly part is where the work is.&lt;/p&gt;

&lt;p&gt;Three things to internalize before writing any registration code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A passkey is bound to a relying party ID, which is your domain. Cross-domain passkeys do not exist. A passkey for &lt;code&gt;app.example.com&lt;/code&gt; cannot be used on &lt;code&gt;example.com&lt;/code&gt; unless you set the RP ID to the parent domain at registration time. You make this choice once and you live with it.&lt;/li&gt;
&lt;li&gt;A user can have many passkeys. They will. Treat the credential as the primary key for authentication, not the user. One user, many credentials, with metadata on each one (device label, last used, transport types).&lt;/li&gt;
&lt;li&gt;The authenticator decides what is possible. Some authenticators are platform-bound (Touch ID without iCloud Keychain). Some are roaming (YubiKey). Some are syncing (iCloud Keychain, 1Password). Your code asks for what you want and the browser tells you what you got. You design around the answer, not around your assumptions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Protocol in One Page
&lt;/h2&gt;

&lt;p&gt;Registration is a four-step dance. The browser API is &lt;code&gt;navigator.credentials.create()&lt;/code&gt; with a &lt;code&gt;publicKey&lt;/code&gt; options object. You generate the options on the server, send them down, the browser creates the credential, you send the attestation back, you verify and store.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Server: generate registration options&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateRegistrationOptions&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;rpName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Example&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;rpID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextEncoder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;userName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;attestationType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;excludeCredentials&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;existingCredentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentialId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;})),&lt;/span&gt;
  &lt;span class="na"&gt;authenticatorSelection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;residentKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preferred&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;userVerification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preferred&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;authenticatorAttachment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sessionStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;challenge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;challenge&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three knobs in that block matter more than they look:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;attestationType: 'none'&lt;/code&gt; is the default for consumer apps. Anything else asks the authenticator to prove what it is, which is useful for regulated environments and a privacy concern for everyone else. Most consumer flows do not need it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;residentKey: 'preferred'&lt;/code&gt; asks for a discoverable credential, which is what makes the "click sign in and just be signed in" flow work without typing a username. The browser respects the preference but does not always honor it. You handle both cases on login.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;authenticatorAttachment: undefined&lt;/code&gt; means the user can pick a platform authenticator (Touch ID, Windows Hello) or a roaming one (security key, phone). Locking this to &lt;code&gt;platform&lt;/code&gt; will exclude users who want their YubiKey. Locking to &lt;code&gt;cross-platform&lt;/code&gt; will exclude users who want Face ID. Leaving it open is almost always right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Login (assertion) is the same shape inverted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Server: generate authentication options&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateAuthenticationOptions&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;rpID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;userVerification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preferred&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;allowCredentials&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// empty for discoverable credential flow&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sessionStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;challenge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;challenge&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Leaving &lt;code&gt;allowCredentials&lt;/code&gt; empty triggers the discoverable credential flow: the browser shows the user every passkey they have for your domain, they pick one, and you find out which user it is from the credential ID after the assertion. This is the flow you want. The alternative, asking the user for their username first and then sending the list of credentials they own, is fine for sign-in form layouts but gives up the magic.&lt;/p&gt;

&lt;p&gt;The verification step on the server is where you check the signature, the challenge match, the origin, the RP ID hash, and the signature counter (if the authenticator increments one). &lt;code&gt;@simplewebauthn/server&lt;/code&gt; handles all of that. You hand it the response, the expected challenge from the session, and your domain, and it tells you whether to trust this assertion.&lt;/p&gt;

&lt;p&gt;Most of the protocol-level work is solved by the SimpleWebAuthn library on Node.js, &lt;code&gt;webauthn-rs&lt;/code&gt; on Rust, and platform-specific equivalents on Go and Python. Writing it yourself in 2026 is not a sign of seriousness. It is a sign of not having read the spec carefully enough to notice how many ways there are to subtly miscount bytes when parsing the authenticator data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Account Model You Actually Need
&lt;/h2&gt;

&lt;p&gt;The schema for storing passkeys is small but easy to get wrong. The shape that has held up for me across three production rollouts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;emailVerifiedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Credential&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                 &lt;span class="c1"&gt;// your primary key&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;             &lt;span class="c1"&gt;// foreign key&lt;/span&gt;
  &lt;span class="nl"&gt;credentialId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// WebAuthn credential ID&lt;/span&gt;
  &lt;span class="nl"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// COSE-encoded public key&lt;/span&gt;
  &lt;span class="nl"&gt;signatureCounter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AuthenticatorTransport&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;deviceLabel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// user-editable&lt;/span&gt;
  &lt;span class="nl"&gt;lastUsedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;backupEligible&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;backupState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two fields people skip and regret: &lt;code&gt;backupEligible&lt;/code&gt; and &lt;code&gt;backupState&lt;/code&gt;. These come from flags on the authenticator data and they tell you whether the credential is syncing across the user's devices. A credential that is &lt;code&gt;backupEligible: true, backupState: true&lt;/code&gt; is a credential that exists in iCloud Keychain or 1Password or similar. If the user loses their phone, that credential is still recoverable. A credential with &lt;code&gt;backupEligible: false&lt;/code&gt; is locked to one device. If that device dies, the credential dies with it.&lt;/p&gt;

&lt;p&gt;You do not show these flags to the user as raw booleans. You use them to decide what to tell the user about recovery. A user who has only single-device credentials needs more aggressive prompting to add a second factor or set up recovery. A user with synced credentials is in much better shape.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;transports&lt;/code&gt; array is what makes the autofill UI on the next device work. A credential created on an iPhone reports &lt;code&gt;['internal', 'hybrid']&lt;/code&gt;. The &lt;code&gt;hybrid&lt;/code&gt; transport is what enables QR-code-mediated cross-device auth where the user scans a code on a desktop with their phone to log in. Storing transports correctly and passing them back in &lt;code&gt;excludeCredentials&lt;/code&gt; and &lt;code&gt;allowCredentials&lt;/code&gt; makes the browser surface the right options at the right moments.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;deviceLabel&lt;/code&gt; field exists because users will end up with five or six credentials and need to be able to tell them apart. "iPhone 15 Pro," "Work MacBook," "1Password," "YubiKey 5C." The browser does not give you a clean device name on registration. You ask the user. A small text input at the end of the registration flow with a sensible default like "Device added on May 8, 2026" is enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Problem
&lt;/h2&gt;

&lt;p&gt;Here is the part most demos skip. Passkeys without a recovery story are worse than passwords, because at least passwords have email-based reset flows that everyone understands.&lt;/p&gt;

&lt;p&gt;The mental model that has worked: a user account needs at least two ways back in, and they need to be independent failure modes. If both of your recovery methods require the user's phone, losing the phone takes the user out of the account permanently. That is a churn event and, for some applications, a regulatory issue.&lt;/p&gt;

&lt;p&gt;The recovery options worth combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A second passkey, registered on a different authenticator. "Add another device" is the clean version of this. The phone is one credential, the password manager is another, the laptop's platform authenticator is a third.&lt;/li&gt;
&lt;li&gt;An emailed magic link. Cheap, familiar to users, and works as long as email is accessible. The downside is that it makes your account security exactly as good as the user's email security, which is a known weak link. For a consumer product this is usually acceptable. For a financial product it is not.&lt;/li&gt;
&lt;li&gt;A printed or shown-once recovery code. A 16-character string the user is told to save somewhere. Most users will not save it. The ones who will are exactly the users you want to keep.&lt;/li&gt;
&lt;li&gt;Identity verification through a third-party service. KYC providers can re-verify the user against their original ID. Expensive and slow. Use this for high-value accounts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern that holds up: at registration time, push the user to set up a second method before they finish onboarding. If they bail, mark the account as having weak recovery and show a banner on every login until they fix it. The friction is worth it. The cost of supporting "I lost my only passkey" tickets is high and the resolution is often "the user creates a new account and we lose their data."&lt;/p&gt;

&lt;p&gt;The other thing to do at registration time: do not delete the password if the user has one. Add the passkey alongside, mark passkeys as preferred, and offer to remove the password later once the user has multiple working passkeys. A common rollout mistake is treating passkey registration as a one-way migration. It should be additive. The password becomes a fallback. Once the user has confirmed they can log in with their passkey on every device they use, you can offer to remove the password. Never remove it without an explicit user action.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cross-Device Reality
&lt;/h2&gt;

&lt;p&gt;The hardest part of shipping passkeys is not writing the code. It is reasoning about what happens when a user sits down at a device that does not have their credential.&lt;/p&gt;

&lt;p&gt;The clean cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;iPhone user opens Safari on their iPhone or Mac signed into the same iCloud account. The credential syncs. Login works.&lt;/li&gt;
&lt;li&gt;1Password user with the browser extension installed and unlocked. The credential is in 1Password. The extension intercepts the WebAuthn ceremony. Login works.&lt;/li&gt;
&lt;li&gt;Android user with Google Password Manager and Chrome signed in. The credential syncs across their Android devices and Chrome on desktop. Login works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The messy cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mac user logs in on a Windows laptop. iCloud Keychain does not exist on Windows. The user needs to use the cross-device flow: the browser shows a QR code, the user scans it with their iPhone, the iPhone authenticates over Bluetooth, and the desktop receives the assertion through a relay server. This works but it is not obvious to users. The first time they see the QR code they assume something is broken.&lt;/li&gt;
&lt;li&gt;A user with credentials only in their work device's platform authenticator goes home and tries to log in on their personal laptop. Same QR code flow needed. If their work device is in their pocket, it works. If they left it at the office, they are locked out unless they have a second method.&lt;/li&gt;
&lt;li&gt;A user on a corporate-managed device where IT has disabled cross-device authentication. The QR code flow does not appear. The user can only log in if they have a credential on this specific device. Your support team will see this case more than you expect.&lt;/li&gt;
&lt;li&gt;A user whose password manager is locked. 1Password and Bitwarden need to be unlocked before they can serve a passkey. If the user just opened their browser, the autofill prompt may not show their saved passkeys until they manually unlock their password manager. This is confusing and looks like the passkey is missing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern that helps: never assume a login attempt is final. Always offer at least two paths on the login page. "Sign in with passkey" and "Email me a sign-in link" side by side. The passkey path covers most cases. The email path covers the user who is on a new device, locked password manager, or weird policy environment. Forcing users into a single path is where the support tickets come from.&lt;/p&gt;

&lt;p&gt;The other thing that helps: explicit copy. When the QR code flow triggers, do not just show the QR code. Tell the user "Use your phone to scan this code and approve the sign-in." Most users have never seen a WebAuthn cross-device flow and need a sentence to recognize what is happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Breaks in the Wild
&lt;/h2&gt;

&lt;p&gt;A list of real failures from real production rollouts. None of these are exotic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Safari and the third-party cookie blocker.&lt;/strong&gt; Safari's privacy mode in some configurations blocks the storage that holds the WebAuthn challenge if you store it in a cookie scoped wrong. If you are seeing intermittent challenge mismatch errors specifically on Safari, check that your session cookie has &lt;code&gt;SameSite=Lax&lt;/code&gt; and is not getting blocked by intelligent tracking prevention. Storing the challenge server-side keyed by session ID dodges this entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subdomain credential split.&lt;/strong&gt; A user registers a passkey on &lt;code&gt;app.example.com&lt;/code&gt; because that is what the browser was on at the time. They later try to log in on &lt;code&gt;example.com&lt;/code&gt;. The credential does not show up because the RP ID does not match. Fix: pick one canonical RP ID at the start, usually the registrable domain (&lt;code&gt;example.com&lt;/code&gt;), and use it everywhere. Migrating later is painful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Counter rollback.&lt;/strong&gt; Some authenticators (notably some old YubiKeys) increment the signature counter on each authentication. Some (most platform authenticators today) do not, and the counter stays at zero. Your verification logic should accept both. A naive "counter must always increase" check rejects platform authenticator users intermittently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The exclude list explosion.&lt;/strong&gt; &lt;code&gt;excludeCredentials&lt;/code&gt; is meant to prevent the user from registering the same authenticator twice. If a user has 12 credentials, you send 12 entries in the exclude list. Some authenticators handle this poorly and time out. Cap the exclude list at the user's most recently used credentials, or skip it entirely and dedupe on the server when you receive the registration response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resident key promises broken.&lt;/strong&gt; You ask for &lt;code&gt;residentKey: 'required'&lt;/code&gt; because you want discoverable credential flows. The user's authenticator does not support it. The browser silently registers a non-discoverable credential. The user's next login does not show their passkey in the autofill prompt because the credential is not discoverable. Fix: check the response's &lt;code&gt;authenticatorAttachment&lt;/code&gt; and &lt;code&gt;credentialDeviceType&lt;/code&gt; to see what you actually got, and surface a warning if the flow you wanted is not what was created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email-as-username collision with discoverable credentials.&lt;/strong&gt; You designed your sign-in page to ask for an email first, then offer a passkey. Discoverable credential flow is a button labeled "Sign in with passkey" that bypasses the email entry. New users who open your sign-in page see two options and pick the wrong one. The fix is to combine: show the passkey button up front, and below it, the email input for users who do not have a passkey or want the magic-link path.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Code That Holds Up
&lt;/h2&gt;

&lt;p&gt;What I have ended up with after a few rounds of iteration, on the server side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;generateRegistrationOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;verifyRegistrationResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;generateAuthenticationOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;verifyAuthenticationResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@simplewebauthn/server&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;RP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WEBAUTHN_RP_ID&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WEBAUTHN_RP_NAME&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WEBAUTHN_ORIGIN&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;startRegistration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findByUserId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateRegistrationOptions&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;rpName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;rpID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextEncoder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;userName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;attestationType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;excludeCredentials&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentialId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})),&lt;/span&gt;
    &lt;span class="na"&gt;authenticatorSelection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;residentKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preferred&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;userVerification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preferred&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sessionChallenges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;challenge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;finishRegistration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RegistrationResponseJSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;expectedChallenge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sessionChallenges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;expectedChallenge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;challenge expired&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;verification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;verifyRegistrationResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;expectedChallenge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;expectedOrigin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;expectedRPID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;verification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;verification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;registrationInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;registration failed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;verification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;registrationInfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;credentialId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;signatureCounter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transports&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="na"&gt;deviceLabel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;label&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s2"&gt;`Device added &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toLocaleDateString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;backupEligible&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentialBackedUp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;backupState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentialBackedUp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sessionChallenges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The login side is the same shape with &lt;code&gt;generateAuthenticationOptions&lt;/code&gt; and &lt;code&gt;verifyAuthenticationResponse&lt;/code&gt;. The thing worth noting is that on a discoverable credential flow, you do not know which user is logging in until after you verify the assertion. So you look up the credential by &lt;code&gt;credentialId&lt;/code&gt; first, then load the user, then verify. The order matters because verification needs the public key that belongs to that credential.&lt;/p&gt;

&lt;p&gt;The session challenge storage is the unsexy part that is worth getting right. A short-lived TTL (five minutes is plenty) keyed by something stable for the request, and never reused. Reusing a challenge breaks the security model entirely. If you are tempted to write your own challenge storage, use Redis or your existing session store and move on.&lt;/p&gt;

&lt;p&gt;For the broader auth library question of whether to build this yourself or pick a service like Clerk, Auth0, or Better Auth, the &lt;a href="https://dev.to/blog/better-auth-vs-clerk-vs-supabase-auth-2026"&gt;auth library comparison&lt;/a&gt; is worth reading. Most of the hosted providers now offer passkey support out of the box, with the same recovery and cross-device subtleties handled for you. The decision is the standard one: build for control and customization, buy for speed and offloaded support burden.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Browser Compatibility Floor in 2026
&lt;/h2&gt;

&lt;p&gt;A short matrix of where things actually work as of mid-2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Safari 17+ supports passkeys, syncs through iCloud Keychain, supports the cross-device hybrid transport.&lt;/li&gt;
&lt;li&gt;Chrome 125+ supports passkeys on macOS, Windows, Linux, ChromeOS, and Android. Google Password Manager syncs across signed-in devices.&lt;/li&gt;
&lt;li&gt;Firefox 122+ supports the WebAuthn API but does not sync credentials itself. It defers to the OS-level platform authenticator on macOS and Windows. On Linux, the user's experience depends on whether they have a hardware authenticator plugged in.&lt;/li&gt;
&lt;li&gt;Edge follows Chrome.&lt;/li&gt;
&lt;li&gt;Mobile browsers all defer to the OS authenticator. iOS Safari uses iCloud Keychain. Android Chrome uses Google Password Manager. Both work well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conditional UI (the autofill prompt that shows passkeys without the user clicking anything) requires the page to call &lt;code&gt;navigator.credentials.get()&lt;/code&gt; with &lt;code&gt;mediation: 'conditional'&lt;/code&gt; and an &lt;code&gt;&amp;lt;input autocomplete="username webauthn"&amp;gt;&lt;/code&gt;. This works in Safari 16+, Chrome 108+, and Firefox 119+. The user experience is excellent when it lands. The fallback to a clicked button needs to exist for browsers that do not support it.&lt;/p&gt;

&lt;p&gt;The compatibility story is in a much better place than it was even a year ago. The remaining gap is configuration, not capability. Corporate-managed environments are still where things break, and the gap between what the spec allows and what enterprise IT permits is the gap your support tickets will live in.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Tell My Past Self
&lt;/h2&gt;

&lt;p&gt;Three things that would have saved me significant time on the first rollout.&lt;/p&gt;

&lt;p&gt;The recovery story is the product. Spend more time on it than on the registration flow. Most engineering attention goes to "how do we make registration smooth" and not enough goes to "what happens when the user calls support saying their phone fell in a lake." The second one is what determines whether passkeys are a net win for your users or a way for them to get locked out.&lt;/p&gt;

&lt;p&gt;Add passkey support without removing passwords first. Treat passwords as a legacy fallback, not a problem to eliminate. Letting users opt in incrementally and confirming their passkeys work across all their devices before any cleanup means the rollback path stays open. Removing passwords prematurely is how you generate a churn event.&lt;/p&gt;

&lt;p&gt;Test on a corporate-managed Windows laptop. The flows that are smooth on a personal MacBook with iCloud Keychain are not necessarily smooth on a managed Windows device with a third-party password manager. The only way to know is to try, and ideally to ship a beta to a population that includes those users before you flip the default.&lt;/p&gt;

&lt;p&gt;Passkeys are better than passwords for users who already have a sync mechanism set up. They are an improvement for users with one device. They are a regression for users you push into them without giving them a working recovery story. The technology is solid. The product work around it is where the wins and losses are.&lt;/p&gt;

&lt;p&gt;If you are building auth from scratch in 2026 and want to skip most of this, &lt;a href="https://dev.to/blog/better-auth-vs-clerk-vs-supabase-auth-2026"&gt;the auth library comparison&lt;/a&gt; is the honest version of which providers handle the messy parts well. If you are extending an existing auth system, the SimpleWebAuthn library plus the schema above will get you to a working passkey flow in a week. Getting it to a flow that does not generate support tickets takes longer, and the difference is mostly the work described in this post.&lt;/p&gt;

&lt;p&gt;The protocol is solved. The product is not. That is the gap worth budgeting for.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>security</category>
      <category>architecture</category>
      <category>devtools</category>
    </item>
    <item>
      <title>JavaScript Async Lifetimes: The Leak You Have and Probably Do Not Know About</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 07 May 2026 08:28:47 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/javascript-async-lifetimes-the-leak-you-have-and-probably-do-not-know-about-4j50</link>
      <guid>https://dev.to/alexcloudstar/javascript-async-lifetimes-the-leak-you-have-and-probably-do-not-know-about-4j50</guid>
      <description>&lt;p&gt;Here is a production bug I have seen three times now, in three different codebases, written by three developers who all considered themselves experienced with async JavaScript.&lt;/p&gt;

&lt;p&gt;A route handler fires three parallel database queries with &lt;code&gt;Promise.all&lt;/code&gt;. One of them hits a slow external service and times out after 30 seconds. &lt;code&gt;Promise.all&lt;/code&gt; rejects immediately. The handler sends a 500. The caller moves on. The other two queries are still running. They are holding database connection pool slots. At a few hundred concurrent requests, the pool exhausts. Every subsequent request queues waiting for a slot. The app looks hung, but the logs show mostly successes.&lt;/p&gt;

&lt;p&gt;The fix everyone reaches for is adding a shorter timeout to the slow query. That helps but does not solve the underlying issue. When &lt;code&gt;Promise.all&lt;/code&gt; rejects, it rejects. It does not cancel the tasks it was waiting on. Those tasks have no owner anymore. They run to completion or to error, nobody is listening, and the resources they hold are not released until they are done.&lt;/p&gt;

&lt;p&gt;This is the async leak problem in JavaScript, and it is more common than most people realize because it is often invisible. The code "works" in the sense that it produces correct outputs. The resource leak shows up as a slow degradation under load, a pool exhaustion event, or a flaky test that passes locally and fails in CI on a slow machine.&lt;/p&gt;

&lt;p&gt;ES2026 shipped the primitives to actually fix this. You do not need a library. You do need to understand what you are composing and why.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Failure Modes Worth Knowing
&lt;/h2&gt;

&lt;p&gt;Before the solution, the problem is worth making concrete. These are the three production patterns I have seen cause real incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Abandoned Fetch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;loadDashboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;notifications&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nf"&gt;fetchUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;fetchSettings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;fetchNotifications&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;// slow, sometimes takes 10 seconds&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="nf"&gt;renderDashboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;notifications&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user navigates away before the notifications fetch completes. The component unmounts. Your framework might fire a cleanup callback, but that cleanup has no way to reach inside &lt;code&gt;Promise.all&lt;/code&gt; and abort the in-flight fetches. All three requests continue running. In a single-page app with heavy route churn, these orphaned fetches accumulate. They fill browser connection slots, they log errors to surfaces nobody checks, and they burn mobile data the user did not ask to spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Zombie Database Query
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;auditLog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;         &lt;span class="c1"&gt;// completes in 5ms&lt;/span&gt;
  &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findByUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;// completes in 12ms&lt;/span&gt;
  &lt;span class="nx"&gt;externalService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recommend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;// times out after 30s&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;recommend&lt;/code&gt; throws, &lt;code&gt;Promise.all&lt;/code&gt; rejects. Your code catches the error and returns a 500. &lt;code&gt;findOne&lt;/code&gt; and &lt;code&gt;findByUser&lt;/code&gt; are still holding connection pool slots from the database. In a busy API, this pattern under load means your connection pool fills with queries attached to requests that have already failed, and new requests queue waiting for slots that are technically occupied by work nobody is waiting for.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Port Still Bound
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;startServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performSetup&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// slow, sometimes takes a few seconds&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForShutdown&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You hit Ctrl-C during &lt;code&gt;performSetup&lt;/code&gt;. The &lt;code&gt;process.exit(0)&lt;/code&gt; fires synchronously, tearing down the event loop before &lt;code&gt;performSetup&lt;/code&gt; has a chance to resume and reach any cleanup code. The port stays bound. You try to restart and get &lt;code&gt;EADDRINUSE&lt;/code&gt;. You have seen this. The fix is usually "kill the process manually" rather than "understand why the port is not being released."&lt;/p&gt;

&lt;p&gt;All three of these have the same root cause: the tasks you started have no owner. When the parent gives up, the children keep running. The language gave you a way to start concurrent work, but not a way to define what happens to that work when the context that started it goes away.&lt;/p&gt;




&lt;h2&gt;
  
  
  What ES2026 Actually Gives You
&lt;/h2&gt;

&lt;p&gt;The honest framing first: JavaScript in 2026 does not have a "structured concurrency" primitive in the way Go, Kotlin, or Swift do. There is no native task scope that automatically propagates cancellation to children when the parent exits. That language feature does not exist yet.&lt;/p&gt;

&lt;p&gt;What does exist is a set of composable primitives that were not in the language two years ago. Together they make it possible to build the pattern yourself without depending on an external library.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;await using&lt;/code&gt; and &lt;code&gt;Symbol.asyncDispose&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The Explicit Resource Management proposal reached Stage 4 in May 2025. &lt;code&gt;await using&lt;/code&gt; is now available natively in Node.js 24+ and Chrome 134+. TypeScript has supported it since version 5.2 with transpilation.&lt;/p&gt;

&lt;p&gt;The core idea: any object that defines &lt;code&gt;[Symbol.asyncDispose]()&lt;/code&gt; returning a Promise can be declared with &lt;code&gt;await using&lt;/code&gt;. When the enclosing block exits, regardless of how it exits (normal return, thrown error, early return), the runtime calls and awaits that method before continuing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DatabaseConnection&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Connection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;asyncDispose&lt;/span&gt;&lt;span class="p"&gt;]()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;using&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DatabaseConnection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="c1"&gt;// the connection releases when this block exits, always&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT * FROM users WHERE id = ?&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is "always." Not "if we reach the cleanup code." Not "if the Promise chain resolved normally." The disposal runs if the function returns, if it throws, and if something higher up calls its &lt;code&gt;AbortSignal&lt;/code&gt;. The LIFO ordering also matters: multiple &lt;code&gt;await using&lt;/code&gt; declarations in the same block dispose in reverse order, which is what you want when resources depend on each other.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AsyncDisposableStack&lt;/code&gt; extends this for ad-hoc aggregation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;withCleanup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;using&lt;/span&gt; &lt;span class="nx"&gt;stack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AsyncDisposableStack&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;openConnection&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="nx"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;defer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;logCompletion&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="c1"&gt;// both cleanup when block exits, in reverse registration order&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The limitation worth knowing: Safari does not support &lt;code&gt;await using&lt;/code&gt; natively as of early 2026. TypeScript's transpilation covers it for browser targets, but if you rely on native support in a Safari-heavy environment, test carefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;AbortSignal.any()&lt;/code&gt; for Composed Cancellation
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;AbortSignal.any()&lt;/code&gt; shipped in all major browsers in March 2024 (Chrome 116+, Firefox 124+, Safari 17.4+) and is available in Node.js 20+. It takes an array of &lt;code&gt;AbortSignal&lt;/code&gt; instances and returns a new signal that fires the moment any of the input signals fires.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AbortController&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;timeoutSignal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;timeoutSignal&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;combined&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fetch aborts if the user cancels (via &lt;code&gt;controller.abort()&lt;/code&gt;) or if the 5-second timeout fires, whichever comes first. The &lt;code&gt;combined&lt;/code&gt; signal's &lt;code&gt;reason&lt;/code&gt; property tells you which input triggered it.&lt;/p&gt;

&lt;p&gt;The real value is in composition. You can have a request-scoped abort signal, a user-interaction abort signal, and a global shutdown signal, and combine them into one that you pass into all the work spawned for a given operation. Any of them firing aborts everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building a Task Scope
&lt;/h2&gt;

&lt;p&gt;These two primitives together make a small but useful abstraction possible. I have been using a version of this in a handful of projects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaskScope&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AbortController&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;signal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="nx"&gt;spawn&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AbortError&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;asyncDispose&lt;/span&gt;&lt;span class="p"&gt;]()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;loadDashboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;parentSignal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scopeSignal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nx"&gt;parentSignal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scopeController&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AbortController&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;combinedSignal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;scopeSignal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;scopeController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;using&lt;/span&gt; &lt;span class="nx"&gt;scope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TaskScope&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;notifications&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetchUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetchSettings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetchNotifications&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;notifications&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When any of the spawned tasks fails, the &lt;code&gt;catch&lt;/code&gt; handler in &lt;code&gt;spawn&lt;/code&gt; calls &lt;code&gt;this.controller.abort()&lt;/code&gt;. All other spawned tasks receive the abort signal and should stop work. When the &lt;code&gt;await using&lt;/code&gt; block exits, the &lt;code&gt;asyncDispose&lt;/code&gt; method fires the abort and waits for all tasks to settle before releasing.&lt;/p&gt;

&lt;p&gt;This does not magically make your fetch calls abort cleanly. Each function you pass to &lt;code&gt;spawn&lt;/code&gt; needs to actually respect the signal. That means threading the signal through to every &lt;code&gt;fetch&lt;/code&gt; call, every database query, every async operation that has a cancellation mechanism. The scope provides the structure; you still do the wiring.&lt;/p&gt;

&lt;p&gt;The fetch case is easy because the fetch API accepts a signal. The database case depends on your driver. Many modern Node.js database drivers support &lt;code&gt;AbortSignal&lt;/code&gt; on query calls. If yours does not, you wrap the query in a &lt;code&gt;Promise.race&lt;/code&gt; against the abort signal and release the connection in the losing branch. It is more boilerplate, but the intent is explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;AsyncLocalStorage&lt;/code&gt; as Context Carrier
&lt;/h2&gt;

&lt;p&gt;One more tool that ties this together, particularly in server environments: &lt;code&gt;AsyncLocalStorage&lt;/code&gt; from Node.js.&lt;/p&gt;

&lt;p&gt;The use case is ambient context, values that need to be available to anything spawned within a request without being passed as arguments everywhere. Request IDs, user sessions, cancellation tokens, tracing metadata.&lt;/p&gt;

&lt;p&gt;Node.js 24 changed the internal implementation of &lt;code&gt;AsyncLocalStorage&lt;/code&gt; from the legacy &lt;code&gt;async_hooks&lt;/code&gt; machinery to a new &lt;code&gt;AsyncContextFrame&lt;/code&gt; backend. The public API did not change but the correctness did. Earlier versions had edge cases where context could be silently lost across certain microtask boundary patterns. The Node 24 implementation is more reliable, which matters specifically for patterns where context carries cancellation tokens through nested async call chains.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AsyncLocalStorage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:async_context&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Node 24+&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;requestContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;AsyncLocalStorage&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AbortController&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;close&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;client disconnected&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
  &lt;span class="nx"&gt;requestContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;anywhereInTheStack&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;requestContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getStore&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;called outside a request context&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// ctx.signal is the request-scoped abort signal&lt;/span&gt;
  &lt;span class="c1"&gt;// no need to thread it through every function signature&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern composes cleanly with &lt;code&gt;TaskScope&lt;/code&gt;. The scope reads the ambient signal from the store, combines it with its own signal, and any work spawned inside inherits both.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Reach for Effection
&lt;/h2&gt;

&lt;p&gt;The primitives above get you a long way. For most server routes and browser interactions, &lt;code&gt;await using&lt;/code&gt; plus &lt;code&gt;AbortSignal.any()&lt;/code&gt; plus a thin scope abstraction covers the problem.&lt;/p&gt;

&lt;p&gt;Effection is worth knowing about for cases where the generator-based model is a better fit. It is a maintained library (~5KB gzipped) that enforces the lifetime guarantees at the library level: no task outlives its parent, cancellation propagates down the entire task tree, and cleanup always runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;race&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fetchUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;timeout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="c1"&gt;// the losing task is actively cancelled, not just abandoned&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference from &lt;code&gt;Promise.race&lt;/code&gt; is that Effection's &lt;code&gt;race&lt;/code&gt; actively cancels the loser and awaits its cleanup before resolving. &lt;code&gt;Promise.race&lt;/code&gt; abandons the loser. That distinction is exactly the failure mode described at the start.&lt;/p&gt;

&lt;p&gt;The tradeoff is the generator syntax. It is not familiar to most JavaScript developers, it requires buy-in from the whole team, and it does not incrementally compose with existing async/await code. I would reach for Effection on greenfield CLIs and servers where correctness is the priority and the team is willing to adopt the model. For existing codebases, the &lt;code&gt;await using&lt;/code&gt; approach is easier to add incrementally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Limitation
&lt;/h2&gt;

&lt;p&gt;I said this at the start and it is worth repeating: JavaScript in 2026 does not enforce task lifetime guarantees. The language lets you build the pattern. It does not require it.&lt;/p&gt;

&lt;p&gt;Compare this with Go's goroutines, where passing a &lt;code&gt;context.Context&lt;/code&gt; is idiomatic and cancellation propagation is expected by every library you use. Or Kotlin coroutines with structured concurrency enforced by the &lt;code&gt;CoroutineScope&lt;/code&gt;. Or Swift's &lt;code&gt;async let&lt;/code&gt;, which lexically bounds the lifetime of the spawned task. In those languages, "structured" is a property the runtime or compiler enforces.&lt;/p&gt;

&lt;p&gt;In JavaScript, "structured" is a property you add to your codebase through discipline and a thin abstraction. The discipline part is the limiting factor. A new engineer joins, writes &lt;code&gt;Promise.all&lt;/code&gt; without threading signals through, and the leak is back.&lt;/p&gt;

&lt;p&gt;The TC39 Concurrency Control proposal (Stage 1) is about concurrency limiting, not lifetime management. It adds a governor model for capping concurrent operations, which is useful but a different problem. There is no proposal on the standards track for native task lifetime management as of mid-2026.&lt;/p&gt;

&lt;p&gt;What we have is enough to write correct code. What we do not have is a language that makes incorrect code hard to write. That gap is worth being honest about, particularly if you are introducing this pattern to a team that is used to &lt;code&gt;Promise.all&lt;/code&gt; and considers the topic closed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making It Stick in Practice
&lt;/h2&gt;

&lt;p&gt;The structural change that actually made this work in a production codebase I maintain: treat task scope as a first-class part of the request lifecycle, not an optional add-on.&lt;/p&gt;

&lt;p&gt;Every route handler receives an abort signal from the framework (or creates one tied to the response &lt;code&gt;close&lt;/code&gt; event). That signal flows into a &lt;code&gt;TaskScope&lt;/code&gt; that wraps the handler. Every async operation inside the handler uses &lt;code&gt;scope.spawn&lt;/code&gt; rather than raw &lt;code&gt;Promise.all&lt;/code&gt;. New code added later follows the same pattern because the pattern is already in the scaffolding.&lt;/p&gt;

&lt;p&gt;The cost of adoption is the upfront wiring: making sure fetch calls and database queries actually accept and respect an abort signal. Most modern Node.js libraries do. For the ones that do not, a wrapper that races against the signal is worth writing once and reusing.&lt;/p&gt;

&lt;p&gt;The benefit is not academic. Database connection pool exhaustion under load is a genuinely painful incident. Orphaned fetches in a React app are a common source of "this bug only happens after you navigate quickly" reports. Ports that stay bound after Ctrl-C are a small irritation that adds up over a development day.&lt;/p&gt;

&lt;p&gt;These primitives exist now, they are stable in Node.js 24 and modern browsers, and they compose cleanly without pulling in a new runtime model. The question is whether you add the pattern to your scaffolding now or explain the connection pool leak to your on-call engineer six months from now.&lt;/p&gt;

&lt;p&gt;Given how central async JavaScript is to &lt;a href="https://dev.to/blog/ai-agent-tool-design-2026"&gt;AI agent tooling&lt;/a&gt; and multi-step pipelines where task cancellation actually matters, this is one of those patterns that goes from "good practice" to "necessary" as the complexity of what you are building goes up. The primitives are there. Worth using them.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Anthropic and SpaceX: What the Colossus Deal Actually Means for Developers</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 07 May 2026 08:28:14 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/anthropic-and-spacex-what-the-colossus-deal-actually-means-for-developers-ken</link>
      <guid>https://dev.to/alexcloudstar/anthropic-and-spacex-what-the-colossus-deal-actually-means-for-developers-ken</guid>
      <description>&lt;p&gt;On May 6, Claude Code's five-hour rate limits doubled. The peak-hour throttling that had been frustrating paid users for months disappeared. Most people noticed the change and moved on without looking too closely at what caused it.&lt;/p&gt;

&lt;p&gt;The answer is strange enough that I think it is worth looking at closely. Anthropic rented the entire Colossus 1 supercomputer cluster in Memphis, Tennessee from SpaceX. That is 220,000 NVIDIA GPUs and 300 megawatts of power capacity, coming online within a month of the announcement. The reason it is strange: three months before signing this deal, Elon Musk had posted on X that Anthropic's AI was "misanthropic and evil" and told the company it was "doomed."&lt;/p&gt;

&lt;p&gt;Let me walk through what actually happened, what it means practically, and what I think it signals about where we are in the AI compute story.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Colossus 1 Actually Is
&lt;/h2&gt;

&lt;p&gt;Most people have heard the name but do not have a clear picture of the scale. Colossus 1 is the original AI supercomputer cluster that xAI (Musk's AI company) built in Memphis starting in 2024. It went operational in July of that year, remarkably fast for infrastructure of that size.&lt;/p&gt;

&lt;p&gt;The hardware breakdown: the cluster runs a mix of NVIDIA H100s, H200s, and GB200s. 220,000 GPUs total. The 300 megawatt power draw is equivalent to the entire electricity load of roughly 300,000 average American homes. When it launched, it was described as the largest AI training facility in the world by a significant margin.&lt;/p&gt;

&lt;p&gt;Here is what changed and why the deal was possible. Since then, xAI (now merged into SpaceX after a $1.25 trillion all-stock deal in February 2026) built Colossus 2, an even larger cluster with around 520,000 GB200s targeting one gigawatt of power capacity. When Grok's training workloads migrated to the newer, faster hardware, Colossus 1 became a 300-megawatt facility generating very little revenue. The deal with Anthropic solves that problem.&lt;/p&gt;

&lt;p&gt;Anthropic gets the compute immediately. SpaceX gets rental income ahead of its planned June 2026 IPO. That is the straightforward business logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Anthropic Needed This
&lt;/h2&gt;

&lt;p&gt;Dario Amodei was on stage at Anthropic's developer conference the same day the deal was announced. He said something that landed harder than most conference quotes: the company had projected 10x growth in Q1 2026. The actual number was 80x, annualized. He called it "just crazy" and "too hard to handle."&lt;/p&gt;

&lt;p&gt;Claude Code specifically drove a lot of that. The adoption curve for AI coding tools has been steep across the industry, and Claude Code became the default choice for a large chunk of that market. The infrastructure was not built for 80x growth. That is what was behind the rate limit caps and the peak-hour throttling that paying users had been hitting for months. It was a capacity problem, not a policy problem.&lt;/p&gt;

&lt;p&gt;Anthropic is not short on future compute commitments. The company has deals with Amazon (up to $25 billion invested, roughly 5 gigawatts of Trainium capacity coming over the next few years), Google (up to $40 billion invested, 5 gigawatts via Broadcom), and several other infrastructure partners. The total compute reserved across all of those deals is measured in gigawatts.&lt;/p&gt;

&lt;p&gt;The problem those deals do not solve is now. AWS Trainium rollouts and Google TPU clusters are measured in years, not weeks. Colossus 1 is available within a month of the announcement. For a company that just discovered its demand is 8x higher than forecast, "available in weeks" is worth a lot even at a smaller scale than the future partnerships will deliver.&lt;/p&gt;

&lt;p&gt;The current deal also appears to be focused on inference rather than training. Anthropic trains Claude on AWS Trainium and Google TPUs. Colossus 1's hardware mix, particularly the H100 and H200 GPU density, is better suited for the inference workloads that serve Claude Pro, Claude Max, and the API. The immediate user-facing impact, the doubled rate limits and removed peak throttling, is consistent with that.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Musk Reversal
&lt;/h2&gt;

&lt;p&gt;This is the part of the story that every tech journalist covered, and for good reason. The timeline is genuinely unusual.&lt;/p&gt;

&lt;p&gt;In February 2026, hours after Anthropic announced a $30 billion funding round, Musk posted directly at the @AnthropicAI account: "Your AI hates Whites &amp;amp; Asians, especially Chinese, heterosexuals and men. This is misanthropic and evil. Fix it." In other posts around the same period he called Anthropic "Misanthropic," said it "hates Western civilization," and declared that "Winning was never in the set of possible outcomes for Anthropic."&lt;/p&gt;

&lt;p&gt;He also had a specific grievance: Anthropic had cut off xAI's access to Claude through Cursor, citing their commercial terms that prohibit using the API to build competing AI products. (Anthropic did the same to OpenAI in August 2025.) The xAI cofounder Tony Wu confirmed it internally: "We will take a hit on productivity, but it really forces us to develop our own coding products and models."&lt;/p&gt;

&lt;p&gt;Three months later they signed a deal together.&lt;/p&gt;

&lt;p&gt;Musk's explanation, posted the day after the announcement: "I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. Everyone I met was highly competent and cared a great deal about doing the right thing. No one set off my evil detector. So long as they engage in critical self-examination, Claude will probably be good."&lt;/p&gt;

&lt;p&gt;There is one unusual clause buried in the deal: SpaceX reserves the right to reclaim the compute if Anthropic's AI "engages in actions that harm humanity." Whether that is meaningful contractual language or a rhetorical add-on is hard to say from the outside, but it is the kind of condition that reflects how personally Musk was taking the criticism before the handshake.&lt;/p&gt;

&lt;p&gt;My read on the reversal is simpler than the drama makes it seem. Colossus 1 was sitting underutilized. Anthropic needed compute fast and had budget to pay for it. Both sides had a clear financial reason to set the insults aside. The "evil detector" framing is Musk, but the underlying transaction is just two companies with complementary short-term needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Changed for Claude Users
&lt;/h2&gt;

&lt;p&gt;The practical changes are real and immediate.&lt;/p&gt;

&lt;p&gt;For Claude Code specifically: five-hour rate limits doubled for Pro, Max, Team, and Enterprise plans. The peak-hour throttling that kicked in during high-demand periods is gone for Pro and Max accounts. If you have been hitting rate limit errors in the late afternoon US time, that should largely stop.&lt;/p&gt;

&lt;p&gt;For API users on Opus models: Anthropic described the limits as "considerably raised" without publishing exact numbers. The framing in the announcement focused on the ability to "process significantly more input and output tokens per minute."&lt;/p&gt;

&lt;p&gt;The rate limit doubling matters more than it might sound if you are actively building with Claude Code. The five-hour window was a real constraint on complex, multi-step agentic tasks. Longer context windows, more tool calls, deeper refactors, those all burn limits faster. Doubling the window is a meaningful change for anyone doing serious work rather than quick edits.&lt;/p&gt;

&lt;p&gt;The timing of availability is also notable. Colossus 1 is supposed to come online for Anthropic within one month of the announcement. That is unusually fast for infrastructure at this scale, but the cluster is already built and operational. It is a matter of provisioning Anthropic's access rather than constructing anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Compute Race Is Now a First-Class Business Problem
&lt;/h2&gt;

&lt;p&gt;Something this deal makes clear, if it was not already, is that AI compute is now a strategic constraint that the companies in this space have to solve actively and continuously.&lt;/p&gt;

&lt;p&gt;Anthropic's situation is a good illustration. They have gigawatt-scale deals committed with Amazon and Google. They also just signed an emergency lease on a competitor's data center because the demand curve outran their projections by a factor of eight. Both things can be true at once. Long-term infrastructure deals are not enough on their own when you are growing at rates this fast.&lt;/p&gt;

&lt;p&gt;The orbital compute angle in the announcement is worth noting, even if it reads as forward-looking. Anthropic and SpaceX expressed interest in developing "multiple gigawatts of orbital AI compute capacity." SpaceX filed with the FCC in January 2026 for authorization to deploy a satellite constellation for exactly this purpose. Google published a feasibility study suggesting space-based data centers become cost-competitive with terrestrial ones once Starship brings launch costs down to around $200 per kilogram, which is a realistic target on a ten-year horizon.&lt;/p&gt;

&lt;p&gt;I would not count orbital compute as near-term capacity planning. But it does reflect where the ceiling conversation is already happening. Terrestrial power, land, and cooling are the constraints. SpaceX has a credible path to removing those constraints eventually, and Anthropic is a customer with both the compute need and the capital to be interesting to them as a long-term partner.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Weird Politics at the Edge of This Deal
&lt;/h2&gt;

&lt;p&gt;This part is less about development and more about context, but I think it matters for how you read the deal.&lt;/p&gt;

&lt;p&gt;Anthropic has said they are "very intentional" about where they add compute capacity, specifically mentioning a preference for democratic countries with stable legal frameworks. In the same month they signed this deal, they were actively suing the Trump administration to reverse a Defense Department decision that blacklisted them as a supply chain risk and cut them off from federal contracts.&lt;/p&gt;

&lt;p&gt;Musk, who controls SpaceX and now SpaceXAI, is closely aligned with that same administration. There is an obvious tension between Anthropic's stated preference for democratic infrastructure partners and signing a major deal with someone whose political alignment is with the government that just tried to cut them off.&lt;/p&gt;

&lt;p&gt;I am not drawing a conclusion here, partly because the financial logic of the deal is clear and partly because I do not have visibility into how Anthropic weighed the tradeoff internally. But it is the kind of contradiction that tends to come up again when there is a policy dispute down the line. If SpaceX invokes the "harms humanity" reclaim clause someday, that context will matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Developers Using Claude
&lt;/h2&gt;

&lt;p&gt;The immediate practical takeaway is: the bottleneck you were hitting on Claude Code is about to be significantly less painful.&lt;/p&gt;

&lt;p&gt;The longer-term takeaway is less tidy. The AI infrastructure layer is consolidating around a small number of very large players, and the relationships between those players are more complicated than a simple vendor-customer model. Anthropic's compute stack now includes Amazon, Google, Microsoft, SpaceX, and Fluidstack in a mix of equity investments, compute credits, and rental agreements. Those relationships come with interests that are not always perfectly aligned with the people building on the platform.&lt;/p&gt;

&lt;p&gt;This is not a reason to stop building on Claude. The rate limits are better, the pricing is still competitive, and the &lt;a href="https://dev.to/blog/prompt-caching-production-guide-2026"&gt;prompt caching economics&lt;/a&gt; still favor Anthropic for high-volume production features. For complex agents, the Claude-specific features (extended thinking, memory primitives, tool use) remain genuinely strong. If you have been building your &lt;a href="https://dev.to/blog/ai-agent-frameworks-comparison-2026"&gt;AI agent architecture&lt;/a&gt; around Claude, the deal does not change the calculus there.&lt;/p&gt;

&lt;p&gt;What it does is add one more data point to the general pattern of the AI infrastructure layer being much more entangled than the clean abstractions on the surface suggest. The API call you make to get a completion goes through a stack that includes a data center leased from the company whose CEO called your provider "evil" this spring. That is not an argument for or against using the API. It is just an accurate description of the current state of things.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Short Version
&lt;/h2&gt;

&lt;p&gt;SpaceX had a 300-megawatt data center with 220,000 GPUs sitting underutilized after upgrading to newer hardware. Anthropic was growing 8x faster than projected and hitting capacity limits. They made a deal that makes clear financial sense for both parties, regardless of what either CEO had said about the other three months earlier.&lt;/p&gt;

&lt;p&gt;Claude Code rate limits doubled as a direct result. That is the part that affects your day-to-day work, and it is a real improvement for anyone doing serious agentic development.&lt;/p&gt;

&lt;p&gt;The rest of the story, the Musk reversal, the orbital compute ambitions, the political contradictions, is worth understanding as context for an industry where the infrastructure layer is genuinely complicated and the companies building on it are making consequential decisions about who they do business with. Those decisions have a way of mattering more than they seem to at announcement time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>RAG Chunking Strategies In Production 2026: What Actually Survives Real Documents And Real Queries</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 06 May 2026 07:47:35 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/rag-chunking-strategies-in-production-2026-what-actually-survives-real-documents-and-real-queries-m8p</link>
      <guid>https://dev.to/alexcloudstar/rag-chunking-strategies-in-production-2026-what-actually-survives-real-documents-and-real-queries-m8p</guid>
      <description>&lt;p&gt;The first RAG system I shipped chunked every document at 512 tokens with a 50 token overlap, because that was the example in the tutorial I was reading at three in the morning. It worked well enough to ship. It worked poorly enough that two weeks later a customer support engineer pinged me with a screenshot of the assistant confidently citing a policy document, except the cited paragraph was the second half of one policy glued to the first half of an unrelated one. The model had retrieved a chunk that crossed a section boundary, and the chunk read like a single coherent rule that did not exist anywhere in the source. Fixing that one bug took longer than building the original retriever.&lt;/p&gt;

&lt;p&gt;That was a few years ago. The pattern has not changed. Teams still ship RAG systems where the LLM is sophisticated, the embedding model is fine, the vector store is overkill for the data volume, and the chunker is a one-line call to a default splitter that tears documents apart at arbitrary character offsets. The retrieval looks like it is working in the demo, because the demo uses clean Wikipedia paragraphs. It stops working the moment the documents are real, which means messy, inconsistent, structurally meaningful, and full of edge cases the default chunker has never seen.&lt;/p&gt;

&lt;p&gt;By 2026 the production patterns for chunking have settled. They are not glamorous. They are mostly about respecting the structure the document already has, sizing chunks to match how the embedding model thinks, and making the retrieval shape match the queries you actually expect. This post is what I would tell my past self before that 3 a.m. tutorial, and what I would build into any retrieval pipeline before its first real user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunking Is The Hidden Half Of RAG
&lt;/h2&gt;

&lt;p&gt;The framing most teams start with is that RAG is about retrieval and generation, with chunking somewhere in the wiring. That framing is wrong. The chunker decides what answers can possibly be found, because the unit of retrieval is the chunk. If the right answer lives in a span the chunker split in half, the retriever cannot return it intact, and the model cannot cite it. Every other component in the pipeline is downstream of the chunking choice.&lt;/p&gt;

&lt;p&gt;This is the same lesson I keep relearning in every retrieval project. You can change the embedding model, swap the vector store, tune the top-k, add a reranker, and you are still bottlenecked by whether the chunks contain the answers the user asks about. A great LLM cannot answer from a chunk that does not contain the relevant information. A great embedding model cannot match a query to a chunk where the answer is split across two retrievable units. The chunker is the floor, and most teams ship with that floor lower than they realize.&lt;/p&gt;

&lt;p&gt;The reason it stays hidden is that chunking failures are silent. The system returns plausible-looking citations, the model produces fluent answers, and only a careful read of the source documents reveals that the answer is wrong, or partial, or stitched together from the wrong context. Compare that to a pipeline where the embedding model is broken: queries return obvious garbage, on-call gets paged, the bug is fixed in an afternoon. Chunking bugs do not page anyone. They show up as a slow drift in answer quality and an unhappy customer support engineer who does not know how to file the ticket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fixed-Size Chunking Is The Default For A Reason, And A Trap For Another
&lt;/h2&gt;

&lt;p&gt;The default everybody starts with is fixed-size chunking. Pick a chunk size, pick an overlap, slide a window across the document. It is one line of code. It works on any document type. It produces predictable chunk counts and predictable storage costs. There is a real reason this pattern is the default, and there is a real reason it stops being good enough the moment the documents have any structure at all.&lt;/p&gt;

&lt;p&gt;The strength of fixed-size chunking is that it is uniform. Every chunk is the same size, every chunk has the same overlap with its neighbors, and the embedding model sees inputs in a consistent shape. That uniformity matters more than people give it credit for. Embedding quality is sensitive to chunk size, and a pipeline where chunks vary wildly in length produces vectors that are not directly comparable. A 50-token chunk and a 2000-token chunk live in different parts of the embedding space, even if they describe the same topic, because the model encodes density and breadth differently. Fixed-size chunking sidesteps that problem by pretending everything is the same shape.&lt;/p&gt;

&lt;p&gt;The weakness is the part everybody hits within a week of shipping. Fixed-size chunking ignores the structure of the document. It splits in the middle of sentences, in the middle of code blocks, between a heading and the section it introduces, between a question and its answer. The overlap parameter is supposed to paper over this, but overlap is a band-aid. A 50-token overlap on a 512-token chunk gives the next chunk a small lead-in to the previous one, but it does not preserve the boundary that mattered, which was the section heading. The retriever finds the body but loses the title that explained what the body was about.&lt;/p&gt;

&lt;p&gt;The pattern that has worked when I am stuck with fixed-size chunking is to preprocess aggressively. Before the splitter runs, I prepend every chunk with the document title and the nearest preceding heading. The chunker still cuts where it cuts, but the chunk now carries enough context that the embedding can place it in the right neighborhood. This is a hack, and it works, and it is almost always worth the small storage hit. The chunk that says "from a document titled X, in a section about Y, the following text..." retrieves better than the chunk that starts mid-paragraph with no signal of where it came from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Structure-Aware Chunking Is Where Production Lives
&lt;/h2&gt;

&lt;p&gt;The next step up, and the one most production systems should be at, is to chunk along the structure the document already carries. Markdown documents have headings. HTML has tags. PDFs have pages and, with the right parser, sections. Code has functions and classes. Notion pages, Confluence pages, and most internal documentation systems expose a structural tree if you ask nicely. Use it.&lt;/p&gt;

&lt;p&gt;The pattern is to split at structural boundaries first, then post-process to merge or further split based on size constraints. A markdown document becomes a tree of sections, each section becomes a candidate chunk, and any section that exceeds the embedding model's effective context gets recursively split along sub-headings. Sections that are too small get merged with their neighbors, but only their structural neighbors, never across a top-level heading. The output is chunks that respect the author's intent: each chunk is a thing the author wrote as a unit, not a slice of arbitrary text.&lt;/p&gt;

&lt;p&gt;The benefit shows up in retrieval quality, but it also shows up in citation quality. When a structural chunk is retrieved, the model can cite the section heading directly. The user can see "this answer comes from Section 4.2 of the Refunds Policy" instead of "this answer comes from chunk 137." That is a product feature. Users trust citations they can verify. Citations that point to recognizable structural units are easier to verify than citations that point to opaque ranges.&lt;/p&gt;

&lt;p&gt;The trap with structure-aware chunking is that the structural parser has to be good. A bad markdown parser will mistake a code block for a heading and chunk wrong. A bad PDF parser will fail to find sections in a document where the section breaks are visual rather than semantic, which is most real PDFs. Investing in the parser is the unglamorous part of this work. The right move is to spend a day looking at how your parser actually splits a representative sample of your documents, and to fix the cases where it is wrong. The fixes pay back for the lifetime of the index.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic Chunking Sounds Smart, Mostly Is Not
&lt;/h2&gt;

&lt;p&gt;There is a class of chunking strategies marketed as "semantic" that try to use embeddings or a small model to find natural break points in the text. The pitch is that the chunker reads the document, notices where the topic shifts, and cuts there. The pitch is correct in theory. In practice, semantic chunking works well on a narrow set of documents and poorly on most of the rest, and the cost is high enough that the trade is rarely good.&lt;/p&gt;

&lt;p&gt;Where it works is on flowing prose without explicit structure. Long-form articles, transcripts, books. The structural signals are absent, the topic shifts are real, and a semantic chunker can find a cut point that a fixed-size chunker would miss. If the entire corpus is documents like this, semantic chunking is worth the engineering cost.&lt;/p&gt;

&lt;p&gt;Where it fails is everything else. On structured documents the semantic chunker fights with the structure. The headings already mark topic shifts, and the embedding-based detector is noisy enough to put cuts in places where the author did not intend cuts. On code, on logs, on FAQs, on transactional documents, semantic chunking adds latency and cost without measurable retrieval improvements. The teams I have seen ship semantic chunking and keep it are the ones whose corpus is dominated by long prose. Everybody else has either ripped it out or quietly downgraded to structure-aware with semantic-style heuristics for the rare cases where it matters.&lt;/p&gt;

&lt;p&gt;The compromise that works is to use a semantic detector only as a fallback. If a structural chunk is too long to fit the embedding model's window, use a semantic detector to find the best cut point inside it. That keeps the cost bounded and the benefit targeted at the cases where structure has run out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hierarchical Chunking And The Parent-Child Pattern
&lt;/h2&gt;

&lt;p&gt;The pattern that has earned its place in production over the last two years is hierarchical chunking, sometimes called the parent-child or small-to-big pattern. The idea is to chunk at two granularities. Small chunks, sized for retrieval, are what the embedding model and the vector store see. Large chunks, sized for context, are what the LLM sees when a small chunk is retrieved. The retrieval index points from the small chunk to its parent.&lt;/p&gt;

&lt;p&gt;The reason this works is that retrieval and generation have different sweet spots. Retrieval works best on chunks small enough that the embedding represents a single coherent idea. The vector for a 200-token chunk about how to issue a refund is sharp. The vector for a 2000-token chunk that contains that same idea plus four other ideas is blurred, because the embedding has to average over all of them. Generation, on the other hand, works best with more context, because the model needs the surrounding details to produce a complete answer.&lt;/p&gt;

&lt;p&gt;The hierarchical pattern lets you have both. The retriever finds the precise small chunk that matches the query. The pipeline then expands to the parent, which is the section or the page or the document, and sends that to the LLM. The model gets the precision of the small chunk's match and the context of the parent's surroundings. The cost is a little extra storage for the parent text, which is rounding error in any production vector store.&lt;/p&gt;

&lt;p&gt;The discipline is to set the parent boundary at a level that means something. Parents that are entire documents are usually too big. Parents that are paragraphs are usually too small. The right level is almost always the structural level: a section in a markdown doc, a page in a PDF, a function in a code file. The parent is the unit a human would point to when asked "where did this come from."&lt;/p&gt;

&lt;p&gt;The same discipline I covered in &lt;a href="https://dev.to/blog/rag-vs-long-context-2026"&gt;RAG vs long context&lt;/a&gt; applies here, because hierarchical chunking is partly an answer to the question of how much context to send. The retrieval narrows the search. The parent expansion gives the model enough surrounding text to produce a grounded answer. Tuning the small-chunk size and the parent size independently is one of the highest-leverage tuning operations in a RAG pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunk Size: The Number Everyone Asks About And The Wrong One To Optimize First
&lt;/h2&gt;

&lt;p&gt;The first question every team asks is what chunk size to use. The honest answer is that it depends on the embedding model, the document type, and the query shape, and the fastest way to get to a good number is to start at 256 to 512 tokens and adjust by measuring. Anchoring to a number before measuring is how teams end up with a confidently wrong setting.&lt;/p&gt;

&lt;p&gt;Embedding models have an effective context that is shorter than their advertised maximum. A model with a 8192-token context window does not produce equally good embeddings for 8192-token chunks as it does for 512-token chunks. The longer the input, the more the embedding has to compress, and the more semantic detail gets lost in the averaging. The advertised context is the limit, not the recommendation. The recommendation is usually a few hundred tokens, sometimes up to a thousand for newer models. Check the model card. Then verify on your own data, because model cards are written for a benchmark and not for your corpus.&lt;/p&gt;

&lt;p&gt;Document type matters because chunk size interacts with information density. Technical documentation packs ideas tightly: a 256-token chunk of API reference can contain three or four distinct facts. Narrative content is sparser: a 256-token chunk of a blog post might contain half of a single argument. The right chunk size for the dense corpus is smaller, because the embedding can capture the multi-fact density at smaller sizes. The right chunk size for the sparse corpus is larger, because cutting too small leaves the chunks without enough signal to retrieve.&lt;/p&gt;

&lt;p&gt;Query shape matters because the chunk has to answer the kind of question users ask. If the queries are precise lookups ("what is the refund window for product X"), small chunks win, because the answer is a single fact and small chunks isolate facts. If the queries are exploratory ("how does our refund process work"), larger chunks win, because the answer needs context the user is implicitly asking the system to assemble. Most production systems get a mix of both, and the right move is hierarchical chunking, which sidesteps the choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overlap: The Knob That Matters Less Than You Think
&lt;/h2&gt;

&lt;p&gt;The other parameter every tutorial mentions is overlap. The standard advice is to overlap chunks by 10 to 20 percent. The standard advice is fine and almost never the difference between a working system and a broken one. Overlap is a small lever, and tuning it is one of the last things to do.&lt;/p&gt;

&lt;p&gt;The reason overlap exists is to handle the case where the answer to a query straddles a chunk boundary. With no overlap, the answer is split between two chunks, and neither chunk is a great match for the query. With overlap, one of the two chunks contains the full answer, and the retriever can find it. This is real, and overlap helps, and the help is bounded.&lt;/p&gt;

&lt;p&gt;The case where overlap stops helping is when the chunk boundaries are wrong in the first place. Adding overlap to a fixed-size chunker that splits in the middle of sentences does not produce chunks that respect sentence boundaries. It produces chunks that share a few sentences with their neighbors and still split mid-sentence at the start and end. The fix is not more overlap. The fix is structure-aware chunking that does not split mid-sentence.&lt;/p&gt;

&lt;p&gt;The other case where overlap is wasted is when the chunk size is already large enough that boundary-straddling answers are rare. A 2000-token chunk almost never has its answer split across the boundary, because almost any answer fits inside it. Spending storage on overlap at that size is paying for an edge case that does not happen.&lt;/p&gt;

&lt;p&gt;The pattern I default to is small overlap, around 10 percent, on smallish chunks, around 256 to 512 tokens. It is a sensible setting that does not need tuning unless something else in the pipeline forces it. If the retrieval quality is bad, do not start by tuning overlap. Start by looking at whether the chunks themselves make sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metadata Is The Multiplier
&lt;/h2&gt;

&lt;p&gt;The chunk text is not the only thing you store. Every chunk should carry metadata that lets the retriever filter, the reranker reason, and the LLM cite. Document title. Section heading. Source URL. Author. Publication date. Document type. Tags. Whatever your system has that distinguishes documents from each other.&lt;/p&gt;

&lt;p&gt;Metadata pays back in three places. First, in retrieval, where filters cut the search space and improve precision. A query about a 2024 policy should not return a chunk from a 2020 policy, no matter how semantically similar the text is. A metadata filter on date solves that without any embedding-side work. Second, in reranking, where the metadata becomes additional features the reranker can weight. Recent documents, authoritative sources, official policies score higher. Third, in citation, where the metadata is what the LLM uses to tell the user where the answer came from. A citation is only as good as the metadata behind it.&lt;/p&gt;

&lt;p&gt;The pattern that has worked is to over-collect metadata at chunking time and decide later what to use. Storage is cheap. Re-chunking the corpus to add a missing field is expensive. If the source has it, capture it. The first time you need to filter by something you did not capture is the day you regret not capturing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tables, Code, And Other Things That Break Default Chunkers
&lt;/h2&gt;

&lt;p&gt;Default chunkers handle prose. They do not handle tables, code blocks, lists with structural meaning, or multi-column PDFs. Each of these requires a different strategy, and each of them shows up in real corpora, and each of them silently degrades retrieval if you do not address them.&lt;/p&gt;

&lt;p&gt;Tables are the worst offender. A table chunked by character count loses its row structure and becomes a stream of cells the embedding model cannot interpret. The fix is to detect tables before chunking and serialize them in a format that preserves structure. Markdown tables, JSON arrays of row objects, or natural-language summaries of the table contents all work, with different trade-offs. The summary approach is the highest quality and the highest cost, because it requires running the table through a small model. The markdown approach is cheaper and works for most queries that ask about the table's contents.&lt;/p&gt;

&lt;p&gt;Code blocks should be chunked by the structure of the code, not by line count. A function or class is the natural unit. Chunking in the middle of a function produces chunks that have neither the signature nor the implementation, and the embedding represents nothing useful. Most languages have AST parsers that can extract function-level chunks cleanly. The investment pays back in code-search quality, which is otherwise terrible.&lt;/p&gt;

&lt;p&gt;Multi-column PDFs are the failure mode that catches every team that ships RAG against scanned documents. The default text extractor reads top-to-bottom, left-to-right, which produces a stream where the first sentence of column one is followed by the first sentence of column two. The chunks are gibberish. The fix is a layout-aware extractor that respects columns, of which there are several open and commercial options as of 2026. Pick one, evaluate on your corpus, switch.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Know Your Chunking Is Wrong
&lt;/h2&gt;

&lt;p&gt;The hardest part of chunking is that the failure signal is buried in answer quality, which is hard to measure and slow to surface. The discipline is to build a small evaluation set early, before the chunker is locked in, and to run it on every chunking change.&lt;/p&gt;

&lt;p&gt;The eval set is a list of representative queries with known correct answers and known correct source spans in the corpus. For each query, the eval measures whether the retrieval returned the chunk containing the correct span, and whether the LLM produced an answer matching the expected one. This is the same evals discipline I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt;, applied to the retrieval-and-generation pipeline as a unit.&lt;/p&gt;

&lt;p&gt;The chunking-specific signal to watch is recall at k. If the correct chunk is in the top 10 results most of the time, the chunker is doing its job. If the correct chunk is missing from the top 10 even when the embedding model is solid and the query is clear, the chunker has split the answer in a way that breaks retrieval. That signal is much faster to act on than answer quality, because it points directly at the chunking step.&lt;/p&gt;

&lt;p&gt;The other signal is qualitative. Read the chunks. Take a sample of fifty chunks at random and read them as if you were the embedding model. Do they make sense as standalone units? Do they cut off mid-thought? Do they have enough context to be retrievable? Five minutes of reading chunks beats five hours of tuning hyperparameters, every time, and most teams skip it because it does not feel like engineering. It is the most engineering thing you can do at this layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Build From Scratch In 2026
&lt;/h2&gt;

&lt;p&gt;If I were starting a RAG pipeline today, the chunker would be structure-aware, hierarchical, with metadata enrichment, with a small overlap, with special handling for tables and code, and with an eval set running on every change. The chunk size would be a few hundred tokens for retrieval, with parents at the section or page level for generation. The fixed-size fallback would only kick in for unstructured prose, and even then with title and heading prepended to every chunk. The semantic chunker would be a fallback inside the structural chunker, used only when a structural unit was too large to embed cleanly.&lt;/p&gt;

&lt;p&gt;That stack is not novel. It is the stack the production teams I trust have converged on, and it is unglamorous in the same way the &lt;a href="https://dev.to/blog/ai-guardrails-output-validation-2026"&gt;guardrails layer&lt;/a&gt; is unglamorous and the &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;observability layer&lt;/a&gt; is unglamorous. The interesting work is at the LLM, the visible improvements are at the LLM, and the actual quality ceiling sits at the chunker. Most of the wins in a RAG system over the next year are going to come from teams realizing this and putting an engineer on the chunking layer for a week instead of swapping models for the third time.&lt;/p&gt;

&lt;p&gt;If your RAG system is producing answers that look right but feel slightly off, the answer is almost never the LLM. It is almost always the chunker, doing exactly what you told it to do, on documents that did not deserve to be cut where they got cut. Fixing that is the highest-leverage thing you can do in retrieval, and it is sitting there, waiting for somebody to read fifty chunks and notice.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Embedding Models And Reranking In Production 2026: Picking The Pair That Actually Lifts Retrieval Quality</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 06 May 2026 07:47:34 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/embedding-models-and-reranking-in-production-2026-picking-the-pair-that-actually-lifts-retrieval-2ci2</link>
      <guid>https://dev.to/alexcloudstar/embedding-models-and-reranking-in-production-2026-picking-the-pair-that-actually-lifts-retrieval-2ci2</guid>
      <description>&lt;p&gt;The first time I swapped an embedding model in production, the answer quality on our internal eval set jumped by twelve points and the latency went down. I felt very smart for about a week. Then a customer success engineer asked why the assistant had stopped finding documents that contained exact product SKUs, and I spent a Saturday discovering that the new model, which was great at semantic similarity, had gotten worse at lexical matching. The old model carried enough surface-level signal to find the SKU. The new one had been trained out of that and pretended every SKU was a similar SKU. Recall on a specific class of query had collapsed, and our eval set had not covered that class.&lt;/p&gt;

&lt;p&gt;That is the standard embedding-model story. The model that wins on benchmarks is not always the model that wins on your data, and the model that wins on your data is not always the model that keeps winning when the queries change shape next quarter. Embeddings are not a commodity. The choice of embedding model and the decision of whether to put a reranker behind it are two of the highest-leverage tuning operations in a retrieval pipeline, and most teams treat both as defaults. The defaults are not bad. They are also not what you ship past year one.&lt;/p&gt;

&lt;p&gt;By 2026 the patterns for picking embedding models and adding rerankers have settled into a small set of choices that consistently outperform the defaults. None of them are exotic. All of them are about understanding what each layer does, what it cannot do, and where the failure modes hide. This post is what I would tell my past self after that Saturday.&lt;/p&gt;

&lt;h2&gt;
  
  
  What An Embedding Model Actually Encodes
&lt;/h2&gt;

&lt;p&gt;The framing that helps most when picking an embedding model is to think about what the model was trained to optimize, because that is what its vectors will encode well. Models trained on web search query-document pairs are good at matching short queries to long documents. Models trained on natural language inference are good at semantic similarity between full sentences. Models trained on code are good at code-to-code or code-to-comment retrieval. Models trained on multilingual corpora are good at cross-language retrieval and often slightly worse at any single language than a dedicated monolingual model.&lt;/p&gt;

&lt;p&gt;What this means in practice is that the right model for your corpus depends on what your queries and documents look like. A support knowledge base with short user queries and medium-length policy documents wants a model trained on query-document pairs. A semantic search across blog posts wants a model trained on long-form similarity. A code search wants a code-specific model. A multilingual product wants a multilingual model and accepts the small penalty in any single language. Defaulting to the highest-MTEB-scoring model regardless of corpus is how teams end up with embeddings that are good in general and mediocre on the specific shape of data they actually run.&lt;/p&gt;

&lt;p&gt;The other thing the embedding encodes is what it does not encode. Most general-purpose embedding models are trained to be invariant to surface-level details that do not affect meaning. Word order, exact phrasing, specific identifiers, punctuation. That invariance is great for semantic search. It is terrible for any retrieval that depends on those exact details. SKUs, version numbers, function names, error codes. The model has been trained to compress these into a representation where similar identifiers are close to each other, which is exactly the wrong behavior when the user wants the specific identifier and not a similar one.&lt;/p&gt;

&lt;p&gt;The fix is not always a different embedding model. The fix is often a hybrid retrieval pipeline that combines dense embeddings with a lexical signal. More on that below. But the framing matters: if you understand what the embedding encodes, you understand which queries it will fail on, and you can plan for those failures instead of being surprised by them in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Embedding Model Choice In Three Tiers
&lt;/h2&gt;

&lt;p&gt;The market in 2026 looks like three tiers, and most teams should pick from one of them based on their constraints.&lt;/p&gt;

&lt;p&gt;The frontier tier is the proprietary embedding APIs from the major model providers. These are the models with the highest benchmark scores, the broadest training, and the steepest cost. They are the right default when you do not want to think about it, when latency is not critical, and when sending your data to an external API is acceptable. The capability is real. The trade is the per-token cost and the network round trip on every embed call.&lt;/p&gt;

&lt;p&gt;The open-weights tier is the strong open models, the descendants of E5, BGE, GTE, Nomic, and the like. By 2026 these are good enough that the gap with the frontier API tier is small for most use cases, and they can be served on commodity GPUs at a fraction of the cost. The trade is that you now run inference: GPU bills, autoscaling, monitoring. For high-volume retrieval, this is almost always cheaper than the API after a few weeks. For low-volume systems, the operational cost is not worth it. The same calculus I covered in &lt;a href="https://dev.to/blog/small-language-models-production-2026"&gt;small language models in production&lt;/a&gt; applies here, because embedding models are exactly that: small models you can host yourself when the volume justifies it.&lt;/p&gt;

&lt;p&gt;The specialized tier is models fine-tuned for a specific domain or task. Code embeddings, scientific paper embeddings, legal document embeddings, product search embeddings. These are not always better than the general models on benchmarks, but they are often better on the specific shape of data they were trained for. For domain-heavy products, this tier is worth the search cost. For general-purpose retrieval, it is not.&lt;/p&gt;

&lt;p&gt;The pattern that has worked when I am unsure is to pick a strong open-weights model, run it on a representative eval set, and only escalate to the frontier tier if the open model leaves measurable quality on the table. Start cheap, measure, escalate only when measurement justifies it. The opposite pattern, starting on the frontier API and trying to descend later, almost always stalls because the team gets used to the latency and quality and the migration becomes a project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embedding Dimension And The Cost Curve
&lt;/h2&gt;

&lt;p&gt;The other axis on which embedding models differ is dimension. Models output vectors of varying lengths: 384, 512, 768, 1024, 1536, sometimes higher. Higher dimensions can encode more information. They also cost more in storage, more in retrieval, and more in latency, and the cost scales linearly with the number of vectors in the index.&lt;/p&gt;

&lt;p&gt;The trade-off is real and the right setting depends on corpus size. For small indexes, up to a few million vectors, dimension does not matter much. The storage and retrieval costs are rounding error, and the quality gain from higher dimensions is worth taking. For larger indexes, tens or hundreds of millions of vectors, dimension becomes a real cost line. Doubling the dimension doubles the storage and roughly doubles the retrieval cost. At those scales, the right move is often the lower-dimension variant of the same model family, accepting a small quality hit for a large cost reduction.&lt;/p&gt;

&lt;p&gt;The pattern that has emerged in 2026 is Matryoshka embeddings, where the same model can produce vectors at multiple dimensions and the lower-dimension variant is a meaningful prefix of the higher-dimension one. This lets a single model serve both a fast, low-dimension index for the first retrieval pass and a slower, high-dimension representation for reranking. If your embedding model supports this, use it. If it does not, picking a fixed dimension that fits the corpus size is the right move. Avoid the trap of picking the highest dimension the model offers because it scored slightly higher on the benchmark. The benchmark did not run at your scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Search Is Not Optional
&lt;/h2&gt;

&lt;p&gt;Pure dense retrieval, where the only signal is embedding similarity, is the default in tutorials and the wrong default in production. By 2026 the consensus pattern is hybrid search: combine dense retrieval with a lexical signal, usually BM25 or its variants, and merge the results. Teams that do this consistently see measurable lifts on real-world queries. Teams that skip it consistently rediscover this lesson when their assistant fails to find the document containing the exact phrase the user typed.&lt;/p&gt;

&lt;p&gt;The reason hybrid works is that dense embeddings and lexical search fail in opposite ways. Dense embeddings handle paraphrases, synonyms, and semantic similarity. They miss exact-match queries with rare terms. Lexical search handles exact matches and rare terms. It misses paraphrases. The two signals together cover both failure modes, and the resulting retrieval is more robust than either alone.&lt;/p&gt;

&lt;p&gt;The pattern that has worked is to run both retrievers in parallel, take the top-k from each, and merge with a reciprocal rank fusion or a weighted score combination. The simplest weighting is to give each retriever equal weight and fuse by reciprocal rank, which produces solid results without any tuning. The tuned version weights the two signals based on the query type, but the simple version is good enough for most production systems and avoids the complexity of dynamic weighting.&lt;/p&gt;

&lt;p&gt;The implementation cost is low. Most modern vector stores support a sparse index alongside the dense one, and the additional storage for the sparse index is small. The latency cost is also low, because the two retrievals run in parallel and the merge is a few milliseconds. The quality lift is real and shows up most clearly on the queries that pure dense retrieval was secretly failing on. If your retrieval pipeline is dense-only, adding a sparse component is the highest-leverage change available, and it is usually a half-day project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What A Reranker Does, And Why You Probably Need One
&lt;/h2&gt;

&lt;p&gt;A reranker is a model that runs on the top results from the initial retriever and reorders them by relevance to the query. The initial retriever, dense or hybrid, optimizes for recall: getting the right candidates into the top-k. The reranker optimizes for precision: making sure the most relevant candidates are at the top of that list, where the LLM will see them.&lt;/p&gt;

&lt;p&gt;The reason rerankers exist is that the initial retriever is doing fast similarity matching against a vector index, and that matching is approximate. A bi-encoder embedding model produces one vector per document and one vector per query, then computes similarity. It is fast and scales to billions of documents. It is also limited, because the document and the query are encoded independently, without the model ever seeing them together. A cross-encoder, which is what most rerankers are, takes the query and a candidate document as a single input and produces a relevance score that takes both into account. It is much slower, because it has to run for each candidate. It is also much more accurate, because the model can attend to specific overlaps and interactions between query and document.&lt;/p&gt;

&lt;p&gt;The production pattern is to use the bi-encoder for the first pass, retrieve the top 50 to 200 candidates, and run the cross-encoder reranker on that smaller set to pick the top 5 to 10 that go to the LLM. The bi-encoder handles the scaling problem. The cross-encoder handles the quality problem. Together they get you both, with a latency cost in the tens to low hundreds of milliseconds for typical reranker sizes.&lt;/p&gt;

&lt;p&gt;The teams that ship without a reranker usually do so because the demo looked fine and the additional latency felt unnecessary. The teams that add a reranker after the fact almost always see a measurable lift in answer quality, especially on harder queries where the initial retrieval put the right document at rank 5 instead of rank 1. The LLM cannot prioritize a document the retrieval pipeline ranked low, and a reranker is the cheapest way to fix that ordering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking A Reranker
&lt;/h2&gt;

&lt;p&gt;Rerankers come in roughly the same three tiers as embedding models. Frontier APIs from major providers, open-weights cross-encoders, and specialized variants. The cost calculus is similar but the latency story is different. Reranking adds latency on every query, which means it sits in the user-perceived path. The choice of reranker is a tighter trade-off than the choice of embedding model, because embedding latency is paid once at indexing time while reranking latency is paid on every query.&lt;/p&gt;

&lt;p&gt;The frontier rerankers are accurate and add real latency. They are the right choice for high-stakes retrieval where the latency budget can absorb a few hundred milliseconds. The open-weights rerankers are nearly as accurate and faster, especially when self-hosted on a GPU close to the application. They are the right choice for most production systems, particularly chat applications where the user is waiting on the response.&lt;/p&gt;

&lt;p&gt;The other lever is reranker size. The same family often comes in multiple sizes, and the small variants are dramatically faster than the large ones with a small quality penalty. For most production systems, the small variant is the right starting point, and the upgrade to a larger variant happens only if the quality measurements justify it. The latency budget is real, and a 50-millisecond reranker that is 95 percent as good as a 250-millisecond reranker is the better production choice nine times out of ten.&lt;/p&gt;

&lt;p&gt;The pattern that has worked when I am picking a reranker is to evaluate three to five candidates on the same eval set used for the embedding model, look at both the quality lift and the p95 latency, and pick the one that maximizes the quality-per-millisecond. The candidate list is small, the eval is fast, and the answer is almost always clearer than it looks before you measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost And Latency Budgets
&lt;/h2&gt;

&lt;p&gt;A pipeline with hybrid retrieval and reranking has more moving parts than a pure dense pipeline, and each part has its own cost and latency profile. The discipline is to be honest about the budget at each stage and to allocate it intentionally.&lt;/p&gt;

&lt;p&gt;The dense retrieval is the cheapest and fastest stage. It runs in milliseconds against a vector index, and the cost is dominated by the storage of the vectors themselves. The sparse retrieval is similarly cheap, with the storage cost of an inverted index that scales with the number of unique tokens in the corpus. Both run in parallel and contribute milliseconds to the latency budget.&lt;/p&gt;

&lt;p&gt;The reranker is the expensive stage. A cross-encoder running on 50 candidates is a meaningful chunk of latency, and on 200 candidates it can dominate. The lever is the candidate count: rerank fewer candidates and the latency drops linearly. The right candidate count is the smallest one that still surfaces the correct document into the top-k after reranking, which is something the eval set can tell you. Most production systems land somewhere between 30 and 100 candidates, and the variance below that range is small.&lt;/p&gt;

&lt;p&gt;The LLM call is the slowest and most expensive stage by far, and the retrieval pipeline's job is to keep its input small and relevant. A retrieval that returns five precise chunks lets the LLM run on a small input and produce a fast, focused answer. A retrieval that returns twenty mediocre chunks forces the LLM to read more, costs more in tokens, and dilutes the answer. Investing in retrieval quality is the same as investing in LLM cost reduction, and the &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization&lt;/a&gt; story I covered earlier is downstream of how good the retrieval is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multilingual, Multimodal, And The Rest Of The Long Tail
&lt;/h2&gt;

&lt;p&gt;Most embedding models are trained primarily on English. If your corpus or your queries are in other languages, you need a multilingual model, and you need to be honest about the quality trade. Multilingual models are usually slightly worse at any single language than a dedicated monolingual model, and the gap shrinks every year but does not close. For a single-language product, monolingual is the right choice. For a multilingual product, multilingual is the right choice, and the small quality gap is the price of language coverage.&lt;/p&gt;

&lt;p&gt;Multimodal embeddings, where the model encodes both text and images into the same vector space, have matured to the point where they are useful in production for image-text retrieval and visual search. The trade-off is that a model trained on text-image pairs is usually worse at pure text-text retrieval than a dedicated text model. For products where images are central, multimodal embeddings are the right choice. For products where images are incidental, the right move is often two separate indexes, one for text and one for images, with the application deciding which to query based on the input.&lt;/p&gt;

&lt;p&gt;The long tail of edge cases is the part where evals matter most. Numeric reasoning, chronological ordering, complex multi-clause queries, queries that mix exact matches with semantic intent. Each of these is a class where embedding-only retrieval can fail in ways that are not obvious until they show up in production. The defense is the eval set, again. Cover the long tail in your evals and the failures show up before the users find them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Tune The Pipeline Without Breaking It
&lt;/h2&gt;

&lt;p&gt;Embedding models and rerankers have a lot of knobs, and the temptation is to tune everything at once. The discipline is to tune one thing at a time, on a fixed eval set, with a measurement loop that takes minutes rather than days.&lt;/p&gt;

&lt;p&gt;Start with the embedding model. Pick three candidates, run them on the eval, look at recall at the top-k that the reranker will see. Pick the best one and lock it in.&lt;/p&gt;

&lt;p&gt;Move to the reranker. Pick two or three candidates, run them on the locked embedding model, look at the answer quality and the latency. Pick the one that maximizes quality within the latency budget.&lt;/p&gt;

&lt;p&gt;Then tune the candidate count for reranking. Sweep from 20 to 200, plot quality versus latency, pick the knee of the curve. The knee is usually obvious. The temptation to rerank everything is rarely justified by the data.&lt;/p&gt;

&lt;p&gt;Finally, tune the merge weights for hybrid retrieval, if you are running it. The default of equal weights with reciprocal rank fusion is usually within a percent or two of the optimum, and tuning past that is worth doing only if the gap shows up in evals.&lt;/p&gt;

&lt;p&gt;The discipline that ties all of this together is the same one I covered for &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt;, and it applies the same way here: build the eval first, run the eval on every change, trust the eval over your intuition. Retrieval is a place where intuition is consistently wrong, because the failure modes are subtle and the wins are often counter-intuitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Build From Scratch
&lt;/h2&gt;

&lt;p&gt;If I were building a retrieval pipeline today, I would start with a strong open-weights embedding model in the bi-encoder tier, hybrid search combining dense and BM25 with reciprocal rank fusion, a small open-weights cross-encoder reranker on the top 50 candidates, and an eval set built from real user queries and corrected answers. The candidate count and the reranker size would be tuned by measurement. The frontier APIs would be in reserve for the case where the open stack hit a quality ceiling I could measure.&lt;/p&gt;

&lt;p&gt;That stack is unglamorous. It is also the stack that production teams have converged on by 2026, because it works and because the trade-offs are honest. The interesting work in retrieval is no longer at the embedding model. It is at the chunker, where the unit of retrieval gets decided, and at the reranker, where the order gets fixed. The same chunking discipline I covered in &lt;a href="https://dev.to/blog/rag-chunking-strategies-production-2026"&gt;RAG chunking strategies in production&lt;/a&gt; is the layer above this one, and the two layers together are most of what determines whether a RAG system is good or just demoable.&lt;/p&gt;

&lt;p&gt;If your retrieval is producing the right kind of answer at the wrong rank, the fix is a reranker. If it is failing to find documents that contain the exact phrase the user typed, the fix is hybrid search. If it is finding the wrong documents entirely, the fix is the chunker or the embedding model, in that order. The patterns are mostly known. The work is in measuring carefully and resisting the urge to swap models when the actual problem is one layer up or one layer down.&lt;/p&gt;

&lt;p&gt;The pipeline that ships in 2026 and still works in 2027 is the one with an eval set that grows when production surfaces a new failure class, a chunker that respects document structure, an embedding model picked on data and not on benchmarks, hybrid retrieval as a default, and a small fast reranker that earns its latency. None of that is novel. All of it is the thing that turns a retrieval demo into a retrieval product, and most teams are still one or two of these layers short of where they need to be.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Small Language Models In Production 2026: Where SLMs Beat Frontier Models, And Where They Quietly Fail</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 05 May 2026 09:58:59 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/small-language-models-in-production-2026-where-slms-beat-frontier-models-and-where-they-quietly-3kn5</link>
      <guid>https://dev.to/alexcloudstar/small-language-models-in-production-2026-where-slms-beat-frontier-models-and-where-they-quietly-3kn5</guid>
      <description>&lt;p&gt;The first time I replaced a frontier model with a small one in production, the cost graph dropped by ninety percent and the on-call channel got quieter. The first time I tried to do that and broke the product, the cost graph also dropped by ninety percent, but the user complaints climbed in a way the dashboard did not catch for two days. Both runs taught me the same thing from opposite directions: small language models are a real production lever, and the lever does not move the same way for every task. The teams I trust have spent 2025 and into 2026 figuring out which tasks bend nicely under a small model and which tasks break the moment you try to save a dollar.&lt;/p&gt;

&lt;p&gt;By small language model I mean roughly the 1B to 30B parameter range. Phi-4 size, Llama 3 8B size, Qwen 2.5 7B size. Models that fit on a single consumer or low-tier datacenter GPU, run at low latency without exotic infrastructure, and cost an order of magnitude less per token than a frontier model. The capability gap between these and the frontier has narrowed enough that the question is no longer "can a small model do this" for many tasks. The question is "is the gap small enough to matter for your specific use case." That is a different question, and answering it well is what this post is about.&lt;/p&gt;

&lt;p&gt;This is what has worked, what has not, and what I would consider before swapping any frontier call for a smaller one in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What An SLM Actually Is, In Production Terms
&lt;/h2&gt;

&lt;p&gt;The interesting line is not parameter count, it is deployment shape. A small language model is a model you can serve yourself, on infrastructure you control, at predictable latency and cost. A frontier model is one you call over an API, at the API's latency and pricing. The capability gap between the two is the headline. The deployment gap is where the actual product implications live.&lt;/p&gt;

&lt;p&gt;When the model is yours to host, you control the latency. You control the rate limits. You control whether the data leaves your network. You can fine-tune. You can quantize. You can colocate the model with the rest of your stack and avoid a network round trip on every call. Those capabilities are not free. You are now responsible for the GPU bill, the deployment, the autoscaling, the monitoring, and the failover. The trade is real and the calculation is rarely the one teams expect when they start.&lt;/p&gt;

&lt;p&gt;When the model is an API, you give up control and you get reliability and capability for the price. The frontier model is run by people whose only job is to run it. Your token cost includes a margin, but it also includes the on-call rotation, the multi-region failover, and the model itself. The trade is paying more per token to do less work yourself, and for many production workloads that is the right trade.&lt;/p&gt;

&lt;p&gt;The production version of "should we use an SLM" is "is this workload high enough volume, low enough complexity, and stable enough in shape that owning the model is cheaper than renting it." If the answer is yes, an SLM is on the table. If the answer is no, the frontier API is almost always still the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where SLMs Beat Frontier Models In 2026
&lt;/h2&gt;

&lt;p&gt;There is a clear class of tasks where a small model, fine-tuned or even just well-prompted, matches or beats a frontier model in production. Knowing the shape of these tasks is the key to picking the right ones to migrate.&lt;/p&gt;

&lt;p&gt;Classification is the most obvious win. Sentiment, intent, topic, language, content moderation, routing decisions. These are tasks where the input is a chunk of text and the output is one of a known set of labels. A 7B model fine-tuned on a few thousand examples typically beats a frontier model on a fixed-label classification task, runs ten times faster, and costs ten times less. The frontier model is doing more work than the task needs. The small model is doing exactly what the task needs.&lt;/p&gt;

&lt;p&gt;Extraction is the next clear win. Pulling structured fields out of unstructured text. Names, dates, amounts, IDs, sentiment per aspect. The same shape as classification but with multiple output fields. Fine-tuned SLMs are very good at this. The benchmark gap between a fine-tuned 8B model and a frontier model on a domain-specific extraction task is often within noise, and the latency and cost gap is enormous.&lt;/p&gt;

&lt;p&gt;Reformatting and rewriting are good targets when the source and target are both in the model's wheelhouse. Convert this prose into bullets. Convert this CSV into JSON. Convert this email into a summary. The task is structurally simple and high volume, and the small model handles it cheaply. The frontier model is overkill.&lt;/p&gt;

&lt;p&gt;Routing decisions inside an agent are a sweet spot. The "which tool should I call" decision can often be made by a small model with a tight prompt, faster and cheaper than asking the frontier model. The same goes for "is this query in scope" or "is this response complete." These are gateway decisions that fire on every request, so the cost savings compound.&lt;/p&gt;

&lt;p&gt;Embedding-adjacent tasks like reranking and similarity scoring are not always SLM tasks in the traditional sense, but small dedicated models in this space have gotten very good. If your retrieval pipeline is calling a frontier model to rerank retrieved chunks, you are leaving money and latency on the table. A small reranker is a better fit and the gap is not capability, it is engineering effort to swap.&lt;/p&gt;

&lt;p&gt;The pattern is that SLMs win on tasks that are narrow, high-volume, and tolerant of fine-tuning. They lose on tasks that are open-ended, low-volume, or that require the kind of broad world knowledge that only a frontier model has internalized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where SLMs Quietly Fail
&lt;/h2&gt;

&lt;p&gt;The failures are the part that the benchmark tables do not show, because the benchmark tables are testing the tasks the small models are good at. The production failures live in a different shape of task.&lt;/p&gt;

&lt;p&gt;Long-horizon reasoning is the first place SLMs fall apart. Multi-step planning, math with several intermediate steps, code that has to track state across many lines, agentic loops that span more than three or four tool calls. The small model can take any one step. It cannot reliably keep the chain coherent across many of them. By the fifth step, it has lost the plot, and the failure looks like a model that confidently does the wrong thing for reasons that do not match the trace.&lt;/p&gt;

&lt;p&gt;Open-ended generation that has to be on-brand and competent is the second place. Long-form writing where the user expects the same quality as the frontier model. Customer-facing replies in a domain where tone matters. Content where the difference between "good" and "fine" is what the product is selling. A small model can do the work. The output reads like a small model did it, and users notice.&lt;/p&gt;

&lt;p&gt;Anything that requires the model to know things it was not fine-tuned on. The frontier models have absorbed a huge slice of public knowledge in their pretraining. A 7B model has absorbed a smaller slice. Tasks that require recall of facts, especially current ones, are tasks where the SLM will hallucinate or wave generically while the frontier model gets it right. The gap closes for domains you fine-tune on. It widens for everything else.&lt;/p&gt;

&lt;p&gt;Edge cases in classification. The 8B model is great at the ninety-five percent of inputs it has seen variants of in training. It is mediocre on the long-tail five percent. The frontier model is great on both. If your application sees a fat tail of weird inputs, the SLM will quietly misclassify the weird ones, and you will not notice until the metric for "how often the user clicked the wrong-result-feedback button" creeps up.&lt;/p&gt;

&lt;p&gt;Reasoning over long context. The small model has a smaller working memory in practice, even when its advertised context window is large. Document QA over a fifty-page contract is a task where the frontier model still wins, because the small model loses focus partway through and starts answering from a few salient chunks instead of the whole document. The same task on a one-page input is fine. The threshold is real and worth measuring on your specific workload.&lt;/p&gt;

&lt;p&gt;The failure mode that is hardest to catch is the slow drift. The SLM works on the launch dataset and degrades on the data that comes in three months later, because the data distribution shifted and the model was fine-tuned on the old shape. The frontier model is more robust to this kind of drift because its pretraining was broader. The SLM needs to be retrained or refreshed. If you do not have the pipeline to do that, you have a model whose quality drops slowly and whose problems show up in user complaints, not in your eval suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Routing Pattern: Use Both
&lt;/h2&gt;

&lt;p&gt;The teams that are getting the most out of SLMs in 2026 are not picking SLM or frontier. They are routing between them based on the task. The pattern is roughly:&lt;/p&gt;

&lt;p&gt;A cheap, fast classifier or rule-based router takes the incoming request and decides whether it is a task an SLM can handle or one that needs a frontier model. Easy classification or extraction goes to the SLM. Open-ended, multi-step, or out-of-domain requests go to the frontier model. The router itself is often a small model, because deciding "is this complex" is a classification task in itself.&lt;/p&gt;

&lt;p&gt;For requests that go to the SLM, you get the fast, cheap path. For requests that go to the frontier model, you pay for capability. The blended cost across the workload is dramatically lower than running everything through the frontier, and the quality on the hard requests is the same as it would have been without the router.&lt;/p&gt;

&lt;p&gt;The pattern that I covered in &lt;a href="https://dev.to/blog/llm-router-model-routing-fallbacks-2026"&gt;the LLM router and model routing patterns post&lt;/a&gt; is the same shape. Routing by task, with a fallback path, is the production architecture that has won. Single-model architectures are now the exception, not the default, in any system that is cost-sensitive at all.&lt;/p&gt;

&lt;p&gt;The trick to making the router work is to be honest about what each model can do, and to monitor the rate at which the router sends things to the wrong path. A router that sends ten percent of frontier-needing requests to the SLM is producing bad outputs on those requests, and the user does not know that the model decision was the cause. Instrument the router. Sample the SLM responses for human review. Be willing to tighten the router as you learn the shape of the wrong-path failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-Tuning Is The Multiplier
&lt;/h2&gt;

&lt;p&gt;A small model out of the box is okay. A small model fine-tuned on your task is often as good as a frontier model on that task. The discipline of fine-tuning is what unlocks most of the SLM win, and it is also the part that teams underspend on because it requires data, infrastructure, and a willingness to maintain a training pipeline.&lt;/p&gt;

&lt;p&gt;The data piece is the hardest. You need labeled examples, in your domain, in the shape the model needs to produce. Some of that data you have. Some of it you have to generate or label. The frontier model is your best tool for generating training data: prompt it carefully, generate examples, validate a sample by hand, and use the rest to fine-tune the small model. This is the loop that makes fine-tuning practical: the frontier model trains the small model, the small model serves production, and the frontier model handles the long tail of requests the small model cannot.&lt;/p&gt;

&lt;p&gt;The infrastructure piece is now solvable with managed services. The bar to fine-tune a 7B or 13B model has dropped enough that a single engineer can run the loop in a week. LoRA-style adapters mean you do not have to host a separate full model per fine-tune; you host the base model and swap adapters per task. That is a real architectural advantage that did not exist as cleanly two years ago.&lt;/p&gt;

&lt;p&gt;The willingness piece is harder than the technical pieces. Fine-tuning is not a one-time job. The model needs to be retrained as the data drifts, as the task evolves, as new edge cases come in. The team has to own that pipeline, and the pipeline has to be on a schedule, with monitoring, with a rollback story. Without that, the fine-tuned model is a snapshot that gets stale, and the staleness shows up in production. The same maintenance discipline I covered in &lt;a href="https://dev.to/blog/llm-fine-tuning-developer-guide-2026"&gt;the LLM fine-tuning developer guide&lt;/a&gt; applies, and the teams that take it seriously are the ones who get sustained wins from SLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency: The Quiet Reason To Switch
&lt;/h2&gt;

&lt;p&gt;The cost win is the headline. The latency win is the one that changes the product. A small model running on a colocated GPU answers in tens of milliseconds for short prompts. A frontier API call is hundreds of milliseconds at best, sometimes more under load, with a long tail that is meaningfully worse. The difference is not in the marketing copy. It is in the user experience.&lt;/p&gt;

&lt;p&gt;For interactive features where the model is on the critical path of a user action, a sub-100ms response feels like an interaction, and a 500ms response feels like a wait. The same feature with a small model can be enabled in places where a frontier model could not. Autocomplete. Inline suggestions. Real-time classification. These are features that exist or do not based on the latency budget.&lt;/p&gt;

&lt;p&gt;For batch and background workflows, the latency difference matters less, but throughput differences are large. A self-hosted small model can run hundreds of concurrent requests on one GPU. The frontier API has rate limits. For high-volume offline work, the SLM throughput advantage compounds with the cost advantage and produces savings that are hard to ignore.&lt;/p&gt;

&lt;p&gt;The latency story has a wrinkle: cold starts. A self-hosted model on autoscaling infrastructure has cold starts, and a cold start on a 13B model loading into GPU memory is not trivial. The pattern is to keep at least one warm replica per region and to be careful about scaling-to-zero on user-facing paths. The cost of one warm replica is small. The cost of a thirty-second cold start in front of a user is large.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost: The Math Is Different Than You Think
&lt;/h2&gt;

&lt;p&gt;The naive cost comparison is per-token API price versus GPU hourly cost divided by tokens served. That math is right but incomplete. The full picture includes the engineering time to ship and maintain the SLM stack, the cost of the fine-tuning pipeline, the cost of the eval and monitoring infrastructure, and the cost of the inevitable migration when the base model gets superseded by a better one.&lt;/p&gt;

&lt;p&gt;For low-volume workloads, the frontier API wins on total cost. The fixed costs of running your own model are larger than the per-token savings until the volume is high enough. The crossover point varies by workload, but for most teams it is somewhere north of a million tokens per day on a sustained basis. Below that, paying the API is the right call, and the engineering effort is better spent elsewhere.&lt;/p&gt;

&lt;p&gt;For high-volume workloads, the SLM stack starts winning, and the win compounds with each layer of optimization: quantization, batching, KV caching, request scheduling. By the time you are running real volume on dedicated hardware, the per-token cost is a fraction of the API price, and the question is whether you have the engineering bandwidth to keep that stack running well.&lt;/p&gt;

&lt;p&gt;The hidden cost is the migration cost when the base model improves. The Llama 3 fine-tune you shipped last year is now behind a Llama 4 base model on the same task. Migrating means retraining, re-evaluating, redeploying. That is a quarter of work, not a sprint. Build the pipeline so that the migration is as automated as you can make it, because there will be more of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Build Today
&lt;/h2&gt;

&lt;p&gt;If I were starting a new AI product in 2026, I would default to a frontier API for v0. The capability gap is large enough at the start that owning the model is a distraction from product work. The cost will not matter at v0 volume. Ship the product, get users, learn what the workload actually looks like.&lt;/p&gt;

&lt;p&gt;After v0, I would profile the workload by request type. The high-volume, narrow tasks are the ones to migrate first. Classification, extraction, simple reformatting, routing decisions. These are the tasks where an SLM is reliably as good or better, and where the per-request savings compound to real money.&lt;/p&gt;

&lt;p&gt;I would keep the frontier model in the loop for the long tail. Open-ended generation, complex reasoning, multi-step agent flows, anything where the SLM is not yet matching the bar. Route by request shape. Be honest about which tasks are which. Update the routing as the SLMs get better, because they will.&lt;/p&gt;

&lt;p&gt;I would invest in the fine-tuning pipeline early once the migration starts paying off. The pipeline is the multiplier. Without it, the SLM is mediocre and the team gets discouraged. With it, the SLM is competitive and the cost and latency wins are real.&lt;/p&gt;

&lt;p&gt;The other thing I would invest in early is the monitoring and rollback story. SLMs fail differently from frontier models. The failure modes are subtler. The eval suite has to catch them. The rollback path has to exist. The same observability discipline I covered in &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability and debugging in production&lt;/a&gt; applies double, because the SLM is a model you own and the responsibility for its quality is yours.&lt;/p&gt;

&lt;p&gt;The frame that has held up across a year of running this is that SLMs are a tool, not a strategy. The strategy is "use the right model for the task." The SLM is one of the models. The frontier is another. The router is the part of the system that knows which is which, and the team's job is to keep the router honest and the SLMs sharp. The teams that did that in 2025 are the teams whose AI features are profitable in 2026. The teams that did not are the teams whose AI line item is the largest one on the cloud bill, and who are now scrambling to migrate under deadline pressure.&lt;/p&gt;

&lt;p&gt;If your AI product is a single API call to a frontier model on every request, the next quarter's work is probably about replacing some of those calls with smaller models you own. The capability has caught up enough to make it worth doing. The patterns are clear enough to make it doable. The hard part is being honest about which tasks are SLM tasks and which are not, and that honesty is the work that does not show up in the model card.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Guardrails And Output Validation In Production 2026: What Actually Catches Bad Outputs Before Users Do</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 05 May 2026 09:58:58 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/ai-guardrails-and-output-validation-in-production-2026-what-actually-catches-bad-outputs-before-30o4</link>
      <guid>https://dev.to/alexcloudstar/ai-guardrails-and-output-validation-in-production-2026-what-actually-catches-bad-outputs-before-30o4</guid>
      <description>&lt;p&gt;The first time I shipped an LLM feature with no guardrails, it took eleven days for a user to get the model to recommend a competitor's product inside our own onboarding flow. The screenshot ended up in a Slack channel with about four hundred people in it, and the conversation that followed was the kind that ends with "we need to fix this by Monday." The fix took two weeks. The lesson took longer. I had assumed the model would behave because the prompt told it to behave. The model behaved exactly as well as the prompt could be relied on, which turned out to be not very well at all.&lt;/p&gt;

&lt;p&gt;That was almost two years ago, and I have been chasing the same class of bug ever since. Different products, different prompts, same shape: a model produces an output that looks fine to the model and is wrong for the product. The output ships because nothing in the pipeline was watching for it. The user finds it before the team does. By the time anyone looks at the trace, the screenshot is on Twitter. By 2026 enough teams have hit this wall that the patterns for not hitting it have stabilized. The patterns are not glamorous. They are mostly about adding cheap checks in the right places and being honest about what the model can and cannot be trusted to do unsupervised.&lt;/p&gt;

&lt;p&gt;This is what I have seen work, what I have seen fail, and what I would build into any serious LLM product before it sees a real user.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Guardrails Actually Are
&lt;/h2&gt;

&lt;p&gt;A guardrail is anything between the model output and the user that can reject, rewrite, or flag the output. That is the whole definition. The fancy framing is "constitutional AI" or "policy enforcement" or "alignment layer." The boring framing is "code that runs on the model's response and decides whether to use it." Both framings describe the same thing. The boring one is more useful when you are trying to ship.&lt;/p&gt;

&lt;p&gt;Guardrails are not the same as prompts. Prompts try to influence the model. Guardrails check what came out. The two work together, but they fail in different ways. A prompt that tells the model to never recommend a competitor will work most of the time and fail occasionally. A guardrail that scans the output for competitor names and rejects the response will fail in different ways, mostly false positives. Stacking the two gives you a system where the prompt makes the model behave for free in the easy cases, and the guardrail catches the residual badness in the hard cases. Either layer alone is not enough. Both layers together are what production looks like.&lt;/p&gt;

&lt;p&gt;The other thing guardrails are not is an excuse to stop thinking about the prompt. I have seen teams ship a wall of validators around a prompt that was doing nothing useful, and the result was a system that rejected fifteen percent of model outputs and shipped slop on the other eighty-five. Validators that fire too often are a sign that the prompt or the model is wrong, not that the validators are working. The right ratio is that the model is doing most of the job, and the guardrails are catching the long tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer Your Checks: Cheap First, Expensive Last
&lt;/h2&gt;

&lt;p&gt;The single most important architectural choice in a guardrail layer is the order of the checks. The right order is cheap and deterministic first, expensive and probabilistic last. The wrong order is what most teams ship in v1, which is to call another LLM to judge the first LLM's output, then layer regex on top, then notice that the LLM judge costs more than the original generation.&lt;/p&gt;

&lt;p&gt;The cheap checks are the ones that should run first because they catch the most common failures and they cost nothing. Schema validation. Length checks. Forbidden-phrase regex. PII scanners. URL validation. JSON parse. These are deterministic, run in milliseconds, and catch the bulk of obvious failures. If the model returned malformed JSON, you do not need an LLM to tell you that. You need a JSON parser.&lt;/p&gt;

&lt;p&gt;The medium checks come next. Embedding similarity to a deny list. Toxicity classifiers. Language detection. Domain-specific validators that need a small model or a database lookup. These cost more than regex but less than another LLM call, and they catch a different class of failures: outputs that are technically valid but semantically wrong.&lt;/p&gt;

&lt;p&gt;The expensive checks come last, and only when needed. LLM-as-judge. Long-context policy classifiers. Multi-step reasoning checks. These are the ones that catch the failures the cheap layers cannot, and they cost real money and add real latency. The discipline is to invoke them only when the cheap layers are clean and the stakes are high enough to warrant the cost. Calling an LLM judge on every response is a tax on every interaction. Calling it on the five percent of responses that pass everything else but might still be off-policy is a different economics entirely.&lt;/p&gt;

&lt;p&gt;The pattern that has worked for me is to short-circuit. If the cheap layer rejects, do not run the expensive layer. The output is going to be regenerated or rejected anyway. There is no point spending tokens to confirm what you already know. This sounds obvious and is the single most common waste I see in production guardrail stacks: every layer runs on every response, regardless of whether earlier layers already failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema Validation Is The Workhorse
&lt;/h2&gt;

&lt;p&gt;If your model is returning structured output, the schema validator is the most important guardrail you have. Tighten the schema and you eliminate entire categories of bugs without writing any other validation code. The same rigor I covered in the &lt;a href="https://dev.to/blog/structured-outputs-llm-developer-guide-2026"&gt;structured outputs developer guide&lt;/a&gt; applies double in a guardrail context: every type, format, enum, and constraint is a check that runs for free.&lt;/p&gt;

&lt;p&gt;Use enums when the field has a fixed set of valid values. Use string formats for emails, URLs, dates, UUIDs. Use min and max for numeric ranges and string lengths. Use patterns for IDs that follow a known shape. Use required and additionalProperties: false to forbid the model from inventing extra fields. Each of these is a guardrail, and each of them runs at zero cost.&lt;/p&gt;

&lt;p&gt;The pattern that punches above its weight is custom validators on top of the schema. JSON Schema cannot express "this URL must be on our domain" or "this product ID must exist in our database." A custom validator can. Layer custom validators on top of the schema and you get a contract that catches both syntactic errors (handled by the schema) and semantic errors (handled by the custom validator).&lt;/p&gt;

&lt;p&gt;The trap to avoid is making the schema so loose that the validator is doing all the work. If your schema accepts any string for a field that should be one of four enum values, you have moved the contract from the cheap layer to the expensive layer. Push the constraints down to the schema whenever you can. The schema is the cheapest validator you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Policy Checks Are Where Brand Lives
&lt;/h2&gt;

&lt;p&gt;Policy checks are guardrails that enforce things specific to your product, your brand, and your user agreement. These are the checks that nobody else can write for you, because they are about what your company has decided is acceptable. The model does not know your competitor list. The model does not know which topics are off-limits because of regulatory constraints. The model does not know that your product never makes promises about future features. You have to tell it, and you have to verify.&lt;/p&gt;

&lt;p&gt;The pattern that works is a small list of specific, concrete policies, each backed by a deterministic check. "Never mention competitors X, Y, Z by name." "Never claim the product can do something not in this list." "Never produce output longer than 500 words for a chat response." Each of these can be checked in a few lines of code. The collection of them is the brand layer.&lt;/p&gt;

&lt;p&gt;Avoid policies that require interpretation. "Be helpful" is not a policy. "Be on-brand" is not a policy. These are aspirations that the prompt can chase, but they are not things a guardrail can enforce, because there is no clean check for them. A guardrail is a binary: does the output pass or not. If you cannot write the check, it is not a guardrail.&lt;/p&gt;

&lt;p&gt;The policy I keep relearning to write is the recovery policy. When a policy check fails, what does the system do? Regenerate? Return a canned message? Escalate to a human? Different policies need different responses. A length violation can usually be fixed by regenerating with a tighter prompt. A competitor mention probably needs a regeneration with an explicit instruction to avoid the mention. A regulatory violation might need a hard fallback to a safe canned response. The policy and the recovery are both part of the guardrail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Sanitization For UI Safety
&lt;/h2&gt;

&lt;p&gt;If the model output is going to be rendered in a browser, the guardrail layer is also responsible for making sure the output cannot break the UI. This is the part that the security team will care about and that the product team will forget. Both groups are partly right, because the failure modes are different.&lt;/p&gt;

&lt;p&gt;Strip or escape any HTML the model produces, unless the product specifically allows it. Markdown is usually safe to render through a known-good parser. Raw HTML from an LLM is not safe to render, ever, because the model can be coaxed into producing script tags and event handlers that the user did not ask for. The guardrail here is to either strip HTML before rendering, or render through a sanitizer like DOMPurify with a strict allowlist. The same logic that protects against XSS in user-submitted content protects against prompt-injected XSS in model output.&lt;/p&gt;

&lt;p&gt;Validate URLs before rendering them as links. The model can produce URLs that look fine and point to malicious domains, especially in retrieval-augmented systems where the model is mixing user content with external sources. The check is cheap: parse the URL, check the domain against an allowlist or a denylist, reject if it does not match. This is the same problem I covered in &lt;a href="https://dev.to/blog/prompt-injection-defense-app-developers-2026"&gt;prompt injection defense for app developers&lt;/a&gt;, and the guardrail pattern is the same: trust nothing from the model, sanitize at the boundary.&lt;/p&gt;

&lt;p&gt;Strip metadata that could leak system internals. Stack traces, file paths, internal IDs, debug strings. The model picks these up from the prompt or the context and can echo them back in the output. The guardrail layer is the place to scrub them, because the prompt cannot reliably suppress them and the user does not need to see them.&lt;/p&gt;

&lt;h2&gt;
  
  
  PII And Sensitive Data: The Boring Critical Layer
&lt;/h2&gt;

&lt;p&gt;If your product handles user data, the guardrail layer is responsible for not leaking it. This is the part that compliance will ask about, and it is also the part that most teams underspend on because it is unglamorous and rarely shows up in the demo.&lt;/p&gt;

&lt;p&gt;The pattern that has worked is to run a PII detector on every model output before it ships, and to log every detection. Not every detection is a leak. Sometimes the user explicitly asked the model to repeat their email address. The point of the detector is not to block, it is to flag and log so you can audit the rate. If the detector starts firing more often, something has changed in the system: the prompt, the retrieval, the user behavior. The metric matters.&lt;/p&gt;

&lt;p&gt;For outputs that should never contain PII, the detector is a hard guardrail. Block the output, log the trace, and either regenerate or fall back to a safe response. For outputs where PII is allowed, the detector is a soft guardrail: log the detection, optionally redact, but do not block.&lt;/p&gt;

&lt;p&gt;The other piece is to scrub PII from the input to the model in the first place. The model cannot leak data it never saw. If you are sending logs, error messages, or third-party content into the prompt, run a PII scrubber on the input. The scrubber is cheaper than the apology email.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-As-Judge: Use It, But Not As The Whole Stack
&lt;/h2&gt;

&lt;p&gt;LLM-as-judge is the technique of using a second model call to evaluate the first model's output against a rubric. It works. It is also expensive, slow, and probabilistic. The mistake is treating it as the entire guardrail stack. The right framing is that it is the layer that catches what the cheap layers miss, and it should run on a fraction of the traffic.&lt;/p&gt;

&lt;p&gt;The cases where LLM-as-judge earns its cost are the ones where the rubric is too nuanced for a regex or a classifier. "Is this response on-topic for a customer support context?" is not something a regex can answer. A small judge model with a tight rubric can. "Does this response match the tone the brand uses in our existing content?" is similar. The judge does the work the deterministic layers cannot.&lt;/p&gt;

&lt;p&gt;The pattern is to keep the judge prompt tight. Long judge prompts produce vague judgments. A rubric of three to five concrete criteria, each scored independently, produces consistent results. A rubric of "evaluate whether this response is good" produces noise. Treat the judge prompt with the same discipline you would treat a tool description: be specific about what to check, what counts as failing, and what to return.&lt;/p&gt;

&lt;p&gt;The other discipline is to validate the judge. Sample the judge's outputs, have a human review them, measure the agreement rate. A judge that disagrees with humans more than ten percent of the time is a judge that is going to ship false positives or false negatives in production. The same evals discipline I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; applies to the judge itself. Without that, you have a guardrail you cannot trust, which is worse than no guardrail at all because it gives the team false confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Do When A Guardrail Fires
&lt;/h2&gt;

&lt;p&gt;The hardest part of guardrail design is what happens after the check fails. The naive answer is "block the output and show an error." The better answer depends on the failure type, the user context, and the cost of the wrong recovery. There is no single right policy, but there are clear patterns.&lt;/p&gt;

&lt;p&gt;For schema and format failures, regenerate with a stricter prompt. The model produced bad JSON because it did not internalize the schema. Telling it the response was rejected and re-asking with the schema reminded usually works. Cap the retries at two or three. After that, fall back to a safe response or escalate.&lt;/p&gt;

&lt;p&gt;For policy violations, do not regenerate without changing the prompt. Adding "do not mention competitors" to a regenerate prompt that already had that instruction is unlikely to help. Either rewrite the prompt to be more explicit, or fall back to a canned response, or block. Regenerating with the same instructions is a way to spin tokens without fixing the issue.&lt;/p&gt;

&lt;p&gt;For PII and security violations, fall back hard. Do not regenerate. Do not try to redact and ship. Return a safe response and log the trace for review. The cost of a leaked PII string is higher than the cost of a clipped response. The recovery is to fail safe, every time.&lt;/p&gt;

&lt;p&gt;For judge rejections, the right move depends on the judge confidence. A high-confidence rejection is treated like a policy violation. A low-confidence rejection might be worth regenerating, since the judge itself is uncertain. The pattern is to thread the confidence score through the recovery decision. A binary judge produces binary recoveries. A judge with a confidence score lets you tune the response.&lt;/p&gt;

&lt;p&gt;The thing I keep relearning is to make the recovery visible. Log every fire. Log every recovery. Every guardrail fire is a signal that the prompt or the system is drifting, and the rate over time is the metric that catches drift before users do. Without logging, the guardrails are silent until something goes badly wrong. With logging, they are a continuous quality signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost And Latency: The Tax You Cannot Skip
&lt;/h2&gt;

&lt;p&gt;A full guardrail stack adds latency and cost. The cheap layers add milliseconds. The medium layers add tens to hundreds of milliseconds. The expensive layers add a second or more and a real fraction of the original generation cost. The temptation is to skip the expensive layers to keep the user experience snappy. The right discipline is to be honest about the trade and tune by use case.&lt;/p&gt;

&lt;p&gt;For chat interfaces where the user is watching the response stream, you can run the cheap layers synchronously and the expensive ones asynchronously. The user sees the response immediately. The expensive checks run in the background, and if they fail, you log and either retract (rare) or correct (more common) in a follow-up. The pattern is similar to how production teams handle generative UI streaming, where the visible response is fast and the validation runs alongside.&lt;/p&gt;

&lt;p&gt;For server-to-server flows where there is no user staring at a spinner, run the full stack synchronously and accept the latency. The benefit is determinism and the cost is response time, and that trade is usually right when there is no user-perceived latency.&lt;/p&gt;

&lt;p&gt;The cost piece is similar. Cheap layers are basically free per request. Medium layers cost cents per thousand requests. Expensive layers cost dollars per thousand requests if you call them on every response. The lever is to call the expensive layers on a fraction of traffic, prioritized by risk. High-stakes flows get the full stack. Low-stakes flows get the cheap layers and the brand checks. The decision is a product call, not an engineering call. The same cost discipline I covered in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization in production&lt;/a&gt; applies here, because guardrails are a real line item, not a free addition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building For Drift, Not For Launch
&lt;/h2&gt;

&lt;p&gt;The last lesson, and the one teams keep relearning, is that the guardrail stack you ship at launch is not the stack you will run six months later. Models change. Prompts change. User behavior changes. Threats change. The stack has to evolve with all of it.&lt;/p&gt;

&lt;p&gt;Treat the guardrail layer as code that gets shipped on its own cadence, with its own tests, with its own metrics. Per-rule fire rate. False positive rate sampled by humans. Latency per layer. Cost per layer. These are the metrics that tell you whether the stack is doing its job and whether any one rule has started to misbehave. A rule that suddenly fires twice as often this week is a signal. A rule that has not fired in six months might be a rule you can retire.&lt;/p&gt;

&lt;p&gt;Build a way to add new rules quickly. The next bad output is going to surface a class of failure your stack does not catch. The team that can ship a new rule the same day is the team that recovers from incidents in hours. The team that cannot is the team that ships hotfixes in branches and ships them to production a week later. The architecture is not complicated. It is a registry of validators, a config for which ones run on which routes, and a deployment path that does not require a full release. That registry is worth the engineering investment because it is the part of the system that gets used every time something goes wrong.&lt;/p&gt;

&lt;p&gt;The frontier models are going to keep getting better, and the prompts are going to keep getting tighter, and the cases where the model misbehaves are going to keep shrinking. None of that is going to zero. The guardrails are the part of the stack that turns "the model usually behaves" into "the product never embarrasses us." That gap is where users live, and it is where the work is, and it is the part of the system that earns the trust the rest of the product depends on.&lt;/p&gt;

&lt;p&gt;If your AI feature is one screenshot away from a bad week, the fix is not a better prompt. The fix is the layer that runs after the prompt. That layer is dull, boring, full of regex and schemas and small classifiers, and it is the layer that lets you sleep through the night.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Pricing AI Features in 2026: How To Charge For LLM-Backed Products Without Bleeding Margins</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 04 May 2026 08:03:37 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/pricing-ai-features-in-2026-how-to-charge-for-llm-backed-products-without-bleeding-margins-57fm</link>
      <guid>https://dev.to/alexcloudstar/pricing-ai-features-in-2026-how-to-charge-for-llm-backed-products-without-bleeding-margins-57fm</guid>
      <description>&lt;p&gt;The first AI feature I shipped on a flat plan lost money on the third user who discovered it. Not slowly. Immediately. He was running a script through it on a loop because the UI did not stop him from doing that, and his single account burned through more in API costs that week than the feature was supposed to make in a month. I shipped the fix on a Sunday and rewrote the pricing on a Tuesday, and I have not priced an AI feature on a flat plan since.&lt;/p&gt;

&lt;p&gt;That is the lesson the SaaS playbook had not caught up to yet in 2024 and that most teams have finally internalized by 2026. The economics of AI features are different from the economics of CRUD features. A heavy CRUD user costs you a few extra database rows. A heavy AI user costs you real money on every action they take. If your pricing does not reflect that, your power users are an unfunded liability and your accountant is the one who finds out.&lt;/p&gt;

&lt;p&gt;This is what pricing AI features actually looks like in 2026, what works, what backfires, and how to land on a structure that scales with your costs instead of fighting them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Flat Pricing Cannot Work For AI
&lt;/h2&gt;

&lt;p&gt;The pitch for flat pricing is real. It is simple. It is predictable. It is what customers expect from SaaS. The reason it stops working for AI features is that the variance in how people use AI features is enormous, and the variance lands on your bill instead of theirs.&lt;/p&gt;

&lt;p&gt;A normal SaaS feature has a soft cap on usage. There are only so many invoices a freelancer can send in a month. There are only so many tickets a support team can write. The heavy users pay the same as the light users and you make money on the average because the gap between them is bounded.&lt;/p&gt;

&lt;p&gt;AI features have no such cap. A user with a script and a coffee can hit your endpoint a thousand times an hour without thinking about it. A user with a clever prompt can route a hundred-page document through your most expensive model and walk away. The cost per action is high enough and the variance wide enough that "average it out" stops being a viable financial model. You will lose money on the long tail and not make enough on the short tail to cover it.&lt;/p&gt;

&lt;p&gt;The teams I have watched try to make flat pricing work end up doing one of three things. They cap usage in a way that frustrates the users they wanted to keep. They eat the cost and watch their gross margin compress until they have to raise prices on everyone. They quietly downgrade the model behind the feature until the quality drop kicks loose enough customers to balance the bill. None of these are good outcomes. All of them are what happens when you treat AI cost like SaaS cost.&lt;/p&gt;

&lt;p&gt;The framing that has held up is that AI features have a unit cost that is not zero, is not negligible, and is not predictable from your headcount. They are closer to a metered service than a fixed-cost feature. Pricing them like the former is what works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Models That Have Converged
&lt;/h2&gt;

&lt;p&gt;By 2026, the pricing models that survived contact with real AI workloads have shrunk to three. There are variations, but the underlying shapes are the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pure usage-based.&lt;/strong&gt; The customer pays per unit of work. Per token, per request, per generated artifact, per minute of voice. The pricing math passes through directly to the underlying cost with a margin on top. This is the model used by API-first products and by features where the unit of work is well-defined and the customer has a mental model for what they are buying. The OpenAI and Anthropic developer APIs are the canonical examples. So is most of what you would build on the &lt;a href="https://dev.to/blog/llm-router-model-routing-fallbacks-2026"&gt;Vercel AI Gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit or token systems.&lt;/strong&gt; The customer buys a bucket of credits per month and spends them on actions. Different actions cost different amounts of credits. Unused credits roll over or expire. This is the model that has won for consumer-facing AI products and prosumer SaaS, because it gives the customer a predictable monthly bill while still letting the vendor charge differentially for different costs. ChatGPT's plan structure, Midjourney's GPU minutes, the credit systems on most image generation tools all use some version of this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid plans with overage.&lt;/strong&gt; The customer pays a monthly subscription that includes a generous-but-bounded amount of AI usage, with overage billed per unit beyond the included amount. Power users pay more, light users pay the flat rate, and the vendor is protected from the worst-case user without making everyone feel metered. This is the dominant model for B2B SaaS that has added AI features on top of existing products. Notion, Linear, the modern incarnation of every productivity tool, all run some version of this.&lt;/p&gt;

&lt;p&gt;The right choice depends on the product, the buyer, and the cost shape of the workload. A pure API product with sophisticated buyers should usually go usage-based. A consumer product where the unit of value is the action should usually go credit-based. A B2B feature glued onto an existing flat plan should almost always go hybrid.&lt;/p&gt;

&lt;p&gt;The thing all three have in common is that the customer's bill moves with their usage in some way. Flat pricing breaks this link, and the link is what keeps the business model intact when usage spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing Close To The Unit Of Value
&lt;/h2&gt;

&lt;p&gt;The hardest part of pricing AI features is figuring out what the customer is actually buying. The token is the unit your provider bills you in. It is rarely the unit your customer cares about.&lt;/p&gt;

&lt;p&gt;A customer using an AI writer does not care how many tokens went into the article. They care that they got an article. The unit of value is "article." A customer using a code review agent does not care about the input context size. They care that they got a review they could ship. The unit of value is "review." A customer using a chatbot does not care about the round-trip token count. They care that they got an answer. The unit of value is "answer," or maybe "conversation."&lt;/p&gt;

&lt;p&gt;If you price in tokens to a customer who is buying articles, you create a UX where the customer is doing math in their head about what their next action will cost, every action they take. That math is anxiety, and anxiety on every click is how features get used less and churn goes up. You also expose your customer to your supply-side problems. Token counts shift when models change. A new model might be twice as efficient for the same output. A new prompt might be twice as long. None of that should land in your customer's invoice, because none of it is something they did differently.&lt;/p&gt;

&lt;p&gt;The pattern that works is to price in the unit your customer thinks in, and absorb the token-level variance behind it. An "article" costs the customer a flat number of credits regardless of how many tokens you used to generate it. A "review" costs a flat number of credits regardless of which model variant ran. The router work I covered in the &lt;a href="https://dev.to/blog/llm-router-model-routing-fallbacks-2026"&gt;LLM router pattern guide&lt;/a&gt; is what makes this possible. You pick a cheaper model when you can, an expensive model when you have to, and the customer never sees the difference because they are paying for the outcome, not the inputs.&lt;/p&gt;

&lt;p&gt;The exception is when the customer is technical enough to care about the underlying mechanics. If you are selling to developers building on your API, token-level pricing is what they expect, because they are reasoning about their own cost downstream. The closer the customer is to building their own AI product on top of yours, the more you should price the way their providers price them. The further they are from that, the more you should abstract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting The Margin Without Setting Yourself On Fire
&lt;/h2&gt;

&lt;p&gt;The naive way to price an AI feature is to take the underlying cost, multiply by some margin, and ship. This works for an afternoon and then fails as soon as your usage mix shifts.&lt;/p&gt;

&lt;p&gt;A few traps to avoid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Margins that are too thin.&lt;/strong&gt; A 20 percent margin on AI usage looks reasonable until you realize that 20 percent margin is supposed to cover the entire rest of the business. Support, hosting, the engineers building the product, marketing, taxes. Twenty percent of an API call is not a business. Three to five times the underlying cost is not unreasonable for a B2C product. For B2B, the multiples are usually higher. The customer is buying a product, not access to a wholesale price list, and they are paying for the work you did to make the feature work, not just for the model call underneath.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Margins that are too thick on simple work.&lt;/strong&gt; Charging ten dollars for an action that costs you ten cents and that the customer could replicate by typing into ChatGPT for free is how you get a churn rate that does not survive the first competitor. Customers in 2026 know what frontier models cost. The margin you charge has to be defensible by the work you did to make the workflow actually useful. If the answer is "we just stuck their text into a system prompt," you do not get to charge fifty times the underlying cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing that does not move with model prices.&lt;/strong&gt; Frontier model prices have come down between two and ten times in the last two years and will keep coming down. If your pricing is locked in based on what models cost a year ago, your competitors will undercut you with the same workflow on cheaper inference. Build pricing that moves. Either pass cost reductions through to the customer (and make a marketing moment of it), or hold the price and use the margin expansion to fund features the customer actually wanted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prices set without an eval.&lt;/strong&gt; "We will route this to the cheap model and charge the expensive-model price" is the move that destroys trust the second time a customer notices the quality dropped. The cost-saving routing pattern works only if your evals confirm the cheaper model is good enough for the bucket. Without that, you are quietly downgrading your product to widen your margin, and customers will figure it out. The same eval discipline I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; is what keeps the routing honest.&lt;/p&gt;

&lt;p&gt;The margin that holds up is the one you can defend with both the underlying cost math and the work you did to make the product useful. Both halves matter. Either alone is a bad answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing The Plan Tiers
&lt;/h2&gt;

&lt;p&gt;Once you have picked a model and set a margin, the next question is how the plan tiers should be shaped. This is the part where most teams pattern-match to SaaS and end up with tiers that do not work for AI.&lt;/p&gt;

&lt;p&gt;The tiers that have held up have a few common features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The free tier is bounded by usage, not by feature gates.&lt;/strong&gt; Free users get a small but real allowance of the AI feature. Not a watered-down version. The same feature, with a usage cap. This is what lets free users actually evaluate whether the feature is worth paying for, instead of bouncing off a degraded version that did not show them the real value. The usage cap protects you from cost. The feature parity protects you from churn-before-conversion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The middle tier is where most paying customers should land.&lt;/strong&gt; It includes enough usage that a typical paying user does not hit the cap during a normal month. The price is set so that this tier is profitable on the average user. If you are seeing a lot of customers regularly hitting overage on this tier, the included usage is too low. If you are seeing the tier lose money on the average customer, the included usage is too high or the price is too low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The top tier is for the heavy users and the ones who want predictability.&lt;/strong&gt; It either has a much higher allowance or it includes overage credits at a discount. The customers on this tier are often businesses with a budget and a low tolerance for surprise invoices. The plan should give them predictable spend even at high volume, even if that means slightly higher unit cost than pure usage-based pricing would imply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overage exists at every tier.&lt;/strong&gt; When a customer hits the cap, they get a clear, fair, undiscounted overage rate. Not a wall. Not a forced upgrade. Just a price they can choose to pay. Walls drive churn. Overage drives revenue. The math on which is better is not close.&lt;/p&gt;

&lt;p&gt;The mistake to avoid is segmenting tiers by feature when the differentiator should be usage. If your AI feature is good enough to be the reason people are paying you, do not gate it behind the top tier. Gate it behind a usage allowance and let everyone use it. The customers who use it heavily will pay you more. The customers who use it lightly will not, and that is fine, because they cost you less.&lt;/p&gt;

&lt;h2&gt;
  
  
  Communicating Usage Without Inducing Anxiety
&lt;/h2&gt;

&lt;p&gt;The UX of usage-based pricing is the part that usually decides whether the model works. Customers who feel a meter ticking in their head every time they click are customers who click less. Customers who get a clean dashboard and a predictable bill are customers who use the product more, find more value, and renew.&lt;/p&gt;

&lt;p&gt;The patterns that work.&lt;/p&gt;

&lt;p&gt;A real-time usage indicator that is informative without being alarmist. Show the customer how much of their allowance they have used this month. Show it as a percentage with a color. Do not show it as an estimated dollar amount that updates with every action. The dollar amount makes every click feel expensive. The percentage makes the cap feel like a budget.&lt;/p&gt;

&lt;p&gt;Soft warnings before hard limits. When a customer is at 80 percent of their allowance with two weeks left in the month, send a quiet email. Tell them they are on track to hit the cap, what their options are, and how much overage would cost. Do not let them surprise themselves at 100 percent. Surprise overages are the single largest source of "I did not understand what I was buying" support tickets, and those are the tickets that turn into chargebacks.&lt;/p&gt;

&lt;p&gt;Clear unit pricing on every action that costs credits. If a workflow costs 5 credits, tell the customer it costs 5 credits before they run it. Hidden costs feel like getting cheated, even when the price is fair. Visible costs feel like a transaction, which is what they are.&lt;/p&gt;

&lt;p&gt;A monthly summary that ties usage to value. The customer should see, at the end of the month, what they got for their money. The number of articles, reviews, conversations, whatever the unit of value is. This is the artifact that makes renewal a yes when budgets get reviewed. The customer cannot remember what they did three weeks ago. The summary remembers for them.&lt;/p&gt;

&lt;p&gt;The thing not to do is bury usage information in a settings page nobody opens. The whole point of usage-based pricing is that the cost reflects the value. If the customer cannot see the value, the cost feels arbitrary.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Backfires
&lt;/h2&gt;

&lt;p&gt;A few patterns look smart in the planning doc and turn into pain in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free trials that do not have usage caps.&lt;/strong&gt; A 14-day free trial of an AI feature with no usage limit is an invitation to a stranger to spend your money. Cap the trial at a sensible allowance. The serious customer will see the value within the cap. The freeloader will hit it and either convert or move on, and either is better than running up your bill for two weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom pricing for everything above the lowest tier.&lt;/strong&gt; "Contact sales" pricing makes sense for genuine enterprise deals. It does not make sense for the second paid tier of a self-serve product. Hiding prices forces a sales conversation on customers who would have happily paid the listed price, and most of them will leave instead of starting that conversation. Show prices. Negotiate enterprise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BYO API keys.&lt;/strong&gt; Letting the customer bring their own provider key looks like a way to push cost off your books. It also pushes off all the things that make your product work, including the routing, the evals, the caching, and the observability. The customer ends up running a worse version of your product on their own bill, and your value capture goes to zero. This is a pattern that comes back when budgets get tight and that is almost always a worse business than just charging for the value you provide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching savings hidden from the customer.&lt;/strong&gt; If your prompt caching layer cuts your costs by 60 percent, the customer should benefit from that, either through better pricing or through more included usage. Pocketing the savings entirely while the customer pays the pre-cache rate is a posture that survives until a competitor undercuts it. The caching architecture I wrote about in the &lt;a href="https://dev.to/blog/prompt-caching-production-guide-2026"&gt;prompt caching production guide&lt;/a&gt; is great for margins. It is even better when some of those margins are reinvested in price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing model changes without notice.&lt;/strong&gt; Doubling prices, slashing allowances, or moving features between tiers without warning is the fastest way to turn happy customers into vocal critics. Every AI product I have watched make this move has eaten a wave of churn and a quarter of bad press. If the math is not working, the answer is to grandfather existing customers and change the pricing for new signups, not to break the deal mid-flight.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Framework For New Features
&lt;/h2&gt;

&lt;p&gt;When I sit down to price a new AI feature now, the questions I run through have stabilized into a short list.&lt;/p&gt;

&lt;p&gt;What is the unit of value the customer is buying? Not the token. The article, the review, the conversation, the analysis. Whatever the customer would describe to a friend.&lt;/p&gt;

&lt;p&gt;What is the underlying cost per unit of value, including the worst-case input? Not the average. The 90th percentile. The pricing has to survive the heavy user, not just the median one.&lt;/p&gt;

&lt;p&gt;What multiple of cost is defensible given the work my product does on top of the model? More if I am providing meaningful workflow, eval, integration, distribution. Less if I am thinly wrapping a frontier model.&lt;/p&gt;

&lt;p&gt;How does the customer want to buy this? A meter, a credit pack, or an included allowance with overage. The answer depends on whether they are a developer, a prosumer, or a business buyer.&lt;/p&gt;

&lt;p&gt;What does the cap look like at each tier, and where does the average user land relative to the cap? The middle tier should be profitable for the average user without making them feel restricted.&lt;/p&gt;

&lt;p&gt;What happens when usage spikes? The pricing has to gracefully handle the customer who suddenly does ten times their normal volume, without either melting my margins or crashing into a wall that ends the relationship.&lt;/p&gt;

&lt;p&gt;How does the price move when underlying model costs change? Either the customer benefits and I market the change, or I capture the margin and reinvest in features. The pricing should not be a fossil.&lt;/p&gt;

&lt;p&gt;The answers shape the pricing. The pricing shapes the business. The business has to survive the worst customer in the dataset, not the average one. That framing is what flips AI feature pricing from "this is hard" to "this is solvable, and here is the structure that solves it."&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does Not Change
&lt;/h2&gt;

&lt;p&gt;The thing that has stayed constant through all of this is that customers will pay good money for AI features that solve real problems. They will not pay for AI features that solve fake problems, no matter how cheap the model gets. Pricing is the conversation about how much, in what shape, and on what terms. It is not the conversation about whether the feature is worth anything in the first place. That conversation happens upstream and the pricing model cannot rescue a feature that lost it.&lt;/p&gt;

&lt;p&gt;The other constant is that the unit economics have to work. There is no pricing model that turns negative gross margin into a sustainable business. If a feature loses money on the average user, no amount of clever tiering will save it. The fix is upstream, in the cost structure, the routing, the model choice, the caching. Pricing is downstream of unit economics. It cannot beat them.&lt;/p&gt;

&lt;p&gt;The AI feature pricing models that have converged in 2026 are not complicated. They are usage, credits, or hybrid with overage. The work is in the details. The unit of value, the margin, the tier shape, the UX, the caps, the warnings, the summaries. Get those right and you have a feature that scales with its costs instead of fighting them. Get them wrong and your most enthusiastic users are also the ones killing your business, which is not the user feedback loop you wanted.&lt;/p&gt;

&lt;p&gt;That weekend I rewrote my pricing was the cheapest education I have ever bought. The bill that scared me into doing it was not even that high. The bill it would have been if I had not is the one I do not have to think about, because I caught it before it got there. Every AI feature I have shipped since has had pricing built in from day one, sized for the heavy user, communicated clearly, and tied to the unit the customer actually values. That has held up across model generations, across customer segments, across price drops in the underlying APIs. The pattern outlasts the parts.&lt;/p&gt;

&lt;p&gt;If you are shipping an AI feature on a flat plan in 2026, the pricing is the bug. Fix it before your power user finds out it is there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>pricing</category>
      <category>saas</category>
      <category>business</category>
    </item>
    <item>
      <title>Multi-Modal AI Agents In Production: Vision, Audio, And The Glue That Actually Works In 2026</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 04 May 2026 08:03:04 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/multi-modal-ai-agents-in-production-vision-audio-and-the-glue-that-actually-works-in-2026-kfi</link>
      <guid>https://dev.to/alexcloudstar/multi-modal-ai-agents-in-production-vision-audio-and-the-glue-that-actually-works-in-2026-kfi</guid>
      <description>&lt;p&gt;The first multi-modal agent I shipped to real users had a beautiful demo and a brutal first week. The demo was a screenshot upload that produced a working bug ticket with the right component, the right severity, and a reproduction step the engineer could actually follow. The first week was a parade of edge cases I had not anticipated. Users uploaded photos of their monitors taken at angles, with glare, with parts of three browser windows visible. They uploaded screenshots of mobile apps the model had never seen. They uploaded full-page captures that exceeded the model's image input limits and got back unhelpful errors. The agent worked perfectly on the inputs I had tested with. It fell apart on the inputs people actually had.&lt;/p&gt;

&lt;p&gt;That is the story of every multi-modal agent I have shipped since. The text-only version is the easy version. Adding vision or audio looks like a small change in the API call and is in fact a significant change in how the system behaves under real traffic. The cost curve is different. The latency profile is different. The failure modes are different. The evaluations have to change. The prompts have to change. The way users interact with the product changes, and the things they expect from it change with them.&lt;/p&gt;

&lt;p&gt;By 2026 the patterns for shipping multi-modal agents have stabilized enough to be useful. They are not the same as the patterns for shipping text agents, and pretending they are is the most common reason teams ship a vision feature that works in the demo and disappoints in production. This is what I have learned, and what the teams I trust have converged on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Counts As Multi-Modal And Why It Matters
&lt;/h2&gt;

&lt;p&gt;Multi-modal in 2026 mostly means three combinations: text plus images in, text out (the most common); text plus audio in, text out (transcription, voice agents); and text in, audio or images out (TTS, image generation). The end-to-end any-modality-in any-modality-out vision is technically possible with frontier models but rarely shipped as one call in production, because the cost and latency tradeoffs do not pencil out for most use cases.&lt;/p&gt;

&lt;p&gt;The reason the distinction matters is that each pairing has its own failure modes and its own economics. A vision-in agent that helps users debug screenshots has different problems from a voice-in agent that handles support calls, and treating them as a single category produces architectures that are wrong for both. The right way to start is to pick the specific modality combination the product needs and design for the failure modes of that combination, not for the abstract category of "multi-modal."&lt;/p&gt;

&lt;p&gt;The other thing that matters is that adding a modality is not free. The text version of your feature is a baseline. The multi-modal version adds preprocessing, larger payloads, longer latencies, more expensive tokens, and more failure modes. If the user does not actually need the modality, do not add it. The most overbuilt agents I have seen this year were ones where someone had decided "we should support voice" without checking whether the users wanted it. The users were happy typing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vision: The Tokenization Trap
&lt;/h2&gt;

&lt;p&gt;The first thing that surprises people about vision input is the cost. An image is not one token. It is potentially thousands of tokens, and the count depends on the resolution, the model's tiling strategy, and whether the model uses a lower-fidelity preview pass before the full one. A high-resolution screenshot can cost more than the text prompt around it by an order of magnitude.&lt;/p&gt;

&lt;p&gt;The fix is to preprocess images before they hit the model. Resize aggressively. Most vision tasks do not need a 4K screenshot. They need an image at the resolution where the relevant content is legible. For a UI screenshot that means roughly 1024 pixels on the long edge for most tasks, less for simple recognition, more only when there is fine detail that matters. The model's accuracy on legible content does not improve meaningfully past that range, and the cost grows linearly or worse with pixel count.&lt;/p&gt;

&lt;p&gt;Crop when you can. If the user uploaded a full-page screenshot but the relevant content is in the top quarter, cropping to the relevant region saves tokens and improves accuracy. The model has less noise to ignore. The output is more focused. The bill is lower. Auto-cropping is hard, but interactive cropping (let the user drag a box) is cheap to build and dramatically improves both cost and accuracy.&lt;/p&gt;

&lt;p&gt;Compress carefully. JPEG at 80 percent quality is usually indistinguishable from the original for vision tasks and is a third of the file size. PNG with quantization can be smaller still. The format the model receives is not necessarily the format the user uploaded, and the conversion is a place where you can save real money without hurting quality.&lt;/p&gt;

&lt;p&gt;The other tokenization surprise is that some models charge differently for low-detail and high-detail processing. If your task is a coarse recognition task ("does this image show a chart"), you can ask for low-detail processing and pay a fraction of the cost. If your task is a fine recognition task ("read the labels on the y-axis"), you need high-detail. Picking the right detail level per request is a routing decision similar to the model routing pattern I covered in &lt;a href="https://dev.to/blog/llm-router-model-routing-fallbacks-2026"&gt;the LLM router pattern guide&lt;/a&gt;, and it produces similar savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vision: The Failure Modes Nobody Warns You About
&lt;/h2&gt;

&lt;p&gt;Vision models hallucinate differently from text models. The failure modes are subtle, look correct on the surface, and are hard to catch without specific evaluations.&lt;/p&gt;

&lt;p&gt;Text in images is unreliable. Models will read text in images, but they will also confidently misread text, especially low-contrast, small, or stylized text. A timestamp the user can see clearly may be off by a digit in the model's reading. A version number may be one minor version different from what is actually shown. If the task depends on exact text extraction, you should be running OCR as a separate step and feeding the extracted text into the model alongside the image, not relying on the model to read accurately. Modern models are getting better at this. They are not reliable enough to skip the OCR pass for tasks where the text matters.&lt;/p&gt;

&lt;p&gt;Spatial reasoning is shallower than it looks. Models can describe what is in an image. They are worse at reasoning about positions, sizes, and relationships between elements. "Which button is to the left of the menu" is the kind of question that produces confident but wrong answers more often than the demo videos suggest. If your task involves spatial reasoning, validate it specifically, and consider supplementing with vision-specific models or pipelines that produce structured spatial outputs.&lt;/p&gt;

&lt;p&gt;Charts and diagrams are read shallowly. The model will tell you a chart shows a downward trend. It is much less reliable about the specific values, the units, or the inflection points. Treat chart understanding as a fuzzy summary task, not a data extraction task, unless you have specifically validated otherwise.&lt;/p&gt;

&lt;p&gt;Multi-image inputs amplify confusion. Two images in one request work fine if the task is "compare these two." They work less well if the task implicitly assumes the model will keep track of which image is which across a multi-step reasoning chain. The model may conflate them. The fix is to be explicit in the prompt about which image is which, and to keep the number of images per call as low as the task allows.&lt;/p&gt;

&lt;p&gt;The other failure mode is content that the model is not trained on. A screenshot of an obscure enterprise dashboard the model has never seen will be described in generic terms. A screenshot of a well-known web product will be described accurately. The same agent can look smart on common content and dumb on rare content. Validate against the content distribution your users actually have, not against the demo set you used to build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audio: Latency Is The First Bill You Pay
&lt;/h2&gt;

&lt;p&gt;Audio in production is dominated by latency in a way that text and even vision are not. A user typing a message has already absorbed the latency of typing. A user speaking expects a response in roughly the time another human would take to respond. That is around eight hundred milliseconds, end to end, from the moment they stop speaking. Anything past two seconds feels broken. Past four seconds, the user starts wondering if the system is alive.&lt;/p&gt;

&lt;p&gt;The latency budget for a voice agent is brutal. The audio has to travel to the server, get transcribed, the transcript has to flow into the agent, the agent has to think, the response has to be generated, the response has to be synthesized into speech, and the speech has to travel back. Every step has a budget, and every step has a worst-case that breaks the experience.&lt;/p&gt;

&lt;p&gt;The patterns that have worked by 2026 are streaming everything that can stream. Streaming transcription that emits partial transcripts as the user is still speaking. Streaming generation that starts the response before it is complete. Streaming TTS that starts audio playback before the full text is generated. Each of these saves hundreds of milliseconds. Together they are the difference between a voice agent that feels alive and one that feels like a voicemail.&lt;/p&gt;

&lt;p&gt;The other pattern is to colocate the components. Sending audio across regions adds round trips that the latency budget cannot absorb. Picking a region close to the user, putting the transcription, the model call, and the TTS in the same region, and minimizing the hops between them is the difference between a sub-second response and a three-second response. The infrastructure for this in 2026 has gotten better than it was, but it is still a place where the careful choices add up.&lt;/p&gt;

&lt;p&gt;The third pattern is to handle interruption. Real conversations have interruptions. The user starts to ask one thing, changes their mind, and asks another. A voice agent that cannot be interrupted will keep talking through the user's correction. The user will hate it. The fix is to have the audio playback pipeline listen for new audio input and stop playback when the user starts speaking. This requires the audio pipeline to be duplex and the agent's state to be revisable mid-response. Both are non-trivial. Both are required if the agent is going to feel like a real conversation.&lt;/p&gt;

&lt;p&gt;The same patterns I covered in the &lt;a href="https://dev.to/blog/ai-voice-agents-production-2026"&gt;voice agents production guide&lt;/a&gt; apply with more force when the voice agent is multi-modal, because every additional modality adds latency that the voice budget cannot afford. If you are layering vision into a voice flow, the vision pass has to fit in the voice latency budget, which usually means it cannot be on the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audio: The Quality Tax On Real Recordings
&lt;/h2&gt;

&lt;p&gt;Demo audio is clean. Real audio is not. Real audio has background noise, multiple speakers, low-quality microphones, accents, hesitation, and code-switching between languages. The transcription accuracy drops on each of these, and the drops compound.&lt;/p&gt;

&lt;p&gt;The pattern that has worked is to validate the transcription quality on a sample of real user audio before tuning the rest of the pipeline. If the transcription is bad, the agent is bad, regardless of how good the model is. The fix may be a better transcription model, audio preprocessing (noise reduction, normalization), or accepting that some audio inputs are out of scope and falling back to text. All of those are reasonable. Pretending the audio is fine when it is not is not.&lt;/p&gt;

&lt;p&gt;Speaker diarization, which is figuring out who is saying what when there are multiple speakers, is its own problem. It works in clean conditions and fails in messy ones. If your product depends on attributing speech to speakers, the quality of the diarization pass is the limiting factor on the rest of the pipeline. Plan for that. Validate it. Do not assume it works.&lt;/p&gt;

&lt;p&gt;The other quality tax is on the output side. Text-to-speech in 2026 is dramatically better than it was, but it still has artifacts on edge cases: long numbers, technical jargon, names, code snippets read aloud. The fix is to preprocess the text the model generates before it goes to TTS. Spell out numbers in a form the TTS handles well. Replace technical strings with paraphrases. Handle proper nouns explicitly. The output sounds dramatically better with a thin transformation layer between the model and the TTS, and the layer is not hard to write.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation: Multi-Modal Tasks Need Different Evals
&lt;/h2&gt;

&lt;p&gt;Text evaluation is mature by 2026. The discipline of running evals on production traffic, grading them, and using the grades to guide changes is well-established. Multi-modal evaluation is less mature, and the gap shows up as agents that ship with strong text evals and weak multi-modal evals, then drift in ways nobody catches.&lt;/p&gt;

&lt;p&gt;The shape of a multi-modal eval is different. The inputs include images or audio, which are larger and harder to store. The outputs may include modalities you have to grade differently from text. The grader, if it is an LLM, has to be a multi-modal model itself, which is more expensive than text grading. The cost of a multi-modal eval pass is meaningfully higher than the cost of a text eval pass.&lt;/p&gt;

&lt;p&gt;The patterns that have worked are to focus eval coverage on the failure modes you have actually seen, not on a broad sample. If users are uploading screenshots of mobile apps and the model is mishandling them, build an eval set of mobile app screenshots. Do not try to cover the full distribution of possible inputs. You will spend forever and miss what matters. Cover the ones you have observed go wrong, and grow the set as new failure modes show up.&lt;/p&gt;

&lt;p&gt;The other pattern is to grade multi-modal outputs with structured criteria. "Did the agent correctly identify the bug class." "Did the agent extract the right error message." "Did the agent suggest a reasonable fix." Each is a binary or scalar judgment. The aggregate is a quality score that is comparable across model versions, prompt versions, and pipeline changes. This is the same eval discipline I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt;, with the additional constraint that the grader has to handle the modality.&lt;/p&gt;

&lt;p&gt;The dataset hygiene is also harder. Storing images and audio at scale is more expensive than storing text. Privacy considerations are larger because images and audio are more identifying than text. Retention policies, redaction strategies, and access controls all get more attention than they did in the text-only version of the same problem. Build for that from the start, because retrofitting privacy onto a multi-modal eval pipeline is painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost: Multi-Modal Bills Bend Differently
&lt;/h2&gt;

&lt;p&gt;Text cost scales with token count. Image cost scales with pixel count. Audio cost scales with duration. A feature that mixes them mixes the cost curves, and the bill ends up being shaped by whichever modality is most expensive on a given workload.&lt;/p&gt;

&lt;p&gt;The pattern that catches teams off guard is that vision-heavy features are dominated by image costs, not by the text reasoning costs people instinctively budget for. A feature that processes a hundred screenshots a day at a couple of thousand tokens each will burn more on the screenshot processing than on the model's reasoning over the extracted content. The optimization target is the image cost, not the model cost.&lt;/p&gt;

&lt;p&gt;Audio-heavy features are dominated by transcription cost and TTS cost. The model call in the middle is often the cheapest part of the pipeline. A voice agent's monthly bill is mostly speech, not language. Optimizing the language model is barely a rounding error compared to optimizing the speech components.&lt;/p&gt;

&lt;p&gt;The cost optimization patterns are the same general shape as the text-only patterns I wrote about in the &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization guide&lt;/a&gt;. Cache aggressively. Route per request. Use cheaper models for easier work. The specifics differ. For images, the cost is upstream of the model and the optimizations are in preprocessing. For audio, the cost is on either side of the model and the optimizations are in the speech components. Knowing which side of the pipeline the bill lives on is the first step in cutting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Glue: Pipelines, Not Single Calls
&lt;/h2&gt;

&lt;p&gt;The single biggest architectural mistake I see in multi-modal agents is treating the whole thing as one model call with multiple inputs. The pattern that has worked is to treat the agent as a pipeline of typed steps, where each step is a single-modality operation that produces a typed output, and the orchestration over the pipeline is its own piece of code.&lt;/p&gt;

&lt;p&gt;A vision agent for support tickets, in this pattern, is not "send the image and the user message to the model and parse the response." It is: classify the image type with a fast vision model, run OCR on the image, extract structured fields from the OCR text with a text model, query the user database for matching context, generate the ticket draft with a text model that takes the structured fields and the context, and return the draft. Five steps. Each is single-modality. Each is testable. Each can use a different model picked for its specific job.&lt;/p&gt;

&lt;p&gt;The orchestration is the agent. The model calls are the steps. The pipeline is observable, debuggable, and modifiable in a way that a single multi-modal call is not. When something fails, the failure is localized. When you want to swap a step, the swap is contained. When the cost gets out of hand, the optimization target is one step at a time. This is the same shape that durable workflow patterns push toward, and the reasons are similar.&lt;/p&gt;

&lt;p&gt;The exception is when the task is genuinely cross-modal in a way that decomposing would lose information. "Describe the relationship between this image and this text" is a task where the model needs both modalities at once. Most tasks are not actually that. Most tasks are decomposable, and decomposition produces a better-behaved system. Default to decomposition. Use the cross-modal call when the task actually requires it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like When It Works
&lt;/h2&gt;

&lt;p&gt;A working multi-modal agent in production by 2026 is a pipeline of single-modality steps, each one tight and observable, with multi-modal calls only where they are necessary. It has aggressive preprocessing on the inputs, structured eval coverage on the failure modes it has seen, and cost dashboards that show where the bill is concentrated. The latency budget is tracked end to end and respected at each step. The privacy and retention policies are explicit and enforced.&lt;/p&gt;

&lt;p&gt;The user-facing experience is fast, accurate on common inputs, gracefully degraded on uncommon ones, and clear about what it can and cannot do. The infrastructure underneath is unglamorous: small steps, typed contracts, careful evals, careful cost watching. The result is an agent that does not embarrass anyone in a customer demo and does not fall over the first time a user uploads a screenshot of something the team did not anticipate.&lt;/p&gt;

&lt;p&gt;That is the agent worth shipping. The demo with the impressive single-call multi-modal magic is fun to build and brittle to ship. The pipeline that does the same thing in five boring steps is what holds up. The boring version is the one that wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Is Going
&lt;/h2&gt;

&lt;p&gt;The frontier models are getting better at handling all modalities in one call, and the temptation will be to collapse the pipelines into single calls again. That will work for some tasks and will not work for others. The diagnostic is whether you can see, debug, and improve the result without rebuilding the whole thing every time something goes wrong. If the single-call version gives you that, take it. If it does not, the pipeline still wins.&lt;/p&gt;

&lt;p&gt;The other shift is that vision and audio inputs are becoming standard parts of agent surfaces, not special features. Users in 2026 expect to drag an image into a chat and have it understood. They expect to ask questions in voice and get answers in voice. The bar for what counts as multi-modal is rising, and features that ignore those modalities are going to feel dated. The cost of adding them is dropping. The cost of not adding them, in user expectations, is rising.&lt;/p&gt;

&lt;p&gt;The thing that is not changing is that the modalities are different from each other. They have different cost shapes, different failure modes, different evaluation needs, and different latency budgets. Treating them as variations of "send tokens to a model" is the failure pattern. Treating each as its own thing, with its own discipline, is what produces multi-modal agents that work.&lt;/p&gt;

&lt;p&gt;If you are about to add a modality to an existing agent, start by writing down what you expect to change. The cost. The latency. The failure modes. The evals. If those answers do not feel different from the text version, you have not thought about it hard enough yet. The modality changes the system. Plan for that, build for it, and the agent that comes out the other side is the one that earns the multi-modal label instead of just claiming it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Designing Tools For AI Agents In 2026: Schemas, Descriptions, And The Pitfalls That Make LLMs Fail Silently</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 04 May 2026 08:02:31 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/designing-tools-for-ai-agents-in-2026-schemas-descriptions-and-the-pitfalls-that-make-llms-fail-4042</link>
      <guid>https://dev.to/alexcloudstar/designing-tools-for-ai-agents-in-2026-schemas-descriptions-and-the-pitfalls-that-make-llms-fail-4042</guid>
      <description>&lt;p&gt;The first agent I shipped that got real usage failed in a way I did not expect. The model was fine. The prompt was fine. The traces showed the agent reaching for a tool called &lt;code&gt;search_docs&lt;/code&gt; and confidently passing the user's entire question as the query, including the polite preamble and the trailing thanks. The tool was returning irrelevant results because nobody had told the model that the query parameter wanted a keyword phrase, not a sentence. I had written a one-line description that said "search the documentation" and called it good. The model did exactly what that description told it to do. It searched the documentation. With the wrong input. Because I never said what the input was supposed to look like.&lt;/p&gt;

&lt;p&gt;That bug took me three days to find, because the agent looked like it was working. The traces had successful tool calls. The outputs were grammatical. The user was getting answers that sounded plausible and were quietly wrong. The fix was not to swap the model. The fix was to rewrite the tool description and tighten the schema. After that, the agent worked. The model had been trying to help me the whole time. I had been the limiting factor.&lt;/p&gt;

&lt;p&gt;That was eighteen months ago. Since then I have shipped a dozen agents in production, debugged a few dozen more, and watched the failure modes converge. The boring truth is that most agent failures are tool design failures. The model is the easy part. The tool is where you make or break the thing. By 2026, the patterns for designing tools that LLMs can use without falling on their face have stabilized enough to write down. This is what I wish someone had handed me before I shipped that first agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mental Model: Tools Are A Public API For The Model
&lt;/h2&gt;

&lt;p&gt;The frame that fixed my tool design was treating each tool as a public API consumed by a customer who has read the docs once, has no ability to ask questions, and gets one shot per call. That customer is the model. Every assumption you do not document, the model has to guess. Every overlap between two tools, the model has to disambiguate. Every error message that does not say what to do next, the model has to invent a recovery strategy from scratch. The same discipline that produces a usable REST API produces a usable tool surface for an agent.&lt;/p&gt;

&lt;p&gt;The thing that makes tool design harder than REST design is that the consumer has no integration phase. The model does not write code against your tool, run it, see the error, fix the code, and try again. It calls the tool with whatever it inferred from the description, gets whatever it gets back, and either uses the result or tries again with another guess. The feedback loop is one shot, in the middle of a conversation, with the user watching. Tool descriptions and schemas have to be self-documenting in a way that human-consumed docs do not. There is no Stack Overflow for the model to fall back on.&lt;/p&gt;

&lt;p&gt;The other thing that makes it harder is that tool surface compounds. Two tools is two tools. Twenty tools is forty pairs of "are these two the same thing?" decisions for the model to make on every call. The cost of an extra tool is not linear. It is roughly the cost of explaining how it differs from every other tool you already have. Most agents I have seen with twenty-plus tools were one redesign away from being agents with eight tools and a happier model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naming Is Most Of The Job
&lt;/h2&gt;

&lt;p&gt;The single highest-leverage decision in tool design is the name of the tool. Names are what the model uses to retrieve the tool from its working memory when it is deciding what to call. A vague name puts the burden of disambiguation on the description. A precise name does most of the work for free.&lt;/p&gt;

&lt;p&gt;The pattern that has worked for me is verb-object, written like a function signature in a codebase you would want to inherit. &lt;code&gt;search_documentation&lt;/code&gt;, &lt;code&gt;get_user_by_id&lt;/code&gt;, &lt;code&gt;create_calendar_event&lt;/code&gt;, &lt;code&gt;summarize_thread&lt;/code&gt;. Not &lt;code&gt;docs&lt;/code&gt;, not &lt;code&gt;user&lt;/code&gt;, not &lt;code&gt;calendar&lt;/code&gt;, not &lt;code&gt;summarize&lt;/code&gt;. Those names tell the model what the tool returns instead of what the tool does, and that matters because the model is choosing tools by action, not by domain.&lt;/p&gt;

&lt;p&gt;Avoid names that overlap. If you have &lt;code&gt;find_user&lt;/code&gt; and &lt;code&gt;get_user&lt;/code&gt;, the model will pick one of them at random the first time it sees a request that fits both, and the choice will not be the one you wanted. If they do different things, name them differently enough that the difference is obvious from the name alone. &lt;code&gt;search_users_by_name&lt;/code&gt; and &lt;code&gt;get_user_by_id&lt;/code&gt; is a much better pair than &lt;code&gt;find_user&lt;/code&gt; and &lt;code&gt;get_user&lt;/code&gt;, because the names make the input shape part of the contract.&lt;/p&gt;

&lt;p&gt;Avoid names that are too clever. The model is good with conventional names because it has seen millions of them in training. It is worse with names you invented for branding reasons. &lt;code&gt;summon_compass&lt;/code&gt; is not a tool name. &lt;code&gt;find_directions&lt;/code&gt; is. The tool description can carry the brand voice. The name should carry the function.&lt;/p&gt;

&lt;p&gt;The last naming rule is the one I keep relearning: be willing to rename a tool when the agent starts using it for the wrong thing. If the model keeps reaching for &lt;code&gt;search_docs&lt;/code&gt; when it should be reaching for &lt;code&gt;lookup_pricing&lt;/code&gt;, the name &lt;code&gt;search_docs&lt;/code&gt; is too broad, or the name &lt;code&gt;lookup_pricing&lt;/code&gt; is too narrow, or the descriptions need work. The names are the first thing to fix. Renaming is cheap. Living with a confused agent is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Descriptions: Write Like You Are Onboarding A New Engineer
&lt;/h2&gt;

&lt;p&gt;The description field on a tool schema is where most teams underspend their effort and pay for it in production. A one-line description is rarely enough. The model needs to know, in plain language, what the tool does, when to use it, when not to use it, and what to expect back. That is four things, not one, and squeezing them into a single sentence is the most common reason agents pick the wrong tool.&lt;/p&gt;

&lt;p&gt;The structure that has worked is: lead with what the tool does, then say when to call it, then say when not to call it, then describe the shape of the response. The "when not to call it" line is the one that does the most work. It is the equivalent of disambiguating from neighboring tools. If the model knows that &lt;code&gt;search_documentation&lt;/code&gt; is for finding article content and is not for looking up product pricing or user data, it will not reach for it when the user asks about pricing. Without that line, it might.&lt;/p&gt;

&lt;p&gt;A worked example. The bad description: "Search the documentation." The good description: "Searches the product documentation for articles matching a keyword phrase. Use this when the user asks a how-to or conceptual question about the product. Do not use this for pricing lookups (use lookup_pricing) or for user account questions (use get_user_account). Returns up to five articles with title, snippet, and URL."&lt;/p&gt;

&lt;p&gt;The bad version is six words. The good version is sixty. The good version eliminates a class of bugs that the bad version invites. Descriptions are the cheapest debugging tool you have. Spend the words.&lt;/p&gt;

&lt;p&gt;The other discipline is to write descriptions that match the schema. If a parameter is supposed to be a keyword phrase, the description should say so, and the parameter description should say so, and there should be an example. If the model is allowed to pass natural language, say that. If it is not, say it is not. The number of agents I have seen pass entire user questions into a parameter that wanted a SQL-safe identifier is more than I want to admit. The fix was always to rewrite the parameter description. The model had been doing what the description allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parameter Schemas Are A Contract, Not A Hint
&lt;/h2&gt;

&lt;p&gt;JSON Schema is the contract between the agent and the tool. If the schema is loose, the model will exploit the looseness. If the schema is tight, the model will work harder to produce valid input. The framing that made tool calls reliable for me was treating the schema as the place where I forbid wrong inputs, not the place where I describe what right inputs look like.&lt;/p&gt;

&lt;p&gt;Use enums when the parameter has a small set of valid values. Do not let the model invent a status string when there are only four valid statuses. Put them in an enum. The model gets to pick from a list, the list constrains the output, and the runtime validator rejects anything else. The cost is one line of schema. The benefit is that the agent stops inventing statuses.&lt;/p&gt;

&lt;p&gt;Use string formats when they exist. ISO 8601 dates, email addresses, UUIDs, URLs. Format hints are part of the contract the model is trained against. The model knows what a date in ISO 8601 looks like. Tell it that is what you want, and it will produce one. Leave the format ambiguous, and you will get "tomorrow" passed in as a date string.&lt;/p&gt;

&lt;p&gt;Use min and max constraints. If a search query has to be at least three characters, say so. If a list parameter has a max length, say so. The model will respect the constraints if you state them. It will violate them if you do not, because the description said "search query" and the model interpreted that as "any string."&lt;/p&gt;

&lt;p&gt;Use required vs optional deliberately. Every required parameter is one more thing the model has to figure out before it can call the tool. Every optional parameter is one more way the call can go subtly wrong. Default optional to off and required to on. Add optional parameters only when they meaningfully change the behavior. Do not add optional parameters as a way to expose every flag your function supports.&lt;/p&gt;

&lt;p&gt;Use descriptions on every parameter. The schema description for the tool talks about the tool. The descriptions on each parameter talk about that parameter. The model reads both. A parameter description that says "the search query" is doing nothing the type does not already do. A parameter description that says "the search query as a keyword phrase, not a full sentence, two to ten words, lowercase" is doing real work.&lt;/p&gt;

&lt;p&gt;The same rigor that goes into the &lt;a href="https://dev.to/blog/structured-outputs-llm-developer-guide-2026"&gt;structured outputs developer guide&lt;/a&gt; belongs in tool schemas. Structured outputs and tool schemas are the same problem dressed differently: how do you make the model produce something machine-readable that your code can rely on. Tight schemas are the answer in both cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Returns Are Where The Agent Recovers Or Spins
&lt;/h2&gt;

&lt;p&gt;The hardest tool design problem is what the tool returns when it fails. A bad error message turns a recoverable mistake into a stuck agent. A good error message tells the agent exactly what to fix and try again with.&lt;/p&gt;

&lt;p&gt;The pattern is to return errors as structured objects, not as string blobs. An error with a code field, a message field, and ideally a hint field is something the model can pattern-match on. An error that just says "invalid input" is something the model has to interpret. The interpretation is sometimes correct and sometimes a guess.&lt;/p&gt;

&lt;p&gt;The hint field is the one that punches above its weight. When the tool rejects a call, say what the agent should do differently. "The user_id parameter must be a UUID. Try calling search_users_by_name first to get the UUID, then call this tool with that value." That is a hint that turns a stuck agent into a working one. Without it, the agent retries with another guess, then another, until it gives up or hits the iteration limit.&lt;/p&gt;

&lt;p&gt;Avoid errors that look like success. A tool that returns an empty array on a misspelled query is silently failing. The agent gets back zero results, assumes the query was correct and the answer is "nothing matched," and reports that to the user. The fix is to return an explicit "no results" object with a hint that the query may be wrong, not an empty array that looks the same as a successful empty result.&lt;/p&gt;

&lt;p&gt;Avoid errors that are hostile. "An error occurred." "Something went wrong." These are the worst possible response for an agent. The agent has nothing to act on. It will retry with the same input, fail again, and either give up or hallucinate an answer. Every error message is a chance to recover the run. Spend the words.&lt;/p&gt;

&lt;p&gt;The same observability shape I covered in &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;agent observability and debugging&lt;/a&gt; needs to extend to tool errors specifically. The tool error rate, segmented by tool and error code, is one of the most useful health signals an agent has. A spike in a specific error code on a specific tool is a fix you can ship. A spike in generic errors is a debugging session that will eat your week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Granularity: The Goldilocks Problem
&lt;/h2&gt;

&lt;p&gt;How much should one tool do. The answer that broke for me at the extremes is "neither too much nor too little," which is unhelpful, so let me be more specific. The right granularity is the unit of work a competent human would think of as one step.&lt;/p&gt;

&lt;p&gt;Too coarse: a single &lt;code&gt;manage_calendar&lt;/code&gt; tool that takes an &lt;code&gt;action&lt;/code&gt; parameter and dispatches to create, update, delete, or query depending on the value. The model has to decide which action it wants, then encode that into the parameter, then construct the right body for that action. The error surface is huge because the schema has to permit all possible action shapes. The descriptions have to cover four functions in one. The agent gets confused. Split it.&lt;/p&gt;

&lt;p&gt;Too fine: separate tools for &lt;code&gt;set_event_title&lt;/code&gt;, &lt;code&gt;set_event_start&lt;/code&gt;, &lt;code&gt;set_event_end&lt;/code&gt;, &lt;code&gt;set_event_attendees&lt;/code&gt;, where each one mutates one field of a calendar event. The agent has to chain six calls to do what a human thinks of as "create the event." The token cost goes up. The latency goes up. The chance of one of the six calls failing goes up. Combine them.&lt;/p&gt;

&lt;p&gt;The right grain is &lt;code&gt;create_calendar_event&lt;/code&gt;, &lt;code&gt;update_calendar_event&lt;/code&gt;, &lt;code&gt;delete_calendar_event&lt;/code&gt;, &lt;code&gt;list_calendar_events&lt;/code&gt;. Each is one verb-object pair. Each is one unit of work. Each has a focused schema. The agent picks one, calls it, and moves on. Four tools instead of one tool with four hidden modes, or twenty tools that are all the same thing.&lt;/p&gt;

&lt;p&gt;The exception is when the underlying API is genuinely composite and exposing the composite as one tool would force ugly schemas. In those cases, splitting is right. The test is whether the descriptions and schemas of the split tools are simpler than the description and schema of the combined tool. If they are, split. If they are not, combine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication, Idempotency, And The Boring Things That Save You
&lt;/h2&gt;

&lt;p&gt;Tools that mutate state need to be safe to retry. The model will retry tools when it thinks the previous call did not work, and the previous call may have actually worked. If your &lt;code&gt;create_invoice&lt;/code&gt; tool is not idempotent, the agent will create duplicate invoices when the network blips, and the user will be unhappy.&lt;/p&gt;

&lt;p&gt;The pattern is to require an idempotency key on any state-mutating tool. The model can generate one and pass it. The tool stores the result of the first call against that key. The second call returns the same result without doing the work again. This is straight out of the payments world and it works just as well for agents. The same principle applies to anything that bills, sends notifications, or moves money.&lt;/p&gt;

&lt;p&gt;Authentication should be invisible to the model. Do not expose API keys as parameters. Do not require the agent to construct authorization headers. The host application holds the credentials, attaches them at call time, and the model never sees them. Every tool spec I have ever seen that exposed authentication to the model leaked credentials into traces, logs, or model outputs. Treat auth as plumbing, not as part of the contract.&lt;/p&gt;

&lt;p&gt;Permissions should be checked in the tool, not in the prompt. The prompt cannot enforce that the user is allowed to call the tool. The tool can. Pass the user identity into the tool, check the permission server-side, and reject the call if the user is not authorized. This is the same security model that I covered in &lt;a href="https://dev.to/blog/securing-ai-agents-production-2026"&gt;securing AI agents in production&lt;/a&gt;, and it applies to every tool that touches user data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Documentation By Example
&lt;/h2&gt;

&lt;p&gt;The single most effective addition to a tool spec is a worked example. One example, in the description, showing what a typical call looks like and what the response looks like. The model will pattern-match on the example more strongly than on the prose description. If the example is a keyword search, the model will produce keyword searches. If the example is a sentence, the model will produce sentences.&lt;/p&gt;

&lt;p&gt;Examples in the description belong on the tool itself, not in the prompt. The prompt is shared across the conversation. The tool description is loaded with the tool. Putting examples on the tool means every agent that uses the tool sees the examples, including agents you have not built yet. It is the cheapest way to ship a tool that other people on your team will use correctly.&lt;/p&gt;

&lt;p&gt;The format that has worked is: a one-line input example, a one-line output example, and a one-line gotcha if there is one. "Example: search_documentation with query='deploy webhook' returns up to five articles. Note: queries longer than ten words tend to underperform; prefer keyword phrases."&lt;/p&gt;

&lt;p&gt;That is three lines. Those three lines have prevented more bugs than any other piece of documentation I have written for an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Versioning And Change Management
&lt;/h2&gt;

&lt;p&gt;Tool surfaces change. New parameters get added. Old ones get deprecated. The agent does not know that the v2 of your tool no longer accepts the &lt;code&gt;legacy_id&lt;/code&gt; parameter, because the agent was prompted with the v1 schema and you redeployed with v2 yesterday.&lt;/p&gt;

&lt;p&gt;The discipline is to version tool specs the way you version APIs. Major changes get a new tool name, not a silent breaking change. Minor changes that add optional parameters or relax constraints can go in place. Removing a parameter or changing its meaning is a major change. The agent's prompt cache should be invalidated when major changes ship, because the agent's instinct is going to be tuned to the old shape.&lt;/p&gt;

&lt;p&gt;The other piece is to monitor tool call rates per tool. A tool you deprecated should be at zero. A tool you launched should be ramping up. A tool whose call rate dropped to zero overnight is a tool the agent stopped reaching for, which usually means a description change made it look like a worse fit for the requests it was handling. The metric that catches this is per-tool call volume over time. The metric is boring. The bugs it catches are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like When It Works
&lt;/h2&gt;

&lt;p&gt;A well-designed tool surface for an agent has the shape of a small, focused API. Eight to fifteen tools is a comfortable range for a domain-specific agent. Each tool has a verb-object name. Each tool has a description that says what it does, when to use it, when not to use it, and what it returns. Each parameter is typed, constrained, and described. Each error is structured, coded, and hinted. The mutating tools are idempotent. The auth and permissions are server-side. There is at least one example per tool.&lt;/p&gt;

&lt;p&gt;The agent built on top of that surface picks the right tool the first time on the requests you have anticipated. It picks a reasonable tool on the requests you have not. When it picks wrong, the error tells it what to do next, and it recovers. The traces show short tool-call chains because each call does its job. The token cost is low because the schemas are tight. The latency is low because the calls are not retried.&lt;/p&gt;

&lt;p&gt;That is the agent you can ship. That is the agent that does not embarrass you in a customer call. The model is the same model everyone else is calling. The prompt is the same prompt everyone else is writing. The tool surface is the part you control, and it is the part that makes the agent yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Is Going
&lt;/h2&gt;

&lt;p&gt;The frontier models are getting better at handling messy tool specs. They are also getting better at composing tools that should never have been split. Both of those mean the cost of bad tool design is dropping, but it is not zero, and the gap between an agent built on tight tools and an agent built on loose tools is still wider than the gap between any two recent model versions. Investing in tool design pays off across model upgrades. Investing in prompt tricks does not always.&lt;/p&gt;

&lt;p&gt;The other shift is that tool specs are starting to be shared across agents the way SDKs are shared across applications. The MCP protocol I covered in &lt;a href="https://dev.to/blog/mcp-model-context-protocol-developer-guide-2026"&gt;the MCP developer guide&lt;/a&gt; is one expression of this. A tool you design well can be reused. A tool you design badly is a liability that ships with every agent that imports it. The half-life of a tool spec is now longer than the half-life of a model version, and that is a good reason to spend more time on the spec.&lt;/p&gt;

&lt;p&gt;The thing that is not changing is that the model can only work with what you hand it. The whole job of tool design is making sure what you hand it is something a competent agent can use. The discipline is the discipline of any good API. The reward is an agent that works in production, on the first try, on requests you did not write the prompt for. That, more than any model upgrade, is the difference between an agent demo and an agent product.&lt;/p&gt;

&lt;p&gt;If your agent is failing in ways that look like the model is the problem, look at the tools first. The model is almost never the limiting factor. The tools almost always are.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Agent Reliability Engineering in 2026: SLOs, Error Budgets, And Failure Modes That Actually Matter</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 04 May 2026 08:02:29 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/ai-agent-reliability-engineering-in-2026-slos-error-budgets-and-failure-modes-that-actually-527m</link>
      <guid>https://dev.to/alexcloudstar/ai-agent-reliability-engineering-in-2026-slos-error-budgets-and-failure-modes-that-actually-527m</guid>
      <description>&lt;p&gt;The dashboard said the agent was at 99.4 percent uptime for the quarter. The customer told me, on the same call where I was about to celebrate that number, that the feature had been broken for him for three weeks. He had stopped using it. He was not going to renew. The agent was returning two-hundreds the entire time. The HTTP layer was fine. The thing the agent was supposed to actually do, which was generate a report he could ship to his client, was not working at all. The model had silently regressed when we swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.&lt;/p&gt;

&lt;p&gt;That call ended my career as a person who measures AI agent reliability with traditional service metrics. The numbers we had been shipping to the leadership deck were technically correct and operationally meaningless. The agent was up. The agent was also broken. Both can be true. The reliability framework I had inherited from a decade of regular service work could not see the difference, and I had to build one that could.&lt;/p&gt;

&lt;p&gt;Two years on, the patterns for measuring and improving AI agent reliability have stabilized enough that I trust them. They are not the same as the SRE playbook for normal services, and trying to retrofit one onto the other is the most common reason teams ship reliability dashboards that do not match user reality. This is what actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Traditional Reliability Numbers Lie About Agents
&lt;/h2&gt;

&lt;p&gt;The reason a 200 OK does not mean an agent worked is that the agent is doing more than serving a request. It is making decisions. It is calling tools. It is generating outputs that have to be useful, not just well-formed. None of that is captured by an HTTP status code, a latency histogram, or a process uptime number.&lt;/p&gt;

&lt;p&gt;A traditional service has a small number of failure modes. The process crashes. The database is unreachable. The deploy was bad. The disk is full. Each of these has a clear signal and a clear remediation. The reliability engineering for these failure modes is mature, and tools like Prometheus and PagerDuty solve most of it.&lt;/p&gt;

&lt;p&gt;An agent has all of those failure modes plus a long list of new ones. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. The retrieval pipeline pulls a stale document. The prompt template gets an extra newline that breaks the JSON-mode parsing. A schema validator was relaxed during a deploy and now garbage is flowing through. The user phrased the request in a way that hits a known weak spot. None of these surface as 500s. They surface as outputs that look fine to the system and wrong to the user.&lt;/p&gt;

&lt;p&gt;The reliability engineering for these failure modes is not as mature as it should be by 2026, but the patterns have started to converge. The headline insight is that you have to measure outcome, not just throughput. A request that succeeds at the HTTP layer and fails at the task layer is still a failure. If your dashboard cannot see that, your dashboard is going to lie to you about the user experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers Of An Agent SLO
&lt;/h2&gt;

&lt;p&gt;The reliability target for an agent is not one number. It is at least three, stacked, and they have to be tracked separately because they fail in different ways and have different remediations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service-level reliability.&lt;/strong&gt; Did the request hit the agent and come back with a non-error response in a reasonable time. This is the layer your existing tooling already covers. The HTTP success rate, the p95 latency, the deploy success rate. Necessary but not sufficient. A target of 99.5 percent here is conventional and reasonable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output validity.&lt;/strong&gt; Did the agent return something that conforms to the contract it was supposed to return. JSON that parses. Tool calls with the right schema. Outputs that pass the type check before they get rendered. This is the layer where most teams realize the gap exists. A 200 with malformed JSON is not a success. The target here should usually be tighter than the service-level reliability, because the failures here often surface to the user as broken UI. I tend to target 99.9 percent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task success.&lt;/strong&gt; Did the agent actually do the thing the user wanted. This is the layer that takes real eval work to measure, because "did the user get value" is fuzzier than "did the JSON parse." The tools for this in 2026 are evals run on a sample of production traffic, with grading by either a human, a verifier program, or another LLM. The target here is product-dependent, but for serious applications it is rarely below 95 percent and often higher. The same eval discipline I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; is what makes this measurable in the first place.&lt;/p&gt;

&lt;p&gt;The reason all three are needed is that they fail independently. A model regression can collapse task success while service-level reliability stays at 100 percent. A bad deploy can collapse service-level reliability while task success is unaffected on the requests that actually go through. A schema change can collapse output validity while the other two are fine. If you only track one of these, you get a partial view of reality, and partial views are how customers churn while your dashboard is green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Budgets That Match The Reality
&lt;/h2&gt;

&lt;p&gt;The classic SRE error budget assumes that failures are independent, attributable, and roughly evenly distributed in time. None of that is true for agent failures.&lt;/p&gt;

&lt;p&gt;A model regression after a provider-side update is not independent. It hits every request in the affected class until you switch models. A retrieval pipeline failure correlates across users who happen to query the same stale documents. A prompt template change ships at one moment and affects every request after it. The error budget burns in spikes, not in smooth curves, and the alerting has to reflect that.&lt;/p&gt;

&lt;p&gt;The pattern that has worked is to set separate error budgets for each of the three SLO layers and to track burn rate, not just total burn. A burn rate that goes from 1x to 10x over an hour is the signal that something just broke, even if the absolute burn is still within budget. Alert on the rate, not on the total. The total tells you the story after the fact. The rate tells you the story while there is still time to act.&lt;/p&gt;

&lt;p&gt;The other adjustment is that the error budget for task success has to be reset more often than for service reliability. A model upgrade, a prompt template change, a tool addition, any of these can shift the underlying success rate. If you carry over a budget calculated against the old behavior, you will spend it in a week and have nothing left for the rest of the month. I tend to reset task success budgets after any meaningful change to the agent's underlying components, with a fresh measurement of the baseline before declaring the new budget.&lt;/p&gt;

&lt;p&gt;The last adjustment is that the budget should account for the cost of the failure, not just the count. A failure on a free-tier user is not the same as a failure on an enterprise user. A failure that the user can retry is not the same as a failure that loses their work. Weighted budgets, where high-stakes failures count for more, force the team to triage by impact instead of by volume, and that prioritization is what keeps the worst failures from being deprioritized just because they are rare.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failure Modes Worth Naming
&lt;/h2&gt;

&lt;p&gt;Treating "the agent is broken" as a single failure mode is what produces incident reviews that go nowhere. The reality is that there are a small number of distinct failure modes, each with their own signal, their own remediation, and their own postmortem shape. Naming them is what lets the team build a runbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model regression.&lt;/strong&gt; The model you are calling has changed behavior on a class of inputs. The output validity rate or the task success rate drops on the affected bucket. The fix is to pin to a specific model version, switch providers, or roll forward with a new prompt that handles the changed behavior. The detection is your eval running on production traffic and noticing the drop. The runbook step is to compare current outputs against a holdout set from the last known-good period.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool failure.&lt;/strong&gt; A tool the agent calls is returning errors, returning the wrong shape, or returning data that is stale or wrong. The output validity may stay high if the agent recovers gracefully. The task success will drop because the agent is operating on bad inputs. Detection is per-tool error rates and per-tool semantic checks. The runbook step is to verify the tool independently of the agent, isolating whether the issue is the tool or the agent's use of it. This is the same observability shape I covered in &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;agent observability and debugging&lt;/a&gt;, and most of the recurring tool failures trace back to the &lt;a href="https://dev.to/blog/ai-agent-tool-design-2026"&gt;tool design choices&lt;/a&gt; made before the agent shipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval drift.&lt;/strong&gt; The retrieval pipeline is returning documents that are stale, irrelevant, or duplicated. The agent's outputs feel slightly off. The user does not always notice individual failures, but renewal numbers slip. Detection requires sampling retrieval results and grading them. The runbook step is to verify the index freshness, the embedding pipeline, and the similarity thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt regression.&lt;/strong&gt; A change to the prompt template, often well-intentioned, has broken a class of requests. The window between deploy and detection is the danger zone. Detection is an eval that runs on every prompt change and an alert on task success rate after deploys. The runbook step is to revert the prompt change and triage in a non-production environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema drift.&lt;/strong&gt; The agent is returning outputs that pass the looser validators but fail the stricter ones, or that have started to drift from the expected shape. Detection is a strict schema validator running on a sample of production outputs and surfacing drift before the looser one starts letting bad data through. The runbook step is to tighten the validator and rerun the eval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider outage.&lt;/strong&gt; The provider is returning errors, rate limiting, or timing out. The fallback should pick this up. The signal that something is wrong is the fallback firing rate going up. Detection is the router's own metrics. The runbook step is to verify the fallback is actually working and to switch primary providers if the outage is sustained. The patterns I covered in the &lt;a href="https://dev.to/blog/llm-router-model-routing-fallbacks-2026"&gt;LLM router pattern guide&lt;/a&gt; are what make this a runbook step instead of an incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost spike.&lt;/strong&gt; The bill is climbing faster than the traffic. Something has changed in the cost shape of the work. A new prompt is longer than the old one. A bucket is escalating to the expensive model more often than expected. A user has discovered a way to drive up token usage. Detection is per-bucket and per-user cost dashboards with alerts on derivative changes. The runbook step is to identify the cost source and either contain it, optimize it, or surface it as a billing issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hallucination.&lt;/strong&gt; The agent is producing outputs that look right and are wrong. This is the hardest failure mode to detect because the surface signals are clean. Detection requires either a verifier that catches the specific class of hallucination (a tool call that references a non-existent file, a citation that does not match the source, a number that does not appear in the input) or a sampled review by a human. The runbook step is to harden the verifier and to retrain or reprompt against the failure mode.&lt;/p&gt;

&lt;p&gt;Each of these has a different signature, a different signal, and a different remediation. The runbook should have a section for each. Pattern matching the symptom to the failure mode is the first step. Without a named failure mode, the team is in "the agent is broken" mode, and that mode does not converge on a fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drills That Find The Bugs Before Production Does
&lt;/h2&gt;

&lt;p&gt;Most agent reliability bugs hide until production traffic finds them. The reason is that the input space is large and the test traffic is usually small. The fix is to run drills that simulate real failure modes and verify the system handles them.&lt;/p&gt;

&lt;p&gt;The drills that have caught the most for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider outage drill.&lt;/strong&gt; Take the primary provider offline in staging. Run a real traffic pattern. Verify the fallback fires, the latency stays within budget, and the task success rate stays above the SLO. The first time you run this, something will be missing. A key not configured. A timeout set wrong. A fallback model that does not actually exist. Better to find it on Tuesday afternoon than during the actual outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model regression drill.&lt;/strong&gt; Swap the model behind a bucket to a deliberately weaker variant. Run the eval. Verify the alerting fires before the budget is exhausted. The drill verifies that your eval-based detection is connected to your alerting, which is the part that almost always has a gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool failure drill.&lt;/strong&gt; Make a tool return errors, then make it return malformed responses, then make it return slow responses. Each is a different failure shape. Verify the agent handles each gracefully and the metrics surface the failure correctly. The slow-response case in particular tends to cause subtle bugs where requests pile up and timeouts compound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost runaway drill.&lt;/strong&gt; Simulate a user driving heavy traffic to an expensive bucket. Verify the cost dashboards alert. Verify the rate limiting kicks in before the budget is blown. Verify the postmortem path includes attributing the cost to the user. The first time someone runs this drill, the cost alerts are usually slower than they should be, and the rate limiting is often missing entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt change drill.&lt;/strong&gt; Ship a prompt change to a staging environment with a deliberately broken section. Verify the eval catches it. Verify the rollout pauses or rolls back automatically. The drill is about verifying that your deployment process for prompt changes is as careful as your deployment process for code changes, which is rarely the case by default.&lt;/p&gt;

&lt;p&gt;The shape of the drill is always the same. Force a known failure. Verify the system detects it. Verify the system mitigates it. Verify the runbook for handling it actually works. Repeat on a schedule. The drill calendar is what turns a reliability claim into a reliability fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability That Connects Layers
&lt;/h2&gt;

&lt;p&gt;The observability that supports all of this has to span the three SLO layers, not just one. A trace that shows the HTTP request and the latency is not enough. The trace has to include the prompt, the tool calls, the retrieval results, the model used, the validator results, and the final output. Without that, debugging a task-level failure means reproducing it manually, which is the slow path.&lt;/p&gt;

&lt;p&gt;The minimum I want to see in a production agent trace.&lt;/p&gt;

&lt;p&gt;The full prompt that was sent, including the system prompt, the user message, and any context. Redacted as needed for privacy, but not stripped to the point of being unhelpful.&lt;/p&gt;

&lt;p&gt;Every tool call, with the tool name, the arguments, the result, and the time taken. Tool calls are where most agent bugs live, and a trace without tool detail is missing the most useful part.&lt;/p&gt;

&lt;p&gt;The model used and the version. If the router picked a different model than the default, the reason. The cost incurred. The token counts.&lt;/p&gt;

&lt;p&gt;The validator results. Did the output pass schema validation. Did it pass any semantic checks. Did the verifier reject it and trigger a fallback.&lt;/p&gt;

&lt;p&gt;The final output that was returned to the user. The thing the user actually saw. Without this, you cannot reproduce the user's experience.&lt;/p&gt;

&lt;p&gt;The user identifier and the request bucket. Both are needed for cohort analysis when failures correlate with user segment or with workload type.&lt;/p&gt;

&lt;p&gt;The shape that has won is OpenTelemetry traces with custom attributes for the agent-specific fields. The infrastructure for normal services already understands the trace format, and the custom attributes give you the agent-specific context. Most observability platforms can ingest these without bespoke work, and the analysis tools that have grown up around traces work for agent debugging without much adaptation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Postmortem Discipline That Actually Helps
&lt;/h2&gt;

&lt;p&gt;Postmortems for agent incidents are different from postmortems for service incidents. The traditional template assumes a deterministic system and a clear root cause. Agent incidents often have several contributing factors and a fuzzy root cause that is more like "the model started doing this for these reasons."&lt;/p&gt;

&lt;p&gt;The postmortem fields that have produced useful changes after agent incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which SLO was breached.&lt;/strong&gt; Service, output validity, or task success. Each implies a different remediation surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which failure mode it was.&lt;/strong&gt; From the named list. If it does not fit a named mode, the postmortem produces a new mode and adds it to the list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The detection lag.&lt;/strong&gt; The time from when the failure started to when the team knew. Long detection lag is a signal that the metrics or the alerts need work, regardless of what caused the failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mitigation lag.&lt;/strong&gt; The time from detection to a contained state. Long mitigation lag is a signal that the runbook needs work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The blast radius.&lt;/strong&gt; Which users were affected, what they saw, whether they got a clean error or an incorrect output, whether they retried, whether they churned. Agent failures often produce silent damage that the metrics do not capture, and the postmortem has to surface that explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The eval delta.&lt;/strong&gt; What the eval looked like before and after the incident. Did the eval catch the failure, did it miss it, did the eval need to be updated. The eval is part of the system. When it fails, that is part of the postmortem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The followups.&lt;/strong&gt; Specific, dated, owned. Drills to add. Alerts to tighten. Runbooks to update. Validators to harden. The followups are the only output of the postmortem that changes the system. The narrative is for sharing context. The followups are for fixing things.&lt;/p&gt;

&lt;p&gt;The discipline that makes this work is treating the postmortem as the input to the next round of reliability work, not as a closing artifact for the incident. Every incident produces material for the next sprint of reliability improvements. The agents that get more reliable over time are the ones whose teams have a steady drip of these improvements landing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does Not Carry Over From Traditional SRE
&lt;/h2&gt;

&lt;p&gt;A few patterns from the SRE playbook do not work for agents and should be skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Five-nines targets.&lt;/strong&gt; The math that makes 99.999 percent reliability achievable in traditional services does not work when the underlying model has a non-zero error rate that you do not control. Aim for the highest reliability that the business actually needs and do not chase numbers that the underlying components cannot deliver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pure synthetic monitoring.&lt;/strong&gt; A synthetic prompt run every minute will tell you the agent is alive. It will not tell you the agent is doing useful work on the actual traffic mix you serve. Sample real traffic for the eval signal. Use synthetic monitoring for the service layer only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strict deployment gates on latency alone.&lt;/strong&gt; A change that improves latency by 10 percent and drops task success by 5 percent is a regression, not a win. The deployment gates have to include task success, not just latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identical staging environments.&lt;/strong&gt; A staging environment with a different model, a smaller dataset, or a synthetic traffic generator does not reproduce the failure modes of production. Either invest in staging that mirrors production or accept that some failures will only appear in production and build the rollback story for that case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating the model as infrastructure.&lt;/strong&gt; It is a dependency, but it is a dependency that changes behavior on its own schedule and that does not have a release notes page that captures all the relevant changes. Pin where you can, monitor where you cannot, and assume the dependency will surprise you on a regular basis.&lt;/p&gt;

&lt;p&gt;The summary is that the framework looks similar but the parameters are different. The names of the artifacts (SLO, error budget, postmortem, runbook) carry over. The contents of those artifacts have to be rebuilt for the agent context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like When It Works
&lt;/h2&gt;

&lt;p&gt;A team running this discipline has a reliability dashboard that has three numbers, not one. The service number is high and steady. The output validity number is high and twitches occasionally on schema changes. The task success number is the one with the most history and the most attention, and it is the one the leadership cares about.&lt;/p&gt;

&lt;p&gt;The team has a runbook with named failure modes, each with a detection signal and a remediation. New failures get added to the runbook after each postmortem. The runbook is the living artifact, not the dashboard.&lt;/p&gt;

&lt;p&gt;The team runs drills on a schedule. Provider outage, model regression, tool failure, cost spike. The drills find one or two issues each time. The drills do not stop. The first time a drill finds nothing in three rounds is the signal that the drills have stopped being aggressive enough.&lt;/p&gt;

&lt;p&gt;The team has eval gates on every prompt change, every model change, every tool change. The gates are integrated with the deployment pipeline. A prompt change that fails the eval does not ship.&lt;/p&gt;

&lt;p&gt;The team has cost dashboards that surface spikes by bucket and by user. Cost is treated as a reliability concern, because a runaway cost is an outage of the business model, even if the service is up.&lt;/p&gt;

&lt;p&gt;The team writes postmortems that produce followups. The followups land in the sprint. The next set of incidents rarely repeats the patterns of the last set, because the patterns get fixed.&lt;/p&gt;

&lt;p&gt;This is not glamorous work. It is the same kind of unsexy reliability discipline that has kept normal services up for decades, adapted for the new failure surface that agents introduce. The teams that take it seriously ship products that work for years. The teams that do not get to live the experience I had on that customer call, where the dashboard says one thing and the customer says another and the customer is the one who is right.&lt;/p&gt;

&lt;p&gt;The dashboard I run now would have caught that quarter's regression on day three. The customer would not have spent three weeks on a broken feature. The renewal would still have been at risk for other reasons, the product is hard, but it would not have been at risk for that one. That is what reliability engineering for agents buys you. Not perfection. Just the chance to know what is actually happening in time to do something about it. The pattern is the floor, not the ceiling, and every agent product I ship now starts from it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>reliability</category>
      <category>architecture</category>
      <category>devtools</category>
    </item>
  </channel>
</rss>
