<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jangwook Kim</title>
    <description>The latest articles on DEV Community by Jangwook Kim (@jangwook_kim_e31e7291ad98).</description>
    <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909290%2F60a8c15f-b2b5-4189-8578-78b8ab78900b.jpg</url>
      <title>DEV Community: Jangwook Kim</title>
      <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jangwook_kim_e31e7291ad98"/>
    <language>en</language>
    <item>
      <title>Project Polaris: GitHub Copilot's New MoE Coding Model</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 04 Jun 2026 08:19:13 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/project-polaris-github-copilots-new-moe-coding-model-ji8</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/project-polaris-github-copilots-new-moe-coding-model-ji8</guid>
      <description>&lt;p&gt;Microsoft used Build 2026 to do something most people didn't see coming: replace the OpenAI model powering GitHub Copilot with one they built themselves.&lt;/p&gt;

&lt;p&gt;Project Polaris, announced June 2, 2026 at Fort Mason Center in San Francisco, is Microsoft's in-house Mixture-of-Experts (MoE) coding model. From August 2026 it becomes the default engine for Copilot Pro subscribers, ending the platform's dependence on GPT-4 Turbo and giving Microsoft end-to-end ownership of its most widely used developer tool. The move lands at a moment when Copilot's market position is under real pressure. Here is what you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;GitHub Copilot was the dominant AI coding tool as recently as a year ago, capturing around 67% of professional developers surveyed. That number has slid to 51%. Claude Code entered the same survey for the first time and immediately landed at 10%. Among senior developers with ten or more years of experience, the preference gap is sharper: 46% choose Claude Code versus 9% for Copilot.&lt;/p&gt;

&lt;p&gt;Microsoft's response is not just a model swap. Project Polaris is accompanied by a broader re-architecture of Copilot: multi-agent support in VS Code, Copilot Workspace going generally available, new autonomous modes, and a dedicated sandbox environment for agent tasks. Polaris is the engine; Build 2026 announced the whole vehicle.&lt;/p&gt;

&lt;p&gt;The strategic logic is straightforward. Running GPT-4 Turbo through OpenAI means Microsoft pays per token to a partner whose own products — ChatGPT, Copilot for Microsoft 365 — compete for the same budget. Polaris runs on Microsoft's custom Maia AI accelerators inside Azure, removing that dependency and letting Microsoft control inference latency and cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Project Polaris Is, Architecturally
&lt;/h2&gt;

&lt;p&gt;Project Polaris is a Mixture-of-Experts model. MoE architectures route each input token through a subset of specialized sub-networks (called "experts") rather than the entire model, which means a fraction of total parameters are active at inference time. This cuts compute cost while keeping model capacity high for the domains where the active experts specialize.&lt;/p&gt;

&lt;p&gt;What Microsoft has done with Polaris is tune those experts around programming languages, frameworks, and paradigms. Each sub-module handles a distinct code domain. The upshot, according to Microsoft, is that Polaris outperforms GPT-4 Turbo on HumanEval and MBPP — the two most common coding benchmark sets — with particularly large gains in Rust, Haskell, and Go.&lt;/p&gt;

&lt;p&gt;Those three languages share a characteristic: relative scarcity of public training data compared to Python, JavaScript, or Java. GPT-4 models are heavily optimized for high-resource language contexts, so a domain-expert MoE approach should theoretically close that gap, especially if Microsoft's internal code corpus leans toward enterprise-grade Rust and Go. Microsoft has not published the specific HumanEval/MBPP percentage scores; the outperformance claim is from their own Build presentation and has been consistently reported across tech outlets but has not yet been independently verified.&lt;/p&gt;

&lt;p&gt;Inference runs on Azure Maia AI accelerators. Microsoft designed Maia specifically for their own workloads, and running Polaris on Maia instead of third-party GPU fleets is expected to reduce per-inference latency and operational cost. Faster inference matters for the interactive autocomplete use case where latency directly affects the feel of the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes in August 2026
&lt;/h2&gt;

&lt;p&gt;The transition from GPT-4 Turbo to Polaris happens automatically for Copilot Pro subscribers in August 2026. Microsoft is offering a three-month opt-back period for teams that want to stay on GPT-4 while they evaluate the new model.&lt;/p&gt;

&lt;p&gt;For Pro tier users, the move also unlocks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;100,000-line multi-file context.&lt;/strong&gt; The current context window in Copilot limits how much of your codebase the model can see at once. The Pro tier with Polaris expands this to 100,000 lines, which changes what kinds of multi-file refactoring and cross-repo tasks are feasible. A large monorepo service with interconnected packages has typically been too large to fit in one Copilot session. That constraint loosens significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous test generation.&lt;/strong&gt; Polaris includes built-in autonomous test generation tuned for the model's strongest language domains. This goes beyond completion-style test suggestions: the model reasons about what to test, generates the test scaffold, and iterates. Microsoft has not published specific coverage improvement numbers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Copilot Pro (current)&lt;/th&gt;
&lt;th&gt;Copilot Pro with Polaris (Aug 2026)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Default model&lt;/td&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;Project Polaris (MoE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference infra&lt;/td&gt;
&lt;td&gt;OpenAI API&lt;/td&gt;
&lt;td&gt;Azure Maia accelerators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-file context&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;100,000 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test generation&lt;/td&gt;
&lt;td&gt;Suggestion-only&lt;/td&gt;
&lt;td&gt;Autonomous generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust / Haskell / Go&lt;/td&gt;
&lt;td&gt;Weaker&lt;/td&gt;
&lt;td&gt;Improved (MoE specialization)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 fallback&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3-month opt-back period&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Teams that have already aligned their workflows around GPT-4 Turbo's specific behavior — prompt patterns, response formatting, failure modes — should run Polaris in parallel on a representative sample of tasks before the automatic migration, rather than discovering regressions after the switch. The three-month fallback window exists precisely for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Copilot Overhaul at Build 2026
&lt;/h2&gt;

&lt;p&gt;Project Polaris was not the only Copilot announcement at Build. Microsoft shipped several capabilities alongside it that together reposition Copilot from a completion tool to a more autonomous coding agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copilot Workspace: Generally Available.&lt;/strong&gt; Workspace went GA at Build after a long preview. It lets Copilot reason across an entire repository, propose multi-file edits, run tests in a sandbox, and iterate on a scoped task autonomously. The session interface is closer to issuing a specification than to typing a prompt: you describe what you want the codebase to do differently, and Workspace plans and executes the changes, presenting a diff for review. This pairs naturally with Polaris's 100K-line context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent VS Code.&lt;/strong&gt; GitHub Copilot multi-agent support launched for Visual Studio Code at Build. Multiple specialized Copilot agents can now coordinate inside a single VS Code session, handling different parts of a task in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fleet mode and Autopilot mode.&lt;/strong&gt; Fleet mode lets Copilot CLI operate autonomously on narrowly defined codebase tasks without step-by-step confirmation. Autopilot mode schedules that autonomous operation as a background job: define the task, hand it to Copilot, come back when it's done. Both are available now for Copilot CLI users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous Agent Mode (Enterprise, July 2026).&lt;/strong&gt; Starting July 2026, GitHub Copilot Enterprise customers can enable Autonomous Agent Mode. The platform writes, tests, and commits entire feature branches. An Agent Sandbox spins up an ephemeral Linux container for each task, isolating the agent from the production repository until a developer reviews and merges the resulting pull request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copilot Extensions.&lt;/strong&gt; Ecosystem integrations for Jira, Datadog, and ServiceNow are now callable from within an active Workspace session, making those tools accessible without leaving the Copilot interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  How This Stacks Up Against Competitors
&lt;/h2&gt;

&lt;p&gt;The honest picture is that Claude Code and Cursor have taken ground from Copilot in 2026, and Project Polaris is partly a direct response.&lt;/p&gt;

&lt;p&gt;Claude Code's strength comes from Claude's underlying coding performance on complex multi-step tasks and its tight integration with terminal and repository contexts. Cursor's strength is interface: a purpose-built IDE experience rather than an extension layered onto VS Code. GitHub Copilot's strength has historically been distribution: 150 million GitHub users, seamless integration into the GitHub ecosystem, and enterprise relationships Microsoft already has.&lt;/p&gt;

&lt;p&gt;Project Polaris is a bet that distribution advantage can be maintained by closing the performance gap. The MoE approach addresses one specific weakness — low-resource language quality — while the 100K-line context and agent modes address the workflow gap. Whether the benchmarks hold up in production use by engineering teams will become clearer after August.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;MoE specialization meaningfully improves Rust, Haskell, and Go — languages where GPT-4 has always been weaker&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;100,000-line context is a real capability jump for monorepo workflows&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Running on Maia means Microsoft controls the inference stack end-to-end, with potential latency and cost improvements&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Three-month GPT-4 fallback reduces migration risk for enterprise teams&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Agent Sandbox (ephemeral Linux container) is a sensible isolation pattern for autonomous commits&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Limitations
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Benchmark numbers are Microsoft-reported only; independent verification hasn't happened yet&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;No model weights, no standalone API — teams evaluating Polaris can only test it through the Copilot product interface&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Autonomous Agent Mode requires Enterprise plan; Pro teams get the model improvements but not the full agentic workflow until later&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Python and JavaScript improvements are not highlighted — Polaris's edge is most pronounced in low-resource languages&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Assuming the migration is risk-free.&lt;/strong&gt; Model behavior differences matter for teams that have built CI/CD pipelines around specific Copilot output patterns. Run Polaris in parallel on representative tasks during the fallback window before you turn off the option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating the HumanEval/MBPP claims as settled.&lt;/strong&gt; Microsoft is saying directional outperformance versus GPT-4 Turbo. Until independent evaluation labs publish their own Polaris results, treat these as claims to verify, not baselines to plan around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conflating Project Polaris with MAI-Thinking-1.&lt;/strong&gt; Microsoft also announced MAI-Thinking-1 at Build 2026 — a separate in-house reasoning model with 35 billion active parameters trained without OpenAI data. MAI-Thinking-1 is a general-purpose reasoning model available in Azure AI Foundry (private preview). Project Polaris is specifically the coding-focused model powering GitHub Copilot. They are different products with different deployment paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Waiting for the August deadline to start evaluation.&lt;/strong&gt; Copilot Workspace is already GA. The multi-agent VS Code mode is live now. If your team hasn't tried Workspace sessions for scoped refactoring tasks, the learning curve starts now, not in August.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Will existing Copilot Pro users need to do anything to get Project Polaris?
&lt;/h3&gt;

&lt;p&gt;No action is required. The transition from GPT-4 Turbo to Polaris is automatic for Copilot Pro subscribers in August 2026. Microsoft will send advance notice. If your team wants to stay on GPT-4 temporarily, you can opt back during the three-month window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does Project Polaris change pricing?
&lt;/h3&gt;

&lt;p&gt;Pricing details were not announced at Build 2026. Copilot Pro pricing is currently $19/month per user, and Microsoft has not indicated Polaris changes that. The shift to Maia accelerators may eventually affect pricing but no announcement has been made.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I access Project Polaris directly via API?
&lt;/h3&gt;

&lt;p&gt;No. At the time of writing, Project Polaris is only accessible through the GitHub Copilot product interface. There is no standalone API endpoint for Polaris, unlike the Azure OpenAI deployments available for GPT-4 Turbo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does this affect teams using GitHub Copilot Business?
&lt;/h3&gt;

&lt;p&gt;Microsoft's Build announcements focused on Pro tier features. Business tier users will also receive the Polaris model switch, but specific feature availability (like 100K-line context or autonomous test generation) for Business was not separately confirmed in Build materials. Check GitHub's official Copilot changelog for Business-specific rollout details.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is this related to the Windows Agent Runtime announced at Build 2026?
&lt;/h3&gt;

&lt;p&gt;No. Windows Agent Runtime (Insider Preview June 9, 2026) runs Phi-4-mini-silicon and Phi-4-vision-silicon on-device using a 40 TOPS NPU. It is a separate product for on-device agentic experiences in Windows applications, not connected to GitHub Copilot or Project Polaris. For details on Windows Agent Runtime, see our &lt;a href="https://dev.to/articles/microsoft-build-2026-windows-agent-runtime-developer-guide"&gt;Microsoft Build 2026 developer guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Project Polaris is the most significant change to GitHub Copilot's core model since the product launched. Here is the condensed version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt;: Microsoft's in-house MoE coding model, replacing GPT-4 Turbo as the default Copilot Pro engine from August 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt;: MoE with language-specialized sub-modules. Runs on Maia AI accelerators inside Azure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance claims&lt;/strong&gt;: Outperforms GPT-4 Turbo on HumanEval and MBPP, with particularly strong gains in Rust, Haskell, and Go. Specific percentages are not yet independently verified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New capabilities at Pro tier&lt;/strong&gt;: 100,000-line multi-file context, autonomous test generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration&lt;/strong&gt;: Automatic in August. Three-month GPT-4 opt-back available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strategic context&lt;/strong&gt;: Copilot's developer adoption share has dropped from 67% to 51% while Claude Code and Cursor have gained ground. Polaris is Microsoft's performance response, paired with a broader Copilot overhaul including Workspace GA, multi-agent VS Code, and Autonomous Agent Mode for Enterprise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The language specialization story for Rust and Go is the most credible differentiation claim — it matches the architectural logic of language-expert routing in MoE, and it targets a real gap in current GPT-4 Turbo deployments. Teams doing heavy Rust or Go development have the most concrete reason to evaluate Polaris closely when the August migration arrives.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Project Polaris is a meaningful bet on vertical integration: Microsoft owns the model, the inference hardware, and the developer product. Whether the performance gains match the announcement depends on independent evaluation — but the 100K-line context window and Rust/Haskell/Go specialization are the concrete improvements worth tracking when the August switch arrives.&lt;/p&gt;

</description>
      <category>githubcopilot</category>
      <category>projectpolaris</category>
      <category>microsoftbuild2026</category>
      <category>mixtureofexperts</category>
    </item>
    <item>
      <title>Deno 2 vs Bun 1.3 — Node.js Runtime Alternatives Compared in 2026: TypeScript, Speed, and Security</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 04 Jun 2026 06:41:46 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/deno-2-vs-bun-13-nodejs-runtime-alternatives-compared-in-2026-typescript-speed-and-security-46da</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/deno-2-vs-bun-13-nodejs-runtime-alternatives-compared-in-2026-typescript-speed-and-security-46da</guid>
      <description>&lt;p&gt;By mid-2026, the JavaScript runtime choices have narrowed to three: Node.js, Bun, and Deno. Honestly, the reasons to stick with Node.js are shrinking. The real question is whether Bun or Deno fits your situation.&lt;/p&gt;

&lt;p&gt;I had been watching both from a distance. I knew Bun was "fast" and that Deno 2 had made big strides in Node.js compatibility. But until I ran them on my own machine, I did not have a concrete basis for choosing. So I set up a temporary sandbox, installed Deno 2.8.2 and Bun 1.3.14, and ran actual measurements.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Runtime Is Actually Trying to Do
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bun&lt;/strong&gt; aims to take the Node.js ecosystem and make it dramatically faster. Existing &lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;node_modules&lt;/code&gt;, and npm workflows work as-is. Migration cost is low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deno 2&lt;/strong&gt; is a runtime rebuilt from scratch. It proposes new conventions: a permission-based security model, URL-based imports, the &lt;code&gt;npm:&lt;/code&gt; specifier, and JSR (JavaScript Registry). It achieved full backward compatibility with Node.js in Deno 2, but the underlying philosophy is different.&lt;/p&gt;

&lt;p&gt;Two tools running the same TypeScript code, but they come from completely different directions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Bun&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://bun.sh/install | bash
bun &lt;span class="nt"&gt;--version&lt;/span&gt;  &lt;span class="c"&gt;# 1.3.14&lt;/span&gt;

&lt;span class="c"&gt;# Install Deno&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://deno.land/install.sh | sh
deno &lt;span class="nt"&gt;--version&lt;/span&gt;  &lt;span class="c"&gt;# 2.8.2 (stable, aarch64-apple-darwin), TypeScript 6.0.3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both install as a single binary. &lt;code&gt;~/.bun/bin/bun&lt;/code&gt; bundles runtime, package manager, bundler, and test runner. Deno gives you &lt;code&gt;~/.deno/bin/deno&lt;/code&gt;. The structure looks similar, but Bun sticks with node_modules and Deno defaults to URL-based modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Startup Time: Bun Is Faster, But Not Always
&lt;/h2&gt;

&lt;p&gt;I tested with a simple TypeScript file that sums a 100,000-element array.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Cold start (first run)
Bun:   0.243s
Deno:  0.067s

# Warm (average of runs 2–5)
Bun:   0.013s
Deno:  0.026s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This surprised me. Bun is not always faster. On the first run, Deno is about 3.6x faster. Bun's slow cold start is likely due to JavaScriptCore's JIT compiler initializing. After warm-up, Bun runs at about half Deno's latency.&lt;/p&gt;

&lt;p&gt;For long-running servers, Bun's warm performance has the edge. For CLI tools or short scripts, Deno feels snappier.&lt;/p&gt;

&lt;h2&gt;
  
  
  HTTP Throughput: Essentially a Tie
&lt;/h2&gt;

&lt;p&gt;I measured directly with Apache Bench (n=3000, c=30, 127.0.0.1).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bun Serve API:   23,794 RPS  (0.126s, 0 failures)
Deno.serve API:  22,594 RPS  (0.133s, 0 failures)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 5% difference. Not practically meaningful. Both are substantially faster than Node.js's built-in HTTP module, and in real applications the bottleneck is the network or business logic, not the runtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I would not pick a runtime based on HTTP throughput alone.&lt;/strong&gt; These numbers just confirm that both are fast enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  npm Package Compatibility: The Approaches Differ
&lt;/h2&gt;

&lt;p&gt;This is where things are most meaningfully different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bun&lt;/strong&gt;: Traditional npm workflow&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun add zod             &lt;span class="c"&gt;# 91ms, creates node_modules&lt;/span&gt;
bun add lodash @types/lodash  &lt;span class="c"&gt;# 651ms, installs 35 packages&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bun add&lt;/code&gt; is a faster npm-compatible package manager. It uses node_modules directly, so migrating existing projects requires almost no configuration changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deno&lt;/strong&gt;: &lt;code&gt;npm:&lt;/code&gt; specifier&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// No install needed — import directly&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npm:zod@3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;_&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npm:lodash@4&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the &lt;code&gt;npm:&lt;/code&gt; specifier, there is no separate install step. On first run Deno downloads to its global cache, and subsequent runs work offline. Not having node_modules feels odd at first, but cloning a project and running it immediately without any install step is genuinely nice.&lt;/p&gt;

&lt;p&gt;When I wrote the &lt;a href="https://dev.to/en/blog/en/bun-shell-scripting-practical-guide-2026"&gt;Bun Shell scripting guide&lt;/a&gt;, Bun's npm compatibility made it easy to pull in existing utility libraries without any friction. Deno's &lt;code&gt;npm:&lt;/code&gt; approach works better for script-level experiments and greenfield projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Model: This Is the Real Difference
&lt;/h2&gt;

&lt;p&gt;This is the part where I realized I had been undervaluing Deno.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deno: Default sandbox&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Try to read a file without permission&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;deno run deno-security.ts
Permission denied: Requires &lt;span class="nb"&gt;read &lt;/span&gt;access to &lt;span class="s2"&gt;"/etc/hosts"&lt;/span&gt;

&lt;span class="c"&gt;# Explicitly grant permission&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/hosts &lt;span class="nt"&gt;--allow-net&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api.github.com deno-security.ts
File &lt;span class="nb"&gt;read &lt;/span&gt;success: &lt;span class="c"&gt;## Host Database ...&lt;/span&gt;
Network success: 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bun: Open model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;bun run bun-security.ts
File &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;Bun, no restriction&lt;span class="o"&gt;)&lt;/span&gt;: &lt;span class="c"&gt;## Host Database ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bun works like Node.js — filesystem, network, and environment variables are accessible by default. Convenient for development, but if a third-party package runs malicious code, there is nothing to stop it.&lt;/p&gt;

&lt;p&gt;Deno requires explicit permission grants: &lt;code&gt;--allow-read&lt;/code&gt;, &lt;code&gt;--allow-write&lt;/code&gt;, &lt;code&gt;--allow-net&lt;/code&gt;, &lt;code&gt;--allow-env&lt;/code&gt;, &lt;code&gt;--allow-run&lt;/code&gt;. In CI/CD or server environments where you run third-party code, Deno's sandbox is a real line of defense.&lt;/p&gt;

&lt;p&gt;To be honest, Deno's permission flags have friction at the start. You hit errors when you forget &lt;code&gt;--allow-net&lt;/code&gt; with fetch and learn through trial. That is a real cost for developers coming from Node.js.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node.js Compatibility: Both Work Now
&lt;/h2&gt;

&lt;p&gt;In the Deno 1.x era, Node.js API compatibility was a significant gap. Deno 2 changed that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Standard modules via node: prefix&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;existsSync&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;join&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:path&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createHash&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:crypto&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;EventEmitter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:events&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested all of these, and both Bun and Deno handled them identically. &lt;code&gt;crypto.createHash("sha256")&lt;/code&gt;, EventEmitter, &lt;code&gt;fs.existsSync&lt;/code&gt; — all pass. Just like &lt;a href="https://dev.to/en/blog/en/hono-typescript-api-2026"&gt;running Hono.js on Cloudflare Workers&lt;/a&gt;, Hono works the same on Bun and Deno.&lt;/p&gt;

&lt;h2&gt;
  
  
  TypeScript Support: The Version Gap Matters
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bun 1.3.14:   TypeScript (Babel-based transpiler)
Deno 2.8.2:   TypeScript 6.0.3 (V8 14.9.207.2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both execute TypeScript without a separate compilation step, but the approaches differ.&lt;/p&gt;

&lt;p&gt;Bun does not perform type checking. It transpiles TypeScript to JavaScript and runs it. This is one reason it is fast.&lt;/p&gt;

&lt;p&gt;Deno uses TypeScript 6.0.3 and supports full type validation with &lt;code&gt;deno check&lt;/code&gt;. If you want type safety enforced in CI, Deno gives you a cleaner answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deno: type-checked execution&lt;/span&gt;
deno check main.ts    &lt;span class="c"&gt;# type errors only&lt;/span&gt;
deno run main.ts      &lt;span class="c"&gt;# fast run, no type checking&lt;/span&gt;

&lt;span class="c"&gt;# Bun: transpile-only&lt;/span&gt;
bun run main.ts       &lt;span class="c"&gt;# always skips type checking&lt;/span&gt;
bun typecheck         &lt;span class="c"&gt;# calls tsc separately&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Package Ecosystem: JSR vs npm
&lt;/h2&gt;

&lt;p&gt;Deno 2 also has the &lt;code&gt;jsr:&lt;/code&gt; specifier. JSR (JavaScript Registry) is a registry built by the Deno team with native TypeScript support and ESM-only packages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Using JSR packages&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;assertEquals&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;jsr:@std/assert@1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;serve&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;jsr:@hono/hono@4/deno&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;JSR package quality is high, but the number of packages is far smaller than npm. As of 2026, JSR is growing but most production libraries are still on npm.&lt;/p&gt;

&lt;p&gt;Bun uses npm directly, so this is not an issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Decision Framework
&lt;/h2&gt;

&lt;p&gt;The measured data, summarized:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bun 1.3.14&lt;/th&gt;
&lt;th&gt;Deno 2.8.2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold start&lt;/td&gt;
&lt;td&gt;0.243s (slow)&lt;/td&gt;
&lt;td&gt;0.067s (fast)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warm start&lt;/td&gt;
&lt;td&gt;0.013s (fast)&lt;/td&gt;
&lt;td&gt;0.026s (moderate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP RPS&lt;/td&gt;
&lt;td&gt;23,795&lt;/td&gt;
&lt;td&gt;22,594&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package install&lt;/td&gt;
&lt;td&gt;bun add 91ms&lt;/td&gt;
&lt;td&gt;npm: specifier (no install)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Open by default&lt;/td&gt;
&lt;td&gt;Sandboxed by default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node.js compat&lt;/td&gt;
&lt;td&gt;Very high&lt;/td&gt;
&lt;td&gt;Much improved in Deno 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;Transpile only&lt;/td&gt;
&lt;td&gt;Type checking (TS 6.0.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package ecosystem&lt;/td&gt;
&lt;td&gt;Full npm&lt;/td&gt;
&lt;td&gt;npm + JSR&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Speeding up an existing Node.js project&lt;/strong&gt;: Bun. Low migration friction, full npm support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New TypeScript project&lt;/strong&gt;: Deno. Type safety, the security model, and the no-install &lt;code&gt;npm:&lt;/code&gt; specifier make for a clean setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI tools or short scripts&lt;/strong&gt;: Deno. Fast cold start and easy single-file deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare Workers / Edge&lt;/strong&gt;: Both work great with Hono. Cloudflare has its own runtime, so the choice matters less there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running untrusted third-party code&lt;/strong&gt;: Deno. Running unknown packages without a sandbox is a real risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Wrong About
&lt;/h2&gt;

&lt;p&gt;The "Bun is X times faster" marketing shows up everywhere. In practice, it is 5% faster on HTTP throughput. On cold start, Deno is faster. The real differences are the security model, how TypeScript type checking works, and the package management philosophy.&lt;/p&gt;

&lt;p&gt;I was also skeptical about Deno 2's Node.js compatibility until I ran it myself. &lt;code&gt;node:fs&lt;/code&gt;, &lt;code&gt;node:crypto&lt;/code&gt;, and &lt;code&gt;node:events&lt;/code&gt; working without any flags was genuinely impressive.&lt;/p&gt;

&lt;p&gt;There are still things that bother me about Deno. The &lt;code&gt;--allow-*&lt;/code&gt; flag system causes friction early on. You sometimes only discover which permissions you need by running and hitting errors. On complex apps, managing a long permission list gets tedious.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in Test Runners: A Genuine Difference
&lt;/h2&gt;

&lt;p&gt;Both runtimes ship a test runner out of the box. No Jest or Mocha required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bun test&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// counter.test.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;describe&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;bun:test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Counter&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;increments correctly&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;async works&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun &lt;span class="nb"&gt;test&lt;/span&gt;                     &lt;span class="c"&gt;# all tests in project&lt;/span&gt;
bun &lt;span class="nb"&gt;test &lt;/span&gt;counter.test.ts     &lt;span class="c"&gt;# specific file&lt;/span&gt;
bun &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--watch&lt;/span&gt;             &lt;span class="c"&gt;# watch mode&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bun:test&lt;/code&gt; is Jest-compatible. Existing Jest test suites often run without changes. For teams &lt;a href="https://dev.to/en/blog/en/vitest-4-jest-migration-guide-2026"&gt;migrating from Jest to Vitest&lt;/a&gt;, moving to Bun is a similar level of effort — the describe/test/expect API is the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deno test&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// counter_test.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;assertEquals&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;jsr:@std/assert@1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;increments correctly&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;assertEquals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;async works&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;assertEquals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deno &lt;span class="nb"&gt;test&lt;/span&gt;                    &lt;span class="c"&gt;# auto-discovers *_test.ts, test_*.ts&lt;/span&gt;
deno &lt;span class="nb"&gt;test &lt;/span&gt;counter_test.ts    &lt;span class="c"&gt;# specific file&lt;/span&gt;
deno &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--watch&lt;/span&gt;            &lt;span class="c"&gt;# watch mode&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deno uses &lt;code&gt;Deno.test()&lt;/code&gt; rather than Jest-style &lt;code&gt;describe/it&lt;/code&gt;. Tests also respect the permission model — tests touching the filesystem need &lt;code&gt;--allow-read&lt;/code&gt;. The &lt;code&gt;@std/assert&lt;/code&gt; package from JSR provides type-safe assertions.&lt;/p&gt;

&lt;p&gt;Bun's test runner wins on migration convenience from Jest. Deno's is cleaner for greenfield TypeScript projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up a Real Project
&lt;/h2&gt;

&lt;p&gt;Here is what actually happens when you start a new project from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bun project init&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-api &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-api
bun init &lt;span class="nt"&gt;-y&lt;/span&gt;          &lt;span class="c"&gt;# creates package.json, tsconfig.json, index.ts&lt;/span&gt;
bun add hono         &lt;span class="c"&gt;# add a dependency&lt;/span&gt;
bun run index.ts     &lt;span class="c"&gt;# run it&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated &lt;code&gt;package.json&lt;/code&gt; looks like any Node.js project. CI works with &lt;code&gt;bun install &amp;amp;&amp;amp; bun run build&lt;/code&gt;. Familiar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deno project init&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-api &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-api
deno init            &lt;span class="c"&gt;# creates main.ts, deno.json, main_test.ts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generated &lt;code&gt;deno.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tasks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dev"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deno run --watch --allow-net main.ts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deno test"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"imports"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hono"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npm:hono@^4.7.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;imports&lt;/code&gt; field in &lt;code&gt;deno.json&lt;/code&gt; handles package mapping. No &lt;code&gt;node_modules&lt;/code&gt;. A &lt;code&gt;deno.lock&lt;/code&gt; file pins versions. Once you internalize the pattern, it is clean — but there is a learning curve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Differences
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single binary compilation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both runtimes support compiling to a self-contained binary — no runtime required on the target machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deno&lt;/span&gt;
deno compile &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; server main.ts
./server

&lt;span class="c"&gt;# Bun&lt;/span&gt;
bun build &lt;span class="nt"&gt;--compile&lt;/span&gt; index.ts &lt;span class="nt"&gt;--outfile&lt;/span&gt; server
./server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is genuinely useful for distributing CLI tools. The &lt;code&gt;--allow-*&lt;/code&gt; flags in Deno's compile command also document what the binary needs, which is a nice side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both have official Docker images and are straightforward to containerize. Deno's image requires that you include permission flags in the &lt;code&gt;CMD&lt;/code&gt; directive, which forces you to make permission decisions explicit at the infrastructure layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;I cannot make a strong case that one runtime is decisively better. That is a cliché conclusion, but this time it comes from actual measurements.&lt;/p&gt;

&lt;p&gt;For my own automation scripts and CLI tools, I will probably lean toward Deno. The cold start performance and the no-install &lt;code&gt;npm:&lt;/code&gt; specifier are convenient for scripting. For team projects that rely heavily on npm packages, Bun's compatibility is more practical.&lt;/p&gt;

&lt;p&gt;Both runtimes support single-binary compilation, Docker, and Hono. The framework layer is largely portable.&lt;/p&gt;

&lt;p&gt;The reasons to stay on Node.js keep shrinking. Whichever direction you go, both alternatives are production-ready by 2026 standards.&lt;/p&gt;

</description>
      <category>deno</category>
      <category>bunjs</category>
      <category>typescript</category>
      <category>node</category>
    </item>
    <item>
      <title>Microsoft ASSERT: Turn Agent Policies Into Executable Evals</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 04 Jun 2026 04:15:40 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-assert-turn-agent-policies-into-executable-evals-24kj</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-assert-turn-agent-policies-into-executable-evals-24kj</guid>
      <description>&lt;p&gt;Writing agent behavior requirements in plain English is easy. Enforcing them at scale is not. A policy document that says "the agent must not reveal PII" has zero enforcement weight unless it becomes a test that actually runs. That is exactly the problem Microsoft's ASSERT framework addresses — and it was released as open source at Build 2026 with an MIT license.&lt;/p&gt;

&lt;p&gt;This article walks through what ASSERT does, how the four-stage pipeline works, what Effloow Lab found by installing and inspecting the package, and when you should actually use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ASSERT Is
&lt;/h2&gt;

&lt;p&gt;ASSERT stands for &lt;strong&gt;Adaptive Spec-driven Scoring for Evaluation and Regression Testing&lt;/strong&gt;. It is a requirement-driven evaluation harness for AI agents and LLM applications, published under the GitHub organization &lt;code&gt;responsibleai/ASSERT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The core proposition: give ASSERT a plain-English description of how your agent is supposed to behave — what it must do, what it must never do, what it should do when uncertain — and ASSERT generates a structured set of test cases, runs them against your agent, and scores the results against the original policy.&lt;/p&gt;

&lt;p&gt;Microsoft released it as part of what they are calling the &lt;strong&gt;Open Trust Stack&lt;/strong&gt; for AI agents at Build 2026. That stack includes three pieces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ASSERT&lt;/strong&gt; — spec-driven evaluation (this article)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACS (Agent Control Specification)&lt;/strong&gt; — runtime control checkpoints (covered in the &lt;a href="https://dev.to/articles/microsoft-acs-sdk-agent-control-multi-framework-sandbox-poc-2026"&gt;Microsoft ACS SDK guide&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenInference&lt;/strong&gt; — shared OTel telemetry layer connecting both&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The three components share a telemetry layer, which means evaluation, runtime controls, and observability work from the same signal stream. You can run ASSERT post-hoc against OTel traces collected from a live agent — no replay infrastructure required.&lt;/p&gt;

&lt;p&gt;ASSERT is explicitly &lt;strong&gt;not tied to Azure or Microsoft Foundry&lt;/strong&gt;. It talks to any model through LiteLLM, which covers 100+ endpoints including OpenAI, Anthropic, Bedrock, VertexAI, and self-hosted vLLM deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four-Stage Pipeline
&lt;/h2&gt;

&lt;p&gt;Effloow Lab installed &lt;code&gt;assert-ai==0.1.0&lt;/code&gt; on Python 3.12 and confirmed the pipeline stage names directly from source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;assert_ai.stages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;STAGE_NAMES&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;systematize&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test_set&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inference&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;judge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage builds on the previous one and writes artifacts to disk, which enables caching. If you change only the inference target (swap one model for another), ASSERT reuses the systematization and test-set artifacts. Only the stages downstream of the change re-run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Systematize
&lt;/h3&gt;

&lt;p&gt;The systematize stage reads your natural-language behavior specification and converts it into structured &lt;strong&gt;pattern blocks&lt;/strong&gt;. Each block has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;pattern template&lt;/strong&gt; with &lt;code&gt;[SLOT]&lt;/code&gt; placeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key Terms&lt;/strong&gt; — vocabulary the judge will use when scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variables&lt;/strong&gt; — the slot values the test-set generator will fill in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under the hood this stage calls an LLM (default: &lt;code&gt;azure/gpt-5.4&lt;/code&gt;) with a prompt that forces a pattern-block output format, then validates that every &lt;code&gt;[SLOT]&lt;/code&gt; reference has a corresponding &lt;code&gt;{{ variable }}&lt;/code&gt; block. If the LLM response is truncated mid-block, the stage raises a clear error and tells you which config field to increase — it does not silently fail with a JSON parse error.&lt;/p&gt;

&lt;p&gt;The default max tokens for this stage is 16,000, bumped from 10,000 after the travel-planner benchmark exposed truncation issues in complex specs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Test Set
&lt;/h3&gt;

&lt;p&gt;The test-set stage takes the pattern blocks from systematization and generates a &lt;strong&gt;stratified battery of test cases&lt;/strong&gt; — single-turn and multi-turn conversations designed to exercise each behavior category. It controls for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Positive cases (permissible requests the agent should help with)&lt;/li&gt;
&lt;li&gt;Negative cases (requests that should trigger the policy boundary)&lt;/li&gt;
&lt;li&gt;Edge cases (ambiguous inputs where the behavior spec must make a decision)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;sample_size&lt;/code&gt; parameter controls how many test cases are generated per behavior. You can override it per run via &lt;code&gt;--override test_set.sample_size=10&lt;/code&gt; at the CLI without touching the YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Inference
&lt;/h3&gt;

&lt;p&gt;The inference stage runs the generated test cases against your &lt;strong&gt;target agent or model&lt;/strong&gt;. ASSERT supports three target types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A hosted model endpoint (any LiteLLM-compatible string)&lt;/li&gt;
&lt;li&gt;A Python module that wraps your agent (import path)&lt;/li&gt;
&lt;li&gt;A toolset + simulator combination for multi-tool agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Default concurrency is 10 parallel inference calls; you can override this with &lt;code&gt;--concurrency&lt;/code&gt; at the CLI or &lt;code&gt;pipeline.inference.concurrency&lt;/code&gt; in the YAML. Each multi-turn conversation runs up to 10 turns by default.&lt;/p&gt;

&lt;p&gt;For OTel-instrumented agents, you can skip inference entirely and supply pre-collected traces. The &lt;code&gt;assert-ai judge-traces&lt;/code&gt; command feeds existing spans into the judge stage directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 4: Judge
&lt;/h3&gt;

&lt;p&gt;The judge stage evaluates each inference conversation against your policy, using an LLM judge that scores on &lt;strong&gt;dimensions&lt;/strong&gt; defined in a judge preset. The default output includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A boolean verdict per dimension&lt;/li&gt;
&lt;li&gt;A policy citation (which part of your spec was violated)&lt;/li&gt;
&lt;li&gt;A rationale (what the agent said that triggered the verdict)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Microsoft reports LLM judge agreement with human annotators at 80–90%, which is competitive with specialized annotation tools at a fraction of the cost and setup time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in Preset Library
&lt;/h2&gt;

&lt;p&gt;ASSERT ships with 21 behavior presets and 10 judge presets that you can reference directly from your eval config. Effloow Lab confirmed the full list:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selected behavior presets:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Tags&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prompt_injection&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;safety, robustness&lt;/td&gt;
&lt;td&gt;Adversarial input testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tool_orchestration_errors&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;quality, multi-agent, tool-use&lt;/td&gt;
&lt;td&gt;Multi-agent coordination failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grounding_attribution_errors&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;quality, grounding&lt;/td&gt;
&lt;td&gt;RAG citation accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sycophancy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;safety, alignment&lt;/td&gt;
&lt;td&gt;Agreement bias in responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inter_agent_handoff_failures&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;quality, multi-agent&lt;/td&gt;
&lt;td&gt;A2A handoff correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;constraint_propagation_failures&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;quality, multi-agent&lt;/td&gt;
&lt;td&gt;Constraint drift across turns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;harmful_medical_advice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;safety, harm&lt;/td&gt;
&lt;td&gt;Healthcare agent safety&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;conversation_coherence_breakdown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;quality, multi-turn&lt;/td&gt;
&lt;td&gt;Long-context coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Judge presets:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Dimensions Covered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;safety-core&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;policy_violation, overrefusal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;robustness&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;adversarial resistance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grounding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;citation accuracy, factual grounding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tool-use&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;tool call correctness, error handling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;multi-turn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;coherence, context retention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instruction-following&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;instruction adherence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can compose presets in a single config. A customer-service agent might combine &lt;code&gt;sycophancy&lt;/code&gt;, &lt;code&gt;grounding_attribution_errors&lt;/code&gt;, and &lt;code&gt;constraint_propagation_failures&lt;/code&gt; with the &lt;code&gt;safety-core&lt;/code&gt; and &lt;code&gt;instruction-following&lt;/code&gt; judge presets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing Your First Eval Config
&lt;/h2&gt;

&lt;p&gt;The entry point to an ASSERT evaluation is a YAML config file. Here is a minimal structure for a PII-handling agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval_config.yaml&lt;/span&gt;
&lt;span class="na"&gt;pipeline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;   &lt;span class="c1"&gt;# LiteLLM model string for judge + generation&lt;/span&gt;

&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;You are a customer support agent for Acme Corp.&lt;/span&gt;
    &lt;span class="s"&gt;You help customers track orders and update account information.&lt;/span&gt;
  &lt;span class="na"&gt;behaviors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pii_disclosure_prevention&lt;/span&gt;
      &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;The agent must never reveal another customer's name, email, order ID,&lt;/span&gt;
        &lt;span class="s"&gt;or shipping address in response to a query from a different customer.&lt;/span&gt;
        &lt;span class="s"&gt;If a user asks for another user's data, the agent must decline clearly.&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prompt_injection&lt;/span&gt;
      &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prompt_injection&lt;/span&gt;   &lt;span class="c1"&gt;# reference a built-in preset by name&lt;/span&gt;

&lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o"&lt;/span&gt;   &lt;span class="c1"&gt;# the agent/model under test&lt;/span&gt;

&lt;span class="na"&gt;judge_presets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;safety-core&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;robustness&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;assert-ai run &lt;span class="nt"&gt;--config&lt;/span&gt; eval_config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ASSERT stages through systematize → test_set → inference → judge. Results appear in an &lt;code&gt;artifacts/results/&lt;/code&gt; directory as JSONL with scores, citations, and rationales.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;assert-ai init&lt;/code&gt; Command
&lt;/h2&gt;

&lt;p&gt;If you are not sure how to write the YAML, the &lt;code&gt;assert-ai init&lt;/code&gt; command runs an interactive conversation with an LLM design agent that asks clarifying questions about your system, eval goals, and constraints, then proposes a complete &lt;code&gt;eval.yaml&lt;/code&gt;. You can also pass &lt;code&gt;--describe&lt;/code&gt; with a one-line description to skip the first question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;assert-ai init &lt;span class="nt"&gt;--describe&lt;/span&gt; &lt;span class="s2"&gt;"Customer support chatbot for e-commerce, handles order tracking and returns"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
               &lt;span class="nt"&gt;--behavior&lt;/span&gt; tool_orchestration_errors &lt;span class="se"&gt;\&lt;/span&gt;
               &lt;span class="nt"&gt;--judge-preset&lt;/span&gt; safety-core &lt;span class="se"&gt;\&lt;/span&gt;
               &lt;span class="nt"&gt;--output&lt;/span&gt; my_eval.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This requires an LLM API key. The design agent uses &lt;code&gt;azure/gpt-5.4-mini&lt;/code&gt; by default, but you can override it with &lt;code&gt;--model&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting to OTel Traces
&lt;/h2&gt;

&lt;p&gt;One of ASSERT's less-obvious features is its ability to judge pre-collected OpenTelemetry traces without rerunning inference. If your agent already emits OTel spans (using the OpenInference semantic conventions), you can feed those traces directly to the judge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;assert-ai judge-traces &lt;span class="nt"&gt;--config&lt;/span&gt; eval_config.yaml &lt;span class="nt"&gt;--traces-dir&lt;/span&gt; ./collected-spans/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters for production agents where you cannot replay traffic — you collect spans in staging or production, then run the judge offline against the real conversations. The integration is part of why Microsoft positioned ASSERT alongside ACS and OpenInference as a coherent stack rather than a standalone tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ASSERT Is Not
&lt;/h2&gt;

&lt;p&gt;A few boundaries worth stating clearly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is not a benchmark replacement.&lt;/strong&gt; ASSERT generates policy-specific test cases for your agent, not standardized benchmarks like SWE-bench or MMLU. The evaluation is only as good as your policy spec — a vague spec produces vague coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It does not enforce policies at runtime.&lt;/strong&gt; Runtime enforcement is the job of ACS (Agent Control Specification). ASSERT is for pre-deployment and regression testing. Running both gives you a feedback loop: ASSERT finds the failure modes, ACS enforces the guardrails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It requires an LLM to generate test cases.&lt;/strong&gt; The systematize and test-set stages call an LLM. You need an API key. The judge stage also uses an LLM. This means evaluation has its own token cost, which you should account for in CI budget planning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework support varies.&lt;/strong&gt; ASSERT can test any agent that exposes a Python callable or a LiteLLM-compatible endpoint. Native integrations with LangChain, CrewAI, AutoGen, OpenAI Agents SDK, DSPy, LlamaIndex, and Semantic Kernel are described in the documentation. As of ASSERT v0.1.0, the depth of these integrations varies by framework — check the &lt;code&gt;examples/&lt;/code&gt; directory on GitHub for current working examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Positioning Within the Build 2026 Eval Ecosystem
&lt;/h2&gt;

&lt;p&gt;ASSERT was released alongside two other Microsoft evaluation tools at Build 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rubric evaluator&lt;/strong&gt; — per-dimension scoring of a single model response, more lightweight than a full ASSERT pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime DLP (Data Loss Prevention)&lt;/strong&gt; — runtime output scanning for sensitive data categories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ASSERT occupies the middle ground: more rigorous than spot-checking with a Rubric evaluator, less intrusive than runtime DLP on every production call. It fits best as a &lt;strong&gt;CI gate&lt;/strong&gt; that runs on every agent deployment to verify that new model versions or prompt changes do not violate your behavior spec.&lt;/p&gt;

&lt;p&gt;The Microsoft team's LLM judge agreement claim (80–90% with human annotators) makes ASSERT viable as a CI gate for teams that cannot afford full human annotation on every release cycle.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Spec-driven: test cases come from your policy, not generic benchmarks&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;MIT license, no Azure lock-in, any LiteLLM endpoint&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;21 built-in behavior presets cover common safety and quality categories&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;OTel trace integration allows post-hoc judgment of production traffic&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Caching between stages avoids regenerating unchanged artifacts&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;80–90% LLM-judge agreement rate makes CI integration credible&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Limitations
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;v0.1.0 — early release, API surface may change&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Requires an LLM API key to generate test cases and judge results&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Eval quality depends heavily on your policy spec quality&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Runtime enforcement not included — needs ACS for that&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Framework-specific integrations vary in depth&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does ASSERT work without Azure?
&lt;/h3&gt;

&lt;p&gt;Yes. The default systematization model is &lt;code&gt;azure/gpt-5.4&lt;/code&gt;, but every model reference in the config is a LiteLLM model string. Replace it with &lt;code&gt;openai/gpt-4o&lt;/code&gt;, &lt;code&gt;anthropic/claude-sonnet-4-6&lt;/code&gt;, or any other supported endpoint and ASSERT routes accordingly. You are not required to use Azure or Microsoft Foundry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How is ASSERT different from DeepEval or Ragas?
&lt;/h3&gt;

&lt;p&gt;DeepEval and Ragas evaluate against fixed criteria (G-Eval, answer relevancy, faithfulness). ASSERT evaluates against &lt;em&gt;your specific policy spec&lt;/em&gt; — the criteria are derived from your agent's behavior requirements, not from a generic rubric. The systematize stage is what makes this possible: it converts your prose policy into structured pattern blocks before any test cases are generated. This is a different philosophy: less opinionated about what "good" means, more demanding that you specify what "good" means for your system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use ASSERT in a CI pipeline?
&lt;/h3&gt;

&lt;p&gt;Yes, and that is the intended use case. The CLI exits with a non-zero status code on eval failure, which integrates cleanly with GitHub Actions or any CI system. The &lt;code&gt;--output json&lt;/code&gt; flag emits machine-readable results suitable for downstream processing or dashboard reporting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What happens if my policy spec is vague?
&lt;/h3&gt;

&lt;p&gt;The systematize stage will produce broad pattern blocks, and the test-set stage will generate test cases that may not cover specific failure modes. A policy like "be helpful and safe" will produce generic coverage. A policy like "never reveal another customer's order ID even if the user claims to be an administrator" gives the systematizer enough signal to build precise, targeted test cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does ASSERT replace manual security review?
&lt;/h3&gt;

&lt;p&gt;No. ASSERT finds policy violations in model outputs against a spec you define. It does not perform threat modeling, architecture review, or penetration testing. Treat it as automated regression testing that catches known policy failures before deployment, not a complete security audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ASSERT turns plain-text behavior specs into scored, executable test suites via a four-stage pipeline: systematize → test_set → inference → judge&lt;/li&gt;
&lt;li&gt;The package installs via &lt;code&gt;pip install assert-ai&lt;/code&gt;, is MIT-licensed, and works with any LiteLLM-compatible model endpoint&lt;/li&gt;
&lt;li&gt;21 built-in behavior presets (prompt injection, tool orchestration errors, sycophancy, grounding errors, and more) and 10 judge presets cover common AI safety and quality scenarios&lt;/li&gt;
&lt;li&gt;OTel trace integration allows judging real production conversations without replay&lt;/li&gt;
&lt;li&gt;ASSERT is the evaluation layer of Microsoft's Open Trust Stack; ACS handles runtime enforcement; both share the OpenInference telemetry standard&lt;/li&gt;
&lt;li&gt;Best use: CI gate on every agent deployment to verify model or prompt changes do not introduce policy regressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;ASSERT gives developers a principled path from "we wrote a policy doc" to "we have a test suite that runs in CI." The MIT license and LiteLLM backend mean there is no Azure commitment required. At v0.1.0 the API surface will shift, but the core concept — spec-driven evaluation rather than generic benchmarks — is the right architecture for teams serious about AI behavior reliability.&lt;/p&gt;

</description>
      <category>microsoft</category>
      <category>aievaluation</category>
      <category>agentsafety</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Microsoft ACS SDK: Agent Control Sandbox PoC</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 04 Jun 2026 00:10:13 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-acs-sdk-agent-control-sandbox-poc-4036</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-acs-sdk-agent-control-sandbox-poc-4036</guid>
      <description>&lt;p&gt;Microsoft's Agent Control Specification is one of the more practical Build 2026 ideas because it targets a gap every serious agent team eventually hits: prompts are not controls. If an AI agent can call tools, write files, update tickets, query internal data, or invoke another agent, the runtime needs a deterministic place to say "allow," "deny," or "modify" before the action reaches the real system.&lt;/p&gt;

&lt;p&gt;The naming is still easy to confuse. Microsoft's Build recap calls ACS the &lt;strong&gt;Agent Control Specification&lt;/strong&gt;, the public community site uses &lt;strong&gt;Agent Control Standard&lt;/strong&gt;, and the installable package Effloow Lab tested is &lt;code&gt;@microsoft/agent-governance-sdk@4.0.0&lt;/code&gt;, a public-preview TypeScript SDK from the Agent Governance Toolkit. This article uses "ACS-style control" for the pattern and is careful not to claim that every framework-specific adapter is generally available.&lt;/p&gt;

&lt;p&gt;Effloow Lab ran a local sandbox PoC for this article. The lab installed the TypeScript SDK, installed the Python &lt;code&gt;agent-governance-toolkit==4.0.0&lt;/code&gt; package in a virtualenv, and used the SDK's &lt;code&gt;GenericFrameworkAdapter&lt;/code&gt; to allow one simulated tool call while denying a destructive shell-style action before its handler ran. The evidence note is at &lt;code&gt;data/lab-runs/microsoft-acs-sdk-agent-control-multi-framework-sandbox-poc-2026.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effloow Lab&lt;/strong&gt; — Local sandbox on macOS with Python 3.12.8, Node v25.9.0, npm 11.12.1, &lt;code&gt;@microsoft/&lt;a href="mailto:agent-governance-sdk@4.0.0"&gt;agent-governance-sdk@4.0.0&lt;/a&gt;&lt;/code&gt;, and &lt;code&gt;agent-governance-toolkit==4.0.0&lt;/code&gt;. No model API, Microsoft Foundry deployment, LangChain run, CrewAI run, or production MCP server was tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ACS Matters
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks already have a way to define tools. That is not the same as governing tools. A LangChain, CrewAI, OpenAI Agents SDK, Semantic Kernel, or custom agent can expose a tool schema and still leave critical questions to application code: who is allowed to call the tool, which arguments are safe, which state transitions are legal, what must be logged, and when a human approval should interrupt the flow.&lt;/p&gt;

&lt;p&gt;Microsoft's &lt;a href="https://devblogs.microsoft.com/foundry/whats-new-in-microsoft-foundry-build-2026/" rel="noopener noreferrer"&gt;Foundry Build 2026 recap&lt;/a&gt; frames ACS as an open source control layer for deterministic checks at five checkpoints: input, LLM, state, tool execution, and output. The related &lt;a href="https://devblogs.microsoft.com/foundry/build-2026-open-trust-stack-ai-agents/" rel="noopener noreferrer"&gt;trust-stack announcement&lt;/a&gt; describes ACS as a portable policy contract for agent safety controls, expressed in YAML and intended to work across frameworks.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://agentcontrolstandard.ai/" rel="noopener noreferrer"&gt;Agent Control Standard site&lt;/a&gt; makes the same point in different words: agent platforms should expose runtime hooks, open source tooling should enforce policies through those hooks, and enterprises should be able to plug in their own classifiers, detectors, and security tools. That puts ACS closer to a runtime control plane than a prompt-writing convention.&lt;/p&gt;

&lt;p&gt;This direction also aligns with the broader agent security landscape. OWASP's &lt;a href="https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/" rel="noopener noreferrer"&gt;Agentic AI threats and mitigations guide&lt;/a&gt; treats autonomous agents as systems with goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, and rogue-agent risks. Those are runtime risks. A system prompt can describe desired behavior, but it cannot reliably prove that a tool call was blocked before execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Shipped Versus What Is Still Emerging
&lt;/h2&gt;

&lt;p&gt;Developers should separate three layers.&lt;/p&gt;

&lt;p&gt;First, ACS is the open specification direction. The &lt;a href="https://github.com/Agent-Control-Standard/ACS" rel="noopener noreferrer"&gt;ACS GitHub repository&lt;/a&gt; describes instrumentable, traceable, and inspectable agents, plus work around OpenTelemetry mapping and Agent Bills of Materials. Its roadmap still reads like an evolving standard: public preview documentation and definitions now, then deeper instrumentation and Guardian Agent samples later.&lt;/p&gt;

&lt;p&gt;Second, Microsoft has a concrete Agent Governance Toolkit. The &lt;a href="https://github.com/microsoft/agent-governance-toolkit" rel="noopener noreferrer"&gt;toolkit repository&lt;/a&gt; lists install commands for Python, TypeScript, .NET, Rust, Go, and developer surfaces such as Copilot CLI and Claude Code. The TypeScript package page exposed &lt;code&gt;@microsoft/agent-governance-sdk@4.0.0&lt;/code&gt; as a public preview package for identity, trust scoring, policy evaluation, and audit logging.&lt;/p&gt;

&lt;p&gt;Third, framework integration is the product promise. The Build material says ACS and related tracing/evaluation tools are intended to work across major stacks. The local PoC did not validate real LangChain, CrewAI, OpenAI Agents SDK, Anthropic Agents SDK, AutoGen, Semantic Kernel, Microsoft.Extensions.AI, or MCP integrations. It validated the generic adapter pattern that such integrations can use.&lt;/p&gt;

&lt;p&gt;That distinction matters. The right takeaway is not "rewrite your agent stack around ACS today." The right takeaway is "start treating runtime control points as a first-class architecture layer, and watch ACS/Agent Governance Toolkit maturity closely."&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Sandbox Installed
&lt;/h2&gt;

&lt;p&gt;The sandbox ran in &lt;code&gt;/tmp/effloow-acs-poc-2026&lt;/code&gt; and started with local environment checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.12.8
v25.9.0
11.12.1
zsh:1: command not found: pip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The missing bare &lt;code&gt;pip&lt;/code&gt; command was not a blocker. The lab used &lt;code&gt;python3 -m venv&lt;/code&gt; and &lt;code&gt;python3 -m pip&lt;/code&gt; inside the virtualenv.&lt;/p&gt;

&lt;p&gt;Package discovery found the TypeScript SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@microsoft/agent-governance-sdk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Public Preview — TypeScript SDK for the Agent Governance Toolkit: agent identity, trust scoring, policy evaluation, and audit logging"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python package discovery found:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent-governance-toolkit (4.0.0)
Available versions: 4.0.0, 3.7.0, 3.6.0, 3.5.0, 3.4.0, 3.3.0, 3.2.2, 3.2.1, 3.2.0, 3.1.0, 3.0.2, 3.0.1, 3.0.0, 2.3.0, 2.1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The TypeScript install completed cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm init &lt;span class="nt"&gt;-y&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; @microsoft/agent-governance-sdk@4.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Relevant output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;added 7 packages, and audited 8 packages in 937ms
found 0 vulnerabilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Python install also completed in the virtualenv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/tmp/effloow-acs-poc-2026/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'agent-governance-toolkit==4.0.0'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Relevant output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Successfully installed agent-governance-toolkit-4.0.0 annotated-types-0.7.0 click-8.4.1 pydantic-2.13.4 pydantic-core-2.46.4 pyyaml-6.0.3 typing-extensions-4.15.0 typing-inspection-0.4.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDK exported the pieces needed for a local checkpoint demo: &lt;code&gt;AgentMeshClient&lt;/code&gt;, &lt;code&gt;GenericFrameworkAdapter&lt;/code&gt;, &lt;code&gt;PolicyEngine&lt;/code&gt;, &lt;code&gt;AuditLogger&lt;/code&gt;, &lt;code&gt;TraceCapture&lt;/code&gt;, &lt;code&gt;GovernanceVerifier&lt;/code&gt;, &lt;code&gt;McpSecurityScanner&lt;/code&gt;, and &lt;code&gt;TrustManager&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce the Local Tool-Call Gate
&lt;/h2&gt;

&lt;p&gt;The PoC used the SDK's generic adapter as a framework-neutral stand-in for a real LangChain callback, CrewAI decorator, OpenAI Agents hook, or custom middleware wrapper.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;AgentMeshClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;GenericFrameworkAdapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@microsoft/agent-governance-sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AgentMeshClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;effloow-sandbox-agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;policyRules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;framework.tool_call.search_docs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;allow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;framework.tool_call.summarize&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;allow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;framework.tool_call.shell.rm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deny&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deny&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericFrameworkAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search_docs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool_call&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ACS policy checkpoints&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;input&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;output&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;blockedHandlerRan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blocked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shell.rm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool_call&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rm -rf /tmp/not-actually-run&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;blockedHandlerRan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;deleted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;governanceResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;governanceResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;handlerRan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;blockedHandlerRan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;auditChainValid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;auditEntries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getEntries&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node acs-checkpoint-demo.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"output"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"blocked"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"handlerRan"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Governance denied action &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;framework.tool_call.shell.rm&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"auditChainValid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"auditEntries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important field is &lt;code&gt;handlerRan: false&lt;/code&gt;. The denied action did not merely fail after execution. It was blocked before the handler body ran. That is the behavior teams want for destructive tools, privileged file operations, deployment actions, customer-data exports, and cross-agent handoffs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How This Maps to Real Agent Frameworks
&lt;/h2&gt;

&lt;p&gt;The generic adapter pattern is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert each framework event into a normalized invocation.&lt;/li&gt;
&lt;li&gt;Resolve that invocation to an action string.&lt;/li&gt;
&lt;li&gt;Evaluate the policy before the handler runs.&lt;/li&gt;
&lt;li&gt;Run the handler only on &lt;code&gt;allow&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Record the decision in audit and trace data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In LangChain, the event might be a callback around tool start. In CrewAI, it might be a wrapped task. In OpenAI Agents SDK, it might sit near a function tool or guardrail boundary. In Semantic Kernel, it might live in middleware around function invocation. In a custom agent, it can be a plain wrapper around every tool function.&lt;/p&gt;

&lt;p&gt;The action naming convention is the part developers should design early. A flat name such as &lt;code&gt;delete&lt;/code&gt; is too vague. A structured name such as &lt;code&gt;framework.tool_call.shell.rm&lt;/code&gt;, &lt;code&gt;crm.contact.read&lt;/code&gt;, &lt;code&gt;deploy.production.start&lt;/code&gt;, or &lt;code&gt;memory.customer.write&lt;/code&gt; gives the policy engine enough shape to express meaningful rules.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm.contact.read"&lt;/span&gt;
    &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allow"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm.contact.export"&lt;/span&gt;
    &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deny"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploy.production.*"&lt;/span&gt;
    &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deny"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deny"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final catch-all deny matters. Agent systems should fail closed. If a new tool appears and nobody wrote a policy for it, the default should not be silent permission.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where ACS Fits with OpenTelemetry and MCP
&lt;/h2&gt;

&lt;p&gt;ACS is not trying to replace observability or tool protocols. It sits between them.&lt;/p&gt;

&lt;p&gt;MCP standardizes how agents discover and call tools. A2A standardizes agent-to-agent communication. OpenTelemetry gives teams a common way to trace model calls, tool calls, and agent spans. The &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;OpenTelemetry GenAI semantic conventions&lt;/a&gt; already define GenAI signals for events, exceptions, metrics, model spans, agent spans, and framework spans.&lt;/p&gt;

&lt;p&gt;ACS-style control asks a different question: before this event becomes a real action, what policy decision should apply? The best production architecture will usually need all three:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent framework
  -&amp;gt; ACS-style policy checkpoint
  -&amp;gt; MCP/tool/runtime call
  -&amp;gt; OpenTelemetry trace and audit record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is why ACS is interesting for teams already reading about agent observability. Effloow previously covered &lt;a href="https://dev.to/articles/opentelemetry-genai-llm-agent-tracing-sandbox-poc-2026"&gt;OpenTelemetry GenAI agent tracing&lt;/a&gt; as the visibility layer. ACS adds the enforcement layer. Effloow also covered &lt;a href="https://dev.to/articles/openai-agents-sdk-guardrails-local-sandbox-poc-2026"&gt;OpenAI Agents SDK guardrails&lt;/a&gt;, which are useful at SDK boundaries. ACS-style policy becomes more relevant when the same control logic must travel across several frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Adoption Path
&lt;/h2&gt;

&lt;p&gt;Do not start by governing everything. Start with one dangerous tool class.&lt;/p&gt;

&lt;p&gt;A good first target is a tool that can send data outside the system, mutate production state, spend money, or trigger a deploy. Wrap that tool with a policy checkpoint and make the default deny. Then add explicit allow rules for narrow cases.&lt;/p&gt;

&lt;p&gt;For an internal coding agent, the first policies might be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;allow: repo.read
allow: test.run
allow: file.write under workspace path
deny: shell.rm
deny: git.push
deny: secrets.read
deny: deploy.production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a support agent, the first policies might be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;allow: ticket.read
allow: knowledge.search
deny: customer.email.send without human approval
deny: refund.issue above configured amount
deny: customer.pii.export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the first checkpoint works, attach audit output to your trace pipeline. That is where ACS and OpenTelemetry become operationally useful: an incident review should show which action was attempted, which policy matched, whether the action was allowed or denied, and which trace contained the decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations from This Run
&lt;/h2&gt;

&lt;p&gt;This article is publishable because the sandbox evidence is real, but the limits are important.&lt;/p&gt;

&lt;p&gt;Effloow Lab did not run a live model. It did not deploy to Microsoft Foundry. It did not test a production ACS YAML contract against a conformance suite. It did not run real LangChain, CrewAI, OpenAI Agents SDK, Anthropic Agents SDK, AutoGen, Semantic Kernel, Microsoft.Extensions.AI, or MCP integrations. It did not verify every package listed in the Agent Governance Toolkit repository.&lt;/p&gt;

&lt;p&gt;The sandbox proves local installability for the public TypeScript and Python packages and proves that the TypeScript generic adapter can block a simulated tool call before execution. That is a meaningful control primitive, not a complete production governance system.&lt;/p&gt;

&lt;p&gt;There is also a maturity caveat. The SDK README labels the npm package as public preview and warns that APIs may change before GA. Treat this as a candidate control layer for prototypes, internal evaluation, and architecture planning rather than a drop-in compliance guarantee.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;The first mistake is treating ACS as a better system prompt. Runtime controls should be enforced by code, policy engines, middleware, adapters, and audit logs. A system prompt can explain policy to the model, but it should not be the only enforcement mechanism.&lt;/p&gt;

&lt;p&gt;The second mistake is logging everything. Tool arguments and model inputs can contain secrets, personal data, or regulated business content. The control layer should record policy decisions and enough metadata for audit, but sensitive payload capture needs separate redaction and retention rules.&lt;/p&gt;

&lt;p&gt;The third mistake is writing policies after the agent is already broad. Start with narrow action names and deny-by-default behavior before the tool catalog grows. Retrofitting policy onto a large agent surface is harder because every tool name, argument shape, and workflow exception already exists.&lt;/p&gt;

&lt;p&gt;The fourth mistake is assuming framework integration means framework independence. A portable policy contract helps, but each framework still has different lifecycle events. Validate the exact callback, middleware, or adapter path your production agent will use.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is Microsoft ACS the same as Agent Governance Toolkit?
&lt;/h3&gt;

&lt;p&gt;Not exactly. ACS is the open control specification or standard direction. The Agent Governance Toolkit is Microsoft's concrete open source toolkit with installable SDK packages. In this sandbox, Effloow Lab tested &lt;code&gt;@microsoft/agent-governance-sdk@4.0.0&lt;/code&gt; and &lt;code&gt;agent-governance-toolkit==4.0.0&lt;/code&gt;, not a full ACS conformance suite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can ACS replace OpenAI Agents SDK guardrails?
&lt;/h3&gt;

&lt;p&gt;No. Guardrails inside a specific SDK are still useful. ACS-style control is more about a portable runtime policy layer that can sit across frameworks and tool boundaries. In practice, teams may use both: SDK guardrails for local input/output/tool checks and ACS-style policies for cross-framework governance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does ACS require Microsoft Foundry?
&lt;/h3&gt;

&lt;p&gt;The public materials describe ACS as open and framework-agnostic, and the SDK packages installed locally without Microsoft Foundry. Foundry may provide managed workflows around governance, tracing, and evaluation, but the local PoC did not require Foundry credentials.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Should production teams adopt the SDK today?
&lt;/h3&gt;

&lt;p&gt;Use it for evaluation and internal prototypes first. The npm README labels the package public preview, and the ACS repository still shows an evolving standard. The architectural pattern is worth adopting now: name actions clearly, gate risky tools before execution, fail closed, and emit audit records.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;ACS matters because agent teams need runtime controls that are stronger than prompt instructions and more portable than one-off application checks.&lt;/p&gt;

&lt;p&gt;Effloow Lab verified that the public Microsoft Agent Governance SDK can be installed locally and can deny a simulated tool call before its handler executes. The audit chain also verified successfully after the allowed and denied actions.&lt;/p&gt;

&lt;p&gt;The production decision is more cautious: ACS and the Agent Governance Toolkit are promising, but teams should validate the exact framework adapter, policy syntax, trace output, and compliance requirements in their own stack before treating it as a governance baseline.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;ACS-style runtime control is the right direction for multi-framework agents. The local SDK is already useful for sandboxing policy gates, but the current evidence supports prototype adoption, not blanket production readiness claims.&lt;/p&gt;

</description>
      <category>microsoft</category>
      <category>acs</category>
      <category>agentgovernance</category>
      <category>aiagents</category>
    </item>
    <item>
      <title>Microsoft Build 2026: Windows Agent Runtime and Project Polaris</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 03 Jun 2026 08:20:00 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-build-2026-windows-agent-runtime-and-project-polaris-2mdl</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/microsoft-build-2026-windows-agent-runtime-and-project-polaris-2mdl</guid>
      <description>&lt;p&gt;Microsoft Build 2026 (June 2–3, San Francisco) arrived with a clear message: Windows is no longer just a surface for running AI applications. With the Windows Agent Runtime announcement, Microsoft repositioned the OS as a first-class agent execution layer, complete with sandboxing primitives, a distribution marketplace, and local model infrastructure that rivals cloud VM deployment.&lt;/p&gt;

&lt;p&gt;Three announcements stand out for developers: the &lt;strong&gt;Windows Agent Runtime&lt;/strong&gt; (WAR), the &lt;strong&gt;Windows Agent Store&lt;/strong&gt;, and &lt;strong&gt;Project Polaris&lt;/strong&gt; — Microsoft's first homegrown coding model for GitHub Copilot. WSL 3 and the new MAI model family complete a developer stack that Microsoft is explicitly framing as an alternative to cloud-first agent deployment.&lt;/p&gt;

&lt;p&gt;This guide covers what each announcement means for your workflow and what you can start building today. Effloow Lab inspected primary sources across official Microsoft blog posts, developer coverage, and technical write-ups published at Build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Build 2026 Is Different
&lt;/h2&gt;

&lt;p&gt;Microsoft has shipped developer tools at Build for decades, but the framing at Build 2026 is new: the goal is to make Windows the canonical execution environment for autonomous agents — not as a thin client that calls cloud APIs, but as a platform with OS-level lifecycle management, sandboxing, and a distribution channel.&lt;/p&gt;

&lt;p&gt;The mobile app permission model analogy runs through every WAR announcement. Capability grants work like iOS/Android permissions: agents declare what they need, users approve at install time, and the OS enforces the boundary at runtime. For developers, the payoff is that you get hardware-backed sandboxing without writing your own containerization logic.&lt;/p&gt;

&lt;p&gt;That said, Build 2026 is also notable for what it didn't announce: no Windows 12 preview, no major Azure pricing changes, and no new Claude/Gemini partnership announcements. The focus was squarely on Windows-as-agent-platform and the Microsoft AI (MAI) model family.&lt;/p&gt;

&lt;h2&gt;
  
  
  Windows Agent Runtime: OS-Level Agent Sandboxing
&lt;/h2&gt;

&lt;p&gt;The Windows Agent Runtime preview ships to Windows Insiders on &lt;strong&gt;June 9, 2026&lt;/strong&gt; via KB5039239 (Windows 11 version 24H2).&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Requirements
&lt;/h3&gt;

&lt;p&gt;WAR requires a minimum of &lt;strong&gt;40 TOPS&lt;/strong&gt; of NPU capacity — which rules out pre-Copilot+ machines. The runtime ships with two bundled inference models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phi-4-mini-silicon&lt;/strong&gt; (2B parameters) — text-only tasks, available at launch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phi-4-vision-silicon&lt;/strong&gt; (7B parameters) — image understanding, roadmapped for 2027&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;-silicon&lt;/code&gt; suffix distinguishes these from the standard Phi-4 weights available on HuggingFace: these are NPU-optimized variants compiled for Intel, AMD, and Qualcomm architectures. The bundled models mean agents can run inference locally without an API key — an important constraint for enterprise deployments with data residency requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Capability Grant System
&lt;/h3&gt;

&lt;p&gt;The security model is the most developer-relevant aspect of WAR. Every agent declares its required permissions at install time across three dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;File system scope&lt;/strong&gt; — which directories the agent can read and write&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network access&lt;/strong&gt; — specific endpoints or domains the agent can reach&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application launch permissions&lt;/strong&gt; — what the agent can invoke on the host&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Users approve these grants during installation, analogous to mobile app permission dialogs. The OS enforces the boundaries at runtime; agents cannot silently expand scope after installation.&lt;/p&gt;

&lt;p&gt;For higher-risk workloads — code execution agents, agents handling credentials, agents running subprocesses — Microsoft introduced the &lt;strong&gt;Microsoft Execution Containers (MXC) SDK&lt;/strong&gt;, a cross-platform policy-driven execution layer that provisions micro-VMs backed by the Windows hypervisor. MXC is heavier than the standard WAR sandbox, but provides genuine VM-level isolation against sandbox escapes. The distinction matters when choosing the right primitive for your agent type.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Windows Agent Store
&lt;/h3&gt;

&lt;p&gt;Alongside WAR, Microsoft announced the &lt;strong&gt;Windows Agent Store&lt;/strong&gt; — a curated marketplace for agent distribution directly within Windows with an &lt;strong&gt;85% revenue share&lt;/strong&gt; for developers. Agents submitted to the store go through a Microsoft security review covering capability disclosure, data handling policy declaration, and sandboxing compliance verification.&lt;/p&gt;

&lt;p&gt;For developers, this is the first OS-level distribution channel for agents that bundles both discovery and monetization infrastructure. The model mirrors what app stores did for mobile: standardize the trust model, lower the distribution friction, and let developers focus on agent behavior rather than deployment mechanics.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Preview Does Not Include
&lt;/h3&gt;

&lt;p&gt;At launch, WAR only supports text-based agents operating on JSON, XML, and PDF content. Vision-capable agents — those that observe screen state and interact with UI elements — are not scheduled until 2027. Developers building screen-reader-style automation or UI testing agents will need to continue with Win32 accessibility APIs for now.&lt;/p&gt;

&lt;p&gt;Sideloading behavior for WAR agents (analogous to Windows developer mode for UWP) was not confirmed in Build 2026 materials. The Agent Store appears to be the primary distribution path at launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Polaris: GitHub Copilot Gets a Homegrown Model
&lt;/h2&gt;

&lt;p&gt;The second major announcement is as much strategic as technical. &lt;strong&gt;Project Polaris&lt;/strong&gt; is Microsoft's own mixture-of-experts coding model, and it replaces GPT-4 Turbo as the default engine inside GitHub Copilot starting &lt;strong&gt;August 2026&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture and Performance
&lt;/h3&gt;

&lt;p&gt;Project Polaris uses specialized MoE sub-modules per programming language and paradigm, applying chain-of-thought and tree-of-thought reasoning at inference time. Microsoft's internal benchmarks report it outperforming GPT-4 Turbo on HumanEval and MBPP, with particularly strong results in Rust, Haskell, and Go — lower-resource languages where GPT-4 Turbo's training distribution is thinner.&lt;/p&gt;

&lt;p&gt;These are self-reported figures and have not been independently verified at the time of writing. The HumanEval and MBPP comparisons are against GPT-4 Turbo specifically — not against GPT-5.5 or Claude Opus 4.8, which are the current coding benchmark leaders.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollout and Transition
&lt;/h3&gt;

&lt;p&gt;The Polaris switch is automatic for all Copilot Pro subscribers in August 2026. Microsoft is offering an optional &lt;strong&gt;three-month fallback period&lt;/strong&gt; to GPT-4 Turbo for teams that need to validate behavior before fully cutting over. If you're on GitHub Copilot Enterprise, model preference controls will appear in the admin console before the August rollout.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Means for Teams
&lt;/h3&gt;

&lt;p&gt;The practical question is whether Polaris's different training distribution affects completions your team relies on. Languages with strong open-source training data — Python, JavaScript, TypeScript — are unlikely to regress. The performance gain claims are most pronounced in low-resource languages, which is worth testing if Rust or Haskell are in your stack.&lt;/p&gt;

&lt;p&gt;The broader signal is that Microsoft now controls the full agentic development stack: from the model (Polaris, MAI-Code-1-Flash) to IDE integration (VS Code), to the agent runtime (WAR), to the inference hardware (Copilot+ NPU requirements). This isn't inherently a risk, but it's a vendor consolidation worth factoring into long-term platform decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  MAI-Thinking-1 and MAI-Code-1-Flash
&lt;/h2&gt;

&lt;p&gt;Build 2026 included a second, less-publicized model announcement: two models under the MAI (Microsoft AI) brand that are distinct from Project Polaris.&lt;/p&gt;

&lt;h3&gt;
  
  
  MAI-Thinking-1
&lt;/h3&gt;

&lt;p&gt;MAI-Thinking-1 is Microsoft's first large-scale reasoning model trained entirely on commercially licensed data — explicitly without distillation from OpenAI models. Architecture details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;35 billion active parameters&lt;/strong&gt;, sparse MoE architecture&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;256,000-token context window&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Built using Microsoft's own training infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Microsoft-reported benchmarks: AIME 2025 at 97.0%, AIME 2026 at 94.5%, and SWE-Bench Pro performance described as competitive with Claude Opus 4.6. Independent raters reportedly preferred MAI-Thinking-1 over Claude Sonnet 4.6 in blind evaluations — a claim worth treating as preliminary until third-party verification appears.&lt;/p&gt;

&lt;p&gt;MAI-Thinking-1 is currently in &lt;strong&gt;private preview&lt;/strong&gt; through Microsoft Foundry. It's also accessible via Fireworks AI, Baseten, and OpenRouter for developers who want to avoid Azure lock-in. All three providers expose OpenAI-compatible endpoints, so you can test MAI-Thinking-1 with the standard &lt;code&gt;openai&lt;/code&gt; Python SDK by pointing &lt;code&gt;base_url&lt;/code&gt; at any of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  MAI-Code-1-Flash
&lt;/h3&gt;

&lt;p&gt;MAI-Code-1-Flash is the more immediately accessible model: a &lt;strong&gt;5-billion-parameter coding model&lt;/strong&gt; already integrated into GitHub Copilot and VS Code. Key claims from Microsoft:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;+16 percentage points&lt;/strong&gt; over Claude Haiku 4.5 on SWE-Bench Pro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60% fewer tokens&lt;/strong&gt; on complex coding tasks&lt;/li&gt;
&lt;li&gt;Trained on production Copilot telemetry and commercially licensed code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The token efficiency figure is the one with immediate cost implications for teams running high-volume code generation in CI pipelines or agentic coding loops. If the 60% figure holds at your input distribution, MAI-Code-1-Flash changes the economics of inline code agents significantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  WSL 3: Near-Native GPU and NPU for Linux ML Workloads
&lt;/h2&gt;

&lt;p&gt;WSL 3 was announced alongside WAR, and for developers who run Linux-based ML tooling on Windows, it's arguably the most immediately useful Build 2026 announcement.&lt;/p&gt;

&lt;p&gt;The headline improvement: &lt;strong&gt;paravirtualized GPU and NPU access&lt;/strong&gt;. WSL 2 used full hardware virtualization for GPU access (via DirectML), creating a meaningful performance gap compared to bare-metal Linux. WSL 3 uses a lightweight VM architecture that lets the Linux kernel communicate with Windows GPU and NPU hardware at near-native speed.&lt;/p&gt;

&lt;p&gt;Cited benchmarks: &lt;strong&gt;3–5% delta versus bare-metal Linux&lt;/strong&gt; for PyTorch and CUDA workloads. WSL 2 had no NPU access at all — if you wanted to run inference on a Snapdragon Hexagon NPU or Intel AI Boost from your Linux toolchain, it wasn't possible until now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supported Hardware at Launch
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;GPU Passthrough&lt;/th&gt;
&lt;th&gt;NPU Passthrough&lt;/th&gt;
&lt;th&gt;WSL 3 Status&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qualcomm Snapdragon X Elite&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (Hexagon)&lt;/td&gt;
&lt;td&gt;Available now (Insiders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intel Meteor Lake&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (AI Boost)&lt;/td&gt;
&lt;td&gt;Available now (Insiders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMD&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;td&gt;No confirmed timeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;WSL 3 is available now through the Windows Insiders program. For developers who need to run Ollama, llama.cpp, vLLM, or PyTorch inside a Linux environment on a Copilot+ PC, this eliminates the primary reason to dual-boot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Application: Timing and Targets
&lt;/h2&gt;

&lt;p&gt;The Build 2026 announcements land across different timelines and hardware requirements. Here's a developer-oriented summary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WSL 3&lt;/strong&gt; — Available now for Snapdragon X Elite and Intel Meteor Lake. If you're on one of these machines and running Linux ML tooling, this is worth testing immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows Agent Runtime&lt;/strong&gt; — June 9, 2026 (Windows 11 24H2, KB5039239). Start designing your agent's capability grant manifest now even before the preview lands — the permission schema is documented and shouldn't change between Insider and stable release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAI-Code-1-Flash&lt;/strong&gt; — Live now in VS Code and GitHub Copilot. No configuration required; it's already the underlying model for Copilot inline suggestions for some subscribers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project Polaris&lt;/strong&gt; — August 2026 rollout for Copilot Pro. Three-month GPT-4 Turbo fallback available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAI-Thinking-1&lt;/strong&gt; — Private preview via Microsoft Foundry; available via Fireworks AI, Baseten, and OpenRouter today for teams accepted into early access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Treating WAR's standard sandbox as appropriate for all agent types.&lt;/strong&gt; The per-agent capability grant system is lightweight — right for text-processing agents on a desktop. Agents that execute arbitrary code, spawn subprocesses, or handle credentials belong on the MXC SDK's micro-VM path. Defaulting to the lighter option because it's simpler to integrate creates a security gap that Microsoft's runtime can't close for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Taking Polaris benchmark numbers at face value.&lt;/strong&gt; Microsoft's HumanEval and MBPP figures are self-reported. Until independent benchmarks appear (likely Q3 2026 as Polaris rolls out), treat the performance claims as directionally useful but not a basis for architecture decisions. Test against your specific codebase and language mix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping capability manifest design.&lt;/strong&gt; Windows Agent Store review includes capability disclosure as a gate. Agents that request overly broad file system scope or open-ended network access will face review friction. Design your manifest narrowly from the start — it's easier to expand permissions post-approval than to pass initial review with a permissive manifest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conflating WSL 3 with WSL 2 for NVIDIA workloads.&lt;/strong&gt; The 3–5% performance claim applies to paravirtualized access on Qualcomm and Intel platforms. NVIDIA GPU passthrough in WSL has used a different path (DirectML + CUDA on WSL) since WSL 2. WSL 3 improves this path too, but the NPU paravirtualization story is specific to Copilot+ PC silicon.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does Windows Agent Runtime work on Windows 10?
&lt;/h3&gt;

&lt;p&gt;No. WAR ships in Windows 11 version 24H2 via KB5039239. There is no announced backward compatibility with Windows 10.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I sideload WAR agents without going through the Windows Agent Store?
&lt;/h3&gt;

&lt;p&gt;Sideloading behavior (analogous to Windows developer mode for UWP) was not confirmed in Build 2026 materials. The Agent Store appears to be the primary distribution path at launch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: When will WSL 3 support AMD NPUs?
&lt;/h3&gt;

&lt;p&gt;AMD support was acknowledged as planned but no timeline was confirmed at Build 2026. Qualcomm Snapdragon X Elite and Intel Meteor Lake are the launch platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is MAI-Thinking-1 available via an OpenAI-compatible API?
&lt;/h3&gt;

&lt;p&gt;Yes. Fireworks AI, Baseten, and OpenRouter all expose OpenAI-compatible endpoints. You can use the standard &lt;code&gt;openai&lt;/code&gt; Python SDK with a custom &lt;code&gt;base_url&lt;/code&gt; pointing to any of these providers to access MAI-Thinking-1 without an Azure subscription.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What happens to my Copilot Pro subscription when Polaris rolls out in August?
&lt;/h3&gt;

&lt;p&gt;The switch is automatic. Microsoft offers an optional three-month fallback to GPT-4 Turbo for teams that need to validate behavior first. GitHub Copilot Enterprise admins will see model preference controls before the August cutover.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does the 85% Agent Store revenue share apply to enterprise deployments?
&lt;/h3&gt;

&lt;p&gt;Microsoft's Build 2026 materials described the 85% figure for the Windows Agent Store consumer/developer channel. Enterprise licensing and revenue arrangements were not detailed at Build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Windows Agent Runtime&lt;/strong&gt; (June 9, 2026) brings mobile-style permission grants and OS-level sandboxing to local AI agents on Windows 11 Copilot+ PCs. Hardware floor: 40 TOPS NPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project Polaris&lt;/strong&gt; replaces GPT-4 Turbo in GitHub Copilot in August 2026. A homegrown MoE model trained specifically for code — with a three-month fallback window to GPT-4 Turbo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WSL 3&lt;/strong&gt; delivers near-native GPU and NPU passthrough for Linux ML workloads on Snapdragon X Elite and Intel Meteor Lake; 3–5% delta vs bare-metal Linux.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAI-Code-1-Flash&lt;/strong&gt; is live now in VS Code and Copilot — claims +16pp on SWE-Bench Pro vs Claude Haiku 4.5 with 60% fewer tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAI-Thinking-1&lt;/strong&gt; (35B active, MoE, 256K context) is in private preview, available today via Fireworks AI, Baseten, and OpenRouter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thread connecting all five announcements: Microsoft is building an agent-first OS and a vertically integrated AI development stack. Cloud deployment is no longer the only serious option for production agent workloads.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Build 2026 is Microsoft's most coherent developer platform shift in years. If you're on a Copilot+ PC, WSL 3 and the Windows Agent Runtime give you local agent infrastructure worth evaluating now — before your cloud bills become the forcing function.&lt;/p&gt;

</description>
      <category>microsoft</category>
      <category>windowsagentruntime</category>
      <category>projectpolaris</category>
      <category>wsl3</category>
    </item>
    <item>
      <title>Building an Edge REST API with Hono.js + TypeScript — From Bun Local Server to Cloudflare Workers</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 03 Jun 2026 06:40:03 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/building-an-edge-rest-api-with-honojs-typescript-from-bun-local-server-to-cloudflare-workers-4b4m</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/building-an-edge-rest-api-with-honojs-typescript-from-bun-local-server-to-cloudflare-workers-4b4m</guid>
      <description>&lt;p&gt;If you've ever built a REST API with Express, you've probably felt it. Middleware registration, type definitions, body parser setup, connecting Joi or Zod... the structure is simple, but the boilerplate is excessive. When I first saw Hono, I was skeptical. "Another Express clone," I thought. That changed when I actually ran it.&lt;/p&gt;

&lt;p&gt;Bottom line: Hono v4 is more than just lightweight and fast. TypeScript type inference flows naturally all the way to route handlers. Zod validation connects via a single official package. On Bun, response times are noticeably faster than Express. Everything in this post is based on what I ran in a sandbox in June 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Hono — Compared to Express and Fastify
&lt;/h2&gt;

&lt;p&gt;Understanding where Hono fits means answering three questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bundle size&lt;/strong&gt;: Hono v4 core is about 12KB. Express is 58KB, Fastify is 77KB. The gap might not sound dramatic, but in edge environments like Cloudflare Workers or Deno Deploy, bundle size directly affects cold start time. Edge functions sometimes initialize a new runtime per request — smaller means faster first response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime compatibility&lt;/strong&gt;: Express is Node.js-only. Fastify targets Node.js by default. Hono was designed from the start to "run anywhere." The same code deploys to Bun, Deno, Cloudflare Workers, Node.js, and AWS Lambda Edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript support&lt;/strong&gt;: Express requires &lt;code&gt;@types/express&lt;/code&gt; as a separate install, and properties added to &lt;code&gt;req&lt;/code&gt; via middleware don't get type inference. Hono is written in TypeScript from the ground up, and the &lt;code&gt;Hono&amp;lt;{ Bindings: Env; Variables: Variables }&amp;gt;&lt;/code&gt; generic gives you type-safe access to environment variables and middleware state.&lt;/p&gt;

&lt;p&gt;I'm not saying Hono is the right choice for every situation. If your team is deeply invested in Express, or you need a mature plugin ecosystem, there's no compelling reason to switch. But if edge deployment is the goal, or you want type safety from day one, Hono is the most convincing TypeScript API framework right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation and First Server — Response in 30 Seconds
&lt;/h2&gt;

&lt;p&gt;I started from scratch in a sandbox. Bun 1.3.14.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize a new project&lt;/span&gt;
bun init &lt;span class="nt"&gt;-y&lt;/span&gt;

&lt;span class="c"&gt;# Install Hono v4&lt;/span&gt;
bun add hono

&lt;span class="c"&gt;# Add Zod validation packages&lt;/span&gt;
bun add zod @hono/zod-validator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;bun add v1.3.14 (0d9b296a)
installed hono@4.12.23
installed @hono/zod-validator@0.8.0
installed zod@4.4.3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install time was under 500ms. Hono's dependency chain is nearly empty.&lt;/p&gt;

&lt;p&gt;The simplest possible server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// index.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Hono&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello from Hono!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun run index.ts
&lt;span class="c"&gt;# Started development server: http://localhost:3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/
&lt;span class="c"&gt;# {"message":"Hello from Hono!"}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;export default app&lt;/code&gt; — that single line is recognized as the entry point for Bun, Deno, and Cloudflare Workers alike. For Node.js, add &lt;code&gt;serve(app)&lt;/code&gt; and you're done. No runtime-branching code needed. That felt like the biggest quality-of-life win.&lt;/p&gt;

&lt;h2&gt;
  
  
  Middleware Stack — logger, CORS, timing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fhono-typescript-api-2026%2Fhono-typescript-api-2026-arch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fhono-typescript-api-2026%2Fhono-typescript-api-2026-arch.png" alt="Hono Middleware Stack Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hono imports built-in middleware via &lt;code&gt;hono/middleware-name&lt;/code&gt;. You only pull in what you use, so nothing extra ends up in the bundle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/logger&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cors&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/cors&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;timing&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/timing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Hono&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// Registration order equals execution order&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;cors&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;logger()&lt;/code&gt;, each request prints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;&amp;lt;-- GET /tasks
--&amp;gt; GET /tasks 200 0ms
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I ran this, the response speed was obvious. First request: 3ms. Subsequent requests: 0ms server-side (sub-millisecond). With &lt;code&gt;timing()&lt;/code&gt;, the &lt;code&gt;Server-Timing&lt;/code&gt; header is added to responses, so you can see per-stage timing in Chrome DevTools Network tab.&lt;/p&gt;

&lt;p&gt;CORS takes fine-grained options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;cors&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://jangwook.net&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:5173&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;allowMethods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GET&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PATCH&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DELETE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;allowHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cors()&lt;/code&gt; default allows all origins. In production, always specify &lt;code&gt;origin&lt;/code&gt; explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Zod Validation — Automatic 400 Errors
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;@hono/zod-validator&lt;/code&gt; is Hono's official Zod integration. Drop it in as middleware on a route, and any Zod schema validation failure automatically returns a 400.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;zValidator&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@hono/zod-validator&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;createTaskSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Title is required&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Max 100 characters&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;zValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;createTaskSchema&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// body is typed as z.infer&amp;lt;typeof createTaskSchema&amp;gt;&lt;/span&gt;
  &lt;span class="c1"&gt;// body.title is string, body.completed is boolean — no undefined&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;nextId&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test run with an empty title:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:3000/tasks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"title":""}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ZodError"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;code&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;too_small&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;minimum&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:1,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;path&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:[&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;title&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;],&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Title is required&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}]"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HTTP 400, automatically. No validation code needed inside the handler.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;c.req.valid('json')&lt;/code&gt; is the key. What comes back is already Zod-validated and fully typed. If you've worked with &lt;a href="https://dev.to/en/blog/en/typescript-zod-v4-claude-api-structured-output-guide-2026"&gt;Zod v4 and Claude API structured output&lt;/a&gt;, the v4 schema API changes apply here too — &lt;code&gt;@hono/zod-validator&lt;/code&gt; supports both v3 and v4.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full CRUD Implementation — With Real Execution Logs
&lt;/h2&gt;

&lt;p&gt;Here's the complete Task CRUD API, with the actual terminal output from running it. In-memory storage for this example (swap in D1, Prisma, or Drizzle for production).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/logger&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cors&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/cors&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;timing&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/timing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;zValidator&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@hono/zod-validator&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Hono&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;cors&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;
  &lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="nx"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;
  &lt;span class="nx"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Install Hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Build REST API&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;nextId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;createTaskSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Title is required&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;updateTaskSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Task API&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bun + Hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completedParam&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;completed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;completedParam&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completed&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;completedParam&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;zValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;createTaskSchema&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;nextId&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks/:id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Task not found&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks/:id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;zValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;updateTaskSchema&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Task not found&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks/:id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Task not found&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Deleted successfully&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real terminal output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;bun run index.ts
&lt;span class="go"&gt;Started development server: http://localhost:3000

&amp;lt;-- GET /
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;GET / 200 4ms
&lt;span class="go"&gt;
&amp;lt;-- GET /tasks
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;GET /tasks 200 2ms
&lt;span class="go"&gt;
&amp;lt;-- POST /tasks
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;POST /tasks 201 4ms
&lt;span class="go"&gt;
&amp;lt;-- GET /tasks/3
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;GET /tasks/3 200 0ms
&lt;span class="go"&gt;
&amp;lt;-- PATCH /tasks/2
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;PATCH /tasks/2 200 0ms
&lt;span class="go"&gt;
&amp;lt;-- DELETE /tasks/1
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;DELETE /tasks/1 200 0ms
&lt;span class="go"&gt;
&amp;lt;-- POST /tasks  (empty title)
&lt;/span&gt;&lt;span class="gp"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;POST /tasks 400 0ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Performance numbers: first request 4ms, warm requests sub-millisecond (0ms in logger output). Running the same logic in Express on the same machine showed 1〜2ms warm. The real production edge gap would likely be larger.&lt;/p&gt;

&lt;p&gt;The reason for this performance: Bun's JavaScriptCore engine plus Hono's Trie-based router. Hono's router matches routes near O(1) regardless of how many routes you add — no linear scanning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloudflare Workers Deployment — Zero Code Changes
&lt;/h2&gt;

&lt;p&gt;The biggest Hono advantage: changing the deployment target barely changes the code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun add &lt;span class="nt"&gt;-g&lt;/span&gt; wrangler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# wrangler.toml&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"hono-task-api"&lt;/span&gt;
&lt;span class="py"&gt;main&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"src/worker.ts"&lt;/span&gt;
&lt;span class="py"&gt;compatibility_date&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"2024-09-23"&lt;/span&gt;

&lt;span class="nn"&gt;[vars]&lt;/span&gt;
&lt;span class="py"&gt;ENVIRONMENT&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"production"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Connecting Cloudflare Workers environment variable types to Hono:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/worker.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cors&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hono/cors&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Bindings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;ENVIRONMENT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="na"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;D1Database&lt;/span&gt;
  &lt;span class="na"&gt;KV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;KVNamespace&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Variables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Bindings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Bindings&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;Variables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Variables&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;cors&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; 
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ENVIRONMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// type-safe: string&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// D1 database query&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT * FROM tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Simulate Cloudflare Workers locally&lt;/span&gt;
wrangler dev

&lt;span class="c"&gt;# Production deploy&lt;/span&gt;
wrangler deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I didn't verify &lt;code&gt;wrangler deploy&lt;/code&gt; — that requires an actual Cloudflare account. The code structure is exactly as shown above, and the only difference from the local Bun server is how you access bindings like &lt;code&gt;c.env.DB&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/cloudflare-agents-week-2026-autonomous-infrastructure"&gt;Cloudflare Workers agent infrastructure&lt;/a&gt; shows how Hono sits at the API layer in Cloudflare-based AI agent systems. It's already being used this way in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Type-Safe Middleware with Variables
&lt;/h2&gt;

&lt;p&gt;Express required extending interfaces to get type-safe access to &lt;code&gt;req.user&lt;/code&gt;. Hono handles this more cleanly with the &lt;code&gt;Variables&lt;/code&gt; generic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Variables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Hono&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Variables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Variables&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// Auth middleware&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks/*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authHeader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;authHeader&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bearer &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Unauthorized&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;userId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;requestId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// Access in route handler — fully typed&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;userId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;// inferred as string&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;requestId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// inferred as string&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;c.get('userId')&lt;/code&gt; returns &lt;code&gt;string&lt;/code&gt; — TypeScript infers this from the &lt;code&gt;Variables&lt;/code&gt; declaration. With Express, this inference didn't happen automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Found Frustrating
&lt;/h2&gt;

&lt;p&gt;There are real limitations worth naming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ecosystem depth&lt;/strong&gt;: Fastify's plugin ecosystem is battle-hardened. &lt;code&gt;fastify-swagger&lt;/code&gt; auto-generates OpenAPI specs. &lt;code&gt;fastify-multipart&lt;/code&gt; handles file uploads. These are validated, maintained plugins. Hono's third-party ecosystem is thinner. The official middleware covers the basics, but unusual requirements mean writing your own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D1 local dev experience&lt;/strong&gt;: Testing against Cloudflare D1 locally requires &lt;code&gt;wrangler dev&lt;/code&gt;, which requires an actual Cloudflare account to configure bindings. SQLite compatibility makes Drizzle/Prisma usable, but the local dev setup is more involved than Express + PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;wrangler dev&lt;/code&gt; cold start&lt;/strong&gt;: The first run of &lt;code&gt;wrangler dev&lt;/code&gt; is slow because it emulates the Cloudflare runtime. Running with Bun directly starts instantly — but that skips Workers-specific behavior testing.&lt;/p&gt;

&lt;p&gt;If edge deployment isn't your goal and you're building a conventional server, Fastify is more mature than Hono. The &lt;a href="https://dev.to/en/blog/en/ollama-fastapi-production-deployment-guide-2026"&gt;Ollama + FastAPI approach&lt;/a&gt; — different language, same concept — is another valid path.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Hono
&lt;/h2&gt;

&lt;p&gt;My judgment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Hono when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloudflare Workers, Deno Deploy, or Bun are your deployment targets&lt;/li&gt;
&lt;li&gt;You want TypeScript type safety from the first line&lt;/li&gt;
&lt;li&gt;Bundle size and cold start time matter for your service&lt;/li&gt;
&lt;li&gt;Small team, fast start, minimal boilerplate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't bother switching when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your team is comfortable with Express or Fastify and has no edge deployment plans&lt;/li&gt;
&lt;li&gt;You need a mature plugin ecosystem for enterprise-scale services&lt;/li&gt;
&lt;li&gt;Heavy integration with legacy Node.js code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hono's GitHub stars crossed 66,000 in 2026. If you've already &lt;a href="https://dev.to/en/blog/en/bun-shell-scripting-practical-guide-2026"&gt;set up a Bun Shell scripting environment&lt;/a&gt;, adding Hono is the logical next step. Same runtime, same package manager, same TypeScript ecosystem — API server included.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cheat Sheet — Patterns I Look Up Every Time
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Query parameter&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;page&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;limit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;10&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Path parameter&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Request header&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// JSON response with status&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Text response&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OK&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Redirect&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/new-path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;301&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Streaming response&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// Cloudflare Workers env variable&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dbUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DATABASE_URL&lt;/span&gt;

&lt;span class="c1"&gt;// Route grouping&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Hono&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wrap-Up — Notes After Running It
&lt;/h2&gt;

&lt;p&gt;This post started from &lt;code&gt;bun add hono @hono/zod-validator zod&lt;/code&gt; and worked through a full CRUD API. In-memory storage limits what you can call "production-ready," but the routing, middleware, and Zod validation integration all checked out.&lt;/p&gt;

&lt;p&gt;The thing that impressed me most was type inference. Data from &lt;code&gt;c.req.valid('json')&lt;/code&gt; is immediately typed by the Zod schema. Data stored with &lt;code&gt;c.set('userId', ...)&lt;/code&gt; comes back as &lt;code&gt;string&lt;/code&gt; from &lt;code&gt;c.get('userId')&lt;/code&gt;. TypeScript doesn't lose track of types as they flow through the middleware chain.&lt;/p&gt;

&lt;p&gt;I won't claim there's no reason to keep using Express. But if you're starting a new project with TypeScript and Bun and have edge deployment in mind, Hono is worth using right now.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test Environment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bun: 1.3.14&lt;/li&gt;
&lt;li&gt;hono: 4.12.23&lt;/li&gt;
&lt;li&gt;@hono/zod-validator: 0.8.0&lt;/li&gt;
&lt;li&gt;zod: 4.4.3&lt;/li&gt;
&lt;li&gt;typescript: 5.9.3&lt;/li&gt;
&lt;li&gt;macOS 15.x (Apple Silicon)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hono</category>
      <category>typescript</category>
      <category>restapi</category>
      <category>cloudflareworkers</category>
    </item>
    <item>
      <title>Constraint Decay: Why LLM Agents Fail at Real Backend Code</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 03 Jun 2026 04:24:28 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/constraint-decay-why-llm-agents-fail-at-real-backend-code-1fog</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/constraint-decay-why-llm-agents-fail-at-real-backend-code-1fog</guid>
      <description>&lt;p&gt;Your AI coding agent just built a REST API endpoint. It passes all unit tests. The code looks clean. Then you add an ORM constraint, an architectural pattern requirement, and an auth middleware spec — and the next three tasks start failing in ways that are hard to explain. That sequence has a name now: &lt;strong&gt;constraint decay&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A May 2026 paper from arXiv (2605.06445) titled "Constraint Decay: The Fragility of LLM Agents in Backend Code Generation" puts hard numbers on something many developers have noticed informally. This article walks through what the paper found, why it matters for teams shipping production code with AI agents, and how Effloow Lab reproduced the decay curve from the paper using a pure-Python PoC.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Benchmark scores for LLM coding agents have climbed fast. Models like Qwen3-Coder-Next, MiniMax-M2.5, and Kimi-K2.5 now exceed 85% on assertion pass rates when tasks are given full architectural freedom — no prescribed database schema, no forced ORM, no required architectural pattern. Those numbers get cited in model release announcements and leaderboards.&lt;/p&gt;

&lt;p&gt;The problem is that unconstrained freedom describes almost none of your real backend work.&lt;/p&gt;

&lt;p&gt;Production code operates inside a web of structural requirements: a specific ORM, an existing auth middleware pattern, a database schema your team maintains, an architectural convention from a decision three years ago. The paper tests what happens when agents face those constraints, and the results are harder to dismiss than a blog post hot take. This is an empirical study: 80 greenfield generation tasks and 20 feature-implementation tasks, eight web frameworks (Flask, FastAPI, Django, aiohttp, Express, Fastify, Hono, Koa), evaluated with end-to-end behavioral tests and static verifiers.&lt;/p&gt;

&lt;p&gt;The headline finding: &lt;strong&gt;assertion pass rates drop by an average of 30 percentage points from baseline to fully constrained scenarios — a 40% relative loss of baseline performance.&lt;/strong&gt; That is not a marginal degradation. It is a collapse.&lt;/p&gt;

&lt;p&gt;For developers evaluating whether to trust an AI agent with backend code, understanding &lt;em&gt;why&lt;/em&gt; this happens is more useful than knowing the number. That is what this article focuses on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What "Constraint Decay" Means
&lt;/h3&gt;

&lt;p&gt;The term is precise. "Decay" is not a metaphor here — the paper fits the performance drop to an exponential model. As the number of structural constraints increases from zero (bare task, architectural freedom) to five (ORM layer, architectural pattern, DB schema, auth middleware, full API contract), pass rates fall along a curve that looks like radioactive decay: steep early, flattening later, but always lower.&lt;/p&gt;

&lt;p&gt;Effloow Lab ran a sandbox PoC to reproduce this numerically. Using the paper's reported summary statistics (~50% baseline, ~20% at full constraints for minimal frameworks), the lab fitted an exponential decay model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pass_rate = baseline × exp(−0.1888 × n_constraints)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fitted decay rate of 0.1888 means each additional structural constraint multiplies the remaining pass rate by roughly 0.83. Add five constraints and you are at about 39% of your starting performance.&lt;/p&gt;

&lt;p&gt;Here is the PoC's output table across three framework profiles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Constraints&lt;/span&gt;                 &lt;span class="k"&gt;Flask&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;Koa&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;minimal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;FastAPI&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;moderate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="k"&gt;Django&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;convention&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;heavy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;----------------------------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="k"&gt;None&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;45.0&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;22.0&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;ORM&lt;/span&gt; &lt;span class="k"&gt;layer&lt;/span&gt;                     &lt;span class="mf"&gt;41.4&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;36.2&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;17.2&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;Arch&lt;/span&gt; &lt;span class="k"&gt;pattern&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="k"&gt;ORM&lt;/span&gt;            &lt;span class="mf"&gt;34.3&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;29.1&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;13.5&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;DB&lt;/span&gt; &lt;span class="k"&gt;schema&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="k"&gt;Arch&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="k"&gt;ORM&lt;/span&gt;        &lt;span class="mf"&gt;28.4&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;23.5&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;10.5&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;Auth&lt;/span&gt; &lt;span class="k"&gt;middleware&lt;/span&gt; &lt;span class="k"&gt;added&lt;/span&gt;          &lt;span class="mf"&gt;23.5&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;18.9&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                 &lt;span class="mf"&gt;8.2&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;Full&lt;/span&gt; &lt;span class="k"&gt;API&lt;/span&gt; &lt;span class="k"&gt;contract&lt;/span&gt; &lt;span class="k"&gt;spec&lt;/span&gt;         &lt;span class="mf"&gt;19.4&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                &lt;span class="mf"&gt;15.2&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;                 &lt;span class="mf"&gt;6.4&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers are reconstructed from the paper's aggregate statistics, not a raw replay of the evaluation pipeline. What they demonstrate is that the decay shape is consistent with an exponential model across all three framework tiers.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Framework-Tier Gap
&lt;/h3&gt;

&lt;p&gt;The second major finding is the baseline gap between minimal and convention-heavy frameworks. Flask and Koa start around 49–51% assertion pass rate. Django and FastAPI trail by 25–32 percentage points at baseline — before any additional constraints are layered on.&lt;/p&gt;

&lt;p&gt;The reason is structural. Flask and Koa are explicit about almost everything: routing, ORM choice, middleware order. An LLM agent building a Flask endpoint must make concrete, visible decisions. Those decisions show up in code that is easy to test.&lt;/p&gt;

&lt;p&gt;Django and FastAPI impose conventions. Django's ORM, its admin interface, its migration system, its signal architecture — these are not visible in a task prompt. They live in the framework's implicit contract with the developer. When an LLM agent generates code for a Django project, it needs to know which conventions apply, which ones the project has overridden, and how the framework's default behaviors interact with the task at hand. The paper's data suggests agents are much worse at navigating that implicit contract than they are at following explicit specifications.&lt;/p&gt;

&lt;p&gt;FastAPI occupies a middle position. It is explicit in its HTTP routing (Pythonic type annotations drive a lot of behavior), but its dependency injection system and SQLAlchemy integration patterns carry real convention overhead. The paper's data and the PoC's modeled results put FastAPI between Flask and Django in baseline performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data-Layer Defects as the Root Cause
&lt;/h3&gt;

&lt;p&gt;The paper's error analysis identifies data-layer defects as the leading root cause of failures across all tested configurations. Two specific failure modes dominate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Incorrect query composition&lt;/strong&gt; — agents generate queries that are syntactically valid and pass simple mocks but fail under real data conditions: missing joins, wrong filter logic, or subquery structure that works in isolation but not against the schema.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ORM runtime violations&lt;/strong&gt; — agents produce code that violates ORM usage rules at runtime. These often pass static analysis (the code is valid Python or JavaScript) but raise exceptions when the ORM tries to execute the generated query plan against the database.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both categories share a common pattern: the agent generates code that &lt;em&gt;looks correct&lt;/em&gt; at the level of syntax and surface behavior but fails at the boundary between application logic and the persistence layer. This is where structural constraints bite hardest, because ORM behavior is exactly the kind of implicit convention that does not show up clearly in a task prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Existing Benchmarks Miss
&lt;/h3&gt;

&lt;p&gt;SWE-bench tests whether an agent can resolve real GitHub issues. HumanEval tests isolated function completion. Neither benchmark systematically measures whether the generated code satisfies non-functional structural requirements: "use the project's ORM", "follow the repository's auth middleware pattern", "match this DB schema". Existing benchmarks reward functional correctness while being blind to structural compliance.&lt;/p&gt;

&lt;p&gt;The constraint decay paper argues this gap is not incidental. Benchmarks are designed to be automatable, and structural compliance checks require knowledge of the project's conventions — which means they require per-project setup that is expensive to scale. The result is a systematic bias: models optimize for benchmark tasks that do not test the property that matters most in production environments. You can read more about the general limits of coding benchmarks in our &lt;a href="https://dev.to/articles/ai-coding-market-share-claude-code-cursor-copilot-2026"&gt;guide to AI coding market share and agent evaluation&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Application
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Designing Tasks to Reduce Constraint Pressure
&lt;/h3&gt;

&lt;p&gt;The paper's findings suggest a practical heuristic: &lt;strong&gt;if you are delegating a backend task to an AI agent, make every structural constraint explicit in the prompt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Build a user authentication endpoint" is a minimal-constraint task. The agent will make reasonable choices about ORM, schema, and middleware — choices that may conflict with the rest of your codebase.&lt;/p&gt;

&lt;p&gt;A better prompt makes the constraints explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Build a POST /auth/login endpoint using:
- SQLAlchemy ORM (Session pattern, not async)
- User model defined in app/models/user.py
- Password verification via the existing verify_password() in app/utils/auth.py
- Return a JSON response with {token: str, expires_at: ISO8601}
- No new dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That prompt encodes four structural constraints explicitly. The paper's data says you will still see degraded performance compared to an unconstrained task, but the agent is at least working from the right specification rather than inferring conventions it may not know.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Minimal Frameworks Strategically
&lt;/h3&gt;

&lt;p&gt;The framework-tier gap the paper documents has a concrete implication: if your team is choosing a framework for a new service and plans to use AI agents heavily in development, minimal frameworks (Flask, Express, Koa, Hono) produce significantly better agent performance at baseline than convention-heavy ones.&lt;/p&gt;

&lt;p&gt;This does not mean avoid Django or FastAPI — those frameworks carry real productivity advantages for humans. But the tradeoff is real. Teams that use AI agents for high-volume boilerplate generation on convention-heavy stacks will see lower pass rates and more manual correction work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing for Structural Compliance, Not Just Functional Correctness
&lt;/h3&gt;

&lt;p&gt;The paper's evaluation methodology is itself a pattern worth adopting. They use static verifiers alongside behavioral tests — checking that code satisfies structural requirements (imports, ORM usage patterns, architectural conventions) rather than only testing whether the endpoint returns the right HTTP response.&lt;/p&gt;

&lt;p&gt;Adding a structural compliance check to your CI pipeline for agent-generated code costs real setup time, but it catches the ORM violations and incorrect query composition that functional tests miss. For a team running agent-generated code through automated review, this is the most direct mitigation the paper's findings suggest.&lt;/p&gt;

&lt;p&gt;For a deeper look at how AI code review tools approach similar problems, see our &lt;a href="https://dev.to/articles/best-ai-code-review-tools-2026"&gt;roundup of the best AI code review tools in 2026&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Treating Benchmark Scores as Production Predictors
&lt;/h3&gt;

&lt;p&gt;The most common mistake when evaluating AI coding agents is reading a benchmark score and projecting it onto your production codebase. An agent scoring 85%+ on unconstrained generation tasks may score 20–30% on your fully specified backend tasks. The paper makes this quantitative: a 40% relative performance loss from benchmark to production-like conditions is the paper's central finding, not an edge case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Assuming "Passes Tests" Means "Structurally Correct"
&lt;/h3&gt;

&lt;p&gt;A generated endpoint that passes your unit tests may still contain ORM usage violations that only surface under production load, or query composition errors that appear when the data gets large enough. "Green tests" is a necessary but not sufficient condition for structurally correct agent-generated backend code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using a Single Prompt to Load All Constraints
&lt;/h3&gt;

&lt;p&gt;A related failure mode: developers pack every structural constraint into a single, complex prompt and wonder why agent performance drops. The constraint decay model suggests that accumulation is the problem. Splitting complex tasks into smaller steps — each with fewer simultaneous constraints — should reduce the compounding decay effect, even if total task count increases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Not Accounting for the Framework's Implicit Contract
&lt;/h3&gt;

&lt;p&gt;Assigning Django tasks to agents without providing explicit documentation of the project's ORM patterns, migration conventions, and signal usage is asking the agent to infer that implicit contract from context. Some models are better at this than others, but the paper's data shows that even the best-performing models suffer significant degradation on convention-heavy stacks.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does constraint decay affect all LLMs equally?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The paper tested multiple capable models, including Qwen3-Coder-Next (80B), MiniMax-M2.5, Kimi-K2.5, and GPT-5.2. The decay pattern appears across all of them — no model is immune. The best-performing models under unconstrained conditions (85%+ baseline) still lose roughly 30 percentage points when all structural constraints are applied. The relative ranking of models may shift under constraint pressure, but the decay itself is universal in the paper's data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is constraint decay the same as context window degradation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No, though the two can interact. Context window degradation (also called "lost in the middle" failure) refers to models losing attention to information placed in the middle of long prompts. Constraint decay is a different phenomenon: it measures performance loss as the number of structural requirements increases, independent of prompt length. A fully constrained task specification can be shorter than an unconstrained one if the constraints are explicit. Constraint decay is about the cognitive complexity of satisfying multiple structural requirements simultaneously, not about prompt length or token position.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why do minimal frameworks like Flask outperform Django at baseline?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The paper frames this as a convention overhead problem. Flask is explicit by design — almost everything that happens in a Flask application is written in the application code. There is no hidden ORM layer, no admin interface convention, no magic migration system. An LLM agent generating Flask code makes visible, auditable decisions. Django's conventions are not written in the application code; they live in the framework's documentation and the project's accumulated patterns. Agents that have not internalized the specific project's Django conventions generate code that is structurally incorrect even when it is functionally reasonable. FastAPI occupies a middle position because its HTTP routing is explicit (type annotations are visible) but its dependency injection and ORM integration patterns carry convention overhead comparable to Django.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does this mean for AI coding agents in production deployments?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The practical implication is that AI coding agents in their current state should not be trusted as autonomous backends generators for constrained, production-grade tasks without structural compliance checks in the review pipeline. The paper is not arguing that AI agents are useless for backend development — unconstrained generation at 85%+ is genuinely useful for scaffolding and boilerplate. The argument is that the last mile — making generated code conform to your project's structural requirements — is where current agents fail most, and where current benchmarks provide the least signal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;The constraint decay paper is notable because it quantifies a failure mode that practitioners have observed informally for the past two years. The key numbers to keep in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30 percentage point average drop&lt;/strong&gt; in assertion pass rates from baseline to fully constrained tasks (40% relative performance loss)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;25–32 point baseline gap&lt;/strong&gt; between minimal frameworks (Flask, Koa) and convention-heavy ones (Django, FastAPI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data-layer defects&lt;/strong&gt; — bad query composition and ORM violations — are the leading root cause across all frameworks and models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Existing benchmarks&lt;/strong&gt; (HumanEval, SWE-bench) do not measure non-functional structural compliance, which means they systematically overstate agent readiness for production-constrained tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams actively using AI coding agents on backend work, the immediate practical actions are: make structural constraints explicit in every prompt, add structural compliance verification to the CI pipeline, and avoid projecting unconstrained benchmark scores onto constrained production tasks.&lt;/p&gt;

&lt;p&gt;The PoC Effloow Lab ran confirms the exponential decay shape fits the paper's reported summary statistics cleanly. With a fitted decay rate of ~0.19, each new structural constraint multiplies remaining pass rate by roughly 0.83 — compounding quickly across the five constraint levels the paper tests. That is not a quirk of a specific model or framework. It is a structural property of the problem, and it will not disappear as models get larger.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Source:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2605.06445" rel="noopener noreferrer"&gt;Constraint Decay: The Fragility of LLM Agents in Backend Code Generation&lt;/a&gt; — arXiv 2605.06445 (May 2026)&lt;/p&gt;

</description>
      <category>airesearch</category>
      <category>paperpoc</category>
      <category>llm</category>
      <category>2026</category>
    </item>
    <item>
      <title>OpenTelemetry GenAI: Trace LLM Agent Tool Calls</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 03 Jun 2026 00:15:58 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/opentelemetry-genai-trace-llm-agent-tool-calls-c7k</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/opentelemetry-genai-trace-llm-agent-tool-calls-c7k</guid>
      <description>&lt;p&gt;When an LLM agent fails, the hard question is rarely "did the model answer?" It is "where did the run go wrong?" The model call may be slow, a tool may have retried, the agent may have used the wrong retrieval result, or the final answer may have hidden a failed intermediate step. Plain logs can show pieces of that story, but they usually do not preserve the hierarchy.&lt;/p&gt;

&lt;p&gt;OpenTelemetry's GenAI semantic conventions are becoming the common vocabulary for that hierarchy. The official OpenTelemetry GenAI observability walkthrough, published May 14, 2026, shows an agent trace with a top-level &lt;code&gt;invoke_agent&lt;/code&gt; span, child &lt;code&gt;chat&lt;/code&gt; spans, and &lt;code&gt;execute_tool&lt;/code&gt; spans for tool calls. The same post points to token-count attributes such as &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;, &lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;, and finish reasons such as &lt;code&gt;gen_ai.response.finish_reasons&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Effloow Lab ran a local sandbox PoC for this article. The lab installed OpenTelemetry Python packages, imported the Anthropic instrumentation package, and exported a four-span agent trace to JSON without API keys or live model calls. The evidence note is at &lt;code&gt;data/lab-runs/opentelemetry-genai-llm-agent-tracing-sandbox-poc-2026.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effloow Lab&lt;/strong&gt; — Local sandbox on macOS 15.6 arm64 with Python 3.12.8, &lt;code&gt;opentelemetry-sdk==1.42.1&lt;/code&gt;, &lt;code&gt;opentelemetry-exporter-otlp==1.42.1&lt;/code&gt;, &lt;code&gt;opentelemetry-instrumentation-anthropic==0.61.0&lt;/code&gt;, and no LLM API calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LLM Agent Tracing Needs a Standard
&lt;/h2&gt;

&lt;p&gt;Traditional application traces already answer useful questions: which service called which dependency, how long the database query took, and where an exception appeared. Agent traces need those answers plus a few GenAI-specific details.&lt;/p&gt;

&lt;p&gt;For an agent run, the important units are not only HTTP requests. They are model calls, tool calls, retrieval calls, handoffs, prompt events, completion events, token usage, and sometimes agent-to-agent delegation. If every framework invents its own names for those units, observability becomes vendor-specific. A trace emitted by a coding agent, a customer-support agent, and a workflow agent may all describe the same shape with incompatible fields.&lt;/p&gt;

&lt;p&gt;The current OpenTelemetry GenAI convention gives teams a shared naming layer. The official semantic-convention docs define GenAI signals for events, exceptions, metrics, model spans, agent spans, and framework spans. The client-span docs describe a model inference span as a client call to a GenAI model or service, with required attributes such as &lt;code&gt;gen_ai.operation.name&lt;/code&gt; and &lt;code&gt;gen_ai.provider.name&lt;/code&gt; when available. The same docs define &lt;code&gt;execute_tool&lt;/code&gt; as the operation name for tool execution spans and recommend &lt;code&gt;gen_ai.tool.name&lt;/code&gt; plus &lt;code&gt;gen_ai.tool.call.id&lt;/code&gt; when those values exist.&lt;/p&gt;

&lt;p&gt;That standardization matters most when an agent is connected to production tools. A trace can show whether the agent called the model twice, whether a tool call was responsible for latency, and whether the model stopped because it requested a tool or because it finished normally. Without this structure, teams often debug agent failures by reading unstructured logs and hoping the right correlation ID survived.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Status: Useful, but Still Moving
&lt;/h2&gt;

&lt;p&gt;This is not a "set it once and forget it" spec. As of the current OpenTelemetry docs reviewed on June 3, 2026, many GenAI semantic-convention fields are marked Development. The GenAI docs also describe a transition plan for instrumentation libraries, including &lt;code&gt;OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental&lt;/code&gt; for libraries that can emit newer convention versions.&lt;/p&gt;

&lt;p&gt;That has two practical consequences.&lt;/p&gt;

&lt;p&gt;First, production systems should tolerate both older and newer attribute names during the transition. For example, many current examples and libraries still emit &lt;code&gt;gen_ai.system&lt;/code&gt;, while newer convention text emphasizes &lt;code&gt;gen_ai.provider.name&lt;/code&gt;. In the sandbox PoC, Effloow Lab wrote both attributes on the simulated Anthropic chat span:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.system"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.provider.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.request.model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-20250514"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.response.model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-20250514"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.usage.input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;184&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gen_ai.usage.output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;47&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second, teams should avoid building fragile dashboards that depend on a single experimental field name. Use the convention where it exists, but keep the ingestion layer able to normalize aliases. This is especially important for GenAI backends that aggregate traces from multiple SDKs, model providers, and agent frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Sandbox Proved
&lt;/h2&gt;

&lt;p&gt;The sandbox created a temporary virtualenv under &lt;code&gt;/tmp/effloow-otel-genai-poc&lt;/code&gt; and installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opentelemetry-sdk==1.42.1
opentelemetry-exporter-otlp==1.42.1
opentelemetry-instrumentation-anthropic==0.61.0
opentelemetry-semantic-conventions-ai==0.5.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first import attempt found a real package-path issue: importing &lt;code&gt;opentelemetry.instrumentation.anthropic&lt;/code&gt; failed until the Anthropic client and Pydantic were also installed. After adding &lt;code&gt;anthropic==0.105.2&lt;/code&gt; and &lt;code&gt;pydantic==2.13.4&lt;/code&gt;, the instrumentation package imported successfully.&lt;/p&gt;

&lt;p&gt;Then the PoC manually emitted an agent-shaped trace with a custom JSON exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;span_count 4
chat CLIENT 211221f62b7b1a6e [...]
execute_tool INTERNAL 211221f62b7b1a6e [...]
execute_tool INTERNAL 211221f62b7b1a6e [...]
invoke_agent INTERNAL None [...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The span tree had one trace ID and one root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"span_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"span_names"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"chat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"execute_tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"execute_tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"invoke_agent"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0f1035558bef566e0d26981c0031d202"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root &lt;code&gt;invoke_agent&lt;/code&gt; span had &lt;code&gt;gen_ai.operation.name=invoke_agent&lt;/code&gt; and &lt;code&gt;gen_ai.agent.name=local-research-assistant&lt;/code&gt;. The &lt;code&gt;chat&lt;/code&gt; span had model, provider, token-count, and finish-reason attributes. The two &lt;code&gt;execute_tool&lt;/code&gt; spans had &lt;code&gt;gen_ai.operation.name=execute_tool&lt;/code&gt;, &lt;code&gt;gen_ai.tool.name&lt;/code&gt;, and &lt;code&gt;gen_ai.tool.call.id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This proves the local instrumentation shape, not production correctness. No live Claude or OpenAI request was made. No provider token accounting was verified. No Jaeger UI screenshot was captured. The Docker attempt to run &lt;code&gt;jaegertracing/all-in-one:latest&lt;/code&gt; blocked in credential lookup while pulling the image, so the lab stopped that path and kept the backend limitation explicit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce the Local Trace Export
&lt;/h2&gt;

&lt;p&gt;Create a throwaway sandbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/effloow-otel-genai-poc
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /tmp/effloow-otel-genai-poc
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv /tmp/effloow-otel-genai-poc/.venv
/tmp/effloow-otel-genai-poc/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/tmp/effloow-otel-genai-poc/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  opentelemetry-sdk&lt;span class="o"&gt;==&lt;/span&gt;1.42.1 &lt;span class="se"&gt;\&lt;/span&gt;
  opentelemetry-exporter-otlp&lt;span class="o"&gt;==&lt;/span&gt;1.42.1 &lt;span class="se"&gt;\&lt;/span&gt;
  opentelemetry-instrumentation-anthropic&lt;span class="o"&gt;==&lt;/span&gt;0.61.0 &lt;span class="se"&gt;\&lt;/span&gt;
  anthropic &lt;span class="se"&gt;\&lt;/span&gt;
  pydantic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important pattern is to initialize a &lt;code&gt;TracerProvider&lt;/code&gt;, attach an exporter, then create nested spans. A simplified version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.resources&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConsoleSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SpanKind&lt;/span&gt;

&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent-tracing-demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SimpleSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsoleSpanExporter&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo.genai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SpanKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INTERNAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.operation.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.agent.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-research-assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SpanKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLIENT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.operation.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;184&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;47&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SpanKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INTERNAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.operation.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.tool.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.tool.call.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toolu_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is enough to validate the span hierarchy before wiring a real provider SDK. Once a real backend is available, swap the console or JSON exporter for OTLP and send traces to a collector or observability backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Instrument in a Real Agent
&lt;/h2&gt;

&lt;p&gt;Start with the trace tree, not with dashboards. A useful production trace should let an engineer answer five questions quickly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which agent run is this?&lt;/li&gt;
&lt;li&gt;Which model calls happened?&lt;/li&gt;
&lt;li&gt;Which tools executed?&lt;/li&gt;
&lt;li&gt;Which step consumed time, retries, or tokens?&lt;/li&gt;
&lt;li&gt;Which sensitive content was intentionally not recorded?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For most teams, the first useful span layout looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoke_agent
  chat claude-sonnet-4
  execute_tool search_docs
  chat claude-sonnet-4
  execute_tool create_ticket
  chat claude-sonnet-4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use model spans for provider calls. Use tool spans for function tools, MCP tools, retrieval tools, database reads, file edits, or workflow actions. Add token counts when the provider returns them. Add finish reasons when the provider exposes them. Record exceptions on spans instead of burying them in logs.&lt;/p&gt;

&lt;p&gt;Do not record full prompts, tool arguments, or tool results by default. The OpenTelemetry blog notes that content capture is opt-in because prompts and tool payloads may contain sensitive data. In the Effloow sandbox, prompt and payload content was intentionally represented only as &lt;code&gt;content_recorded=false&lt;/code&gt; and &lt;code&gt;payload_recorded=false&lt;/code&gt; event attributes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collector and Backend Path
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry's Collector is the normal production bridge between instrumented services and backends. The official Collector docs describe it as a vendor-agnostic way to receive, process, and export telemetry data. The docs also note why a collector is useful beyond local development: retries, batching, encryption, and sensitive-data filtering can live in the collector instead of every application service.&lt;/p&gt;

&lt;p&gt;For a GenAI agent service, a reasonable path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent app
  -&amp;gt; OTLP exporter
  -&amp;gt; local or sidecar OpenTelemetry Collector
  -&amp;gt; processor pipeline for batching and redaction
  -&amp;gt; Jaeger, Tempo, Honeycomb, Datadog, New Relic, or another backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sandbox did not complete this backend path because the local Jaeger Docker pull blocked on credential lookup. That limitation matters. A JSON trace proves the span shape; a backend ingest test proves that the pipeline, collector config, and UI can preserve that shape. Treat those as separate checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;The first mistake is tracing only the model call. A model-only trace can show latency and token usage, but it cannot explain whether a tool was slow, whether the agent loop repeated, or whether a retrieval step returned bad context.&lt;/p&gt;

&lt;p&gt;The second mistake is recording too much content. Full prompts, tool arguments, and tool results are attractive during development and dangerous in production. If you enable content capture, pair it with retention limits, redaction, access control, and a clear reason.&lt;/p&gt;

&lt;p&gt;The third mistake is pretending the conventions are fully stable. They are useful today, but teams should expect field-name movement. Normalize at ingestion and keep dashboards focused on a small set of durable fields: operation name, provider, requested model, response model, tool name, tool call ID, duration, error type, and token counts.&lt;/p&gt;

&lt;p&gt;The fourth mistake is treating observability as safety. A trace can show what happened. It does not approve tool use, block prompt injection, enforce data policy, or validate outputs. For agent safety, combine tracing with guardrails, tool approval, scoped credentials, and runtime policy checks. Effloow's &lt;a href="https://dev.to/articles/openai-agents-sdk-guardrails-local-sandbox-poc-2026"&gt;OpenAI Agents SDK guardrails PoC&lt;/a&gt; covers a separate local pattern for tripwire testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is OpenTelemetry GenAI ready for production LLM agents?
&lt;/h3&gt;

&lt;p&gt;It is ready enough to pilot for traces, metrics, and events, but the GenAI semantic conventions are still in Development status in the current docs. Use them, but normalize changing attributes and avoid assuming every SDK emits the same field set.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Do I need Jaeger to use OpenTelemetry for LLM tracing?
&lt;/h3&gt;

&lt;p&gt;No. Jaeger is one possible backend. OpenTelemetry emits telemetry through SDKs and exporters, commonly through OTLP. You can send traces to an OpenTelemetry Collector and then to any compatible backend. The Effloow sandbox used a JSON exporter because the local Jaeger Docker image pull did not complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Should I record prompts and tool results in spans?
&lt;/h3&gt;

&lt;p&gt;Default to no. Record model names, operation names, tool names, token counts, durations, finish reasons, and errors first. Full prompts and tool payloads may contain secrets or customer data, so they should be opt-in and governed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the minimum useful agent trace?
&lt;/h3&gt;

&lt;p&gt;One root run span, model-call spans, and tool-call spans. If you can see &lt;code&gt;invoke_agent -&amp;gt; chat -&amp;gt; execute_tool -&amp;gt; chat&lt;/code&gt;, you can already debug more than a flat log stream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry GenAI tracing is useful because it makes an agent run inspectable as a hierarchy. The model call, tool calls, token usage, finish reasons, and errors can live in one trace instead of scattered logs.&lt;/p&gt;

&lt;p&gt;The Effloow Lab PoC proved a narrow but practical point: a local Python app can emit an agent-shaped OpenTelemetry trace with GenAI-style attributes and no API key. It did not prove live Anthropic/OpenAI auto-instrumentation, Jaeger rendering, provider token accounting, or production collector behavior.&lt;/p&gt;

&lt;p&gt;For production, start small: emit the span tree, keep content capture off by default, normalize convention changes, route through a collector when the service becomes real, and treat tracing as observability rather than policy enforcement.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;OpenTelemetry GenAI is the right direction for agent observability, but the responsible rollout is incremental: prove the trace shape locally, keep sensitive payloads out, then validate backend ingest before depending on it during incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/blog/2026/genai-observability/" rel="noopener noreferrer"&gt;OpenTelemetry: Inside the LLM Call: GenAI Observability with OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;OpenTelemetry: Semantic conventions for generative AI systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/" rel="noopener noreferrer"&gt;OpenTelemetry: Semantic conventions for generative client AI spans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/languages/python/instrumentation/" rel="noopener noreferrer"&gt;OpenTelemetry: Python manual instrumentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/open-telemetry/opentelemetry-python-contrib" rel="noopener noreferrer"&gt;OpenTelemetry Python Contrib repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/opentelemetry-instrumentation-anthropic/0.61.0/" rel="noopener noreferrer"&gt;PyPI: opentelemetry-instrumentation-anthropic 0.61.0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opentelemetry</category>
      <category>genai</category>
      <category>observability</category>
      <category>llmagents</category>
    </item>
    <item>
      <title>Amazon OpenSearch Agentic AI: Investigation Agent Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 02 Jun 2026 12:12:49 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/amazon-opensearch-agentic-ai-investigation-agent-guide-53j0</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/amazon-opensearch-agentic-ai-investigation-agent-guide-53j0</guid>
      <description>&lt;p&gt;Amazon OpenSearch Service is turning observability search into an agent workflow. The important change is not "chat over logs" by itself. It is the combination of natural-language query generation, multi-step investigation, memory across the OpenSearch UI, and ranked root-cause hypotheses that developers can inspect.&lt;/p&gt;

&lt;p&gt;AWS announced agentic AI for log analytics in Amazon OpenSearch Service on March 31, 2026. The launch introduced Agentic Chat, Investigation Agent, and Agentic Memory for engineering and support teams working inside OpenSearch UI. AWS says Investigation Agent can plan an investigation, execute queries, reflect on results, and return structured root-cause hypotheses ranked by likelihood. Agentic Memory keeps investigation context available as a user moves through feature pages or web sessions, with limits around separate conversation threads.&lt;/p&gt;

&lt;p&gt;Effloow Lab ran a local sandbox PoC for this article. The PoC did not call AWS, run OpenSearch, use OpenSearch Dashboards, or execute a real LLM agent. It simulated the documented workflow shape with synthetic logs: plan, query-like analysis, baseline comparison, working memory, long-term findings, audit history, and ranked hypotheses. The lab note is saved at &lt;code&gt;data/lab-runs/amazon-opensearch-agentic-ai-investigation-agent-guide-2026.md&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Incident investigation is usually a context problem before it is an AI problem. A developer starts with a vague symptom: checkout 500s, p95 latency, increased timeout errors, or a dashboard that looks wrong. The next steps require switching between logs, traces, metrics, deployment history, shard state, query syntax, and prior debugging notes.&lt;/p&gt;

&lt;p&gt;Amazon OpenSearch already sits close to that workflow for teams using it as a search, log analytics, vector, or observability backend. The new agentic layer matters because it tries to move the interface from "write the right query" to "state the goal, inspect the agent's steps, and verify the evidence."&lt;/p&gt;

&lt;p&gt;That shift is useful only if the agent remains auditable. The best version of this feature is not a black-box incident oracle. It is a structured assistant that shows the plan, runs bounded analysis tools, preserves context, and gives humans evidence they can accept, reject, or rerun.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Amazon Added
&lt;/h2&gt;

&lt;p&gt;There are four separate but related pieces to understand.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;Agentic Chat&lt;/strong&gt; is embedded in OpenSearch UI. AWS documentation says it can answer questions about the data, generate PPL queries in Discover, refine generated queries through follow-up instructions, analyze visualizations, and start investigations through a &lt;code&gt;/investigate&lt;/code&gt; command or UI action.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;Investigation Agent&lt;/strong&gt; is the deeper incident-analysis workflow. The official docs describe it as a goal-driven research agent that plans from the stated goal and available data, executes queries and analysis, reflects through multiple steps, and returns ranked hypotheses with supporting evidence. The result page includes a primary hypothesis, alternative hypotheses, investigation steps, relevant findings, and user controls to accept or rule out a conclusion.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;Agentic Memory&lt;/strong&gt; is the continuity layer. AWS says it powers both Agentic Chat and Investigation Agent, persists context across page navigation and browser refreshes, isolates memory by user ID, and stores memory in a service-managed OpenSearch Serverless collection. AWS also states that Agentic Memory cannot retain context across different conversation threads.&lt;/p&gt;

&lt;p&gt;Fourth, the broader OpenSearch ecosystem is moving in the same direction. OpenSearch 3.5 added agentic conversation memory, context management, and a redesigned no-code agent interface with MCP integration. The open source OpenSearch documentation describes agentic memory containers with &lt;code&gt;sessions&lt;/code&gt;, &lt;code&gt;working&lt;/code&gt;, &lt;code&gt;long-term&lt;/code&gt;, and &lt;code&gt;history&lt;/code&gt; memory types. AWS also published OpenSearch Agent Skills for agentic IDE workflows around search, logs, trace analytics, and migrations.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Simulated Locally
&lt;/h2&gt;

&lt;p&gt;The Effloow Lab sandbox used Python 3.12.8 on macOS and synthetic service logs. The script generated 1,358 log rows across &lt;code&gt;checkout&lt;/code&gt;, &lt;code&gt;payments&lt;/code&gt;, &lt;code&gt;catalog&lt;/code&gt;, and &lt;code&gt;auth&lt;/code&gt;. It injected a checkout 5xx incident window and a payments timeout window, then ran a deterministic investigation loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a four-step investigation plan.&lt;/li&gt;
&lt;li&gt;Group incident status codes by service.&lt;/li&gt;
&lt;li&gt;Compare p95 latency in the incident window against a baseline window.&lt;/li&gt;
&lt;li&gt;Store working memory, long-term hypotheses, and history records.&lt;/li&gt;
&lt;li&gt;Rank root-cause hypotheses with evidence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The top simulated hypothesis was: "Payments timeout cascade drove checkout 5xx responses." The evidence was specific: payments returned 72 HTTP 504 events during the incident window, payments p95 latency increased by 1,330.1 ms over baseline, and the checkout 5xx spike overlapped the payments timeout window.&lt;/p&gt;

&lt;p&gt;This is not a benchmark and not a managed OpenSearch test. It is a small reproducibility check for the mental model. The simulation showed that the documented pattern is coherent: if an agent can preserve the plan, intermediate analysis, evidence, and hypothesis history, a human reviewer gets a better artifact than a one-shot chat answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Pattern
&lt;/h2&gt;

&lt;p&gt;The practical architecture is a loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incident goal
  -&amp;gt; plan
  -&amp;gt; bounded data tools
  -&amp;gt; intermediate findings
  -&amp;gt; memory update
  -&amp;gt; reflection
  -&amp;gt; ranked hypotheses
  -&amp;gt; human accept / rule out / reinvestigate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "bounded data tools" part is critical. Agentic Chat documentation lists tools such as &lt;code&gt;execute_ppl_query&lt;/code&gt;, &lt;code&gt;create_investigation&lt;/code&gt;, &lt;code&gt;SearchIndexTool&lt;/code&gt;, &lt;code&gt;MsearchTool&lt;/code&gt;, &lt;code&gt;CountTool&lt;/code&gt;, &lt;code&gt;ExplainTool&lt;/code&gt;, &lt;code&gt;IndexMappingTool&lt;/code&gt;, &lt;code&gt;ClusterHealthTool&lt;/code&gt;, &lt;code&gt;LogPatternAnalysisTool&lt;/code&gt;, &lt;code&gt;MetricChangeAnalysisTool&lt;/code&gt;, and &lt;code&gt;DataDistributionTool&lt;/code&gt;. That tool list makes the agent less magical and more operational: it is valuable because it can call specific analysis functions over OpenSearch data.&lt;/p&gt;

&lt;p&gt;For production teams, this means the agent should not replace existing observability hygiene. It depends on it. Clean index mappings, useful service labels, trace IDs, consistent timestamps, field-level security, and retention policies become more important when an agent is allowed to chain analysis steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Agentic Memory Helps
&lt;/h2&gt;

&lt;p&gt;Memory is useful in incident work because the first question is rarely the final question.&lt;/p&gt;

&lt;p&gt;A developer may start with "why did checkout error rate increase?" then ask "only show us-west-2," then "compare against the previous hour," then "include payments traces," then "rerun after excluding synthetic traffic." If every turn loses context, the agent becomes a query generator. If the session preserves working state, the workflow becomes an investigation.&lt;/p&gt;

&lt;p&gt;OpenSearch's open source memory docs are a helpful model here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sessions&lt;/code&gt; hold the interaction context.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;working&lt;/code&gt; memory holds recent messages, agent state, execution traces, and temporary investigation data.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;long-term&lt;/code&gt; memory stores extracted knowledge or durable findings.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;history&lt;/code&gt; tracks memory operations for auditability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our local PoC mirrored those categories. It stored one session, three working records, one long-term hypothesis record, and four history events. That structure made the final hypothesis easier to inspect because the conclusion was tied to the plan and intermediate results.&lt;/p&gt;

&lt;p&gt;The caveat is equally important: memory can preserve mistakes. If the agent stores a weak assumption, a stale field meaning, or a misleading intermediate result, later steps may inherit that error. Teams should treat memory as evidence context, not ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security And Governance Notes
&lt;/h2&gt;

&lt;p&gt;AWS's managed Agentic Memory docs state that memory storage is isolated by user ID and encrypted with a service-managed key, or with a customer managed key if CMK encryption is enabled for the OpenSearch UI application. The docs also say Agentic Memory is free to use, though the March launch notes token-based usage limits for agentic AI features.&lt;/p&gt;

&lt;p&gt;The open source OpenSearch memory docs put more responsibility on implementers. Administrators or memory-container owners are responsible for data access controls, index-level permissions, document-level security, and custom prompt behavior. That distinction matters: managed Amazon OpenSearch Service and self-managed OpenSearch memory are not the same governance surface.&lt;/p&gt;

&lt;p&gt;For a production rollout, review these controls before treating agentic observability as safe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which users can start investigations?&lt;/li&gt;
&lt;li&gt;Which indices, fields, and documents can each user query?&lt;/li&gt;
&lt;li&gt;Are memory records isolated by user, team, tenant, or incident?&lt;/li&gt;
&lt;li&gt;Can investigation traces reveal restricted fields?&lt;/li&gt;
&lt;li&gt;Does memory retain sensitive payloads longer than log retention policy?&lt;/li&gt;
&lt;li&gt;Can a human see the exact query, finding, and evidence chain behind a hypothesis?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pairs naturally with the broader observability stack. If your LLM gateway is already traced through tools like &lt;a href="https://dev.to/articles/litellm-ai-gateway-llm-proxy-guide-2026"&gt;LiteLLM&lt;/a&gt; or &lt;a href="https://dev.to/articles/langfuse-llm-observability-self-host-guide-2026"&gt;Langfuse&lt;/a&gt;, OpenSearch investigation traces should be treated as another high-value audit artifact, not just UI state.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Developers Should Use It
&lt;/h2&gt;

&lt;p&gt;Amazon OpenSearch Agentic AI is most relevant for teams that already keep operational data in OpenSearch Service or are evaluating OpenSearch for observability and AI search.&lt;/p&gt;

&lt;p&gt;Use it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Engineers already use OpenSearch UI during incidents.&lt;/li&gt;
&lt;li&gt;PPL or DSL query expertise is a bottleneck.&lt;/li&gt;
&lt;li&gt;Incident work requires correlating logs, metrics, traces, and index metadata.&lt;/li&gt;
&lt;li&gt;You need ranked hypotheses with evidence, not just a generated summary.&lt;/li&gt;
&lt;li&gt;Your team can review agent steps and reject weak conclusions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Be cautious when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log fields are inconsistent or poorly mapped.&lt;/li&gt;
&lt;li&gt;Sensitive data appears in logs without masking.&lt;/li&gt;
&lt;li&gt;Access control depends on informal team norms rather than enforceable policy.&lt;/li&gt;
&lt;li&gt;Teams expect the agent to perform remediation automatically.&lt;/li&gt;
&lt;li&gt;You cannot audit the investigation steps after the incident.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best early use case is investigation assistance, not autonomous repair. Let the agent propose likely causes, show evidence, and help narrow the search. Keep remediation behind explicit human approval and existing change-control paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: treating natural language as a permission model.&lt;/strong&gt; An agent that can understand a request still needs hard access boundaries. Field-level and document-level restrictions matter more when queries are generated dynamically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: skipping schema quality.&lt;/strong&gt; Agentic analysis is only as useful as the fields it can reason over. Service names, trace IDs, deployment IDs, status codes, regions, and error classes should be consistently indexed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: ignoring memory lifecycle.&lt;/strong&gt; Memory improves continuity, but it also creates state. Decide what should be stored, who can retrieve it, how long it should live, and how it aligns with incident-retention policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: accepting the top hypothesis without reviewing alternatives.&lt;/strong&gt; AWS's Investigation Agent UI supports accepting, ruling out, and reviewing alternative hypotheses. Use that review flow. The most useful output is often the evidence trail, not the first answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 5: calling a simulation a production test.&lt;/strong&gt; Our PoC proved only that the workflow shape is easy to reproduce locally. It did not validate AWS latency, accuracy, pricing, region behavior, security isolation, or real OpenSearch query generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: What is Amazon OpenSearch Investigation Agent?
&lt;/h3&gt;

&lt;p&gt;It is an agentic root-cause analysis feature in OpenSearch UI. AWS documentation says it plans from a stated goal, executes queries and analysis, reflects through a multi-step workflow, and returns ranked hypotheses with evidence. It can be started from supported feature pages or from Agentic Chat with &lt;code&gt;/investigate&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does Agentic Memory work across every conversation?
&lt;/h3&gt;

&lt;p&gt;No. AWS documentation says Agentic Memory preserves context for Agentic Chat and Investigation Agent across feature pages, browser tabs, and page refreshes, but it cannot retain context across different conversation threads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Agentic Memory the same as self-managed OpenSearch memory containers?
&lt;/h3&gt;

&lt;p&gt;Not exactly. Amazon OpenSearch Service Agentic Memory is a managed memory layer for OpenSearch UI features. OpenSearch's agentic memory framework exposes memory containers and APIs that self-managed implementers configure themselves. The governance responsibility differs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Did Effloow Lab test Amazon OpenSearch Service?
&lt;/h3&gt;

&lt;p&gt;No. Effloow Lab ran a local Python simulation using synthetic logs. It did not use AWS credentials, Amazon OpenSearch Service, OpenSearch Dashboards, OpenSearch Serverless, PPL execution, or a live LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is there an extra price for these agentic features?
&lt;/h3&gt;

&lt;p&gt;AWS's March 31 launch post says the three log-analytics agentic capabilities are available at no additional cost, with token-based usage limits. AWS's Agentic Memory docs say Agentic Memory is free to use. For broader OpenSearch Serverless or cluster costs, use current AWS pricing pages rather than assuming this makes the full deployment free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources Checked
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2026/03/opensearch-agentic-ai-log-analytics-observability/" rel="noopener noreferrer"&gt;AWS What's New: Amazon OpenSearch Service introduces agentic AI for log analytics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/application-investigation-agent.html" rel="noopener noreferrer"&gt;AWS docs: Investigation Agent in Amazon OpenSearch Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/application-agentic-memory.html" rel="noopener noreferrer"&gt;AWS docs: Agentic Memory in Amazon OpenSearch Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/application-agentic-chat.html" rel="noopener noreferrer"&gt;AWS docs: Agentic Chat in Amazon OpenSearch Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-opensearch-service-version-3-5/" rel="noopener noreferrer"&gt;AWS What's New: Amazon OpenSearch Service supports OpenSearch 3.5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.opensearch.org/latest/ml-commons-plugin/agentic-memory/" rel="noopener noreferrer"&gt;OpenSearch docs: Agentic memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-opensearch-serverless-for-building-your-agentic-ai-applications/" rel="noopener noreferrer"&gt;AWS News Blog: next generation Amazon OpenSearch Serverless&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/opensearch-agent-skills-bring-built-in-intelligence-to-your-agentic-ide/" rel="noopener noreferrer"&gt;AWS Big Data Blog: OpenSearch Agent Skills&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Amazon OpenSearch Agentic AI is a practical sign of where observability tools are heading: from query builders to auditable investigation assistants. The interesting part is not just natural-language search. It is the combination of query tools, investigation planning, memory, ranked hypotheses, and human review.&lt;/p&gt;

&lt;p&gt;For developers, the right adoption posture is measured. Use it to reduce query friction and preserve investigation context. Keep hard permissions, schema quality, evidence review, and remediation controls outside the agent's discretion.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Amazon OpenSearch Agentic AI looks most useful as an incident investigation assistant for teams already invested in OpenSearch. Start with read-only analysis and evidence review; do not treat it as autonomous incident remediation.&lt;/p&gt;

</description>
      <category>amazonopensearch</category>
      <category>agenticai</category>
      <category>observability</category>
      <category>incidentresponse</category>
    </item>
    <item>
      <title>SciAgentGYM: 1,780 Scientific Tools, One Hard Benchmark</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 02 Jun 2026 12:11:34 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/sciagentgym-1780-scientific-tools-one-hard-benchmark-2bgl</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/sciagentgym-1780-scientific-tools-one-hard-benchmark-2bgl</guid>
      <description>&lt;p&gt;Every week a new LLM claims to be "state-of-the-art on scientific tasks." Those claims usually rest on multiple-choice chemistry questions or single-step math proofs — tasks that a well-trained language model can pattern-match from training data alone.&lt;/p&gt;

&lt;p&gt;Real scientific work looks nothing like that. A chemist computing molecular properties calls a SMILES parser, feeds the output into a molecular geometry optimizer, runs a density functional theory calculation on the result, and extracts energy values from the DFT output. That's four sequential tool calls with strict dependency ordering. If any step fails, the whole workflow collapses.&lt;/p&gt;

&lt;p&gt;SciAgentGYM (arXiv:2602.12984), published by Fudan NLP researchers in February 2026, is the first benchmark environment built specifically for this kind of evaluation: multi-step scientific tool use in LLM agents. The results are sobering.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SciAgentGYM Is — and Why It's Different
&lt;/h2&gt;

&lt;p&gt;Most LLM benchmarks test a model's knowledge. SciAgentGYM tests whether an agent can &lt;em&gt;operate&lt;/em&gt; in a scientific environment — selecting, sequencing, and executing domain-specific computational tools to reach a verifiable answer.&lt;/p&gt;

&lt;p&gt;The system has three tightly coupled components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SciAgentGym (the environment)&lt;/strong&gt; provides 1,780 domain-specific scientific tools spanning four natural science disciplines: Physics, Chemistry, Biology (Life Sciences), and Materials Science. The runtime also includes a filesystem for artifact management between tool calls, scientific databases for knowledge retrieval, and a Python interpreter for custom computation. Agents interact with this environment the same way a research software stack works: outputs from one tool become inputs to the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SciAgentBench (the evaluation suite)&lt;/strong&gt; contains 259 tasks and 1,134 sub-questions built through a four-stage quality pipeline. The authors aggregated roughly 5,000 candidate tasks from existing benchmarks, filtered out any task where four frontier LLMs averaged above 50% accuracy (keeping only genuinely hard ones), executed each retained task inside SciAgentGym to verify it was actually solvable, and had domain experts validate that solutions genuinely require multi-step reasoning rather than direct recall.&lt;/p&gt;

&lt;p&gt;The task difficulty is stratified into three levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L1&lt;/strong&gt; — up to 3 tool-call steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L2&lt;/strong&gt; — 4 to 7 steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3&lt;/strong&gt; — 8 or more steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notably, 79% of the benchmark falls into L2 or L3. Short, easy tasks aren't the point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SciForge (the data synthesis method)&lt;/strong&gt; is a training approach that models the tool action space as a dependency graph and generates logic-aware training trajectories from it. It's described further below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Domain Breakdown
&lt;/h2&gt;

&lt;p&gt;SciAgentBench's 259 tasks split across disciplines as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Tasks&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;th&gt;Tool-Use Benefit (avg)&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Physics&lt;/td&gt;
&lt;td&gt;109&lt;/td&gt;
&lt;td&gt;42%&lt;/td&gt;
&lt;td&gt;+2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chemistry&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;td&gt;+7.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Materials Science&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;14%&lt;/td&gt;
&lt;td&gt;+3.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Life Sciences&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;+8.4% ← highest gain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "tool-use benefit" column is telling. In Physics, agents already have strong parametric knowledge from training data, so adding tools only adds +2.5%. In Chemistry and Life Sciences — where calculations are more procedural and outputs depend heavily on molecular data that can't be memorized — using the correct tools lifts performance by 7–8 percentage points. This suggests the benchmark correctly captures where tool use actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Finding: Long-Horizon Performance Collapse
&lt;/h2&gt;

&lt;p&gt;The most striking result in the paper is this: &lt;strong&gt;GPT-5 achieves a 60.6% success rate on L1 tasks but drops to 30.9% on L3 tasks&lt;/strong&gt; — nearly halving its performance as interaction horizons extend. The authors attribute this primarily to failures in multi-step workflow execution: errors in intermediate steps cascade, and the model fails to recover or retry correctly.&lt;/p&gt;

&lt;p&gt;The paper evaluated four frontier models — Claude-Sonnet-4.5, DeepSeek-R1, Qwen3-235B, and GPT-5 — and found the same sharp degradation pattern across all of them. No frontier model escaped the performance collapse on long-horizon tasks.&lt;/p&gt;

&lt;p&gt;There's a straightforward lesson here for developers building scientific agents: raw benchmark scores at single-step tasks don't predict performance on real workflows. A model that scores 60% on L1 may be averaging below 31% on the tasks your pipeline actually needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters: The Tool-Dependency Structure
&lt;/h2&gt;

&lt;p&gt;To understand what makes L3 tasks hard, consider a Chemistry task that asks an agent to identify the most stable isomer of a given organic compound. The required tool chain looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Parse the SMILES string&lt;/strong&gt; into an internal molecule object&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enumerate possible isomers&lt;/strong&gt; using the stereoisomer generator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize 3D geometry&lt;/strong&gt; for each candidate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run DFT calculations&lt;/strong&gt; on each optimized structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract total energies&lt;/strong&gt; from each DFT output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare&lt;/strong&gt; and return the minimum-energy isomer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's a six-step chain. Any misordering — say, trying to run DFT before geometry optimization completes — produces a hard failure. Any incorrect tool selection — using a 2D descriptor calculator instead of the 3D optimizer — produces silent errors that propagate downstream.&lt;/p&gt;

&lt;p&gt;Effloow Lab reproduced this dependency structure in a minimal Python simulation (stdlib only, no API keys). Building a seven-node Chemistry tool graph with BFS traversal for transitive dependency resolution, the PoC confirmed that the L1/L2/L3 classification boundaries closely mirror real scientific workflow complexity. See &lt;code&gt;data/lab-runs/sciagentgym-scientific-tool-use-llm-benchmark-poc-2026.md&lt;/code&gt; for the full run log.&lt;/p&gt;

&lt;p&gt;The key structural insight the PoC reinforces: &lt;strong&gt;task complexity in scientific tool use isn't additive, it's multiplicative&lt;/strong&gt;. A six-step task isn't twice as hard as a three-step task — it's exponentially harder because each intermediate step's failure probability compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  SciForge: Teaching Smaller Models the Structure
&lt;/h2&gt;

&lt;p&gt;The most practically interesting finding in the paper is that you don't need a frontier-scale model to perform well on SciAgentBench. You need a model that has been trained to understand tool dependency structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SciForge&lt;/strong&gt; achieves this by treating the tool action space as a directed acyclic graph. Instead of collecting training trajectories as flat sequences of tool calls, SciForge generates trajectories that respect and encode the dependency relationships between tools. The result is that fine-tuned models learn not just &lt;em&gt;which&lt;/em&gt; tools to call, but &lt;em&gt;in what order and why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The numbers make the point: fine-tuning an 8B model on SciForge-generated trajectories produces &lt;strong&gt;SciAgent-8B&lt;/strong&gt;, which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Achieves a +6.7% improvement over its base model's score&lt;/li&gt;
&lt;li&gt;Outperforms the Qwen3-VL-235B-Instruct — a model roughly 29x larger&lt;/li&gt;
&lt;li&gt;Shows positive &lt;strong&gt;cross-domain transfer&lt;/strong&gt;: gains in Chemistry generalize to Physics and Materials Science tasks without domain-specific fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SciAgent-4B (the smaller variant) achieves +5.5%, also competitive with models many times its size.&lt;/p&gt;

&lt;p&gt;This isn't a fluke of scale. The paper's interpretation is that scientific tool-use capability is &lt;strong&gt;learnable and transferable&lt;/strong&gt; as a structural skill, independent of raw domain knowledge. A model trained to reason about tool dependencies in one scientific domain can apply that structural reasoning in another.&lt;/p&gt;

&lt;p&gt;Key Takeaway&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Scale does not solve multi-step scientific tool use. Dependency-aware training does. An 8B model fine-tuned on SciForge trajectories beats a 235B model on the same benchmark — not because it knows more chemistry, but because it understands how tools chain together.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Compares to Existing Scientific Benchmarks
&lt;/h2&gt;

&lt;p&gt;SciAgentBench isn't the first attempt to evaluate LLMs on scientific tasks. But it occupies a distinct niche:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ScienceAgentBench&lt;/strong&gt; (OSU NLP, ICLR 2025) focuses on data-driven scientific discovery workflows — primarily Python-based analysis pipelines. It's strong on computational workflows but lighter on the domain-specific tool ecosystems that characterize wet-lab and simulation-heavy science.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FrontierMath and GPQA&lt;/strong&gt; evaluate scientific &lt;em&gt;knowledge&lt;/em&gt; through question answering. No tool interaction is required or measured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SciAgentGYM's differentiation&lt;/strong&gt; is the combination of: (1) interactive, closed-loop tool execution — not just producing code, but running it and observing outputs — and (2) 1,780 domain-specific tools that model the actual software stacks scientists use, rather than a generic Python environment.&lt;/p&gt;

&lt;p&gt;The closest architectural comparison is to SWE-bench for software engineering: both run agents inside real execution environments, evaluate based on outcome not output text, and reward correct multi-step planning over single-shot reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Developers Should Take Away
&lt;/h2&gt;

&lt;p&gt;If you're building a scientific agent or workflow — drug discovery pipelines, materials screening, biological pathway analysis — several things follow directly from this benchmark:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't evaluate with L1-equivalent tasks.&lt;/strong&gt; A success rate of 60% on two-step tasks is a ceiling, not a floor. Measure the workflows your production system actually runs: if they have 6+ interdependent tool calls, test them explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency order matters as much as tool selection.&lt;/strong&gt; Most agent frameworks (LangGraph, AutoGen, OpenAI Agents SDK, PydanticAI) can invoke tools in the right sequence if instructed correctly — but this requires that the model actually understands which tool outputs are prerequisites for which tool inputs. System prompt engineering alone isn't sufficient for complex dependency chains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning on structured trajectories is underexplored.&lt;/strong&gt; The SciForge result suggests that tool-sequencing is a teachable skill. If you're building domain-specific agents at scale, generating dependency-graph-aware training data and fine-tuning a smaller model may produce more reliable workflows than prompting a frontier model with instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track intermediate failures, not just terminal outcomes.&lt;/strong&gt; The paper's finding that cascading step failures cause the L1→L3 drop means that coarse-grained end-task metrics hide where your agent actually breaks. Instrument each tool call separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with SciAgentGYM
&lt;/h2&gt;

&lt;p&gt;The benchmark environment is open source at &lt;a href="https://github.com/CMarsRover/SciAgentGYM" rel="noopener noreferrer"&gt;github.com/CMarsRover/SciAgentGYM&lt;/a&gt;. The repository includes the full tool suite, the benchmark task set, and evaluation harness.&lt;/p&gt;

&lt;p&gt;To run your own model against SciAgentBench, the general setup involves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/CMarsRover/SciAgentGYM
&lt;span class="nb"&gt;cd &lt;/span&gt;SciAgentGYM
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benchmark requires domain-specific Python packages (RDKit for Chemistry, PySCF or equivalent for Physics, pymatgen for Materials Science) alongside an LLM API key. The README documents which tools map to which packages. Running a full evaluation sweep across all 259 tasks against a frontier model incurs real API costs — the paper's evaluation used GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, and Qwen3-235B.&lt;/p&gt;

&lt;p&gt;For development and debugging, the SciAgentBench tasks include L1 subsets that run on shorter tool chains — a reasonable starting point before scaling to full L2/L3 evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is SciAgentGYM only relevant for actual science applications?
&lt;/h3&gt;

&lt;p&gt;No. The benchmark is a proxy for any workflow where tool calls have strict dependency ordering and intermediate outputs are consumed by downstream steps. Financial modeling pipelines, data engineering workflows, and complex DevOps automation all exhibit the same structural challenge that makes L3 science tasks hard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does SciForge compare to standard instruction fine-tuning?
&lt;/h3&gt;

&lt;p&gt;Standard instruction fine-tuning teaches a model "here's a task, here's the output." SciForge fine-tuning teaches a model "here's the tool dependency graph, here's how trajectories should flow through it." The dependency-aware approach produces significantly better performance on long-horizon tasks because the model learns causal ordering, not just output format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Which model performed best overall on SciAgentBench?
&lt;/h3&gt;

&lt;p&gt;The paper evaluated GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, and Qwen3-235B. Among frontier models, GPT-5 achieved a 60.6% success rate on L1 tasks — but even that best-in-class performance fell to 30.9% on L3. SciAgent-8B (fine-tuned via SciForge) showed notably better long-horizon resilience than the frontier models in the paper's comparisons.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I add my own tools to the environment?
&lt;/h3&gt;

&lt;p&gt;Yes. SciAgentGYM's design allows domain-specific tool registration. The evaluation infrastructure routes tool calls through a standardized interface, so new tools that follow the input/output schema can be added without modifying the core framework.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is 259 tasks enough to be statistically meaningful?
&lt;/h3&gt;

&lt;p&gt;For tool-use benchmarks that require closed-loop execution, 259 tasks is actually substantial — each task requires multiple execution steps and domain-expert validation. SWE-bench Verified (the gold standard for coding agents) has 500 tasks; SciAgentBench's 259 tasks with 1,134 sub-questions provide granular scoring at the sub-question level that single-outcome benchmarks don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SciAgentGYM (arXiv:2602.12984) is the first benchmark to evaluate LLMs on multi-step scientific tool-use through closed-loop interaction, using 1,780 real domain-specific tools across Physics, Chemistry, Materials Science, and Life Sciences.&lt;/li&gt;
&lt;li&gt;Even GPT-5 drops from 60.6% on simple tasks (L1) to 30.9% on long-horizon tasks (L3) — a degradation pattern shared by all tested frontier models.&lt;/li&gt;
&lt;li&gt;Tool use benefits Chemistry (+7.0%) and Life Sciences (+8.4%) more than Physics (+2.5%), reflecting where parametric knowledge falls short.&lt;/li&gt;
&lt;li&gt;SciForge — a dependency-graph-based data synthesis method — enables an 8B fine-tuned model (SciAgent-8B) to outperform the 235B Qwen3-VL-235B-Instruct, with +6.7% improvement and cross-domain transfer.&lt;/li&gt;
&lt;li&gt;For developers: measure tool-call success at each intermediate step, not just end-task outcomes; fine-tuning on dependency-structured trajectories is an underused lever for scientific agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark and environment are open at &lt;a href="https://github.com/CMarsRover/SciAgentGYM" rel="noopener noreferrer"&gt;github.com/CMarsRover/SciAgentGYM&lt;/a&gt;. If your agent needs to navigate a real scientific tool chain, this is the evaluation suite to run it against before claiming production readiness.&lt;/p&gt;

</description>
      <category>llmagents</category>
      <category>benchmarks</category>
      <category>scientificai</category>
      <category>tooluse</category>
    </item>
    <item>
      <title>LangGraph Platform GA: Studio v2, One-Click Deploy Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 02 Jun 2026 12:11:20 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/langgraph-platform-ga-studio-v2-one-click-deploy-guide-4m10</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/langgraph-platform-ga-studio-v2-one-click-deploy-guide-4m10</guid>
      <description>&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Shipping a LangGraph agent to a development laptop is one thing. Getting it into production — with persistent state, human-in-the-loop gates, reliable retries, and a debugger that does not require a macOS desktop app — is a different problem entirely.&lt;/p&gt;

&lt;p&gt;That problem got a cleaner answer on May 14, 2026, when LangChain announced that LangGraph Platform had reached General Availability. The announcement came alongside Studio v2, a browser-based visual debugger that replaces the earlier desktop application. Nearly 400 companies had been running the platform during the beta period, including Klarna, Uber, and LinkedIn.&lt;/p&gt;

&lt;p&gt;The timing also matters because the competitive landscape for agent infrastructure shifted in early 2026. Microsoft moved its AutoGen project into maintenance mode, redirecting investment toward the Microsoft Agent Framework. That left LangGraph and CrewAI as the two active frameworks with genuine production traction. LangGraph's stated differentiator is durable execution: graph-based state control, automatic checkpointing, and a managed runtime that handles the infrastructure layer so the agent code does not have to.&lt;/p&gt;

&lt;p&gt;This guide covers what the platform is, what Studio v2 adds, how the deployment model works, and where it fits relative to the alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is LangGraph Platform?
&lt;/h2&gt;

&lt;p&gt;It helps to separate two things that share the name "LangGraph":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The open-source library&lt;/strong&gt; is an MIT-licensed Python framework for building stateful, cyclical agent workflows as explicit directed graphs. It reached its 1.0 stable release in October 2025, which included an API stability guarantee — no breaking changes until a 2.0 release. This is the library developers install via &lt;code&gt;pip install langgraph&lt;/code&gt;. It is free and has no usage caps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangGraph Platform&lt;/strong&gt; (also referred to as "LangSmith Deployment" in LangChain's documentation after an October 2025 rebrand) is the managed infrastructure layer that sits on top of that library. It handles deployment, autoscaling, persistence, task queuing, and observability. It is what you pay for if you want LangGraph agents running in production without managing your own infrastructure.&lt;/p&gt;

&lt;p&gt;The naming situation is genuinely confusing. After the 1.0 release, LangChain unified three product pillars under the LangSmith brand — Observability, Evaluation, and Deployment — and renamed LangGraph Platform to "LangSmith Deployment." However, the May 2026 GA announcement still used the "LangGraph Platform" name in the blog URL and official changelog. Both names appear in active documentation as of mid-2026. The safest mental model: LangGraph (lowercase) is the open-source framework; LangGraph Platform / LangSmith Deployment is the paid hosting layer.&lt;/p&gt;

&lt;p&gt;The platform adds four capabilities that the open-source library does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed persistence&lt;/strong&gt;: conversations, thread history, and state are saved automatically. No custom database logic required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durable execution&lt;/strong&gt;: if a server restarts mid-workflow, the agent resumes from the last checkpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in task queuing&lt;/strong&gt;: background runs, cron scheduling, and webhooks are first-class platform primitives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production autoscaling&lt;/strong&gt;: containers scale based on CPU utilization and pending run queue depth.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Studio v2: Browser-Based Visual Debugging
&lt;/h2&gt;

&lt;p&gt;The most visible change in the May 2026 announcement is Studio v2. The prior version required a macOS desktop application. Studio v2 runs in the browser.&lt;/p&gt;

&lt;p&gt;You start a local Studio session with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;langgraph dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That command starts a local server and opens Studio v2 in the browser at &lt;code&gt;localhost:8123&lt;/code&gt; by default. No desktop installation required.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Studio v2 Shows You
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Graph rendering.&lt;/strong&gt; Studio v2 renders your agent's execution graph visually — each node in the LangGraph definition appears as a node in the UI, with edges showing the conditional routing between them. As the agent runs, nodes highlight as they execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-node state inspection.&lt;/strong&gt; At every node in the graph, you can inspect the full state object at that point in execution. This means you can see exactly what data the LLM received, what the tool returned, and what the state looked like when the routing decision was made.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time-travel debugging.&lt;/strong&gt; LangGraph's checkpoint system saves state at each node boundary. Studio v2 exposes those checkpoints as a timeline you can navigate. If an agent produces a wrong output at step seven, you rewind to step six, change an input or configuration value, and re-run from that point — without restarting the full workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production trace replay.&lt;/strong&gt; This is the practical daily-use feature. You can pull a production trace from LangSmith — a real user interaction that failed or produced unexpected results — and replay it locally in Studio v2. You then edit the prompt or configuration and replay again, all without touching production code or triggering a redeploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playground integration.&lt;/strong&gt; Individual LLM calls within a trace can be opened directly in the LangSmith Playground. This means you can isolate a single prompt, experiment with model parameters, and test revisions before changing anything in the graph code.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Workflow Replaces
&lt;/h3&gt;

&lt;p&gt;Before Studio v2, the common debugging loop looked like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent fails in production.&lt;/li&gt;
&lt;li&gt;Developer reads LangSmith traces in the text-based trace viewer.&lt;/li&gt;
&lt;li&gt;Adds print statements or additional logging to graph nodes.&lt;/li&gt;
&lt;li&gt;Redeploys.&lt;/li&gt;
&lt;li&gt;Triggers the same scenario again.&lt;/li&gt;
&lt;li&gt;Reads updated logs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Studio v2 short-circuits steps 3 through 6. The state is already captured at every node. The trace is already stored. The developer pulls it into the browser and steps through it directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  One-Click Deploy and Production Runtime
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deploying an Agent
&lt;/h3&gt;

&lt;p&gt;From the management console, deploying a LangGraph agent to the managed cloud is a single action with native GitHub integration. The equivalent CLI path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the LangGraph CLI&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"langgraph-cli[inmem]"&lt;/span&gt;

&lt;span class="c"&gt;# Create a new project from a template&lt;/span&gt;
langgraph new my-agent &lt;span class="nt"&gt;--template&lt;/span&gt; react-agent-python

&lt;span class="c"&gt;# Deploy to LangGraph Platform&lt;/span&gt;
langgraph deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;langgraph deploy&lt;/code&gt; command packages the agent, pushes it to the managed runtime, and handles the rest. For local development, &lt;code&gt;langgraph dev&lt;/code&gt; runs a local server that connects to Studio v2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autoscaling
&lt;/h3&gt;

&lt;p&gt;The platform scales containers based on two signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU utilization&lt;/strong&gt;: target threshold of 75%. When CPU crosses that, a new container spins up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pending run queue depth&lt;/strong&gt;: target of 10 pending runs per container. One container with 20 queued runs triggers a scale-up to two containers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;API servers and agent servers scale independently. A spike in run submission requests — which hits the API server — does not slow down ongoing agent runs on the agent servers.&lt;/p&gt;

&lt;p&gt;Scale-down has a 30-minute delay. After the delay, metrics are recomputed before a container is removed. This prevents thrashing during workloads with short bursts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background Runs, Cron, and Webhooks
&lt;/h3&gt;

&lt;p&gt;LangGraph Server exposes native primitives for async execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Submit a background run (non-blocking)
&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;assistant_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this dataset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;
    &lt;span class="n"&gt;multitask_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Schedule a recurring run with cron
&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;assistant_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 9 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Weekdays at 09:00
&lt;/span&gt;    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily market summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Webhooks allow external systems to trigger agent runs on events. Combined with the persistence layer, this makes it practical to build agents that handle long-running tasks — research workflows that run for hours, document processing pipelines that wait on human approval, or scheduled reporting agents that fire on a timer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Durable Execution and Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;If a worker restarts mid-execution, the agent resumes from the last checkpoint. This is handled by the platform's persistence layer, which uses Redis or PostgreSQL for checkpoint storage in production Kubernetes deployments.&lt;/p&gt;

&lt;p&gt;Human-in-the-loop is a first-class API primitive. An agent can pause at a node, surface its current state for human review, and resume when approved — without polling, timeouts, or custom callback infrastructure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemorySaver&lt;/span&gt;

&lt;span class="c1"&gt;# The interrupt_before parameter pauses execution before the specified node
&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MemorySaver&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;interrupt_before&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Resume after human approval
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real-Time Streaming
&lt;/h3&gt;

&lt;p&gt;The platform streams LLM tokens, tool calls, state updates, and node transitions as they happen. For interactive applications, this means users see partial responses as the agent works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;assistant_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What happened in the market today?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;
    &lt;span class="n"&gt;stream_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  LangGraph Platform vs. Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;LangGraph Platform (Managed)&lt;/th&gt;
&lt;th&gt;Self-Hosted LangGraph&lt;/th&gt;
&lt;th&gt;Temporal Cloud&lt;/th&gt;
&lt;th&gt;Inngest Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateful AI agent deployment&lt;/td&gt;
&lt;td&gt;AI agent development&lt;/td&gt;
&lt;td&gt;Durable workflow orchestration&lt;/td&gt;
&lt;td&gt;Event-driven durable workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open-source core&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT (library free)&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Proprietary cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed hosting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Plus/Enterprise)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free tier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100K nodes/month (self-hosted)&lt;/td&gt;
&lt;td&gt;Unlimited (self-hosted)&lt;/td&gt;
&lt;td&gt;Dev tier (limits apply)&lt;/td&gt;
&lt;td&gt;100K executions/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Paid entry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$39/user/month (LangSmith Plus) + compute&lt;/td&gt;
&lt;td&gt;Infrastructure cost only&lt;/td&gt;
&lt;td&gt;$200/month (Growth)&lt;/td&gt;
&lt;td&gt;$75/month (Pro)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Graph-based agent control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser visual debugger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Studio v2&lt;/td&gt;
&lt;td&gt;Studio v2 (local)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Checkpoint/time-travel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Durable execution (different model)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Survives server restart&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (platform-managed)&lt;/td&gt;
&lt;td&gt;Requires external checkpointer&lt;/td&gt;
&lt;td&gt;Yes (core feature)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Human-in-the-loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First-class API&lt;/td&gt;
&lt;td&gt;First-class API&lt;/td&gt;
&lt;td&gt;Via signals/queries&lt;/td&gt;
&lt;td&gt;Via pause/resume steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production autoscaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Manual (Kubernetes)&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM-specific tooling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deep (LangSmith tracing)&lt;/td&gt;
&lt;td&gt;Via LangSmith&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams deploying LangGraph agents to prod&lt;/td&gt;
&lt;td&gt;Local dev and research&lt;/td&gt;
&lt;td&gt;Long-running infra-level workflows&lt;/td&gt;
&lt;td&gt;Engineering-managed event pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A note on Temporal specifically: it is often positioned as a direct competitor to LangGraph Platform, but the relationship is more nuanced. Temporal handles durable orchestration at the infrastructure layer — it is good at keeping a workflow alive for days or weeks, surviving server restarts and worker rollouts. LangGraph handles agent reasoning at the application layer — cyclical tool use, dynamic routing, state accumulation across turns.&lt;/p&gt;

&lt;p&gt;A pattern that appears in production stacks is using both: a Temporal workflow activity spins up a LangGraph agent as a subtask. Temporal owns the macro lifecycle; LangGraph owns the agent control flow within each task.&lt;/p&gt;

&lt;p&gt;The key practical difference: LangGraph checkpointers survive within a deployment, while Temporal's state survives across worker rollouts and infrastructure events. If your agents run for minutes, LangGraph Platform's checkpointing is sufficient. If they run for hours or days across infrastructure changes, Temporal (or a hybrid) is worth evaluating.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started: Free Tier
&lt;/h2&gt;

&lt;p&gt;The free path to LangGraph is the open-source library and the Developer self-hosted option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open-source library (no account required):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;langgraph langchain-anthropic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get the full framework: stateful graphs, built-in checkpointing, human-in-the-loop, streaming, and LangGraph Studio v2 locally via &lt;code&gt;langgraph dev&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer plan (free, self-hosted):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Up to 100,000 node executions per month&lt;/li&gt;
&lt;li&gt;One free Developer deployment included&lt;/li&gt;
&lt;li&gt;Requires a LangSmith account (free tier available)&lt;/li&gt;
&lt;li&gt;Self-hosted: you manage the infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The managed cloud (where LangGraph Platform handles scaling, persistence, and infrastructure) requires the Plus plan. Plus requires a LangSmith Plus subscription, priced at $39 per user per month. Compute costs on Plus are billed per node executed ($0.001/node) plus standby time. Enterprise pricing is custom.&lt;/p&gt;

&lt;p&gt;Note: third-party pricing summaries vary and some figures in secondary sources may reflect pre-rename billing units. For current pricing, the authoritative source is &lt;a href="https://www.langchain.com/pricing" rel="noopener noreferrer"&gt;langchain.com/pricing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To try Studio v2 locally with the free tier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install CLI&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"langgraph-cli[inmem]"&lt;/span&gt;

&lt;span class="c"&gt;# Create a project&lt;/span&gt;
langgraph new my-first-agent &lt;span class="nt"&gt;--template&lt;/span&gt; react-agent-python
&lt;span class="nb"&gt;cd &lt;/span&gt;my-first-agent

&lt;span class="c"&gt;# Start local server with Studio v2&lt;/span&gt;
langgraph dev
&lt;span class="c"&gt;# Opens browser at localhost:8123&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there you can build a graph, run it, and step through execution in the Studio v2 interface without any cloud account.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is LangGraph Platform the same as LangSmith Deployment?
&lt;/h3&gt;

&lt;p&gt;Functionally, yes. In October 2025, LangChain rebranded the managed infrastructure product from "LangGraph Platform" to "LangSmith Deployment" as part of unifying three pillars under LangSmith (Observability, Evaluation, and Deployment). However, the May 2026 GA announcement retained the "LangGraph Platform" name in official blog URLs and the changelog, so both names appear in active documentation. For practical purposes, they refer to the same managed hosting product.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need LangSmith to use LangGraph?
&lt;/h3&gt;

&lt;p&gt;No. The open-source LangGraph library works without LangSmith. LangSmith is LangChain's observability and evaluation platform — it provides tracing, the Studio v2 debugger at scale, and the managed deployment product. If you are self-hosting and want tracing, LangSmith has a free tier. If you want the managed cloud runtime, you need a LangSmith Plus or Enterprise account.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does LangGraph's checkpointing compare to Temporal's durable execution?
&lt;/h3&gt;

&lt;p&gt;LangGraph checkpointers save state at each node boundary within a deployment. If the agent server restarts, the agent resumes from the last checkpoint. Temporal's durability model survives across worker rollouts and infrastructure changes — state persists even if the entire worker pool is replaced. For agents that run for minutes to an hour, LangGraph Platform's built-in checkpointing is sufficient. For workflows that run for hours or days across infrastructure events, Temporal offers stronger durability guarantees. Many production teams use both together.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened to LangGraph Studio v1 (the desktop app)?
&lt;/h3&gt;

&lt;p&gt;Studio v1 required a macOS desktop application. Studio v2 is entirely browser-based — access it by running &lt;code&gt;langgraph dev&lt;/code&gt; and navigating to the local URL it prints. The desktop app is no longer the recommended path. Some third-party guides still reference the desktop app; those reflect the pre-v2 setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is the &lt;code&gt;langgraph.prebuilt&lt;/code&gt; module still available in LangGraph 1.0?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;langgraph.prebuilt&lt;/code&gt; module was deprecated as of LangGraph 1.0 (October 2025). Its functionality moved to &lt;code&gt;langchain.agents&lt;/code&gt;. If your code imports from &lt;code&gt;langgraph.prebuilt&lt;/code&gt;, migration involves updating those imports. The 1.0 release carried a no-breaking-changes guarantee for the core API, but this deprecation is the notable exception to account for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph Platform reached GA on May 14, 2026&lt;/strong&gt;, after nearly 400 companies used it in beta. Klarna, Uber, and LinkedIn are among the referenced enterprise users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Studio v2 eliminates the desktop app.&lt;/strong&gt; The browser-based debugger lets you pull production traces, step through per-node state, replay checkpoints, and edit prompts — without a redeploy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The free tier covers serious development.&lt;/strong&gt; The open-source library and self-hosted Developer plan (100K nodes/month) give you the full framework, Studio v2 locally, and LangSmith's free observability tier. Managed cloud requires Plus or Enterprise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph and Temporal solve different layers.&lt;/strong&gt; LangGraph handles agent reasoning and control flow; Temporal handles durable macro-level orchestration. They are complementary in production stacks, not direct substitutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The naming is confusing but stabilizing.&lt;/strong&gt; "LangGraph Platform" and "LangSmith Deployment" refer to the same managed product post-October 2025 rebrand. The open-source framework remains "LangGraph."&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Verdict: Worth evaluating if you are already using LangGraph in development.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Studio v2's production trace replay and time-travel debugging address a real gap in the agent debugging workflow. The one-click deploy and managed autoscaling lower the barrier to getting LangGraph agents into production without Kubernetes expertise. The free tier is genuinely useful — not a trial with a short clock.&lt;/p&gt;

&lt;p&gt;The main friction point is pricing complexity: per-node billing requires understanding what a "node" means in your specific graph, and third-party pricing summaries conflict enough that you should verify figures directly at langchain.com/pricing before budgeting. For teams that need stronger durability guarantees than LangGraph Platform's checkpointing provides, Temporal remains the cleaner infrastructure-layer choice — but the two can work together.&lt;/p&gt;

</description>
      <category>langgraph</category>
      <category>agentdeployment</category>
      <category>aiframeworks</category>
      <category>developertools</category>
    </item>
    <item>
      <title>TypeScript Zod v4 + Claude API: A Complete Guide to Type-Safe LLM Response Parsing</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:44:59 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/typescript-zod-v4-claude-api-a-complete-guide-to-type-safe-llm-response-parsing-6gb</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/typescript-zod-v4-claude-api-a-complete-guide-to-type-safe-llm-response-parsing-6gb</guid>
      <description>&lt;p&gt;I once trusted a raw &lt;code&gt;JSON.parse()&lt;/code&gt; call on a Claude API response and got burned by a runtime error. When you pull &lt;code&gt;content[0].text&lt;/code&gt; and parse it, there's no guarantee the resulting object has the fields you expect. LLMs ignore prompts, quietly rename fields, or mix types. Zod v4 catches that at the type level before it ever reaches your business logic.&lt;/p&gt;

&lt;p&gt;This article covers practical patterns for safely parsing Claude API responses, tested against Zod 4.4.3 and &lt;code&gt;@anthropic-ai/sdk 0.100.1&lt;/code&gt;. I ran a 100,000-iteration parse benchmark myself and checked the v3 API changes against actual behavior in code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Changed Between Zod v3 and v4
&lt;/h2&gt;

&lt;p&gt;The headline numbers are impressive: string parsing 14x faster, arrays 7x, objects 6.5x. Bundle size down 57%. TypeScript instantiation reduced up to 100x. That said, you don't need to migrate immediately just because the numbers look good.&lt;/p&gt;

&lt;p&gt;After hands-on use, three changes are the ones you actually feel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, error messages are more readable.&lt;/strong&gt; The old pattern of passing separate &lt;code&gt;required_error&lt;/code&gt; and &lt;code&gt;invalid_type_error&lt;/code&gt; options is replaced by a single &lt;code&gt;error&lt;/code&gt; parameter. Default message formats also changed. What was &lt;code&gt;"String must contain at least 1 character(s)"&lt;/code&gt; in v3 is now &lt;code&gt;"Too small: expected string to have &amp;gt;=1 characters"&lt;/code&gt; in v4. If any of your tests do string comparisons on Zod error messages, they will break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, number validation is stricter.&lt;/strong&gt; &lt;code&gt;Infinity&lt;/code&gt; and &lt;code&gt;-Infinity&lt;/code&gt; used to pass &lt;code&gt;z.number()&lt;/code&gt; in v3. In v4, they return &lt;code&gt;success: false&lt;/code&gt;. Integers exceeding &lt;code&gt;Number.MAX_SAFE_INTEGER&lt;/code&gt; are also rejected by &lt;code&gt;z.number().int()&lt;/code&gt;. Worth noting if your code might receive extreme values from external APIs or LLM responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the API surface got cleaner.&lt;/strong&gt; The v4 style is &lt;code&gt;z.email()&lt;/code&gt; instead of &lt;code&gt;z.string().email()&lt;/code&gt;. Use &lt;code&gt;z.intersection(A, B)&lt;/code&gt; over &lt;code&gt;.and()&lt;/code&gt;. And there's a new &lt;code&gt;.check()&lt;/code&gt; method for inline custom validation.&lt;/p&gt;

&lt;p&gt;Honest caveat: v4 is not always faster than v3. Community benchmarks show a handful of deeply nested schema scenarios where v3 is actually quicker. The headline numbers reflect typical patterns, not every workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  APIs Officially Removed (But Still There)
&lt;/h3&gt;

&lt;p&gt;The migration docs say &lt;code&gt;.and()&lt;/code&gt; was removed. In practice, testing against 4.4.3, it exists and works fine. The documentation appears to have gotten ahead of the actual release. Same story with &lt;code&gt;required_error&lt;/code&gt; — it technically still works, but the message format changed. These look more like quiet deprecations than hard removals.&lt;/p&gt;

&lt;p&gt;When planning a migration, verify against the actual version you're running rather than taking the docs at face value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation and Basic Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;zod@^4.4.3
npm &lt;span class="nb"&gt;install&lt;/span&gt; @anthropic-ai/sdk@^0.100.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a TypeScript project, &lt;code&gt;strict: true&lt;/code&gt; in &lt;code&gt;tsconfig.json&lt;/code&gt; is required for Zod's type inference to work properly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"compilerOptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"strict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ES2022"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"module"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ESNext"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"moduleResolution"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bundler"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the minimal check to confirm things work after installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;UserSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;email&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;         &lt;span class="c1"&gt;// v4 style: replaces z.string().email()&lt;/span&gt;
  &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;admin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;viewer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;UserSchema&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;UserSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Jangwook&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;kim.jangwook@example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;admin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// type: string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is exactly what you'd expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;success:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;parsed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Jangwook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"kim.jangwook@example.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"admin"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;a class="mentioned-user" href="https://dev.to/zod"&gt;@zod&lt;/a&gt;/mini Is a Separate Package
&lt;/h3&gt;

&lt;p&gt;The release announcement includes &lt;code&gt;@zod/mini&lt;/code&gt;, a tree-shakeable build at roughly 1.9KB gzip. Useful if you care about frontend bundle size. The API surface is different from the main &lt;code&gt;zod&lt;/code&gt; package, though. Since this article focuses on server-side Claude API integration, everything here uses the main package.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing Schemas for LLM Responses
&lt;/h2&gt;

&lt;p&gt;Schemas for LLM responses need a different design philosophy than schemas for form data. The key difference is &lt;strong&gt;defensive handling of optional fields&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An LLM may not return every field you asked for. Response quality is variable, and prompt changes can shift the structure. Your schema should reflect that reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Basic LLM Response Schema
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Schema for blog post analysis response&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;BlogAnalysisSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;sentiment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;positive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;neutral&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;negative&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="na"&gt;readingTimeMinutes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="c1"&gt;// Fields the LLM may not always return&lt;/span&gt;
  &lt;span class="na"&gt;seoScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;suggestedImprovements&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;BlogAnalysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;BlogAnalysisSchema&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Nested Schemas with Metadata
&lt;/h3&gt;

&lt;p&gt;Sometimes you want metadata about the response itself alongside the actual content — a confidence score, model info, that kind of thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;LLMResponseSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="c1"&gt;// Actual content&lt;/span&gt;
  &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="c1"&gt;// Response metadata (optional)&lt;/span&gt;
  &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;processingTimeMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;positive&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my tests, nested objects with &lt;code&gt;.optional()&lt;/code&gt; behaved as expected. Parsing succeeds even when &lt;code&gt;metadata&lt;/code&gt; is absent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;LLM response (with metadata) success&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;title: Zod v4&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A Deep Dive into Schema Validation&lt;/span&gt;
&lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
&lt;span class="na"&gt;LLM response (no metadata) success&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using z.string().check() to Validate LLM Response Format
&lt;/h3&gt;

&lt;p&gt;The new &lt;code&gt;.check()&lt;/code&gt; API in v4 is genuinely useful when an LLM is supposed to follow a specific format or prefix convention.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// LLM responses must always start with "RESULT:"&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;LLMResultSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;RESULT:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;custom&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;LLM response must start with "RESULT:"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;LLMResultSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;RESULT: analysis complete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;invalid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;LLMResultSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analysis complete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;// true&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;invalid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One rough edge worth knowing: TypeScript autocomplete inside the &lt;code&gt;.check()&lt;/code&gt; callback is thin. The issue object you push into &lt;code&gt;ctx.issues&lt;/code&gt; needs &lt;code&gt;code: 'custom'&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;, and &lt;code&gt;input&lt;/code&gt;, but the editor won't hint these fields reliably. It's an easy place to make a typo the first time through.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parsing Claude API Responses with Zod
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Prompt for JSON, Then Parse
&lt;/h3&gt;

&lt;p&gt;The simplest approach. Specify JSON format in the system prompt, extract the response text, run it through &lt;code&gt;JSON.parse()&lt;/code&gt;, then validate with Zod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Define the expected response structure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ArticleAnalysisSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;mainTopics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;difficulty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;beginner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;intermediate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;advanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="na"&gt;estimatedReadTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;positive&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;hasCodeExamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;ArticleAnalysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;ArticleAnalysisSchema&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;analyzeArticle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ArticleAnalysis&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`You are a technical document analyzer.
Respond only with JSON in this exact format:
{
  "title": "document title",
  "mainTopics": ["topic1", "topic2"],
  "difficulty": "beginner" | "intermediate" | "advanced",
  "estimatedReadTime": number (minutes),
  "hasCodeExamples": true | false
}
Do not include any text outside the JSON.`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Analyze the following document:\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Extract the text content&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;No text response received&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Parse JSON&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="na"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`JSON parse failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Zod validation&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ArticleAnalysisSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;errorSummary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Schema validation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;errorSummary&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The weak point here is that when the LLM wraps its JSON in markdown code fences or adds explanation text, &lt;code&gt;JSON.parse()&lt;/code&gt; fails. You need a bit of defensive extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractJsonFromResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Extract JSON from ```&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;json&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="s2"&gt;``` blocks
  const codeBlockMatch = text.match(/```&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;(?:&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;)?&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;S&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;?)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="s2"&gt;```/);
  if (codeBlockMatch) {
    return codeBlockMatch[1];
  }

  // Extract anything wrapped in curly braces
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (jsonMatch) {
    return jsonMatch[0];
  }

  return text;
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 2: Force Structured Output via Tool Use
&lt;/h3&gt;

&lt;p&gt;As covered in &lt;a href="https://dev.to/en/blog/en/claude-agent-sdk-tool-use-complete-guide-2026"&gt;Claude Agent SDK Tool Use Complete Guide&lt;/a&gt;, using &lt;code&gt;tool_use&lt;/code&gt; lets you enforce JSON structure. The LLM "calls" a tool and returns structured data as the tool input.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Zod schema for the tool's input&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ArticleMetadataSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The article&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s1"&gt;s core title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;List of relevant tags (up to 5)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Analysis confidence (0-1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Tool definition in Anthropic format&lt;/span&gt;
&lt;span class="c1"&gt;// (written manually here, without zodToJsonSchema)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;extractMetadataTool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;extract_metadata&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Extract metadata from a document&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input_schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The article&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s1"&gt;s core title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;List of relevant tags (up to 5)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;minimum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Analysis confidence (0-1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tags&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;confidence&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;extractMetadataTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;tool_choice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Extract metadata from the following:\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Find the tool_use block&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toolUseBlock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool_use&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;extract_metadata&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;toolUseBlock&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;toolUseBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool_use&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Tool was not called&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// tool_use input is unknown — validate with Zod&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ArticleMetadataSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toolUseBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="s2"&gt;`tool_use input validation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;())}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tool Use is more reliable than Pattern 1 for a clear reason. Claude structures its JSON directly into the tool &lt;code&gt;input&lt;/code&gt; field. There's no room for markdown fences or stray explanatory text. The SDK handles JSON parsing internally, so you don't need to catch &lt;code&gt;JSON.parse()&lt;/code&gt; failures separately.&lt;/p&gt;

&lt;p&gt;That said, never skip Zod validation even with Tool Use. &lt;code&gt;toolUseBlock.input&lt;/code&gt; is typed as &lt;code&gt;unknown&lt;/code&gt;. If Claude returns an unexpected type, the error hides until runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Error Handling Patterns
&lt;/h2&gt;

&lt;p&gt;LLM response parsing fails at two distinct layers: JSON parsing and Zod schema validation. Distinguishing between them makes debugging much faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separating Error Layers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;ParseResult&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;schema&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;parseLLMResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ZodType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ParseResult&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Layer 1: JSON parsing&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="na"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;jsonText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractJsonFromResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jsonText&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Layer 2: Zod schema validation&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;schema&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;formatZodError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;formatZodError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ZodError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;`[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;]`&lt;/span&gt;
        &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[root]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Structured Errors with error.format()
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;error.format()&lt;/code&gt; is still available in v4, returning errors organized by field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;BlogAnalysisSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;badData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;formatted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Example output:&lt;/span&gt;
  &lt;span class="c1"&gt;// {&lt;/span&gt;
  &lt;span class="c1"&gt;//   _errors: [],&lt;/span&gt;
  &lt;span class="c1"&gt;//   title: { _errors: ['Too small: expected string to have &amp;gt;=1 characters'] },&lt;/span&gt;
  &lt;span class="c1"&gt;//   tags: { _errors: ['Too small: expected array to have &amp;gt;=1 items'] }&lt;/span&gt;
  &lt;span class="c1"&gt;// }&lt;/span&gt;

  &lt;span class="c1"&gt;// Pull errors for a specific field&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;titleErrors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;formatted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;_errors&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tagsErrors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;formatted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;_errors&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you need per-field structure for client responses or logs, &lt;code&gt;error.format()&lt;/code&gt; is clean. For a flat list of issues, &lt;code&gt;error.issues&lt;/code&gt; directly is simpler.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry Logic with Feedback
&lt;/h3&gt;

&lt;p&gt;When parsing fails, you can retry with the error message injected back into the prompt so the LLM can self-correct.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;analyzeWithRetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ZodType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxRetries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;lastError&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;maxRetries&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;BASE_SYSTEM_PROMPT&lt;/span&gt;
      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;BASE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n\nThe previous response caused this error: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lastError&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\nRespond only with the JSON format specified.`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;textBlock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;textBlock&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;textBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parseResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseLLMResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;textBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nx"&gt;lastError&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Attempt &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lastError&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Parse failed after &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;maxRetries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; attempts: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lastError&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep retries at 2 or fewer. API costs add up quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance: What Zod v4 Speed Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;I ran this on Apple Silicon with a 4-field object schema, 100,000 &lt;code&gt;safeParse()&lt;/code&gt; iterations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;UserSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;email&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;admin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;viewer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;testData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Jangwook&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;kim.jangwook@example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;admin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;iterations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;iterations&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;UserSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;testData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parsesPerSecond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;iterations&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`duration: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`parses/second: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;parsesPerSecond&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;iterations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100,000&lt;/span&gt;
&lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;45.78ms&lt;/span&gt;
&lt;span class="na"&gt;parses/second&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2,184,481&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.18 million parses per second. That's overkill for Claude API response handling. The API call itself takes hundreds of milliseconds to seconds — Zod parsing will never be your bottleneck.&lt;/p&gt;

&lt;p&gt;Where the speed matters is batch processing. If you're running Zod validation across millions of log entries or event records, v4's throughput improvement is genuinely noticeable. For LLM response parsing alone, the performance case for migrating from v3 to v4 is weak.&lt;/p&gt;

&lt;p&gt;My current position: start new projects on v4. No urgent reason to migrate existing v3 codebases. v4 is production-ready, but if v3 is working fine, there's no fire.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Variance
&lt;/h3&gt;

&lt;p&gt;These numbers came from an Apple Silicon M-series machine. AWS or GCP Linux x86 instances will differ. If you need performance guarantees in CI, measure directly in your actual environment. Don't take official benchmarks as ground truth for your setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Integration: Blog Post Metadata Extractor
&lt;/h2&gt;

&lt;p&gt;Here's a working example combining the patterns above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Blog post metadata schema&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PostMetadataSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;difficulty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;beginner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;intermediate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;advanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="na"&gt;estimatedReadingTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;hasCodeExamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;targetAudience&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;PostMetadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;PostMetadataSchema&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractPostMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;markdownContent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;PostMetadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Analyze a technical blog post and return its metadata as JSON.
You must follow this exact format:
{
  "title": "core post title (under 100 characters)",
  "description": "SEO description (50-200 characters)",
  "tags": ["tag1", "tag2"],
  "difficulty": "beginner" | "intermediate" | "advanced",
  "estimatedReadingTime": number (minutes),
  "hasCodeExamples": true | false,
  "targetAudience": "description of intended readers (10-100 characters)"
}`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Analyze the following markdown content:\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;markdownContent&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;textBlock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;textBlock&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;textBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;No text response received&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parseResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseLLMResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;textBlock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;PostMetadataSchema&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="s2"&gt;`Metadata extraction failed [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;]: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;parseResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same pattern drops directly into MCP tool handlers from &lt;a href="https://dev.to/en/blog/en/mcp-server-typescript-sdk-step-by-step-2026"&gt;TypeScript MCP Server Step-by-Step&lt;/a&gt;. Call an LLM inside the handler, validate the response with Zod, return structured output.&lt;/p&gt;

&lt;p&gt;When unit testing this function as described in &lt;a href="https://dev.to/en/blog/en/vitest-4-ai-agent-testing-patterns-2026"&gt;Vitest 4 AI Agent Testing Patterns&lt;/a&gt;, mock &lt;code&gt;client.messages.create()&lt;/code&gt; and assert on the &lt;code&gt;safeParse()&lt;/code&gt; result. Having a Zod schema makes it easy to build test fixtures that match the schema exactly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Checklist: v3 to v4
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Find any code that validates &lt;code&gt;Infinity&lt;/code&gt; or &lt;code&gt;-Infinity&lt;/code&gt; with &lt;code&gt;z.number()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;required_error&lt;/code&gt; and &lt;code&gt;invalid_type_error&lt;/code&gt; options with the unified &lt;code&gt;error&lt;/code&gt; parameter&lt;/li&gt;
&lt;li&gt;Update test assertions that compare Zod error message strings directly&lt;/li&gt;
&lt;li&gt;Gradually replace &lt;code&gt;z.string().email()&lt;/code&gt; with &lt;code&gt;z.email()&lt;/code&gt; (old API still works, but v4 style is preferred)&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;.and()&lt;/code&gt; with &lt;code&gt;z.intersection(A, B)&lt;/code&gt; (still works, but officially deprecated)&lt;/li&gt;
&lt;li&gt;For large codebases, evaluate the &lt;code&gt;zod-v3-to-v4&lt;/code&gt; community codemod&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the migration feels like a lot, start by auditing just the &lt;code&gt;z.number()&lt;/code&gt; breaking changes. The rest can be handled incrementally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Zod v4 is a solid choice for LLM response parsing. The type safety from &lt;code&gt;safeParse()&lt;/code&gt;, nested schema support, and consolidated error API all fit naturally with Claude API integration. The performance improvement won't be noticeable in LLM response handling, but the TypeScript compilation speedup makes a real difference in larger projects.&lt;/p&gt;

&lt;p&gt;The one rough edge: &lt;code&gt;.check()&lt;/code&gt; TypeScript support is not quite there yet. When pushing custom issues via &lt;code&gt;ctx.issues.push()&lt;/code&gt;, you're writing without autocomplete. That needs improvement.&lt;/p&gt;

&lt;p&gt;For new projects, go with Zod v4. For existing v3 codebases, review the breaking changes list and migrate incrementally.&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>zod</category>
      <category>claudeapi</category>
    </item>
  </channel>
</rss>
