<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pavelgj</title>
    <description>The latest articles on DEV Community by pavelgj (@pavelgj).</description>
    <link>https://dev.to/pavelgj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891156%2Fa3252f12-0605-49c1-a084-4b9be04d136b.jpg</url>
      <title>DEV Community: pavelgj</title>
      <link>https://dev.to/pavelgj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pavelgj"/>
    <language>en</language>
    <item>
      <title>Stop Your AI Agents From Crashing, Looping, and Burning Through Tokens</title>
      <dc:creator>pavelgj</dc:creator>
      <pubDate>Tue, 05 May 2026 01:31:50 +0000</pubDate>
      <link>https://dev.to/pavelgj/stop-your-ai-agents-from-crashing-looping-and-burning-through-tokens-2g70</link>
      <guid>https://dev.to/pavelgj/stop-your-ai-agents-from-crashing-looping-and-burning-through-tokens-2g70</guid>
      <description>&lt;p&gt;If you've built agentic workflows with LLMs — the kind where a model calls tools, reasons over results, and loops back for more — you've hit &lt;em&gt;the wall&lt;/em&gt;. Not the conceptual wall. The very real, very expensive wall where your agent crashes at turn 47 because a model returned a 503, or silently burns $12 calling the same search tool in an infinite loop, or stuffs 200K tokens of context into a request that could've been 20K.&lt;/p&gt;

&lt;p&gt;These aren't edge cases. They're the default behavior of every agentic loop that runs long enough. And until now, the fix was the same every time: wrap everything in try/catch, add a turn counter, pray.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's a better way now.&lt;/strong&gt; Google's &lt;a href="https://genkit.dev" rel="noopener noreferrer"&gt;Genkit&lt;/a&gt; just shipped a &lt;code&gt;generateMiddleware()&lt;/code&gt; API that lets you intercept and modify AI generation at every level — model calls, tool execution, and the entire generate loop. Think of it as Express middleware, but for LLM inference. And it changes how you build resilient agents.&lt;/p&gt;

&lt;p&gt;I built three middleware on top of it — &lt;code&gt;softFail&lt;/code&gt;, &lt;code&gt;smartMaxTurns&lt;/code&gt;, and &lt;code&gt;contextCompression&lt;/code&gt; — that solve the three problems I kept hitting in production. This article walks through the middleware API itself, why it matters, and how each middleware works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Middleware API: Intercept Everything
&lt;/h2&gt;

&lt;p&gt;Genkit's &lt;code&gt;generateMiddleware()&lt;/code&gt; gives you hooks into three layers of the generation process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;generateMiddleware&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkit/beta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;myMiddleware&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;myMiddleware&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;configSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="cm"&gt;/* per-call config */&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pluginConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="c1"&gt;// Wraps each model call (request → response)&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// modify request, call next(), modify response&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="c1"&gt;// Wraps each turn of the generate loop&lt;/span&gt;
    &lt;span class="na"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// access envelope.currentTurn, modify messages, tools, etc.&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="c1"&gt;// Wraps each tool execution&lt;/span&gt;
    &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// intercept tool calls, modify inputs/outputs&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Three hooks, three levels:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;model&lt;/code&gt;&lt;/strong&gt; — fires for every model API call. You see the raw request and response. Perfect for catching errors, tracking token usage, or modifying model config on the fly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;generate&lt;/code&gt;&lt;/strong&gt; — fires for each turn of the agentic loop. You get the full conversation, the current turn number, and can modify messages, tools, or short-circuit the loop entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;tool&lt;/code&gt;&lt;/strong&gt; — fires for every tool execution. You can catch errors, modify inputs/outputs, or skip tools entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the generate hook is recursive&lt;/strong&gt;. In a multi-turn agentic loop, each turn's generate hook runs &lt;em&gt;inside&lt;/em&gt; the previous turn's &lt;code&gt;next()&lt;/code&gt; call. The model hook, on the other hand, runs before the framework processes tool results and calls the next turn. Understanding this execution order is what makes powerful middleware possible.&lt;/p&gt;

&lt;p&gt;You also register the middleware as a plugin so it shows up in the Genkit Dev UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;genkit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;myMiddleware&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plugin&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="cm"&gt;/* plugin-level options */&lt;/span&gt; &lt;span class="p"&gt;})],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Then use it per-call:&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Research this topic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;myMiddleware&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="cm"&gt;/* per-call config */&lt;/span&gt; &lt;span class="p"&gt;})],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is genuinely powerful. You're not monkey-patching or wrapping &lt;code&gt;ai.generate()&lt;/code&gt; with utility functions. You're composing behavior at the framework level, with type-safe config schemas, proper lifecycle management, and full access to the Genkit runtime.&lt;/p&gt;

&lt;p&gt;Now let's look at the three problems and their middleware solutions.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;softFail&lt;/code&gt;: Stop Crashing, Start Recovering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Your agent is 15 turns into a complex research task. It's called six tools, accumulated useful results, and is about to synthesize an answer. Then the model API returns a 503. &lt;code&gt;ai.generate()&lt;/code&gt; throws. All that accumulated context and tool output? Gone. Your flow crashes and the user gets an error.&lt;/p&gt;

&lt;p&gt;Or maybe a tool throws — a database query times out, an API returns an unexpected format. Same result: the whole agentic loop crashes.&lt;/p&gt;

&lt;p&gt;Or the agent hits &lt;code&gt;maxTurns&lt;/code&gt;. The framework throws a &lt;code&gt;GenerationResponseError&lt;/code&gt;. The model's last response — which might contain useful partial results — is buried inside the error object.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;softFail&lt;/code&gt; catches all three failure modes and returns a clean &lt;code&gt;GenerateResponse&lt;/code&gt; with &lt;code&gt;finishReason: 'aborted'&lt;/code&gt; instead of throwing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;softFail&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/soft-fail&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Do something complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;riskyTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;finishReason&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aborted&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// No crash. No lost context. Just a clean signal.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;custom&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; — &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;softFail&lt;/code&gt; uses all three middleware hooks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model hook&lt;/strong&gt;: Wraps the model call in a try/catch. If the model throws, it returns a synthetic response with &lt;code&gt;finishReason: 'aborted'&lt;/code&gt; and stashes the error details in &lt;code&gt;response.custom.softFail&lt;/code&gt;. You can optionally filter by error status — only catch &lt;code&gt;UNAVAILABLE&lt;/code&gt; and &lt;code&gt;RESOURCE_EXHAUSTED&lt;/code&gt;, for instance, and let validation errors throw normally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool hook&lt;/strong&gt;: Wraps each tool execution. If a tool throws, the error message is returned to the model &lt;em&gt;as a normal tool response&lt;/em&gt; (&lt;code&gt;"Tool 'search' failed: connection timeout"&lt;/code&gt;). The model sees this and can recover — retry the tool, skip it, or wrap up with what it has. &lt;code&gt;ToolInterruptError&lt;/code&gt;s are never caught; those are intentional control flow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Generate hook&lt;/strong&gt;: Catches the &lt;code&gt;GenerationResponseError&lt;/code&gt; that the framework throws when &lt;code&gt;maxTurns&lt;/code&gt; is exceeded. Instead of losing the model's last response, it extracts it from the error and returns it with &lt;code&gt;finishReason: 'aborted'&lt;/code&gt;. It also acts as a safety net for the model hook — if a synthetic aborted response triggers a downstream schema validation error, the generate hook re-surfaces the original aborted response.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Only catch specific model errors&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;modelStatuses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;UNAVAILABLE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;RESOURCE_EXHAUSTED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;

&lt;span class="c1"&gt;// Don't catch tool errors — let them throw&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;

&lt;span class="c1"&gt;// Only handle max turns gracefully&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Can You Do With an Aborted Response?
&lt;/h3&gt;

&lt;p&gt;The key insight is that an aborted response is still a valid &lt;code&gt;GenerateResponse&lt;/code&gt;. The conversation history — &lt;code&gt;response.messages&lt;/code&gt; — contains everything the agent accumulated up to the failure point: all the tool calls, tool responses, and model messages. You can feed that right back into &lt;code&gt;ai.generate()&lt;/code&gt; to pick up where you left off:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Research this topic thoroughly&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;finishReason&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aborted&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;custom&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;details&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;model-error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Model had a transient error — retry with the full conversation intact&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Model failed, retrying with accumulated context...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;retryResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// All prior context preserved&lt;/span&gt;
      &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;details&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;max-turns&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Agent ran out of turns — prompt user or continue later&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Agent needs more turns. Continue?&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ... prompt user, then resume:&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;continued&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No accumulated work is lost. The agent's 15 turns of tool calls and reasoning are all in &lt;code&gt;response.messages&lt;/code&gt;, ready to be continued immediately, after a delay, or after prompting the user to check their connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Composing with Retry and Fallback
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;softFail&lt;/code&gt; composes naturally with retry and fallback middleware. Put &lt;code&gt;softFail&lt;/code&gt; outermost so it catches anything that still throws after retries are exhausted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;        &lt;span class="c1"&gt;// Last line of defense&lt;/span&gt;
  &lt;span class="nf"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;    &lt;span class="c1"&gt;// Retry transient errors first&lt;/span&gt;
  &lt;span class="nf"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="c1"&gt;// Try alternate models&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;code&gt;smartMaxTurns&lt;/code&gt;: Detect Loops, Not Just Count Turns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;maxTurns: 10&lt;/code&gt; is a blunt instrument. Set it too low and your agent can't finish complex tasks. Set it too high and a looping agent burns through tokens calling the same tool with the same arguments 47 times before hitting the limit. There's no way to say "stop when you're stuck, not when you've used N turns."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;smartMaxTurns&lt;/code&gt; replaces the rigid counter with intelligent loop detection. It watches the conversation and terminates when it detects the agent is stuck — not when an arbitrary number is reached:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;smartMaxTurns&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/smart-max-turns&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Research and summarize...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;custom&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Terminated: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; after &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;turnsUsed&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; turns`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;smartMaxTurns&lt;/code&gt; takes ownership of turn management. It overrides the framework's &lt;code&gt;maxTurns&lt;/code&gt; to effectively infinite, then uses its generate hook to apply intelligent checks on every turn:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two heuristic detectors (enabled by default, zero cost):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exact loop detection&lt;/strong&gt; — Hashes tool calls across consecutive turns. If the agent calls the same tools with the same arguments N times in a row (default: 2), it's looping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response repetition&lt;/strong&gt; — Detects when tools return identical outputs across consecutive turns. If the same tool keeps returning the same result, the agent isn't making progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;One optional LLM judge (opt-in):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sends the conversation to a separate model and asks: "Is this agent making progress or stuck?" The judge responds &lt;code&gt;PROGRESSING&lt;/code&gt; or &lt;code&gt;STUCK&lt;/code&gt;. You can configure how often it checks (&lt;code&gt;every: 3&lt;/code&gt; = every 3 turns after &lt;code&gt;minTurns&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Three termination strategies:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Abort immediately (default) — return aborted response&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;onDetection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;abort&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;

&lt;span class="c1"&gt;// Wrap up — remove tools, ask model for a final answer&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;onDetection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wrapUp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;

&lt;span class="c1"&gt;// Prune — remove only the looping tools, let the agent continue with others&lt;/span&gt;
&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;onDetection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pruneTools&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;wrapUp&lt;/code&gt; strategy is particularly useful. Instead of hard-stopping, it strips all tools from the request and injects a message: &lt;em&gt;"You have spent several turns working on this task. Please provide your best final answer now based on what you have learned so far."&lt;/em&gt; The model gets one final toolless turn to synthesize everything it's gathered.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pruneTools&lt;/code&gt; is even more nuanced — it only removes the tools that were involved in the loop. If the agent was looping on &lt;code&gt;searchTool&lt;/code&gt; but also has &lt;code&gt;analyzeTool&lt;/code&gt; available, it removes &lt;code&gt;searchTool&lt;/code&gt; and lets the agent continue with &lt;code&gt;analyzeTool&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;maxTurns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// Hard ceiling (safety net)&lt;/span&gt;
  &lt;span class="na"&gt;minTurns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Don't check until turn 5&lt;/span&gt;
  &lt;span class="na"&gt;onDetection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wrapUp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Ask for a final answer&lt;/span&gt;
  &lt;span class="na"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;exactLoops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;       &lt;span class="c1"&gt;// 3 identical calls to trigger&lt;/span&gt;
    &lt;span class="na"&gt;responseRepetition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;// 3 identical responses to trigger&lt;/span&gt;
    &lt;span class="na"&gt;llmJudge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;every&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;             &lt;span class="c1"&gt;// Check every 2 turns after minTurns&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;code&gt;contextCompression&lt;/code&gt;: Shrink the Conversation, Keep the Knowledge
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Long-running agents accumulate context fast. Each tool call adds a request and response to the message history. By turn 20, you might have 150K tokens of context — most of which is verbose tool output from early turns that the model doesn't need anymore. You're paying for all of it on every subsequent turn. And eventually you hit the model's context window limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;contextCompression&lt;/code&gt; monitors token usage and automatically compresses the conversation when it gets too large. It triggers based on the &lt;em&gt;actual&lt;/em&gt; &lt;code&gt;inputTokens&lt;/code&gt; reported by the model — no custom tokenizer needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;contextCompression&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/context-compression&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Research and summarize...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;contextCompression&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;maxInputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;80000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;toolResponses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-lite-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;})],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;contextCompression&lt;/code&gt; uses both the model and generate hooks in a coordinated dance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model hook&lt;/strong&gt;: After each model call, records &lt;code&gt;inputTokens&lt;/code&gt; from the response usage metadata. This is what the generate hook checks on the &lt;em&gt;next&lt;/em&gt; turn to decide whether to compress. The model hook also attaches compression metadata to &lt;code&gt;response.custom&lt;/code&gt; so it propagates through to the final &lt;code&gt;GenerateResponse&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Generate hook&lt;/strong&gt;: On each turn, checks if the previous turn's &lt;code&gt;inputTokens&lt;/code&gt; exceeded &lt;code&gt;maxInputTokens&lt;/code&gt;. If so, applies compression strategies in order:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Three composable strategies:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool response truncation&lt;/strong&gt; — The cheapest option. Truncates verbose tool outputs to a character limit (&lt;code&gt;maxChars: 2000&lt;/code&gt;), preserving the N most recent tool responses untouched. No LLM call needed. A 50KB API response becomes a 2KB excerpt with a &lt;code&gt;…[truncated]&lt;/code&gt; marker.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Message truncation&lt;/strong&gt; — Drops the oldest messages beyond a hard cap (&lt;code&gt;maxMessages: 30&lt;/code&gt;), always preserving system messages and recent messages. Blunt but effective when you just need to stay under a token limit.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM summarization&lt;/strong&gt; — Replaces older messages with a condensed summary generated by a (cheap, fast) model. The summary preserves important facts, decisions, and tool results while dramatically reducing token count. Summaries are cached across turns — if no new messages have shifted into the summarization window, the cached summary is reused without another LLM call.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;contextCompression&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;maxInputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;80000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="c1"&gt;// Strategy 1: Truncate tool responses beyond 2000 chars&lt;/span&gt;
  &lt;span class="c1"&gt;// (keep last 2 tool responses intact)&lt;/span&gt;
  &lt;span class="na"&gt;toolResponses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;preserveRecent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;

  &lt;span class="c1"&gt;// Strategy 2: Hard cap at 40 messages&lt;/span&gt;
  &lt;span class="na"&gt;maxMessages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="c1"&gt;// Strategy 3: Summarize old messages with a cheap model&lt;/span&gt;
  &lt;span class="na"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-lite-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;preserveRecent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Keep last 6 messages un-summarized&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The strategies compose. On a compression trigger, tool responses get truncated first, then messages are capped, then remaining old messages are summarized. You can use any combination — just tool response truncation for a zero-LLM-cost option, or the full pipeline for maximum compression.&lt;/p&gt;

&lt;p&gt;The summary caching is worth highlighting: after the first summarization, the middleware tracks which messages have been summarized. On subsequent turns, if the summary message is still the oldest non-system message, the cached summary is reused. Only when new messages shift into the summarization window does it regenerate — and even then, it uses incremental summarization (&lt;code&gt;[Previous summary] + [New messages]&lt;/code&gt;) rather than re-summarizing everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Composing Middleware
&lt;/h2&gt;

&lt;p&gt;These three middleware are designed to work together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Research and write a comprehensive report on...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;searchTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;analyzeTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;writeTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;softFail&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;                                    &lt;span class="c1"&gt;// Catch crashes&lt;/span&gt;
    &lt;span class="nf"&gt;smartMaxTurns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;onDetection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wrapUp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;      &lt;span class="c1"&gt;// Detect loops&lt;/span&gt;
    &lt;span class="nf"&gt;contextCompression&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;                           &lt;span class="c1"&gt;// Manage context size&lt;/span&gt;
      &lt;span class="na"&gt;maxInputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;80000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;toolResponses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;googleai/gemini-flash-lite-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent won't crash if the model or a tool throws&lt;/li&gt;
&lt;li&gt;It won't loop forever calling the same tool&lt;/li&gt;
&lt;li&gt;It won't burn through tokens with ever-growing context&lt;/li&gt;
&lt;li&gt;If it does get stuck, it'll wrap up with a final answer instead of hard-stopping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this with &lt;strong&gt;zero changes to your tools, prompts, or flow logic&lt;/strong&gt;. Just &lt;code&gt;use: [...]&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  First-Party Middleware
&lt;/h2&gt;

&lt;p&gt;Genkit also ships a set of middleware out of the box in the &lt;a href="https://www.npmjs.com/package/@genkit-ai/middleware" rel="noopener noreferrer"&gt;&lt;code&gt;@genkit-ai/middleware&lt;/code&gt;&lt;/a&gt; package:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;retry&lt;/code&gt;&lt;/strong&gt; — Automatic retries with exponential backoff on transient errors (&lt;code&gt;RESOURCE_EXHAUSTED&lt;/code&gt;, &lt;code&gt;UNAVAILABLE&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;fallback&lt;/code&gt;&lt;/strong&gt; — Switch to a backup model when the primary fails on specific error codes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;toolApproval&lt;/code&gt;&lt;/strong&gt; — Restrict tool execution to an approved list; unapproved tools trigger a &lt;code&gt;ToolInterruptError&lt;/code&gt; for human-in-the-loop confirmation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;filesystem&lt;/code&gt;&lt;/strong&gt; — Grant the model access to the local filesystem with sandboxed file manipulation tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;skills&lt;/code&gt;&lt;/strong&gt; — Auto-inject &lt;code&gt;SKILL.md&lt;/code&gt; files into the system prompt and provide a &lt;code&gt;use_skill&lt;/code&gt; tool for on-demand skill retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These compose with the middleware in this article. For example, &lt;code&gt;softFail&lt;/code&gt; + &lt;code&gt;retry&lt;/code&gt; + &lt;code&gt;fallback&lt;/code&gt; is a natural stack: retry transient errors, fall back to a cheaper model, and if everything still fails, return a clean aborted response instead of crashing.&lt;/p&gt;

&lt;p&gt;Learn more: &lt;a href="https://genkit.dev/docs/js/middleware/" rel="noopener noreferrer"&gt;Genkit Middleware docs&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/@genkit-ai/middleware" rel="noopener noreferrer"&gt;&lt;code&gt;@genkit-ai/middleware&lt;/code&gt; on npm&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;genkitx-misc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;softFail&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/soft-fail&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;smartMaxTurns&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/smart-max-turns&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;contextCompression&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;genkitx-misc/context-compression&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;genkitx-misc&lt;/code&gt; package also includes &lt;a href="https://github.com/pavelgj/genkitx-misc/blob/main/docs/quota.md" rel="noopener noreferrer"&gt;quota&lt;/a&gt;, &lt;a href="https://github.com/pavelgj/genkitx-misc/blob/main/docs/cache.md" rel="noopener noreferrer"&gt;cache&lt;/a&gt;, and &lt;a href="https://github.com/pavelgj/genkitx-misc/blob/main/docs/router.md" rel="noopener noreferrer"&gt;router&lt;/a&gt; middleware — all built on the same &lt;code&gt;generateMiddleware()&lt;/code&gt; API.&lt;/p&gt;

&lt;p&gt;Full docs, examples, and source: &lt;a href="https://github.com/pavelgj/genkitx-misc" rel="noopener noreferrer"&gt;github.com/pavelgj/genkitx-misc&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The &lt;code&gt;generateMiddleware()&lt;/code&gt; API is available in &lt;code&gt;genkit/beta&lt;/code&gt;. These middleware work with any Genkit-compatible model — Gemini, Claude, OpenAI, Ollama, or any custom model plugin.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>genkit</category>
    </item>
  </channel>
</rss>
