pavelgj

Posted on May 5

Stop Your AI Agents From Crashing, Looping, and Burning Through Tokens

#ai #genkit

If you've built agentic workflows with LLMs — the kind where a model calls tools, reasons over results, and loops back for more — you've hit the wall. Not the conceptual wall. The very real, very expensive wall where your agent crashes at turn 47 because a model returned a 503, or silently burns $12 calling the same search tool in an infinite loop, or stuffs 200K tokens of context into a request that could've been 20K.

These aren't edge cases. They're the default behavior of every agentic loop that runs long enough. And until now, the fix was the same every time: wrap everything in try/catch, add a turn counter, pray.

There's a better way now. Google's Genkit just shipped a generateMiddleware() API that lets you intercept and modify AI generation at every level — model calls, tool execution, and the entire generate loop. Think of it as Express middleware, but for LLM inference. And it changes how you build resilient agents.

I built three middleware on top of it — softFail, smartMaxTurns, and contextCompression — that solve the three problems I kept hitting in production. This article walks through the middleware API itself, why it matters, and how each middleware works.

The Middleware API: Intercept Everything

Genkit's generateMiddleware() gives you hooks into three layers of the generation process:

import { generateMiddleware } from 'genkit/beta';

const myMiddleware = generateMiddleware(
  {
    name: 'myMiddleware',
    configSchema: z.object({ /* per-call config */ }),
  },
  ({ config, pluginConfig, ai }) => ({
    // Wraps each model call (request → response)
    model: async (req, ctx, next) => {
      // modify request, call next(), modify response
      return next(req, ctx);
    },

    // Wraps each turn of the generate loop
    generate: async (envelope, ctx, next) => {
      // access envelope.currentTurn, modify messages, tools, etc.
      return next(envelope, ctx);
    },

    // Wraps each tool execution
    tool: async (req, ctx, next) => {
      // intercept tool calls, modify inputs/outputs
      return next(req, ctx);
    },
  })
);

Three hooks, three levels:

model — fires for every model API call. You see the raw request and response. Perfect for catching errors, tracking token usage, or modifying model config on the fly.
generate — fires for each turn of the agentic loop. You get the full conversation, the current turn number, and can modify messages, tools, or short-circuit the loop entirely.
tool — fires for every tool execution. You can catch errors, modify inputs/outputs, or skip tools entirely.

The key insight: the generate hook is recursive. In a multi-turn agentic loop, each turn's generate hook runs inside the previous turn's next() call. The model hook, on the other hand, runs before the framework processes tool results and calls the next turn. Understanding this execution order is what makes powerful middleware possible.

You also register the middleware as a plugin so it shows up in the Genkit Dev UI:

const ai = genkit({
  plugins: [myMiddleware.plugin({ /* plugin-level options */ })],
});

// Then use it per-call:
const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Research this topic',
  tools: [searchTool, analyzeTool],
  use: [myMiddleware({ /* per-call config */ })],
});

This is genuinely powerful. You're not monkey-patching or wrapping ai.generate() with utility functions. You're composing behavior at the framework level, with type-safe config schemas, proper lifecycle management, and full access to the Genkit runtime.

Now let's look at the three problems and their middleware solutions.

`softFail`: Stop Crashing, Start Recovering

The Problem

Your agent is 15 turns into a complex research task. It's called six tools, accumulated useful results, and is about to synthesize an answer. Then the model API returns a 503. ai.generate() throws. All that accumulated context and tool output? Gone. Your flow crashes and the user gets an error.

Or maybe a tool throws — a database query times out, an API returns an unexpected format. Same result: the whole agentic loop crashes.

Or the agent hits maxTurns. The framework throws a GenerationResponseError. The model's last response — which might contain useful partial results — is buried inside the error object.

The Solution

softFail catches all three failure modes and returns a clean GenerateResponse with finishReason: 'aborted' instead of throwing:

import { softFail } from 'genkitx-misc/soft-fail';

const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Do something complex',
  tools: [riskyTool],
  use: [softFail()],
});

if (response.finishReason === 'aborted') {
  // No crash. No lost context. Just a clean signal.
  const details = (response.custom as any)?.softFail;
  console.log(`Failed: ${details.reason} — ${details.error}`);
}

How It Works

softFail uses all three middleware hooks:

Model hook: Wraps the model call in a try/catch. If the model throws, it returns a synthetic response with finishReason: 'aborted' and stashes the error details in response.custom.softFail. You can optionally filter by error status — only catch UNAVAILABLE and RESOURCE_EXHAUSTED, for instance, and let validation errors throw normally.
Tool hook: Wraps each tool execution. If a tool throws, the error message is returned to the model as a normal tool response ("Tool 'search' failed: connection timeout"). The model sees this and can recover — retry the tool, skip it, or wrap up with what it has. ToolInterruptErrors are never caught; those are intentional control flow.
Generate hook: Catches the GenerationResponseError that the framework throws when maxTurns is exceeded. Instead of losing the model's last response, it extracts it from the error and returns it with finishReason: 'aborted'. It also acts as a safety net for the model hook — if a synthetic aborted response triggers a downstream schema validation error, the generate hook re-surfaces the original aborted response.

// Only catch specific model errors
use: [softFail({ modelStatuses: ['UNAVAILABLE', 'RESOURCE_EXHAUSTED'] })]

// Don't catch tool errors — let them throw
use: [softFail({ tools: false })]

// Only handle max turns gracefully
use: [softFail({ model: false, tools: false })]

What Can You Do With an Aborted Response?

The key insight is that an aborted response is still a valid GenerateResponse. The conversation history — response.messages — contains everything the agent accumulated up to the failure point: all the tool calls, tool responses, and model messages. You can feed that right back into ai.generate() to pick up where you left off:

const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Research this topic thoroughly',
  tools: [searchTool, analyzeTool],
  use: [softFail()],
});

if (response.finishReason === 'aborted') {
  const details = (response.custom as any)?.softFail;

  if (details?.reason === 'model-error') {
    // Model had a transient error — retry with the full conversation intact
    console.log('Model failed, retrying with accumulated context...');
    const retryResponse = await ai.generate({
      model: 'googleai/gemini-flash-latest',
      messages: response.messages, // All prior context preserved
      tools: [searchTool, analyzeTool],
      use: [softFail()],
    });
  }

  if (details?.reason === 'max-turns') {
    // Agent ran out of turns — prompt user or continue later
    console.log('Agent needs more turns. Continue?');
    // ... prompt user, then resume:
    const continued = await ai.generate({
      model: 'googleai/gemini-flash-latest',
      messages: response.messages,
      tools: [searchTool, analyzeTool],
      use: [softFail()],
    });
  }
}

No accumulated work is lost. The agent's 15 turns of tool calls and reasoning are all in response.messages, ready to be continued immediately, after a delay, or after prompting the user to check their connection.

Composing with Retry and Fallback

softFail composes naturally with retry and fallback middleware. Put softFail outermost so it catches anything that still throws after retries are exhausted:

use: [
  softFail(),        // Last line of defense
  retry({ ... }),    // Retry transient errors first
  fallback({ ... }), // Try alternate models
]

`smartMaxTurns`: Detect Loops, Not Just Count Turns

The Problem

maxTurns: 10 is a blunt instrument. Set it too low and your agent can't finish complex tasks. Set it too high and a looping agent burns through tokens calling the same tool with the same arguments 47 times before hitting the limit. There's no way to say "stop when you're stuck, not when you've used N turns."

The Solution

smartMaxTurns replaces the rigid counter with intelligent loop detection. It watches the conversation and terminates when it detects the agent is stuck — not when an arbitrary number is reached:

import { smartMaxTurns } from 'genkitx-misc/smart-max-turns';

const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Research and summarize...',
  tools: [searchTool, analyzeTool],
  use: [smartMaxTurns()],
});

const meta = (response.custom as any)?.smartMaxTurns;
if (meta) {
  console.log(`Terminated: ${meta.reason} after ${meta.turnsUsed} turns`);
}

How It Works

smartMaxTurns takes ownership of turn management. It overrides the framework's maxTurns to effectively infinite, then uses its generate hook to apply intelligent checks on every turn:

Two heuristic detectors (enabled by default, zero cost):

Exact loop detection — Hashes tool calls across consecutive turns. If the agent calls the same tools with the same arguments N times in a row (default: 2), it's looping.
Response repetition — Detects when tools return identical outputs across consecutive turns. If the same tool keeps returning the same result, the agent isn't making progress.

One optional LLM judge (opt-in):

Sends the conversation to a separate model and asks: "Is this agent making progress or stuck?" The judge responds PROGRESSING or STUCK. You can configure how often it checks (every: 3 = every 3 turns after minTurns).

Three termination strategies:

// Abort immediately (default) — return aborted response
use: [smartMaxTurns({ onDetection: 'abort' })]

// Wrap up — remove tools, ask model for a final answer
use: [smartMaxTurns({ onDetection: 'wrapUp' })]

// Prune — remove only the looping tools, let the agent continue with others
use: [smartMaxTurns({ onDetection: 'pruneTools' })]

The wrapUp strategy is particularly useful. Instead of hard-stopping, it strips all tools from the request and injects a message: "You have spent several turns working on this task. Please provide your best final answer now based on what you have learned so far." The model gets one final toolless turn to synthesize everything it's gathered.

pruneTools is even more nuanced — it only removes the tools that were involved in the loop. If the agent was looping on searchTool but also has analyzeTool available, it removes searchTool and lets the agent continue with analyzeTool.

use: [smartMaxTurns({
  maxTurns: 25,          // Hard ceiling (safety net)
  minTurns: 5,           // Don't check until turn 5
  onDetection: 'wrapUp', // Ask for a final answer
  detect: {
    exactLoops: { threshold: 3 },       // 3 identical calls to trigger
    responseRepetition: { threshold: 3 }, // 3 identical responses to trigger
    llmJudge: { every: 2 },             // Check every 2 turns after minTurns
  },
})]

`contextCompression`: Shrink the Conversation, Keep the Knowledge

The Problem

Long-running agents accumulate context fast. Each tool call adds a request and response to the message history. By turn 20, you might have 150K tokens of context — most of which is verbose tool output from early turns that the model doesn't need anymore. You're paying for all of it on every subsequent turn. And eventually you hit the model's context window limit.

The Solution

contextCompression monitors token usage and automatically compresses the conversation when it gets too large. It triggers based on the actual inputTokens reported by the model — no custom tokenizer needed:

import { contextCompression } from 'genkitx-misc/context-compression';

const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Research and summarize...',
  tools: [searchTool],
  use: [contextCompression({
    maxInputTokens: 80000,
    toolResponses: { maxChars: 2000 },
    summarize: {
      model: { name: 'googleai/gemini-flash-lite-latest' },
    },
  })],
});

How It Works

contextCompression uses both the model and generate hooks in a coordinated dance:

Model hook: After each model call, records inputTokens from the response usage metadata. This is what the generate hook checks on the next turn to decide whether to compress. The model hook also attaches compression metadata to response.custom so it propagates through to the final GenerateResponse.
Generate hook: On each turn, checks if the previous turn's inputTokens exceeded maxInputTokens. If so, applies compression strategies in order:

Three composable strategies:

Tool response truncation — The cheapest option. Truncates verbose tool outputs to a character limit (maxChars: 2000), preserving the N most recent tool responses untouched. No LLM call needed. A 50KB API response becomes a 2KB excerpt with a …[truncated] marker.
Message truncation — Drops the oldest messages beyond a hard cap (maxMessages: 30), always preserving system messages and recent messages. Blunt but effective when you just need to stay under a token limit.
LLM summarization — Replaces older messages with a condensed summary generated by a (cheap, fast) model. The summary preserves important facts, decisions, and tool results while dramatically reducing token count. Summaries are cached across turns — if no new messages have shifted into the summarization window, the cached summary is reused without another LLM call.

use: [contextCompression({
  maxInputTokens: 80000,

  // Strategy 1: Truncate tool responses beyond 2000 chars
  // (keep last 2 tool responses intact)
  toolResponses: { maxChars: 2000, preserveRecent: 2 },

  // Strategy 2: Hard cap at 40 messages
  maxMessages: 40,

  // Strategy 3: Summarize old messages with a cheap model
  summarize: {
    model: { name: 'googleai/gemini-flash-lite-latest' },
    preserveRecent: 6, // Keep last 6 messages un-summarized
  },
})]

The strategies compose. On a compression trigger, tool responses get truncated first, then messages are capped, then remaining old messages are summarized. You can use any combination — just tool response truncation for a zero-LLM-cost option, or the full pipeline for maximum compression.

The summary caching is worth highlighting: after the first summarization, the middleware tracks which messages have been summarized. On subsequent turns, if the summary message is still the oldest non-system message, the cached summary is reused. Only when new messages shift into the summarization window does it regenerate — and even then, it uses incremental summarization ([Previous summary] + [New messages]) rather than re-summarizing everything.

Composing Middleware

These three middleware are designed to work together:

const response = await ai.generate({
  model: 'googleai/gemini-flash-latest',
  prompt: 'Research and write a comprehensive report on...',
  tools: [searchTool, analyzeTool, writeTool],
  use: [
    softFail(),                                    // Catch crashes
    smartMaxTurns({ onDetection: 'wrapUp' }),      // Detect loops
    contextCompression({                           // Manage context size
      maxInputTokens: 80000,
      toolResponses: { maxChars: 2000 },
      summarize: { model: { name: 'googleai/gemini-flash-lite-latest' } },
    }),
  ],
});

With this stack:

The agent won't crash if the model or a tool throws
It won't loop forever calling the same tool
It won't burn through tokens with ever-growing context
If it does get stuck, it'll wrap up with a final answer instead of hard-stopping

All of this with zero changes to your tools, prompts, or flow logic. Just use: [...].

First-Party Middleware

Genkit also ships a set of middleware out of the box in the @genkit-ai/middleware package:

retry — Automatic retries with exponential backoff on transient errors (RESOURCE_EXHAUSTED, UNAVAILABLE, etc.)
fallback — Switch to a backup model when the primary fails on specific error codes
toolApproval — Restrict tool execution to an approved list; unapproved tools trigger a ToolInterruptError for human-in-the-loop confirmation
filesystem — Grant the model access to the local filesystem with sandboxed file manipulation tools
skills — Auto-inject SKILL.md files into the system prompt and provide a use_skill tool for on-demand skill retrieval

These compose with the middleware in this article. For example, softFail + retry + fallback is a natural stack: retry transient errors, fall back to a cheaper model, and if everything still fails, return a clean aborted response instead of crashing.

Learn more: Genkit Middleware docs · @genkit-ai/middleware on npm

Getting Started

npm install genkitx-misc

import { softFail } from 'genkitx-misc/soft-fail';
import { smartMaxTurns } from 'genkitx-misc/smart-max-turns';
import { contextCompression } from 'genkitx-misc/context-compression';

The genkitx-misc package also includes quota, cache, and router middleware — all built on the same generateMiddleware() API.

Full docs, examples, and source: github.com/pavelgj/genkitx-misc

The generateMiddleware() API is available in genkit/beta. These middleware work with any Genkit-compatible model — Gemini, Claude, OpenAI, Ollama, or any custom model plugin.

DEV Community

Stop Your AI Agents From Crashing, Looping, and Burning Through Tokens

The Middleware API: Intercept Everything

`softFail`: Stop Crashing, Start Recovering

The Problem

The Solution

How It Works

What Can You Do With an Aborted Response?

Composing with Retry and Fallback

`smartMaxTurns`: Detect Loops, Not Just Count Turns

The Problem

The Solution

How It Works

`contextCompression`: Shrink the Conversation, Keep the Knowledge

The Problem

The Solution

How It Works

Composing Middleware

First-Party Middleware

Getting Started

Top comments (0)

The Middleware API: Intercept Everything

softFail: Stop Crashing, Start Recovering

The Problem

The Solution

How It Works

What Can You Do With an Aborted Response?

Composing with Retry and Fallback

smartMaxTurns: Detect Loops, Not Just Count Turns

The Problem

The Solution

How It Works

contextCompression: Shrink the Conversation, Keep the Knowledge

The Problem

The Solution

How It Works

Composing Middleware

First-Party Middleware

Getting Started

`softFail`: Stop Crashing, Start Recovering

`smartMaxTurns`: Detect Loops, Not Just Count Turns

`contextCompression`: Shrink the Conversation, Keep the Knowledge