<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Avinash Sangle</title>
    <description>The latest articles on DEV Community by Avinash Sangle (@aavisangle).</description>
    <link>https://dev.to/aavisangle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878490%2F16544dab-61bc-4ca8-823e-58734c16fcd0.png</url>
      <title>DEV Community: Avinash Sangle</title>
      <link>https://dev.to/aavisangle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aavisangle"/>
    <language>en</language>
    <item>
      <title>Gemini 3.5 Flash for Agentic Coding: A Claude Coder's Guide</title>
      <dc:creator>Avinash Sangle</dc:creator>
      <pubDate>Mon, 01 Jun 2026 05:06:59 +0000</pubDate>
      <link>https://dev.to/aavisangle/gemini-35-flash-for-agentic-coding-a-claude-coders-guide-56o3</link>
      <guid>https://dev.to/aavisangle/gemini-35-flash-for-agentic-coding-a-claude-coders-guide-56o3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://avinashsangle.com/blog/gemini-3-5-flash-agentic-coding-guide" rel="noopener noreferrer"&gt;avinashsangle.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Gemini 3.5 Flash is Google's new Flash-tier coding model, generally available since May 19, 2026. It scores 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas, beating Gemini 3.1 Pro on 11 of 15 benchmarks. Pricing is $1.50 input and $9 output per 1M tokens. For Claude Code users, it's the right model for tool-heavy agent loops, not a replacement for production code edits.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is:&lt;/strong&gt; Gemini 3.5 Flash (GA May 19, 2026) is a Flash-tier model that outperforms Gemini 3.1 Pro on agentic benchmarks while costing 25% less per token than the Pro tier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing reality:&lt;/strong&gt; $1.50/$9 per 1M tokens looks cheap, but it's 3x the price of Gemini 3 Flash Preview and runs about 5.5x more expensive per full benchmark suite according to Artificial Analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The thinking_level trap:&lt;/strong&gt; the default dropped from &lt;code&gt;high&lt;/code&gt; to &lt;code&gt;medium&lt;/code&gt;. Copy-pasted code from &lt;code&gt;gemini-3-flash-preview&lt;/code&gt; silently produces dumber outputs. For agentic coding, set &lt;code&gt;thinking_level: "low"&lt;/code&gt; explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where Flash wins:&lt;/strong&gt; MCP tool orchestration (83.6% MCP Atlas, beats Claude Opus 4.7 by 4.5 points), parallel function calling, fast iterative agent loops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where Claude Code still wins:&lt;/strong&gt; production codebase editing (Sonnet 4.6 leads SWE-Bench Verified), defensive code, long-context retrieval past 128k tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing rule:&lt;/strong&gt; keep Claude Code for &lt;code&gt;Edit&lt;/code&gt; and &lt;code&gt;Write&lt;/code&gt; tasks; route MCP-heavy planning and tool fan-out to Gemini 3.5 Flash via OpenRouter or a thin custom MCP server.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is Gemini 3.5 Flash and what changed on May 19, 2026
&lt;/h2&gt;

&lt;p&gt;Gemini 3.5 Flash is a Flash-tier Gemini model that Google announced at I/O 2026 and shipped straight to GA on the same day. It is the first Flash-tier model to outperform the previous Pro tier on real agentic coding benchmarks. The launch lives on the &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/" rel="noopener noreferrer"&gt;official Google blog&lt;/a&gt; and the technical details on the &lt;a href="https://deepmind.google/models/model-cards/gemini-3-5-flash/" rel="noopener noreferrer"&gt;Google DeepMind model card&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The model is available on the Gemini API, AI Studio, Antigravity CLI (the successor to Gemini CLI), Vertex AI, the Gemini app, AI Mode in Search, and now GitHub Copilot per the &lt;a href="https://github.blog/changelog/2026-05-19-gemini-3-5-flash-is-generally-available-for-github-copilot/" rel="noopener noreferrer"&gt;May 19 changelog&lt;/a&gt;. The context window is 1,048,576 input tokens with a 65,536 output cap.&lt;/p&gt;

&lt;p&gt;Why this matters for a Claude Code user: the cheap model is now smart enough to handle production agent loops. That changes routing math, not loyalty. If you already run Sonnet 4.6 or Opus 4.7 inside Claude Code, you don't throw the stack away. You ask which subtasks now belong on a cheaper, faster Gemini call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemini 3.5 Flash benchmarks: where it beats Gemini 3.1 Pro
&lt;/h2&gt;

&lt;p&gt;Gemini 3.5 Flash wins 11 of 15 published benchmarks against Gemini 3.1 Pro, including the ones that matter most for agentic coding. The headline numbers from the &lt;a href="https://deepmind.google/models/model-cards/gemini-3-5-flash/" rel="noopener noreferrer"&gt;Google DeepMind model card&lt;/a&gt; and the &lt;a href="https://wavespeed.ai/blog/posts/gemini-3-5-flash-shipped-leads-agent-benchmarks/" rel="noopener noreferrer"&gt;WaveSpeed roundup&lt;/a&gt; are below.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Gemini 3.5 Flash&lt;/th&gt;
&lt;th&gt;Gemini 3.1 Pro&lt;/th&gt;
&lt;th&gt;Claude Opus 4.7&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.1&lt;/td&gt;
&lt;td&gt;76.2%&lt;/td&gt;
&lt;td&gt;70.3%&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Atlas&lt;/td&gt;
&lt;td&gt;83.6%&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;79.1%&lt;/td&gt;
&lt;td&gt;75.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPval-AA (Elo)&lt;/td&gt;
&lt;td&gt;1656&lt;/td&gt;
&lt;td&gt;1314&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;1769&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-Bench Pro&lt;/td&gt;
&lt;td&gt;55.1%&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;64.3%&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARC-AGI-2&lt;/td&gt;
&lt;td&gt;72.1%&lt;/td&gt;
&lt;td&gt;~77%&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;84.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128k retrieval&lt;/td&gt;
&lt;td&gt;-7.6 pts vs 3.1 Pro&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;strong&lt;/td&gt;
&lt;td&gt;strong&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The single most important number on that table for Claude Code users is the 83.6% MCP Atlas score. MCP Atlas measures how reliably a model chains multi-step tool calls without stalling on a malformed or out-of-order call. For anyone running an MCP-heavy stack, that score predicts task-completion rate more directly than SWE-bench does. The current Flash score beats Claude Opus 4.7 by 4.5 points and GPT-5.5 by 8.3 points.&lt;/p&gt;

&lt;p&gt;The honest other side: Gemini 3.5 Flash regresses 7.6 points on 128k-token retrieval versus Gemini 3.1 Pro, and gives up 5 points on ARC-AGI-2 versus the prior Pro tier (12.5 points to GPT-5.5). If you have a million-token context refactor, or a problem that looks like ARC-style abstract reasoning, Flash is the wrong answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemini 3.5 Flash pricing: cheap per token, expensive per task
&lt;/h2&gt;

&lt;p&gt;Gemini 3.5 Flash is $1.50 per 1M input tokens, $9 per 1M output tokens, and $0.15 per 1M cached input tokens (see &lt;a href="https://openrouter.ai/google/gemini-3.5-flash" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; for live pricing). On its face the Flash tier looks cheap. Per task it is not.&lt;/p&gt;

&lt;p&gt;Simon Willison's &lt;a href="https://simonwillison.net/2026/May/19/gemini-35-flash/" rel="noopener noreferrer"&gt;May 19, 2026 analysis&lt;/a&gt; cites Artificial Analysis benchmark-suite costs: running their full evaluation cost $1,551.60 on Gemini 3.5 Flash versus $892.28 on Gemini 3.1 Pro. Cheaper per token, more expensive per workload, because thinking tokens persist across turns and agent loops chew more output tokens. NxCode reports a similar multiplier: &lt;a href="https://www.nxcode.io/resources/news/gemini-3-5-flash-developer-guide-agentic-coding-2026" rel="noopener noreferrer"&gt;roughly 9x the cost of gemini-3-flash on equivalent eval jobs ($1,552 vs $278)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The pricing comparison that matters for routing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/1M)&lt;/th&gt;
&lt;th&gt;Output ($/1M)&lt;/th&gt;
&lt;th&gt;Cached input ($/1M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.5 Flash&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;td&gt;$9.00&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash Preview (deprecated)&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One trap to call out before the next section. GitHub Copilot launched Gemini 3.5 Flash with a 14x premium-request multiplier (&lt;a href="https://github.blog/changelog/2026-05-19-gemini-3-5-flash-is-generally-available-for-github-copilot/" rel="noopener noreferrer"&gt;GitHub Changelog, May 19 2026&lt;/a&gt;). A 300-request Copilot Pro quota becomes about 21 Flash calls before overage. If you already have Claude Code and an OpenRouter or AI Studio API key, calling Flash directly at roughly $0.015 per call is almost always cheaper than burning Copilot quota.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thinking_level default trap that breaks copy-pasted code
&lt;/h2&gt;

&lt;p&gt;Google replaced the integer &lt;code&gt;thinking_budget&lt;/code&gt; parameter with a string enum &lt;code&gt;thinking_level&lt;/code&gt; and quietly dropped the default from &lt;code&gt;high&lt;/code&gt; to &lt;code&gt;medium&lt;/code&gt;. Code copy-pasted from &lt;code&gt;gemini-3-flash-preview&lt;/code&gt; still runs, but it produces measurably worse outputs unless you set the new field. The official notes live on &lt;a href="https://ai.google.dev/gemini-api/docs/whats-new-gemini-3.5" rel="noopener noreferrer"&gt;Google AI Developers - What's new in Gemini 3.5&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The four values are &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt; (new default), and &lt;code&gt;high&lt;/code&gt;. Google retuned &lt;code&gt;low&lt;/code&gt; specifically for coding and tool-calling workloads. For agent loops with MCP tools, &lt;code&gt;thinking_level: "low"&lt;/code&gt; is faster, cheaper, and on coding benchmarks roughly equivalent to &lt;code&gt;medium&lt;/code&gt;. For hard reasoning, set &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before and after diff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before - gemini-3-flash-preview
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# was "dynamic" / high
&lt;/span&gt;    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                            &lt;span class="c1"&gt;# ignored by 3.5
&lt;/span&gt;    &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                                 &lt;span class="c1"&gt;# ignored by 3.5
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After - gemini-3.5-flash, explicit and tuned for agent loops
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thinking_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# for MCP agent loops
&lt;/span&gt;    &lt;span class="c1"&gt;# for hard reasoning tasks, use thinking_level="high"
&lt;/span&gt;    &lt;span class="c1"&gt;# for latency-sensitive work, use thinking_level="minimal"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two cleanup notes from the migration. &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, and &lt;code&gt;top_k&lt;/code&gt; are no longer recommended controls in the new SDK profile. Leaving them in your config is not an error, but they are silently ignored - delete them so the next reader of your code doesn't assume they still work. And inspect &lt;code&gt;response.usage_metadata&lt;/code&gt; on your first run: thinking tokens now persist across multi-turn conversations, and the per-task token count for an agent loop can climb 30 to 50 percent versus the preview model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemini 3.5 Flash vs Claude Code (Sonnet 4.6, Opus 4.7) for coding
&lt;/h2&gt;

&lt;p&gt;The short version: Flash wins agent orchestration and MCP tool chains. Claude Code wins repo-level edits and defensive code generation. Pick by task, not by model loyalty.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task type&lt;/th&gt;
&lt;th&gt;Best model&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP tool orchestration, parallel function calling&lt;/td&gt;
&lt;td&gt;Gemini 3.5 Flash&lt;/td&gt;
&lt;td&gt;83.6% MCP Atlas, ~289 tok/sec, $1.50 input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-file refactor in a real repo&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.6 in Claude Code&lt;/td&gt;
&lt;td&gt;Default Claude Code model; strong SWE-Bench Verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARC-style abstract reasoning&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7 or GPT-5.5&lt;/td&gt;
&lt;td&gt;Flash gives up 5 pts ARC-AGI-2 vs prior Pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context retrieval beyond 128k&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Pro or Sonnet 4.6 (1M ctx)&lt;/td&gt;
&lt;td&gt;Flash regresses 7.6 pts on 128k retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cheap intermediate planning inside an agent&lt;/td&gt;
&lt;td&gt;Gemini 3.5 Flash&lt;/td&gt;
&lt;td&gt;Cached input at $0.15/1M is the lowest among frontier models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production code review with defensive patches&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Anthropic models add error handling more naturally&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The defensive-code observation isn't hand-wavy. Multiple head-to-head reviews this month converge on the same pattern. &lt;a href="https://www.mindstudio.ai/blog/gemini-3-5-flash-vs-claude-opus-4-7-agentic-workflows" rel="noopener noreferrer"&gt;MindStudio&lt;/a&gt; and &lt;a href="https://www.buildfastwithai.com/blogs/gemini-3-5-flash-vs-gpt-5-5-claude-deepseek-2026" rel="noopener noreferrer"&gt;BuildFastWithAI&lt;/a&gt; both report that Claude Opus 4.7 anticipates edge cases and adds error handling more naturally, while Gemini 3.5 Flash produces more concise code that occasionally skips defensive patterns. That maps to my own experience: I trust Sonnet 4.6 to write production patches; I lean on Flash to coordinate the 30 tool calls that fetch the inputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to route tasks from Claude Code to Gemini 3.5 Flash
&lt;/h2&gt;

&lt;p&gt;My default: I keep Claude Code with Sonnet 4.6 as the editor for anything that touches the repo. The &lt;code&gt;Edit&lt;/code&gt;, &lt;code&gt;Write&lt;/code&gt;, &lt;code&gt;Glob&lt;/code&gt;, and &lt;code&gt;Grep&lt;/code&gt; tools stay where they are. That is the production path and it doesn't need a different model today.&lt;/p&gt;

&lt;p&gt;Where I route to Gemini 3.5 Flash is the supporting cast of tasks around the editor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP-heavy planning subtasks&lt;/strong&gt; where an agent fans out 10 to 100 tool calls to query an API, hit a database, or coordinate with another agent. The 83.6% MCP Atlas score shows up here as fewer retries and fewer stalled tool calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-running background tasks&lt;/strong&gt; where speed beats defensive depth: linting summaries, log triage, doc generation, scheduled cron-style agents. Flash's ~289 tok/sec output throughput is roughly 4x what Opus 4.7 delivers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheap intermediate planning steps&lt;/strong&gt; inside a larger agent loop where Sonnet 4.6 is overkill. Use Flash to pick which tool to call next, then hand control back to Sonnet for the actual code change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel sub-agent fan-out&lt;/strong&gt; like the 93 parallel agents in Antigravity's demo described in the &lt;a href="https://www.nxcode.io/resources/news/gemini-3-5-flash-developer-guide-agentic-coding-2026" rel="noopener noreferrer"&gt;NxCode developer guide&lt;/a&gt;. Cached input pricing at $0.15/1M makes the fan-out economically viable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Three ways I actually route
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter as a routing proxy.&lt;/strong&gt; Configure Claude Code or any Claude SDK call to dispatch specific tool calls to &lt;code&gt;google/gemini-3.5-flash&lt;/code&gt; on OpenRouter. You keep one API key, one billing surface, and you can swap models without code changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A thin custom MCP server&lt;/strong&gt; that wraps &lt;code&gt;client.models.generate_content&lt;/code&gt; with &lt;code&gt;gemini-3.5-flash&lt;/code&gt; as an exposed tool, then mount it inside Claude Code via &lt;code&gt;~/.claude.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Antigravity CLI for hybrid teams.&lt;/strong&gt; If your team already migrated from Gemini CLI to &lt;code&gt;agy&lt;/code&gt;, Flash is the default model. Use Antigravity for parallel agents and keep Claude Code as your primary editor.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Build an MCP agent with Gemini 3.5 Flash in 40 lines of Python
&lt;/h2&gt;

&lt;p&gt;The Google GenAI SDK has native MCP support. You hand the SDK a connected MCP &lt;code&gt;ClientSession&lt;/code&gt;, and it auto-executes tool calls and feeds the responses back to the model in a loop until the agent finishes. The official reference lives on &lt;a href="https://ai.google.dev/gemini-api/docs/function-calling" rel="noopener noreferrer"&gt;Google AI Developers - Function calling&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install the SDKs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"google-genai&amp;gt;=2.0"&lt;/span&gt; &lt;span class="s2"&gt;"mcp&amp;gt;=1.4"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key-from-aistudio"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Working agent example
&lt;/h3&gt;

&lt;p&gt;The script below connects to an MCP server, hands the session to Gemini 3.5 Flash with &lt;code&gt;thinking_level="low"&lt;/code&gt;, and runs a real triage prompt. Replace &lt;code&gt;your_mcp_server&lt;/code&gt; with the module path to whatever MCP server you already run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StdioServerParameters&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.stdio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stdio_client&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StdioServerParameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_mcp_server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;stdio_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Triage the 5 most recent open PRs in this repo. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;For each, return: PR number, risk score (low/med/high), &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and a one-line reason. Use the tools available.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thinking_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# SDK auto-executes MCP tool calls
&lt;/span&gt;                &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why every choice is what it is
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;thinking_level="low"&lt;/code&gt;: Google retuned &lt;code&gt;low&lt;/code&gt; for code and tool-calling. It is faster, cheaper, and on coding benchmarks comparable to &lt;code&gt;medium&lt;/code&gt;. The default &lt;code&gt;medium&lt;/code&gt; would quietly inflate cost without improving the tool-call sequence.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools=[session]&lt;/code&gt;: the SDK accepts an MCP &lt;code&gt;ClientSession&lt;/code&gt; directly. It introspects the server's tool list, calls each tool when the model requests it, matches the &lt;code&gt;FunctionResponse&lt;/code&gt; by id and name, and continues the loop until the model stops asking for tool calls.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response.usage_metadata&lt;/code&gt;: log this on every run. Inspect &lt;code&gt;ThoughtsTokenCount&lt;/code&gt;. Thinking tokens persist across turns and can inflate input costs 30 to 50 percent on long agent loops.&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;temperature&lt;/code&gt;, no &lt;code&gt;top_p&lt;/code&gt;: these parameters are silently ignored in Gemini 3.5. Leaving them in your config will confuse the next person to read it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Gemini 3.5 Flash in Antigravity, GitHub Copilot, and the raw API
&lt;/h2&gt;

&lt;p&gt;Flash ships across four meaningful surfaces. The right one depends on what you already pay for and how you build.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Cost model&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw Gemini API&lt;/td&gt;
&lt;td&gt;$1.50 / $9 per 1M (cached $0.15)&lt;/td&gt;
&lt;td&gt;Custom agents, MCP servers, routing layers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Antigravity CLI (agy)&lt;/td&gt;
&lt;td&gt;Free weekly cap, Pro $19.99/mo, Ultra $249.99/mo&lt;/td&gt;
&lt;td&gt;Hybrid teams on Google's stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Copilot&lt;/td&gt;
&lt;td&gt;14x premium-request multiplier&lt;/td&gt;
&lt;td&gt;Existing Copilot users with light volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$1.50 / $9 per 1M + small markup&lt;/td&gt;
&lt;td&gt;Routing inside Claude Code or multi-model proxies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One opinionated note: for a Claude Code user with even one active OpenRouter or AI Studio key, raw API plus OpenRouter is almost always cheaper than burning Copilot quota at the 14x multiplier. If you don't already pay for Copilot, the decision is easy. If you do, do the math once on your own workload before changing anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations and gotchas
&lt;/h2&gt;

&lt;p&gt;The honest list. None of these are deal-breakers, but each one is worth knowing before you swap an existing agent over.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Computer Use yet.&lt;/strong&gt; Flash doesn't drive a browser. For browser-driving agents, use a Pro-tier Gemini or Claude with Computer Use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge cutoff January 2025.&lt;/strong&gt; Tool-augmented prompts and web search are the standard workarounds for fresh facts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-only output.&lt;/strong&gt; Multimodal input works. Output is text only - no image or audio generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;128k retrieval regressed.&lt;/strong&gt; If you have million-token contexts and need exact-recall retrieval at scale, Sonnet 4.6 with its 1M context or Gemini 3.1 Pro are stronger picks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thought-token inflation.&lt;/strong&gt; Thinking tokens persist across multi-turn conversations and can inflate input costs 30 to 50 percent on agent loops. Track &lt;code&gt;ThoughtsTokenCount&lt;/code&gt; from &lt;code&gt;response.usage_metadata&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;thinking_level: medium is the silent default.&lt;/strong&gt; Set it explicitly in every config. The previous &lt;code&gt;high&lt;/code&gt; default is gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPU capacity hiccups.&lt;/strong&gt; Multiple developers reported 503 errors during the first week. Build retry-with-backoff into any production caller.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Gemini 3.5 Flash?
&lt;/h3&gt;

&lt;p&gt;Gemini 3.5 Flash is Google's Flash-tier coding and agent model, generally available since May 19, 2026. It ships across the Gemini API, AI Studio, Antigravity CLI, Vertex AI, GitHub Copilot, and the Gemini app. It beats Gemini 3.1 Pro on 11 of 15 published agent benchmarks while pricing at $1.50 input and $9 output per 1M tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does Gemini 3.5 Flash cost per 1M tokens?
&lt;/h3&gt;

&lt;p&gt;Gemini 3.5 Flash costs $1.50 per 1M input tokens, $9 per 1M output tokens, and $0.15 per 1M cached input tokens. That is 25 percent cheaper than Gemini 3.1 Pro, but 3x the price of the Gemini 3 Flash Preview it replaces and 6x the price of Gemini 3.1 Flash-Lite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Gemini 3.5 Flash better than Gemini 3.1 Pro?
&lt;/h3&gt;

&lt;p&gt;On agent benchmarks, yes. Gemini 3.5 Flash beats Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2 vs 70.3), MCP Atlas (83.6 vs 78.2), and GDPval-AA Elo (1656 vs 1314). It regresses on 128k-token retrieval by 7.6 points and ARC-AGI-2 by 5 points, so long-context or pure reasoning work still wants Pro.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Gemini 3.5 Flash compare to Claude Code for coding?
&lt;/h3&gt;

&lt;p&gt;Flash leads MCP tool orchestration at 83.6 percent MCP Atlas, beating Claude Opus 4.7 by 4.5 points. Claude Sonnet 4.6 still leads production code editing on SWE-Bench Verified and is the default model in Claude Code. The practical answer is to route: Claude Code for repository edits, Gemini 3.5 Flash for tool-heavy agent loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the thinking_level default in Gemini 3.5 Flash and why does it matter?
&lt;/h3&gt;

&lt;p&gt;Google replaced the integer &lt;code&gt;thinking_budget&lt;/code&gt; with a string enum &lt;code&gt;thinking_level&lt;/code&gt; and dropped the default from &lt;code&gt;high&lt;/code&gt; to &lt;code&gt;medium&lt;/code&gt;. Copy-pasting code from &lt;code&gt;gemini-3-flash-preview&lt;/code&gt; silently produces worse outputs. For agentic coding with MCP tools, set &lt;code&gt;thinking_level: "low"&lt;/code&gt;. For hard reasoning, set &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Gemini 3.5 Flash call MCP tools?
&lt;/h3&gt;

&lt;p&gt;Yes. The Google GenAI SDK has built-in MCP support that auto-executes tool calls and feeds responses back in a loop until the agent finishes. Gemini 3.5 Flash scored 83.6 percent on MCP Atlas, the benchmark that measures multi-step tool-call reliability. It is currently the strongest published score on that benchmark among major frontier models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is Gemini 3.5 Flash 3x more expensive than Gemini 3 Flash Preview?
&lt;/h3&gt;

&lt;p&gt;Google retuned Flash to handle frontier-grade agent loops and is pricing it accordingly. Simon Willison observed all three major labs probing API price tolerance at the same time. Artificial Analysis reported their benchmark suite cost $1,551.60 on Gemini 3.5 Flash versus $892.28 on Gemini 3.1 Pro. Cheaper per token, more expensive per workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the GitHub Copilot premium multiplier for Gemini 3.5 Flash?
&lt;/h3&gt;

&lt;p&gt;GitHub Copilot launched Gemini 3.5 Flash with a 14x premium-request multiplier across Copilot Pro, Pro Plus, Business, and Enterprise plans. A 300-request monthly quota becomes about 21 Gemini 3.5 Flash calls before overage. For most Claude Code users, calling the raw API through OpenRouter or AI Studio is cheaper than burning Copilot quota.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I switch from Claude Code to Gemini 3.5 Flash?
&lt;/h3&gt;

&lt;p&gt;Not as a wholesale swap. Claude Code with Sonnet 4.6 is still the strongest tool for production repository edits and long-context refactors. Gemini 3.5 Flash is the right routing target for MCP-heavy agent loops, parallel sub-agent fan-out, and cheap intermediate planning steps. The high-leverage move is a hybrid stack, not a switch.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I call Gemini 3.5 Flash from a Python script?
&lt;/h3&gt;

&lt;p&gt;Install the &lt;code&gt;google-genai&lt;/code&gt; SDK, set &lt;code&gt;GEMINI_API_KEY&lt;/code&gt;, and call &lt;code&gt;client.models.generate_content&lt;/code&gt; with model &lt;code&gt;gemini-3.5-flash&lt;/code&gt;. Set &lt;code&gt;thinking_level&lt;/code&gt; explicitly via &lt;code&gt;ThinkingConfig&lt;/code&gt;. Drop &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, and &lt;code&gt;top_k&lt;/code&gt; from your config. For MCP, pass the session object into the &lt;code&gt;tools&lt;/code&gt; list.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>gemini</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Claude Managed Agents Outcomes: Auto-Grading Agent Work</title>
      <dc:creator>Avinash Sangle</dc:creator>
      <pubDate>Wed, 27 May 2026 10:31:51 +0000</pubDate>
      <link>https://dev.to/aavisangle/claude-managed-agents-outcomes-auto-grading-agent-work-22np</link>
      <guid>https://dev.to/aavisangle/claude-managed-agents-outcomes-auto-grading-agent-work-22np</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://avinashsangle.com/blog/claude-managed-agents-outcomes" rel="noopener noreferrer"&gt;avinashsangle.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude Managed Agents Outcomes is a public-beta feature, launched on May 6, 2026, that lets you hand the agent a rubric and have a separate grader model check every draft against it. If the grader returns &lt;code&gt;needs_revision&lt;/code&gt;, the gaps flow back to the writer for another pass, up to &lt;code&gt;max_iterations&lt;/code&gt; (default 3, max 20). Same hosted harness, no human in the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Outcomes is a rubric-graded iteration loop built into the Managed Agents harness. You send one event, &lt;code&gt;user.define_outcome&lt;/code&gt;, and the agent works until the grader says &lt;code&gt;satisfied&lt;/code&gt; or hits &lt;code&gt;max_iterations&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A separate grader (same model and tools as the writer, fresh context window) evaluates every draft. Its feedback is the only signal the writer gets back on each revision.&lt;/li&gt;
&lt;li&gt;Anthropic's internal benchmarks report up to &lt;strong&gt;+10 points overall task success&lt;/strong&gt;, &lt;strong&gt;+10.1% on .pptx generation&lt;/strong&gt;, and &lt;strong&gt;+8.4% on .docx&lt;/strong&gt; (&lt;a href="https://claude.com/blog/new-in-claude-managed-agents" rel="noopener noreferrer"&gt;Anthropic, May 2026&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;The cost trap is the iteration count, not a per-outcome fee. Each revision multiplies writer plus grader tokens against the same $0.08-per-session-hour line item from the underlying Managed Agents pricing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Are Claude Managed Agents Outcomes?
&lt;/h2&gt;

&lt;p&gt;Outcomes is the part of &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;Claude Managed Agents&lt;/a&gt; that lets the agent verify its own work. Instead of running until it self-assesses as done, the session runs against a markdown rubric, and a second Claude (the grader) inspects each artifact with no access to the writer's reasoning. Anthropic launched Outcomes in public beta on May 6, 2026, alongside two sibling features: dreaming (research preview) and multiagent orchestration (public beta).&lt;/p&gt;

&lt;p&gt;The framing matters. Anthropic describes it as &lt;em&gt;"agents do their best work when they know what 'good' looks like - a structural framework, a presentation standard, or a set of requirements"&lt;/em&gt; (&lt;a href="https://claude.com/blog/new-in-claude-managed-agents" rel="noopener noreferrer"&gt;Anthropic blog, May 6 2026&lt;/a&gt;). The earlier Managed Agents flow asked you to write transcripts and review output yourself. Outcomes replaces that loop with a grader process and a rubric, so the agent keeps iterating without paging a human.&lt;/p&gt;

&lt;p&gt;On the launch list, three companies were explicitly named as production users of Outcomes: &lt;strong&gt;Harvey&lt;/strong&gt; (legal document drafting), &lt;strong&gt;Spiral by Every&lt;/strong&gt; (writing quality against editorial principles), and &lt;strong&gt;Wisedocs&lt;/strong&gt; (document quality checks against internal guidelines). Per Anthropic's internal benchmarks, the loop lifts task success rates by up to &lt;strong&gt;10 percentage points&lt;/strong&gt; over a standard prompting loop, with the largest gains on the hardest tasks (&lt;a href="https://www.mindstudio.ai/blog/claude-outcomes-feature-rubric-grading-agent-powerpoint-quality" rel="noopener noreferrer"&gt;MindStudio, 2026&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The beta header is &lt;code&gt;managed-agents-2026-04-01&lt;/code&gt;. Every Managed Agents API call carries it, and the official SDKs set it for you when you pass &lt;code&gt;betas=BETAS&lt;/code&gt;. If you forget the header on a raw HTTP call, the session API returns 400 before you even get to the outcome event.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Outcome Grader Works
&lt;/h2&gt;

&lt;p&gt;The flow is small and predictable. You create an environment and a writer agent. You start a session and send one event, &lt;code&gt;user.define_outcome&lt;/code&gt;, carrying the task description and the rubric. The writer drafts. After each writer turn, the harness emits &lt;code&gt;span.outcome_evaluation_start&lt;/code&gt; and spins up a grader in a fresh context window. The grader reads only the rubric, inspects the artifact (it has the same model and tools as the writer), and emits &lt;code&gt;span.outcome_evaluation_end&lt;/code&gt; with a verdict. If the verdict is &lt;code&gt;needs_revision&lt;/code&gt;, the explanation flows back into the writer's next turn.&lt;/p&gt;

&lt;p&gt;Two design choices make this useful rather than gimmicky. First, the grader runs with no visibility into the writer's internal reasoning, so it cannot be talked into approving an artifact that does not meet the rubric. Second, the grader re-checks the full artifact on every iteration, not the diff, so a fix that breaks a previously-passing criterion gets caught on the next round. The &lt;a href="https://platform.claude.com/docs/en/managed-agents/define-outcomes" rel="noopener noreferrer"&gt;official define-outcomes reference&lt;/a&gt; states this plainly: the grader uses a separate context window to avoid being influenced by the main agent's implementation choices.&lt;/p&gt;

&lt;p&gt;The benchmark numbers are useful context. On Anthropic's internal eval set, file generation specifically saw &lt;strong&gt;+8.4% on .docx outputs&lt;/strong&gt; and &lt;strong&gt;+10.1% on .pptx outputs&lt;/strong&gt; over a standard prompting loop (&lt;a href="https://claude.com/blog/new-in-claude-managed-agents" rel="noopener noreferrer"&gt;Anthropic, May 2026&lt;/a&gt;). Those are not headline-chart numbers; they are the difference between a slide deck that ships and one that doesn't. The gain is largest on the hardest tasks, which fits the pattern: easy work looks fine on the first pass anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing a Rubric the Grader Will Actually Enforce
&lt;/h2&gt;

&lt;p&gt;The rubric is the only lever you have on the grader. The default failure mode is a grader that approves everything, and the reason is almost always vague criteria. The Anthropic docs are blunt about it: structure the rubric as explicit, gradeable criteria, such as &lt;em&gt;the CSV contains a price column with numeric values&lt;/em&gt; rather than &lt;em&gt;the data looks good&lt;/em&gt;. The grader scores each criterion independently, so vague criteria produce noisy evaluations (&lt;a href="https://platform.claude.com/docs/en/managed-agents/define-outcomes" rel="noopener noreferrer"&gt;Define outcomes, Anthropic&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;A working rubric has five properties. Each criterion is &lt;strong&gt;checkable&lt;/strong&gt; by the grader using its tools. The target is the artifact's &lt;strong&gt;structure and completeness&lt;/strong&gt;, not a fact the grader cannot independently confirm. The rubric &lt;strong&gt;anticipates shortcuts&lt;/strong&gt; (for example, blocks corroboration via search snippets and mirrors when you want a primary source). It &lt;strong&gt;mandates a feedback format&lt;/strong&gt; so you can parse the explanation downstream. And it &lt;strong&gt;tells the grader what to ignore&lt;/strong&gt;, so you do not burn iterations on style nits.&lt;/p&gt;

&lt;p&gt;Anthropic ships a working DCF model rubric on the docs page. It's worth reading because it shows what "explicit and gradeable" looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# DCF Model Rubric&lt;/span&gt;

&lt;span class="gu"&gt;## Revenue Projections&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Uses historical revenue data from the last 5 fiscal years
&lt;span class="p"&gt;-&lt;/span&gt; Projects revenue for at least 5 years forward
&lt;span class="p"&gt;-&lt;/span&gt; Growth rate assumptions are explicitly stated and reasonable

&lt;span class="gu"&gt;## Cost Structure&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; COGS and operating expenses are modeled separately
&lt;span class="p"&gt;-&lt;/span&gt; Margins are consistent with historical trends or deviations are justified

&lt;span class="gu"&gt;## Discount Rate&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; WACC is calculated with stated assumptions for cost of equity and cost of debt
&lt;span class="p"&gt;-&lt;/span&gt; Beta, risk-free rate, and equity risk premium are sourced or justified

&lt;span class="gu"&gt;## Terminal Value&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Uses either perpetuity growth or exit multiple method (stated which)
&lt;span class="p"&gt;-&lt;/span&gt; Terminal growth rate does not exceed long-term GDP growth

&lt;span class="gu"&gt;## Output Quality&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; All figures are in a single .xlsx file with clearly labeled sheets
&lt;span class="p"&gt;-&lt;/span&gt; Key assumptions are on a separate "Assumptions" sheet
&lt;span class="p"&gt;-&lt;/span&gt; Sensitivity analysis on WACC and terminal growth rate is included
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what the rubric does not say. It never asks the grader to verify that the input revenue figures are factually accurate. The grader has no way to confirm that a 2023 revenue number is real without going off and looking it up, and even if it did, you cannot easily test that part of the work. The rubric checks &lt;em&gt;that history was used&lt;/em&gt;, not that the numbers are true. That is the right line.&lt;/p&gt;

&lt;p&gt;If you do not have a rubric and you are starting from scratch, the docs offer a bootstrap trick worth stealing: hand Claude a known-good artifact and ask it to write the rubric. The output is usually better than the rubric you would have written from a blank page, because it can name what makes the good artifact good. I run this once per document type and keep the rubric in a markdown file uploaded via the Files API with the &lt;code&gt;files-api-2025-04-14&lt;/code&gt; beta. That way you can pass &lt;code&gt;rubric: {type: "file", file_id: ...}&lt;/code&gt; and reuse it across sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Define an Outcome: Python Code Walkthrough
&lt;/h2&gt;

&lt;p&gt;The setup is three calls plus one event. Create the environment. Create the writer agent with whatever tools the task needs. Create the session. Send a single &lt;code&gt;user.define_outcome&lt;/code&gt; event carrying a description string and the rubric, and the writer starts on receipt. No separate &lt;code&gt;user.message&lt;/code&gt; event is needed to kick it off.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;BETAS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;managed-agents-2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files-api-2025-04-14&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Environment - the sandbox the agent runs in
&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-brief&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_cloud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unrestricted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Writer agent - same model and tools the grader will use
&lt;/span&gt;&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a research analyst. You write one-page business briefs. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cite every factual claim with an inline footnote [n].&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_toolset_20260401&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_fetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Upload the rubric once, reuse across sessions
&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dcf-rubric.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Uploaded rubric: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Session + the one event that starts everything
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;environment_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Brief: EV fast-charging unit economics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.define_outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build a one-page business brief on EV fast-charging unit economics in .docx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="c1"&gt;# or inline: {"type": "text", "content": RUBRIC_MD},
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_iterations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# optional, default 3, max 20
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rubric field accepts either inline text or a file reference. For one-off notebook work I keep the rubric inline as a long string, because the round-trip is faster and the rubric is right there in the source. For anything I run more than once I upload it once and pass the &lt;code&gt;file_id&lt;/code&gt;, so updates to the rubric do not require re-pasting it everywhere.&lt;/p&gt;

&lt;p&gt;Two notes on the agent definition that bite people. The grader uses the same model and the same tools as the writer agent. If the writer has &lt;code&gt;read&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt;, so does the grader, and the grader can open every file the writer produced. If you scope the writer too tightly (no &lt;code&gt;read&lt;/code&gt;, for example) the grader will not be able to verify the artifact and you will get noisy verdicts. Give the grader the tools it needs to confirm what the rubric demands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grader Feedback: The Five Result States
&lt;/h2&gt;

&lt;p&gt;Every grader pass ends with a &lt;code&gt;span.outcome_evaluation_end&lt;/code&gt; event. The &lt;code&gt;result&lt;/code&gt; field on that event takes one of five values and tells you exactly what the harness will do next. Memorize this table once and you will save yourself a lot of stream-parsing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;th&gt;What happens next&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;satisfied&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All criteria met. Session transitions to &lt;code&gt;idle&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;needs_revision&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Writer starts another iteration with the grader's explanation as feedback.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_iterations_reached&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No further evaluation. Writer may run one final revision before the session goes idle.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rubric fundamentally does not match the task. Session goes idle.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;interrupted&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A &lt;code&gt;user.interrupt&lt;/code&gt; event landed mid-evaluation. You can start a new outcome.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice you watch the stream and react to two events: &lt;code&gt;span.outcome_evaluation_start&lt;/code&gt; tells you the writer finished a draft, and &lt;code&gt;span.outcome_evaluation_end&lt;/code&gt; carries the verdict. A heartbeat event, &lt;code&gt;span.outcome_evaluation_ongoing&lt;/code&gt;, fires while the grader works, but the grader's internal reasoning is opaque - you see that it is working, not what it is thinking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TERMINAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satisfied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_iterations_reached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interrupted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span.outcome_evaluation_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[iter &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] grader evaluating draft...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span.outcome_evaluation_end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[iter &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explanation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# per-criterion feedback
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TERMINAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;

&lt;span class="c1"&gt;# After the loop, fetch deliverables from /mnt/session/outputs/
&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BETAS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output files live at &lt;code&gt;/mnt/session/outputs/&lt;/code&gt; inside the container and you fetch them via the Files API with &lt;code&gt;scope_id=session.id&lt;/code&gt;. The grader's &lt;code&gt;explanation&lt;/code&gt; field is the part you actually want to log for postmortems - it's the verbatim feedback the writer used for the next pass, so if a session looped to &lt;code&gt;max_iterations_reached&lt;/code&gt;, that field tells you what the grader kept catching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning max_iterations vs Fixing the Rubric
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;max_iterations&lt;/code&gt; defaults to 3 and the cap is 20. The cookbook recommends starting at 5 for strict rubrics. The mistake I see most is people raising the cap when they should be rewriting the rubric. There's a simple decision rule that catches the difference.&lt;/p&gt;

&lt;p&gt;Log every iteration's &lt;code&gt;explanation&lt;/code&gt; field and look at the failures across passes. If the grader is flagging &lt;strong&gt;the same criterion every time&lt;/strong&gt; and the writer is not closing it, the rubric is the problem - either the criterion is unverifiable, or the grader and writer are interpreting it differently. Raise the cap and you just pay for more iterations of the same loop. If the grader is flagging &lt;strong&gt;different criteria each pass&lt;/strong&gt;, with the failures converging on the last unsolved item, the rubric is fine and you need a higher cap. That is real progress and another iteration will close it out.&lt;/p&gt;

&lt;p&gt;The other anti-patterns are easier to spot once you know what to look for. A rubric that prescribes specific steps instead of describing the goal will over-constrain the writer, and the grader will mark novel approaches as failed. A description and rubric that contradict each other returns &lt;code&gt;result: failed&lt;/code&gt; on the first pass, before any work is done - check the explanation, it is usually unambiguous about which one is wrong. A single criterion that packs four ideas together produces noisy per-criterion verdicts because the grader cannot tell which of the four is failing on a given draft.&lt;/p&gt;

&lt;p&gt;Treat &lt;code&gt;max_iterations&lt;/code&gt; as a circuit breaker, not a knob. Set it once based on how strict your rubric is, and let repeated &lt;code&gt;max_iterations_reached&lt;/code&gt; events tell you when the rubric needs work. Raising the cap from 5 to 20 to mask a bad rubric doubles your token spend and surfaces nothing useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Outcomes Actually Cost
&lt;/h2&gt;

&lt;p&gt;Outcomes does not add a separate per-outcome fee. The cost driver is iteration count: every revision adds writer tokens plus grader tokens and keeps the same Managed Agents &lt;strong&gt;$0.08-per-session-hour&lt;/strong&gt; clock running. There is no standalone grader bill or rubric bill. There is just more of the same line items.&lt;/p&gt;

&lt;p&gt;Worked example for a research brief task. The writer takes about ten minutes per draft. The grader takes about a minute per evaluation. A session that goes two iterations to &lt;code&gt;satisfied&lt;/code&gt; runs roughly 22 minutes of wall-clock session time. At $0.08 per hour, that is about $0.029 in session-hours. The token spend is whatever the writer and grader cost across two passes (typically the dominant line in this kind of work). For comparison, a manual human review of the same brief at, say, $25 per round, blows past the entire outcome-driven session cost on the first review.&lt;/p&gt;

&lt;p&gt;Two cost levers actually move the bill. First, &lt;code&gt;max_iterations&lt;/code&gt;. A run that loops six times when three would have done it doubles the writer plus grader tokens. Track the average iteration count per task type and tune accordingly. Second, the grader's tools. The grader uses whatever the writer agent was created with - if you gave the writer &lt;code&gt;web_search&lt;/code&gt; and the rubric does not require cross-checking, you are paying for grader web searches it does not need. Strip unused tools from the writer config and the grader stops calling them.&lt;/p&gt;

&lt;p&gt;For broader cost-tracking patterns across Claude Code and Managed Agents work, my &lt;a href="https://avinashsangle.com/blog/claude-code-cost-tracking" rel="noopener noreferrer"&gt;Claude Code cost tracking&lt;/a&gt; post covers the JSONL logs and ccusage workflow I run weekly. The same approach works for Managed Agents sessions: dump the events stream to a file per session and roll up iteration counts and token usage from the &lt;code&gt;usage&lt;/code&gt; field on &lt;code&gt;span.outcome_evaluation_end&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Outcomes vs LLM-as-Judge vs Codex /goal
&lt;/h2&gt;

&lt;p&gt;LLM-as-judge is a category in 2026, not a single product. Tools like Galileo, DeepEval, Langfuse, and G-Eval all let you score agent or model output against a rubric using an LLM, and they do it well. Strong LLM judges in current research achieve roughly &lt;strong&gt;80% agreement with human evaluators&lt;/strong&gt;, matching human-to-human consistency on many quality dimensions (&lt;a href="https://galileo.ai/blog/agent-evaluation-framework-metrics-rubrics-benchmarks" rel="noopener noreferrer"&gt;Galileo, 2026&lt;/a&gt;). What you get from those tools is the score. What you do with it is up to you.&lt;/p&gt;

&lt;p&gt;What Outcomes adds is the wiring. The grader runs inside the harness, the explanation flows back into the writer's next turn without any code on your side, and the iteration loop stops when the grader is satisfied. With a standalone judge, you build that loop yourself: capture the score, decide if it is good enough, format the gaps as a prompt, and restart the agent. That wiring is the difference between a one-off evaluation script and a self-correcting agent in production.&lt;/p&gt;

&lt;p&gt;On the OpenAI side, Codex &lt;code&gt;/goal&lt;/code&gt; is the closest analogue. Both attach a success target to an autonomous run. The difference is the verdict shape. Outcomes leans on a markdown rubric and natural-language gap explanations. Codex &lt;code&gt;/goal&lt;/code&gt; leans on verifier scripts and structured pass-fail signals, which works well for code where you can run tests. Practitioners comparing them note that &lt;code&gt;/goal&lt;/code&gt; fits programmatic tasks better, while Outcomes fits qualitative artifacts (documents, decks, prose) better (&lt;a href="https://www.developersdigest.tech/blog/codex-goal-vs-claude-managed-outcomes-practical-differences" rel="noopener noreferrer"&gt;Developers Digest, 2026&lt;/a&gt;). They are not interchangeable, they target different shapes of work.&lt;/p&gt;

&lt;p&gt;Pick Outcomes if you live in the Managed Agents harness already and your artifacts are document-like. Pick a standalone judge if you want to evaluate offline across a corpus, or you need cross-provider scoring. Pick Codex &lt;code&gt;/goal&lt;/code&gt; if your success criterion is "does the test suite pass" and you are already on OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are Claude Managed Agents Outcomes?
&lt;/h3&gt;

&lt;p&gt;Outcomes is a public-beta Managed Agents feature launched May 6, 2026. You attach a markdown rubric to a session via a &lt;code&gt;user.define_outcome&lt;/code&gt; event, and a separate grader model evaluates each draft in its own context window. If the grader returns &lt;code&gt;needs_revision&lt;/code&gt;, the feedback goes back to the writer for another iteration.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does the Claude outcome grader work?
&lt;/h3&gt;

&lt;p&gt;The grader runs in a fresh context window using the same model and tools as the writer agent. It reads only the rubric, inspects the artifact, and returns a per-criterion verdict on every iteration. Its reasoning is opaque, but its &lt;code&gt;explanation&lt;/code&gt; field carries the gaps that the writer must close on the next pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I write a good rubric for Claude Outcomes?
&lt;/h3&gt;

&lt;p&gt;Use explicit, gradeable criteria like "the CSV has a numeric price column," not vibes like "the data looks good." Anchor the rubric in verifiable structure and completeness, anticipate shortcuts the writer might take, mandate a feedback format, and tell the grader what to ignore so it does not thrash on style nits.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the default value of max_iterations in Claude Outcomes?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;max_iterations&lt;/code&gt; field defaults to 3 and accepts values up to 20. For strict rubrics, the Anthropic cookbook recommends starting at 5. If the loop hits the cap with the same failures every iteration, the rubric is wrong; if it hits the cap with failures that converge, raise the cap instead of rewriting.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the result states of a Claude outcome evaluation?
&lt;/h3&gt;

&lt;p&gt;Five values appear on &lt;code&gt;span.outcome_evaluation_end.result&lt;/code&gt;: &lt;code&gt;satisfied&lt;/code&gt; (criteria met, session goes idle), &lt;code&gt;needs_revision&lt;/code&gt; (writer starts another pass), &lt;code&gt;max_iterations_reached&lt;/code&gt; (one final revision allowed before idle), &lt;code&gt;failed&lt;/code&gt; (rubric contradicts the task description), and &lt;code&gt;interrupted&lt;/code&gt; (a &lt;code&gt;user.interrupt&lt;/code&gt; event landed mid-evaluation).&lt;/p&gt;

&lt;h3&gt;
  
  
  How much do Claude Outcomes cost on top of session-hours?
&lt;/h3&gt;

&lt;p&gt;Outcomes has no separate per-outcome fee. The real cost driver is iterations: each revision adds writer tokens plus grader tokens and keeps the $0.08-per-session-hour clock running. A 20-minute session that iterates twice still bills around $0.027 in session-hours, plus the writer-and-grader tokens for both rounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use Claude Outcomes with the Agent SDK or only Managed Agents?
&lt;/h3&gt;

&lt;p&gt;Outcomes is a Managed Agents feature. The grader, the iteration loop, and the &lt;code&gt;span.outcome_evaluation_*&lt;/code&gt; events all live in the hosted harness. If you run the Agent SDK locally, you can still build an LLM-as-judge yourself with a separate Anthropic API call, but the wiring back to the writer is on you.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between Claude Outcomes and Codex /goal?
&lt;/h3&gt;

&lt;p&gt;Both attach a success target to an autonomous agent run. Outcomes uses a rubric plus a separate grader and feeds the gaps back as natural-language revision notes. OpenAI's Codex &lt;code&gt;/goal&lt;/code&gt; favors verifier scripts and structured pass-fail signals. Outcomes leans qualitative, &lt;code&gt;/goal&lt;/code&gt; leans test-driven, and the runtime substrates differ.&lt;/p&gt;




&lt;p&gt;If you're still deciding between Managed Agents and the Agent SDK, start with &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;Claude Managed Agents vs Agent SDK&lt;/a&gt;. And if you're running agents in CI, my &lt;a href="https://avinashsangle.com/blog/hardening-ai-agents-cicd-prompt-injection" rel="noopener noreferrer"&gt;prompt-injection defense guide for GitHub Actions&lt;/a&gt; applies to outcome-driven sessions too.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Getting Started with the ant CLI: Deploy Claude Agents</title>
      <dc:creator>Avinash Sangle</dc:creator>
      <pubDate>Wed, 22 Apr 2026 05:25:26 +0000</pubDate>
      <link>https://dev.to/aavisangle/getting-started-with-the-ant-cli-deploy-claude-agents-50ml</link>
      <guid>https://dev.to/aavisangle/getting-started-with-the-ant-cli-deploy-claude-agents-50ml</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://avinashsangle.com/blog/ant-cli-getting-started" rel="noopener noreferrer"&gt;avinashsangle.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ant CLI is Anthropic's official command-line client for the Claude API, and it's the fastest way to create, configure, and manage cloud-hosted agents without writing application code. From install to a running managed agent in under 10 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;ant CLI&lt;/strong&gt; is Anthropic's official Go-based CLI for the Claude API, launched April 2026. It manages agents, environments, and sessions from your terminal.&lt;/li&gt;
&lt;li&gt;Install on macOS with &lt;code&gt;brew install anthropics/tap/ant&lt;/code&gt;. Linux and Go installs are also supported.&lt;/li&gt;
&lt;li&gt;Define agents as &lt;strong&gt;YAML files&lt;/strong&gt;, check them into Git, and deploy through CI - full GitOps for your agent configs.&lt;/li&gt;
&lt;li&gt;Sessions cost &lt;strong&gt;$0.08/hour&lt;/strong&gt; (billed to the millisecond) plus standard Claude token rates. Idle time is free.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Is the ant CLI?
&lt;/h2&gt;

&lt;p&gt;The ant CLI shipped alongside &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;Claude Managed Agents&lt;/a&gt; on April 8, 2026, and it's built specifically for developers who want to create, configure, and run cloud-hosted agents without writing wrapper code. The &lt;a href="https://github.com/anthropics/anthropic-cli" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; already has over 300 stars in its first ten days.&lt;/p&gt;

&lt;p&gt;It follows a resource-based command structure: &lt;code&gt;ant [resource] &amp;lt;command&amp;gt; [flags...]&lt;/code&gt;. Think of it like &lt;code&gt;kubectl&lt;/code&gt; for Claude agents. You can pipe YAML into it, extract fields with GJSON transforms, and chain commands in shell scripts. If you've worked with any modern infrastructure CLI, the patterns will feel familiar.&lt;/p&gt;

&lt;p&gt;One thing to clarify early: the ant CLI and Claude Code solve different problems. Claude Code is your interactive coding assistant in the terminal - you talk to it, it writes code, and you pay through a subscription. The ant CLI is a programmatic API client for managing hosted agent infrastructure. You authenticate with an API key, and you're billed at standard API rates. I use both daily, and they complement each other well. Claude Code even understands how to shell out to &lt;code&gt;ant&lt;/code&gt; natively.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Install the ant CLI
&lt;/h2&gt;

&lt;p&gt;There are three installation paths depending on your platform. If you're on macOS, Homebrew is the fastest route.&lt;/p&gt;

&lt;h3&gt;
  
  
  macOS (Homebrew)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install from Anthropic's tap&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;anthropics/tap/ant

&lt;span class="c"&gt;# Clear the macOS quarantine flag (required)&lt;/span&gt;
xattr &lt;span class="nt"&gt;-d&lt;/span&gt; com.apple.quarantine &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;brew &lt;span class="nt"&gt;--prefix&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/bin/ant"&lt;/span&gt;

&lt;span class="c"&gt;# Verify&lt;/span&gt;
ant &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That quarantine step trips people up. macOS flags unsigned binaries downloaded by Homebrew, and without clearing it you'll get a "cannot be opened because the developer cannot be verified" error. It's a one-time thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux / WSL (curl)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1.2.1
&lt;span class="nv"&gt;OS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="s1"&gt;'[:upper:]'&lt;/span&gt; &lt;span class="s1"&gt;'[:lower:]'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;ARCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/x86_64/amd64/'&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/aarch64/arm64/'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://github.com/anthropics/anthropic-cli/releases/download/v&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/ant_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;OS&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ARCH&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.tar.gz"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tar&lt;/span&gt; &lt;span class="nt"&gt;-xz&lt;/span&gt; &lt;span class="nt"&gt;-C&lt;/span&gt; /usr/local/bin ant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  From Source (Go 1.22+)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/anthropics/anthropic-cli/cmd/ant@latest
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;go &lt;span class="nb"&gt;env &lt;/span&gt;GOPATH&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/bin"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Set Your API Key
&lt;/h3&gt;

&lt;p&gt;Once installed, set your Anthropic API key. The CLI reads it from the &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-ant-your-key-here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can generate an API key from the &lt;a href="https://console.anthropic.com/settings/keys" rel="noopener noreferrer"&gt;Anthropic Console&lt;/a&gt;. I keep mine in a &lt;code&gt;.env&lt;/code&gt; file that my shell sources on startup, but any secret management approach works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shell Completions
&lt;/h3&gt;

&lt;p&gt;The ant CLI supports completions for bash, zsh, fish, and PowerShell. For zsh (the default macOS shell):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate zsh completions&lt;/span&gt;
ant completion zsh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ~/.zfunc/_ant

&lt;span class="c"&gt;# Add to your .zshrc if not already there&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'fpath=(~/.zfunc $fpath)'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'autoload -Uz compinit &amp;amp;&amp;amp; compinit'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tab completion saves a lot of time when working with the &lt;code&gt;beta:&lt;/code&gt; namespaced commands, which can get long.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concepts - Agents, Environments, and Sessions
&lt;/h2&gt;

&lt;p&gt;Before you create anything, it helps to understand how the four core pieces fit together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt; - A versioned configuration defining the model, system prompt, tools, and MCP server connections. Think of it as a blueprint. Each update creates a new version, so you can roll back if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt; - A container template specifying pre-installed packages (pip, npm) and networking rules. Create it once, reference it by ID. Multiple sessions can share one environment config, but each gets its own isolated container.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session&lt;/strong&gt; - A running instance that pairs an agent with an environment. It has its own container, filesystem, and conversation history. Sessions are where the actual work happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Events&lt;/strong&gt; - The communication protocol. You send user events (messages, interrupts, tool confirmations) and receive agent events (messages, tool calls, thinking). Everything is event-based and streamable.&lt;/p&gt;

&lt;p&gt;The flow works like this: you create an agent (the what), create an environment (the where), start a session linking them together, and then communicate through events. Anthropic handles the container orchestration, tool execution, and conversation state. According to the &lt;a href="https://platform.claude.com/docs/en/managed-agents/overview" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;, sessions cost $0.08 per session-hour billed to the millisecond, and idle time doesn't count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Your First Agent with the ant CLI
&lt;/h2&gt;

&lt;p&gt;Let's build a simple code review agent. I'll walk through each step so you can see exactly what the CLI does at each stage. All managed agent commands sit under the &lt;code&gt;beta:&lt;/code&gt; prefix since the feature is still in beta.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Create the Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ant beta:agents create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"Code Reviewer"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; claude-sonnet-4-6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="s2"&gt;"You are a senior code reviewer. Read the code carefully, check for bugs, security issues, and style problems. Be specific about line numbers and provide fix suggestions."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool&lt;/span&gt; &lt;span class="s1"&gt;'{"type": "agent_toolset_20260401"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response comes back as JSON with the agent ID and version. I like to extract just the ID for scripting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Extract the agent ID&lt;/span&gt;
&lt;span class="nv"&gt;AGENT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ant beta:agents create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"Code Reviewer"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; claude-sonnet-4-6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="s2"&gt;"You are a senior code reviewer."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool&lt;/span&gt; &lt;span class="s1"&gt;'{"type": "agent_toolset_20260401"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; raw&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Created agent: &lt;/span&gt;&lt;span class="nv"&gt;$AGENT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--transform&lt;/code&gt; flag uses GJSON syntax to pluck a specific field from the response, and &lt;code&gt;--format raw&lt;/code&gt; strips the quotes. This is one of the CLI's best features for scripting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create an Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ENV_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ant beta:environments create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"python-dev"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--pip-packages&lt;/span&gt; &lt;span class="s1"&gt;'["pytest", "ruff", "mypy"]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--networking&lt;/span&gt; unrestricted &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; raw&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Created environment: &lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Environments define what's pre-installed in the container. I'm giving this one Python linting tools since it's a code review agent. The &lt;code&gt;unrestricted&lt;/code&gt; networking flag lets the agent fetch external resources if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Start a Session
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;SESSION_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ant beta:sessions create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$AGENT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--environment-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; raw&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Started session: &lt;/span&gt;&lt;span class="nv"&gt;$SESSION_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Send a Message and Stream the Response
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Send a review request&lt;/span&gt;
ant beta:sessions:events send &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SESSION_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; user.message &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--content-type&lt;/span&gt; text &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--content-text&lt;/span&gt; &lt;span class="s2"&gt;"Review this Python function for bugs:

def divide(a, b):
    return a / b
"&lt;/span&gt;

&lt;span class="c"&gt;# Stream the agent's response in real-time&lt;/span&gt;
ant beta:sessions stream &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SESSION_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;stream&lt;/code&gt; command opens a real-time SSE connection to the session. You'll see the agent's thinking, tool calls (it might run the code through ruff), and its final review - all printed to your terminal as they happen.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Want to explore the response interactively? Replace &lt;code&gt;--format raw&lt;/code&gt; with &lt;code&gt;--format explore&lt;/code&gt; on any command to open the TUI explorer. It lets you navigate nested JSON with arrow keys - really useful when debugging agent responses.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  YAML Version Control for Agents
&lt;/h2&gt;

&lt;p&gt;This is the ant CLI's best feature, and the one I haven't seen anyone write about yet. Instead of passing flags inline, you can define agents and environments as YAML files, check them into Git, and deploy through your CI pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# code-reviewer.agent.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Code Reviewer&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
&lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;You are a senior code reviewer. Read the code carefully,&lt;/span&gt;
  &lt;span class="s"&gt;check for bugs, security issues, and style problems.&lt;/span&gt;
  &lt;span class="s"&gt;Be specific about line numbers and provide fix suggestions.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_toolset_20260401&lt;/span&gt;
    &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web_fetch&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# code-reviewer.environment.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python-dev&lt;/span&gt;
&lt;span class="na"&gt;pip_packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pytest&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ruff&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mypy&lt;/span&gt;
&lt;span class="na"&gt;networking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unrestricted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can create the agent directly from the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create from YAML&lt;/span&gt;
ant beta:agents create &amp;lt; code-reviewer.agent.yaml

&lt;span class="c"&gt;# Update an existing agent (version is required for safety)&lt;/span&gt;
ant beta:agents update &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$AGENT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &amp;lt; code-reviewer.agent.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The versioning requirement matters. When you update an agent, you must pass the current version number. If someone else updated it since you last pulled, the command fails rather than silently overwriting. It's optimistic concurrency control - the same pattern you'd find in Kubernetes or Terraform.&lt;/p&gt;

&lt;p&gt;This YAML approach is where the ant CLI really shines for teams. Your agent configs live in the same repo as your application code, go through pull request review, and deploy through the same pipeline. I wrote more about the broader Managed Agents architecture in my &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;Managed Agents vs Agent SDK comparison&lt;/a&gt;, but the YAML workflow is what makes the CLI my preferred interface.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;According to the &lt;a href="https://platform.claude.com/docs/en/api/sdks/cli" rel="noopener noreferrer"&gt;official CLI docs&lt;/a&gt;, Anthropic designed the YAML workflow specifically for GitOps-style agent management. If you're already doing infrastructure as code, this slots right in.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  ant CLI vs curl vs SDK - Why Use the CLI?
&lt;/h2&gt;

&lt;p&gt;You can hit the Managed Agents API three ways: raw HTTP with curl, a language SDK (Python, TypeScript, Go, etc.), or the ant CLI. Each has its place.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;curl&lt;/th&gt;
&lt;th&gt;ant CLI&lt;/th&gt;
&lt;th&gt;Python SDK&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;2 minutes&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON body authoring&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Typed flags / YAML&lt;/td&gt;
&lt;td&gt;Typed objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-pagination&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File references&lt;/td&gt;
&lt;td&gt;Manual base64&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@path&lt;/code&gt; syntax&lt;/td&gt;
&lt;td&gt;File objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response filtering&lt;/td&gt;
&lt;td&gt;Pipe to jq&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--transform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shell scripting&lt;/td&gt;
&lt;td&gt;Verbose&lt;/td&gt;
&lt;td&gt;Ergonomic&lt;/td&gt;
&lt;td&gt;Requires Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD fit&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Quick tests&lt;/td&gt;
&lt;td&gt;Ops / automation&lt;/td&gt;
&lt;td&gt;App integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ant CLI sits in a sweet spot. It's faster than writing curl commands by hand (no JSON body construction, no header management), and lighter than pulling in a full SDK when you just want to script some agent operations. For anything that lives in a shell script or CI workflow, it's the right tool.&lt;/p&gt;

&lt;p&gt;If you're building an application that embeds agent interactions - a web app, a Slack bot, a data pipeline - use the SDK. The ant CLI is for the operational layer: provisioning agents, rotating credentials, monitoring sessions, deploying config changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scripting and Automation Patterns
&lt;/h2&gt;

&lt;p&gt;Here are a few patterns I've found useful when automating agent workflows with the ant CLI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extract IDs from Create Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="c"&gt;# Create agent and capture the ID&lt;/span&gt;
&lt;span class="nv"&gt;AGENT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ant beta:agents create &lt;span class="se"&gt;\&lt;/span&gt;
  &amp;lt; agents/reviewer.agent.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; raw&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Create environment and capture the ID&lt;/span&gt;
&lt;span class="nv"&gt;ENV_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ant beta:environments create &lt;span class="se"&gt;\&lt;/span&gt;
  &amp;lt; agents/reviewer.environment.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; raw&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Agent: &lt;/span&gt;&lt;span class="nv"&gt;$AGENT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Environment: &lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Store for later use&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"AGENT_ID=&lt;/span&gt;&lt;span class="nv"&gt;$AGENT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .env.agents
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ENV_ID=&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .env.agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Actions Deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy Agents&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agents/**'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install ant CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL \&lt;/span&gt;
            &lt;span class="s"&gt;"https://github.com/anthropics/anthropic-cli/releases/download/v1.2.1/ant_1.2.1_linux_amd64.tar.gz" \&lt;/span&gt;
            &lt;span class="s"&gt;| sudo tar -xz -C /usr/local/bin ant&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update agent config&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;ant beta:agents update \&lt;/span&gt;
            &lt;span class="s"&gt;--agent-id "${{ vars.AGENT_ID }}" \&lt;/span&gt;
            &lt;span class="s"&gt;--version "${{ vars.AGENT_VERSION }}" \&lt;/span&gt;
            &lt;span class="s"&gt;&amp;lt; agents/reviewer.agent.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  List All Agents and Environments
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List agents in a readable table&lt;/span&gt;
ant beta:agents list &lt;span class="nt"&gt;--format&lt;/span&gt; yaml

&lt;span class="c"&gt;# List environments with just names and IDs&lt;/span&gt;
ant beta:environments list &lt;span class="nt"&gt;--transform&lt;/span&gt; &lt;span class="s2"&gt;"data.#.{id,name}"&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; yaml

&lt;span class="c"&gt;# Check session status&lt;/span&gt;
ant beta:sessions retrieve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SESSION_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transform&lt;/span&gt; status &lt;span class="nt"&gt;--format&lt;/span&gt; raw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--transform&lt;/code&gt; flag accepts full GJSON path syntax. You can filter arrays, project specific fields, and even do conditional extraction. It's much cleaner than piping to &lt;code&gt;jq&lt;/code&gt; for simple extractions, though for complex transformations I still reach for jq.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Tools Can Managed Agents Use?
&lt;/h2&gt;

&lt;p&gt;When you include &lt;code&gt;{"type": "agent_toolset_20260401"}&lt;/code&gt; in your agent config, it gets access to a standard set of tools: bash, read, write, edit, glob, grep, and web_fetch. All are enabled by default.&lt;/p&gt;

&lt;p&gt;You can selectively disable tools you don't want the agent to have. For a read-only code review agent, you might disable write and edit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# readonly-reviewer.agent.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Read-Only Reviewer&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
&lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Review code without modifying it.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_toolset_20260401&lt;/span&gt;
    &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;edit&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web_fetch&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or flip the default and whitelist only what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_toolset_20260401&lt;/span&gt;
    &lt;span class="na"&gt;default_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents can also connect to external MCP servers for tools beyond the built-in set. If you've built a custom MCP server, a managed agent can use it by adding an &lt;code&gt;mcp_servers&lt;/code&gt; block to the agent config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the ant CLI from Anthropic?
&lt;/h3&gt;

&lt;p&gt;The ant CLI is Anthropic's official command-line client for the Claude API. Written in Go, it provides a resource-based command structure for managing agents, environments, and sessions. It supports typed flags, YAML input, auto-pagination, and multiple output formats including an interactive TUI explorer.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I install the ant CLI on macOS?
&lt;/h3&gt;

&lt;p&gt;Install via Homebrew: run &lt;code&gt;brew install anthropics/tap/ant&lt;/code&gt;, then clear the macOS quarantine flag with &lt;code&gt;xattr -d com.apple.quarantine "$(brew --prefix)/bin/ant"&lt;/code&gt;. Set your &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; environment variable and verify with &lt;code&gt;ant --version&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between the ant CLI and Claude Code?
&lt;/h3&gt;

&lt;p&gt;Claude Code is an interactive agentic coding assistant that runs in your terminal and uses a subscription. The ant CLI is a programmatic API client for managing Managed Agents resources, uses an API key, and is built for scripting and CI/CD automation. They're complementary - Claude Code can even shell out to &lt;code&gt;ant&lt;/code&gt; commands.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does it cost to run a managed agent session?
&lt;/h3&gt;

&lt;p&gt;Sessions cost $0.08 per session-hour, billed to the millisecond. Idle time is free. You also pay standard Claude API token rates on top. A typical 1-hour coding session with Opus costs roughly $0.70 total including both tokens and session runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I version control agents with the ant CLI?
&lt;/h3&gt;

&lt;p&gt;Yes. Define agents as YAML files (e.g. &lt;code&gt;reviewer.agent.yaml&lt;/code&gt;), check them into Git, and deploy via CI. Use &lt;code&gt;ant beta:agents create&lt;/code&gt; to create from YAML and &lt;code&gt;ant beta:agents update&lt;/code&gt; with the version flag to push updates. This gives you full GitOps for agent configurations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can managed agents connect to MCP servers?
&lt;/h3&gt;

&lt;p&gt;Yes. Agents support remote MCP server connections via the &lt;code&gt;--mcp-server&lt;/code&gt; flag. You specify the server URL and name, then add an &lt;code&gt;mcp_toolset&lt;/code&gt; tool entry referencing that server. This lets agents use tools from GitHub, Slack, or custom MCP servers you've built.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I use the ant CLI in CI/CD pipelines?
&lt;/h3&gt;

&lt;p&gt;Define agents and environments as YAML files in your repo. In CI, use &lt;code&gt;ant beta:agents create &amp;lt; agent.yaml&lt;/code&gt; to provision and &lt;code&gt;ant beta:agents update&lt;/code&gt; to deploy changes. The &lt;code&gt;--transform&lt;/code&gt; flag extracts IDs for scripting, and &lt;code&gt;--format&lt;/code&gt; controls output parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  What tools are available to managed agents?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;agent_toolset_20260401&lt;/code&gt; built-in toolset includes bash, read, write, edit, glob, grep, and web_fetch. You can enable or disable individual tools, or disable all by default and whitelist specific ones. Agents can also connect to external MCP servers for custom tool integrations.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Read the full tutorial&lt;/strong&gt; with interactive code examples and component-based layout on the original post: &lt;a href="https://avinashsangle.com/blog/ant-cli-getting-started" rel="noopener noreferrer"&gt;Getting Started with the ant CLI on avinashsangle.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Claude Code Cost Tracking: Monitor and Cut Your Spending</title>
      <dc:creator>Avinash Sangle</dc:creator>
      <pubDate>Fri, 17 Apr 2026 05:04:30 +0000</pubDate>
      <link>https://dev.to/aavisangle/claude-code-cost-tracking-monitor-and-cut-your-spending-4cge</link>
      <guid>https://dev.to/aavisangle/claude-code-cost-tracking-monitor-and-cut-your-spending-4cge</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://avinashsangle.com/blog/claude-code-cost-tracking" rel="noopener noreferrer"&gt;avinashsangle.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How Much Does Claude Code Actually Cost?
&lt;/h2&gt;

&lt;p&gt;The pricing structure is straightforward. Claude Code Pro runs $20 per month (or $17 annually). The Max plan comes in two tiers: $100/month for 5x the Pro usage allowance, and $200/month for 20x. If you are on the API, you pay per token - Sonnet 4.6 at $3/$15 per million input/output tokens, and Opus 4.6 at $15/$75.&lt;/p&gt;

&lt;p&gt;Across enterprise deployments, the average lands between $150 and $250 per developer per month, according to Anthropic's published benchmarks. Ninety percent of users stay under $12 per day. But that top 10% can burn through tokens fast, especially with extended thinking enabled and Opus as the default model.&lt;/p&gt;

&lt;p&gt;The real issue? Tracking is scattered. Subscription users can't see dollar costs in the Console. API users get billing data but not per-session breakdowns. And everyone has local JSONL files sitting on their machine that most people don't even know exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Track costs with built-in commands:&lt;/strong&gt; &lt;code&gt;/cost&lt;/code&gt; for API users, &lt;code&gt;/stats&lt;/code&gt; for subscribers, &lt;code&gt;/usage&lt;/code&gt; for rate limit status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find your hidden usage data:&lt;/strong&gt; Claude Code logs every session to &lt;code&gt;~/.claude/projects/&lt;/code&gt; as JSONL files with full token counts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use third-party tools for real visibility:&lt;/strong&gt; ccusage (4.8k GitHub stars) gives you daily, monthly, and per-session cost reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cut costs by 50% with 7 practical changes:&lt;/strong&gt; default to Sonnet, cap thinking tokens, clear context between tasks, and write specific prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Built-In Cost Tracking Commands You Should Know
&lt;/h2&gt;

&lt;p&gt;Claude Code ships with three commands for checking usage. Which one you should use depends on whether you are paying through the API or a subscription plan.&lt;/p&gt;

&lt;h3&gt;
  
  
  /cost - Session API Spend
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;/cost&lt;/code&gt; command shows your current session's token usage and estimated dollar cost. Designed for API users. Subscription users still see token counts, which is useful for understanding consumption patterns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total cost:            $0.55
Total duration (API):  6m 19.7s
Total duration (wall): 6h 33m 10.2s
Total code changes:    127 lines added, 43 lines removed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /stats - Subscriber Usage Dashboard
&lt;/h3&gt;

&lt;p&gt;If you are on Pro or Max, &lt;code&gt;/stats&lt;/code&gt; opens a dashboard with a usage heatmap, session counts, token totals by model, and activity streaks. No dollar costs (flat-rate plan), but you see exactly how much of your allowance you are burning.&lt;/p&gt;

&lt;h3&gt;
  
  
  /usage - Rate Limit Status
&lt;/h3&gt;

&lt;p&gt;Shows your plan limits and current rate limit status. Check this when Claude Code feels slow or you suspect throttling. Shows both 5-hour and 1-week usage windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Status Line Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Show cost in the status line (API users)&lt;/span&gt;
claude config &lt;span class="nb"&gt;set &lt;/span&gt;status_line.show_cost &lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Show token count in the status line&lt;/span&gt;
claude config &lt;span class="nb"&gt;set &lt;/span&gt;status_line.show_tokens &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use which:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API users:&lt;/strong&gt; Use &lt;code&gt;/cost&lt;/code&gt; for dollar amounts and &lt;code&gt;/usage&lt;/code&gt; for rate limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro/Max subscribers:&lt;/strong&gt; Use &lt;code&gt;/stats&lt;/code&gt; for usage patterns and &lt;code&gt;/usage&lt;/code&gt; for rate limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everyone:&lt;/strong&gt; Configure the status line for passive monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Claude Code Stores Your Usage Data
&lt;/h2&gt;

&lt;p&gt;Every session gets logged to your local filesystem as JSONL files. These contain detailed token counts for every API call - input tokens, output tokens, cache creation tokens, cache read tokens, and the model used. This is the same data third-party tools read to build their dashboards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Logs
&lt;/h3&gt;

&lt;p&gt;Claude Code writes one JSONL file per session to &lt;code&gt;~/.claude/projects/&lt;/code&gt;. If you are on a subscription plan, these local logs are the only way to get granular cost data since the Console doesn't expose it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find your session logs&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; ~/.claude/projects/

&lt;span class="c"&gt;# Look at the most recent session&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; ~/.claude/projects/ | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-5&lt;/span&gt;

&lt;span class="c"&gt;# Count tokens in a session with jq&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.claude/projects/&amp;lt;session-file&amp;gt;.jsonl | &lt;span class="se"&gt;\&lt;/span&gt;
  jq &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s1"&gt;'[.[].message.usage // empty] |
    { total_input: (map(.input_tokens) | add),
      total_output: (map(.output_tokens) | add),
      cache_read: (map(.cache_read_input_tokens // 0) | add),
      cache_creation: (map(.cache_creation_input_tokens // 0) | add) }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Status Line Snapshots
&lt;/h3&gt;

&lt;p&gt;Second file most people miss: &lt;code&gt;~/.claude/statusline.jsonl&lt;/code&gt;. Contains periodic snapshots with server-reported cumulative cost and your 5-hour and 1-week rate-limit usage percentages. This data is only in this local file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View recent status line snapshots&lt;/span&gt;
&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-5&lt;/span&gt; ~/.claude/statusline.jsonl | jq &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Extract cost progression over time&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.claude/statusline.jsonl | &lt;span class="se"&gt;\&lt;/span&gt;
  jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'[.timestamp, .cost_usd] | @csv'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Third-Party Tools for Claude Code Usage Analytics
&lt;/h2&gt;

&lt;p&gt;The built-in commands give a snapshot. For real visibility into trends, per-project breakdowns, and forecasting, you need more.&lt;/p&gt;

&lt;h3&gt;
  
  
  ccusage - The Most Popular Option
&lt;/h3&gt;

&lt;p&gt;4,800+ GitHub stars. CLI that reads your local JSONL files and produces clean tables with daily, monthly, or per-session cost breakdowns. Tracks cache tokens separately, supports billing window analysis, works offline with cached pricing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install and run - no setup needed&lt;/span&gt;
npx ccusage              &lt;span class="c"&gt;# Daily report (default)&lt;/span&gt;
npx ccusage daily        &lt;span class="c"&gt;# Detailed daily breakdown&lt;/span&gt;
npx ccusage monthly      &lt;span class="c"&gt;# Monthly aggregated totals&lt;/span&gt;
npx ccusage session      &lt;span class="c"&gt;# Cost per conversation session&lt;/span&gt;
npx ccusage blocks       &lt;span class="c"&gt;# 5-hour billing window analysis&lt;/span&gt;

&lt;span class="c"&gt;# Filter by project&lt;/span&gt;
npx ccusage &lt;span class="nt"&gt;--instances&lt;/span&gt;  &lt;span class="c"&gt;# Group usage by project&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  claude-usage - Local Web Dashboard
&lt;/h3&gt;

&lt;p&gt;Reads the same local log files but renders them as charts with cost estimates, session timelines, and model breakdowns. Pro and Max subscribers get a progress bar for their allowance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude-Code-Usage-Monitor - Real-Time Alerts
&lt;/h3&gt;

&lt;p&gt;Real-time chart of token consumption with predictions about when you will hit your limits. Good for Max plan users who want early warnings before getting throttled.&lt;/p&gt;

&lt;h3&gt;
  
  
  ccost - Per-Request Granularity
&lt;/h3&gt;

&lt;p&gt;Analyzes per-request JSONL logs with detailed token counts using LiteLLM pricing data. Use when you want to know exactly which requests were the most expensive.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;GitHub Stars&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ccusage&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;Daily/monthly reports, billing windows&lt;/td&gt;
&lt;td&gt;4,800+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-usage&lt;/td&gt;
&lt;td&gt;Web dashboard&lt;/td&gt;
&lt;td&gt;Visual charts, subscriber progress&lt;/td&gt;
&lt;td&gt;1,200+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Usage-Monitor&lt;/td&gt;
&lt;td&gt;CLI (real-time)&lt;/td&gt;
&lt;td&gt;Limit predictions, early warnings&lt;/td&gt;
&lt;td&gt;500+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ccost&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;Per-request cost analysis&lt;/td&gt;
&lt;td&gt;200+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How to Set a Budget Limit for Claude Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Per-Command Budget Cap
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;--max-budget-usd&lt;/code&gt; flag caps the maximum dollar amount for a single print-mode command. Useful in CI/CD pipelines or automated scripts where a runaway agent could burn through tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Cap a single command at $5&lt;/span&gt;
claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;--max-budget-usd&lt;/span&gt; 5.00 &lt;span class="s2"&gt;"Refactor the auth module"&lt;/span&gt;

&lt;span class="c"&gt;# Combine with max-turns for double protection&lt;/span&gt;
claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;--max-budget-usd&lt;/span&gt; 10.00 &lt;span class="nt"&gt;--max-turns&lt;/span&gt; 5 &lt;span class="s2"&gt;"Fix failing tests in src/"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Workspace Rate Limits for Teams
&lt;/h3&gt;

&lt;p&gt;Claude Code creates a workspace called "Claude Code" when you first authenticate with Console. Set rate limits on this workspace in the Console's Limits page to cap Claude Code's share of your API allocation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent SDK Cost Tracking
&lt;/h3&gt;

&lt;p&gt;If you are building on the Claude Agent SDK, every result message includes a &lt;code&gt;total_cost_usd&lt;/code&gt; field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@anthropic-ai/claude-agent-sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;totalSpend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Read the files in src/ and summarize the architecture&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;List all exported functions in src/auth.ts&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;result&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;totalSpend&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total_cost_usd&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`This call: $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total_cost_usd&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Total spend: $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;totalSpend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7 Ways to Cut Claude Code Costs by 50%
&lt;/h2&gt;

&lt;p&gt;After tracking my spending for a few weeks, I identified the patterns that were burning tokens fastest. These seven changes brought my daily average from ~$12 down to $5-6, with zero quality loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Default to Sonnet, Switch to Opus Only When Needed
&lt;/h3&gt;

&lt;p&gt;Sonnet 4.6 costs $3/$15 per million input/output tokens. Opus 4.6 costs $15/$75. That's 5x more expensive. For most coding tasks, Sonnet produces results that are just as good.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Switch models on the fly&lt;/span&gt;
/model sonnet    &lt;span class="c"&gt;# For everyday tasks&lt;/span&gt;
/model opus      &lt;span class="c"&gt;# For complex reasoning only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Set MAX_THINKING_TOKENS to 10,000
&lt;/h3&gt;

&lt;p&gt;Extended thinking is the single biggest cost lever. Uncapped thinking tokens can generate tens of thousands of tokens per request. A 10,000 limit still gives Claude enough room to reason.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set thinking token limit&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MAX_THINKING_TOKENS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10000

&lt;span class="c"&gt;# Or lower the effort level for simple tasks&lt;/span&gt;
/effort low       &lt;span class="c"&gt;# Significant token savings&lt;/span&gt;
/effort medium    &lt;span class="c"&gt;# Balance of cost and quality&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Use /clear Between Tasks
&lt;/h3&gt;

&lt;p&gt;Stale context is a silent cost multiplier. Every message includes the full conversation history as input tokens. Run &lt;code&gt;/clear&lt;/code&gt; when you switch to unrelated work. Use &lt;code&gt;/rename&lt;/code&gt; first if you want to come back to the session later with &lt;code&gt;/resume&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Use /compact When Context Grows
&lt;/h3&gt;

&lt;p&gt;If you are mid-task and can't clear, use &lt;code&gt;/compact&lt;/code&gt; to summarize the conversation history. Reduces token count while preserving important context.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Write Specific Prompts
&lt;/h3&gt;

&lt;p&gt;Vague prompts are expensive. "Make this better" forces Claude to spend tokens figuring out what you want. "Extract the hardcoded strings in src/auth.js into constants" gets the job done in one pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Use Plan Mode Before Expensive Operations
&lt;/h3&gt;

&lt;p&gt;Press Shift+Tab twice to enter plan mode before starting a big task. Claude outlines its approach before writing code. Costs a few hundred tokens for the plan but saves thousands by preventing costly rework.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Break Work Into Scoped Sessions
&lt;/h3&gt;

&lt;p&gt;One session for everything is the most expensive way to use Claude Code. Context accumulates, cache misses increase, and irrelevant history gets sent with every request. Work in task-scoped sessions: one for fixing the login bug, another for adding the new API endpoint, a third for writing tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code API vs Subscription: Which Costs Less?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Usage Profile&lt;/th&gt;
&lt;th&gt;API Cost/Month&lt;/th&gt;
&lt;th&gt;Best Plan&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Light (1-2 hrs/day)&lt;/td&gt;
&lt;td&gt;$30-50/mo&lt;/td&gt;
&lt;td&gt;API or Pro ($20)&lt;/td&gt;
&lt;td&gt;Pay-per-use wins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moderate (3-5 hrs/day)&lt;/td&gt;
&lt;td&gt;$100-180/mo&lt;/td&gt;
&lt;td&gt;Max 5x ($100)&lt;/td&gt;
&lt;td&gt;Up to 44% savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy (6+ hrs/day)&lt;/td&gt;
&lt;td&gt;$200-400/mo&lt;/td&gt;
&lt;td&gt;Max 20x ($200)&lt;/td&gt;
&lt;td&gt;Up to 50% savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The API makes more sense with sporadic usage or when you need fine-grained budget controls like &lt;code&gt;--max-budget-usd&lt;/code&gt;. It's also the only option for per-project cost allocation when billing clients. The subscription wins on predictability.&lt;/p&gt;

&lt;p&gt;My approach: Max 5x plan for day-to-day, API key configured for automated scripts and CI pipelines where I want hard budget caps. Hybrid setup gives predictable costs for interactive work and strict controls for automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I check my Claude Code costs?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;/cost&lt;/code&gt; in any session for API spend totals with token counts and dollar estimates. Subscribers should use &lt;code&gt;/stats&lt;/code&gt; for a usage dashboard with heatmaps and model breakdowns, or &lt;code&gt;/usage&lt;/code&gt; for rate limit status. You can also configure the status line to show costs continuously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where does Claude Code store usage data locally?
&lt;/h3&gt;

&lt;p&gt;Claude Code writes one JSONL file per session to &lt;code&gt;~/.claude/projects/&lt;/code&gt; with full token counts for every API call. It also writes periodic snapshots to &lt;code&gt;~/.claude/statusline.jsonl&lt;/code&gt; containing cumulative cost and rate-limit usage percentages. Third-party tools like ccusage read these files.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is ccusage and how do I use it?
&lt;/h3&gt;

&lt;p&gt;ccusage is an open-source CLI tool with 4,800+ GitHub stars that analyzes Claude Code usage from local JSONL files. Run &lt;code&gt;npx ccusage&lt;/code&gt; for a daily report, &lt;code&gt;npx ccusage monthly&lt;/code&gt; for monthly totals, or &lt;code&gt;npx ccusage session&lt;/code&gt; to see costs per conversation. Works offline with cached pricing data.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does Claude Code cost per day on average?
&lt;/h3&gt;

&lt;p&gt;Anthropic reports the average at about $6 per developer per day, with 90% of users under $12 per day. Enterprise deployments average $150 to $250 per developer per month. Heavy Opus sessions with extended thinking can spike past $20 in a single day.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I set a budget limit for Claude Code API usage?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;--max-budget-usd&lt;/code&gt; in print mode to cap spending per command: &lt;code&gt;claude -p --max-budget-usd 5.00 "your prompt"&lt;/code&gt;. For team-wide limits, set workspace rate limits in the Claude Console. You can also use &lt;code&gt;--max-turns&lt;/code&gt; to indirectly limit costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Claude Code Max plan worth it vs API pricing?
&lt;/h3&gt;

&lt;p&gt;If your API equivalent spend exceeds $100/month, the Max 5x plan at $100/month saves money. If you spend over $200/month on API, Max 20x is the better deal. For sporadic usage under $50/month, pay-per-token API pricing usually costs less overall.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does prompt caching reduce Claude Code costs?
&lt;/h3&gt;

&lt;p&gt;Claude Code automatically caches repeated content like system prompts and CLAUDE.md files. Cached tokens cost 90% less than fresh input tokens. The cache has a 5-minute TTL, so keeping sessions under 5 minutes apart maximizes savings. Track cache hit rates in local JSONL logs.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Read the full version&lt;/strong&gt; (with extra examples and updates) on the original post: &lt;a href="https://avinashsangle.com/blog/claude-code-cost-tracking" rel="noopener noreferrer"&gt;Claude Code Cost Tracking on avinashsangle.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Claude Managed Agents vs Agent SDK: Which Should You Use?</title>
      <dc:creator>Avinash Sangle</dc:creator>
      <pubDate>Tue, 14 Apr 2026 11:38:39 +0000</pubDate>
      <link>https://dev.to/aavisangle/claude-managed-agents-vs-agent-sdk-which-should-you-use-4112</link>
      <guid>https://dev.to/aavisangle/claude-managed-agents-vs-agent-sdk-which-should-you-use-4112</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;avinashsangle.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anthropic launched &lt;strong&gt;Claude Managed Agents&lt;/strong&gt; in beta on April 8, 2026. It's a hosted service that runs long-horizon Claude agents in Anthropic's infrastructure - sandboxed, persistent, and integrated with MCP servers out of the box.&lt;/p&gt;

&lt;p&gt;If you're choosing between Managed Agents and the Agent SDK, the short answer is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick &lt;strong&gt;Managed Agents&lt;/strong&gt; for multi-hour production workloads&lt;/li&gt;
&lt;li&gt;Pick the &lt;strong&gt;Agent SDK&lt;/strong&gt; when you need full control over the runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the breakdown after digging through the docs and API.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed Agents&lt;/strong&gt; = Anthropic runs the agent harness, sandbox, and runtime for you (hosted, beta)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent SDK&lt;/strong&gt; = you run the same engine yourself, with full control over infrastructure&lt;/li&gt;
&lt;li&gt;Pricing: standard token rates + &lt;strong&gt;$0.08 per session-hour&lt;/strong&gt; of active runtime + $10 per 1,000 web searches&lt;/li&gt;
&lt;li&gt;Early adopters: Notion, Rakuten, Asana - focused on long-running enterprise workflows&lt;/li&gt;
&lt;li&gt;Beta header required: &lt;code&gt;managed-agents-2026-04-01&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Core Difference
&lt;/h2&gt;

&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed Agents = Vercel&lt;/strong&gt; (hosted, opinionated, pay-per-use)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent SDK = self-hosted Next.js&lt;/strong&gt; (you run it on your infra)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same underlying engine. Different operational trade-offs.&lt;/p&gt;

&lt;p&gt;Managed Agents handles the agent loop, sandboxed code execution, file system access, web browsing, persistent sessions, and checkpointing for you. You send a prompt, connect your MCP servers, and the agent runs - even for hours - without you maintaining any of that runtime infrastructure.&lt;/p&gt;

&lt;p&gt;The Agent SDK exposes the same engine for self-hosted runtimes. You get local file access, private network connectivity, custom tool execution, and full runtime control. No session-hour charges - just token costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;Managed Agents pricing on top of standard Claude API rates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$0.08 per session-hour&lt;/strong&gt; of active runtime&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;$10 per 1,000 web searches&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle time is free&lt;/strong&gt; - sessions can wait for input without billing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a 2-hour research task, you're looking at roughly &lt;strong&gt;$0.16 in compute&lt;/strong&gt; plus token costs. For zero infrastructure management, that's a strong tradeoff for production workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Pick Which
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pick Managed Agents when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have multi-hour production workloads (research, batch processing, monitoring)&lt;/li&gt;
&lt;li&gt;You need sandboxed code execution out of the box&lt;/li&gt;
&lt;li&gt;Web browsing + MCP integrations matter&lt;/li&gt;
&lt;li&gt;You don't want to build or maintain agent infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Agent SDK when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need local file access (working against repos)&lt;/li&gt;
&lt;li&gt;Private network access required&lt;/li&gt;
&lt;li&gt;Custom tool execution logic&lt;/li&gt;
&lt;li&gt;You want predictable token-only costs without session-hour pricing&lt;/li&gt;
&lt;li&gt;Development and debugging - the SDK lets you inspect everything&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Get Out of the Box with Managed Agents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Sandboxed containers with code execution, file system, and web access&lt;/li&gt;
&lt;li&gt;Sessions can run for hours with &lt;strong&gt;checkpointing&lt;/strong&gt; for fault tolerance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP server support&lt;/strong&gt; - any MCP server you've built for Claude Desktop or Claude Code can be configured for a Managed Agent session&lt;/li&gt;
&lt;li&gt;Built-in web browsing and search&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Beta Status
&lt;/h2&gt;

&lt;p&gt;Managed Agents is currently in beta. All endpoints require the beta header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;anthropic-beta: managed-agents-2026-04-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official Anthropic SDKs set this automatically when you use the beta namespace. Some features like multi-agent orchestration remain in limited research preview.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Notion, Rakuten, and Asana are early adopters - all using Managed Agents for enterprise workflows where the agent needs to run for extended periods, integrate with internal tools via MCP, and survive infrastructure failures.&lt;/p&gt;

&lt;p&gt;This is Anthropic moving up the value chain: instead of just selling the model, they're selling the complete runtime that wraps it. For teams without dedicated AI infrastructure, that's a meaningful shift.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Read the full deep-dive&lt;/strong&gt; with code examples, pricing math, and a decision flowchart on the original post: &lt;a href="https://avinashsangle.com/blog/claude-managed-agents" rel="noopener noreferrer"&gt;Claude Managed Agents vs Agent SDK on avinashsangle.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
