DEV Community: Avinash Sangle

Gemini 3.5 Flash for Agentic Coding: A Claude Coder's Guide

Avinash Sangle — Mon, 01 Jun 2026 05:06:59 +0000

This article was originally published on avinashsangle.com.

Gemini 3.5 Flash is Google's new Flash-tier coding model, generally available since May 19, 2026. It scores 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas, beating Gemini 3.1 Pro on 11 of 15 benchmarks. Pricing is $1.50 input and $9 output per 1M tokens. For Claude Code users, it's the right model for tool-heavy agent loops, not a replacement for production code edits.

TL;DR

What it is: Gemini 3.5 Flash (GA May 19, 2026) is a Flash-tier model that outperforms Gemini 3.1 Pro on agentic benchmarks while costing 25% less per token than the Pro tier.
Pricing reality: $1.50/$9 per 1M tokens looks cheap, but it's 3x the price of Gemini 3 Flash Preview and runs about 5.5x more expensive per full benchmark suite according to Artificial Analysis.
The thinking_level trap: the default dropped from high to medium. Copy-pasted code from gemini-3-flash-preview silently produces dumber outputs. For agentic coding, set thinking_level: "low" explicitly.
Where Flash wins: MCP tool orchestration (83.6% MCP Atlas, beats Claude Opus 4.7 by 4.5 points), parallel function calling, fast iterative agent loops.
Where Claude Code still wins: production codebase editing (Sonnet 4.6 leads SWE-Bench Verified), defensive code, long-context retrieval past 128k tokens.
Routing rule: keep Claude Code for Edit and Write tasks; route MCP-heavy planning and tool fan-out to Gemini 3.5 Flash via OpenRouter or a thin custom MCP server.

What is Gemini 3.5 Flash and what changed on May 19, 2026

Gemini 3.5 Flash is a Flash-tier Gemini model that Google announced at I/O 2026 and shipped straight to GA on the same day. It is the first Flash-tier model to outperform the previous Pro tier on real agentic coding benchmarks. The launch lives on the official Google blog and the technical details on the Google DeepMind model card.

The model is available on the Gemini API, AI Studio, Antigravity CLI (the successor to Gemini CLI), Vertex AI, the Gemini app, AI Mode in Search, and now GitHub Copilot per the May 19 changelog. The context window is 1,048,576 input tokens with a 65,536 output cap.

Why this matters for a Claude Code user: the cheap model is now smart enough to handle production agent loops. That changes routing math, not loyalty. If you already run Sonnet 4.6 or Opus 4.7 inside Claude Code, you don't throw the stack away. You ask which subtasks now belong on a cheaper, faster Gemini call.

Gemini 3.5 Flash benchmarks: where it beats Gemini 3.1 Pro

Gemini 3.5 Flash wins 11 of 15 published benchmarks against Gemini 3.1 Pro, including the ones that matter most for agentic coding. The headline numbers from the Google DeepMind model card and the WaveSpeed roundup are below.

Benchmark	Gemini 3.5 Flash	Gemini 3.1 Pro	Claude Opus 4.7	GPT-5.5
Terminal-Bench 2.1	76.2%	70.3%	n/a	78.2%
MCP Atlas	83.6%	78.2%	79.1%	75.3%
GDPval-AA (Elo)	1656	1314	n/a	1769
SWE-Bench Pro	55.1%	n/a	64.3%	n/a
ARC-AGI-2	72.1%	~77%	n/a	84.6%
128k retrieval	-7.6 pts vs 3.1 Pro	baseline	strong	strong

The single most important number on that table for Claude Code users is the 83.6% MCP Atlas score. MCP Atlas measures how reliably a model chains multi-step tool calls without stalling on a malformed or out-of-order call. For anyone running an MCP-heavy stack, that score predicts task-completion rate more directly than SWE-bench does. The current Flash score beats Claude Opus 4.7 by 4.5 points and GPT-5.5 by 8.3 points.

The honest other side: Gemini 3.5 Flash regresses 7.6 points on 128k-token retrieval versus Gemini 3.1 Pro, and gives up 5 points on ARC-AGI-2 versus the prior Pro tier (12.5 points to GPT-5.5). If you have a million-token context refactor, or a problem that looks like ARC-style abstract reasoning, Flash is the wrong answer.

Gemini 3.5 Flash pricing: cheap per token, expensive per task

Gemini 3.5 Flash is $1.50 per 1M input tokens, $9 per 1M output tokens, and $0.15 per 1M cached input tokens (see OpenRouter for live pricing). On its face the Flash tier looks cheap. Per task it is not.

Simon Willison's May 19, 2026 analysis cites Artificial Analysis benchmark-suite costs: running their full evaluation cost $1,551.60 on Gemini 3.5 Flash versus $892.28 on Gemini 3.1 Pro. Cheaper per token, more expensive per workload, because thinking tokens persist across turns and agent loops chew more output tokens. NxCode reports a similar multiplier: roughly 9x the cost of gemini-3-flash on equivalent eval jobs ($1,552 vs $278).

The pricing comparison that matters for routing:

Model	Input ($/1M)	Output ($/1M)	Cached input ($/1M)
Gemini 3.5 Flash	$1.50	$9.00	$0.15
Gemini 3.1 Pro	$2.50	$15.00	-
Gemini 3 Flash Preview (deprecated)	$0.50	$3.00	-
Claude Sonnet 4.6	$3.00	$15.00	$0.30
Claude Opus 4.7	$5.00	$25.00	$0.50
GPT-5.5	$1.25	$10.00	-

One trap to call out before the next section. GitHub Copilot launched Gemini 3.5 Flash with a 14x premium-request multiplier (GitHub Changelog, May 19 2026). A 300-request Copilot Pro quota becomes about 21 Flash calls before overage. If you already have Claude Code and an OpenRouter or AI Studio API key, calling Flash directly at roughly $0.015 per call is almost always cheaper than burning Copilot quota.

The thinking_level default trap that breaks copy-pasted code

Google replaced the integer thinking_budget parameter with a string enum thinking_level and quietly dropped the default from high to medium. Code copy-pasted from gemini-3-flash-preview still runs, but it produces measurably worse outputs unless you set the new field. The official notes live on Google AI Developers - What's new in Gemini 3.5.

The four values are minimal, low, medium (new default), and high. Google retuned low specifically for coding and tool-calling workloads. For agent loops with MCP tools, thinking_level: "low" is faster, cheaper, and on coding benchmarks roughly equivalent to medium. For hard reasoning, set high.

Before and after diff

# Before - gemini-3-flash-preview
from google import genai
from google.genai import types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_budget=-1),  # was "dynamic" / high
    temperature=0.2,                                            # ignored by 3.5
    top_p=0.95,                                                 # ignored by 3.5
)

# After - gemini-3.5-flash, explicit and tuned for agent loops
from google import genai
from google.genai import types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_level="low"),  # for MCP agent loops
    # for hard reasoning tasks, use thinking_level="high"
    # for latency-sensitive work, use thinking_level="minimal"
)

Two cleanup notes from the migration. temperature, top_p, and top_k are no longer recommended controls in the new SDK profile. Leaving them in your config is not an error, but they are silently ignored - delete them so the next reader of your code doesn't assume they still work. And inspect response.usage_metadata on your first run: thinking tokens now persist across multi-turn conversations, and the per-task token count for an agent loop can climb 30 to 50 percent versus the preview model.

Gemini 3.5 Flash vs Claude Code (Sonnet 4.6, Opus 4.7) for coding

The short version: Flash wins agent orchestration and MCP tool chains. Claude Code wins repo-level edits and defensive code generation. Pick by task, not by model loyalty.

Task type	Best model	Reason
MCP tool orchestration, parallel function calling	Gemini 3.5 Flash	83.6% MCP Atlas, ~289 tok/sec, $1.50 input
Multi-file refactor in a real repo	Claude Sonnet 4.6 in Claude Code	Default Claude Code model; strong SWE-Bench Verified
ARC-style abstract reasoning	Claude Opus 4.7 or GPT-5.5	Flash gives up 5 pts ARC-AGI-2 vs prior Pro
Long-context retrieval beyond 128k	Gemini 3.1 Pro or Sonnet 4.6 (1M ctx)	Flash regresses 7.6 pts on 128k retrieval
Cheap intermediate planning inside an agent	Gemini 3.5 Flash	Cached input at $0.15/1M is the lowest among frontier models
Production code review with defensive patches	Claude Sonnet 4.6	Anthropic models add error handling more naturally

The defensive-code observation isn't hand-wavy. Multiple head-to-head reviews this month converge on the same pattern. MindStudio and BuildFastWithAI both report that Claude Opus 4.7 anticipates edge cases and adds error handling more naturally, while Gemini 3.5 Flash produces more concise code that occasionally skips defensive patterns. That maps to my own experience: I trust Sonnet 4.6 to write production patches; I lean on Flash to coordinate the 30 tool calls that fetch the inputs.

When to route tasks from Claude Code to Gemini 3.5 Flash

My default: I keep Claude Code with Sonnet 4.6 as the editor for anything that touches the repo. The Edit, Write, Glob, and Grep tools stay where they are. That is the production path and it doesn't need a different model today.

Where I route to Gemini 3.5 Flash is the supporting cast of tasks around the editor:

MCP-heavy planning subtasks where an agent fans out 10 to 100 tool calls to query an API, hit a database, or coordinate with another agent. The 83.6% MCP Atlas score shows up here as fewer retries and fewer stalled tool calls.
Long-running background tasks where speed beats defensive depth: linting summaries, log triage, doc generation, scheduled cron-style agents. Flash's ~289 tok/sec output throughput is roughly 4x what Opus 4.7 delivers.
Cheap intermediate planning steps inside a larger agent loop where Sonnet 4.6 is overkill. Use Flash to pick which tool to call next, then hand control back to Sonnet for the actual code change.
Parallel sub-agent fan-out like the 93 parallel agents in Antigravity's demo described in the NxCode developer guide. Cached input pricing at $0.15/1M makes the fan-out economically viable.

Three ways I actually route

OpenRouter as a routing proxy. Configure Claude Code or any Claude SDK call to dispatch specific tool calls to google/gemini-3.5-flash on OpenRouter. You keep one API key, one billing surface, and you can swap models without code changes.
A thin custom MCP server that wraps client.models.generate_content with gemini-3.5-flash as an exposed tool, then mount it inside Claude Code via ~/.claude.json.
Antigravity CLI for hybrid teams. If your team already migrated from Gemini CLI to agy, Flash is the default model. Use Antigravity for parallel agents and keep Claude Code as your primary editor.

Build an MCP agent with Gemini 3.5 Flash in 40 lines of Python

The Google GenAI SDK has native MCP support. You hand the SDK a connected MCP ClientSession, and it auto-executes tool calls and feeds the responses back to the model in a loop until the agent finishes. The official reference lives on Google AI Developers - Function calling.

Install the SDKs

pip install "google-genai>=2.0" "mcp>=1.4"
export GEMINI_API_KEY="your-key-from-aistudio"

Working agent example

The script below connects to an MCP server, hands the session to Gemini 3.5 Flash with thinking_level="low", and runs a real triage prompt. Replace your_mcp_server with the module path to whatever MCP server you already run.

import asyncio
from google import genai
from google.genai import types
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client


async def main() -> None:
    server = StdioServerParameters(
        command="python",
        args=["-m", "your_mcp_server"],
    )

    async with stdio_client(server) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            client = genai.Client()
            response = await client.aio.models.generate_content(
                model="gemini-3.5-flash",
                contents=(
                    "Triage the 5 most recent open PRs in this repo. "
                    "For each, return: PR number, risk score (low/med/high), "
                    "and a one-line reason. Use the tools available."
                ),
                config=types.GenerateContentConfig(
                    thinking_config=types.ThinkingConfig(thinking_level="low"),
                    tools=[session],  # SDK auto-executes MCP tool calls
                ),
            )

            print(response.text)
            print(response.usage_metadata)


if __name__ == "__main__":
    asyncio.run(main())

Why every choice is what it is

thinking_level="low": Google retuned low for code and tool-calling. It is faster, cheaper, and on coding benchmarks comparable to medium. The default medium would quietly inflate cost without improving the tool-call sequence.
tools=[session]: the SDK accepts an MCP ClientSession directly. It introspects the server's tool list, calls each tool when the model requests it, matches the FunctionResponse by id and name, and continues the loop until the model stops asking for tool calls.
response.usage_metadata: log this on every run. Inspect ThoughtsTokenCount. Thinking tokens persist across turns and can inflate input costs 30 to 50 percent on long agent loops.
No temperature, no top_p: these parameters are silently ignored in Gemini 3.5. Leaving them in your config will confuse the next person to read it.

Gemini 3.5 Flash in Antigravity, GitHub Copilot, and the raw API

Flash ships across four meaningful surfaces. The right one depends on what you already pay for and how you build.

Surface	Cost model	Best for
Raw Gemini API	$1.50 / $9 per 1M (cached $0.15)	Custom agents, MCP servers, routing layers
Antigravity CLI (agy)	Free weekly cap, Pro $19.99/mo, Ultra $249.99/mo	Hybrid teams on Google's stack
GitHub Copilot	14x premium-request multiplier	Existing Copilot users with light volume
OpenRouter	$1.50 / $9 per 1M + small markup	Routing inside Claude Code or multi-model proxies

One opinionated note: for a Claude Code user with even one active OpenRouter or AI Studio key, raw API plus OpenRouter is almost always cheaper than burning Copilot quota at the 14x multiplier. If you don't already pay for Copilot, the decision is easy. If you do, do the math once on your own workload before changing anything.

Limitations and gotchas

The honest list. None of these are deal-breakers, but each one is worth knowing before you swap an existing agent over.

No Computer Use yet. Flash doesn't drive a browser. For browser-driving agents, use a Pro-tier Gemini or Claude with Computer Use.
Knowledge cutoff January 2025. Tool-augmented prompts and web search are the standard workarounds for fresh facts.
Text-only output. Multimodal input works. Output is text only - no image or audio generation.
128k retrieval regressed. If you have million-token contexts and need exact-recall retrieval at scale, Sonnet 4.6 with its 1M context or Gemini 3.1 Pro are stronger picks.
Thought-token inflation. Thinking tokens persist across multi-turn conversations and can inflate input costs 30 to 50 percent on agent loops. Track ThoughtsTokenCount from response.usage_metadata.
thinking_level: medium is the silent default. Set it explicitly in every config. The previous high default is gone.
TPU capacity hiccups. Multiple developers reported 503 errors during the first week. Build retry-with-backoff into any production caller.

Frequently Asked Questions

What is Gemini 3.5 Flash?

Gemini 3.5 Flash is Google's Flash-tier coding and agent model, generally available since May 19, 2026. It ships across the Gemini API, AI Studio, Antigravity CLI, Vertex AI, GitHub Copilot, and the Gemini app. It beats Gemini 3.1 Pro on 11 of 15 published agent benchmarks while pricing at $1.50 input and $9 output per 1M tokens.

How much does Gemini 3.5 Flash cost per 1M tokens?

Gemini 3.5 Flash costs $1.50 per 1M input tokens, $9 per 1M output tokens, and $0.15 per 1M cached input tokens. That is 25 percent cheaper than Gemini 3.1 Pro, but 3x the price of the Gemini 3 Flash Preview it replaces and 6x the price of Gemini 3.1 Flash-Lite.

Is Gemini 3.5 Flash better than Gemini 3.1 Pro?

On agent benchmarks, yes. Gemini 3.5 Flash beats Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2 vs 70.3), MCP Atlas (83.6 vs 78.2), and GDPval-AA Elo (1656 vs 1314). It regresses on 128k-token retrieval by 7.6 points and ARC-AGI-2 by 5 points, so long-context or pure reasoning work still wants Pro.

How does Gemini 3.5 Flash compare to Claude Code for coding?

Flash leads MCP tool orchestration at 83.6 percent MCP Atlas, beating Claude Opus 4.7 by 4.5 points. Claude Sonnet 4.6 still leads production code editing on SWE-Bench Verified and is the default model in Claude Code. The practical answer is to route: Claude Code for repository edits, Gemini 3.5 Flash for tool-heavy agent loops.

What is the thinking_level default in Gemini 3.5 Flash and why does it matter?

Google replaced the integer thinking_budget with a string enum thinking_level and dropped the default from high to medium. Copy-pasting code from gemini-3-flash-preview silently produces worse outputs. For agentic coding with MCP tools, set thinking_level: "low". For hard reasoning, set high.

Can Gemini 3.5 Flash call MCP tools?

Yes. The Google GenAI SDK has built-in MCP support that auto-executes tool calls and feeds responses back in a loop until the agent finishes. Gemini 3.5 Flash scored 83.6 percent on MCP Atlas, the benchmark that measures multi-step tool-call reliability. It is currently the strongest published score on that benchmark among major frontier models.

Why is Gemini 3.5 Flash 3x more expensive than Gemini 3 Flash Preview?

Google retuned Flash to handle frontier-grade agent loops and is pricing it accordingly. Simon Willison observed all three major labs probing API price tolerance at the same time. Artificial Analysis reported their benchmark suite cost $1,551.60 on Gemini 3.5 Flash versus $892.28 on Gemini 3.1 Pro. Cheaper per token, more expensive per workload.

What is the GitHub Copilot premium multiplier for Gemini 3.5 Flash?

GitHub Copilot launched Gemini 3.5 Flash with a 14x premium-request multiplier across Copilot Pro, Pro Plus, Business, and Enterprise plans. A 300-request monthly quota becomes about 21 Gemini 3.5 Flash calls before overage. For most Claude Code users, calling the raw API through OpenRouter or AI Studio is cheaper than burning Copilot quota.

Should I switch from Claude Code to Gemini 3.5 Flash?

Not as a wholesale swap. Claude Code with Sonnet 4.6 is still the strongest tool for production repository edits and long-context refactors. Gemini 3.5 Flash is the right routing target for MCP-heavy agent loops, parallel sub-agent fan-out, and cheap intermediate planning steps. The high-leverage move is a hybrid stack, not a switch.

How do I call Gemini 3.5 Flash from a Python script?

Install the google-genai SDK, set GEMINI_API_KEY, and call client.models.generate_content with model gemini-3.5-flash. Set thinking_level explicitly via ThinkingConfig. Drop temperature, top_p, and top_k from your config. For MCP, pass the session object into the tools list.

Claude Managed Agents Outcomes: Auto-Grading Agent Work

Avinash Sangle — Wed, 27 May 2026 10:31:51 +0000

This article was originally published on avinashsangle.com.

Claude Managed Agents Outcomes is a public-beta feature, launched on May 6, 2026, that lets you hand the agent a rubric and have a separate grader model check every draft against it. If the grader returns needs_revision, the gaps flow back to the writer for another pass, up to max_iterations (default 3, max 20). Same hosted harness, no human in the loop.

TL;DR

Outcomes is a rubric-graded iteration loop built into the Managed Agents harness. You send one event, user.define_outcome, and the agent works until the grader says satisfied or hits max_iterations.
A separate grader (same model and tools as the writer, fresh context window) evaluates every draft. Its feedback is the only signal the writer gets back on each revision.
Anthropic's internal benchmarks report up to +10 points overall task success, +10.1% on .pptx generation, and +8.4% on .docx (Anthropic, May 2026).
The cost trap is the iteration count, not a per-outcome fee. Each revision multiplies writer plus grader tokens against the same $0.08-per-session-hour line item from the underlying Managed Agents pricing.

What Are Claude Managed Agents Outcomes?

Outcomes is the part of Claude Managed Agents that lets the agent verify its own work. Instead of running until it self-assesses as done, the session runs against a markdown rubric, and a second Claude (the grader) inspects each artifact with no access to the writer's reasoning. Anthropic launched Outcomes in public beta on May 6, 2026, alongside two sibling features: dreaming (research preview) and multiagent orchestration (public beta).

The framing matters. Anthropic describes it as "agents do their best work when they know what 'good' looks like - a structural framework, a presentation standard, or a set of requirements" (Anthropic blog, May 6 2026). The earlier Managed Agents flow asked you to write transcripts and review output yourself. Outcomes replaces that loop with a grader process and a rubric, so the agent keeps iterating without paging a human.

On the launch list, three companies were explicitly named as production users of Outcomes: Harvey (legal document drafting), Spiral by Every (writing quality against editorial principles), and Wisedocs (document quality checks against internal guidelines). Per Anthropic's internal benchmarks, the loop lifts task success rates by up to 10 percentage points over a standard prompting loop, with the largest gains on the hardest tasks (MindStudio, 2026).

The beta header is managed-agents-2026-04-01. Every Managed Agents API call carries it, and the official SDKs set it for you when you pass betas=BETAS. If you forget the header on a raw HTTP call, the session API returns 400 before you even get to the outcome event.

How the Outcome Grader Works

The flow is small and predictable. You create an environment and a writer agent. You start a session and send one event, user.define_outcome, carrying the task description and the rubric. The writer drafts. After each writer turn, the harness emits span.outcome_evaluation_start and spins up a grader in a fresh context window. The grader reads only the rubric, inspects the artifact (it has the same model and tools as the writer), and emits span.outcome_evaluation_end with a verdict. If the verdict is needs_revision, the explanation flows back into the writer's next turn.

Two design choices make this useful rather than gimmicky. First, the grader runs with no visibility into the writer's internal reasoning, so it cannot be talked into approving an artifact that does not meet the rubric. Second, the grader re-checks the full artifact on every iteration, not the diff, so a fix that breaks a previously-passing criterion gets caught on the next round. The official define-outcomes reference states this plainly: the grader uses a separate context window to avoid being influenced by the main agent's implementation choices.

The benchmark numbers are useful context. On Anthropic's internal eval set, file generation specifically saw +8.4% on .docx outputs and +10.1% on .pptx outputs over a standard prompting loop (Anthropic, May 2026). Those are not headline-chart numbers; they are the difference between a slide deck that ships and one that doesn't. The gain is largest on the hardest tasks, which fits the pattern: easy work looks fine on the first pass anyway.

Writing a Rubric the Grader Will Actually Enforce

The rubric is the only lever you have on the grader. The default failure mode is a grader that approves everything, and the reason is almost always vague criteria. The Anthropic docs are blunt about it: structure the rubric as explicit, gradeable criteria, such as the CSV contains a price column with numeric values rather than the data looks good. The grader scores each criterion independently, so vague criteria produce noisy evaluations (Define outcomes, Anthropic).

A working rubric has five properties. Each criterion is checkable by the grader using its tools. The target is the artifact's structure and completeness, not a fact the grader cannot independently confirm. The rubric anticipates shortcuts (for example, blocks corroboration via search snippets and mirrors when you want a primary source). It mandates a feedback format so you can parse the explanation downstream. And it tells the grader what to ignore, so you do not burn iterations on style nits.

Anthropic ships a working DCF model rubric on the docs page. It's worth reading because it shows what "explicit and gradeable" looks like in practice:

# DCF Model Rubric

## Revenue Projections
- Uses historical revenue data from the last 5 fiscal years
- Projects revenue for at least 5 years forward
- Growth rate assumptions are explicitly stated and reasonable

## Cost Structure
- COGS and operating expenses are modeled separately
- Margins are consistent with historical trends or deviations are justified

## Discount Rate
- WACC is calculated with stated assumptions for cost of equity and cost of debt
- Beta, risk-free rate, and equity risk premium are sourced or justified

## Terminal Value
- Uses either perpetuity growth or exit multiple method (stated which)
- Terminal growth rate does not exceed long-term GDP growth

## Output Quality
- All figures are in a single .xlsx file with clearly labeled sheets
- Key assumptions are on a separate "Assumptions" sheet
- Sensitivity analysis on WACC and terminal growth rate is included

Notice what the rubric does not say. It never asks the grader to verify that the input revenue figures are factually accurate. The grader has no way to confirm that a 2023 revenue number is real without going off and looking it up, and even if it did, you cannot easily test that part of the work. The rubric checks that history was used, not that the numbers are true. That is the right line.

If you do not have a rubric and you are starting from scratch, the docs offer a bootstrap trick worth stealing: hand Claude a known-good artifact and ask it to write the rubric. The output is usually better than the rubric you would have written from a blank page, because it can name what makes the good artifact good. I run this once per document type and keep the rubric in a markdown file uploaded via the Files API with the files-api-2025-04-14 beta. That way you can pass rubric: {type: "file", file_id: ...} and reuse it across sessions.

Define an Outcome: Python Code Walkthrough

The setup is three calls plus one event. Create the environment. Create the writer agent with whatever tools the task needs. Create the session. Send a single user.define_outcome event carrying a description string and the rubric, and the writer starts on receipt. No separate user.message event is needed to kick it off.

import anthropic
from pathlib import Path

BETAS = ["managed-agents-2026-04-01", "files-api-2025-04-14"]
MODEL = "claude-opus-4-7"

client = anthropic.Anthropic()

# 1. Environment - the sandbox the agent runs in
env = client.beta.environments.create(
    name="research-brief",
    config={"type": "anthropic_cloud", "networking": {"type": "unrestricted"}},
    betas=BETAS,
)

# 2. Writer agent - same model and tools the grader will use
writer = client.beta.agents.create(
    name="Research Analyst",
    model=MODEL,
    system=(
        "You are a research analyst. You write one-page business briefs. "
        "Cite every factual claim with an inline footnote [n]."
    ),
    tools=[
        {
            "type": "agent_toolset_20260401",
            "configs": [
                {"name": "web_search"},
                {"name": "web_fetch"},
                {"name": "read"},
                {"name": "write"},
            ],
        }
    ],
    betas=BETAS,
)

# 3. Upload the rubric once, reuse across sessions
rubric = client.beta.files.upload(file=Path("dcf-rubric.md"), betas=BETAS)
print(f"Uploaded rubric: {rubric.id}")

# 4. Session + the one event that starts everything
session = client.beta.sessions.create(
    agent={"type": "agent", "id": writer.id, "version": writer.version},
    environment_id=env.id,
    title="Brief: EV fast-charging unit economics",
    betas=BETAS,
)

client.beta.sessions.events.send(
    session.id,
    betas=BETAS,
    events=[
        {
            "type": "user.define_outcome",
            "description": "Build a one-page business brief on EV fast-charging unit economics in .docx",
            "rubric": {"type": "file", "file_id": rubric.id},
            # or inline: {"type": "text", "content": RUBRIC_MD},
            "max_iterations": 5,  # optional, default 3, max 20
        }
    ],
)

The rubric field accepts either inline text or a file reference. For one-off notebook work I keep the rubric inline as a long string, because the round-trip is faster and the rubric is right there in the source. For anything I run more than once I upload it once and pass the file_id, so updates to the rubric do not require re-pasting it everywhere.

Two notes on the agent definition that bite people. The grader uses the same model and the same tools as the writer agent. If the writer has read and write, so does the grader, and the grader can open every file the writer produced. If you scope the writer too tightly (no read, for example) the grader will not be able to verify the artifact and you will get noisy verdicts. Give the grader the tools it needs to confirm what the rubric demands.

Grader Feedback: The Five Result States

Every grader pass ends with a span.outcome_evaluation_end event. The result field on that event takes one of five values and tells you exactly what the harness will do next. Memorize this table once and you will save yourself a lot of stream-parsing.

result	What happens next
`satisfied`	All criteria met. Session transitions to `idle`.
`needs_revision`	Writer starts another iteration with the grader's explanation as feedback.
`max_iterations_reached`	No further evaluation. Writer may run one final revision before the session goes idle.
`failed`	Rubric fundamentally does not match the task. Session goes idle.
`interrupted`	A `user.interrupt` event landed mid-evaluation. You can start a new outcome.

In practice you watch the stream and react to two events: span.outcome_evaluation_start tells you the writer finished a draft, and span.outcome_evaluation_end carries the verdict. A heartbeat event, span.outcome_evaluation_ongoing, fires while the grader works, but the grader's internal reasoning is opaque - you see that it is working, not what it is thinking.

TERMINAL = {"satisfied", "max_iterations_reached", "failed", "interrupted"}

with client.beta.sessions.events.stream(session.id, betas=BETAS) as stream:
    for ev in stream:
        if ev.type == "span.outcome_evaluation_start":
            print(f"[iter {ev.iteration}] grader evaluating draft...")
        elif ev.type == "span.outcome_evaluation_end":
            print(f"[iter {ev.iteration}] result: {ev.result}")
            print(ev.explanation)  # per-criterion feedback
            if ev.result in TERMINAL:
                break

# After the loop, fetch deliverables from /mnt/session/outputs/
files = client.beta.files.list(scope_id=session.id, betas=BETAS)
for f in files.data:
    client.beta.files.download(f.id, betas=BETAS).write_to_file(f.filename)

Output files live at /mnt/session/outputs/ inside the container and you fetch them via the Files API with scope_id=session.id. The grader's explanation field is the part you actually want to log for postmortems - it's the verbatim feedback the writer used for the next pass, so if a session looped to max_iterations_reached, that field tells you what the grader kept catching.

Tuning max_iterations vs Fixing the Rubric

max_iterations defaults to 3 and the cap is 20. The cookbook recommends starting at 5 for strict rubrics. The mistake I see most is people raising the cap when they should be rewriting the rubric. There's a simple decision rule that catches the difference.

Log every iteration's explanation field and look at the failures across passes. If the grader is flagging the same criterion every time and the writer is not closing it, the rubric is the problem - either the criterion is unverifiable, or the grader and writer are interpreting it differently. Raise the cap and you just pay for more iterations of the same loop. If the grader is flagging different criteria each pass, with the failures converging on the last unsolved item, the rubric is fine and you need a higher cap. That is real progress and another iteration will close it out.

The other anti-patterns are easier to spot once you know what to look for. A rubric that prescribes specific steps instead of describing the goal will over-constrain the writer, and the grader will mark novel approaches as failed. A description and rubric that contradict each other returns result: failed on the first pass, before any work is done - check the explanation, it is usually unambiguous about which one is wrong. A single criterion that packs four ideas together produces noisy per-criterion verdicts because the grader cannot tell which of the four is failing on a given draft.

Treat max_iterations as a circuit breaker, not a knob. Set it once based on how strict your rubric is, and let repeated max_iterations_reached events tell you when the rubric needs work. Raising the cap from 5 to 20 to mask a bad rubric doubles your token spend and surfaces nothing useful.

What Outcomes Actually Cost

Outcomes does not add a separate per-outcome fee. The cost driver is iteration count: every revision adds writer tokens plus grader tokens and keeps the same Managed Agents $0.08-per-session-hour clock running. There is no standalone grader bill or rubric bill. There is just more of the same line items.

Worked example for a research brief task. The writer takes about ten minutes per draft. The grader takes about a minute per evaluation. A session that goes two iterations to satisfied runs roughly 22 minutes of wall-clock session time. At $0.08 per hour, that is about $0.029 in session-hours. The token spend is whatever the writer and grader cost across two passes (typically the dominant line in this kind of work). For comparison, a manual human review of the same brief at, say, $25 per round, blows past the entire outcome-driven session cost on the first review.

Two cost levers actually move the bill. First, max_iterations. A run that loops six times when three would have done it doubles the writer plus grader tokens. Track the average iteration count per task type and tune accordingly. Second, the grader's tools. The grader uses whatever the writer agent was created with - if you gave the writer web_search and the rubric does not require cross-checking, you are paying for grader web searches it does not need. Strip unused tools from the writer config and the grader stops calling them.

For broader cost-tracking patterns across Claude Code and Managed Agents work, my Claude Code cost tracking post covers the JSONL logs and ccusage workflow I run weekly. The same approach works for Managed Agents sessions: dump the events stream to a file per session and roll up iteration counts and token usage from the usage field on span.outcome_evaluation_end.

Outcomes vs LLM-as-Judge vs Codex /goal

LLM-as-judge is a category in 2026, not a single product. Tools like Galileo, DeepEval, Langfuse, and G-Eval all let you score agent or model output against a rubric using an LLM, and they do it well. Strong LLM judges in current research achieve roughly 80% agreement with human evaluators, matching human-to-human consistency on many quality dimensions (Galileo, 2026). What you get from those tools is the score. What you do with it is up to you.

What Outcomes adds is the wiring. The grader runs inside the harness, the explanation flows back into the writer's next turn without any code on your side, and the iteration loop stops when the grader is satisfied. With a standalone judge, you build that loop yourself: capture the score, decide if it is good enough, format the gaps as a prompt, and restart the agent. That wiring is the difference between a one-off evaluation script and a self-correcting agent in production.

On the OpenAI side, Codex /goal is the closest analogue. Both attach a success target to an autonomous run. The difference is the verdict shape. Outcomes leans on a markdown rubric and natural-language gap explanations. Codex /goal leans on verifier scripts and structured pass-fail signals, which works well for code where you can run tests. Practitioners comparing them note that /goal fits programmatic tasks better, while Outcomes fits qualitative artifacts (documents, decks, prose) better (Developers Digest, 2026). They are not interchangeable, they target different shapes of work.

Pick Outcomes if you live in the Managed Agents harness already and your artifacts are document-like. Pick a standalone judge if you want to evaluate offline across a corpus, or you need cross-provider scoring. Pick Codex /goal if your success criterion is "does the test suite pass" and you are already on OpenAI.

Frequently Asked Questions

What are Claude Managed Agents Outcomes?

Outcomes is a public-beta Managed Agents feature launched May 6, 2026. You attach a markdown rubric to a session via a user.define_outcome event, and a separate grader model evaluates each draft in its own context window. If the grader returns needs_revision, the feedback goes back to the writer for another iteration.

How does the Claude outcome grader work?

The grader runs in a fresh context window using the same model and tools as the writer agent. It reads only the rubric, inspects the artifact, and returns a per-criterion verdict on every iteration. Its reasoning is opaque, but its explanation field carries the gaps that the writer must close on the next pass.

How do I write a good rubric for Claude Outcomes?

Use explicit, gradeable criteria like "the CSV has a numeric price column," not vibes like "the data looks good." Anchor the rubric in verifiable structure and completeness, anticipate shortcuts the writer might take, mandate a feedback format, and tell the grader what to ignore so it does not thrash on style nits.

What is the default value of max_iterations in Claude Outcomes?

The max_iterations field defaults to 3 and accepts values up to 20. For strict rubrics, the Anthropic cookbook recommends starting at 5. If the loop hits the cap with the same failures every iteration, the rubric is wrong; if it hits the cap with failures that converge, raise the cap instead of rewriting.

What are the result states of a Claude outcome evaluation?

Five values appear on span.outcome_evaluation_end.result: satisfied (criteria met, session goes idle), needs_revision (writer starts another pass), max_iterations_reached (one final revision allowed before idle), failed (rubric contradicts the task description), and interrupted (a user.interrupt event landed mid-evaluation).

How much do Claude Outcomes cost on top of session-hours?

Outcomes has no separate per-outcome fee. The real cost driver is iterations: each revision adds writer tokens plus grader tokens and keeps the $0.08-per-session-hour clock running. A 20-minute session that iterates twice still bills around $0.027 in session-hours, plus the writer-and-grader tokens for both rounds.

Can I use Claude Outcomes with the Agent SDK or only Managed Agents?

Outcomes is a Managed Agents feature. The grader, the iteration loop, and the span.outcome_evaluation_* events all live in the hosted harness. If you run the Agent SDK locally, you can still build an LLM-as-judge yourself with a separate Anthropic API call, but the wiring back to the writer is on you.

What is the difference between Claude Outcomes and Codex /goal?

Both attach a success target to an autonomous agent run. Outcomes uses a rubric plus a separate grader and feeds the gaps back as natural-language revision notes. OpenAI's Codex /goal favors verifier scripts and structured pass-fail signals. Outcomes leans qualitative, /goal leans test-driven, and the runtime substrates differ.

If you're still deciding between Managed Agents and the Agent SDK, start with Claude Managed Agents vs Agent SDK. And if you're running agents in CI, my prompt-injection defense guide for GitHub Actions applies to outcome-driven sessions too.

Getting Started with the ant CLI: Deploy Claude Agents

Avinash Sangle — Wed, 22 Apr 2026 05:25:26 +0000

This article was originally published on avinashsangle.com.

The ant CLI is Anthropic's official command-line client for the Claude API, and it's the fastest way to create, configure, and manage cloud-hosted agents without writing application code. From install to a running managed agent in under 10 minutes.

TL;DR

The ant CLI is Anthropic's official Go-based CLI for the Claude API, launched April 2026. It manages agents, environments, and sessions from your terminal.
Install on macOS with brew install anthropics/tap/ant. Linux and Go installs are also supported.
Define agents as YAML files, check them into Git, and deploy through CI - full GitOps for your agent configs.
Sessions cost $0.08/hour (billed to the millisecond) plus standard Claude token rates. Idle time is free.

What Is the ant CLI?

The ant CLI shipped alongside Claude Managed Agents on April 8, 2026, and it's built specifically for developers who want to create, configure, and run cloud-hosted agents without writing wrapper code. The GitHub repo already has over 300 stars in its first ten days.

It follows a resource-based command structure: ant [resource] <command> [flags...]. Think of it like kubectl for Claude agents. You can pipe YAML into it, extract fields with GJSON transforms, and chain commands in shell scripts. If you've worked with any modern infrastructure CLI, the patterns will feel familiar.

One thing to clarify early: the ant CLI and Claude Code solve different problems. Claude Code is your interactive coding assistant in the terminal - you talk to it, it writes code, and you pay through a subscription. The ant CLI is a programmatic API client for managing hosted agent infrastructure. You authenticate with an API key, and you're billed at standard API rates. I use both daily, and they complement each other well. Claude Code even understands how to shell out to ant natively.

How to Install the ant CLI

There are three installation paths depending on your platform. If you're on macOS, Homebrew is the fastest route.

macOS (Homebrew)

# Install from Anthropic's tap
brew install anthropics/tap/ant

# Clear the macOS quarantine flag (required)
xattr -d com.apple.quarantine "$(brew --prefix)/bin/ant"

# Verify
ant --version

That quarantine step trips people up. macOS flags unsigned binaries downloaded by Homebrew, and without clearing it you'll get a "cannot be opened because the developer cannot be verified" error. It's a one-time thing.

Linux / WSL (curl)

VERSION=1.2.1
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m | sed -e 's/x86_64/amd64/' -e 's/aarch64/arm64/')

curl -fsSL \
  "https://github.com/anthropics/anthropic-cli/releases/download/v${VERSION}/ant_${VERSION}_${OS}_${ARCH}.tar.gz" \
  | sudo tar -xz -C /usr/local/bin ant

From Source (Go 1.22+)

go install github.com/anthropics/anthropic-cli/cmd/ant@latest
export PATH="$PATH:$(go env GOPATH)/bin"

Set Your API Key

Once installed, set your Anthropic API key. The CLI reads it from the ANTHROPIC_API_KEY environment variable:

export ANTHROPIC_API_KEY="sk-ant-your-key-here"

You can generate an API key from the Anthropic Console. I keep mine in a .env file that my shell sources on startup, but any secret management approach works.

Shell Completions

The ant CLI supports completions for bash, zsh, fish, and PowerShell. For zsh (the default macOS shell):

# Generate zsh completions
ant completion zsh > ~/.zfunc/_ant

# Add to your .zshrc if not already there
echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc

Tab completion saves a lot of time when working with the beta: namespaced commands, which can get long.

Core Concepts - Agents, Environments, and Sessions

Before you create anything, it helps to understand how the four core pieces fit together.

Agent - A versioned configuration defining the model, system prompt, tools, and MCP server connections. Think of it as a blueprint. Each update creates a new version, so you can roll back if needed.

Environment - A container template specifying pre-installed packages (pip, npm) and networking rules. Create it once, reference it by ID. Multiple sessions can share one environment config, but each gets its own isolated container.

Session - A running instance that pairs an agent with an environment. It has its own container, filesystem, and conversation history. Sessions are where the actual work happens.

Events - The communication protocol. You send user events (messages, interrupts, tool confirmations) and receive agent events (messages, tool calls, thinking). Everything is event-based and streamable.

The flow works like this: you create an agent (the what), create an environment (the where), start a session linking them together, and then communicate through events. Anthropic handles the container orchestration, tool execution, and conversation state. According to the official docs, sessions cost $0.08 per session-hour billed to the millisecond, and idle time doesn't count.

Creating Your First Agent with the ant CLI

Let's build a simple code review agent. I'll walk through each step so you can see exactly what the CLI does at each stage. All managed agent commands sit under the beta: prefix since the feature is still in beta.

Step 1: Create the Agent

ant beta:agents create \
  --name "Code Reviewer" \
  --model claude-sonnet-4-6 \
  --system "You are a senior code reviewer. Read the code carefully, check for bugs, security issues, and style problems. Be specific about line numbers and provide fix suggestions." \
  --tool '{"type": "agent_toolset_20260401"}'

The response comes back as JSON with the agent ID and version. I like to extract just the ID for scripting:

# Extract the agent ID
AGENT_ID=$(ant beta:agents create \
  --name "Code Reviewer" \
  --model claude-sonnet-4-6 \
  --system "You are a senior code reviewer." \
  --tool '{"type": "agent_toolset_20260401"}' \
  --transform id --format raw)

echo "Created agent: $AGENT_ID"

The --transform flag uses GJSON syntax to pluck a specific field from the response, and --format raw strips the quotes. This is one of the CLI's best features for scripting.

Step 2: Create an Environment

ENV_ID=$(ant beta:environments create \
  --name "python-dev" \
  --pip-packages '["pytest", "ruff", "mypy"]' \
  --networking unrestricted \
  --transform id --format raw)

echo "Created environment: $ENV_ID"

Environments define what's pre-installed in the container. I'm giving this one Python linting tools since it's a code review agent. The unrestricted networking flag lets the agent fetch external resources if needed.

Step 3: Start a Session

SESSION_ID=$(ant beta:sessions create \
  --agent-id "$AGENT_ID" \
  --environment-id "$ENV_ID" \
  --transform id --format raw)

echo "Started session: $SESSION_ID"

Step 4: Send a Message and Stream the Response

# Send a review request
ant beta:sessions:events send \
  --session-id "$SESSION_ID" \
  --type user.message \
  --content-type text \
  --content-text "Review this Python function for bugs:

def divide(a, b):
    return a / b
"

# Stream the agent's response in real-time
ant beta:sessions stream --session-id "$SESSION_ID"

The stream command opens a real-time SSE connection to the session. You'll see the agent's thinking, tool calls (it might run the code through ruff), and its final review - all printed to your terminal as they happen.

Tip: Want to explore the response interactively? Replace --format raw with --format explore on any command to open the TUI explorer. It lets you navigate nested JSON with arrow keys - really useful when debugging agent responses.

YAML Version Control for Agents

This is the ant CLI's best feature, and the one I haven't seen anyone write about yet. Instead of passing flags inline, you can define agents and environments as YAML files, check them into Git, and deploy through your CI pipeline.

# code-reviewer.agent.yaml
name: Code Reviewer
model: claude-sonnet-4-6
system: |
  You are a senior code reviewer. Read the code carefully,
  check for bugs, security issues, and style problems.
  Be specific about line numbers and provide fix suggestions.
tools:
  - type: agent_toolset_20260401
    configs:
      - name: web_fetch
        enabled: false

# code-reviewer.environment.yaml
name: python-dev
pip_packages:
  - pytest
  - ruff
  - mypy
networking: unrestricted

Now you can create the agent directly from the file:

# Create from YAML
ant beta:agents create < code-reviewer.agent.yaml

# Update an existing agent (version is required for safety)
ant beta:agents update \
  --agent-id "$AGENT_ID" \
  --version 1 \
  < code-reviewer.agent.yaml

The versioning requirement matters. When you update an agent, you must pass the current version number. If someone else updated it since you last pulled, the command fails rather than silently overwriting. It's optimistic concurrency control - the same pattern you'd find in Kubernetes or Terraform.

This YAML approach is where the ant CLI really shines for teams. Your agent configs live in the same repo as your application code, go through pull request review, and deploy through the same pipeline. I wrote more about the broader Managed Agents architecture in my Managed Agents vs Agent SDK comparison, but the YAML workflow is what makes the CLI my preferred interface.

According to the official CLI docs, Anthropic designed the YAML workflow specifically for GitOps-style agent management. If you're already doing infrastructure as code, this slots right in.

ant CLI vs curl vs SDK - Why Use the CLI?

You can hit the Managed Agents API three ways: raw HTTP with curl, a language SDK (Python, TypeScript, Go, etc.), or the ant CLI. Each has its place.

Feature	curl	ant CLI	Python SDK
Setup time	None	2 minutes	5 minutes
JSON body authoring	Manual	Typed flags / YAML	Typed objects
Auto-pagination	Manual	Built-in	Built-in
File references	Manual base64	`@path` syntax	File objects
Response filtering	Pipe to jq	`--transform`	Code
Shell scripting	Verbose	Ergonomic	Requires Python
CI/CD fit	OK	Excellent	Good
Best for	Quick tests	Ops / automation	App integration

The ant CLI sits in a sweet spot. It's faster than writing curl commands by hand (no JSON body construction, no header management), and lighter than pulling in a full SDK when you just want to script some agent operations. For anything that lives in a shell script or CI workflow, it's the right tool.

If you're building an application that embeds agent interactions - a web app, a Slack bot, a data pipeline - use the SDK. The ant CLI is for the operational layer: provisioning agents, rotating credentials, monitoring sessions, deploying config changes.

Scripting and Automation Patterns

Here are a few patterns I've found useful when automating agent workflows with the ant CLI.

Extract IDs from Create Commands

#!/bin/bash
set -euo pipefail

# Create agent and capture the ID
AGENT_ID=$(ant beta:agents create \
  < agents/reviewer.agent.yaml \
  --transform id --format raw)

# Create environment and capture the ID
ENV_ID=$(ant beta:environments create \
  < agents/reviewer.environment.yaml \
  --transform id --format raw)

echo "Agent: $AGENT_ID"
echo "Environment: $ENV_ID"

# Store for later use
echo "AGENT_ID=$AGENT_ID" >> .env.agents
echo "ENV_ID=$ENV_ID" >> .env.agents

GitHub Actions Deployment

name: Deploy Agents
on:
  push:
    branches: [main]
    paths: ['agents/**']

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install ant CLI
        run: |
          curl -fsSL \
            "https://github.com/anthropics/anthropic-cli/releases/download/v1.2.1/ant_1.2.1_linux_amd64.tar.gz" \
            | sudo tar -xz -C /usr/local/bin ant

      - name: Update agent config
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          ant beta:agents update \
            --agent-id "${{ vars.AGENT_ID }}" \
            --version "${{ vars.AGENT_VERSION }}" \
            < agents/reviewer.agent.yaml

List All Agents and Environments

# List agents in a readable table
ant beta:agents list --format yaml

# List environments with just names and IDs
ant beta:environments list --transform "data.#.{id,name}" --format yaml

# Check session status
ant beta:sessions retrieve \
  --session-id "$SESSION_ID" \
  --transform status --format raw

The --transform flag accepts full GJSON path syntax. You can filter arrays, project specific fields, and even do conditional extraction. It's much cleaner than piping to jq for simple extractions, though for complex transformations I still reach for jq.

What Tools Can Managed Agents Use?

When you include {"type": "agent_toolset_20260401"} in your agent config, it gets access to a standard set of tools: bash, read, write, edit, glob, grep, and web_fetch. All are enabled by default.

You can selectively disable tools you don't want the agent to have. For a read-only code review agent, you might disable write and edit:

# readonly-reviewer.agent.yaml
name: Read-Only Reviewer
model: claude-sonnet-4-6
system: Review code without modifying it.
tools:
  - type: agent_toolset_20260401
    configs:
      - name: write
        enabled: false
      - name: edit
        enabled: false
      - name: web_fetch
        enabled: false

Or flip the default and whitelist only what you need:

tools:
  - type: agent_toolset_20260401
    default_config:
      enabled: false
    configs:
      - name: bash
        enabled: true
      - name: read
        enabled: true

Agents can also connect to external MCP servers for tools beyond the built-in set. If you've built a custom MCP server, a managed agent can use it by adding an mcp_servers block to the agent config.

Frequently Asked Questions

What is the ant CLI from Anthropic?

The ant CLI is Anthropic's official command-line client for the Claude API. Written in Go, it provides a resource-based command structure for managing agents, environments, and sessions. It supports typed flags, YAML input, auto-pagination, and multiple output formats including an interactive TUI explorer.

How do I install the ant CLI on macOS?

Install via Homebrew: run brew install anthropics/tap/ant, then clear the macOS quarantine flag with xattr -d com.apple.quarantine "$(brew --prefix)/bin/ant". Set your ANTHROPIC_API_KEY environment variable and verify with ant --version.

What is the difference between the ant CLI and Claude Code?

Claude Code is an interactive agentic coding assistant that runs in your terminal and uses a subscription. The ant CLI is a programmatic API client for managing Managed Agents resources, uses an API key, and is built for scripting and CI/CD automation. They're complementary - Claude Code can even shell out to ant commands.

How much does it cost to run a managed agent session?

Sessions cost $0.08 per session-hour, billed to the millisecond. Idle time is free. You also pay standard Claude API token rates on top. A typical 1-hour coding session with Opus costs roughly $0.70 total including both tokens and session runtime.

Can I version control agents with the ant CLI?

Yes. Define agents as YAML files (e.g. reviewer.agent.yaml), check them into Git, and deploy via CI. Use ant beta:agents create to create from YAML and ant beta:agents update with the version flag to push updates. This gives you full GitOps for agent configurations.

Can managed agents connect to MCP servers?

Yes. Agents support remote MCP server connections via the --mcp-server flag. You specify the server URL and name, then add an mcp_toolset tool entry referencing that server. This lets agents use tools from GitHub, Slack, or custom MCP servers you've built.

How do I use the ant CLI in CI/CD pipelines?

Define agents and environments as YAML files in your repo. In CI, use ant beta:agents create < agent.yaml to provision and ant beta:agents update to deploy changes. The --transform flag extracts IDs for scripting, and --format controls output parsing.

What tools are available to managed agents?

The agent_toolset_20260401 built-in toolset includes bash, read, write, edit, glob, grep, and web_fetch. You can enable or disable individual tools, or disable all by default and whitelist specific ones. Agents can also connect to external MCP servers for custom tool integrations.

Read the full tutorial with interactive code examples and component-based layout on the original post: Getting Started with the ant CLI on avinashsangle.com.

Claude Code Cost Tracking: Monitor and Cut Your Spending

Avinash Sangle — Fri, 17 Apr 2026 05:04:30 +0000

This article was originally published on avinashsangle.com.

How Much Does Claude Code Actually Cost?

The pricing structure is straightforward. Claude Code Pro runs $20 per month (or $17 annually). The Max plan comes in two tiers: $100/month for 5x the Pro usage allowance, and $200/month for 20x. If you are on the API, you pay per token - Sonnet 4.6 at $3/$15 per million input/output tokens, and Opus 4.6 at $15/$75.

Across enterprise deployments, the average lands between $150 and $250 per developer per month, according to Anthropic's published benchmarks. Ninety percent of users stay under $12 per day. But that top 10% can burn through tokens fast, especially with extended thinking enabled and Opus as the default model.

The real issue? Tracking is scattered. Subscription users can't see dollar costs in the Console. API users get billing data but not per-session breakdowns. And everyone has local JSONL files sitting on their machine that most people don't even know exist.

TL;DR

Track costs with built-in commands: /cost for API users, /stats for subscribers, /usage for rate limit status
Find your hidden usage data: Claude Code logs every session to ~/.claude/projects/ as JSONL files with full token counts
Use third-party tools for real visibility: ccusage (4.8k GitHub stars) gives you daily, monthly, and per-session cost reports
Cut costs by 50% with 7 practical changes: default to Sonnet, cap thinking tokens, clear context between tasks, and write specific prompts

Built-In Cost Tracking Commands You Should Know

Claude Code ships with three commands for checking usage. Which one you should use depends on whether you are paying through the API or a subscription plan.

/cost - Session API Spend

The /cost command shows your current session's token usage and estimated dollar cost. Designed for API users. Subscription users still see token counts, which is useful for understanding consumption patterns.

Total cost:            $0.55
Total duration (API):  6m 19.7s
Total duration (wall): 6h 33m 10.2s
Total code changes:    127 lines added, 43 lines removed

/stats - Subscriber Usage Dashboard

If you are on Pro or Max, /stats opens a dashboard with a usage heatmap, session counts, token totals by model, and activity streaks. No dollar costs (flat-rate plan), but you see exactly how much of your allowance you are burning.

/usage - Rate Limit Status

Shows your plan limits and current rate limit status. Check this when Claude Code feels slow or you suspect throttling. Shows both 5-hour and 1-week usage windows.

Status Line Configuration

# Show cost in the status line (API users)
claude config set status_line.show_cost true

# Show token count in the status line
claude config set status_line.show_tokens true

When to use which:

API users: Use /cost for dollar amounts and /usage for rate limits
Pro/Max subscribers: Use /stats for usage patterns and /usage for rate limits
Everyone: Configure the status line for passive monitoring

Where Claude Code Stores Your Usage Data

Every session gets logged to your local filesystem as JSONL files. These contain detailed token counts for every API call - input tokens, output tokens, cache creation tokens, cache read tokens, and the model used. This is the same data third-party tools read to build their dashboards.

Session Logs

Claude Code writes one JSONL file per session to ~/.claude/projects/. If you are on a subscription plan, these local logs are the only way to get granular cost data since the Console doesn't expose it.

# Find your session logs
ls ~/.claude/projects/

# Look at the most recent session
ls -lt ~/.claude/projects/ | head -5

# Count tokens in a session with jq
cat ~/.claude/projects/<session-file>.jsonl | \
  jq -s '[.[].message.usage // empty] |
    { total_input: (map(.input_tokens) | add),
      total_output: (map(.output_tokens) | add),
      cache_read: (map(.cache_read_input_tokens // 0) | add),
      cache_creation: (map(.cache_creation_input_tokens // 0) | add) }'

Status Line Snapshots

Second file most people miss: ~/.claude/statusline.jsonl. Contains periodic snapshots with server-reported cumulative cost and your 5-hour and 1-week rate-limit usage percentages. This data is only in this local file.

# View recent status line snapshots
tail -5 ~/.claude/statusline.jsonl | jq .

# Extract cost progression over time
cat ~/.claude/statusline.jsonl | \
  jq -r '[.timestamp, .cost_usd] | @csv'

Third-Party Tools for Claude Code Usage Analytics

The built-in commands give a snapshot. For real visibility into trends, per-project breakdowns, and forecasting, you need more.

ccusage - The Most Popular Option

4,800+ GitHub stars. CLI that reads your local JSONL files and produces clean tables with daily, monthly, or per-session cost breakdowns. Tracks cache tokens separately, supports billing window analysis, works offline with cached pricing data.

# Install and run - no setup needed
npx ccusage              # Daily report (default)
npx ccusage daily        # Detailed daily breakdown
npx ccusage monthly      # Monthly aggregated totals
npx ccusage session      # Cost per conversation session
npx ccusage blocks       # 5-hour billing window analysis

# Filter by project
npx ccusage --instances  # Group usage by project

claude-usage - Local Web Dashboard

Reads the same local log files but renders them as charts with cost estimates, session timelines, and model breakdowns. Pro and Max subscribers get a progress bar for their allowance.

Claude-Code-Usage-Monitor - Real-Time Alerts

Real-time chart of token consumption with predictions about when you will hit your limits. Good for Max plan users who want early warnings before getting throttled.

ccost - Per-Request Granularity

Analyzes per-request JSONL logs with detailed token counts using LiteLLM pricing data. Use when you want to know exactly which requests were the most expensive.

Tool	Interface	Best For	GitHub Stars
ccusage	CLI	Daily/monthly reports, billing windows	4,800+
claude-usage	Web dashboard	Visual charts, subscriber progress	1,200+
Usage-Monitor	CLI (real-time)	Limit predictions, early warnings	500+
ccost	CLI	Per-request cost analysis	200+

How to Set a Budget Limit for Claude Code

Per-Command Budget Cap

The --max-budget-usd flag caps the maximum dollar amount for a single print-mode command. Useful in CI/CD pipelines or automated scripts where a runaway agent could burn through tokens.

# Cap a single command at $5
claude -p --max-budget-usd 5.00 "Refactor the auth module"

# Combine with max-turns for double protection
claude -p --max-budget-usd 10.00 --max-turns 5 "Fix failing tests in src/"

Workspace Rate Limits for Teams

Claude Code creates a workspace called "Claude Code" when you first authenticate with Console. Set rate limits on this workspace in the Console's Limits page to cap Claude Code's share of your API allocation.

Agent SDK Cost Tracking

If you are building on the Claude Agent SDK, every result message includes a total_cost_usd field.

import { query } from "@anthropic-ai/claude-agent-sdk";

let totalSpend = 0;

const prompts = [
  "Read the files in src/ and summarize the architecture",
  "List all exported functions in src/auth.ts"
];

for (const prompt of prompts) {
  for await (const message of query({ prompt })) {
    if (message.type === "result") {
      totalSpend += message.total_cost_usd;
      console.log(`This call: $${message.total_cost_usd}`);
    }
  }
}

console.log(`Total spend: $${totalSpend.toFixed(4)}`);

7 Ways to Cut Claude Code Costs by 50%

After tracking my spending for a few weeks, I identified the patterns that were burning tokens fastest. These seven changes brought my daily average from ~$12 down to $5-6, with zero quality loss.

1. Default to Sonnet, Switch to Opus Only When Needed

Sonnet 4.6 costs $3/$15 per million input/output tokens. Opus 4.6 costs $15/$75. That's 5x more expensive. For most coding tasks, Sonnet produces results that are just as good.

# Switch models on the fly
/model sonnet    # For everyday tasks
/model opus      # For complex reasoning only

2. Set MAX_THINKING_TOKENS to 10,000

Extended thinking is the single biggest cost lever. Uncapped thinking tokens can generate tens of thousands of tokens per request. A 10,000 limit still gives Claude enough room to reason.

# Set thinking token limit
export MAX_THINKING_TOKENS=10000

# Or lower the effort level for simple tasks
/effort low       # Significant token savings
/effort medium    # Balance of cost and quality

3. Use /clear Between Tasks

Stale context is a silent cost multiplier. Every message includes the full conversation history as input tokens. Run /clear when you switch to unrelated work. Use /rename first if you want to come back to the session later with /resume.

4. Use /compact When Context Grows

If you are mid-task and can't clear, use /compact to summarize the conversation history. Reduces token count while preserving important context.

5. Write Specific Prompts

Vague prompts are expensive. "Make this better" forces Claude to spend tokens figuring out what you want. "Extract the hardcoded strings in src/auth.js into constants" gets the job done in one pass.

6. Use Plan Mode Before Expensive Operations

Press Shift+Tab twice to enter plan mode before starting a big task. Claude outlines its approach before writing code. Costs a few hundred tokens for the plan but saves thousands by preventing costly rework.

7. Break Work Into Scoped Sessions

One session for everything is the most expensive way to use Claude Code. Context accumulates, cache misses increase, and irrelevant history gets sent with every request. Work in task-scoped sessions: one for fixing the login bug, another for adding the new API endpoint, a third for writing tests.

Claude Code API vs Subscription: Which Costs Less?

Usage Profile	API Cost/Month	Best Plan	Savings
Light (1-2 hrs/day)	$30-50/mo	API or Pro ($20)	Pay-per-use wins
Moderate (3-5 hrs/day)	$100-180/mo	Max 5x ($100)	Up to 44% savings
Heavy (6+ hrs/day)	$200-400/mo	Max 20x ($200)	Up to 50% savings

The API makes more sense with sporadic usage or when you need fine-grained budget controls like --max-budget-usd. It's also the only option for per-project cost allocation when billing clients. The subscription wins on predictability.

My approach: Max 5x plan for day-to-day, API key configured for automated scripts and CI pipelines where I want hard budget caps. Hybrid setup gives predictable costs for interactive work and strict controls for automation.

Frequently Asked Questions

How do I check my Claude Code costs?

Use /cost in any session for API spend totals with token counts and dollar estimates. Subscribers should use /stats for a usage dashboard with heatmaps and model breakdowns, or /usage for rate limit status. You can also configure the status line to show costs continuously.

Where does Claude Code store usage data locally?

Claude Code writes one JSONL file per session to ~/.claude/projects/ with full token counts for every API call. It also writes periodic snapshots to ~/.claude/statusline.jsonl containing cumulative cost and rate-limit usage percentages. Third-party tools like ccusage read these files.

What is ccusage and how do I use it?

ccusage is an open-source CLI tool with 4,800+ GitHub stars that analyzes Claude Code usage from local JSONL files. Run npx ccusage for a daily report, npx ccusage monthly for monthly totals, or npx ccusage session to see costs per conversation. Works offline with cached pricing data.

How much does Claude Code cost per day on average?

Anthropic reports the average at about $6 per developer per day, with 90% of users under $12 per day. Enterprise deployments average $150 to $250 per developer per month. Heavy Opus sessions with extended thinking can spike past $20 in a single day.

How do I set a budget limit for Claude Code API usage?

Use --max-budget-usd in print mode to cap spending per command: claude -p --max-budget-usd 5.00 "your prompt". For team-wide limits, set workspace rate limits in the Claude Console. You can also use --max-turns to indirectly limit costs.

Is Claude Code Max plan worth it vs API pricing?

If your API equivalent spend exceeds $100/month, the Max 5x plan at $100/month saves money. If you spend over $200/month on API, Max 20x is the better deal. For sporadic usage under $50/month, pay-per-token API pricing usually costs less overall.

How does prompt caching reduce Claude Code costs?

Claude Code automatically caches repeated content like system prompts and CLAUDE.md files. Cached tokens cost 90% less than fresh input tokens. The cache has a 5-minute TTL, so keeping sessions under 5 minutes apart maximizes savings. Track cache hit rates in local JSONL logs.

Read the full version (with extra examples and updates) on the original post: Claude Code Cost Tracking on avinashsangle.com.

Claude Managed Agents vs Agent SDK: Which Should You Use?

Avinash Sangle — Tue, 14 Apr 2026 11:38:39 +0000

This article was originally published on avinashsangle.com.

Anthropic launched Claude Managed Agents in beta on April 8, 2026. It's a hosted service that runs long-horizon Claude agents in Anthropic's infrastructure - sandboxed, persistent, and integrated with MCP servers out of the box.

If you're choosing between Managed Agents and the Agent SDK, the short answer is:

Pick Managed Agents for multi-hour production workloads
Pick the Agent SDK when you need full control over the runtime

Here's the breakdown after digging through the docs and API.

TL;DR

Managed Agents = Anthropic runs the agent harness, sandbox, and runtime for you (hosted, beta)
Agent SDK = you run the same engine yourself, with full control over infrastructure
Pricing: standard token rates + $0.08 per session-hour of active runtime + $10 per 1,000 web searches
Early adopters: Notion, Rakuten, Asana - focused on long-running enterprise workflows
Beta header required: managed-agents-2026-04-01

The Core Difference

Think of it like this:

Managed Agents = Vercel (hosted, opinionated, pay-per-use)
Agent SDK = self-hosted Next.js (you run it on your infra)

Same underlying engine. Different operational trade-offs.

Managed Agents handles the agent loop, sandboxed code execution, file system access, web browsing, persistent sessions, and checkpointing for you. You send a prompt, connect your MCP servers, and the agent runs - even for hours - without you maintaining any of that runtime infrastructure.

The Agent SDK exposes the same engine for self-hosted runtimes. You get local file access, private network connectivity, custom tool execution, and full runtime control. No session-hour charges - just token costs.

Pricing

Managed Agents pricing on top of standard Claude API rates:

$0.08 per session-hour of active runtime
$10 per 1,000 web searches
Idle time is free - sessions can wait for input without billing

For a 2-hour research task, you're looking at roughly $0.16 in compute plus token costs. For zero infrastructure management, that's a strong tradeoff for production workloads.

When to Pick Which

Pick Managed Agents when:

You have multi-hour production workloads (research, batch processing, monitoring)
You need sandboxed code execution out of the box
Web browsing + MCP integrations matter
You don't want to build or maintain agent infrastructure

Pick Agent SDK when:

You need local file access (working against repos)
Private network access required
Custom tool execution logic
You want predictable token-only costs without session-hour pricing
Development and debugging - the SDK lets you inspect everything

What You Get Out of the Box with Managed Agents

Sandboxed containers with code execution, file system, and web access
Sessions can run for hours with checkpointing for fault tolerance
MCP server support - any MCP server you've built for Claude Desktop or Claude Code can be configured for a Managed Agent session
Built-in web browsing and search

Beta Status

Managed Agents is currently in beta. All endpoints require the beta header:

anthropic-beta: managed-agents-2026-04-01

The official Anthropic SDKs set this automatically when you use the beta namespace. Some features like multi-agent orchestration remain in limited research preview.

The Bigger Picture

Notion, Rakuten, and Asana are early adopters - all using Managed Agents for enterprise workflows where the agent needs to run for extended periods, integrate with internal tools via MCP, and survive infrastructure failures.

This is Anthropic moving up the value chain: instead of just selling the model, they're selling the complete runtime that wraps it. For teams without dedicated AI infrastructure, that's a meaningful shift.

Read the full deep-dive with code examples, pricing math, and a decision flowchart on the original post: Claude Managed Agents vs Agent SDK on avinashsangle.com.