<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jangwook Kim</title>
    <description>The latest articles on DEV Community by Jangwook Kim (@jangwook_kim_e31e7291ad98).</description>
    <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909290%2F60a8c15f-b2b5-4189-8578-78b8ab78900b.jpg</url>
      <title>DEV Community: Jangwook Kim</title>
      <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jangwook_kim_e31e7291ad98"/>
    <language>en</language>
    <item>
      <title>nanobot: Build AI Agents in 4,000 Lines You Can Actually Read</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 25 Apr 2026 08:25:43 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/nanobot-build-ai-agents-in-4000-lines-you-can-actually-read-3pi0</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/nanobot-build-ai-agents-in-4000-lines-you-can-actually-read-3pi0</guid>
      <description>&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;There is a recurring complaint in developer forums about modern AI agent frameworks: you spend more time understanding the framework than building your actual agent. LangGraph's dependency graph, OpenClaw's 430,000+ lines of code, CrewAI's layered abstractions — these are powerful, but they impose a learning cliff that slows down experimentation.&lt;/p&gt;

&lt;p&gt;nanobot, released February 2, 2026 by the Data Intelligence Lab at the University of Hong Kong (HKUDS), takes the opposite bet. The entire core agent loop — message routing, LLM calls, memory management, tool execution, cron scheduling — fits in roughly 4,000 lines of Python. You can read it in an afternoon. You can fork it by lunch.&lt;/p&gt;

&lt;p&gt;That constraint is not a limitation. It is the design. nanobot implements around 90% of OpenClaw's core capabilities with 99% less code. By April 2026 it had accumulated over 34,000 GitHub stars, making it one of the fastest-growing open-source agent frameworks of the year.&lt;/p&gt;

&lt;p&gt;This guide walks through what nanobot actually does, how to get it running, and where it fits compared to heavier alternatives like &lt;a href="https://dev.to/articles/smolagents-huggingface-code-agents-guide-2026"&gt;smolagents&lt;/a&gt; and &lt;a href="https://dev.to/articles/openclaw-local-ai-gateway-multi-platform-guide-2026"&gt;OpenClaw&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is nanobot? Core Architecture
&lt;/h2&gt;

&lt;p&gt;nanobot is a personal AI agent that you deploy as a long-running process. It listens on one or more messaging channels (Telegram, Discord, WhatsApp, Slack, and others), routes incoming messages to an LLM, executes tool calls via MCP or custom skills, and persists memory between sessions.&lt;/p&gt;

&lt;p&gt;The architecture is deliberately flat:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incoming message
        ↓
  Channel adapter (Telegram / Discord / WhatsApp / ...)
        ↓
    Message bus
        ↓
   Agent loop
   ├── LLM call (11+ providers supported)
   ├── Tool execution (MCP stdio / HTTP)
   └── Memory read/write (session + Dream)
        ↓
  Response dispatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each of these layers is a small, standalone module. There is no hidden orchestration engine, no graph traversal, no pre-built DAG. The agent loop itself is the kind of code you can step through in a debugger in ten minutes.&lt;/p&gt;

&lt;p&gt;The project requires Python 3.11 or higher and is licensed under MIT. The latest release at time of writing is &lt;strong&gt;v0.1.5.post2&lt;/strong&gt; (April 21, 2026), which added Windows and Python 3.14 support, Office document reading, SSE streaming for the OpenAI-compatible API endpoint, and improved session reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started: Install in Under 5 Minutes
&lt;/h2&gt;

&lt;p&gt;Three installation paths are available. PyPI is recommended for most users:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;nanobot-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with uv for faster installs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;nanobot-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To track the latest development branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/HKUDS/nanobot
&lt;span class="nb"&gt;cd &lt;/span&gt;nanobot
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Minimal YAML Configuration
&lt;/h3&gt;

&lt;p&gt;nanobot is configured entirely via a YAML (or JSON) file. A minimal setup with Telegram and Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YOUR_ANTHROPIC_KEY&lt;/span&gt;

&lt;span class="na"&gt;channels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;telegram&lt;/span&gt;
    &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YOUR_TELEGRAM_BOT_TOKEN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this as &lt;code&gt;~/.nanobot/config.yaml&lt;/code&gt; and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nanobot start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire setup. nanobot will start listening on Telegram and responding with Claude Sonnet 4.6. No Docker, no Kubernetes, no environment config beyond the YAML file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supported LLM Providers
&lt;/h3&gt;

&lt;p&gt;nanobot ships with adapters for 11+ providers out of the box:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud APIs&lt;/strong&gt;: Anthropic (Claude), OpenAI (GPT), Google Gemini, DeepSeek, Moonshot, Groq, AiHubMix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregators&lt;/strong&gt;: OpenRouter, DashScope, Zhipu (智谱)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted&lt;/strong&gt;: vLLM (for local models on your own GPU)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Switching providers is one line in the config. You can also configure multiple providers and route specific channels to specific models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dream Memory System
&lt;/h2&gt;

&lt;p&gt;One of the more interesting engineering choices in nanobot is its two-tier memory architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session history&lt;/strong&gt; stores the raw conversation turns for each active chat thread in JSON files under &lt;code&gt;sessions/&lt;/code&gt;. This is the short-term buffer — what the agent knows right now, in this conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dream&lt;/strong&gt; is the long-term consolidation layer. It runs as a background process that periodically reads session history and extracts durable facts, summaries, and user preferences into a &lt;code&gt;MEMORY.md&lt;/code&gt; file. Think of it as the agent sleeping on what it learned and writing notes before waking up.&lt;/p&gt;

&lt;p&gt;The underlying storage for Dream is git-versioned, which means every memory state is recoverable. You can roll back to any earlier memory checkpoint with a standard &lt;code&gt;git checkout&lt;/code&gt;. This is an elegant solution to a real problem: long-running agents accumulate incorrect or stale memories, and having a full audit trail makes debugging far less painful.&lt;/p&gt;

&lt;p&gt;Dream's behavior is configured via &lt;code&gt;DreamConfig&lt;/code&gt; in your YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dream&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;interval_minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
    &lt;span class="na"&gt;max_facts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For agents running multi-day workflows — the kind described in guides on &lt;a href="https://temporal.io" rel="noopener noreferrer"&gt;Temporal durable execution&lt;/a&gt; patterns — the Dream system fills a gap that most lightweight frameworks ignore entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Integration: External Tools Without the Overhead
&lt;/h2&gt;

&lt;p&gt;nanobot connects to MCP (Model Context Protocol) servers via two transport modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio&lt;/strong&gt; — for local MCP servers running as child processes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP&lt;/strong&gt; — for remote servers with optional custom authentication headers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools from connected MCP servers are auto-discovered and registered at startup. The agent can call any exposed tool the same way it would call a built-in skill, with no additional plumbing required.&lt;/p&gt;

&lt;p&gt;Example YAML configuration for an MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;mcp_servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;filesystem&lt;/span&gt;
    &lt;span class="na"&gt;transport&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stdio&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@modelcontextprotocol/server-filesystem"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;brave-search&lt;/span&gt;
    &lt;span class="na"&gt;transport&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://your-mcp-server.example.com&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;YOUR_KEY"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This direct MCP support is meaningful in 2026, when the &lt;a href="https://dev.to/articles/mcp-ecosystem-growth-100-million-installs-2026"&gt;MCP ecosystem has crossed 97 million monthly SDK downloads&lt;/a&gt;. You can drop in any of the thousands of public MCP servers — GitHub, Brave Search, databases, home automation — without modifying the agent's core logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Platform Messaging
&lt;/h2&gt;

&lt;p&gt;nanobot connects to 8+ messaging platforms through channel adapters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Telegram&lt;/strong&gt; — most mature, best supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord&lt;/strong&gt; — full slash command support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WhatsApp&lt;/strong&gt; — via WhatsApp Business API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — workspace bot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feishu / DingTalk&lt;/strong&gt; — enterprise Chinese platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email&lt;/strong&gt; — IMAP/SMTP polling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QQ&lt;/strong&gt; — via Lagrange bridge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each channel runs as an independent adapter. You can run multiple channels simultaneously and isolate production from testing environments by spinning up separate instances on the same machine.&lt;/p&gt;

&lt;p&gt;This breadth matters for teams building internal AI assistants. Your sales team uses Slack, your ops team uses DingTalk, and nanobot can serve both from a single config file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cron Scheduling and Subagents
&lt;/h2&gt;

&lt;p&gt;nanobot includes a cron system built on &lt;code&gt;apscheduler&lt;/code&gt; for time-based automation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Schedule a daily briefing at 9am&lt;/span&gt;
nanobot cron add &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"morning"&lt;/span&gt; &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"Summarize today's GitHub activity"&lt;/span&gt; &lt;span class="nt"&gt;--cron&lt;/span&gt; &lt;span class="s2"&gt;"0 9 * * *"&lt;/span&gt;

&lt;span class="c"&gt;# Check a webhook every hour&lt;/span&gt;
nanobot cron add &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"monitor"&lt;/span&gt; &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"Check deployment status"&lt;/span&gt; &lt;span class="nt"&gt;--every&lt;/span&gt; 3600
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cron jobs can also be defined in YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;cron_jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily_digest&lt;/span&gt;
    &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;18&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prepare&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;send&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;digest&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Telegram"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Subagents&lt;/strong&gt; allow the main agent to spin up specialized child agents for scoped tasks. Subagents work in CLI mode and communicate via the internal message bus. A common pattern is a routing agent that delegates research to a subagent, code editing to another, and aggregates their results before responding to the user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills: Extending nanobot With Pre-Built Behaviors
&lt;/h2&gt;

&lt;p&gt;Beyond MCP tools, nanobot has a skills system. Skills are Markdown files that describe repeatable behaviors — think of them as stored prompts with light tool wiring. The agent loads skills from a &lt;code&gt;skills/&lt;/code&gt; directory and can invoke them by name.&lt;/p&gt;

&lt;p&gt;nanobot ships with pre-bundled skills for GitHub, weather, system commands, and general task management. Community skills can be discovered and installed through the ClawHub skill registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nanobot skill search web-scraper
nanobot skill &lt;span class="nb"&gt;install &lt;/span&gt;web-scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike the massive &lt;a href="https://dev.to/articles/hermes-agent-nous-research-self-improving-developer-guide-2026"&gt;Hermes Agent skill ecosystem&lt;/a&gt; (118 bundled skills, self-improving closed learning loop), nanobot keeps the default skill set minimal. That is intentional — you install exactly what you need, and the codebase stays readable.&lt;/p&gt;

&lt;h2&gt;
  
  
  nanobot vs The Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;nanobot&lt;/th&gt;
&lt;th&gt;smolagents&lt;/th&gt;
&lt;th&gt;LangGraph&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Core codebase&lt;/td&gt;
&lt;td&gt;~4,000 lines&lt;/td&gt;
&lt;td&gt;~1,000 lines&lt;/td&gt;
&lt;td&gt;50,000+ lines&lt;/td&gt;
&lt;td&gt;430,000+ lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Messaging platforms&lt;/td&gt;
&lt;td&gt;8+ built-in&lt;/td&gt;
&lt;td&gt;None built-in&lt;/td&gt;
&lt;td&gt;None built-in&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP support&lt;/td&gt;
&lt;td&gt;Yes (stdio + HTTP)&lt;/td&gt;
&lt;td&gt;Yes (limited)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes (full)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-term memory&lt;/td&gt;
&lt;td&gt;Dream (git-versioned)&lt;/td&gt;
&lt;td&gt;External only&lt;/td&gt;
&lt;td&gt;Checkpoints&lt;/td&gt;
&lt;td&gt;Full memory graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cron scheduling&lt;/td&gt;
&lt;td&gt;Built-in (apscheduler)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;External&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host difficulty&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Personal agents, hackable infra&lt;/td&gt;
&lt;td&gt;Code-first prototyping&lt;/td&gt;
&lt;td&gt;Complex multi-agent graphs&lt;/td&gt;
&lt;td&gt;Production multi-platform&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The choice comes down to what you're optimizing for. smolagents is the right tool when you need a minimal code-execution agent fast. LangGraph wins on complex stateful multi-agent graphs with human-in-the-loop requirements. OpenClaw has the widest platform coverage and a large community skills ecosystem.&lt;/p&gt;

&lt;p&gt;nanobot's sweet spot is the developer who wants a &lt;strong&gt;persistent personal agent&lt;/strong&gt; — one that lives on a cheap VPS, watches multiple channels, handles cron jobs, and grows with you — without reading 430,000 lines of someone else's code to understand what's happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Running without a persistent process manager.&lt;/strong&gt; nanobot is a long-running daemon. Run it with &lt;code&gt;systemd&lt;/code&gt;, &lt;code&gt;supervisord&lt;/code&gt;, or at minimum &lt;code&gt;nohup&lt;/code&gt;. Crashes in Telegram or Discord adapters will disconnect your channels and you won't notice until someone complains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-engineering the skill system early.&lt;/strong&gt; Skills are for repeatable, well-defined tasks. Using them as a workaround for prompt quality issues just adds indirection. Fix the prompts first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring session history growth.&lt;/strong&gt; The &lt;code&gt;sessions/&lt;/code&gt; directory grows indefinitely if Dream consolidation is disabled. A 3-month-old agent on an active Telegram group can accumulate gigabytes of JSON. Configure Dream with a reasonable &lt;code&gt;interval_minutes&lt;/code&gt; and cap &lt;code&gt;max_facts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Misconfiguring MCP server lifecycles.&lt;/strong&gt; Stdio MCP servers are child processes of nanobot. If nanobot crashes, those servers crash with it. HTTP MCP servers survive independently. Structure your architecture accordingly when uptime matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running multiple instances against the same config directory.&lt;/strong&gt; Session and memory files are not designed for concurrent writes. Use separate config directories for each instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does nanobot work with local LLMs?
&lt;/h3&gt;

&lt;p&gt;Yes. Configure vLLM as the provider with a local endpoint URL and model name. Any OpenAI-compatible API server works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai_compatible&lt;/span&gt;
  &lt;span class="na"&gt;base_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-4-scout&lt;/span&gt;
  &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;none&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Q: How does nanobot compare to OpenClaw in terms of stability?
&lt;/h3&gt;

&lt;p&gt;OpenClaw is more battle-tested in high-volume production deployments with community skill ecosystems. nanobot is newer (February 2026) but has been moving fast — v0.1.5.post2 added Windows support and SSE streaming in April 2026. For a personal agent or small team, nanobot's stability is adequate. For enterprise scale, OpenClaw or LangGraph are safer bets today.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I run nanobot without connecting it to a messaging platform?
&lt;/h3&gt;

&lt;p&gt;Yes. nanobot ships with a CLI interface — run &lt;code&gt;nanobot chat&lt;/code&gt; to interact directly in the terminal without configuring any channel adapter. This is useful for testing skills and memory behavior before deploying to Telegram or Discord.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does the Dream memory consolidation handle incorrect facts?
&lt;/h3&gt;

&lt;p&gt;Dream stores memory as Markdown files under git version control. To remove or correct a fact, edit &lt;code&gt;MEMORY.md&lt;/code&gt; directly and commit the change. The agent picks up the updated file on the next session. There is currently no automated conflict resolution — incorrect memories require manual intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the minimum VPS spec to run nanobot?
&lt;/h3&gt;

&lt;p&gt;nanobot's own footprint is small — approximately 100MB RAM for the process itself. Add the memory of whichever cloud LLM provider you use (network only, no local inference), plus any MCP server processes. A $4/month VPS with 512MB RAM runs nanobot comfortably. For local LLM inference via vLLM, the hardware requirements of the model dominate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;nanobot is a ~4,000-line MIT-licensed Python agent framework from HKUDS, released February 2, 2026&lt;/li&gt;
&lt;li&gt;Install with &lt;code&gt;pip install nanobot-ai&lt;/code&gt;, configure with a single YAML file, and start receiving messages in under five minutes&lt;/li&gt;
&lt;li&gt;Supports 8+ messaging platforms, 11+ LLM providers, MCP (stdio and HTTP), cron scheduling, subagents, and a two-tier Dream memory system&lt;/li&gt;
&lt;li&gt;The entire core codebase is readable in an afternoon — this is a feature, not a constraint&lt;/li&gt;
&lt;li&gt;Best suited for personal agents and developer experiments where understanding and forking the code matters more than enterprise-scale throughput&lt;/li&gt;
&lt;li&gt;For production multi-agent orchestration with complex state, LangGraph or OpenClaw remain stronger choices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;nanobot delivers a genuinely useful personal AI agent in a codebase you can actually read, fork, and understand. If the opacity of larger frameworks has been your reason for not deploying an agent yet, nanobot removes that excuse. Ship it to a VPS, wire up Telegram, and extend it from there.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=_L0nIj3r-b8" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>python</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
    <item>
      <title>MCP vs A2A vs Open Responses — AI Agent Communication Protocols in 2026: What to Actually Use</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 25 Apr 2026 06:40:00 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/mcp-vs-a2a-vs-open-responses-ai-agent-communication-protocols-in-2026-what-to-actually-use-5goh</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/mcp-vs-a2a-vs-open-responses-ai-agent-communication-protocols-in-2026-what-to-actually-use-5goh</guid>
      <description>&lt;p&gt;Since late 2025, AI agent standards have been arriving in a cluster. Anthropic donated MCP to the Linux Foundation, Google announced A2A, and OpenAI published the Open Responses spec. That's great news for the ecosystem — but it's also confusing as hell. What does each one do? Are they competing? Can they coexist?&lt;/p&gt;

&lt;p&gt;My first reaction was "another protocol war." Then I built a few MCP servers myself, read through the A2A spec, and my view changed. These three protocols aren't competing — they occupy &lt;strong&gt;different layers&lt;/strong&gt;. The confusion comes from the fact that all three sound like "agent communication standards" when you read the names.&lt;/p&gt;

&lt;p&gt;In this post I'll break down each protocol and give my honest take on when to use what.&lt;/p&gt;




&lt;h2&gt;
  
  
  MCP: Giving Agents Hands
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) was published by Anthropic in late 2024 and donated to the Linux Foundation's Agentic AI Initiative (AAIF) in December 2025. Its core purpose is singular: &lt;strong&gt;standardize how AI models access external tools and data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The "USB-C for AI" analogy holds up. Before USB-C, every laptop had a different charging port. Before MCP, Claude's tool connections and GPT's tool connections were separate implementations. MCP created a common connector.&lt;/p&gt;

&lt;p&gt;What MCP standardizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt;: Functions or actions an agent can invoke (file reads, API calls, code execution)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt;: Data the agent can read (documents, DB records, file system)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt;: Reusable prompt templates the server provides&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As of April 2026, there are over 5,000+ MCP servers — GitHub Actions, Notion, PostgreSQL, Brave Search, browser automation, and almost every major tool you can think of.&lt;/p&gt;

&lt;p&gt;When I first wired MCP into this blog's automation system, the thing that surprised me most was &lt;strong&gt;framework agnosticism&lt;/strong&gt;. MCP servers I built for Claude Code worked in other MCP-compatible clients without modification. In practice there are edge cases where client feature sets differ, but the direction is sound.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the 2026 MCP Roadmap Focuses On
&lt;/h3&gt;

&lt;p&gt;The most important item on the 2026 MCP roadmap is solving &lt;strong&gt;horizontal scaling&lt;/strong&gt;. Current Streamable HTTP transport maintains stateful sessions — which fights with load balancers. When requests get routed to different server instances, sessions break. The roadmap aims to make MCP servers genuinely stateless.&lt;/p&gt;

&lt;p&gt;The second priority is &lt;strong&gt;discovery standardization&lt;/strong&gt; via &lt;code&gt;.well-known&lt;/code&gt;. Right now you have to connect to an MCP server to know what it offers. The goal is to serve capability metadata without a live connection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/webmcp-chrome-146-ai-tool-server"&gt;My earlier post on WebMCP&lt;/a&gt; gets into how MCP server implementation works under the hood, if you want a concrete picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  A2A: Agents Talking to Each Other
&lt;/h2&gt;

&lt;p&gt;A2A (Agent2Agent) was announced by Google in April 2025 and donated to the Linux Foundation in June 2025. The purpose is different from MCP: &lt;strong&gt;standardize how AI agents discover, communicate with, and delegate to each other.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If MCP is "agent ↔ tool," A2A is "agent ↔ agent."&lt;/p&gt;

&lt;p&gt;The problem A2A solves: suppose you have a travel booking agent, a hotel search specialist agent, and a flight search specialist agent. How does the booking agent delegate tasks to the specialists? MCP doesn't handle this. That's A2A's domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  A2A v1.0 Core Concepts
&lt;/h3&gt;

&lt;p&gt;A2A v1.0, released in early 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Card&lt;/strong&gt;: A JSON document where an agent advertises its capabilities. When a client agent needs to find the right specialist, it reads Agent Cards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task-based communication&lt;/strong&gt;: Interactions are oriented around Tasks. Tasks can complete immediately or run long, with state synchronization built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signed Agent Cards (the v1.0 headline feature)&lt;/strong&gt;: Cryptographic signatures allow receiving agents to verify that an Agent Card was issued by the domain owner. This makes decentralized agent discovery viable — you can filter fake agents.&lt;/p&gt;

&lt;p&gt;By April 2026, 150+ organizations have adopted A2A, with production deployments at Microsoft, AWS, Salesforce, SAP, and ServiceNow.&lt;/p&gt;

&lt;p&gt;Honest take: when I first read the A2A spec, I was skeptical about practical safety. Agents delegating directly to other agents sounds elegant, but the trust model gets complicated fast. v1.0's Signed Agent Cards are moving in the right direction, but I'd want to see more production validation before treating it as battle-hardened infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/a2a-mcp-hybrid-architecture-production-guide"&gt;A separate post covers A2A + MCP production hybrid architectures&lt;/a&gt; — specifically how to layer these two protocols without creating a mess.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open Responses: OpenAI's Bet on API Compatibility
&lt;/h2&gt;

&lt;p&gt;Open Responses is an open spec published by OpenAI in February 2026. It operates at a different level from MCP and A2A. Those two address &lt;strong&gt;how agents communicate&lt;/strong&gt;; Open Responses addresses &lt;strong&gt;how to standardize agentic workflow APIs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The spec is built on OpenAI's Responses API — the successor to Chat Completions — and the pitch is: let's open this standard so that other model providers can offer the same interface. If you write agentic code against the Responses API, it should run against Hugging Face models, local inference, or any other compliant provider without rewriting your integration.&lt;/p&gt;

&lt;p&gt;Ecosystem support: Hugging Face, Vercel, OpenRouter have signed on. Ollama, vLLM, and LM Studio support it for local inference. The spec documentation and conformance testing tools are at openresponses.org.&lt;/p&gt;

&lt;p&gt;My honest take: Open Responses is complementary to MCP and A2A, not competing. But I don't see a compelling reason to prioritize it in most production stacks right now. Large-scale production validation is thin. The bet that other vendors will adopt OpenAI's API design as a universal standard is real, but unproven at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side-by-Side Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;th&gt;A2A&lt;/th&gt;
&lt;th&gt;Open Responses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent ↔ Tool connectivity&lt;/td&gt;
&lt;td&gt;Agent ↔ Agent collaboration&lt;/td&gt;
&lt;td&gt;Agentic API loop standardization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Analogy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;USB-C (universal connector)&lt;/td&gt;
&lt;td&gt;HTTP (for agent networks)&lt;/td&gt;
&lt;td&gt;REST API design standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Origin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anthropic → AAIF&lt;/td&gt;
&lt;td&gt;Google → Linux Foundation&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Current version&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2025-11-25&lt;/td&gt;
&lt;td&gt;v1.0 (early 2026)&lt;/td&gt;
&lt;td&gt;Beta&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ecosystem maturity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (5,000++ servers)&lt;/td&gt;
&lt;td&gt;High (150+ orgs)&lt;/td&gt;
&lt;td&gt;Low (early stage)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transport&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Streamable HTTP, stdio&lt;/td&gt;
&lt;td&gt;JSON-RPC, gRPC&lt;/td&gt;
&lt;td&gt;WebSocket, HTTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth, per-server auth&lt;/td&gt;
&lt;td&gt;Signed Agent Cards&lt;/td&gt;
&lt;td&gt;Under specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When to use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any time tool access is needed&lt;/td&gt;
&lt;td&gt;Multi-agent task delegation&lt;/td&gt;
&lt;td&gt;OpenAI-compatible workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most important thing to understand: &lt;strong&gt;MCP and A2A are AND, not OR.&lt;/strong&gt; Most production multi-agent systems in 2026 use both. Each agent connects to its own tools via MCP; agents coordinate via A2A.&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Layer in Practice
&lt;/h2&gt;

&lt;p&gt;A concrete architecture example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario: Automated research system&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Orchestrator Agent
├── (A2A) → Web research specialist agent
│   └── (MCP) → Brave Search MCP server
│   └── (MCP) → Web scraping MCP server
├── (A2A) → Document analysis specialist agent
│   └── (MCP) → File system MCP server
│   └── (MCP) → PDF processing MCP server
└── (MCP) → Results storage MCP server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The orchestrator delegates via A2A; each specialist accesses its own tools via MCP. Open Responses could sit at the orchestrator's external API interface if you need OpenAI-compatible endpoint exposure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/claude-code-agentic-workflow-patterns-5-types"&gt;My breakdown of Claude Code agentic workflow patterns&lt;/a&gt; goes deeper on implementing this kind of layered architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Learn Right Now
&lt;/h2&gt;

&lt;p&gt;Practical priority order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn immediately: MCP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're building agents, start here. Reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5,000++ server ecosystem already exists&lt;/li&gt;
&lt;li&gt;Claude Code, OpenAI Agents SDK, LangGraph all support it natively&lt;/li&gt;
&lt;li&gt;Streamable HTTP is the settled standard; spec is stable enough for production&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/en/blog/en/anthropic-agent-skills-standard"&gt;Anthropic's Agent Skills standard&lt;/a&gt; builds directly on MCP, creating increasingly powerful patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Medium-term: A2A&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're planning multi-agent production systems, study A2A. 150 org adoption, Linux Foundation governance, v1.0 stability — it's ready. But I'd still want to see more validated production case studies before relying on the Signed Agent Cards trust model for anything security-critical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor: Open Responses&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unless you have a specific OpenAI-compatibility requirement today, there's no urgency. Subscribe to updates; don't architect around it yet.&lt;/p&gt;

&lt;p&gt;One more thing worth noting: both MCP and A2A are now under the Linux Foundation. This isn't a standards war — it's the same foundation solving two different layers of the same problem. That's the clearest signal that 2026 is different from 2024.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Take
&lt;/h2&gt;

&lt;p&gt;MCP is the tool to use right now. It's the layer that gives agents access to the external world, and the ecosystem is mature. A2A is worth learning seriously if you're thinking about multi-agent systems — v1.0 is production-ready in most respects. Open Responses is worth following but not yet worth building around.&lt;/p&gt;

&lt;p&gt;Stop thinking about these as competing standards. They solve different problems and most serious systems need all three eventually. My working heuristic: MCP first, A2A when you need multi-agent delegation, Open Responses when the ecosystem catches up.&lt;/p&gt;

&lt;p&gt;And &lt;a href="https://dev.to/en/blog/en/ai-agent-framework-comparison-2026-langgraph-crewai-dapr-production"&gt;your choice of AI agent framework&lt;/a&gt; is tightly coupled to this — different frameworks have significantly different levels of MCP and A2A support.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>a2a</category>
      <category>aiagents</category>
      <category>protocolcomparison</category>
    </item>
    <item>
      <title>Meta Llama Stack: Deploy Llama 4 With OpenAI-Compatible API</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 25 Apr 2026 04:21:19 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/meta-llama-stack-deploy-llama-4-with-openai-compatible-api-m0m</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/meta-llama-stack-deploy-llama-4-with-openai-compatible-api-m0m</guid>
      <description>&lt;h2&gt;
  
  
  Why Llama Stack Changes the Open-Source Deployment Story
&lt;/h2&gt;

&lt;p&gt;Running open-source LLMs in production has always had a catch: you pick a backend (Ollama, vLLM, llama.cpp), write your integration code against that backend's specific API, and then find yourself locked in. Swap the backend for performance or cost reasons and you're rewriting client code.&lt;/p&gt;

&lt;p&gt;Meta Llama Stack solves exactly this problem. It's an open-source AI application server that sits in front of any backend and exposes a single, OpenAI-compatible API layer. The same &lt;code&gt;/v1/chat/completions&lt;/code&gt; call that works against your local Ollama instance in development routes to vLLM or AWS Bedrock in production — with zero application-code changes.&lt;/p&gt;

&lt;p&gt;As of April 2026, the repository has over 8,200 GitHub stars and is under active development by Meta's open-source team. It ships with native support for Llama 4 Scout and Llama 4 Maverick, alongside older Llama 3.x models. If you're running or planning to run open-weight Llama models in production, Llama Stack is the infrastructure layer worth knowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Llama Stack Actually Is
&lt;/h2&gt;

&lt;p&gt;Most frameworks for LLM deployment focus on one thing — inference serving. Llama Stack goes wider. It provides a unified API layer covering seven concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference&lt;/strong&gt; — run Llama models against any supported backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt; — vector store integration for retrieval-augmented generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents&lt;/strong&gt; — multi-step agent orchestration with tool use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — web search, code interpreter, custom tool registration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; — Llama Guard integration for prompt/output filtering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evals&lt;/strong&gt; — built-in evaluation harness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry&lt;/strong&gt; — OpenTelemetry-based tracing and logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture has two layers: a distribution (a pre-configured bundle of provider implementations) and the Llama Stack server (a single process that routes API calls to whichever providers are configured in that distribution).&lt;/p&gt;

&lt;p&gt;Your application only ever talks to the Llama Stack server. Swapping backends, safety models, or vector stores is a config change, not a code change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributions: The Core Concept
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;distribution&lt;/strong&gt; is the unit of deployment in Llama Stack. It bundles together one provider for each API component and packages them into a runnable server.&lt;/p&gt;

&lt;p&gt;Meta ships several official distributions out of the box:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Distribution&lt;/th&gt;
&lt;th&gt;Inference Backend&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ollama&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;Local development, CPU/Apple Silicon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;GPU production servers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HuggingFace TGI&lt;/td&gt;
&lt;td&gt;HuggingFace-native stacks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;together&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Together AI&lt;/td&gt;
&lt;td&gt;Managed API, no GPU needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fireworks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fireworks AI&lt;/td&gt;
&lt;td&gt;Low-latency managed inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bedrock&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AWS Bedrock&lt;/td&gt;
&lt;td&gt;AWS-native production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OpenAI API&lt;/td&gt;
&lt;td&gt;Hybrid open/closed LLM routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: develop with &lt;code&gt;ollama&lt;/code&gt;, deploy with &lt;code&gt;vllm&lt;/code&gt; or a managed service. The API your application uses doesn't change.&lt;/p&gt;

&lt;p&gt;You can also build custom distributions if you need to mix providers — for example, using Fireworks for inference but self-hosted ChromaDB for vector storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation and Quick Start
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install the Client
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;llama-stack-client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run the Server with Ollama (Local Development)
&lt;/h3&gt;

&lt;p&gt;First, make sure Ollama is running and has pulled the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull llama3.3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then start the Llama Stack server pointing at the Ollama distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;INFERENCE_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"llama3.3"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LLAMA_STACK_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8321

docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nv"&gt;$LLAMA_STACK_PORT&lt;/span&gt;:&lt;span class="nv"&gt;$LLAMA_STACK_PORT&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.llama:/root/.llama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;INFERENCE_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$INFERENCE_MODEL&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  llamastack/distribution-ollama:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server is now running at &lt;code&gt;http://localhost:8321&lt;/code&gt; and exposes standard OpenAI-compatible endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Your First API Call
&lt;/h3&gt;

&lt;p&gt;Use the OpenAI Python client directly — point it at the Llama Stack server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8321/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Llama Stack handles auth separately
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain attention mechanisms in one paragraph.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Any existing code that uses the OpenAI Python SDK can point at a Llama Stack server instead, with no further changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Or Use the Native Llama Stack Client
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_stack_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LlamaStackClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlamaStackClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8321&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the Llama 4 model sizes?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The native client exposes additional Llama Stack-specific APIs (agents, memory, safety) that aren't part of the OpenAI SDK interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Deployment with vLLM
&lt;/h2&gt;

&lt;p&gt;For GPU production deployments, swap the distribution from &lt;code&gt;ollama&lt;/code&gt; to &lt;code&gt;vllm&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Start the vLLM Backend
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--runtime&lt;/span&gt; nvidia &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.cache/huggingface:/root/.cache/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; meta-llama/Llama-4-Scout-17B-16E-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Start Llama Stack Pointed at vLLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;INFERENCE_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"meta-llama/Llama-4-Scout-17B-16E-Instruct"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VLLM_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8000"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LLAMA_STACK_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8321

docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nv"&gt;$LLAMA_STACK_PORT&lt;/span&gt;:&lt;span class="nv"&gt;$LLAMA_STACK_PORT&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;INFERENCE_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$INFERENCE_MODEL&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VLLM_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$VLLM_URL&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  llamastack/distribution-vllm:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your application code is unchanged — it still talks to &lt;code&gt;http://your-server:8321/v1&lt;/code&gt;. The only thing that moved was the backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Llama 4 Model Options
&lt;/h3&gt;

&lt;p&gt;Llama Stack supports the full Llama 4 family. The two currently available production models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Active Params&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4 Scout&lt;/td&gt;
&lt;td&gt;109B total / 17B active&lt;/td&gt;
&lt;td&gt;17B&lt;/td&gt;
&lt;td&gt;10M tokens&lt;/td&gt;
&lt;td&gt;Single GPU, balanced tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4 Maverick&lt;/td&gt;
&lt;td&gt;400B total / 52B active&lt;/td&gt;
&lt;td&gt;52B&lt;/td&gt;
&lt;td&gt;10M tokens&lt;/td&gt;
&lt;td&gt;Multi-GPU, high-quality output&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both are MoE (Mixture of Experts) models under Meta's open-weight license. Scout runs on a single A100 80GB; Maverick requires 2–4 GPUs depending on quantization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents and Tool Use
&lt;/h2&gt;

&lt;p&gt;Llama Stack's agents API goes well beyond basic chat completion. An agent maintains a session, executes multi-step plans, and calls tools.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_stack_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LlamaStackClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlamaStackClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8321&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create an agent with web search enabled
&lt;/span&gt;&lt;span class="n"&gt;agent_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a research assistant. Use search when you need current information.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brave_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_BRAVE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_infer_iters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;agent_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;session_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Turn 1
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the latest Llama 4 benchmarks?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turn_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent automatically decides when to call the search tool, reads the results, and synthesizes a final answer — all within Llama Stack's orchestration layer.&lt;/p&gt;

&lt;p&gt;Built-in tools include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;brave_search&lt;/code&gt; — web search via Brave Search API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wolfram_alpha&lt;/code&gt; — math and science queries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;code_interpreter&lt;/code&gt; — sandboxed Python execution&lt;/li&gt;
&lt;li&gt;Custom tools via function registration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Safety: Llama Guard Integration
&lt;/h2&gt;

&lt;p&gt;Every Llama Stack distribution can run Llama Guard as the safety provider, filtering both inputs and outputs against configurable policy categories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check a response against safety policy
&lt;/span&gt;&lt;span class="n"&gt;safety_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_shield&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;shield_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-Guard-3-8B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;safety_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Safety violation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;safety_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safety categories can be tuned per deployment. Production use cases often enable all defaults; internal developer tools might relax some restrictions.&lt;/p&gt;

&lt;p&gt;Additional safety capabilities in Llama Stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt injection detection&lt;/li&gt;
&lt;li&gt;Output validation before streaming&lt;/li&gt;
&lt;li&gt;Rate limiting (configurable per session or API key)&lt;/li&gt;
&lt;li&gt;Human-in-the-loop controls for high-stakes actions&lt;/li&gt;
&lt;li&gt;Audit logging for compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Memory and RAG
&lt;/h2&gt;

&lt;p&gt;The Memory API supports four storage types for different use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vector (FAISS, ChromaDB, Weaviate)&lt;/td&gt;
&lt;td&gt;Semantic similarity search, RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Key-Value (Redis, PostgreSQL)&lt;/td&gt;
&lt;td&gt;Session state, structured lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keyword (BM25)&lt;/td&gt;
&lt;td&gt;Exact-match and hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graph (Neo4j)&lt;/td&gt;
&lt;td&gt;Relationship-based retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Adding RAG to an agent is a configuration change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a memory bank (vector store)
&lt;/span&gt;&lt;span class="n"&gt;memory_bank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_banks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;memory_bank_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_bank_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_size_in_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overlap_size_in_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Insert documents
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bank_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Llama 4 Scout has a 10M token context window...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, PostgreSQL is the recommended backend for both vector storage and key-value persistence, replacing in-memory FAISS for durability across restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Telemetry and Observability
&lt;/h2&gt;

&lt;p&gt;Llama Stack ships a complete OpenTelemetry-native telemetry system. Traces, spans, and events flow from the server to any OTEL-compatible backend (Jaeger, Grafana Tempo, Datadog, etc.).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable OTEL tracing in your distribution config&lt;/span&gt;
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://jaeger:4317
&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;llama-stack-prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every inference call, agent step, tool invocation, and safety check becomes a traced span. This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-request token counts and latency&lt;/li&gt;
&lt;li&gt;Agent tool-call traces with reasoning steps&lt;/li&gt;
&lt;li&gt;Safety shield evaluation times&lt;/li&gt;
&lt;li&gt;Provider-level error attribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams already using Langfuse or other LLM observability tools, Llama Stack's OTEL output integrates cleanly with existing dashboards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Using the wrong distribution for your hardware.&lt;/strong&gt; The &lt;code&gt;ollama&lt;/code&gt; distribution works fine on CPU and Apple Silicon, but for A100/H100 servers, &lt;code&gt;vllm&lt;/code&gt; gives 3–5x better throughput. Don't use CPU-tier distributions for GPU production workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not setting model IDs consistently.&lt;/strong&gt; The model identifier in your API call must match exactly what the provider backend has loaded. With vLLM this is usually the full HuggingFace path (&lt;code&gt;meta-llama/Llama-4-Scout-17B-16E-Instruct&lt;/code&gt;); with Ollama it's the short tag (&lt;code&gt;llama3.3&lt;/code&gt;). Mismatches return a 404 that looks like a server error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping Safety in development.&lt;/strong&gt; Llama Guard evaluation adds ~50ms latency. Developers sometimes disable it locally to speed up iteration, then forget to re-enable it before production. Treat safety configuration as part of your deployment checklist, not a late addition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring session management for agents.&lt;/strong&gt; Agent sessions accumulate context across turns. For production services that handle many concurrent users, set &lt;code&gt;session_ttl&lt;/code&gt; and clean up sessions explicitly, or you'll see memory growth over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mounting volumes incorrectly for model weights.&lt;/strong&gt; The Docker images expect model weights at specific paths. If the volume mount doesn't match, the container downloads models on startup — slow, expensive, and fragile in autoscaling environments. Pre-pull weights and mount them at the documented path.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Can I use Llama Stack with non-Llama models?
&lt;/h3&gt;

&lt;p&gt;Yes. Llama Stack supports any model available through its provider backends. The &lt;code&gt;openai&lt;/code&gt; distribution lets you route to GPT-4o or GPT-6, the &lt;code&gt;anthropic&lt;/code&gt; provider connects to Claude, and the &lt;code&gt;vllm&lt;/code&gt; distribution serves any HuggingFace-compatible model. The "Llama" branding is about the defaults, not a hard constraint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does Llama Stack compare to LiteLLM?
&lt;/h3&gt;

&lt;p&gt;LiteLLM focuses on unified API routing to managed providers (OpenAI, Anthropic, Azure, etc.) with cost tracking and fallback logic. Llama Stack is broader: it includes self-hosting, agent orchestration, safety, RAG, and evaluation. For a team running managed cloud providers, LiteLLM is simpler. For teams self-hosting Llama models who need agents and safety, Llama Stack adds significant value beyond what LiteLLM offers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Llama Stack production-ready?
&lt;/h3&gt;

&lt;p&gt;The core inference and OpenAI-compatible endpoints are stable and used in production deployments. The agent, memory, and evaluation APIs are under more active development. As of version 0.2.x (April 2026), production use is most reliable for inference + safety use cases. Agents work well but have more API surface area that can change between minor versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What's the minimum hardware for running Llama 4 Scout with Llama Stack?
&lt;/h3&gt;

&lt;p&gt;Llama 4 Scout (17B active parameters) requires approximately 35GB VRAM in BF16 or 20GB with 4-bit quantization. A single A100 40GB handles it comfortably. For Apple Silicon, M4 Pro (48GB unified memory) or M4 Max runs it at reduced throughput. Maverick needs 2–4 A100/H100 GPUs depending on quantization level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I run Llama Stack without Docker?
&lt;/h3&gt;

&lt;p&gt;Yes. Install via pip: &lt;code&gt;pip install llama-stack&lt;/code&gt; and run &lt;code&gt;llama-stack start --config path/to/config.yaml&lt;/code&gt;. Docker is recommended for production for isolation and reproducibility, but the Python package works for development and custom deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Meta Llama Stack is the cleanest path from local Llama model experimentation to production deployment. Its distribution model — develop with Ollama, deploy with vLLM, never change your API client — removes the most common painful rewrite in open-source LLM adoption.&lt;/p&gt;

&lt;p&gt;The OpenAI compatibility layer is the practical unlock: teams already using the OpenAI Python SDK can switch to self-hosted Llama 4 by changing one line (&lt;code&gt;base_url&lt;/code&gt;). Combined with built-in agents, safety, RAG, and telemetry, Llama Stack positions itself as a full infrastructure layer, not just a model server.&lt;/p&gt;

&lt;p&gt;For the current state of production deployments: inference and safety are solid; agents and RAG are functional with active API evolution. A reasonable approach is to start with inference + safety in production, then evaluate the agents API for lower-stakes workloads while it stabilizes.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Llama Stack is the best available open-source infrastructure layer for Llama 4 production deployments. The OpenAI-compatible API and swappable distribution model eliminate the usual vendor lock-in trade-off of self-hosting — you get full control without rewriting your application code when you scale up or change backends.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=zyW0tmpuWIU" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llamastack</category>
      <category>metallama</category>
      <category>opensource</category>
      <category>llmdeployment</category>
    </item>
    <item>
      <title>DeepSeek V4-Pro and V4-Flash: Migration Guide and API Setup</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 25 Apr 2026 00:19:37 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/deepseek-v4-pro-and-v4-flash-migration-guide-and-api-setup-3bpb</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/deepseek-v4-pro-and-v4-flash-migration-guide-and-api-setup-3bpb</guid>
      <description>&lt;p&gt;DeepSeek dropped two new models on April 24, 2026: &lt;strong&gt;V4-Pro&lt;/strong&gt;, a 1.6-trillion-parameter MoE flagship, and &lt;strong&gt;V4-Flash&lt;/strong&gt;, a 284-billion-parameter workhorse optimized for throughput. Both support a one-million-token context window, dual Thinking/Non-Thinking modes, and an OpenAI-compatible API — available immediately on DeepSeek's platform and across third-party providers.&lt;/p&gt;

&lt;p&gt;There's a deadline attached. The legacy &lt;code&gt;deepseek-chat&lt;/code&gt; and &lt;code&gt;deepseek-reasoner&lt;/code&gt; model names are being retired on &lt;strong&gt;July 24, 2026, 15:59 UTC&lt;/strong&gt;. If your application targets either of those strings, you have roughly three months to update a single line of code.&lt;/p&gt;

&lt;p&gt;This guide covers what changed architecturally, how V4-Pro and V4-Flash differ, how the benchmarks and pricing compare to frontier alternatives, and exactly how to migrate — with copy-paste code examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;DeepSeek's V4 release lands at a moment when the frontier pricing war has broken wide open. GPT-5.5 output tokens cost $30/M; Claude Opus 4.7 costs $25/M. V4-Pro output is &lt;strong&gt;$3.48/M&lt;/strong&gt; — roughly one-seventh the price of GPT-5.5 — while scoring within single-digit percentage points of both models on most coding and reasoning benchmarks.&lt;/p&gt;

&lt;p&gt;For cost-sensitive production deployments — high-volume agents, RAG pipelines, code review bots, document analysis systems — the V4 release changes the default calculus. The question is no longer "is open-source good enough?" but "which workloads still justify the closed-source premium?"&lt;/p&gt;

&lt;p&gt;The migration urgency adds a second dimension: any team still using &lt;code&gt;deepseek-chat&lt;/code&gt; or &lt;code&gt;deepseek-reasoner&lt;/code&gt; in production needs to act before July 24.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Overview: V4-Pro vs V4-Flash
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;Spec&lt;/th&gt;
  &lt;th&gt;DeepSeek V4-Pro&lt;/th&gt;
  &lt;th&gt;DeepSeek V4-Flash&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Parameters&lt;/td&gt;
&lt;td&gt;1.6T (49B active)&lt;/td&gt;
&lt;td&gt;284B (13B active)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;MoE + Hybrid Attention&lt;/td&gt;
&lt;td&gt;MoE + Hybrid Attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Window&lt;/td&gt;
&lt;td&gt;1,000,000 tokens&lt;/td&gt;
&lt;td&gt;1,000,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Modes&lt;/td&gt;
&lt;td&gt;Thinking + Non-Thinking&lt;/td&gt;
&lt;td&gt;Thinking + Non-Thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input Pricing (cache miss)&lt;/td&gt;
&lt;td&gt;$1.74/M tokens&lt;/td&gt;
&lt;td&gt;$0.14/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input Pricing (cache hit)&lt;/td&gt;
&lt;td&gt;$0.145/M tokens&lt;/td&gt;
&lt;td&gt;$0.028/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Pricing&lt;/td&gt;
&lt;td&gt;$3.48/M tokens&lt;/td&gt;
&lt;td&gt;$0.28/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weights on Hugging Face&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Complex reasoning, agents, coding&lt;/td&gt;
&lt;td&gt;High-throughput, cost-sensitive tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both models also receive a 50% discount during Beijing off-peak hours — relevant for batch jobs that don't need real-time response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V4-Pro&lt;/strong&gt; is designed for tasks where quality ceiling matters: complex multi-step agents, competitive programming, research-grade reasoning, long-document analysis across the full 1M context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V4-Flash&lt;/strong&gt; replaces both &lt;code&gt;deepseek-chat&lt;/code&gt; (non-thinking mode) and &lt;code&gt;deepseek-reasoner&lt;/code&gt; (thinking mode) in the transition mapping. At $0.28/M output tokens, it's the right default for classification, summarization, extraction, customer-facing chat, and high-volume pipelines where V4-Pro's quality headroom goes unused.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's New: The Hybrid Attention Architecture
&lt;/h2&gt;

&lt;p&gt;The architectural upgrade that makes V4's 1M context practical is the &lt;strong&gt;Hybrid Attention Architecture (HAA)&lt;/strong&gt; — a combination of two complementary attention strategies applied layer by layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compressed Sparse Attention (CSA)
&lt;/h3&gt;

&lt;p&gt;CSA first compresses KV caches along the sequence dimension (compression rate 4 in V4), then applies DeepSeek Sparse Attention. A "lightning indexer" selects the top-k most relevant compressed KV entries per query: V4-Pro selects the top 1,024; V4-Flash selects the top 512.&lt;/p&gt;

&lt;p&gt;This gives the model a high-precision view of the most relevant context chunks — similar to how a search index retrieves only the best-matching documents rather than scanning everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heavily Compressed Attention (HCA)
&lt;/h3&gt;

&lt;p&gt;HCA applies a much more aggressive compression rate of 128, then performs dense attention over that smaller representation. Every layer gets a cheap, global view of distant tokens — the model always knows roughly what happened 800K tokens ago, even if it can't recall exact details.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Combined Effect
&lt;/h3&gt;

&lt;p&gt;By routing between CSA and HCA at every depth, V4 avoids the standard memory explosion of full attention at 1M tokens. The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;27% of single-token inference FLOPs&lt;/strong&gt; compared to DeepSeek-V3.2 at equivalent context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10% of KV cache memory&lt;/strong&gt; compared to V3.2&lt;/li&gt;
&lt;li&gt;Usable 1M context on standard inference hardware rather than requiring specialized memory configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DeepSeek also trained V4 on 32T+ tokens using FP4 + FP8 mixed precision (MoE experts at FP4, most parameters at FP8), which contributes to the efficiency advantage over V3.2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manifold-Constrained Hyper-Connections (mHC)
&lt;/h3&gt;

&lt;p&gt;A secondary architectural addition: mHC strengthens conventional residual connections to improve signal propagation stability across the model's many layers. The practical effect is more stable training and better performance on tasks requiring deep multi-step reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks: How V4-Pro Stacks Up
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;Benchmark&lt;/th&gt;
  &lt;th&gt;V4-Pro&lt;/th&gt;
  &lt;th&gt;Claude Opus 4.7&lt;/th&gt;
  &lt;th&gt;GPT-5.5&lt;/th&gt;
  &lt;th&gt;Gemini 3.1 Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;80.6%&lt;/td&gt;
&lt;td&gt;~80.4%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench&lt;/td&gt;
&lt;td class="highlight"&gt;93.5&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;91.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codeforces Rating&lt;/td&gt;
&lt;td class="highlight"&gt;3206&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;3168 (GPT-5.4)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BrowseComp (agentic search)&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCPAtlas (tool orchestration)&lt;/td&gt;
&lt;td&gt;73.6%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HMMT 2026 math&lt;/td&gt;
&lt;td&gt;95.2%&lt;/td&gt;
&lt;td&gt;96.2%&lt;/td&gt;
&lt;td&gt;97.7% (GPT-5.4)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SimpleQA factual recall&lt;/td&gt;
&lt;td&gt;57.9%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td class="highlight"&gt;75.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output cost / M tokens&lt;/td&gt;
&lt;td class="highlight"&gt;$3.48&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Where V4-Pro leads&lt;/strong&gt;: coding. LiveCodeBench 93.5 puts it ahead of both Gemini 3.1 Pro (91.7) and Claude (88.8). On real-world competitive programming via Codeforces rating (3206), it beats GPT-5.4 (3168). SWE-bench Verified (80.6%) lands within 0.2 points of Claude Opus 4.6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it trails&lt;/strong&gt;: factual knowledge retrieval. SimpleQA-Verified at 57.9% versus Gemini's 75.6% is a meaningful gap for applications that need reliable factual recall (knowledge base Q&amp;amp;A, citation-heavy document work). Advanced math competition problems (HMMT 2026) show Claude (96.2%) and GPT-5.4 (97.7%) pulling ahead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agentic benchmarks are the headline&lt;/strong&gt;: BrowseComp 83.4% and MCPAtlas 73.6% suggest V4-Pro is genuinely competitive at autonomous multi-step tasks — not just raw text generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Guide: Updating from deepseek-chat and deepseek-reasoner
&lt;/h2&gt;

&lt;p&gt;The migration is a one-line change in most codebases. DeepSeek kept the base URL and request/response shapes identical. Only the &lt;code&gt;model&lt;/code&gt; parameter changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Legacy model mapping
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Old model name&lt;/th&gt;
&lt;th&gt;New equivalent&lt;/th&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-chat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Non-Thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-reasoner&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;During the transition window (until July 24), the legacy names are already silently routing to V4-Flash. After the deadline, they will return errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python migration (OpenAI SDK)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-deepseek-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.deepseek.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← retiring July 24
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this code: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (drop-in replacement):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-deepseek-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.deepseek.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← updated
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this code: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To upgrade to V4-Pro for higher-quality outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← flagship model
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this code: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Enabling Thinking mode
&lt;/h3&gt;

&lt;p&gt;Both V4-Pro and V4-Flash support explicit Thinking mode. Pass &lt;code&gt;thinking&lt;/code&gt; in the model field suffix or use the &lt;code&gt;reasoning_effort&lt;/code&gt; parameter — DeepSeek supports three levels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Thinking mode: for complex reasoning chains
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Solve this step by step: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# enable extended reasoning
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Non-Thinking mode (default): for faster, cheaper completions
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this document: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="c1"&gt;# thinking defaults to False
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you were previously using &lt;code&gt;deepseek-reasoner&lt;/code&gt; specifically for its chain-of-thought behavior, migrate to &lt;code&gt;deepseek-v4-flash&lt;/code&gt; with &lt;code&gt;"thinking": True&lt;/code&gt; — that maps directly to the same reasoning capability at the same price tier.&lt;/p&gt;

&lt;h3&gt;
  
  
  TypeScript / Node.js migration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DEEPSEEK_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.deepseek.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// V4-Flash: fast, cheap, suitable for most tasks&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deepseek-v4-flash&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// V4-Pro: best quality for coding/agent tasks&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deepseek-v4-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment variable approach for easy switching
&lt;/h3&gt;

&lt;p&gt;If you centralize your model name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Set in .env or deployment config:
# DEEPSEEK_MODEL=deepseek-v4-pro
# DEEPSEEK_MODEL=deepseek-v4-flash
&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEEPSEEK_MODEL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes it easy to A/B test Pro vs Flash across deployments without touching application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Third-Party Providers
&lt;/h2&gt;

&lt;p&gt;DeepSeek V4 is available on multiple inference providers — useful if you need lower latency, specific geographic regions, or pay-per-use billing alternatives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Models Available&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;V4-Pro, V4-Flash&lt;/td&gt;
&lt;td&gt;Official, cheapest at peak pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Together AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;V4-Pro&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek-ai/DeepSeek-V4-Pro&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;V4-Pro, V4-Flash&lt;/td&gt;
&lt;td&gt;Unified key across providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepInfra&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;V4-Pro&lt;/td&gt;
&lt;td&gt;Low-latency EU and US endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;APIYI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;V4-Flash&lt;/td&gt;
&lt;td&gt;5-minute migration guide available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hugging Face&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Both (weights)&lt;/td&gt;
&lt;td&gt;Self-host via vLLM or TGI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For most teams, the DeepSeek API is the default. But if you're already using Together AI or OpenRouter for provider routing, both models are available there without a separate key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosting Options
&lt;/h2&gt;

&lt;p&gt;Both models have weights published on Hugging Face:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;deepseek-ai/DeepSeek-V4-Pro&lt;/code&gt; (Apache 2.0)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deepseek-ai/DeepSeek-V4-Flash&lt;/code&gt; (MIT)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V4-Flash is the realistic self-host option for most teams. At 284B total parameters with 13B active per forward pass, it runs on multi-GPU setups in FP4/FP8 with reasonably sized hardware. V4-Pro at 1.6T is a data-center-scale deployment — full-scale self-hosting requires significant infrastructure, though quantized versions via Unsloth (&lt;code&gt;unsloth/DeepSeek-V4-Pro&lt;/code&gt;) reduce that burden.&lt;/p&gt;

&lt;p&gt;vLLM added native support for V4's Hybrid Attention Architecture shortly after release, making it the preferred inference framework for self-hosted deployments. DeepSeek also noted close integration with Huawei's Ascend chips for organizations running on Chinese cloud infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Comparison: When to Use Which Tier
&lt;/h2&gt;

&lt;p&gt;At $0.28/M output tokens, &lt;strong&gt;V4-Flash&lt;/strong&gt; is the right default for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-volume classification and extraction pipelines&lt;/li&gt;
&lt;li&gt;Customer support chatbots&lt;/li&gt;
&lt;li&gt;Real-time summarization&lt;/li&gt;
&lt;li&gt;Any task where &lt;code&gt;deepseek-chat&lt;/code&gt; was already sufficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At $3.48/M output tokens, &lt;strong&gt;V4-Pro&lt;/strong&gt; makes sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building coding agents that need reliable SWE-bench-level performance&lt;/li&gt;
&lt;li&gt;Tasks require multi-step agentic reasoning (MCPAtlas-class tool orchestration)&lt;/li&gt;
&lt;li&gt;Documents approach 100K+ tokens and you need deep contextual understanding&lt;/li&gt;
&lt;li&gt;Long-context retrieval across 500K–1M token windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even V4-Pro at $3.48/M is a significant discount from frontier alternatives. For a team generating 100M output tokens/month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;V4-Flash: &lt;strong&gt;$28/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;V4-Pro: &lt;strong&gt;$348/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Claude Opus 4.7: &lt;strong&gt;$2,500/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;GPT-5.5: &lt;strong&gt;$3,000/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cash-per-quality tradeoff genuinely favors V4-Pro for most mid-complexity workloads that aren't specifically HMMT-level math or heavy factual recall.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Migrating to the wrong tier&lt;/strong&gt;: &lt;code&gt;deepseek-reasoner&lt;/code&gt; → &lt;code&gt;deepseek-v4-flash&lt;/code&gt; (not &lt;code&gt;v4-pro&lt;/code&gt;). V4-Flash with thinking mode is the direct functional equivalent of &lt;code&gt;deepseek-reasoner&lt;/code&gt;. You don't need to upgrade to V4-Pro just because you were using the reasoning model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not setting cache_control breakpoints&lt;/strong&gt;: V4-Pro cache-hit input is $0.145/M versus $1.74/M for cache-miss — a 12x difference. For agent loops that repeat the same system prompt and tool definitions, prompt caching can cut input costs by 90%. Structure your messages to keep the cacheable prefix stable (system prompt → tools → documents → conversation history → current user message).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring the 50% off-peak discount&lt;/strong&gt;: If you're running batch jobs, scheduling them during Beijing off-peak hours halves the cost. For teams in UTC±8 time zones this is trivial to configure; for US/EU teams, a simple cron job targeting the overnight batch window captures the discount without user-facing latency impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assuming Thinking mode is always better&lt;/strong&gt;: V4-Pro and V4-Flash both support Thinking mode, but enabling it adds latency and cost. Use it for complex multi-step problems where the reasoning chain genuinely helps — not for simple extraction or summarization tasks where it adds overhead without quality benefit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing only on your original benchmark&lt;/strong&gt;: SimpleQA-Verified at 57.9% is a real gap. If your application depends on factual knowledge retrieval (especially for niche or recent information), test V4-Pro against your specific dataset before committing — Gemini 3.1 Pro may outperform here even at higher cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is July 24, 2026 a hard cutoff for deepseek-chat and deepseek-reasoner?
&lt;/h3&gt;

&lt;p&gt;Yes. DeepSeek has stated that &lt;code&gt;deepseek-chat&lt;/code&gt; and &lt;code&gt;deepseek-reasoner&lt;/code&gt; will be &lt;strong&gt;fully retired and inaccessible&lt;/strong&gt; after July 24, 2026, 15:59 UTC. Any request using those model names after that time will return an error. Plan your migration with time to test — migrating in the final week is risky for production systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Do I need to change my base_url when migrating?
&lt;/h3&gt;

&lt;p&gt;No. The base URL (&lt;code&gt;https://api.deepseek.com&lt;/code&gt;) remains the same. Only the &lt;code&gt;model&lt;/code&gt; parameter in your request body changes. The request and response shapes are unchanged, so existing parsing code requires no modification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the difference between V4-Pro and V4-Flash in thinking mode?
&lt;/h3&gt;

&lt;p&gt;Both models support Thinking and Non-Thinking modes. The difference is capability ceiling: V4-Pro has 49B active parameters versus V4-Flash's 13B, which gives it substantially better performance on complex reasoning and coding tasks even in thinking mode. V4-Flash thinking mode is appropriate for moderately complex problems at lower cost; V4-Pro thinking mode is for tasks where you need the highest quality available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I self-host DeepSeek V4-Flash without special hardware?
&lt;/h3&gt;

&lt;p&gt;V4-Flash at 284B total parameters (13B active per forward pass) is feasible on multi-GPU servers with 80GB+ VRAM total. Quantized variants via Unsloth lower the memory requirements further. V4-Pro self-hosting requires significantly more resources due to the 1.6T total parameter count.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is DeepSeek V4 suitable for agentic workflows with tool calling?
&lt;/h3&gt;

&lt;p&gt;Yes, this is one of V4-Pro's demonstrated strengths. MCPAtlas score of 73.6% measures tool orchestration performance; BrowseComp at 83.4% covers autonomous search-and-retrieval agents. V4-Pro is competitive with frontier closed-source models on these benchmarks while costing 7-9x less per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does V4 support multimodal inputs?
&lt;/h3&gt;

&lt;p&gt;Currently, both V4-Pro and V4-Flash are text-only. DeepSeek stated they are "working on incorporating multimodal capabilities," but no release date has been announced. For vision tasks, Gemini 3.1 or GPT-5.5 remain the options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Migrate before July 24, 2026&lt;/strong&gt;: &lt;code&gt;deepseek-chat&lt;/code&gt; → &lt;code&gt;deepseek-v4-flash&lt;/code&gt;, &lt;code&gt;deepseek-reasoner&lt;/code&gt; → &lt;code&gt;deepseek-v4-flash&lt;/code&gt; with thinking mode enabled. One line of code change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V4-Flash is the default upgrade&lt;/strong&gt;: cheaper than any frontier alternative at $0.28/M output, with 1M context and dual thinking modes built in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V4-Pro for coding and agents&lt;/strong&gt;: SWE-bench Verified 80.6%, LiveCodeBench 93.5, Codeforces 3206 — leads the field on coding benchmarks at $3.48/M output (1/7th of GPT-5.5).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Attention Architecture&lt;/strong&gt; makes 1M context practical: 27% of FLOP and 10% of KV cache vs V3.2, enabling long-context retrieval at inference cost that previously required far more hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosting is viable for Flash&lt;/strong&gt;: Apache 2.0 (V4-Pro) and MIT (V4-Flash) licenses, weights on Hugging Face, vLLM support. Flash's 13B active parameters make it runnable on a multi-GPU server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch for factual recall gaps&lt;/strong&gt;: SimpleQA at 57.9% means V4-Pro isn't the right choice for factual knowledge-heavy applications. Test on your specific dataset before committing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;DeepSeek V4-Pro is the best cost-per-quality option for coding and agentic workloads in April 2026, period. V4-Flash replaces deepseek-chat at the same price tier with a much larger context window. Migrate both before July 24 — it's one line of code and there's no reason to wait.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=76PErINCvIg" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>llm</category>
      <category>api</category>
      <category>opensource</category>
    </item>
    <item>
      <title>GPT-5.5 Spud: Unified Multimodal API — Developer Integration Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 24 Apr 2026 08:27:56 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/gpt-55-spud-unified-multimodal-api-developer-integration-guide-375o</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/gpt-55-spud-unified-multimodal-api-developer-integration-guide-375o</guid>
      <description>&lt;p&gt;OpenAI shipped GPT-5.5 on April 23, 2026 — six weeks after GPT-5.4 and one week after Anthropic released Claude Opus 4.7. The internal codename is "Spud," and the model is something genuinely different from the incremental updates that preceded it.&lt;/p&gt;

&lt;p&gt;GPT-5.5 is the first fully retrained base model OpenAI has released since GPT-4.5. Every prior 5.x release was a tuned derivative of the same underlying architecture. Spud is not. It processes text, images, audio, and video inside a single unified system — no Whisper call for transcription, no DALL-E endpoint for image generation, no separate pipeline stitching the modalities together. One model, one API endpoint, all four modalities.&lt;/p&gt;

&lt;p&gt;For developers building production applications, that architectural shift matters more than the benchmark numbers. Here is what you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GPT-5.5 Is a Different Kind of Release
&lt;/h2&gt;

&lt;p&gt;The GPT-5.x lineage started with GPT-5 (August 2025), then cycled through 5.1, 5.2, 5.3, 5.4, and now 5.5. Most of those were targeted improvements — better reasoning in 5.2, faster latency in 5.3, stronger computer-use in 5.4. They shared the same fundamental architecture.&lt;/p&gt;

&lt;p&gt;GPT-5.5 breaks from that pattern. OpenAI rebuilt the token embedding layer to unify all four modalities at the representation level. Text, audio frames, image patches, and video keyframes are projected into the same vector space from the start. Previous OpenAI models encoded modalities separately and fused them at a later layer; Spud does not. The result is that the model reasons &lt;em&gt;across&lt;/em&gt; modalities rather than translating between them.&lt;/p&gt;

&lt;p&gt;The practical consequence: you can send an audio file, a screenshot, and a text question in a single request, and the model understands the relationships between all three without any preprocessing on your side. If you have been building pipelines that pass audio through Whisper first, then feed the transcript to GPT, you have a genuine opportunity to simplify your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Performance: Where Spud Leads and Where It Doesn't
&lt;/h2&gt;

&lt;p&gt;GPT-5.5 landed well on agent-oriented benchmarks and general knowledge, but the picture is more nuanced when you look at code quality head-to-head against Claude Opus 4.7.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;th&gt;Claude Opus 4.7&lt;/th&gt;
&lt;th&gt;GPT-5.4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td class="highlight"&gt;82.7%&lt;/td&gt;
&lt;td&gt;69.4%&lt;/td&gt;
&lt;td&gt;~71%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-Bench Pro&lt;/td&gt;
&lt;td&gt;58.6%&lt;/td&gt;
&lt;td class="highlight"&gt;64.3%&lt;/td&gt;
&lt;td&gt;54.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expert-SWE&lt;/td&gt;
&lt;td&gt;73.1%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSWorld-Verified&lt;/td&gt;
&lt;td&gt;78.7%&lt;/td&gt;
&lt;td&gt;78.0%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP-Atlas&lt;/td&gt;
&lt;td&gt;75.3%&lt;/td&gt;
&lt;td class="highlight"&gt;79.1%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU&lt;/td&gt;
&lt;td class="highlight"&gt;92.4%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;89.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination Rate&lt;/td&gt;
&lt;td class="highlight"&gt;60% lower vs 5.4&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Agentic workflows&lt;/td&gt;
&lt;td&gt;Precision code review&lt;/td&gt;
&lt;td&gt;Lower cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Terminal-Bench 2.0 measures complex command-line workflows — multi-step operations that require planning, tool invocation, iteration, and recovery from failed steps. GPT-5.5's 82.7% vs Opus 4.7's 69.4% is a meaningful gap for developers building autonomous agents that operate over CLI tools, file systems, or APIs.&lt;/p&gt;

&lt;p&gt;SWE-Bench Pro is the inverse story. Claude Opus 4.7 holds a 5.7-point lead (64.3% vs 58.6%) on solving real GitHub issues. That benchmark rewards careful, precise code generation — the kind of output you want when a human will review the PR.&lt;/p&gt;

&lt;p&gt;The practical read: GPT-5.5 is the stronger choice for autonomous, long-running agentic tasks. Opus 4.7 remains stronger for code generation where precision matters and a human reviews the output. OpenAI's own &lt;a href="https://dev.to/articles/openai-agents-sdk-sandbox-memory-guide-2026"&gt;Agents SDK&lt;/a&gt; is built to run on GPT-5.5 by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing and Tier Structure
&lt;/h2&gt;

&lt;p&gt;GPT-5.5 is priced higher than GPT-5.4 but offset by significantly better token efficiency — OpenAI reports the model completes equivalent Codex tasks with fewer tokens and fewer retries.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
&lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5 Standard&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5 Pro&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;$180.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch / Flex&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Priority&lt;/td&gt;
&lt;td&gt;$12.50&lt;/td&gt;
&lt;td&gt;$75.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context: &lt;a href="https://dev.to/articles/gpt-5-4-api-developer-guide-2026"&gt;GPT-5.4&lt;/a&gt; runs at $2.50/$15 standard, so GPT-5.5 doubles the per-token rate. OpenAI's position is that the token efficiency gain offsets the price increase in most workloads — and for multimodal tasks that previously required multiple API calls (Whisper + GPT + DALL-E), the consolidation often results in lower total cost.&lt;/p&gt;

&lt;p&gt;GPT-5.5 Pro is aimed at scientific research, complex analysis, and the highest-stakes production tasks. For most teams, standard GPT-5.5 or batch mode will be the right starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Window and Available Modalities
&lt;/h2&gt;

&lt;p&gt;The context window is &lt;strong&gt;1 million tokens&lt;/strong&gt; in ChatGPT and the upcoming Responses/Chat Completions API. In Codex CLI, the window is fixed at 400K tokens across all subscription plans.&lt;/p&gt;

&lt;p&gt;GPT-5.5 handles four modalities natively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text&lt;/strong&gt;: Standard token-based processing, same API interface as GPT-5.4&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt;: Send base64-encoded or URL-referenced images; the model generates images natively as output (no separate DALL-E call)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt;: Send audio files directly; transcription and speech synthesis are handled within the same request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video&lt;/strong&gt;: Pass video files or frame sequences; the model analyzes temporal content and can describe, summarize, or reason about video&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One million tokens of context is substantial. For reference, a typical 100K-line codebase fits in roughly 700K tokens. You can now send your entire medium-sized repository in a single API call — relevant for automated code review, architecture analysis, or repo-level refactoring tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Integration Guide
&lt;/h2&gt;

&lt;p&gt;As of April 24, 2026, full Responses API and Chat Completions access is staged — currently available via Codex sign-in for developers, with the general API rollout described as "very soon." The model IDs to watch for are &lt;code&gt;gpt-5.5&lt;/code&gt; and &lt;code&gt;gpt-5.5-pro&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Text-Only Request (Chat Completions)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior software engineer.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this function for edge cases.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multimodal Request: Image + Text
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/png;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What errors are visible in this UI? List them with severity.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multimodal Request: Audio Transcription + Analysis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meeting.mp3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;audio_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;audio_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mp3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the action items from this recording.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the audio modality interface follows the same pattern as image input — no separate Whisper API call needed. The transcript, analysis, and any follow-up reasoning happen in a single model inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Batch Processing for Cost Reduction
&lt;/h3&gt;

&lt;p&gt;For high-volume workloads where latency is not critical, the Batch API cuts costs by half:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;batch_requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_requests.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;batch_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_requests.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;purpose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;input_file_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;completion_window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Batch results arrive within 24 hours at $2.50 per million input tokens — the most cost-efficient path for document processing, classification, or any workload that tolerates latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic Use Cases Where GPT-5.5 Excels
&lt;/h2&gt;

&lt;p&gt;The Terminal-Bench 2.0 lead is not an accident. GPT-5.5 was trained specifically for multi-step autonomous task completion. Three categories stand out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Long-running CLI agent workflows.&lt;/strong&gt; Tasks that involve: read a file, decide what to do, invoke a shell command, observe the output, retry on failure, and produce a final result. GPT-5.5 handles the iteration loop with significantly fewer stalls than 5.4. OpenAI reports it "uses significantly fewer tokens to complete the same Codex tasks" — meaning the model is better at deciding when it has enough information to act rather than asking for clarification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multimodal investigation pipelines.&lt;/strong&gt; Security analysts reviewing screenshots, logs, and audio recordings in a single context window. Accessibility auditors comparing UI screenshots to specs. QA engineers sending screen recordings and asking the model to identify regressions. These workloads become dramatically simpler without the orchestration layer between modalities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Code + documentation cross-referencing.&lt;/strong&gt; Sending an entire codebase (within the 1M context) alongside its documentation and asking the model to identify inconsistencies. At GPT-5.4's context limit, this required chunking or vector search. At 1M tokens, many medium repos fit directly.&lt;/p&gt;

&lt;p&gt;For &lt;a href="https://dev.to/articles/claude-sonnet-4-6-developer-guide-2026"&gt;Claude Code&lt;/a&gt;-style agentic coding, GPT-5.5 via Codex is OpenAI's answer. The Terminal-Bench lead suggests it is competitive for autonomous task completion; SWE-Bench Pro results suggest human-in-the-loop code review still favors Opus 4.7.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes When Integrating GPT-5.5
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Assuming audio input replaces structured data.&lt;/strong&gt; Sending audio files works, but for structured extraction (dates, numbers, names), the transcription quality benefits from an explicit instruction in the text part of the message. Don't assume the model will automatically apply structured output constraints to audio-derived content without being told.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring batch mode for classification tasks.&lt;/strong&gt; If you are running GPT-5.5 on thousands of documents for classification, tagging, or summarization, and latency is not a constraint, not using the Batch API means you are paying 2× unnecessarily. Batch mode is half the price with no quality trade-off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-estimating token efficiency gains for simple tasks.&lt;/strong&gt; The token efficiency improvement is most pronounced on complex, multi-step tasks. For simple Q&amp;amp;A or single-turn completions, GPT-5.4 at half the price is often the better call. Reserve GPT-5.5 for workloads where the capability improvement justifies the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using Chat Completions when Responses API is available.&lt;/strong&gt; The Responses API is the unified interface OpenAI is investing in going forward. It supports tool calling, multimodal output, and streaming in a more consistent way than Chat Completions. For new production integrations, build against Responses API from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sending full videos when frame sampling suffices.&lt;/strong&gt; Video input works natively but large video files consume tokens proportionally to their duration. For most use cases, sampling key frames (every 5-10 seconds) and sending those as image inputs gives similar analysis quality at a fraction of the token cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: When will GPT-5.5 be available in the general API?
&lt;/h3&gt;

&lt;p&gt;OpenAI's announcement states it will be in the Responses and Chat Completions APIs "very soon." As of April 24, developers can access it through Codex. Watch the &lt;a href="https://developers.openai.com/api/docs/changelog" rel="noopener noreferrer"&gt;OpenAI API changelog&lt;/a&gt; for the rollout announcement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is GPT-5.5 better than GPT-6 for everyday tasks?
&lt;/h3&gt;

&lt;p&gt;GPT-6 (released April 14, 2026) sits above GPT-5.5 in OpenAI's model hierarchy with a 2M token context window and Symphony architecture. For the highest-stakes tasks, &lt;a href="https://dev.to/articles/gpt-6-api-developer-guide-2026"&gt;GPT-6&lt;/a&gt; is the more capable option. For most production workloads — agentic coding, document analysis, multimodal pipelines — GPT-5.5 offers a better price-to-capability ratio.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does GPT-5.5 replace the Whisper API for audio transcription?
&lt;/h3&gt;

&lt;p&gt;For most use cases, yes. GPT-5.5 handles audio input natively with equivalent transcription quality. The Whisper API remains available for dedicated transcription-only workloads where you do not need subsequent language model reasoning on the output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the model ID for GPT-5.5?
&lt;/h3&gt;

&lt;p&gt;The model identifier is &lt;code&gt;gpt-5.5&lt;/code&gt; for standard and &lt;code&gt;gpt-5.5-pro&lt;/code&gt; for the Pro tier. These identifiers are confirmed in the OpenAI changelog and will be available once the general API rollout completes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does GPT-5.5 handle the Claude Mythos comparison?
&lt;/h3&gt;

&lt;p&gt;VentureBeat's Terminal-Bench 2.0 results show GPT-5.5 at 82.7% narrowly ahead of Claude Mythos Preview. Mythos remains in gated access (approximately 50 organizations). For most developers, the relevant comparison is GPT-5.5 vs Claude Opus 4.7, where the trade-off is agentic tasks (GPT-5.5) vs precision code generation (Opus 4.7).&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.5 is a genuine architectural rebase&lt;/strong&gt;, not an incremental fine-tune. It is the first fully retrained OpenAI model since GPT-4.5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native omnimodal processing&lt;/strong&gt; means text, audio, image, and video share a single embedding space. One API call, no preprocessing pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark positioning is task-dependent&lt;/strong&gt;: Terminal-Bench 2.0 (82.7%) and MMLU (92.4%) favor GPT-5.5; SWE-Bench Pro (58.6% vs 64.3%) still favors Claude Opus 4.7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API pricing&lt;/strong&gt;: $5/$30 per million tokens standard; half that on Batch. Worth the upgrade for complex multimodal or long-running agentic tasks; overkill for simple Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1M token context window&lt;/strong&gt; opens up whole-repo analysis that previously required RAG or chunking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General API access is staged&lt;/strong&gt; — currently through Codex, full Responses/Chat Completions rollout imminent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;GPT-5.5 is the right choice for developers building autonomous agents that operate over extended workflows, multimodal inputs, or large codebases. The unified omnimodal architecture genuinely simplifies application code that previously required multiple specialized API calls. For precision-focused code generation with human review, Claude Opus 4.7's SWE-Bench Pro advantage is still worth considering.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=m5zJK5-MEIU" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>gpt55</category>
      <category>multimodalai</category>
      <category>apiintegration</category>
    </item>
    <item>
      <title>&gt;-</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 24 Apr 2026 06:50:05 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/--2bj4</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/--2bj4</guid>
      <description>&lt;p&gt;Yesterday (April 23), OpenAI released GPT-5.5. One sentence in the announcement stood out.&lt;/p&gt;

&lt;p&gt;"This is the first GPT flagship model designed as an agent runtime, not a chat assistant."&lt;/p&gt;

&lt;p&gt;Whether that is marketing copy or a genuine architectural shift is hard to judge right away. GPT-5.1 through 5.4 were iterative improvements on the GPT-5 base — fine-tuning and RLHF layered on top of the same foundation. GPT-5.5, according to OpenAI, is the first fully retrained base model since GPT-4.5. MMLU 92.4%, SWE-bench 88.7%, Terminal-Bench 2.0 82.7% — those are the numbers they shipped alongside the announcement.&lt;/p&gt;

&lt;p&gt;Setting the claim aside, April has been an unusually busy month for AI agents. Anthropic launched Claude Managed Agents into public beta on April 8, followed by the Claude Advisor Tool on April 9. GitHub Copilot Agent Mode reached GA in Q1. Cursor 3.0 Glass dropped in early April. In the span of a few weeks, every major AI coding and agent platform shipped a significant update. In that context, GPT-5.5 is worth examining carefully — particularly how it stacks up against Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verdict: A Real Shift, But Not One That Demands Immediate Action
&lt;/h2&gt;

&lt;p&gt;GPT-5.5 is a meaningful update. But the conclusion "every developer should switch right now" is wrong. Three reasons.&lt;/p&gt;

&lt;p&gt;First, the API is not available yet. Right now, only ChatGPT Plus/Pro/Business/Enterprise users can access it. The API comes "after additional cybersecurity guardrail review," with no specific date given. Anyone saying they've "tested GPT-5.5 and it's great" has tested it through the ChatGPT interface — not in their own agent pipelines with tool calls and custom system prompts.&lt;/p&gt;

&lt;p&gt;Second, the price doubled. Justifying that requires performance gains that clearly outweigh the cost increase. That cannot be verified independently until the API ships and third-party benchmarks appear.&lt;/p&gt;

&lt;p&gt;Third, Anthropic's concurrent releases — Managed Agents and the Advisor Tool — are not just model improvements. They are infrastructure-layer upgrades: checkpointing, credential management, scoped permissions, long-running sessions, multi-agent coordination. "A smarter model" and "more reliable agent infrastructure" serve different needs, and the right choice depends on what problem your team is actually solving.&lt;/p&gt;

&lt;p&gt;That said, GPT-5.5 is not being overhyped by nothing. SWE-bench 88.7% clears a threshold that no widely available model has crossed before on that benchmark, and a 6-week release cadence signals OpenAI is taking this competition seriously. Once production data accumulates post-API release, this assessment may change. For now, it is provisional.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Changed from Previous Models
&lt;/h2&gt;

&lt;p&gt;To understand GPT-5.5, it helps to understand what GPT-5.1 through 5.4 were.&lt;/p&gt;

&lt;p&gt;They were variations on a theme: take the GPT-5 base, apply targeted fine-tuning and reinforcement learning, push specific capabilities higher. Faster reasoning, more stable multimodal processing, better domain accuracy. This approach ships improvements quickly, but has a fundamental ceiling. Fine-tuning cannot fully instill patterns that require the base model to have internalized them during pre-training: complex multi-step tool call sequences, self-correction loops, long-context coherence.&lt;/p&gt;

&lt;p&gt;GPT-5.5 retrains from the ground up. Two key changes per the announcement:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-training data composition optimized for agentic tasks.&lt;/strong&gt; Rather than predicting text in isolation, the model learned from substantially more multi-step tool-call sequences and self-correction trajectories. OpenAI did not release the exact data mix, but said "agentic workflow data representation was significantly increased relative to prior generations." What that actually means in practice — code execution traces, API call/response pairs, error recovery loops — is not public.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed and capability improved simultaneously.&lt;/strong&gt; Response latency matches GPT-5.4 while benchmark scores are higher. This is OpenAI's claim from their announcement materials; independent latency measurement awaits API availability. Simple scaling would not typically achieve this — it likely reflects architectural optimization. The specific mechanism is outside my expertise, and I won't pretend otherwise.&lt;/p&gt;

&lt;p&gt;The timing matters too. Six weeks from GPT-5.4. For context, previous major GPT releases were spaced 2–4 months apart. Anthropic announced Managed Agents and the Advisor Tool days before this. That is not a coincidence. The release cadence across the industry is compressing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Benchmark Numbers Deserve Skepticism
&lt;/h2&gt;

&lt;p&gt;SWE-bench 88.7% is genuinely impressive. But using it to conclude "GPT-5.5 is way better at coding than Claude" is premature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MMLU 92.4%&lt;/strong&gt; — a knowledge recall benchmark. It measures how much a model has memorized, not how effectively it acts on multi-step problems. High MMLU and strong agentic performance are correlated but not the same thing. An agent that knows a lot but hallucinates tool arguments is still a bad agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-bench 88.7%&lt;/strong&gt; — more directly relevant. But the comparison point often cited is Claude Sonnet 4.6 plus the Opus advisor at 74.8% on SWE-bench Multilingual. GPT-5.5's 88.7% is on the original English SWE-bench. These are different test distributions. Comparing them is not an apples-to-apples exercise. Fair comparison requires identical evaluation conditions, which we do not have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 at 82.7%&lt;/strong&gt; — the most credible datapoint for the "agent runtime" claim. This benchmark measures what actually matters for agents: execute a command, interpret output, decide what to do next. Scoring 82.7% on this is consistent with the positioning and represents a meaningful capability for CLI-based agents and CI/CD integrations. That said, this is still an OpenAI self-reported number, and independent replication has not happened yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GDPval 84.9%&lt;/strong&gt; — OpenAI's own benchmark, which most developers had not heard of before yesterday. Self-designed benchmarks can favor the model that designed them. Cite the source when you cite the number.&lt;/p&gt;

&lt;p&gt;This pattern is familiar from &lt;a href="https://dev.to/en/blog/en/llm-api-pricing-comparison-2026-gpt5-claude-gemini-deepseek"&gt;the LLM API pricing comparison I did earlier this year&lt;/a&gt;: every major lab leads with the benchmarks that favor them. The situation has gotten worse, not better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Price Doubled — Who Can Absorb That?
&lt;/h2&gt;

&lt;p&gt;This is the part that bothers me most about this release.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.4&lt;/strong&gt;: $2.50/1M input, $15/1M output&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt;: $5/1M input, $30/1M output&lt;/p&gt;

&lt;p&gt;Exactly 2x. "Performance goes up, price goes up" sounds reasonable in the abstract. But agentic workflows disproportionately generate output tokens — multi-step reasoning chains, tool call results being processed, intermediate state being recorded, final responses being synthesized. Output costs hit agents harder than they hit chat applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt;: $30/1M input, $180/1M output. That is the highest output token price among any major publicly available LLM. For high-stakes reasoning tasks that might justify it — but for most teams, this tier is not realistically accessible.&lt;/p&gt;

&lt;p&gt;Let me make this concrete. Assume 500 agent tasks per day, 8,000 output tokens each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-5.4: 500 × 8,000 × $15/1M = $60/day, ~$1,800/month&lt;/li&gt;
&lt;li&gt;GPT-5.5: 500 × 8,000 × $30/1M = $120/day, ~$3,600/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a $1,800/month difference. Whether that is worth it depends on whether the task success rate improvement, the reduction in error handling costs, and the faster iteration time add up to more than $1,800/month of value. That calculation is team-specific and requires real production data — not benchmark comparisons.&lt;/p&gt;

&lt;p&gt;Compare this to Anthropic's Claude Managed Agents at $0.08/session-hour plus standard token costs. Time-based billing is more predictable for long-running agent tasks, where token consumption can be hard to estimate in advance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude vs. GPT-5.5: Which One to Use
&lt;/h2&gt;

&lt;p&gt;There is no universal answer. But the decision factors are concrete enough to be useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When GPT-5.5 makes more sense.&lt;/strong&gt; If your team is already deep in the OpenAI ecosystem — Azure OpenAI, Vercel AI SDK on OpenAI, GitHub Copilot integration — switching costs are lower. If raw coding performance on SWE-bench style tasks is your primary metric, GPT-5.5 has the higher self-reported numbers. If you are building a product on top of ChatGPT, GPT-5.5 aligns with what your end users already have access to through their ChatGPT subscription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Claude still makes more sense.&lt;/strong&gt; As covered in &lt;a href="https://dev.to/en/blog/en/claude-code-agentic-workflow-patterns-5-types"&gt;Claude Code's 5 agentic workflow patterns&lt;/a&gt;, Claude's tool-use behavior is more granular and context handling is more stable across long sessions. Claude Managed Agents combined with the Advisor Tool offers meaningful cost efficiency: Sonnet 4.6 as the executor plus Opus as the advisor improves task success rates while reducing per-task cost by 11.9%, according to Anthropic's data. For agent workflows that run for minutes or hours, Claude's infrastructure layer — checkpointing, credential management, scoped permissions — makes a practical difference that model benchmarks do not capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bigger factor: ecosystem and workflow integration.&lt;/strong&gt; A few percentage points on a benchmark matters less than which SDK your existing codebase depends on and which tooling your team already understands. Switching models is not an API key swap. Prompt engineering, error handling patterns, tool schema design, retry strategies — all of these are model-specific and require rework. I have seen teams underestimate this and lose days to re-engineering what was already working.&lt;/p&gt;

&lt;p&gt;For my own projects, I am staying with Claude for now. The &lt;a href="https://dev.to/en/blog/en/vercel-ai-sdk-claude-streaming-agent-2026"&gt;Vercel AI SDK + Claude streaming agent work&lt;/a&gt; I did recently confirmed that Claude's streaming behavior is stable even when tool calls are interleaved with generation — a common requirement in production agents that is easy to underestimate until it breaks. Once GPT-5.5 API access opens up, I plan to run the same tasks and compare directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Framework
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Existing codebase deeply tied to OpenAI SDK? → Consider GPT-5.5&lt;/li&gt;
&lt;li&gt;Agent infrastructure (checkpointing, long sessions, multi-agent) is the key need? → Claude Managed Agents&lt;/li&gt;
&lt;li&gt;Cost predictability matters? → Claude Managed Agents' time-based billing is more stable&lt;/li&gt;
&lt;li&gt;Cannot wait for independent benchmark validation? → Claude API is available now&lt;/li&gt;
&lt;li&gt;Planning to compare when GPT-5.5 API ships? → Run on Claude now, evaluate later&lt;/li&gt;
&lt;li&gt;Coding agents are the primary use case and cost is manageable? → Worth experimenting with GPT-5.5 when API opens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real question is not "which model is better" — it is "which tool fits my team's current constraints and technical context." Both ecosystems are evolving fast. Assessments 3–6 months from now will look different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Model vs. Agent Infrastructure — Not the Same Problem
&lt;/h2&gt;

&lt;p&gt;This is the part of the GPT-5.5 announcement I find most frustrating.&lt;/p&gt;

&lt;p&gt;OpenAI positioned GPT-5.5 as an "agent runtime." But what Anthropic shipped with Managed Agents is a different layer entirely. Anthropic's answer is agent infrastructure, not just a better agent model: checkpointing, credential management, scoped permissions, multi-agent coordination, long-running session support — all provided at the platform level.&lt;/p&gt;

&lt;p&gt;If GPT-5.5 is "a model optimized for agent runtimes," Managed Agents is "the infrastructure for running agents." Smarter engine versus more reliable rails. Both matter, but conflating them is a category error.&lt;/p&gt;

&lt;p&gt;My read — and I hold this with appropriate uncertainty — is that whoever establishes the infrastructure standard for production agents will have a durable advantage over whoever just has the best benchmark scores. Benchmarks are quarterly. Infrastructure lock-in is much stickier. As I explored in &lt;a href="https://dev.to/en/blog/en/ai-agent-framework-comparison-2026-langgraph-crewai-dapr-production"&gt;the 2026 AI agent framework comparison&lt;/a&gt;, the agent ecosystem is converging toward integrated framework-plus-infrastructure stacks, and early positioning there tends to be self-reinforcing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Questions This Release Leaves Open
&lt;/h2&gt;

&lt;p&gt;Several things remain unclear after the announcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API availability timeline is vague.&lt;/strong&gt; "After additional cybersecurity guardrail review" provides no schedule. Positioning a model as an agent runtime while keeping it inaccessible to API developers creates friction between the marketing claim and the developer experience. Claude Managed Agents was available to API users from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "agent runtime" claim lacks infrastructure specifics.&lt;/strong&gt; The benchmarks support the claim at the model performance layer. But what does "agent runtime design" mean for checkpointing? For long-running sessions? For credential handling? The announcement is heavy on numbers and light on architectural detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro tier pricing is difficult to justify without independent data.&lt;/strong&gt; $180/1M output is the highest in the mainstream LLM market. Justifying that requires demonstrably superior task completion rates — which we cannot evaluate until the API ships and production data accumulates.&lt;/p&gt;

&lt;p&gt;One more thing worth noting: if GPT-5.5 is heavily optimized for agentic tasks, it may show little improvement over GPT-5.4 for simple conversation use cases. For users who interact with ChatGPT as a writing assistant or search tool rather than an agent platform, GPT-5.5 is likely to feel like a more expensive GPT-5.4.&lt;/p&gt;




&lt;p&gt;GPT-5.5 is a real release that deserves attention. The benchmark numbers, the positioning pivot toward agent runtimes, the accelerated release cadence — taken together, they signal that the competition for the agentic AI platform is intensifying faster than most predicted.&lt;/p&gt;

&lt;p&gt;But there is no reason to move production workloads to GPT-5.5 today. The API is not available. The price doubled. Independent evaluation is pending. And Anthropic is not standing still — they are building infrastructure that complements the model layer in ways GPT-5.5's announcement did not match.&lt;/p&gt;

&lt;p&gt;Who wins the production agent standard war will not be decided by a benchmark press release. Developer experience, pricing, reliability, and infrastructure depth will be the actual differentiators. That race is still very open.&lt;/p&gt;

&lt;p&gt;When GPT-5.5 API access opens, I intend to test the same agent workflows against the Claude Managed Agents stack — same prompts, same tasks, same success criteria. That comparison will be more informative than anything in today's announcement. My conclusion will follow from what actually happens when the rubber meets the road.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Cursor 2.0: 8 Parallel AI Agents and Visual Editor Bridge</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 24 Apr 2026 04:22:43 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/cursor-20-8-parallel-ai-agents-and-visual-editor-bridge-50nk</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/cursor-20-8-parallel-ai-agents-and-visual-editor-bridge-50nk</guid>
      <description>&lt;h1&gt;
  
  
  Cursor 2.0: 8 Parallel AI Agents and Visual Editor Bridge
&lt;/h1&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;8.6
/ 10



  &amp;lt;span&amp;gt;Parallel Agent Performance&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.0&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Composer Model Speed&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.8&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Visual Editor Usability&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.2&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Value for Money&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.5&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Workflow Integration&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.7&amp;lt;/span&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Cursor 2.0 launched on October 29, 2025, and it drew a hard line in AI-assisted development. For the first time, Anysphere shipped its own proprietary coding model — Composer — and rebuilt the agent interface around one idea: multiple agents working in parallel, each in its own isolated environment, without stepping on each other.&lt;/p&gt;

&lt;p&gt;By April 2026, when Cursor 3 arrived with its full Agents Window redesign, the foundation was already established in 2.0. If you want to understand &lt;em&gt;why&lt;/em&gt; Cursor's parallel agent architecture works the way it does — and how to use it effectively — 2.0 is where it started. This review covers what Cursor 2.0 introduced, how it evolved into 2.2's Visual Editor Bridge, and whether the approach holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Cursor 2.0?
&lt;/h2&gt;

&lt;p&gt;Cursor is a VS Code fork that puts AI agents at the center of the development workflow. Unlike GitHub Copilot, which suggests code inline, or &lt;a href="https://dev.to/articles/ai-coding-market-share-claude-code-cursor-copilot-2026"&gt;Claude Code&lt;/a&gt;, which runs in your terminal, Cursor embeds the agent directly into the IDE with access to your entire codebase.&lt;/p&gt;

&lt;p&gt;Cursor 2.0 was the first version to ship a parallel agent UI alongside Anysphere's own model. Before 2.0, agents ran one at a time — you had to wait for one task to finish before starting another. The 2.0 release broke that constraint entirely.&lt;/p&gt;

&lt;p&gt;Three things defined the 2.0 milestone:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Composer&lt;/strong&gt; — Anysphere's first proprietary coding model, built specifically for low-latency agentic work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Agents UI&lt;/strong&gt; — a new interface that makes running up to 8 agents simultaneously feel manageable rather than chaotic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git Worktrees integration&lt;/strong&gt; — the under-the-hood mechanism that makes true isolation possible&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything in &lt;a href="https://dev.to/articles/cursor-3-review-background-agents-2026"&gt;Cursor 3&lt;/a&gt; — the Agents Window, Design Mode, cloud agents — built on top of what 2.0 proved out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Composer Model: Cursor's First Proprietary AI
&lt;/h2&gt;

&lt;p&gt;Until Cursor 2.0, the IDE routed all requests to third-party models: Claude, GPT-4o, Gemini. Composer changed that. It is Anysphere's own frontier model, trained specifically for agentic coding tasks inside Cursor.&lt;/p&gt;

&lt;p&gt;The headline numbers: Composer generates at 250 tokens per second and completes most agent turns in under 30 seconds. In benchmarks comparing average task completion time, Composer finished in 62 seconds against GitHub Copilot's 89 seconds. The trade-off is accuracy — Composer sits at around 51.7% success rate on SWE-Bench style tasks compared to Copilot's 56.5%. Speed wins over correctness in many interactive development loops, where the developer is reviewing and steering the agent anyway.&lt;/p&gt;

&lt;p&gt;What makes Composer different from simply routing to a fast model is that it was trained with codebase-aware tools: semantic search across the full repo, file reading, diff generation, and test execution. It is not a general-purpose model given a system prompt — it is a purpose-built model that understands the specific operations a coding agent needs.&lt;/p&gt;

&lt;p&gt;In practice, Composer is the default for agent tasks in Auto mode. You can still choose Claude Sonnet 4.6, GPT-6, or any other model from the model picker, but doing so draws from your monthly credit pool. Composer in Auto mode is unlimited on all paid plans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running 8 Agents in Parallel
&lt;/h2&gt;

&lt;p&gt;The parallel agent interface is the feature most developers asked about after the 2.0 release. Here is how it actually works.&lt;/p&gt;

&lt;p&gt;When you open the Agents panel in Cursor 2.0, you see a list view of active agents, each with their task description, status, and current diff. You can start a new agent with &lt;code&gt;Cmd+Shift+A&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+Shift+A&lt;/code&gt; (Windows/Linux). Each agent gets its own conversation thread and its own working environment.&lt;/p&gt;

&lt;p&gt;You can run up to 8 agents simultaneously. In practice, the most useful pattern is task specialization: one agent handles a feature branch, another fixes failing tests, a third does documentation updates. They work simultaneously and independently.&lt;/p&gt;

&lt;p&gt;Starting a parallel agent session looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Agent 1: Add feature
Refactor the UserService class to support OAuth token refresh. 
Tests are in tests/user_service_test.py.

# Agent 2: Fix tests
The integration tests in tests/api/ are failing with a 401 
after the recent middleware changes. Diagnose and fix.

# Agent 3: Update docs
Update the API reference in docs/api.md to reflect the 
/auth/refresh endpoint added in PR #412.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent runs in parallel. You switch between them in the sidebar like switching browser tabs.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Git Worktrees Enable True Isolation
&lt;/h3&gt;

&lt;p&gt;The reason parallel agents do not conflict with each other is git worktrees. This is the technical foundation that makes 2.0's parallel agents actually safe.&lt;/p&gt;

&lt;p&gt;A git worktree creates a separate working copy of your repository that shares the same &lt;code&gt;.git&lt;/code&gt; directory — same commit history, same branches — but with independent file state. When Cursor spawns an agent, it creates a dedicated worktree for that agent. Changes the agent makes exist only in that worktree until you explicitly merge them.&lt;/p&gt;

&lt;p&gt;You can also create a worktree manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Cursor creates these automatically, but you can manage them directly&lt;/span&gt;
git worktree add ../cursor-agent-feature feature/oauth-refresh
git worktree add ../cursor-agent-fix-tests fix/integration-401
git worktree list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/Users/you/myproject          abc1234 [main]
/Users/you/cursor-agent-feature  def5678 [feature/oauth-refresh]
/Users/you/cursor-agent-fix-tests  789abcd [fix/integration-401]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent sees a clean state. If agent 2 modifies &lt;code&gt;src/auth.ts&lt;/code&gt;, agent 1 does not see that change. When each agent finishes, you review the diff and merge — or discard — independently. Cursor manages worktree creation and cleanup automatically when agents start and stop.&lt;/p&gt;

&lt;h3&gt;
  
  
  The /best-of-n Command
&lt;/h3&gt;

&lt;p&gt;Cursor 2.0 also shipped &lt;code&gt;/best-of-n&lt;/code&gt;, a command that runs the same task across multiple models simultaneously. You invoke it in the agent chat:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/best-of-n Write a function that validates JWT tokens and handles 
clock skew within a 30-second window
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cursor spins up N agents (default: 3, configurable), each using a different model — Composer, Claude Sonnet 4.6, GPT-6 — and runs them in parallel in separate worktrees. When they finish, you get a side-by-side comparison of the three results and choose which to keep.&lt;/p&gt;

&lt;p&gt;For harder tasks where correctness matters more than speed, best-of-n materially improves output quality. Anysphere found that comparing results across models and picking the best one increased success rates by roughly 20% on complex coding tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Visual Editor Bridge
&lt;/h2&gt;

&lt;p&gt;Cursor 2.2, released after the initial 2.0 launch, added the Visual Editor for Cursor Browser — commonly called the "visual editor bridge." This feature connects the agent's code output to a visual representation of the running UI.&lt;/p&gt;

&lt;p&gt;When you have Cursor's built-in browser open (accessible via &lt;code&gt;Cmd+Shift+B&lt;/code&gt;), the Visual Editor lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click on any element in the rendered UI to select it&lt;/li&gt;
&lt;li&gt;Annotate that element with a note (e.g., "move this button 8px to the right")&lt;/li&gt;
&lt;li&gt;Add the annotation directly to the agent conversation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of describing UI changes in text ("the submit button in the footer needs more padding"), you click the button, add a comment, and the agent receives precise element-level context. It reads the annotation, locates the relevant component in the codebase, and makes the change.&lt;/p&gt;

&lt;p&gt;This is particularly useful for front-end work where the gap between "I see the problem" and "I can describe the problem in words" costs time. The visual editor bridge closes that gap.&lt;/p&gt;

&lt;p&gt;Note: In &lt;a href="https://dev.to/articles/cursor-3-review-background-agents-2026"&gt;Cursor 3&lt;/a&gt;, this concept expanded significantly into Design Mode, which works directly in the Agents Window and supports drag-select annotations, keyboard shortcuts for targeting elements, and canvas visualizations. But the visual editor bridge in 2.2 established the pattern.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Up to 8 true parallel agents via git worktrees — no file conflicts between agents&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Composer generates at 250 tokens/sec, completing most turns in under 30 seconds&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;/best-of-n improves output quality on complex tasks by comparing across models&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Visual editor bridge removes the verbosity of describing UI changes in text&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Auto mode is unlimited on all paid plans — Composer has no credit cost&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Limitations
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Composer accuracy (51.7%) trails some third-party models — matters for critical tasks&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Managing 8 agents simultaneously is cognitively demanding without a clear task system&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Visual editor requires Cursor's built-in browser — doesn't work with external browsers&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Credit pool system can be confusing: manual model selection draws from pool, Auto doesn't&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Pricing Breakdown
&lt;/h2&gt;

&lt;p&gt;Cursor's credit system, introduced in June 2025, ties your monthly usage allowance to your plan price in dollars. Auto mode (using Composer) is unlimited and costs zero credits on all paid plans. Manually picking a premium model draws from your pool at rates that vary by model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Credit Pool&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hobby&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Trying it out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;td&gt;Solo developers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro+&lt;/td&gt;
&lt;td&gt;$60/mo&lt;/td&gt;
&lt;td&gt;$60/mo&lt;/td&gt;
&lt;td&gt;Power users, heavy parallel agent use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra&lt;/td&gt;
&lt;td&gt;$200/mo&lt;/td&gt;
&lt;td&gt;$200/mo&lt;/td&gt;
&lt;td&gt;Intensive daily use, 20x Pro usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teams&lt;/td&gt;
&lt;td&gt;$40/user/mo&lt;/td&gt;
&lt;td&gt;Per user&lt;/td&gt;
&lt;td&gt;Engineering teams, SSO, admin controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;td&gt;Pooled org usage&lt;/td&gt;
&lt;td&gt;Large organizations, compliance needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For most individual developers who rely primarily on Composer in Auto mode, Pro at $20/month covers the vast majority of work. The credit pool only matters when you deliberately choose premium third-party models. If you run /best-of-n frequently with GPT-6 or Claude Sonnet 4.6, Pro+ makes more financial sense.&lt;/p&gt;

&lt;p&gt;Annual billing saves 20% on Pro ($16/month instead of $20).&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use Cursor 2.0?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Good fit
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full-stack developers&lt;/strong&gt; who regularly switch between features, tests, and documentation — parallel agents let you delegate each to a separate agent simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams moving from GitHub Copilot&lt;/strong&gt; who want a more complete agentic experience beyond inline suggestions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers who value speed over perfect first-pass accuracy&lt;/strong&gt; — Composer is fast and Cursor's review interface makes iterating on agent output quick&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Front-end developers&lt;/strong&gt; who want to give agents visual UI feedback without writing long textual descriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Less ideal
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developers who need maximum accuracy on complex tasks&lt;/strong&gt; and cannot afford iteration — in that scenario, routing manually to Claude Sonnet 4.6 or GPT-6 (with the credit cost) makes more sense than defaulting to Composer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vim or Neovim users&lt;/strong&gt; who do not want to leave their editor — Cursor is VS Code-based&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Projects with strict data residency requirements&lt;/strong&gt; that haven't yet evaluated Cursor's enterprise data handling policies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: How does Cursor 2.0 differ from Cursor 3?
&lt;/h3&gt;

&lt;p&gt;Cursor 2.0 introduced parallel agents, the Composer model, and git worktrees for isolation. Cursor 3 (April 2, 2026) rebuilt the entire interface into the Agents Window, added Design Mode for precise UI annotations, and introduced cloud agents and the /best-of-n improvements. The 2.0 features are all still present in 3.0 — 3.0 is an evolution, not a replacement architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does using 8 parallel agents multiply my credit consumption?
&lt;/h3&gt;

&lt;p&gt;Only if you are using premium third-party models manually. Composer in Auto mode costs zero credits, so 8 parallel agents running in Auto mode are effectively free of per-token cost. If you use /best-of-n with Claude or GPT-6, each model run draws from your pool. For heavy /best-of-n users, Pro+ at $60/month is the practical entry point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use Cursor 2.0's parallel agents on monorepos?
&lt;/h3&gt;

&lt;p&gt;Yes, and monorepos are actually a strong use case. Because each agent runs in a git worktree, they share the repository's full history and branch state but maintain independent file copies. Multiple agents can work on different packages in the monorepo simultaneously without conflicts. Cursor handles worktree creation and cleanup automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does the visual editor bridge work with React, Vue, and other frameworks?
&lt;/h3&gt;

&lt;p&gt;The visual editor works with any framework that renders in Cursor's built-in Chromium browser. React, Vue, Svelte, Angular — if it renders in the browser, the visual editor can target it. The agent uses the DOM structure and component metadata to locate the relevant code in your project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Cursor 2.0 changed the fundamental unit of AI-assisted development from a single conversation to a parallel team of agents. The Composer model provided the speed necessary to make that practical — waiting 4 minutes for a single response makes parallel agents useless; waiting 62 seconds makes them very useful.&lt;/p&gt;

&lt;p&gt;The git worktree architecture is the part that makes it real rather than theoretical. True file isolation means you can genuinely run 8 agents without fear of them corrupting each other's work. The visual editor bridge, arriving in 2.2, added a dimension of feedback precision that text prompts cannot match.&lt;/p&gt;

&lt;p&gt;If you are evaluating Cursor today in April 2026, you are looking at Cursor 3's interface on top of this foundation. But understanding what 2.0 built explains why the platform is structured the way it is — and how to use parallel agents effectively rather than just spawning them and hoping for the best.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Cursor 2.0 earned its reputation by making parallel agents actually work — not as a demo, but as a daily workflow tool backed by git worktrees and a purpose-built model fast enough to justify the parallel overhead. If your development workflow involves managing more than one concern at a time (and whose doesn't?), this is the release that made Cursor worth serious consideration.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=OLFK2BM_Qns" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitools</category>
      <category>cursor</category>
      <category>codeeditor</category>
      <category>parallelagents</category>
    </item>
    <item>
      <title>Llama 4 Maverick: 400B MoE Model — Self-Hosting and API Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 24 Apr 2026 00:18:51 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/llama-4-maverick-400b-moe-model-self-hosting-and-api-guide-2bc5</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/llama-4-maverick-400b-moe-model-self-hosting-and-api-guide-2bc5</guid>
      <description>&lt;h2&gt;
  
  
  Why Llama 4 Maverick Matters
&lt;/h2&gt;

&lt;p&gt;In April 2025, Meta shipped the Llama 4 family and reset what open-weight models can do. Llama 4 Maverick — the mid-tier model in the three-model lineup (Scout, Maverick, Behemoth) — packs 400 billion total parameters into a mixture-of-experts design that only activates 17 billion at inference time. That combination delivers near-frontier multimodal performance at a fraction of the compute cost compared to running a dense 400B model.&lt;/p&gt;

&lt;p&gt;What makes Maverick uniquely interesting for infrastructure teams: the weights are free. You can run it inside your own VPC, own the full data pipeline, and still get a model that beats GPT-4o and Gemini 2.0 Flash on most multimodal benchmarks. For teams that cannot send data to a third-party API — finance, healthcare, defense — Maverick is the strongest open-weight option available.&lt;/p&gt;

&lt;p&gt;This guide covers everything you need to put Maverick into production: the architecture, the real hardware costs, step-by-step vLLM setup, which managed API to pick if you don't want to self-host, and how benchmarks stack up against proprietary alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: MoE in Practice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mixture of Experts — the active parameter trick
&lt;/h3&gt;

&lt;p&gt;Llama 4 Maverick uses an &lt;strong&gt;alternating dense and MoE layer architecture&lt;/strong&gt;. In every MoE layer, each token activates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One &lt;strong&gt;shared expert&lt;/strong&gt; (always active)&lt;/li&gt;
&lt;li&gt;One of 128 &lt;strong&gt;routed experts&lt;/strong&gt; (selected per token by a learned router)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The router is a small linear layer that picks which expert processes each token. At inference time, only around 17 billion of the 400 billion parameters do actual computation. The remaining ~383B parameters sit in VRAM, cold.&lt;/p&gt;

&lt;p&gt;This creates the central hardware trade-off: you pay in &lt;strong&gt;memory&lt;/strong&gt; for all 400B parameters, but you pay in &lt;strong&gt;compute&lt;/strong&gt; only for 17B. For batch inference with large batches, this is a significant throughput advantage. For latency-sensitive, small-batch workloads, the advantage shrinks because you still need all weights loaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Early fusion multimodality
&lt;/h3&gt;

&lt;p&gt;Unlike previous Llama releases that handled vision through a separate projection head bolted onto the text model, Llama 4 uses &lt;strong&gt;early fusion&lt;/strong&gt;: image patches and text tokens are encoded into the same embedding space from the first transformer layer. There is no separate visual encoder like a CLIP tower. The practical effect is more natural interleaving of visual and textual reasoning — the model can reference specific image regions mid-sentence rather than treating vision as a separate step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context window: Maverick vs Scout
&lt;/h3&gt;

&lt;p&gt;A common point of confusion: Maverick supports up to &lt;strong&gt;1 million tokens&lt;/strong&gt;. Scout — the smaller sibling with 16 experts instead of 128 — is the one that reaches 10 million tokens. The trade-off is quality: Maverick's 128-expert depth produces significantly stronger scores on reasoning and coding benchmarks. For most production workloads under 500K tokens, Maverick is the right pick. If you genuinely need multi-million-token context, Scout is the tool.&lt;/p&gt;

&lt;p&gt;In practice, hardware determines how close to Maverick's 1M ceiling you can operate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8× H100 80GB (BF16)&lt;/strong&gt;: approximately 430K tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8× H200 141GB (BF16)&lt;/strong&gt;: full 1M tokens&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Hardware Requirements
&lt;/h2&gt;

&lt;p&gt;Self-hosting Maverick is not a single-GPU affair. Here are the real numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;VRAM Required&lt;/th&gt;
&lt;th&gt;Minimum Hardware&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BF16 (full precision)&lt;/td&gt;
&lt;td&gt;~800 GB total&lt;/td&gt;
&lt;td&gt;8× H100 80GB or 10× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8 quantized&lt;/td&gt;
&lt;td&gt;~400 GB total&lt;/td&gt;
&lt;td&gt;4× H100 80GB or 4× H200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 quantized&lt;/td&gt;
&lt;td&gt;~200 GB total&lt;/td&gt;
&lt;td&gt;2× H200 or 8× A40 48GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: VRAM requirements above cover model weights only. Add 20–50% headroom for the KV cache at production context lengths.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For most teams, &lt;strong&gt;FP8 Maverick on 4× H100 80GB&lt;/strong&gt; is the practical entry point. Benchmark quality versus BF16 on your specific use case before committing — FP8 typically shows near-identical MMLU and HumanEval scores with roughly half the memory cost.&lt;/p&gt;

&lt;p&gt;On cloud hardware, an 8× H100 node (AWS p4de.24xlarge) runs approximately $32/hour on-demand. FP8 on 4× H100s cuts that by half. If you expect sustained load, reserved instances reduce the effective rate to $12–16/hour.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-Hosting with vLLM
&lt;/h2&gt;

&lt;p&gt;vLLM v0.8.3 introduced full Llama 4 Maverick support, including FP8 quantization and multimodal inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install vLLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"vllm&amp;gt;=0.8.3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Download weights from Hugging Face
&lt;/h3&gt;

&lt;p&gt;Accept Meta's Llama 4 Community License at huggingface.co before downloading:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;huggingface-cli login
huggingface-cli download meta-llama/Llama-4-Maverick-17B-128E-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ./llama4-maverick
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Launch the inference server
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;BF16 on 8× H100:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 400000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; bfloat16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--trust-remote-code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;FP8 on 4× H100 (recommended starting point):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 200000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization&lt;/span&gt; fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--trust-remote-code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--trust-remote-code&lt;/code&gt; is required. Maverick's MoE routing uses custom code that vLLM cannot load without it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Test with the OpenAI-compatible endpoint
&lt;/h3&gt;

&lt;p&gt;vLLM exposes a &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint. Any library built for the OpenAI SDK works without changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-4-Maverick-17B-128E-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain MoE routing in under 100 words.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multimodal requests
&lt;/h3&gt;

&lt;p&gt;Pass image URLs alongside text in the content array — no separate endpoint needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-4-Maverick-17B-128E-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/architecture-diagram.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What components does this architecture diagram show?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Early fusion means the model handles interleaved text and images in a single pass with no additional latency overhead from a separate vision model.&lt;/p&gt;




&lt;h2&gt;
  
  
  API Providers (No Self-Hosting Required)
&lt;/h2&gt;

&lt;p&gt;If 8× H100s are not in your budget, four managed providers offer production Maverick access:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Input (per M tokens)&lt;/th&gt;
&lt;th&gt;Output (per M tokens)&lt;/th&gt;
&lt;th&gt;Max Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$0.77&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Lowest latency, dev/test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fireworks AI&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;td&gt;$0.88&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;Production, full context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Together AI&lt;/td&gt;
&lt;td&gt;$0.27&lt;/td&gt;
&lt;td&gt;$0.85&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;Inference + fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;Cost-optimized routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Groq&lt;/strong&gt; uses LPU inference hardware that delivers near-instant token generation for short responses. The 128K context cap is the main constraint — workloads requiring long context need a different provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fireworks AI&lt;/strong&gt; was among the first to serve Maverick (minutes after Meta published the weights) and has a well-tested production configuration. Full 1M context, competitive pricing, strong reliability track record.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Together AI&lt;/strong&gt; is the best pick if you plan to fine-tune Maverick later. They offer both inference and fine-tuning on the same platform, so you keep the full model lifecycle in one place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt; routes requests dynamically to the cheapest available backend. Lowest floor price, but you sacrifice control over which provider handles each request.&lt;/p&gt;

&lt;p&gt;All four expose OpenAI-compatible endpoints. Change &lt;code&gt;base_url&lt;/code&gt; and &lt;code&gt;api_key&lt;/code&gt; and your existing code runs without modification.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;Meta released Llama 4 Maverick on April 5, 2025. Here are the official benchmark scores from Meta's release announcement (0-shot, temperature=0, no majority voting):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Llama 4 Maverick&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MMLU Pro&lt;/td&gt;
&lt;td&gt;80.5&lt;/td&gt;
&lt;td&gt;Knowledge reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;69.8&lt;/td&gt;
&lt;td&gt;Graduate-level science&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval&lt;/td&gt;
&lt;td&gt;82.4%&lt;/td&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DocVQA (test)&lt;/td&gt;
&lt;td&gt;94.4&lt;/td&gt;
&lt;td&gt;Document understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChartQA&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;Chart comprehension&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMU&lt;/td&gt;
&lt;td&gt;73.4&lt;/td&gt;
&lt;td&gt;Multimodal understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multilingual MMLU&lt;/td&gt;
&lt;td&gt;84.6&lt;/td&gt;
&lt;td&gt;12-language coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At launch, Maverick outperformed GPT-4o and Gemini 2.0 Flash on the multimodal benchmarks (DocVQA, ChartQA, MMMU) while scoring comparably to DeepSeek V3 on coding. In 2026, newer frontier models (GPT-5.4, Gemini 3.1 Pro) have raised the ceiling on reasoning benchmarks, but Maverick's multimodal document understanding scores remain competitive with commercial APIs.&lt;/p&gt;

&lt;p&gt;The comparison that matters for most cost-conscious teams: at $0.50/$0.77 on Groq, Maverick delivers strong multimodal capability at 9–23x lower per-token cost than comparable proprietary alternatives. If your workload is document processing, visual analysis, or multilingual generation rather than cutting-edge math competitions, Maverick's price-performance ratio is difficult to match with any closed-source model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Loading all weights onto one GPU&lt;/strong&gt;: 400B parameters never fit a single GPU at any practical precision. Always specify &lt;code&gt;--tensor-parallel-size&lt;/code&gt; matching your GPU count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confusing Maverick and Scout context windows&lt;/strong&gt;: Scout reaches 10 million tokens; Maverick reaches 1 million. If you need to ingest multi-million-token codebases, use Scout. For most production workloads under 500K tokens, Maverick's stronger reasoning quality wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping FP8 without benchmarking&lt;/strong&gt;: FP8 Maverick on 4× H100 80GB achieves near-identical MMLU and HumanEval scores versus BF16 for most tasks. Defaulting to BF16 without checking doubles your hardware cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Underestimating KV cache memory&lt;/strong&gt;: A 400K token context in BF16 requires roughly 160GB of KV cache alone, on top of model weights. Budget VRAM with KV cache included, not just model size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Omitting &lt;code&gt;--trust-remote-code&lt;/code&gt;&lt;/strong&gt;: Maverick's custom routing code will not load without this flag. The server silently fails model initialization if you forget it.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does Llama 4 Maverick support function calling?
&lt;/h3&gt;

&lt;p&gt;Yes. The Instruct variant was trained with tool use. The format mirrors the OpenAI function calling spec — pass a &lt;code&gt;tools&lt;/code&gt; array in the chat completion request. Groq, Fireworks AI, and Together AI all support it in their Maverick deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What's the practical difference between Llama 4 Scout and Maverick?
&lt;/h3&gt;

&lt;p&gt;Scout has 16 experts (Maverick has 128), a 10M token context window (Maverick has 1M), and runs on a single H100 80GB node. Maverick scores 4–8% higher on reasoning and coding benchmarks. Use Scout when context length is the bottleneck; use Maverick when quality is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I fine-tune Llama 4 Maverick?
&lt;/h3&gt;

&lt;p&gt;Yes, under the Llama 4 Community License. Fine-tuning with LoRA at FP8 requires at least 4× H100 80GB. Together AI and Fireworks AI both offer managed fine-tuning pipelines. A full fine-tune on a 50K sample dataset typically takes 3–7 days on that configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Llama 4 Maverick production-ready in 2026?
&lt;/h3&gt;

&lt;p&gt;Yes. Fireworks AI reports Maverick as one of their five most-requested models. Multiple organizations are running it in production. The main operational concern is node-level failure: if one H100 in an 8-GPU setup fails, the full model instance goes down. Plan for multi-node redundancy if you have uptime SLA requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does Maverick perform on long-context tasks?
&lt;/h3&gt;

&lt;p&gt;At 1M token context, retrieval quality degrades noticeably past ~400K tokens on needle-in-a-haystack benchmarks. For RAG workloads, semantic retrieval to a 200K or smaller context window still outperforms stuffing 800K raw tokens into the model. Use full 1M context for tasks where the model needs to reason across the entire corpus, not for general production RAG.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt;: 400B total / 17B active parameters, 128 experts, alternating dense and MoE layers. Early fusion handles text and images in a single model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window&lt;/strong&gt;: 1M tokens for Maverick (Scout is the 10M variant).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware floor&lt;/strong&gt;: 4× H100 80GB at FP8, 8× H100 80GB at BF16. Multi-GPU is mandatory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM setup&lt;/strong&gt;: v0.8.3 or later, &lt;code&gt;--tensor-parallel-size&lt;/code&gt;, &lt;code&gt;--trust-remote-code&lt;/code&gt; required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed APIs&lt;/strong&gt;: Groq for speed, Fireworks and Together for full 1M context, OpenRouter for lowest cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark position&lt;/strong&gt;: Beats GPT-4o on multimodal document tasks; slightly behind on pure math and coding. 9–23x cheaper per token on managed APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Llama 4 Maverick is the strongest open-weight multimodal model available and the only practical choice for teams that need frontier-class performance with full data ownership. The 4× H100 FP8 entry point is real infrastructure investment, but it buys you a model that beats GPT-4o on document understanding at a fraction of the API cost. If self-hosting is off the table, Fireworks AI or Groq give you managed access at 9–23x lower per-token pricing than comparable proprietary models.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=FgMfZt-sOOo" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llama4</category>
      <category>metaai</category>
      <category>opensourcellm</category>
      <category>mixtureofexperts</category>
    </item>
    <item>
      <title>Databricks Unity AI Gateway: MCP Agent Governance Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 23 Apr 2026 08:26:31 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/databricks-unity-ai-gateway-mcp-agent-governance-guide-4d48</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/databricks-unity-ai-gateway-mcp-agent-governance-guide-4d48</guid>
      <description>&lt;p&gt;Enterprise AI adoption has hit a governance wall. Organizations that rushed to deploy LLM-powered applications now face an uncomfortable reality: dozens of agents making API calls across multiple providers, MCP servers accessing sensitive data without proper audit trails, and no unified way to track what any of it costs. Databricks calls this "agent sprawl," and in April 2026 they shipped a direct answer: Unity AI Gateway.&lt;/p&gt;

&lt;p&gt;This guide covers what Unity AI Gateway actually does, how its MCP governance model works in practice, and where it fits in the broader enterprise AI infrastructure stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters: The Agent Sprawl Problem
&lt;/h2&gt;

&lt;p&gt;The shift to agentic AI workflows created a governance gap that earlier tooling wasn't designed to handle. A single production agent might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Call three different LLM providers in the same session&lt;/li&gt;
&lt;li&gt;Invoke five external MCP servers to access Slack, GitHub, and internal databases&lt;/li&gt;
&lt;li&gt;Run as a shared service account with broader permissions than any human would be granted&lt;/li&gt;
&lt;li&gt;Generate costs that get attributed to a catch-all "AI budget" line item&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional cloud IAM controls weren't designed for this pattern. You can restrict what a service account can do at the infrastructure level, but you can't easily say "this agent can use Claude for reasoning tasks but must route code generation to GPT-6, and can only access the GitHub MCP server if the requesting user has write access to that repo."&lt;/p&gt;

&lt;p&gt;That's the problem Unity AI Gateway is designed to solve—not by adding another governance layer on top of your existing stack, but by extending the Unity Catalog permission model you may already use for data governance directly into your AI layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Unity AI Gateway?
&lt;/h2&gt;

&lt;p&gt;Unity AI Gateway is the AI governance component of Databricks' Unity Catalog, extended to cover LLM endpoints, MCP servers, and coding agents. It was previously branded as Mosaic AI Gateway—the April 2026 rename to "Unity AI Gateway" signals the deeper integration with Unity Catalog's existing access control and audit infrastructure.&lt;/p&gt;

&lt;p&gt;The core architecture positions AI Gateway as a proxy layer that sits between your agents and the external systems they call. Every request—whether it's an LLM completion from Anthropic's API or a tool call to a GitHub MCP server—passes through AI Gateway, where it's evaluated against access policies, monitored for compliance, and logged to a centralized audit table.&lt;/p&gt;

&lt;p&gt;From a developer perspective, this is similar to how API gateways work in microservices architectures, but with two enterprise-specific additions: identity propagation (so the gateway knows &lt;em&gt;who&lt;/em&gt; initiated the request, not just &lt;em&gt;which service&lt;/em&gt; is making it) and Unity Catalog integration (so permissions are expressed in the same terms your data teams already use).&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Governance: The Key Differentiator
&lt;/h2&gt;

&lt;p&gt;The most significant April 2026 addition is first-class MCP server governance. Model Context Protocol has gone from a research curiosity to standard infrastructure—97 million monthly SDK downloads as of March 2026—and most enterprise AI deployments now involve agents that use MCP servers to access internal systems.&lt;/p&gt;

&lt;p&gt;The problem is that MCP servers are typically authenticated with service account credentials, which means every agent that connects gets the same access level regardless of who initiated the request. An agent helping a junior analyst might access the same financial data that a senior analyst would.&lt;/p&gt;

&lt;p&gt;Unity AI Gateway addresses this with &lt;strong&gt;on-behalf-of (OBO) execution&lt;/strong&gt;: when an agent calls an MCP server through AI Gateway, the server receives the &lt;em&gt;requesting user's&lt;/em&gt; identity and permissions, not the agent's service account. The MCP server then enforces Unity Catalog permissions based on that user identity.&lt;/p&gt;

&lt;p&gt;Every MCP server accessible through the workspace is registered in Unity Catalog as a catalog object. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt;: Teams can browse available MCP servers in the same interface they use to find datasets and tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access control&lt;/strong&gt;: Admins grant or revoke MCP server access with the same &lt;code&gt;GRANT&lt;/code&gt; and &lt;code&gt;REVOKE&lt;/code&gt; syntax used for table permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt;: Every MCP call logs the requesting identity, connection name, HTTP method, and OBO status to a centralized audit table queryable via SQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last point matters more than it might seem. When your compliance team asks "which agents accessed the customer data MCP server last quarter, and on whose behalf?", the answer is a SQL query rather than a multi-week log analysis project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managed vs. External MCP Servers
&lt;/h3&gt;

&lt;p&gt;Databricks distinguishes between two server types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Managed MCP servers&lt;/strong&gt; are hosted by Databricks and pre-integrated with Unity Catalog. The initial set includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Genie&lt;/strong&gt;: Natural language queries against your Databricks data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector Search&lt;/strong&gt;: Semantic retrieval from indexed documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UC Functions&lt;/strong&gt;: Custom tools registered as Unity Catalog functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBSQL&lt;/strong&gt;: Direct SQL execution against Unity Catalog tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Managed servers inherit Unity Catalog permissions automatically—there's no additional configuration needed to enforce row-level security or column masking policies that already exist on your tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External MCP servers&lt;/strong&gt; are third-party or self-hosted (GitHub, Slack, internal APIs). These are registered in Unity Catalog with a connection definition, and AI Gateway applies OBO auth when routing requests to them. Unity Catalog permissions control which users and service principals can access each external server.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM Safeguards: Beyond Simple Rate Limiting
&lt;/h2&gt;

&lt;p&gt;AI Gateway's guardrail system has expanded significantly in 2026. The current feature set covers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Rate Limiting
&lt;/h3&gt;

&lt;p&gt;Rate limits apply at three granularities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint level&lt;/strong&gt;: Maximum requests per minute across all callers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User level&lt;/strong&gt;: Per-identity limits to prevent runaway costs from a single misconfigured agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group level&lt;/strong&gt;: Department or team-scoped budgets enforced at the request layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a request exceeds a rate limit, it receives a 429 response. Other agents sharing the endpoint are unaffected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;AI Gateway supports multi-model endpoints where multiple LLM providers are listed in priority order. When the primary model returns a 429 (rate limited) or 5XX (server error), the gateway automatically routes to the next listed model—no application code changes needed.&lt;/p&gt;

&lt;p&gt;This is useful for reliability, but it's also a cost optimization mechanism: you can list an expensive frontier model first and a faster, cheaper model as fallback, catching cases where the premium model is unavailable rather than failing the request entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM-Judge Guardrails
&lt;/h3&gt;

&lt;p&gt;The guardrail system uses an LLM-judge approach—configurable with custom models and prompts—to enforce policies that can't be expressed as simple rules. Available checks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PII detection and redaction&lt;/strong&gt;: Identify and mask personal information in inputs or outputs before logging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content safety&lt;/strong&gt;: Block or flag outputs that violate configured policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt injection defense&lt;/strong&gt;: Detect attempts to override system instructions through user input&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data exfiltration prevention&lt;/strong&gt;: Flag requests that appear to be extracting bulk data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination checks&lt;/strong&gt;: Evaluate output confidence against retrieved context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each guardrail is independently configurable. Violations result in request rejection or data masking, and all enforcement actions are logged. You can run guardrails on input, output, or both.&lt;/p&gt;

&lt;h2&gt;
  
  
  End-to-End Observability with MLflow Tracing
&lt;/h2&gt;

&lt;p&gt;Governance without observability is incomplete—you need to know not just &lt;em&gt;whether&lt;/em&gt; your policies are enforced but &lt;em&gt;what your agents are actually doing&lt;/em&gt; at execution time. MLflow Tracing provides the second half of this picture.&lt;/p&gt;

&lt;p&gt;When an agent runs through Databricks, MLflow automatically captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM calls&lt;/strong&gt;: Model, prompt, response, token count, latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP tool calls&lt;/strong&gt;: Which server, which tool, inputs and outputs, execution time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent reasoning steps&lt;/strong&gt;: The sequence of decisions that led to each tool call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval operations&lt;/strong&gt;: Documents fetched, similarity scores, chunk boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This trace data is OpenTelemetry-compatible, so it flows naturally into existing observability infrastructure. The Unity Catalog audit logs and MLflow traces complement each other: audit logs answer security and compliance questions ("who accessed what?"), while traces answer debugging and performance questions ("why did this agent make that tool call?").&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Attribution
&lt;/h3&gt;

&lt;p&gt;One of the more practical capabilities is request tagging for cost attribution. Teams can attach custom tags to requests—project code, team name, user ID, deployment environment—and the system aggregates costs by those dimensions in Unity Catalog system tables.&lt;/p&gt;

&lt;p&gt;This moves AI spend from a catch-all line item to something your finance team can actually work with. Product teams can see their LLM costs broken down by feature. Platform teams can identify which agents are consuming disproportionate resources. Budget alerts can trigger at the team or project level rather than only at the account level.&lt;/p&gt;

&lt;p&gt;The DBU rate for Foundation Model API workloads starts at approximately $0.07 per DBU in the 2026 pricing, but the more significant value is the attribution clarity rather than the rate itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Setup: Adding MCP Governance to an Existing Agent
&lt;/h2&gt;

&lt;p&gt;Here's how the integration works in practice for a team that already has a Databricks workspace and wants to add governance to an agent that calls external MCP servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Register External MCP Servers
&lt;/h3&gt;

&lt;p&gt;External MCP servers are registered as Unity Catalog connections. Using the Databricks UI or Terraform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;CONNECTION&lt;/span&gt; &lt;span class="n"&gt;github_mcp&lt;/span&gt;
&lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;HTTP&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;host&lt;/span&gt; &lt;span class="s1"&gt;'https://api.github.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="s1"&gt;'443'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;CONNECTION&lt;/span&gt; &lt;span class="n"&gt;github_mcp&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data-engineering-team`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once registered, the server appears in the MCP Servers tab of the Agents workspace and is discoverable by other teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure AI Gateway on LLM Endpoints
&lt;/h3&gt;

&lt;p&gt;Enable AI Gateway on a serving endpoint through the UI or API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WorkspaceClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk.service.serving&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;AiGatewayConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AiGatewayGuardrails&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AiGatewayRateLimit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AiGatewayUsageTrackingConfig&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WorkspaceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serving_endpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_ai_gateway&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-llm-endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ai_gateway&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AiGatewayConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;usage_tracking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AiGatewayUsageTrackingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;rate_limits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;AiGatewayRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;renewal_period&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;guardrails&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AiGatewayGuardrails&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;input_safety&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;pii_detection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Route Agent Traffic Through AI Gateway
&lt;/h3&gt;

&lt;p&gt;Agents call the AI Gateway endpoint rather than provider APIs directly. The endpoint URL is OpenAI-compatible, so most frameworks require only a base URL change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;databricks_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;workspace_host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/serving-endpoints/production-llm-endpoint/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# All requests now flow through AI Gateway
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;databricks-claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize Q1 sales data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Query Audit Logs
&lt;/h3&gt;

&lt;p&gt;Audit data lands in Unity Catalog system tables, queryable via standard SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;user_identity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mcp_connection_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_behalf_of_user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_status_code&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ai_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mcp_audit_logs&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-01'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;mcp_connection_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'github_mcp'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Using service accounts for all agent traffic&lt;/strong&gt;: OBO auth only works if agents pass user identity through the request chain. If your agent framework authenticates with a shared service account and doesn't propagate user context, all MCP calls will appear to originate from that account in the audit logs. Check that your agent framework supports identity forwarding before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuring guardrails in blocking mode without testing&lt;/strong&gt;: LLM-judge guardrails have non-zero latency and false positive rates. Start guardrails in monitoring mode to understand the false positive rate on your actual traffic before switching to blocking mode in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping rate limit configuration for internal tools&lt;/strong&gt;: Teams often configure rate limits on external-facing endpoints but skip them for internal tools. A misconfigured internal agent can generate the same runaway costs—set limits everywhere, not just at the perimeter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-permissioning managed MCP servers&lt;/strong&gt;: The convenience of managed servers can lead to blanket grants ("grant &lt;code&gt;data-team&lt;/code&gt; access to all MCP servers") instead of the principle of least privilege. Audit which servers each team actually uses and grant accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not tagging requests for cost attribution&lt;/strong&gt;: Tags need to be set when the request is made—retroactive attribution isn't possible. Establish a tagging convention at project start, not after the first billing surprise.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Unity AI Gateway is the most complete enterprise AI governance platform available in 2026 for teams already on Databricks. The Unity Catalog integration means you're not adding a separate permission system—you're extending existing data governance to cover LLM calls and MCP tools. For organizations outside the Databricks ecosystem, the switching cost is high; alternatives like LiteLLM or Cloudflare AI Gateway provide a subset of the governance features without the platform lock-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Compares to Alternatives
&lt;/h2&gt;

&lt;p&gt;Enterprise teams evaluating AI governance platforms typically consider three options besides Unity AI Gateway:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is the open-source alternative—140+ provider support, budget management, semantic caching, and self-hosted deployment. It lacks Unity Catalog integration and the OBO auth model for MCP servers, but it's a strong choice for teams that need multi-cloud LLM routing without vendor lock-in. We covered &lt;a href="https://dev.to/articles/litellm-ai-gateway-llm-proxy-guide-2026"&gt;LiteLLM's setup in detail here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt; handles edge caching, rate limiting, and spend controls at the CDN layer—zero code changes for basic observability. The governance model is simpler (no per-user identity propagation), making it better suited for customer-facing applications than internal agent workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native provider controls&lt;/strong&gt; (Anthropic's system prompt policies, OpenAI's organization settings) provide some guardrails but don't unify multi-provider deployments and don't address the MCP governance problem at all.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Unity AI Gateway&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Cloudflare AI Gateway&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-provider LLM routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (140+ providers)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP server governance&lt;/td&gt;
&lt;td&gt;Yes (OBO auth)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-user rate limiting&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-judge guardrails&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end traces&lt;/td&gt;
&lt;td&gt;Yes (MLflow)&lt;/td&gt;
&lt;td&gt;Yes (Langfuse/Helicone)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unity Catalog integration&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host option&lt;/td&gt;
&lt;td&gt;No (Databricks managed)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Databricks-native enterprises&lt;/td&gt;
&lt;td&gt;Multi-cloud / open-source&lt;/td&gt;
&lt;td&gt;Edge / customer-facing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does Unity AI Gateway work with agents built outside Databricks?
&lt;/h3&gt;

&lt;p&gt;Yes—any agent that can make HTTP requests to an OpenAI-compatible endpoint can route through AI Gateway. The gateway doesn't require Databricks-native agent frameworks. Identity propagation for OBO auth requires passing a user token in the request header, which most frameworks support via custom headers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does OBO auth work when an agent initiates a multi-step workflow without a human in the loop?
&lt;/h3&gt;

&lt;p&gt;For fully automated workflows without an active user session, OBO auth falls back to the service principal identity of the agent. The audit log records this as a service principal call rather than an end-user call. If your compliance requirements mandate user-level attribution for automated workflows, you'll need to either redesign the workflow to include human approval steps or accept service principal attribution for background tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use Unity AI Gateway with Claude Code or Cursor?
&lt;/h3&gt;

&lt;p&gt;Yes, as of April 2026, AI Gateway explicitly supports coding agent governance. The "Governing Coding Agent Sprawl" blog post from Databricks covers this use case in detail—you can route Claude Code and Cursor traffic through AI Gateway to enforce the same policies applied to other agents in your workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What's the latency overhead of routing through AI Gateway?
&lt;/h3&gt;

&lt;p&gt;Databricks hasn't published precise latency benchmarks for AI Gateway overhead. In practice, the proxy layer adds single-digit millisecond overhead for policy evaluation on cached decisions. Guardrail evaluation—particularly LLM-judge checks—adds meaningful latency proportional to the complexity of the check. For latency-sensitive applications, configure guardrails in async monitoring mode rather than synchronous blocking mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is there a free tier for experimenting with Unity AI Gateway?
&lt;/h3&gt;

&lt;p&gt;Unity AI Gateway is available on all Databricks workspace tiers, including the free trial. The Foundation Model API, which is the primary LLM endpoint type, charges at DBU rates that start around $0.07 per DBU. External provider pass-through endpoints (where you bring your own API key for Anthropic, OpenAI, etc.) incur DBU charges for the gateway itself but you pay the provider directly for model usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Unity AI Gateway is Unity Catalog's governance layer extended to LLM endpoints and MCP servers—the same permissions and audit infrastructure, applied to AI.&lt;/li&gt;
&lt;li&gt;MCP governance is the standout April 2026 addition: every MCP server is a Unity Catalog object with fine-grained permissions and full audit logging, with OBO auth ensuring agents act with the requesting user's identity rather than a shared service account.&lt;/li&gt;
&lt;li&gt;Rate limiting (endpoint/user/group), automatic failover across providers, and LLM-judge guardrails are all configurable without application code changes.&lt;/li&gt;
&lt;li&gt;MLflow Tracing provides the debugging and performance visibility layer; Unity Catalog audit logs provide the compliance layer. They address different questions and are used together.&lt;/li&gt;
&lt;li&gt;For teams outside the Databricks ecosystem, LiteLLM covers most of the LLM routing and cost control use cases; the Unity Catalog integration is the primary reason to stay on Databricks AI Gateway specifically.&lt;/li&gt;
&lt;li&gt;Cost attribution via request tags is opt-in and must be configured before deployment—retroactive tagging isn't supported.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=CxaEUiP4x_Y" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>mcp</category>
      <category>aigovernance</category>
      <category>llmobservability</category>
    </item>
    <item>
      <title>Building a Claude Streaming Agent with Vercel AI SDK</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 23 Apr 2026 06:43:08 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/building-a-claude-streaming-agent-with-vercel-ai-sdk-5jo</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/building-a-claude-streaming-agent-with-vercel-ai-sdk-5jo</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello, streaming test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textStream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time I ran this, I had an odd reaction. Not amazed that Claude was outputting text character by character — surprised that this actually worked in 5 lines. Compared to writing the Anthropic SDK directly, the setup code is less than half.&lt;/p&gt;

&lt;p&gt;I first heard about Vercel AI SDK from a Slack link someone shared at work. The usual "AI chat in 10 minutes with Next.js" type of title — and usually these fall apart at the dependency installation step. I tried it skeptically and it actually worked quickly. I've been reaching for it during prototyping ever since.&lt;/p&gt;

&lt;p&gt;After using Vercel AI SDK seriously for a while, the tradeoffs became clear. This post explains how to use it while being honest about where it gets complicated. If you've already used it, jump to the "Tool Calling" and "Production Considerations" sections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Vercel AI SDK — A Direct Comparison
&lt;/h2&gt;

&lt;p&gt;I tried the alternatives first. Direct Anthropic SDK, LangChain.js, then Vercel AI SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct Anthropic SDK&lt;/strong&gt; is most flexible, but the boilerplate for getting streaming responses to the frontend is more than expected. You're writing SSE format handling, frontend hooks, and error handling from scratch. The underlying capability is simple, but the code length grows unnecessarily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Direct Anthropic SDK streaming setup — gets longer than this&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Manually construct SSE response&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextEncoder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;readable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ReadableStream&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;content_block_delta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text_delta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`data: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;\n\n`&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the frontend parsing code and you're looking at real volume. This is right when you need fine-grained control, but it's too much overhead for building a single chat app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain.js&lt;/strong&gt; — I gave up on this one. Frequent API changes between versions, documentation that doesn't match actual behavior. Digging through GitHub issues often turns up "this feature has been removed" answers. Might work for complex pipelines, but not right for fast prototyping.&lt;/p&gt;

&lt;p&gt;Vercel AI SDK's practical advantages come down to three things:&lt;/p&gt;

&lt;p&gt;First, &lt;code&gt;streamText()&lt;/code&gt; + &lt;code&gt;useChat()&lt;/code&gt; gets server-to-client streaming wired up in under 10 lines. Second, switching between Claude, OpenAI, Gemini, and Mistral is a single provider line change — which turns out to be genuinely useful for comparing model outputs on the same code. Third, &lt;code&gt;generateObject()&lt;/code&gt; with Zod schema validation gives you clean structured output handling.&lt;/p&gt;

&lt;p&gt;The downsides: it's optimized for Vercel's platform, which creates friction in other deployment environments. When you need fine-grained control over the agent loop, you lose some flexibility compared to the Anthropic SDK directly. More on that later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/claude-managed-agents-production-deployment-guide"&gt;Compared to building directly with Claude Managed Agents&lt;/a&gt;, Managed Agents are easier to start without infrastructure but hit customization ceilings quickly. Vercel AI SDK sits between the two — more abstracted than raw SDK, more controllable than Managed Agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 20+&lt;/li&gt;
&lt;li&gt;Anthropic API key (&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Next.js 15 (App Router)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# New project&lt;/span&gt;
npx create-next-app@latest my-claude-app &lt;span class="nt"&gt;--typescript&lt;/span&gt; &lt;span class="nt"&gt;--app&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;my-claude-app

&lt;span class="c"&gt;# Core AI SDK packages&lt;/span&gt;
npm &lt;span class="nb"&gt;install &lt;/span&gt;ai @ai-sdk/anthropic zod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add your API key to &lt;code&gt;.env.local&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-api03-...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing I ran into: &lt;code&gt;@ai-sdk/anthropic&lt;/code&gt; installed fine, but a TypeScript type error appeared. The &lt;code&gt;moduleResolution&lt;/code&gt; in &lt;code&gt;tsconfig.json&lt;/code&gt; needs to be &lt;code&gt;bundler&lt;/code&gt; or &lt;code&gt;node16&lt;/code&gt; or higher. &lt;code&gt;create-next-app&lt;/code&gt; handles this in the default config.&lt;/p&gt;

&lt;p&gt;Directory structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app/
├── api/
│   ├── chat/
│   │   └── route.ts      # Streaming chat API
│   └── extract/
│       └── route.ts      # generateObject API
├── page.tsx              # Chat UI
└── components/
    └── Message.tsx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not complex by design — this structure is actually sufficient for a functional chat app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Claude Streaming with streamText
&lt;/h2&gt;

&lt;p&gt;Start with the server-side API route.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;app/api/chat/route.ts&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;streamText&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@ai-sdk/anthropic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`You are a helpful technical blog assistant.
Answer code questions practically, and be honest when you don't know something.
Keep responses concise but complete.`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toUIMessageStreamResponse&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;toUIMessageStreamResponse()&lt;/code&gt; is the key piece. This single method handles SSE header setup, chunk formatting, and stream termination. Doing this manually with the Anthropic SDK takes 20+ lines minimum.&lt;/p&gt;

&lt;p&gt;Frontend &lt;code&gt;app/page.tsx&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;use client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useChat&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai/react&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;useRef&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ChatPage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleSubmit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isLoading&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useChat&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bottomRef&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;useRef&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;HTMLDivElement&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bottomRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;scrollIntoView&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;smooth&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"flex flex-col h-screen max-w-2xl mx-auto p-4"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"flex-1 overflow-y-auto space-y-4 pb-4"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"text-gray-400 text-center mt-8"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            How can I help you?
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;
            &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
            &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;`p-3 rounded-lg max-w-[85%] &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;
              &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bg-blue-100 ml-auto&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bg-gray-100&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"text-xs text-gray-400 mb-1"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
              &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;You&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Claude&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"whitespace-pre-wrap text-sm"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
              &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isLoading&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bg-gray-100 p-3 rounded-lg text-gray-400 text-sm"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            Claude is typing...
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"text-red-500 text-sm p-2 bg-red-50 rounded"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            Error: &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;bottomRef&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt; &lt;span class="na"&gt;onSubmit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleSubmit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"flex gap-2 mt-4 border-t pt-4"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;onChange&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;placeholder&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Type a message..."&lt;/span&gt;
          &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"flex-1 border rounded-lg px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-400"&lt;/span&gt;
          &lt;span class="na"&gt;disabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isLoading&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"submit"&lt;/span&gt;
          &lt;span class="na"&gt;disabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isLoading&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bg-blue-500 text-white px-6 py-2 rounded-lg disabled:opacity-50"&lt;/span&gt;
        &lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          Send
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;useChat&lt;/code&gt; handles message state, streaming updates, loading state, and error handling. Building this from scratch means writing &lt;code&gt;useState&lt;/code&gt;, &lt;code&gt;useRef&lt;/code&gt;, &lt;code&gt;AbortController&lt;/code&gt;, SSE parsing, and retry logic.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;npm run dev&lt;/code&gt; and the chat works at &lt;code&gt;localhost:3000&lt;/code&gt;. Claude's response streams in character by character.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Calling — Making Claude Actually Do Things
&lt;/h2&gt;

&lt;p&gt;To go beyond chat and let Claude call external tools, add the &lt;code&gt;tools&lt;/code&gt; option with &lt;code&gt;maxSteps&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@ai-sdk/anthropic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;You are an assistant that manages weather info and todo lists.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;maxSteps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;getWeather&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Gets the current weather for a specific city&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;City name to get weather for&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
          &lt;span class="na"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;celsius&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fahrenheit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;celsius&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unit&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="c1"&gt;// Real implementation would call a weather API&lt;/span&gt;
          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unit&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;celsius&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sunny&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;humidity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;feelsLike&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unit&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;celsius&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="na"&gt;addTodo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Adds a new item to the todo list&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Todo title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
          &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;low&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;medium&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;high&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;medium&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
          &lt;span class="na"&gt;dueDate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Due date (YYYY-MM-DD)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dueDate&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dueDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;created&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toUIMessageStreamResponse&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;maxSteps: 5&lt;/code&gt; matters. Without it, Claude receives tool results but doesn't loop back to generate another response. The SDK handles this loop automatically — &lt;code&gt;maxSteps&lt;/code&gt; caps the maximum iterations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/ai-agent-collaboration-patterns"&gt;The pattern of agents combining multiple tools to solve problems&lt;/a&gt; depends heavily on &lt;code&gt;maxSteps&lt;/code&gt; and the quality of each tool's &lt;code&gt;description&lt;/code&gt;. Vague descriptions mean Claude can't decide when to use which tool. I had an early version where weather and todos were getting mixed up — adding explicit usage scenarios to the system prompt fixed it.&lt;/p&gt;

&lt;p&gt;Real-time tool call progress in the frontend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
      &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolInvocations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
      &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolInvocations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolCallId&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"text-xs text-gray-400 italic mb-1"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;call&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s2"&gt;`⚙ Calling &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;result&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s2"&gt;`✓ &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; complete`&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="si"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"text-sm"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;))}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Extracting Structured Output with generateObject
&lt;/h2&gt;

&lt;p&gt;Separate from streaming chat, use &lt;code&gt;generateObject()&lt;/code&gt; when you need to pull specific structured data out of Claude's response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// app/api/extract/route.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;generateObject&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@ai-sdk/anthropic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ArticleMetaSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Post title (under 60 chars)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;3-4 sentence summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Related technology tags&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;difficulty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;beginner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;intermediate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;advanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="na"&gt;estimatedReadTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Estimated read time in minutes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;hasCodeExamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;mainTopics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Up to 3 main topics&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateObject&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ArticleMetaSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Analyze the following technical blog post and extract metadata.
Primary readers are backend and fullstack developers.

---
&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
---`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Analysis failed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response comes back as a type-safe object matching the Zod schema. No JSON parse errors or type mismatches to handle separately.&lt;/p&gt;

&lt;p&gt;This pattern works well for: automated blog post tagging, user input classification, structured information extraction from documents, and form auto-completion.&lt;/p&gt;

&lt;p&gt;I'm using a similar pattern for this blog's category score extraction. Writing good &lt;code&gt;describe()&lt;/code&gt; text on Zod schema fields is the key to better output quality. &lt;a href="https://dev.to/en/blog/en/context-engineering-production-ai-agents"&gt;Good context engineering&lt;/a&gt; means schema design and prompt quality determine 80% of extraction accuracy.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;streamObject()&lt;/code&gt; is also available — useful when you want fields in a large schema to appear progressively in the UI without waiting for the full response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Issues I've Encountered
&lt;/h2&gt;

&lt;p&gt;After enough use, a few constraints surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge Runtime Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Running on Vercel Edge Functions means no Node.js-specific packages. &lt;code&gt;@ai-sdk/anthropic&lt;/code&gt; works on Edge, but importing Node.js packages inside tool functions breaks deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Explicitly declare at top of route.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;nodejs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Use Node.js runtime, not Edge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most cases, setting &lt;code&gt;runtime = 'nodejs'&lt;/code&gt; is the practical choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serverless Timeout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vercel's free tier serverless function timeout is 10 seconds. Long Claude outputs or complex tool loops can exceed this. Pro tier raises it to 60 seconds.&lt;/p&gt;

&lt;p&gt;For longer-running tasks, the architecture needs to change. &lt;a href="https://dev.to/en/blog/en/mcp-server-build-practical-guide-2026"&gt;Building a separate MCP server to offload long-running work&lt;/a&gt; is one approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Accumulation Cost&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As conversations grow, the full message history accumulates in context, and token costs climb fast. Check usage from the &lt;code&gt;streamText&lt;/code&gt; result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;inputCost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;promptTokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;outputCost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completionTokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Tokens: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;promptTokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completionTokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; out`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Cost: $&lt;/span&gt;&lt;span class="p"&gt;${(&lt;/span&gt;&lt;span class="nx"&gt;inputCost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;outputCost&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A real service needs a context management strategy. Simplest approach: only pass the last N turns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentMessages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// last 10 turns each direction&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;recentMessages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rate Limits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hitting Anthropic API rate limits returns &lt;code&gt;429 Too Many Requests&lt;/code&gt;. Multi-user environments need request queuing or backoff logic. The &lt;code&gt;ai&lt;/code&gt; package doesn't include retry logic, so you'll build it or add middleware.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Should You Use This?
&lt;/h2&gt;

&lt;p&gt;Vercel AI SDK fits well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to add AI chat to a Next.js app quickly&lt;/li&gt;
&lt;li&gt;You want to test Claude, OpenAI, and Gemini on the same codebase&lt;/li&gt;
&lt;li&gt;You want &lt;code&gt;useChat&lt;/code&gt; to handle frontend state management&lt;/li&gt;
&lt;li&gt;You're deploying to Vercel and the timeout limits aren't a bottleneck for your scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skip it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need full control over the agent loop internals — use Anthropic SDK directly&lt;/li&gt;
&lt;li&gt;You have a Python backend — this SDK is TypeScript only&lt;/li&gt;
&lt;li&gt;Your service baseline is long conversations with dozens of turns per user — context management becomes a significant architecture concern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Personally, I reach for it when validating new AI feature ideas. Going from idea to working prototype in under 30 minutes is real. Where it falls short is when you need fine-grained production control — at that point, you're either using Anthropic SDK directly or building abstractions on top of the AI SDK.&lt;/p&gt;

&lt;p&gt;Vercel AI SDK made a tradeoff between convenience and flexibility. That tradeoff fits a lot of use cases, just not all of them. "Start with this, migrate when needed" is the realistic mental model.&lt;/p&gt;




&lt;p&gt;What I want to test next is the human-in-the-loop tool approval flow added in AI SDK 6. The idea is that an agent can pause before calling certain tools and wait for human approval. Whether this is reliable enough for production is something I haven't fully validated yet. Finding the right balance between fully autonomous agents and manual workflows is one of the core challenges in agent development right now.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>GitLab 18.11: Agentic AI for Security, CI, and Analytics</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 23 Apr 2026 04:22:35 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/gitlab-1811-agentic-ai-for-security-ci-and-analytics-4ncp</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/gitlab-1811-agentic-ai-for-security-ci-and-analytics-4ncp</guid>
      <description>&lt;p&gt;GitLab 18.11 landed on April 16, 2026, and it is the most agentic release the platform has shipped. Three separate AI agents — one for security vulnerability remediation, one for CI pipeline configuration, and one for delivery analytics — moved from concept to either general availability or public beta. If you are running GitLab on any tier and have the Duo Agent Platform enabled, at least one of these agents is available to you today.&lt;/p&gt;

&lt;p&gt;This review covers what each agent actually does, who can use it, what the limitations are, and whether the hype holds up.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;8.4
/ 10



  &amp;lt;span&amp;gt;Security Automation&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.0&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;CI/CD Improvement&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.0&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Analytics Depth&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;8.5&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Tier Accessibility&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;7.5&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Cost Controls&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.0&amp;lt;/span&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  What Is GitLab 18.11?
&lt;/h2&gt;

&lt;p&gt;GitLab 18.11 is the eleventh monthly release in the GitLab 18.x series. It builds directly on the GitLab Duo Agent Platform, which reached general availability in GitLab 18.8 (January 2026). The Duo Agent Platform is the runtime layer that hosts GitLab's foundational agents — pre-built, domain-specific AI assistants that can take multi-step actions inside the GitLab platform without requiring the developer to orchestrate each step manually.&lt;/p&gt;

&lt;p&gt;Prior to 18.11, the platform had agents for planning and for security analysis (Security Analyst Agent). This release adds three more:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agentic SAST Vulnerability Resolution&lt;/strong&gt; — automatically generates merge requests that fix confirmed security vulnerabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI Expert Agent&lt;/strong&gt; — proposes a complete CI/CD pipeline from a natural-language description of your project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Analyst Agent&lt;/strong&gt; — answers natural-language questions about your delivery metrics with visual charts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three are backed by the same underlying agent infrastructure, which means they share the same Credits consumption model, the same IDE access points (VS Code, JetBrains, GitLab UI), and the same access controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic SAST Vulnerability Resolution
&lt;/h2&gt;

&lt;p&gt;This is the headline feature of 18.11, and it earns that billing. Agentic SAST Vulnerability Resolution is now generally available for &lt;strong&gt;GitLab Ultimate&lt;/strong&gt; customers who have the Duo Agent Platform enabled.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;When a SAST scan completes on the main branch, the agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reviews each detected vulnerability and filters out likely false positives&lt;/li&gt;
&lt;li&gt;For confirmed High and Critical severity findings, analyzes the root cause using multi-shot reasoning&lt;/li&gt;
&lt;li&gt;Generates a context-aware code fix targeting that specific root cause&lt;/li&gt;
&lt;li&gt;Opens a merge request with the proposed fix, a confidence score, and a short explanation&lt;/li&gt;
&lt;li&gt;Runs the pipeline automatically to validate the fix resolves the issue without introducing regressions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Developers receive a ready-to-review MR in their inbox. They can inspect the diff, see the confidence score, and merge or close it — without ever switching to a separate security dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters
&lt;/h3&gt;

&lt;p&gt;The bottleneck in most DevSecOps workflows is not scanning. Scans run automatically. The bottleneck is the gap between "scan found a vulnerability" and "developer fixed it." Security teams file tickets, developers deprioritize them, and vulnerabilities sit unresolved for weeks. This agent collapses that gap by delivering the fix alongside the finding.&lt;/p&gt;

&lt;p&gt;The confidence score is an important detail. GitLab is not silently auto-merging fixes — it surfaces a signal that lets the developer make an informed call. A 92% confidence fix for a SQL injection is a different decision than a 61% confidence fix for a complex deserialization issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caveats
&lt;/h3&gt;

&lt;p&gt;This feature is &lt;strong&gt;Ultimate only&lt;/strong&gt;. Teams on Free or Premium do not get auto-remediation. Additionally, the agent currently targets SAST findings — DAST, dependency scanning, and secret detection are not yet covered by auto-remediation. Incremental SAST scanning (which analyzes only changed files rather than the full codebase) is a separate 18.11 improvement that speeds up scans generally, but that is distinct from the agentic remediation feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI Expert Agent (Beta)
&lt;/h2&gt;

&lt;p&gt;The CI Expert Agent is in public beta in 18.11. It is aimed at a specific problem that has blocked teams for years: the blank &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Writing a CI pipeline from scratch requires knowing GitLab's YAML syntax, understanding which stages your project needs, knowing your test runner commands, and figuring out caching and parallelization. Developers who have not configured CI before either copy from an existing project (with mismatches), stitch together docs examples (fragile), or wait for someone who has done it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the CI Expert Agent does
&lt;/h3&gt;

&lt;p&gt;The agent inspects your repository — file structure, detected language and framework, existing scripts — and proposes a complete build-and-test pipeline in natural language. It targets a running pipeline in under five minutes without manual YAML authoring.&lt;/p&gt;

&lt;p&gt;Beyond initial setup, the CI Expert Agent can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debug failing jobs by reading pipeline logs and explaining what went wrong&lt;/li&gt;
&lt;li&gt;Suggest optimizations: caching, needs dependencies to start jobs earlier, parallelization&lt;/li&gt;
&lt;li&gt;Help migrate from other CI systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a &lt;strong&gt;beta&lt;/strong&gt; feature, which means it works but is not recommended for production pipelines without human review of the generated YAML. GitLab's docs are explicit: test in a fork or staging branch first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical example
&lt;/h3&gt;

&lt;p&gt;You tell the agent: "This is a Python FastAPI project using pytest for tests and Docker for deployment." It reads your repo, identifies the framework, and proposes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;

&lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python:3.12&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pip install -r requirements.txt&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pytest --cov=app tests/&lt;/span&gt;
  &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.pip_cache/&lt;/span&gt;

&lt;span class="na"&gt;build-docker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker:24&lt;/span&gt;
  &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;docker:dind&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You review it, adjust any project-specific details, and push. The agent handles the structural thinking; you own the configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Analyst Agent
&lt;/h2&gt;

&lt;p&gt;The Data Analyst Agent is generally available in 18.11 and is notable for its tier availability: &lt;strong&gt;Free, Premium, and Ultimate&lt;/strong&gt; customers with the Duo Agent Platform enabled can use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does
&lt;/h3&gt;

&lt;p&gt;You ask questions in natural language. The agent translates them into GitLab Query Language (GLQL) queries, runs them against your project or group data, and returns visual charts or tables.&lt;/p&gt;

&lt;p&gt;Example questions you can ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How many merge requests did the backend team close last month?"&lt;/li&gt;
&lt;li&gt;"What is our average MR cycle time this quarter compared to last quarter?"&lt;/li&gt;
&lt;li&gt;"Which pipelines are failing most often and on which branches?"&lt;/li&gt;
&lt;li&gt;"Show me deployment frequency for the production environment over the last 90 days."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent covers the four DORA metrics (deployment frequency, lead time, change failure rate, mean time to restore), merge request analytics, pipeline health, and team contribution summaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this is useful
&lt;/h3&gt;

&lt;p&gt;Engineering managers previously needed to export data from GitLab, load it into a BI tool, and write queries manually. Teams with dedicated analytics tooling (like Grafana or Tableau) could build dashboards, but smaller teams could not justify the overhead.&lt;/p&gt;

&lt;p&gt;The Data Analyst Agent brings that capability to anyone on the platform with a Duo subscription. The answers are not static dashboards — you can ask follow-up questions, drill into specific time ranges, or compare teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;The agent is powered by GLQL, which has coverage gaps. Not every data point across the GitLab platform is queryable through natural language yet. Complex cross-project analytics (for example, correlating security findings across five repos) may require falling back to manual GLQL queries or the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitLab Credits Spending Controls
&lt;/h2&gt;

&lt;p&gt;18.11 also ships a practical addition for teams worried about AI cost creep: subscription-level and per-user spending caps for GitLab Credits.&lt;/p&gt;

&lt;p&gt;GitLab Credits are the consumption unit for on-demand AI features on the platform. Prior to 18.11, teams had visibility into usage but limited enforcement controls.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Billing account managers&lt;/strong&gt; can set a monthly Credits cap at the subscription level. When the cap is hit, Duo Agent Platform features are suspended until the next billing period or until an admin adjusts the cap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-user caps&lt;/strong&gt; let admins prevent any single user from consuming the full team allocation. If one user hits their cap, only that user is suspended — other team members are unaffected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free namespaces&lt;/strong&gt; have an automatic on-demand cap of $25,000 per calendar month as a safety floor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because agentic features (particularly the SAST remediation agent, which can open many MRs at once) consume more Credits than a standard Duo Chat query. Having hard stops prevents a single automated scan from generating an unexpected bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Good and What's Not
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Liked
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;SAST auto-remediation GA — genuinely closes the scan-to-fix gap without requiring a separate security tool&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Confidence scores on generated fixes prevent blind merges&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Data Analyst Agent available on all tiers — not locked behind Ultimate&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Per-user and subscription-level credit caps give CFOs and engineering managers real spend controls&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Incremental SAST scanning reduces pipeline time for large repos independently of the agentic features&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Kubernetes 1.35 support keeps GitLab current with the k8s release schedule&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Didn't Like
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;SAST auto-remediation is Ultimate-only — teams on Premium still have to fix vulnerabilities manually&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;CI Expert Agent is beta only — generated YAML needs human review before shipping to production&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Auto-remediation covers SAST only; DAST and dependency scanning are not yet included&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Data Analyst Agent has GLQL coverage gaps for complex cross-project queries&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Duo Agent Platform still requires a paid add-on even on GitLab.com free tier&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Pricing Breakdown
&lt;/h2&gt;

&lt;p&gt;Understanding which features you can access requires mapping your GitLab tier to the Duo Agent Platform availability.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;Feature&lt;/th&gt;
  &lt;th&gt;Free&lt;/th&gt;
  &lt;th&gt;Premium ($29/user/mo)&lt;/th&gt;
  &lt;th&gt;Ultimate ($99/user/mo)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;Data Analyst Agent&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;CI Expert Agent (Beta)&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Agentic SAST Remediation&lt;/td&gt;
  &lt;td&gt;No&lt;/td&gt;
  &lt;td&gt;No&lt;/td&gt;
  &lt;td&gt;Yes (Duo required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Credits Spending Caps&lt;/td&gt;
  &lt;td&gt;Yes ($25K auto-cap)&lt;/td&gt;
  &lt;td&gt;Yes (manual cap)&lt;/td&gt;
  &lt;td&gt;Yes (manual cap)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Best Value Tier&lt;/td&gt;
  &lt;td&gt;Analytics only&lt;/td&gt;
  &lt;td&gt;Analytics + CI Agent&lt;/td&gt;
  &lt;td&gt;Full agentic suite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GitLab Duo add-on pricing sits at approximately $19/user/month on top of your base plan. For Ultimate customers, the Duo add-on is bundled starting in certain enterprise arrangements — check your GitLab contract for specifics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who It's For — and Who Should Skip
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Go all-in on 18.11 if you are:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A security-conscious team on GitLab Ultimate: the SAST auto-remediation agent alone justifies the upgrade if you have a backlog of unresolved vulnerabilities&lt;/li&gt;
&lt;li&gt;A mid-size team without a dedicated DevOps engineer: the CI Expert Agent reduces the YAML expertise required to maintain pipelines&lt;/li&gt;
&lt;li&gt;An engineering manager who wants DORA metrics without a separate BI tool: the Data Analyst Agent handles most common delivery questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The upgrade is less compelling if you are:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On Free or Premium with no plans to upgrade: you get the Data Analyst Agent and CI Expert Agent beta, but the highest-value feature (SAST remediation) is out of reach&lt;/li&gt;
&lt;li&gt;Running a very small project with a simple pipeline: the CI Expert Agent adds value primarily for teams maintaining complex multi-stage pipelines&lt;/li&gt;
&lt;li&gt;Relying on DAST or dependency scanning for your primary security workflow: auto-remediation does not cover those vectors yet&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does Agentic SAST Vulnerability Resolution work on self-managed GitLab?
&lt;/h3&gt;

&lt;p&gt;Yes. The Duo Agent Platform is available on GitLab.com, GitLab Self-Managed, and GitLab Dedicated. For self-managed instances, you need to enable the Duo Agent Platform in your admin settings and ensure your instance has network access to GitLab's AI gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can the CI Expert Agent write pipelines for non-standard languages or frameworks?
&lt;/h3&gt;

&lt;p&gt;The agent identifies language and framework from your repository structure. For common stacks (Python, Node, Go, Java, Ruby, PHP), it performs well. For less common stacks or custom toolchains, the generated YAML provides a reasonable starting point but will likely need manual adjustments. The docs recommend treating it as a first draft, not a final config.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What counts as a GitLab Credit?
&lt;/h3&gt;

&lt;p&gt;Credits are consumed when you use on-demand AI features — Duo Agent Platform agent actions, certain Duo Chat queries, and AI-powered MR summaries. Standard GitLab features (CI minutes, storage, Packages) use separate consumption units and do not draw from your Credits balance. The Credits dashboard in the Admin area shows a per-feature breakdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is the Data Analyst Agent aware of data across multiple projects?
&lt;/h3&gt;

&lt;p&gt;It can query data from projects and groups you have access to. Cross-group analytics (comparing metrics across separate top-level groups) is limited. For complex multi-group rollups, the GitLab Analytics API remains the more powerful path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does the SAST agent handle false positives?
&lt;/h3&gt;

&lt;p&gt;The agent filters out likely false positives before generating a fix. It assigns a confidence score to each fix it does generate. Developers are expected to review the MR and decide whether to merge — the agent does not auto-merge anything without human approval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;GitLab 18.11 is a meaningful release, not a marketing refresh. The SAST auto-remediation agent solves a workflow problem that security teams have been complaining about for years. The Data Analyst Agent's broad tier availability means most teams can use it today without an upgrade. The CI Expert Agent is beta — useful, but not ready to hand off without review.&lt;/p&gt;

&lt;p&gt;The per-user credit caps are an underrated addition. Agentic features are inherently higher-consumption than chat-style AI, and organizations need budget controls before deploying them at scale. GitLab shipping the controls in the same release as the agents is the right sequencing.&lt;/p&gt;

&lt;p&gt;The outstanding gap is coverage: SAST auto-remediation is excellent, but teams that depend on DAST or dependency scanning for their primary security findings will not see the same workflow improvement yet. That feels like an obvious 18.12 candidate.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;GitLab 18.11 is the best DevSecOps release of 2026 so far. If you are on Ultimate, enable the Duo Agent Platform and let the SAST remediation agent start clearing your vulnerability backlog. Everyone else gets the Data Analyst Agent for free — that alone is worth turning on Duo.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=Zt_cFgZjazc" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gitlab</category>
      <category>devsecops</category>
      <category>agenticai</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Kimi Code K2.6: Moonshot AI's Coding Model vs Claude Code</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 23 Apr 2026 00:19:51 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/kimi-code-k26-moonshot-ais-coding-model-vs-claude-code-1jab</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/kimi-code-k26-moonshot-ais-coding-model-vs-claude-code-1jab</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;8.7
/ 10



  &amp;lt;span&amp;gt;Benchmark Performance&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.5&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Agentic Capabilities&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.0&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Cost Efficiency&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;9.5&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Instruction Following&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;7.2&amp;lt;/span&amp;gt;


  &amp;lt;span&amp;gt;Ecosystem &amp;amp;amp; Tooling&amp;lt;/span&amp;gt;

  &amp;lt;span&amp;gt;7.5&amp;lt;/span&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Moonshot AI shipped Kimi Code K2.6 as generally available on April 20, 2026 — one week after beta testers ran the Code Preview. The release is significant: K2.6 tops SWE-Bench Pro at 58.6%, outscoring GPT-5.4 (57.7%) and Claude Opus 4.6 (53.4%) on the benchmark that comes closest to measuring real-world GitHub issue resolution. It does this while running fully open weights under a Modified MIT License and charging $0.60 per million input tokens — roughly 5x cheaper than Claude Sonnet 4.6.&lt;/p&gt;

&lt;p&gt;That combination — top-tier coding benchmarks, open weights, and aggressive pricing — makes K2.6 the most credible challenger to Claude Code that developers have seen in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kimi Code K2.6 Is
&lt;/h2&gt;

&lt;p&gt;Kimi K2.6 is Moonshot AI's flagship model, built from the ground up for agentic software engineering. Architecturally, it uses the same Mixture-of-Experts design as K2.5: 1 trillion total parameters with only 32 billion activated per forward pass. The full architecture details: 384 experts in total, 8 selected per token (plus one shared expert that is always active), 61 layers, an attention hidden dimension of 7,168, and 64 attention heads.&lt;/p&gt;

&lt;p&gt;What K2.6 changes from K2.5 is execution depth. Kimi K2.5 could reliably follow 30–50 sequential tool calls before losing coherence. K2.6 extends that to 200–300 calls. Agent swarm capacity grows from 100 to 300 simultaneous sub-agents, each capable of executing across up to 4,000 coordinated steps. Moonshot AI demonstrated the practical implications with a real test: K2.6 autonomously overhauled an 8-year-old financial matching engine over 13 hours, achieving a 185% throughput improvement without human intervention.&lt;/p&gt;

&lt;p&gt;That's not a benchmark. That's a production refactoring job that would normally take a senior engineer a week.&lt;/p&gt;

&lt;p&gt;If you've been following the &lt;a href="https://dev.to/articles/best-ai-coding-agents-2026"&gt;AI coding tools landscape in 2026&lt;/a&gt;, Kimi K2.6 lands in the tier just below Claude Mythos but well above the open-weight field. It's Moonshot AI's direct answer to &lt;a href="https://dev.to/articles/claude-sonnet-4-6-developer-guide-2026"&gt;Claude Sonnet 4.6&lt;/a&gt; and the &lt;a href="https://dev.to/articles/cursor-3-review-background-agents-2026"&gt;Cursor background agent&lt;/a&gt; ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks: Where K2.6 Actually Leads
&lt;/h2&gt;

&lt;p&gt;Numbers first, context after.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Kimi K2.6&lt;/th&gt;
&lt;th&gt;GPT-5.4&lt;/th&gt;
&lt;th&gt;Claude Opus 4.6&lt;/th&gt;
&lt;th&gt;Gemini 3.1 Pro&lt;/th&gt;
&lt;th&gt;Kimi K2.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-Bench Pro&lt;/td&gt;
&lt;td class="highlight"&gt;58.6%&lt;/td&gt;
&lt;td&gt;57.7%&lt;/td&gt;
&lt;td&gt;53.4%&lt;/td&gt;
&lt;td&gt;54.2%&lt;/td&gt;
&lt;td&gt;50.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-Bench Verified&lt;/td&gt;
&lt;td class="highlight"&gt;80.2%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench v6&lt;/td&gt;
&lt;td class="highlight"&gt;89.6&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HLE-Full (with tools)&lt;/td&gt;
&lt;td class="highlight"&gt;54.0&lt;/td&gt;
&lt;td&gt;52.1&lt;/td&gt;
&lt;td&gt;53.0&lt;/td&gt;
&lt;td&gt;51.4&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSearchQA (F1)&lt;/td&gt;
&lt;td class="highlight"&gt;92.5%&lt;/td&gt;
&lt;td&gt;78.6%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td&gt;66.7%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Input Price&lt;/td&gt;
&lt;td&gt;$0.60/M&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;$3.00/M&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;$0.60/M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SWE-Bench Pro is currently the most credible coding evaluation because it tests models on real GitHub issues — bugs filed by actual developers, not synthetic problems. K2.6's 58.6% means it correctly resolves more than half of those issues autonomously, placing it ahead of every closed-weight model in this comparison.&lt;/p&gt;

&lt;p&gt;The HLE-Full with tools result (54.0) is perhaps more surprising. Humanity's Last Exam tests genuinely hard multi-domain reasoning, and K2.6 leads there too — which suggests that Moonshot AI's improvements to tool call reliability have broader reasoning implications, not just code execution effects.&lt;/p&gt;

&lt;p&gt;One important note: BenchLM currently ranks K2.6 as #6 out of 111 models for coding overall, with an average score of 89.9. It is leading the open-weight category by a significant margin.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Good
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Top SWE-Bench Pro score&lt;/strong&gt; — 58.6% on real GitHub issues beats every frontier model, including GPT-5.4 and Claude Opus 4.6&lt;br&gt;
      &lt;/p&gt;
&lt;li&gt;

&lt;strong&gt;Genuine long-horizon execution&lt;/strong&gt; — 200–300 reliable tool calls, 13-hour autonomous sessions, 300-agent swarm with 4,000 coordinated steps&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Price-to-performance ratio&lt;/strong&gt; — $0.60/M input vs Claude Sonnet 4.6's $3/M input. For batch coding pipelines, this difference compounds quickly&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Open weights, actual commercial use&lt;/strong&gt; — Modified MIT allows commercial deployment without per-token fees, hardware cost aside&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Claude Code drop-in&lt;/strong&gt; — Three environment variables and you're running K2.6 through the Claude Code interface&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Multi-language depth&lt;/strong&gt; — Tested and reliable across Python, Go, Rust, and front-end (HTML/CSS/JS motion generation)&lt;/li&gt;
&lt;br&gt;
    
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Weaknesses
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;English instruction following lags Claude&lt;/strong&gt; — Complex multi-part English prompts with nuanced constraints show more drift than Claude Sonnet 4.6&lt;br&gt;
      &lt;/p&gt;
&lt;li&gt;

&lt;strong&gt;Ecosystem is maturing, not mature&lt;/strong&gt; — IDE integrations, plugin coverage, and community tooling lag Claude Code and Cursor by 12+ months&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Self-hosting requires serious hardware&lt;/strong&gt; — Full weights need H100-class GPUs; GGUF quantizations help but reduce performance noticeably&lt;/li&gt;
&lt;br&gt;
      &lt;li&gt;

&lt;strong&gt;Revenue credit requirement&lt;/strong&gt; — Modified MIT's "display Kimi K2.6 credit" clause for $20M+/month revenue companies creates unexpected branding obligations&lt;/li&gt;
&lt;br&gt;
    

&lt;h2&gt;
  
  
  Pricing Breakdown
&lt;/h2&gt;

&lt;p&gt;Kimi K2.6 is available through four channels with different economics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Managed API (platform.kimi.ai)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: $0.60 per million tokens&lt;/li&gt;
&lt;li&gt;Output: $2.50 per million tokens&lt;/li&gt;
&lt;li&gt;Zero infrastructure overhead; recommended for teams under 10M tokens/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter (moonshotai/kimi-k2.6)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slightly higher margin on OpenRouter's standard passthrough&lt;/li&gt;
&lt;li&gt;Useful if you're already routing multiple providers through OpenRouter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Azure AI Foundry&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Available as a managed deployment in Azure infrastructure&lt;/li&gt;
&lt;li&gt;Pricing follows Azure AI model marketplace rates; better for enterprises with existing Azure commitments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Self-Hosted (Hugging Face weights)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero per-token cost after hardware&lt;/li&gt;
&lt;li&gt;Requires transformers ≥4.57.1&lt;/li&gt;
&lt;li&gt;Recommended inference: vLLM or SGLang&lt;/li&gt;
&lt;li&gt;Community GGUF quantizations (ubergarm) available for lower VRAM configurations&lt;/li&gt;
&lt;li&gt;Practical for teams running &amp;gt;50M tokens/month with H100-class access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For context: at $0.60 input / $2.50 output, K2.6 is 5x cheaper on input and 6x cheaper on output than Claude Sonnet 4.6 ($3/$15). Against Claude Opus 4.6 or 4.7, the gap widens further. For agentic pipelines that generate thousands of tool-call roundtrips, this pricing difference translates directly to project economics.&lt;/p&gt;

&lt;p&gt;The Modified MIT License allows unrestricted commercial use with one exception: if your product exceeds 100 million monthly active users &lt;strong&gt;or&lt;/strong&gt; $20 million in monthly revenue, you must display a visible "Kimi K2.6" attribution in your user interface. Most developer teams won't hit that threshold, but SaaS companies building on top of K2.6 should check their TOS terms before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Integration: Get K2.6 Running in 10 Minutes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1: Direct Kimi API (OpenAI-compatible)
&lt;/h3&gt;

&lt;p&gt;Kimi's API is OpenAI SDK-compatible. If you're already calling OpenAI endpoints, the switch is a base URL change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-moonshot-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.moonshot.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kimi-k2.6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor this Python class to use dataclasses.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get your API key at platform.kimi.ai/console/api-keys.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Run K2.6 Inside Claude Code
&lt;/h3&gt;

&lt;p&gt;This is the integration that's gained the most traction. Set three environment variables and Claude Code's entire interface — slash commands, subagents, CLAUDE.md — runs against K2.6's backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux / macOS&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.moonshot.ai/anthropic"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-moonshot-api-key"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"kimi-k2.6"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"kimi-k2.6"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"kimi-k2.6"&lt;/span&gt;

&lt;span class="c"&gt;# Then launch Claude Code normally&lt;/span&gt;
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kimi maintains an Anthropic-compatible API endpoint at &lt;code&gt;api.moonshot.ai/anthropic&lt;/code&gt;, which means Claude Code's tool call format, context compaction, and session management work without modification. The practical advantage: you get Claude Code's polished UX at K2.6's pricing.&lt;/p&gt;

&lt;p&gt;If you're already using &lt;a href="https://dev.to/articles/how-to-use-claude-code-guide-2026"&gt;Claude Code for advanced workflows&lt;/a&gt;, this is the fastest way to evaluate K2.6 without changing your tooling setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Kimi Code CLI
&lt;/h3&gt;

&lt;p&gt;Moonshot AI ships its own terminal agent built on K2.6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;kimi-cli
kimi /login   &lt;span class="c"&gt;# OAuth via browser&lt;/span&gt;
kimi          &lt;span class="c"&gt;# Start coding session&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI includes repository-aware context, MCP tool integration (&lt;code&gt;kimi mcp add&lt;/code&gt;), cron scheduling, and shell mode toggle with Ctrl-X. It supports 256K context tuned for repository-scale codebases and outputs at ~100 tokens/second. For teams comfortable with &lt;a href="https://dev.to/articles/openai-codex-cli-terminal-coding-agent-guide-2026"&gt;terminal-first AI coding agents&lt;/a&gt;, this is the most direct path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosting K2.6 with vLLM
&lt;/h2&gt;

&lt;p&gt;For teams wanting zero per-token cost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm transformers&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;4.57.1

&lt;span class="c"&gt;# Launch vLLM server with K2.6 weights&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; moonshotai/Kimi-K2.6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 65536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; bfloat16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hardware baseline: 4× H100 80GB for the full model in bfloat16. For lower-budget setups, community GGUF quantizations from ubergarm reduce VRAM requirements significantly, though at reduced accuracy on complex reasoning tasks.&lt;/p&gt;

&lt;p&gt;The recommended inference stack is vLLM or SGLang. vLLM's MRV2 architecture (released March 2026) handles MoE routing well; SGLang is faster for structured output generation. If you're already running &lt;a href="https://dev.to/articles/vllm-production-inference-guide-2026"&gt;vLLM in production&lt;/a&gt;, K2.6 slots in without configuration changes beyond the model path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Performance: What Developers Are Reporting
&lt;/h2&gt;

&lt;p&gt;The 13-hour financial engine refactor is the headline, but production reports are more nuanced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where K2.6 genuinely wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long refactoring sessions that cross 50+ file touches — K2.6 maintains context coherence that previous open-weight models couldn't sustain&lt;/li&gt;
&lt;li&gt;Python and Go codebases — these appear to be the training-data sweet spots, with clean output and minimal hallucinated APIs&lt;/li&gt;
&lt;li&gt;Cost-sensitive batch pipelines — teams running nightly code analysis, automated PR review, or large-scale code generation report meaningful cost reductions at K2.6's pricing versus equivalent Claude Sonnet usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where Claude Code still has the edge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex English-language system prompts with layered constraints — Claude Sonnet 4.6's instruction following is measurably tighter on prompts with 5+ simultaneous requirements&lt;/li&gt;
&lt;li&gt;Sensitive code contexts (security, compliance) — Anthropic's Constitutional AI training shows in how Claude handles edge cases; K2.6 is more willing to generate code that might have subtle issues&lt;/li&gt;
&lt;li&gt;IDE integrations — the JetBrains, VS Code, and Cursor ecosystems are built around Anthropic's API; K2.6 works as a drop-in but surface-level polish differences are noticeable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hybrid workflow gaining traction: K2.6 for code generation and bulk execution, Claude Opus 4.7 for planning, validation, and anything requiring precise instruction adherence. Running K2.6 via the OpenAI-compatible endpoint alongside tools like &lt;a href="https://dev.to/articles/litellm-ai-gateway-llm-proxy-guide-2026"&gt;LiteLLM's proxy&lt;/a&gt; makes provider switching transparent to application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who It's For
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;K2.6 is the right choice if you're:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running cost-sensitive agentic pipelines at scale (&amp;gt;10M tokens/month where pricing compounds)&lt;/li&gt;
&lt;li&gt;Building on open-weight infrastructure where you need weights you actually control&lt;/li&gt;
&lt;li&gt;Doing large-scale refactoring, automated PR review, or repository-level code analysis&lt;/li&gt;
&lt;li&gt;Evaluating a Claude Code alternative without locking into Anthropic's pricing&lt;/li&gt;
&lt;li&gt;Already familiar with MoE model deployment and have H100-class access for self-hosting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stick with Claude Code if you're:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing complex English-language system prompts with nuanced multi-part constraints&lt;/li&gt;
&lt;li&gt;Building in an IDE-first workflow where JetBrains or VS Code integrations matter&lt;/li&gt;
&lt;li&gt;Prioritizing safety and compliance behavior over raw benchmark performance&lt;/li&gt;
&lt;li&gt;A solo developer where the tooling ecosystem difference matters more than per-token costs&lt;/li&gt;
&lt;li&gt;Working in domains (legal, medical, security) where Anthropic's safety tuning is a practical requirement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare K2.6 alongside other capable open-weight agents like &lt;a href="https://dev.to/articles/goose-open-source-ai-agent-review-2026"&gt;Goose by Block&lt;/a&gt; and &lt;a href="https://dev.to/articles/hermes-agent-nous-research-self-improving-developer-guide-2026"&gt;Hermes Agent&lt;/a&gt; if your priority is moving away from proprietary model dependencies entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is Kimi K2.6 actually open source?
&lt;/h3&gt;

&lt;p&gt;The weights are publicly available on Hugging Face under a Modified MIT License. "Modified" because of the revenue/MAU attribution requirement — but for the vast majority of developers and teams, it's functionally open source with commercial use allowed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use Kimi K2.6 with existing Claude Code projects?
&lt;/h3&gt;

&lt;p&gt;Yes. Set &lt;code&gt;ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic&lt;/code&gt; and &lt;code&gt;ANTHROPIC_AUTH_TOKEN=&amp;lt;your-kimi-key&amp;gt;&lt;/code&gt; and &lt;code&gt;ANTHROPIC_MODEL=kimi-k2.6&lt;/code&gt;. Claude Code's UI, slash commands, and CLAUDE.md handling all work against K2.6's backend via Kimi's Anthropic-compatible endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does the agent swarm work in practice?
&lt;/h3&gt;

&lt;p&gt;The 300 sub-agent, 4,000 coordinated step architecture is accessible via Kimi Code CLI and the managed API. You define an orchestration prompt describing the overall task; K2.6's planning layer spawns sub-agents for parallelizable work (e.g., different modules or files) and coordinates their outputs. Direct programmatic control over individual sub-agent allocation is not yet exposed in the API — it's handled internally by the model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What's the context window?
&lt;/h3&gt;

&lt;p&gt;The Kimi Code CLI is tuned for 256K tokens on repository-scale codebases. Via the managed API, current documentation shows 128K. Self-hosted configurations depend on your &lt;code&gt;--max-model-len&lt;/code&gt; setting and available VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does K2.6 compare to DeepSeek V3.2?
&lt;/h3&gt;

&lt;p&gt;Both are competitive open-weight coding models at aggressive price points. &lt;a href="https://dev.to/articles/deepseek-v3-2-developer-guide-2026"&gt;DeepSeek V3.2&lt;/a&gt; has the unique capability of simultaneous thinking + tool use in one API call. K2.6 leads on SWE-Bench Pro and on agent swarm scale. For pure coding throughput and agentic workflows, K2.6 currently has the benchmark edge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kimi K2.6 posts 58.6% on SWE-Bench Pro, the highest score among publicly listed frontier models as of April 2026&lt;/li&gt;
&lt;li&gt;The core improvement over K2.5 is execution reliability: 200–300 sequential tool calls without drift, versus 30–50 previously&lt;/li&gt;
&lt;li&gt;API pricing at $0.60/M input is 5x cheaper than Claude Sonnet 4.6 — significant for agentic pipelines at scale&lt;/li&gt;
&lt;li&gt;Claude Code integration requires three environment variables; K2.6 runs transparently through Claude's interface&lt;/li&gt;
&lt;li&gt;Open weights on Hugging Face under Modified MIT; self-hosting requires H100-class hardware for the full model&lt;/li&gt;
&lt;li&gt;Instruction following for complex English prompts remains a gap versus Claude; hybrid workflows mitigate this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Kimi Code K2.6 is the most capable open-weight coding model available in April 2026, and its pricing makes it a serious Claude alternative for cost-sensitive agentic pipelines. The benchmark lead is real and the Claude Code drop-in integration removes most switching friction. The honest caveat: complex instruction following and ecosystem maturity still favor Anthropic — but for teams primarily doing code generation at scale, K2.6 earns its place in the stack.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Prefer a deep-dive walkthrough? &lt;a href="https://www.youtube.com/watch?v=xCxDT-54adA" rel="noopener noreferrer"&gt;Watch the full video on YouTube&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kimicode</category>
      <category>moonshotai</category>
      <category>codingmodel</category>
      <category>aiagents</category>
    </item>
  </channel>
</rss>
