<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ramsis Hammadi</title>
    <description>The latest articles on DEV Community by Ramsis Hammadi (@rams901).</description>
    <link>https://dev.to/rams901</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1140118%2F1f844f4d-35c1-4a93-b31e-651c0d27cc6e.png</url>
      <title>DEV Community: Ramsis Hammadi</title>
      <link>https://dev.to/rams901</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rams901"/>
    <language>en</language>
    <item>
      <title>OpenAI Agents SDK: Sandbox Execution and Model-Native Harness in 2026</title>
      <dc:creator>Ramsis Hammadi</dc:creator>
      <pubDate>Sat, 16 May 2026 10:30:00 +0000</pubDate>
      <link>https://dev.to/rams901/openai-agents-sdk-sandbox-execution-and-model-native-harness-in-2026-37jn</link>
      <guid>https://dev.to/rams901/openai-agents-sdk-sandbox-execution-and-model-native-harness-in-2026-37jn</guid>
      <description>&lt;h2&gt;
  
  
  OpenAI Agents SDK: Sandbox Execution and Model-Native Harness in 2026
&lt;/h2&gt;

&lt;h2&gt;
  
  
  TL;DR Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The OpenAI Agents SDK now includes &lt;strong&gt;sandbox execution&lt;/strong&gt; — agents run code, access files, and use shell commands in isolated container-based workspaces&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;model-native harness&lt;/strong&gt; replaces custom orchestration code: the SDK handles tool dispatch, state persistence, and multi-step workflows&lt;/li&gt;
&lt;li&gt;Sandboxes support &lt;strong&gt;filesystem, shell, package installs, Git repos, mounted storage (S3/GCS/R2), exposed ports, snapshots, and resumable state&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;agent and sandbox are deliberately separate&lt;/strong&gt; — harness owns the control plane (model calls, tool routing, approvals), sandbox owns execution (files, commands)&lt;/li&gt;
&lt;li&gt;Deploy on &lt;strong&gt;Unix-local (dev), Docker (local container), or hosted providers&lt;/strong&gt; (Cloudflare, Vercel) with the same agent definition&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Direct Answer Block
&lt;/h2&gt;

&lt;p&gt;The OpenAI Agents SDK is a code-first framework for building production AI agents in TypeScript or Python. Its sandbox feature gives agents an isolated Unix-like workspace with filesystem, shell, mounted data, and resumable state. The model-native harness handles tool dispatch, multi-step execution, and state persistence — replacing the custom orchestration code you'd otherwise write yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Before the Agents SDK's sandbox update, building a production AI agent that could safely execute code required stitching together: a model API client, a container runtime, credential isolation, state persistence, tool routing, and approval logic. Each piece was custom code. The SDK collapses that stack: define your agent with a manifest describing the workspace, attach capabilities (shell, filesystem, skills, memory), and pick a sandbox client. The harness handles everything between model turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the OpenAI Agents SDK's "model-native harness" and how does it change agent development?
&lt;/h2&gt;

&lt;p&gt;The model-native harness is a runtime layer that &lt;strong&gt;matches how models naturally use tools and context&lt;/strong&gt;. According to the newsletter reporting OpenAI's announcement, it "runs agents in a way that matches how models naturally use tools and context."&lt;/p&gt;

&lt;p&gt;In practice, this means the harness owns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool dispatch&lt;/strong&gt;: when the model calls &lt;code&gt;shell&lt;/code&gt; or &lt;code&gt;file_read&lt;/code&gt;, the harness routes the call to the correct sandbox tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State persistence&lt;/strong&gt;: conversation state, tool results, and workspace state survive across model turns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step execution&lt;/strong&gt;: the agent loop continues across turns, with each step observable and cancellable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming&lt;/strong&gt;: responses stream back to the application as the agent works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery&lt;/strong&gt;: if a sandbox session stops, the harness can resume from serialized state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pre-harness approach required developers to write this orchestration themselves — wrapping every tool call, managing conversation state, handling tool errors, and building resumption logic. The harness replaces that with a structured runtime.&lt;/p&gt;

&lt;p&gt;OpenAI's Agents SDK documentation positions it as the code-first path: "use the SDK track when your server owns orchestration, tool execution, state, and approvals." For hosted workflow creation without code, use Agent Builder. For direct model API access, use the client libraries.&lt;/p&gt;

&lt;p&gt;The SDK separates agent definitions from execution boundaries. A &lt;code&gt;SandboxAgent&lt;/code&gt; is still an &lt;code&gt;Agent&lt;/code&gt; — it keeps instructions, prompt, tools, handoffs, MCP servers, model settings, and hooks. What changes is where execution happens: a live sandbox session with its own filesystem, commands, and ports.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does sandbox execution work — and how does it keep agent code safe in production?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mv0ls5q91wvqmwerue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mv0ls5q91wvqmwerue.png" alt="Diagram showing how sandbox isolates agent code execution from host — file system tools, shell commands, network access, credential isolation" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The sandbox is an &lt;strong&gt;isolated, Unix-like execution environment&lt;/strong&gt; with filesystem, shell, installed packages, mounted data, exposed ports, and resumable state. The key architectural decision: the agent harness and sandbox compute are &lt;strong&gt;separate&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The key split is the boundary between the harness and compute. The harness is the control plane around the model: it owns the agent loop, model calls, tool routing, handoffs, approvals, tracing, recovery, and run state. Compute is the sandbox execution plane where model-directed work reads and writes files, runs commands, installs dependencies, uses mounted storage, exposes ports, and snapshots state." — OpenAI Sandbox Agents documentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This separation matters for production safety:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control plane stays in trusted infrastructure&lt;/strong&gt; — the harness keeps auth, billing, audit logs, human review, and recovery state outside any single container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox is an execution environment, not the control plane&lt;/strong&gt; — it runs commands and edits files but doesn't own model decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials isolate from agent code&lt;/strong&gt; — sandbox credentials are runtime configuration, not prompt content. OpenAI's docs explicitly warn: "Treat sandbox credentials as runtime configuration, not prompt content."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The difference between running the harness &lt;em&gt;inside&lt;/em&gt; the sandbox vs &lt;em&gt;separate&lt;/em&gt; from it is a product decision. Inside-sandbox is convenient for prototypes. Separate-sandbox is the production pattern — the harness keeps sensitive control plane operations in your infrastructure while sandboxes handle provider-specific execution.&lt;/p&gt;

&lt;p&gt;According to the newsletter, the SDK "keeps credentials outside execution environments where model-generated code runs" — a critical security boundary when agents can generate and execute arbitrary code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sandbox clients
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Client&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UnixLocal&lt;/td&gt;
&lt;td&gt;Local development on macOS/Linux. Creates temp workspace, cleans up after run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Local container isolation with custom images&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosted providers&lt;/td&gt;
&lt;td&gt;Cloudflare, Vercel — production deployment with provider-specific isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The sandbox client is part of &lt;strong&gt;run configuration, not agent definition&lt;/strong&gt;. Keep the agent, manifest, and capabilities stable, then swap the client per environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What file system tools, MCP integration, and storage systems does the SDK support?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  File system tools
&lt;/h3&gt;

&lt;p&gt;The SDK provides file system primitives that the agent uses to interact with workspace files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File reads and writes&lt;/strong&gt; — read project directories, edit source files, create new files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply patch&lt;/strong&gt; — apply diffs to workspace files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;View image&lt;/strong&gt; — inspect local images in the sandbox&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shell commands&lt;/strong&gt; — execute arbitrary commands with interactive input support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  MCP integration
&lt;/h3&gt;

&lt;p&gt;MCP (Model Context Protocol) enables structured tool use for external APIs and services. According to the newsletter, "MCP enables structured tool use for external APIs and services."&lt;/p&gt;

&lt;p&gt;MCP servers connect through the SDK's integration layer, allowing agents to use tools from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Communication (Slack, Discord)&lt;/li&gt;
&lt;li&gt;Project management (Linear, Jira)&lt;/li&gt;
&lt;li&gt;Data sources (databases, Google Drive)&lt;/li&gt;
&lt;li&gt;Custom APIs (your internal services)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Storage systems
&lt;/h3&gt;

&lt;p&gt;The manifest supports mounting external storage directly into the sandbox:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mount type&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;S3 Mount&lt;/td&gt;
&lt;td&gt;Data room files, generated artifacts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS Mount&lt;/td&gt;
&lt;td&gt;Google Cloud Storage datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;R2 Mount&lt;/td&gt;
&lt;td&gt;Cloudflare storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure Blob&lt;/td&gt;
&lt;td&gt;Azure data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Box Mount&lt;/td&gt;
&lt;td&gt;Box cloud storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 Files Mount&lt;/td&gt;
&lt;td&gt;Individual files from S3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenAI's docs recommend: "Keep mounted storage scoped to the inputs the agent should read or write. Treat mount entries as ephemeral workspace entries."&lt;/p&gt;

&lt;h3&gt;
  
  
  Manifest
&lt;/h3&gt;

&lt;p&gt;The manifest describes the workspace contract for a fresh sandbox session — files, repos, input artifacts, output directories, environment variables, and OS users/groups. It's treated as a starting-point contract, not the full source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you define an agent manifest with inputs, outputs, directory structure, and provider config?
&lt;/h2&gt;

&lt;p&gt;A manifest defines what the agent sees when a sandbox session starts. Here's a practical example from OpenAI's sandbox quickstart:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Manifest&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;account_brief.md&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;file&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;# Northwind Health&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;- Segment: Mid-market healthcare analytics provider.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;- Renewal date: 2026-04-15.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;implementation_risks.md&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;file&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;# Delivery risks&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;- Security questionnaire is not complete.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;- Procurement requires final legal language by April 1.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;account_brief.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# Northwind Health&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implementation_risks.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# Delivery risks&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Manifest inputs cover:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input type&lt;/th&gt;
&lt;th&gt;What it provides&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;File&lt;/code&gt; / &lt;code&gt;Dir&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Synthetic inputs, helper files, output directories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local file/directory&lt;/td&gt;
&lt;td&gt;Host files materialized into sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git repo&lt;/td&gt;
&lt;td&gt;Repository cloned into workspace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage mounts&lt;/td&gt;
&lt;td&gt;S3, GCS, R2, Azure Blob, Box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;environment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Startup environment variables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;users&lt;/code&gt; / &lt;code&gt;groups&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Sandbox-local OS accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Design rules from OpenAI's docs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Put repos, input artifacts, and output directories in the manifest&lt;/li&gt;
&lt;li&gt;Put task specs and instructions in workspace files (&lt;code&gt;repo/task.md&lt;/code&gt;, &lt;code&gt;AGENTS.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Use relative workspace paths in instructions&lt;/li&gt;
&lt;li&gt;Keep mounts scoped to inputs the agent should use&lt;/li&gt;
&lt;li&gt;Avoid saving secrets, tokens, or sensitive files in the manifest&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How does credential isolation work across Cloudflare, Vercel, and custom deployment environments?
&lt;/h2&gt;

&lt;p&gt;Credential isolation is a first-class design concern in the sandbox architecture. The principle: &lt;strong&gt;credentials are runtime configuration, not prompt content.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI's sandbox docs specify three rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prefer provider-native secret systems&lt;/strong&gt; for hosted sandbox providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep cloud storage credentials scoped&lt;/strong&gt; to the specific mount or provider option&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;Manifest.environment&lt;/code&gt;&lt;/strong&gt; for startup values, marking sensitive entries as ephemeral&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;According to the newsletter, the SDK "keeps credentials outside execution environments where model-generated code runs." This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent prompt never contains API keys, tokens, or secrets&lt;/li&gt;
&lt;li&gt;Sandbox environment variables are injected by the provider, not by the model&lt;/li&gt;
&lt;li&gt;Cloud provider deployments (Cloudflare Workers, Vercel Functions) isolate credentials from sandbox compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The provider is part of run configuration, not agent definition. The same agent with the same manifest can run on UnixLocal for development, Docker for local container testing, and a hosted provider for production — credentials are configured per provider, per environment.&lt;/p&gt;

&lt;p&gt;OpenAI's documentation warns: "Review artifacts before moving them out of the sandbox, especially when the agent can read private documents or mounted storage." The sandbox can access mounted data — your application should verify what comes out.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you orchestrate multi-agent workflows with handoffs, guardrails, and human-in-the-loop approvals?
&lt;/h2&gt;

&lt;p&gt;The Agents SDK includes orchestration primitives that layer on top of the sandbox foundation:&lt;/p&gt;

&lt;h3&gt;
  
  
  Handoffs
&lt;/h3&gt;

&lt;p&gt;When a task requires multiple specialists, handoffs transfer control between agents. Each agent owns its domain. The harness routes based on the handoff target.&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrails
&lt;/h3&gt;

&lt;p&gt;Guardrails run before or after model turns to validate output or block unsafe actions. According to the SDK docs, guardrails and human review "block or pause before risky work continues."&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-loop
&lt;/h3&gt;

&lt;p&gt;For high-risk operations, the workflow pauses for human approval. The sandbox state persists during the pause — when approved, the agent continues in the same workspace with the same files and context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Capabilities
&lt;/h3&gt;

&lt;p&gt;Each sandbox agent gets capabilities attached to its definition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What it adds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shell&lt;/td&gt;
&lt;td&gt;Command execution with interactive input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filesystem&lt;/td&gt;
&lt;td&gt;File edits (apply_patch) and image viewing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills&lt;/td&gt;
&lt;td&gt;Skill discovery and materialization from local dirs or Git repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Persist memory artifacts across runs (requires Shell + Filesystem)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compaction&lt;/td&gt;
&lt;td&gt;Context trimming for long-running flows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By default, a &lt;code&gt;SandboxAgent&lt;/code&gt; includes filesystem, shell, and compaction. If you pass a custom capabilities list, it replaces the defaults — include them explicitly if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced patterns (from OpenAI's examples)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data room Q&amp;amp;A&lt;/strong&gt;: Answer questions over mounted documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository code review&lt;/strong&gt;: Clone a repo, inspect it, produce review artifacts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision website clone&lt;/strong&gt;: Clone a website using Vision API and screenshot feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox resume&lt;/strong&gt;: Resume work in a pre-existing sandbox session&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Do I need a sandbox for every agent?
&lt;/h3&gt;

&lt;p&gt;No. If your agent only needs model responses without files, commands, or persistent state, use the Responses API directly or the basic Agents SDK runtime. Sandboxes are for when the answer depends on workspace work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use the Agents SDK with non-OpenAI models?
&lt;/h3&gt;

&lt;p&gt;The SDK supports provider configuration, allowing different model providers per agent. Sandbox execution is independent of model choice — the harness handles tool routing regardless of which model generates the tool calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How much do sandbox runs cost?
&lt;/h3&gt;

&lt;p&gt;Sandbox pricing depends on the provider (UnixLocal is free, hosted providers bill per session). OpenAI's API usage is separate from sandbox compute costs. Check provider-specific pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can sandbox state survive between runs?
&lt;/h3&gt;

&lt;p&gt;Yes. Three persistence levels: RunState (harness-side state), serialized session state (reconnect to same sandbox), and snapshots (save workspace contents to seed a fresh session). Use snapshots to skip dependency installation on subsequent runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is sandbox execution available in both TypeScript and Python SDKs?
&lt;/h3&gt;

&lt;p&gt;Yes. Both SDKs support the same sandbox primitives with language-idiomatic APIs. Official examples exist for both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does this differ from Claude Code's sandbox approach?
&lt;/h3&gt;

&lt;p&gt;Both separate agent from execution, but OpenAI's SDK is a code-first framework you integrate into your application, while Claude Code is a product you run. OpenAI's approach gives you programmatic control over the harness, manifests, and provider selection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model-native harness&lt;/strong&gt;: The SDK runtime layer that handles tool dispatch, state persistence, and multi-step execution in a way that matches model behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox&lt;/strong&gt;: An isolated, Unix-like execution environment with filesystem, shell, packages, mounts, ports, and resumable state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest&lt;/strong&gt;: The workspace contract describing what files, repos, mounts, and environment variables a fresh sandbox session starts with&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities&lt;/strong&gt;: Sandbox-native behaviors attached to an agent (shell, filesystem, skills, memory, compaction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handoff&lt;/strong&gt;: Transfer of control between specialized agents within a multi-agent workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot&lt;/strong&gt;: A saved workspace state used to seed a fresh sandbox session, skipping redundant setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Author
&lt;/h2&gt;

&lt;p&gt;Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. &lt;a href="https://dev.to/ramsishammadi"&gt;Full bio →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
    </item>
    <item>
      <title>CLAUDE.md Rules: How to Cut AI Coding Mistakes from 40% to 3% in 2026</title>
      <dc:creator>Ramsis Hammadi</dc:creator>
      <pubDate>Fri, 15 May 2026 06:21:00 +0000</pubDate>
      <link>https://dev.to/rams901/claudemd-rules-how-to-cut-ai-coding-mistakes-from-40-to-3-in-2026-2j7o</link>
      <guid>https://dev.to/rams901/claudemd-rules-how-to-cut-ai-coding-mistakes-from-40-to-3-in-2026-2j7o</guid>
      <description>&lt;h2&gt;
  
  
  CLAUDE.md Rules: How to Cut AI Coding Mistakes from 40% to 3% in 2026
&lt;/h2&gt;

&lt;h2&gt;
  
  
  TL;DR Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Andrej Karpathy's original 4-rule CLAUDE.md cut Claude coding errors from &lt;strong&gt;~40% to ~11%&lt;/strong&gt; by enforcing clarification, simplicity, surgical scope, and verification&lt;/li&gt;
&lt;li&gt;The 12-rule extension (claude-code-pro-pack) adds 8 more rules targeting agent-orchestration failures and pushes error rates to &lt;strong&gt;~3%&lt;/strong&gt; — a ~10x improvement over no rules&lt;/li&gt;
&lt;li&gt;Two leading open-source implementations exist: the &lt;strong&gt;12-Rule Pro Pack&lt;/strong&gt; (~700 tokens, 5 skill templates, Karpathy-provenance) and &lt;strong&gt;Ten Commandments for Coding Agents&lt;/strong&gt; (~400 tokens, portable across all agents.md tools)&lt;/li&gt;
&lt;li&gt;The key insight: past ~200 lines of CLAUDE.md, &lt;strong&gt;compliance drops sharply&lt;/strong&gt; — rules get buried. 12 rules with minimal boilerplate is the sweet spot&lt;/li&gt;
&lt;li&gt;These are drop-in files. Copy one into your project root. The agent picks it up on the next run. No framework, no config.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Direct Answer Block
&lt;/h2&gt;

&lt;p&gt;CLAUDE.md is a markdown file in your project root that AI coding agents read at session start. Karpathy's original 4 rules addressed the highest-frequency failure modes: silent assumptions, overbuilt code, unintended edits, and unverified claims. The 12-rule extension layers agent-orchestration safeguards: token budget limits to stop debugging spirals, conflict-surfacing to prevent "averaging" two codebase patterns, and read-before-write to block uninformed edits. Together they form a behavioral contract between you and the AI agent — and the data says it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;You've experienced it: you ask an AI coding agent to fix a one-line bug, and it rewrites three functions, reformats adjacent code, adds a "helpful" abstraction layer, and introduces two new edge cases. The problem isn't the model — it's the absence of constraints. AI coding agents are &lt;strong&gt;prompt-optimizers&lt;/strong&gt;: they fill ambiguity with creativity. CLAUDE.md removes the ambiguity. It replaces "be careful" with concrete, actionable, negative-example-rich directives that survive long conversational contexts. This article breaks down the rules that actually work, the failure mode each one closes, and how to choose between the two leading implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do AI coding agents keep making the same mistakes — and how does CLAUDE.md fix this at the system level?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczmawukka8uvs2g2frqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczmawukka8uvs2g2frqj.png" alt="Bar chart showing coding error rates dropping from ~40% (no rules) to ~11% (4 rules) to ~3% (12 rules)" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI coding agents fail in predictable patterns. The Claude Code Pro Pack's documentation — built from real-world agent failures across 30+ codebases — identifies four root causes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Silent assumptions&lt;/strong&gt;: The agent guesses your intent when requirements are vague. It builds what it &lt;em&gt;thinks&lt;/em&gt; you want, not what you &lt;em&gt;actually&lt;/em&gt; want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overbuilt code&lt;/strong&gt;: A simple feature request triggers a cascade of "while I'm here" improvements — abstractions, refactors, helper utilities — none of which you asked for.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unintended edits&lt;/strong&gt;: The agent touches adjacent code, renames variables, reformats files, and cleans up "messy" patterns that were intentional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep&lt;/strong&gt;: A focused task ("add error logging to the payment handler") expands into a system-wide logging framework with configurable backends.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CLAUDE.md works as a &lt;strong&gt;behavioral control layer&lt;/strong&gt; rather than a prompt. Traditional prompting says "please do X carefully." CLAUDE.md says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Surface uncertainty — if requirements are unclear, ask"&lt;/li&gt;
&lt;li&gt;"Keep changes surgical — touch only what the task requires"&lt;/li&gt;
&lt;li&gt;"Choose simplicity — write the minimum code that correctly solves the problem"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is specificity. "Be careful" doesn't survive 50 turns of conversation. "Do not refactor, rename, reformat, or clean unrelated code" does.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Past ~200 lines of CLAUDE.md, compliance drops sharply — rules get buried. The pack holds at 12 rules + minimal boilerplate so the agent actually reads and follows the file." — claude-code-pro-pack README&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This token-efficiency constraint is underappreciated. CLAUDE.md is prepended to every agent context. Every line costs tokens on every call. The 12-rule pack clocks at ~700 tokens total — roughly the cost of a single paragraph of prose. The Ten Commandments version is even leaner at ~400 tokens.&lt;/p&gt;

&lt;p&gt;According to Anthropic's Claude Code documentation, CLAUDE.md is one of the primary customization mechanisms alongside skills, hooks, and MCP servers. It's the first thing Claude reads when a session starts. The file sits in your project root or &lt;code&gt;~/.claude/&lt;/code&gt; and is automatically loaded — no plugin, no &lt;code&gt;/import&lt;/code&gt;, no configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What were Karpathy's original 4 rules, and how did they cut error rates from 40% to 11%?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmewc7k32dorfimh624vc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmewc7k32dorfimh624vc.png" alt="Numbered list of 4 rules with short code examples showing before/after of each rule being applied" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Karpathy's original CLAUDE.md established four rules as the minimum viable constraint set:&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 1: Clarify before implementing
&lt;/h3&gt;

&lt;p&gt;The agent must restate the problem, goal, and expected outcome before writing code. This blocks the silent assumption failure mode. If the agent restates something wrong, you catch it before a single file changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 2: Simplicity first
&lt;/h3&gt;

&lt;p&gt;The agent must write the minimum code that solves the problem. No speculative features, no generic abstractions, no "future-proofing." This blocks overbuilt code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 3: Surgical changes only
&lt;/h3&gt;

&lt;p&gt;The agent must touch only what the task requires. Match existing style. Do not refactor, rename, reformat, or clean unrelated code. This blocks unintended edits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 4: Verify before claiming success
&lt;/h3&gt;

&lt;p&gt;The agent must run tests, lint, type checks, and confirm output before reporting completion. This blocks the "I fixed it" (didn't run anything) failure.&lt;/p&gt;

&lt;p&gt;The 4 rules cut error rates from ~40% to ~11% because they target the four highest-frequency failure categories. Each rule is a &lt;strong&gt;negative constraint&lt;/strong&gt; — it tells the agent what NOT to do — which research shows is more effective than positive guidance ("be helpful") for AI behavior control.&lt;/p&gt;

&lt;p&gt;The 11% remaining errors come from failure modes the original rules don't cover: debugging spirals (the agent loops on a bug, burning tokens), pattern pollution (the agent sees two codebase patterns and averages them), silent partial failures (the agent catches one error but misses its downstream effects), and duplicate-function drift (creating near-identical functions in different files).&lt;/p&gt;

&lt;h2&gt;
  
  
  What 8 additional rules does the 12-rule pro pack add, and which failure mode does each address?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvw072czloxq612fgs1yk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvw072czloxq612fgs1yk.png" alt="A diagram showing 4 original rules plus 8 new rules organized by failure mode category (reasoning, execution, validation, safety)" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The claude-code-pro-pack extends Karpathy's 4 rules with 8 more, each targeting a specific agent-orchestration failure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;What it addresses&lt;/th&gt;
&lt;th&gt;The failure it closes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5. Hard token budget&lt;/td&gt;
&lt;td&gt;Token-spiral debugging&lt;/td&gt;
&lt;td&gt;Agent loops 20+ iterations on a bug, burning 100K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Surface conflicts, don't average&lt;/td&gt;
&lt;td&gt;Two-pattern pollution&lt;/td&gt;
&lt;td&gt;Agent sees two conventions in codebase and produces a third&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7. Read before you write&lt;/td&gt;
&lt;td&gt;Uninformed edits&lt;/td&gt;
&lt;td&gt;Agent modifies a function without understanding its callers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8. Tests gated by correctness, not "pass"&lt;/td&gt;
&lt;td&gt;Fake green tests&lt;/td&gt;
&lt;td&gt;Agent writes a test that passes trivially but doesn't verify the fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9. Long-running operations need checkpoints&lt;/td&gt;
&lt;td&gt;Lost progress on failure&lt;/td&gt;
&lt;td&gt;A 50-file refactor fails at file 47 with no saved state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10. Convention beats novelty&lt;/td&gt;
&lt;td&gt;Inconsistent codebase&lt;/td&gt;
&lt;td&gt;Agent introduces new patterns that clash with existing conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11. Fail visibly, not silently&lt;/td&gt;
&lt;td&gt;Silent partial failures&lt;/td&gt;
&lt;td&gt;Error swallowed by try/catch, agent reports success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12. Don't make the model do non-language work&lt;/td&gt;
&lt;td&gt;Inefficient task routing&lt;/td&gt;
&lt;td&gt;Agent uses LLM loop for retries/validation instead of deterministic code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most impactful of these in practice is rule 5 — &lt;strong&gt;hard token budget&lt;/strong&gt;. The agent's natural response to a failing test is "try again." Without a budget, this becomes a spiral: try, fail, try differently, fail, until context exhaustion. The rule forces the agent to stop after a defined number of attempts and surface the impasse to the user.&lt;/p&gt;

&lt;p&gt;Rule 7 — &lt;strong&gt;read before you write&lt;/strong&gt; — prevents the most common "confident wrong answer" scenario: the agent modifies a function signature without checking its call sites, breaking the build in files it never touched.&lt;/p&gt;

&lt;p&gt;The full rationale for each rule is documented in the pro pack's &lt;code&gt;docs/why-12-rules.md&lt;/code&gt;, with every rule citing a real failure it closes rather than a preference.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do the "Ten Commandments for Coding Agents" differ from the 12-rule approach — and which should you use?
&lt;/h2&gt;

&lt;p&gt;Both approaches are drop-in, open-source, MIT-licensed constraint files. They differ in philosophy, scope, and tooling:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;12-Rule Pro Pack&lt;/th&gt;
&lt;th&gt;Ten Commandments&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rule count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~700 tokens&lt;/td&gt;
&lt;td&gt;~400 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Philosophy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extension of Karpathy's work&lt;/td&gt;
&lt;td&gt;"Smallest set of rules that blocks all failures"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skill templates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 example skills (TDD, debugging, PR workflow, etc.)&lt;/td&gt;
&lt;td&gt;None — rules only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install method&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copy file or GitHub Action&lt;/td&gt;
&lt;td&gt;curl one-liner or git clone + symlink&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-tool support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude, Codex, Cursor, Hermes, Copilot&lt;/td&gt;
&lt;td&gt;All agents.md readers (Claude, Codex, Gemini CLI, OpenCode, Cursor)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Negative examples&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-rule failure modes in separate doc&lt;/td&gt;
&lt;td&gt;Inline within some rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repository rules section&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Project-specific block at bottom (edit for your team)&lt;/td&gt;
&lt;td&gt;Same — project conventions section&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standout feature&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Includes &lt;code&gt;docs/adoption-guide.md&lt;/code&gt; for 10-min team setup&lt;/td&gt;
&lt;td&gt;Symlink strategy for single-source-of-truth across multiple CLIs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Which should you use?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use the 12-Rule Pro Pack if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the most comprehensive coverage (every known failure mode addressed)&lt;/li&gt;
&lt;li&gt;You want skill templates (TDD loop, systematic debugging, PR workflow) included&lt;/li&gt;
&lt;li&gt;Your team is 3+ developers and needs a shared behavior baseline&lt;/li&gt;
&lt;li&gt;You want explicit Karpathy provenance (built on the original 4 rules)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use the Ten Commandments if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You use multiple AI coding tools across your workflow (the symlink trick is elegant)&lt;/li&gt;
&lt;li&gt;Token efficiency matters — 400 tokens is about half the cost of the 12-rule pack&lt;/li&gt;
&lt;li&gt;You prefer the "commandments" framing — imperative directives with named failure modes inline&lt;/li&gt;
&lt;li&gt;You're a solo developer who wants minimal overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both work. The Ten Commandments author themselves note: "If your fork grows past ~20 rules, you have a wiki, not a system prompt." The 12-rule pack author says: "Use all three — pack for behavior, anthropic/skills for domain tasks, addyosmani/agent-skills for lifecycle flow."&lt;/p&gt;

&lt;h2&gt;
  
  
  What does a surgical change look like in practice (and what happens when agents ignore rule #5)?
&lt;/h2&gt;

&lt;p&gt;Rule 5 (12-rule pack) / Rule 5 (Ten Commandments) both address the same failure: &lt;strong&gt;keep changes surgical&lt;/strong&gt;. Here's a concrete example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Fix a null-pointer exception in &lt;code&gt;PaymentService.processRefund()&lt;/code&gt; when &lt;code&gt;transaction.merchant&lt;/code&gt; is null after a failed payment gateway response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent without surgical-change rules (what happens):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- public RefundResult processRefund(Transaction transaction) {
&lt;/span&gt;&lt;span class="gi"&gt;+ public RefundResult processRefund(Transaction transaction) {
+     // Added null safety
+     if (transaction == null) throw new IllegalArgumentException("transaction required");
&lt;/span&gt;      RefundRequest request = buildRequest(transaction);
&lt;span class="gd"&gt;-     GatewayResponse response = gateway.refund(request);
&lt;/span&gt;&lt;span class="gi"&gt;+     GatewayResponse response = gateway.refund(request);
+     // Extracted to separate method for testability
+     return handleGatewayResponse(response, transaction);
+ }
+
+ private RefundResult handleGatewayResponse(GatewayResponse response, Transaction tx) {
+     if (response.isSuccess()) {
+         return RefundResult.success(tx.getId());
+     }
+     // Added merchant null handling
+     if (tx.getMerchant() == null) {
+         log.warn("Merchant information missing for transaction {}", tx.getId());
+     }
+     return RefundResult.failure(response.getError());
&lt;/span&gt;  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things happened that weren't asked for: (1) the method was split into two, (2) a new null check was added at the top, (3) the &lt;code&gt;gateway.refund()&lt;/code&gt; variable was renamed. This touches 4 lines that didn't need changing and introduces a new method the team didn't agree on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent with surgical-change rules (what was asked for):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;  public RefundResult processRefund(Transaction transaction) {
      RefundRequest request = buildRequest(transaction);
      GatewayResponse response = gateway.refund(request);
&lt;span class="gd"&gt;-     return RefundResult.success(transaction.getId());
&lt;/span&gt;&lt;span class="gi"&gt;+     if (response.isSuccess()) {
+         return RefundResult.success(transaction.getId());
+     }
+     if (transaction.getMerchant() == null) {
+         log.warn("Merchant information missing for transaction {}", transaction.getId());
+     }
+     return RefundResult.failure(response.getError());
&lt;/span&gt;  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One change, directly addressing the null pointer. No extracted methods, no input validation refactor, no variable renames.&lt;/p&gt;

&lt;p&gt;The surgical approach isn't about writing worse code — it's about &lt;strong&gt;scope discipline&lt;/strong&gt;. The refactored version might be genuinely better code. But when an AI agent introduces structural changes you didn't ask for, you lose the ability to reason about what else might have changed. The surgical rule preserves your ability to review the diff with confidence that everything you see is intentional.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you customize CLAUDE.md rules for your specific stack without breaking the system?
&lt;/h2&gt;

&lt;p&gt;Customization follows two levels: repository rules and fork-and-extend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: Repository rules (edit in place)
&lt;/h3&gt;

&lt;p&gt;Both the 12-rule pack and Ten Commandments include a "Repository Rules" section at the bottom for project-specific conventions. Edit these without touching the core rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Repository Rules&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; We use pnpm, not npm or yarn. Use &lt;span class="sb"&gt;`pnpm install`&lt;/span&gt;, &lt;span class="sb"&gt;`pnpm test`&lt;/span&gt;, etc.
&lt;span class="p"&gt;-&lt;/span&gt; Never modify &lt;span class="sb"&gt;`schema.prisma`&lt;/span&gt; directly — use &lt;span class="sb"&gt;`pnpm db migrate`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Test files live next to their source files, not in a &lt;span class="sb"&gt;`__tests__`&lt;/span&gt; directory
&lt;span class="p"&gt;-&lt;/span&gt; Prefer server components over client components. Only add 'use client' when necessary
&lt;span class="p"&gt;-&lt;/span&gt; Auth is handled by NextAuth.js with the credentials provider. Do not add new auth libraries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These should be &lt;strong&gt;imperative directives&lt;/strong&gt;, not descriptions. "Use X, not Y" works. "We use X for Y" gets ignored by the agent after 20 turns of context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: Fork and extend (add custom rules)
&lt;/h3&gt;

&lt;p&gt;If you encounter a failure mode the existing rules don't cover, fork and add a rule. The criterion for adding a new rule:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;One sentence&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maps to a real incident&lt;/strong&gt; (not a hypothetical preference)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does not duplicate an existing rule&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example of a good custom rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;13.&lt;/span&gt; Never import from barrel files in package internals. Use direct imports to avoid circular dependency cycles.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This maps to a real incident (your build broke from a circular dependency), is one sentence, and doesn't duplicate any existing rule.&lt;/p&gt;

&lt;p&gt;Example of a bad custom rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;13.&lt;/span&gt; Write good code that follows best practices and is maintainable over time.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a preference, not a directive. It doesn't map to a specific failure mode. The agent will ignore it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-patterns to avoid
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't add tool-specific rules&lt;/strong&gt;: "Use &lt;code&gt;npm test&lt;/code&gt; not &lt;code&gt;jest&lt;/code&gt;" belongs in Repository Rules, not as a new commandment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't add style rules&lt;/strong&gt;: Prettier and ESLint handle formatting; CLAUDE.md shouldn't&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't go past ~15 rules&lt;/strong&gt;: If you have 20 rules, audit them. Cut the ones that haven't prevented a real incident&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't describe your architecture&lt;/strong&gt;: "We use hexagonal architecture with domain-driven design" is a wiki page, not a behavioral constraint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;cc-audit&lt;/code&gt; tool (from the pro pack ecosystem) scores any CLAUDE.md against the 12-rule baseline — use it in CI to enforce rule quality across your team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does CLAUDE.md work with non-Claude tools like Cursor or Codex?
&lt;/h3&gt;

&lt;p&gt;Yes. Both Cursor and Codex read AGENTS.md or CLAUDE.md from your project root. The Ten Commandments maintain identical content in both file formats specifically for cross-tool compatibility. The 12-rule pack provides both CLAUDE.md and AGENTS.md variants.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can CLAUDE.md rules conflict with my existing .cursorrules or copilot-instructions?
&lt;/h3&gt;

&lt;p&gt;They can. If your .cursorrules says "add comprehensive error handling" and your CLAUDE.md says "choose simplicity," the agent may produce inconsistent output. Pick one behavioral baseline and use it everywhere. The &lt;code&gt;arai&lt;/code&gt; tool can enforce instruction files via hooks to prevent conflicts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Will these rules make the agent too conservative and miss edge cases?
&lt;/h3&gt;

&lt;p&gt;No. The rules block unwanted behavior, not necessary behavior. An agent with surgical-change rules will still handle edge cases — it just won't restructure your codebase while doing it. The hard token budget rule prevents spiraling, not standard error handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How do I verify my CLAUDE.md is actually working?
&lt;/h3&gt;

&lt;p&gt;Watch for reduced chatter. An effective CLAUDE.md produces fewer clarifying questions, shorter diffs, and higher first-attempt success rates. The &lt;code&gt;cc-audit&lt;/code&gt; tool provides quantitative scoring. Empirically, if your agent produces 3-line diffs instead of 30-line diffs for bug fixes, the rules are working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use different rule sets per project?
&lt;/h3&gt;

&lt;p&gt;Yes. CLAUDE.md files are project-scoped. Have a strict 12-rule set for your production monorepo and a lightweight 4-rule set for your experimental side projects. You can also have a global &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; with baseline rules that all projects inherit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Do these rules work with non-English prompts?
&lt;/h3&gt;

&lt;p&gt;The rules are language-agnostic — they constrain behavior, not output language. The Ten Commandments repository includes a Korean translation (README.ko.md) demonstrating cross-language applicability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CLAUDE.md&lt;/strong&gt;: A markdown file in your project root or &lt;code&gt;~/.claude/&lt;/code&gt; that Claude Code reads at the start of every session, containing behavioral rules and project conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGENTS.md&lt;/strong&gt;: The emerging cross-tool equivalent of CLAUDE.md, read by Codex, Gemini CLI, OpenCode, and Cursor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical change&lt;/strong&gt;: A code modification that touches only what the task requires, matching existing style without refactoring adjacent code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token budget&lt;/strong&gt;: A hard limit on consecutive debugging attempts, preventing the agent from spiraling into infinite retry loops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-pattern pollution&lt;/strong&gt;: When an agent encounters two different conventions in a codebase and produces a third, averaging them instead of picking one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule compliance cliff&lt;/strong&gt;: The threshold (~200 lines or ~15 rules) beyond which AI agents stop consistently following CLAUDE.md directives&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Author
&lt;/h2&gt;

&lt;p&gt;Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. &lt;a href="https://dev.to/ramsishammadi"&gt;Full bio →&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Claude Code Ultraplan: Cloud-Based AI Planning in 2026 — A Hands-On Tutorial</title>
      <dc:creator>Ramsis Hammadi</dc:creator>
      <pubDate>Thu, 14 May 2026 08:33:19 +0000</pubDate>
      <link>https://dev.to/rams901/claude-code-ultraplan-cloud-based-ai-planning-in-2026-a-hands-on-tutorial-4id6</link>
      <guid>https://dev.to/rams901/claude-code-ultraplan-cloud-based-ai-planning-in-2026-a-hands-on-tutorial-4id6</guid>
      <description>&lt;h2&gt;
  
  
  Claude Code Ultraplan: Cloud-Based AI Planning in 2026 — A Hands-On Tutorial
&lt;/h2&gt;

&lt;h2&gt;
  
  
  TL;DR Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ultraplan offloads Claude Code's planning phase to a &lt;strong&gt;cloud session&lt;/strong&gt;, keeping your terminal free while a structured plan is drafted remotely&lt;/li&gt;
&lt;li&gt;You review plans in your &lt;strong&gt;browser&lt;/strong&gt; with inline comments, emoji reactions, and section-level navigation — a richer surface than terminal text&lt;/li&gt;
&lt;li&gt;Three ways to launch: the &lt;code&gt;/ultraplan&lt;/code&gt; command, the &lt;code&gt;ultraplan&lt;/code&gt; keyword in any prompt, or from a local plan's approval dialog&lt;/li&gt;
&lt;li&gt;You choose where to execute: &lt;strong&gt;in the cloud&lt;/strong&gt; (with PR creation) or &lt;strong&gt;teleport back to your terminal&lt;/strong&gt; (with full local environment access)&lt;/li&gt;
&lt;li&gt;Requires Claude Code v2.1.91+, a GitHub repo, and a Claude.ai account. Not available on Bedrock, Vertex, or Foundry&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Direct Answer Block
&lt;/h2&gt;

&lt;p&gt;Ultraplan is Anthropic's research preview feature that separates AI planning from execution by drafting structured plans in a cloud-based Claude Code session. You type a task in your CLI, Claude researches and drafts a plan remotely on Anthropic's infrastructure, and you review the plan in your browser — commenting on specific sections, asking for revisions, then choosing whether to execute in the cloud or pull the plan back to your terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Most AI coding tools conflate planning and execution. You describe a task, the agent starts editing files immediately, and if the plan is wrong you discover it 15 minutes into a broken implementation. Ultraplan breaks that cycle. It gives you a browser-based review surface where you can inspect every section of a plan, comment on specific parts, and iterate before a single line of code changes. For complex multi-step changes — migrations, refactors, architectural shifts — this changes the review dynamic entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Anthropic Ultraplan and what problem does it actually solve?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbcbg5fp9hafdprahoaa5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbcbg5fp9hafdprahoaa5.png" alt="Architecture diagram showing CLI -&amp;gt; Cloud Session -&amp;gt; Browser Review -&amp;gt; Execution flow" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ultraplan is not just "plan mode in the cloud." It's a &lt;strong&gt;structural separation of planning from execution&lt;/strong&gt; that solves a specific terminal UX limitation: long plans are hard to review in a 24-line scrollable prompt window.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You write a task in the CLI, and Claude drafts a structured plan in a cloud session you can review in your browser. This separates planning from execution and gives you a clearer way to inspect and edit multi-step changes before running them." — AlphaSignal summary of Anthropic's Ultraplan announcement&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The core problem it addresses is &lt;strong&gt;plan review friction&lt;/strong&gt;. In local plan mode, Claude produces a plan in the terminal — you read paragraphs of text, type a response, and Claude re-drafts. If you want to comment on a specific section, you have to quote it or describe its position. With Ultraplan, the plan appears in a web interface where you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Highlight any passage and leave an inline comment&lt;/strong&gt; for Claude to address&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;React with emojis&lt;/strong&gt; (thumbs up, thinking face) to signal approval or concern without writing full feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jump between sections&lt;/strong&gt; via an outline sidebar&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters most for changes spanning 5+ files with architectural implications. The terminal review of such a plan takes patience; the browser review takes seconds.&lt;/p&gt;

&lt;p&gt;According to Anthropic's docs, the cloud session runs on your account's default cloud environment. If you don't have one, Ultraplan creates it automatically on first launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you launch Ultraplan from the CLI (and what are the three ways to trigger it)?
&lt;/h2&gt;

&lt;p&gt;There are three trigger methods, from explicit to incidental:&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 1: The &lt;code&gt;/ultraplan&lt;/code&gt; command (explicit)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/ultraplan migrate the auth service from sessions to JWTs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the most intentional path. Type the slash command with your task, confirm the dialog, and Claude launches a remote session. The CLI shows a status indicator:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;◇ ultraplan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude is researching your codebase and drafting the plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;◇ ultraplan needs your input&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude has a clarifying question; open the session link&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;◆ ultraplan ready&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The plan is ready to review in your browser&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Method 2: The &lt;code&gt;ultraplan&lt;/code&gt; keyword (implicit)
&lt;/h3&gt;

&lt;p&gt;Include the word "ultraplan" anywhere in a normal prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Help me plan a refactor of the payment service — use ultraplan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same result, less typing. This path also shows a confirmation dialog before launching.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 3: From a local plan (iterative)
&lt;/h3&gt;

&lt;p&gt;When Claude finishes a local plan and presents the approval dialog, select &lt;strong&gt;No, refine with Ultraplan on Claude Code on the web&lt;/strong&gt;. This sends your existing draft to the cloud for richer iteration. This path skips the confirmation dialog since selecting the option is already consent.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;/tasks&lt;/code&gt; at any point to see the Ultraplan entry, open detail view with the session link, or stop the plan. Stopping archives the cloud session and clears the indicator; nothing is saved to your terminal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important constraint:&lt;/strong&gt; If Remote Control is active, it disconnects when Ultraplan starts because both features occupy the claude.ai/code interface — only one can be connected at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does reviewing and iterating on a plan work in the browser?
&lt;/h2&gt;

&lt;p&gt;When the status changes to &lt;code&gt;◆ ultraplan ready&lt;/code&gt;, open the session link. The plan appears in three zones:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The plan document&lt;/strong&gt; — a structured breakdown of the proposed changes, organized by sections (migration steps, file changes, testing strategy, risks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inline comments&lt;/strong&gt; — highlight any text, leave a comment, and Claude revises that specific section in response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outline sidebar&lt;/strong&gt; — navigable section index for jumping between parts of the plan&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the iteration loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You highlight a section like "Proposed database migration: 3-step rollout with rollback" and comment: &lt;em&gt;This assumes zero-downtime. Can we add a step for staging validation first?&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Claude revises the plan, inserting the staging validation step&lt;/li&gt;
&lt;li&gt;You react with a thinking face emoji on the rollback strategy, signaling uncertainty&lt;/li&gt;
&lt;li&gt;Claude proposes an alternative rollback mechanism&lt;/li&gt;
&lt;li&gt;You approve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fundamentally different from terminal-based iteration. In the terminal, you'd need to: read the full plan, type a revision request covering multiple sections, hope Claude understood which sections you meant. In the browser, your feedback is surgically attached to specific text.&lt;/p&gt;

&lt;p&gt;The plan document also supports &lt;strong&gt;emoji reactions&lt;/strong&gt; at the section level. These are lightweight signals — a thumbs up means "this section looks right," a thinking face means "reconsider this" — that let you communicate without typing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should you execute your Ultraplan in the cloud or teleport it back to your terminal?
&lt;/h2&gt;

&lt;p&gt;When the plan is approved, you pick from two execution paths from the browser:&lt;/p&gt;

&lt;h3&gt;
  
  
  Execute on the web
&lt;/h3&gt;

&lt;p&gt;Select &lt;strong&gt;Approve Claude's plan and start coding&lt;/strong&gt; to have Claude implement the plan in the same cloud session. Your terminal shows a confirmation, the status indicator clears, and work continues in the cloud. When done, you review the diff and create a pull request from the web interface.&lt;/p&gt;

&lt;p&gt;Best for: When you don't need local access (environment variables, private dependencies, local services). The cloud session has your repo but not your machine's runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Teleport back to terminal
&lt;/h3&gt;

&lt;p&gt;Select &lt;strong&gt;Approve plan and teleport back to terminal&lt;/strong&gt; to pull the plan into your local CLI session. The cloud session archives, and your terminal shows three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Implement here&lt;/strong&gt;: inject the plan into your current conversation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start new session&lt;/strong&gt;: clear context, begin fresh with only the plan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cancel&lt;/strong&gt;: save the plan to a file without executing (Claude prints the file path)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you start a new session, Claude prints a &lt;code&gt;claude --resume&lt;/code&gt; command so you can return to your previous conversation.&lt;/p&gt;

&lt;p&gt;Best for: When you need local environment access, private dependencies, or are running integration tests against local services. The plan lands in your terminal with full access to your machine.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Web Execution&lt;/th&gt;
&lt;th&gt;Terminal Execution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Environment access&lt;/td&gt;
&lt;td&gt;GitHub repo only&lt;/td&gt;
&lt;td&gt;Full local machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR creation&lt;/td&gt;
&lt;td&gt;Built-in from web UI&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal stays free&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (implementation uses it)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review surface&lt;/td&gt;
&lt;td&gt;Browser diff view&lt;/td&gt;
&lt;td&gt;Terminal or IDE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context preservation&lt;/td&gt;
&lt;td&gt;Cloud session&lt;/td&gt;
&lt;td&gt;Local session&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When should you use Ultraplan instead of local plan mode (and when should you NOT)?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use Ultraplan when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The change spans &lt;strong&gt;5+ files with architectural implications&lt;/strong&gt; — you need a rich review surface&lt;/li&gt;
&lt;li&gt;You want &lt;strong&gt;hands-off drafting&lt;/strong&gt; — Ultraplan runs remotely, your terminal stays free for other work&lt;/li&gt;
&lt;li&gt;You're on a &lt;strong&gt;team and want async plan review&lt;/strong&gt; — share the browser link, get comments before execution&lt;/li&gt;
&lt;li&gt;The plan needs &lt;strong&gt;multiple iterations&lt;/strong&gt; — inline comments are faster than terminal-based revision prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use local plan mode (&lt;code&gt;/plan&lt;/code&gt; or &lt;code&gt;Shift+Tab&lt;/code&gt; into plan mode) when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The change is &lt;strong&gt;small and self-contained&lt;/strong&gt; — 2-3 files, quick to review in terminal&lt;/li&gt;
&lt;li&gt;You're on &lt;strong&gt;Bedrock, Vertex, or Foundry&lt;/strong&gt; — Ultraplan requires Anthropic direct API and is not available on these providers&lt;/li&gt;
&lt;li&gt;You don't have a &lt;strong&gt;Claude.ai web account&lt;/strong&gt; — Ultraplan runs on Claude Code on the web infrastructure&lt;/li&gt;
&lt;li&gt;You want &lt;strong&gt;instant iteration&lt;/strong&gt; — local plan mode has no remote session startup time (typically 15-30 seconds for Ultraplan)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Do NOT use Ultraplan when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your organization requires &lt;strong&gt;Zero Data Retention&lt;/strong&gt; (ZDR) — Ultraplan runs on cloud infrastructure where ZDR is not available&lt;/li&gt;
&lt;li&gt;Your repository is &lt;strong&gt;not on GitHub&lt;/strong&gt; — the cloud session needs a GitHub remote to clone and operate&lt;/li&gt;
&lt;li&gt;You're working on &lt;strong&gt;sensitive code that cannot leave your machine&lt;/strong&gt; — the repo is bundled and uploaded to Anthropic's cloud sandbox&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How does Ultraplan compare to Ultrareview, and when should you use both?
&lt;/h2&gt;

&lt;p&gt;Ultraplan and Ultrareview are siblings: one plans before work, the other reviews before merge.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Ultraplan&lt;/th&gt;
&lt;th&gt;Ultrareview&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Before implementation&lt;/td&gt;
&lt;td&gt;Before merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured plan document&lt;/td&gt;
&lt;td&gt;Verified bug findings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single cloud session&lt;/td&gt;
&lt;td&gt;Fleet of reviewer agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human review of plan&lt;/td&gt;
&lt;td&gt;Independent reproduction of each finding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1-5 minutes&lt;/td&gt;
&lt;td&gt;5-10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Included in plan usage&lt;/td&gt;
&lt;td&gt;Free runs then $5-$20/review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/ultraplan&lt;/code&gt; or keyword&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/ultrareview&lt;/code&gt; or &lt;code&gt;/ultrareview &amp;lt;PR#&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The ideal workflow pairs both:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;/ultraplan&lt;/code&gt; — plan a complex feature in the browser, iterate on architecture&lt;/li&gt;
&lt;li&gt;Implement — execute in cloud or terminal&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/ultrareview&lt;/code&gt; — run a multi-agent deep review before merging&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ultrareview has one distinct advantage over local &lt;code&gt;/review&lt;/code&gt;: every reported finding is &lt;strong&gt;independently reproduced and verified&lt;/strong&gt; by the agent fleet, so results focus on real bugs rather than style suggestions. It supports both branch diff mode (reviews changes against default branch) and PR mode (clones the PR from GitHub directly).&lt;/p&gt;

&lt;p&gt;Ultrareview includes a non-interactive mode for CI: &lt;code&gt;claude ultrareview&lt;/code&gt; runs headless, prints findings to stdout, and exits with code 0 on success or 1 on failure. Pass &lt;code&gt;--json&lt;/code&gt; for raw output or &lt;code&gt;--timeout &amp;lt;minutes&amp;gt;&lt;/code&gt; to limit wait time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Do I need a paid Claude subscription to use Ultraplan?
&lt;/h3&gt;

&lt;p&gt;Ultraplan requires a Claude Code on the web account, which is tied to a Claude subscription (Pro, Max, Team, or Enterprise). It's not available with API-key-only authentication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What happens to my code during Ultraplan?
&lt;/h3&gt;

&lt;p&gt;Your repository state is bundled and uploaded to Anthropic's cloud sandbox for plan drafting. The sandbox is ephemeral — destroyed when the session ends. For Ultrareview in PR mode, the sandbox clones directly from GitHub rather than uploading your local state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use Ultraplan without a GitHub repo?
&lt;/h3&gt;

&lt;p&gt;No. The cloud session needs a GitHub remote to clone and operate on your codebase. If your repo is on GitLab or Bitbucket, Ultraplan is not available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How much does Ultraplan cost?
&lt;/h3&gt;

&lt;p&gt;Ultraplan counts toward your plan's included usage. It does not bill as extra usage like Ultrareview does. You pay only in token consumption from the planning session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can multiple people review the same Ultraplan?
&lt;/h3&gt;

&lt;p&gt;Yes. Share the browser session link with teammates. They can view the plan, leave comments, and react to sections. Only one person's feedback drives Claude's revisions at a time, but multiple people can participate in the review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Ultraplan available on the VS Code extension?
&lt;/h3&gt;

&lt;p&gt;Ultraplan is launched from the CLI. If you're using Claude Code inside VS Code's integrated terminal, the &lt;code&gt;/ultraplan&lt;/code&gt; command works there too. The browser review interface is separate from VS Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ultraplan&lt;/strong&gt;: A cloud-based planning feature that drafts Claude Code plans remotely, reviewable in a browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultrareview&lt;/strong&gt;: A cloud-based code review feature that uses multiple AI agents to find and verify bugs before merge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan mode&lt;/strong&gt;: Local Claude Code mode (&lt;code&gt;/plan&lt;/code&gt; or &lt;code&gt;Shift+Tab&lt;/code&gt;) that researches and proposes changes without editing files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teleport&lt;/strong&gt;: The mechanism for pulling a cloud-drafted plan back into a local terminal session for execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud session&lt;/strong&gt;: A Claude Code session running on Anthropic's managed infrastructure, not your local machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultraplan ready&lt;/strong&gt;: The status indicator confirming the cloud-drafted plan is available for browser review&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Author
&lt;/h2&gt;

&lt;p&gt;Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. &lt;a href="https://dev.to/rams901"&gt;Full bio →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>cli</category>
      <category>tutorial</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
