<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Creeta</title>
    <description>The latest articles on DEV Community by Creeta (@creeta).</description>
    <link>https://dev.to/creeta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3988151%2F79e19311-8408-48c6-b177-4544ce358e93.png</url>
      <title>DEV Community: Creeta</title>
      <link>https://dev.to/creeta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/creeta"/>
    <language>en</language>
    <item>
      <title>openai-codex Python SDK v0.1.0b2: Install, Authenticate, and Run</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:32:16 +0000</pubDate>
      <link>https://dev.to/creeta/openai-codex-python-sdk-v010b2-install-authenticate-and-run-k9b</link>
      <guid>https://dev.to/creeta/openai-codex-python-sdk-v010b2-install-authenticate-and-run-k9b</guid>
      <description>&lt;h2&gt;
  
  
  v0.1.0b2 in Brief: Sandbox Presets and a Renamed Config Class
&lt;/h2&gt;

&lt;p&gt;Released May 28, 2026 as GitHub tag &lt;code&gt;python-v0.1.0b2&lt;/code&gt; , &lt;code&gt;openai-codex&lt;/code&gt; v0.1.0b2 is the second public beta of OpenAI's Python SDK for programmatically driving Codex agents. The headline addition over v0.1.0b1: three named &lt;code&gt;Sandbox&lt;/code&gt; presets — &lt;code&gt;READ_ONLY&lt;/code&gt;, &lt;code&gt;WORKSPACE_WRITE&lt;/code&gt;, and &lt;code&gt;FULL_ACCESS&lt;/code&gt; — replacing raw permission strings you previously had to construct by hand. A secondary change, &lt;code&gt;CodexConfig&lt;/code&gt; replacing &lt;code&gt;AppServerConfig&lt;/code&gt;, is a one-line find-and-replace with no behavioral difference . The package depends on &lt;code&gt;openai-codex-cli-bin&lt;/code&gt; pinned to 0.132.0  — the binary is not bundled in the wheel. This is also the first publicly accessible iteration under the &lt;code&gt;openai-codex&lt;/code&gt; package name, which was renamed from the internal &lt;code&gt;codex_app_server&lt;/code&gt; module in Codex CLI v0.131.0, approximately May 16–18, 2026 .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; Install with &lt;code&gt;pip install openai-codex==0.1.0b2&lt;/code&gt; on Python 3.10+. The key addition over v0.1.0b1 is three named &lt;code&gt;Sandbox&lt;/code&gt; presets that control filesystem access. Authenticate headlessly with &lt;code&gt;codex.login_api_key('sk-...')&lt;/code&gt;. Pin the exact version — API surface changes between betas with no stability guarantee.&lt;/p&gt;

&lt;p&gt;If you used &lt;code&gt;AppServerConfig&lt;/code&gt; in any prior code, rename it to &lt;code&gt;CodexConfig&lt;/code&gt; — nothing else breaks. Note also that a separate package, &lt;code&gt;codex-sdk-python&lt;/code&gt; on PyPI at version 0.117.0 as of March 2026 , follows a distinct versioning track aligned to CLI runtime versions and is not the same library. Don't install both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites and pip Installation
&lt;/h2&gt;

&lt;p&gt;The SDK requires Python 3.10 or later . Confirm your version first, then install pinned to the exact beta:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;openai-codex&lt;span class="o"&gt;==&lt;/span&gt;0.1.0b2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wheel pulls in &lt;code&gt;openai-codex-cli-bin&lt;/code&gt; as a dependency but does not embed the Codex binary directly. In most environments — macOS, Linux, a typical CI runner — the binary resolves automatically from that companion package. In containers or Jupyter notebooks where you need explicit control, bootstrap the binary before any other call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Codex&lt;/span&gt;

&lt;span class="n"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rust-v0.132.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Rust toolchain or additional system dependencies are needed on the consumer side — the binary ships pre-compiled. Verify the install landed correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"from openai_codex import Codex, Sandbox, CodexConfig; print('OK')"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Starting a Thread and Reading the Response
&lt;/h2&gt;

&lt;p&gt;The SDK is organized around threads. Each &lt;code&gt;Thread&lt;/code&gt; maps to a persistent session stored in &lt;code&gt;~/.codex/sessions&lt;/code&gt;. A single &lt;code&gt;thread.run()&lt;/code&gt; call is one turn — one prompt in, one &lt;code&gt;TurnResult&lt;/code&gt; out. The &lt;code&gt;TurnResult&lt;/code&gt; exposes &lt;code&gt;.final_response&lt;/code&gt; (the agent's text reply), &lt;code&gt;.collected_items&lt;/code&gt; (all intermediate tool calls in order), &lt;code&gt;.timing&lt;/code&gt;, and &lt;code&gt;.usage&lt;/code&gt; (token counts) . Here is the synchronous path, step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the context manager and start a thread.&lt;/strong&gt; Always use &lt;code&gt;Codex()&lt;/code&gt; as a context manager — it manages the underlying process lifecycle. Pass a &lt;code&gt;Sandbox&lt;/code&gt; preset at &lt;code&gt;thread_start()&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;

   &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKSPACE_WRITE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain this repository in three bullets.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# token counts
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick the right sandbox preset.&lt;/strong&gt; &lt;code&gt;FULL_ACCESS&lt;/code&gt; is the default but grants unrestricted filesystem access. For most coding tasks that touch your project, &lt;code&gt;WORKSPACE_WRITE&lt;/code&gt; (writes inside CWD only) is the appropriate choice. Use &lt;code&gt;READ_ONLY&lt;/code&gt; for analysis and auditing tasks where no writes should occur.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inspect the result.&lt;/strong&gt; &lt;code&gt;result.final_response&lt;/code&gt; is the agent's final text. &lt;code&gt;result.collected_items&lt;/code&gt; gives you every intermediate tool call — useful for auditing exactly what the agent read or modified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resume the thread in a later session.&lt;/strong&gt; Save &lt;code&gt;result.thread_id&lt;/code&gt; after the first turn. In a new Python process, call &lt;code&gt;codex.resume_thread(thread_id)&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thread_id&lt;/span&gt;  &lt;span class="c1"&gt;# persist this string
&lt;/span&gt;
   &lt;span class="c1"&gt;# in a later session:
&lt;/span&gt;   &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resume_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Continue from where we left off.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Async path.&lt;/strong&gt; For non-blocking applications, use &lt;code&gt;AsyncCodex&lt;/code&gt;. Do not mix &lt;code&gt;Codex&lt;/code&gt; and &lt;code&gt;AsyncCodex&lt;/code&gt; instances in the same event loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncCodex&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncCodex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor this module for clarity.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Streaming.&lt;/strong&gt; For long-running agent turns, &lt;code&gt;run_streamed()&lt;/code&gt; yields incremental events. Poll &lt;code&gt;event.type&lt;/code&gt; — &lt;code&gt;"turn.delta"&lt;/code&gt; for partial text, &lt;code&gt;"turn.completed"&lt;/code&gt; for the final usage summary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;streamed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_streamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Diagnose the CI failure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;streamed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turn.delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turn.completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Usage:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The end-to-end snippet below creates a fresh venv, installs the SDK, and runs a smoke test. It is illustrative — it was not executed against the live API at article generation time due to network constraints — but the install path and import are accurate for v0.1.0b2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CODEX_DEMO_VENV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;venv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.codex-demo-venv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_call&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;venv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;venv&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
    &lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scripts/python.exe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bin/python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_call&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;install&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-codex==0.1.0b2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CODEX_DEMO_VENV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Codex&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login_api_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authenticated with OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;using existing Codex auth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply with exactly: Codex SDK OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Authenticating Without a Browser
&lt;/h2&gt;

&lt;p&gt;v0.1.0b2 ships four authentication modes covering interactive desktop sessions through fully headless CI containers. First-class auth support landed alongside Codex CLI v0.132.0 . Automatic mode — reusing existing &lt;code&gt;codex login&lt;/code&gt; credentials — requires no extra code. For CI or headless containers, the API-key mode is the direct path and requires no browser.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Method call&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Codex()&lt;/code&gt; — no extra call needed&lt;/td&gt;
&lt;td&gt;Existing &lt;code&gt;codex login&lt;/code&gt; session on the machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API key&lt;/td&gt;
&lt;td&gt;&lt;code&gt;codex.login_api_key("sk-...")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CI/headless; standard OpenAI API key, no browser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT browser&lt;/td&gt;
&lt;td&gt;&lt;code&gt;login = codex.login_chatgpt(); print(login.auth_url); login.wait()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive desktop with a ChatGPT account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Device-code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;login = codex.login_chatgpt_device_code(); print(login.verification_url, login.user_code); login.wait()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Non-interactive container with a ChatGPT account&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For fully automated container deployments, set the &lt;code&gt;CODEX_AUTH_JSON&lt;/code&gt; environment variable and skip the programmatic login call entirely . The SDK reads it at startup — this is the cleanest path for Kubernetes or Docker pipelines where injecting secrets as env vars is already standard.&lt;/p&gt;

&lt;p&gt;Device-code flow is worth noting for restricted environments: &lt;code&gt;login.verification_url&lt;/code&gt; and &lt;code&gt;login.user_code&lt;/code&gt; print to stdout; a human visits the URL and enters the code, then &lt;code&gt;login.wait()&lt;/code&gt; blocks until the flow completes. Once confirmed, the session is stored locally and reused on subsequent runs without repeating the flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough Edges and Patterns Worth Exploring
&lt;/h2&gt;

&lt;p&gt;A few practical caveats before building on this beta:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin the version.&lt;/strong&gt; The API surface changes between beta releases with no stability guarantee. Use &lt;code&gt;pip install openai-codex==0.1.0b2&lt;/code&gt; and read the &lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex changelog&lt;/a&gt; before bumping . Test upgrades in isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default sandbox is &lt;code&gt;FULL_ACCESS&lt;/code&gt;.&lt;/strong&gt; If the prompt text is not fully trusted — user-provided input, external data, PR comment bodies — set &lt;code&gt;Sandbox.READ_ONLY&lt;/code&gt; or &lt;code&gt;Sandbox.WORKSPACE_WRITE&lt;/code&gt; explicitly. The Codex CLI v0.135.0 changelog  flags this as a security consideration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session files accumulate.&lt;/strong&gt; Every thread persists to &lt;code&gt;~/.codex/sessions&lt;/code&gt;. Prune the directory manually if you run many short test sessions — there is no built-in TTL or cleanup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming under long turns is undocumented.&lt;/strong&gt; The &lt;code&gt;run_streamed()&lt;/code&gt; behavior for agent tasks exceeding a few minutes has no documented guarantees in this beta. Treat it as experimental for long-running pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing and rate limits are not stated in the SDK.&lt;/strong&gt; They inherit from your Codex subscription tier — check your account dashboard rather than the SDK README.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What to try next:&lt;/strong&gt; use &lt;code&gt;CodexConfig&lt;/code&gt; to set a custom working directory and model selection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CodexConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CodexConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;working_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/project&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_VAR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKSPACE_WRITE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Audit the test coverage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For structured output, pair &lt;code&gt;run()&lt;/code&gt; with a Pydantic v2 schema via &lt;code&gt;output_schema=MyModel.model_json_schema()&lt;/code&gt; and validate with &lt;code&gt;MyModel.model_validate_json(result.final_response)&lt;/code&gt;. Wiring &lt;code&gt;run_streamed()&lt;/code&gt; into a FastAPI SSE endpoint is a natural next step for developer-facing tools that surface real-time agent output. The full API surface is documented in the &lt;a href="https://github.com/openai/codex/tree/main/sdk/python" rel="noopener noreferrer"&gt;SDK source on GitHub&lt;/a&gt; and the &lt;a href="https://pypi.org/project/openai-codex/" rel="noopener noreferrer"&gt;PyPI release page&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does pip install openai-codex also install the Codex CLI binary?
&lt;/h3&gt;

&lt;p&gt;No. The wheel declares &lt;code&gt;openai-codex-cli-bin&lt;/code&gt; as a dependency but does not bundle the binary itself. In most environments the dependency resolves and installs the binary automatically. In containers or notebooks where you need explicit control over the bootstrap, call &lt;code&gt;Codex.install(version='rust-v0.132.0')&lt;/code&gt; before any other SDK call. If the binary is missing, most SDK methods will raise immediately with a clear error.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use a standard OpenAI API key instead of a ChatGPT account?
&lt;/h3&gt;

&lt;p&gt;Yes. &lt;code&gt;codex.login_api_key('sk-...')&lt;/code&gt; accepts a standard OpenAI API key and works in headless CI without any browser interaction. Set &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; as an environment variable and call &lt;code&gt;codex.login_api_key(os.environ["OPENAI_API_KEY"])&lt;/code&gt; in your startup code. This is the recommended path for automated pipelines and container deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between Sandbox.READ_ONLY and Sandbox.WORKSPACE_WRITE?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Sandbox.READ_ONLY&lt;/code&gt; permits the agent to read files but not write or modify them — appropriate for analysis, code review, and question-answering tasks where you want a zero-write guarantee. &lt;code&gt;Sandbox.WORKSPACE_WRITE&lt;/code&gt; allows writes inside the current working directory only, which covers most coding and refactoring tasks. &lt;code&gt;Sandbox.FULL_ACCESS&lt;/code&gt; is the default and is unrestricted — use it only when you fully control and trust the input prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I continue a thread from a previous Python session?
&lt;/h3&gt;

&lt;p&gt;Save &lt;code&gt;result.thread_id&lt;/code&gt; (a string) after the first turn — to a database, file, or environment variable. In a new Python session, instantiate &lt;code&gt;Codex()&lt;/code&gt; as usual, then call &lt;code&gt;codex.resume_thread(thread_id)&lt;/code&gt; to reload the conversation. Sessions are stored on disk in &lt;code&gt;~/.codex/sessions&lt;/code&gt; and persist across Python restarts. Prune that directory periodically if you run many short test threads, as there is no automatic cleanup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is openai-codex v0.1.0b2 stable enough for production?
&lt;/h3&gt;

&lt;p&gt;It is a public beta with no API stability guarantees between beta versions, as noted on the &lt;a href="https://pypi.org/project/openai-codex/" rel="noopener noreferrer"&gt;PyPI release page&lt;/a&gt; . The right approach: pin &lt;code&gt;openai-codex==0.1.0b2&lt;/code&gt;, monitor the &lt;a href="https://github.com/openai/codex/releases" rel="noopener noreferrer"&gt;Codex release log&lt;/a&gt;, and test any version bump in isolation before shipping. For workloads that cannot tolerate breaking API changes, wait for the 1.0 release.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do With It Now
&lt;/h2&gt;

&lt;p&gt;Two things in v0.1.0b2 are immediately actionable. First, if you have existing code that passes raw permission strings to the Codex SDK, swap them for the named &lt;code&gt;Sandbox&lt;/code&gt; presets — it is a one-line change per thread that meaningfully reduces filesystem blast radius and makes intent explicit in code review. Second, if you are running Codex in CI, the &lt;code&gt;login_api_key()&lt;/code&gt; path eliminates the browser-auth workaround and makes the authentication model match how you handle every other API key in your pipeline.&lt;/p&gt;

&lt;p&gt;The versioning scheme is now decoupled: &lt;code&gt;pyproject.toml&lt;/code&gt; carries &lt;code&gt;version = "0.0.0-dev"&lt;/code&gt; in source, and the published version is injected from the &lt;code&gt;python-v*&lt;/code&gt; git tag at release time . This means beta releases can ship faster without requiring source-tree commits for each version bump. Track the &lt;a href="https://github.com/openai/codex/releases/tag/python-v0.1.0b2" rel="noopener noreferrer"&gt;v0.1.0b2 release notes&lt;/a&gt; and the &lt;a href="https://developers.openai.com/codex/sdk" rel="noopener noreferrer"&gt;official SDK docs&lt;/a&gt; for the shape of 1.0 — the cadence is likely to accelerate from here.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-05-31. Based on &lt;a href="https://pypi.org/project/openai-codex/" rel="noopener noreferrer"&gt;openai-codex v0.1.0b2 on PyPI&lt;/a&gt; and the Codex CLI v0.135.0 changelog reviewed on this date.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openaicodex</category>
      <category>python</category>
      <category>codex</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Claude Code v2.1.157: Live Plugin Loading and Worktree Unlock</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:32:16 +0000</pubDate>
      <link>https://dev.to/creeta/claude-code-v21157-live-plugin-loading-and-worktree-unlock-4i9</link>
      <guid>https://dev.to/creeta/claude-code-v21157-live-plugin-loading-and-worktree-unlock-4i9</guid>
      <description>&lt;h2&gt;
  
  
  What v2.1.157 Delivers
&lt;/h2&gt;

&lt;p&gt;Claude Code v2.1.157 landed on May 29, 2026 — part of a weekly cadence that has moved the tool from v2.1.83 in March to v2.1.157 in under three months . The release carries no breaking changes and touches four surface areas: the plugin/skills system, dispatched session configuration, OpenTelemetry tool-parameter logging, and a 20+ item bug-fix round that closes several regressions from the prior two weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; Claude Code v2.1.157 (May 29, 2026) adds automatic plugin loading from &lt;code&gt;.claude/skills/&lt;/code&gt; — no &lt;code&gt;--plugin-dir&lt;/code&gt; flag needed — alongside dispatched-session agent config, OTEL tool-parameter logging, and 20+ bug fixes including the tmux clipboard regression from v2.1.153.&lt;/p&gt;

&lt;p&gt;The pace matters for teams pinning versions. Going from v2.1.83 to v2.1.157 in ~11 weeks means a pinned install can fall 70+ patch levels behind in a single sprint cycle. If you're locking a CI image or a shared dev container to a specific Claude Code build, factor in that weekly updates now routinely touch the SDK surface, the sessions UX, and the plugin loader simultaneously .&lt;/p&gt;

&lt;p&gt;The changelog entry for plugin auto-loading reads: &lt;em&gt;"Plugins placed in &lt;code&gt;.claude/skills/&lt;/code&gt; are automatically loaded on session start — no marketplace publish or explicit install step required."&lt;/em&gt;  What that replaces: previously you either published a plugin to the marketplace or passed a &lt;code&gt;--plugin-dir&lt;/code&gt; path at every invocation. Neither option was usable for project-local tooling you didn't want to publish. The new behavior makes &lt;code&gt;.claude/skills/&lt;/code&gt; a live-load directory — drop something in, start a new session, it's there.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Companion features shipped alongside: &lt;code&gt;claude plugin init &amp;lt;name&amp;gt;&lt;/code&gt; scaffolds a plugin skeleton directly inside &lt;code&gt;.claude/skills/&lt;/code&gt;; &lt;code&gt;/plugin&lt;/code&gt; autocompletion now includes subcommands, installed names, and marketplace entries." — &lt;a href="https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;Claude Code CHANGELOG, May 29 2026&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Before v2.1.157&lt;/th&gt;
&lt;th&gt;After v2.1.157&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local plugin loading&lt;/td&gt;
&lt;td&gt;Requires &lt;code&gt;--plugin-dir &amp;lt;path&amp;gt;&lt;/code&gt; on every invocation or marketplace publish&lt;/td&gt;
&lt;td&gt;Drop into &lt;code&gt;.claude/skills/&lt;/code&gt;; loaded automatically on session start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude-managed worktree unlock&lt;/td&gt;
&lt;td&gt;Worktrees left locked at session end; &lt;code&gt;git worktree remove&lt;/code&gt; requires manual lock-breaking&lt;/td&gt;
&lt;td&gt;Worktrees unlocked on session close; &lt;code&gt;git worktree prune&lt;/code&gt; works without intervention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OTEL tool parameters&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tool_decision&lt;/code&gt; events carry no command detail&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OTEL_LOG_TOOL_DETAILS=1&lt;/code&gt; adds &lt;code&gt;tool_parameters&lt;/code&gt; (bash commands, skill names) to events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero-byte image handling&lt;/td&gt;
&lt;td&gt;Corrupt or empty image crashes the entire request&lt;/td&gt;
&lt;td&gt;Replaced with a text placeholder; request continues&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Step 1 — Drop Plugins into .claude/skills/
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;.claude/skills/&lt;/code&gt; directory is now a first-class plugin loader. On session start, Claude Code scans every subdirectory inside it and loads whatever plugins it finds — no flags, no publish step, no restart of a background process . To scaffold a new plugin, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;claude plugin init my-deploy-tools
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That creates &lt;code&gt;.claude/skills/my-deploy-tools/&lt;/code&gt; with the following skeleton:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/my-deploy-tools/
├── plugin.json          # metadata: name, version, description
├── skills/
│   └── deploy.md        # one skill per .md file
└── hooks/
    └── pre-tool.js      # optional lifecycle hooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;.claude/skills/&lt;/code&gt; doesn't exist yet, &lt;code&gt;plugin init&lt;/code&gt; creates it. You can also create the directory manually and place a plugin directory inside it — the loader doesn't care how it got there .&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;skill&lt;/strong&gt; is a single Markdown file with YAML frontmatter. A &lt;strong&gt;plugin&lt;/strong&gt; bundles one or more skills plus hooks and a metadata manifest into a shareable directory unit. The distinction matters for tooling: &lt;code&gt;claude plugin install&lt;/code&gt; installs a full plugin (from marketplace or a local path); you can also author a raw &lt;code&gt;.md&lt;/code&gt; skill file in &lt;code&gt;.claude/skills/&lt;/code&gt; directly and it will be picked up as a standalone skill without a manifest.&lt;/p&gt;

&lt;p&gt;A key frontmatter flag worth knowing is &lt;code&gt;disable-model-invocation: true&lt;/code&gt;. Set this on any skill that wraps a destructive command — a deploy script, a database migration, a force-push — and Claude will not auto-trigger it based on context. The skill becomes user-initiated only. Example frontmatter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-staging&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Push current branch to staging environment&lt;/span&gt;
&lt;span class="na"&gt;disable-model-invocation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;Deploy branch to staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>claudecode</category>
      <category>plugins</category>
      <category>worktree</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Claude Code v2.1.156: Opus 4.8 Thinking Block Hotfix Explained</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 12:38:13 +0000</pubDate>
      <link>https://dev.to/creeta/claude-code-v21156-opus-48-thinking-block-hotfix-explained-4k5g</link>
      <guid>https://dev.to/creeta/claude-code-v21156-opus-48-thinking-block-hotfix-explained-4k5g</guid>
      <description>&lt;h2&gt;
  
  
  What Broke: Thinking Block Corruption on Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Claude Code v2.1.154 , released May 28, 2026, made Opus 4.8 the default model — and immediately surfaced a silent payload bug. Thinking blocks were being mutated during retry or multi-turn sequences, causing the Anthropic API to reject the modified payload with HTTP 400 errors on the following turn. Sessions appeared to hang or crash with no actionable error message. v2.1.156 , released at approximately 01:42 UTC on May 29, 2026, is a single-surface hotfix that patches this exact mutation path and nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; If you're on Claude Code v2.1.154 or v2.1.155 with Opus 4.8 and extended thinking active, update now. v2.1.156 patches thinking block mutation during multi-turn sequences that caused silent payload corruption and unrecoverable HTTP 400 errors. Run &lt;code&gt;npm install -g @anthropic-ai/claude-code@latest&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The failure mode was subtle: on turn N, a thinking block was modified before being echoed back in the next request — whitespace stripped, bytes compacted. The Anthropic API enforces that signed thinking blocks must be replayed byte-for-byte; any modification causes rejection. This meant the error always surfaced on turn N+1, making it indistinguishable from a logic error in the session itself. In agentic workflows with many turns, callers had no reliable way to attribute the 400 to an infrastructure issue rather than their own code.&lt;/p&gt;

&lt;p&gt;Affected scope: any session on Opus 4.8 with extended thinking active (&lt;code&gt;/effort high&lt;/code&gt; or higher). Opus 4.7, all Sonnet variants, and Haiku were not affected — the mutation was specific to the Opus 4.8 thinking block payload format during replay and compaction.&lt;/p&gt;

&lt;p&gt;The verified Python snippet below — executed successfully (exit 0) — demonstrates exactly what the hotfix prevents. A frozen dataclass enforces that signed thinking blocks cannot be modified before replay:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt;


&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ThinkingBlock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_to_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ThinkingBlock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ThinkingBlock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API 400: signed thinking blocks cannot be modified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accepted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ThinkingBlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hidden reasoning bytes from Opus 4.8&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sig:abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Claude Code v2.1.156 hotfix demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad replay:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;send_to_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixed replay:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;send_to_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explanation: preserve signed thinking blocks byte-for-byte during replay/compaction.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Captured output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code v2.1.156 hotfix demo
bad replay: API 400: signed thinking blocks cannot be modified
fixed replay: accepted
explanation: preserve signed thinking blocks byte-for-byte during replay/compaction.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;"Resolves a critical bug where thinking blocks were modified during multi-turn Opus 4.8 sessions with extended thinking enabled, causing API errors that broke agentic sessions mid-execution." — &lt;a href="https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;Claude Code CHANGELOG, v2.1.156&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Prerequisites and How to Update to v2.1.156
&lt;/h2&gt;

&lt;p&gt;The update is a single npm command. No configuration file changes are required — the fix is entirely within the client's thinking block replay logic. Confirm your current version first, then update and verify.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check your installed version:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   claude &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the output is &lt;code&gt;v2.1.154&lt;/code&gt; or &lt;code&gt;v2.1.155&lt;/code&gt; and you use Opus 4.8 with extended thinking, update before your next multi-turn session.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Update via npm:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart any active Claude Code sessions after the install completes. Using &lt;code&gt;@latest&lt;/code&gt; resolves to the newest available release — as of May 29, 2026 this also pulls v2.1.157 (same-day plugin system overhaul). No config changes needed alongside the update.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Verify the hotfix is active:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open a fresh Opus 4.8 session (the default since v2.1.154), run &lt;code&gt;/effort high&lt;/code&gt; to activate extended thinking, and send a two-turn exchange — a question followed by a follow-up in the same session. Confirm no HTTP 400 errors appear in the session log. The second turn is the critical test: that is exactly where the corrupted payload was rejected pre-hotfix.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Confirm plugin behavior after v2.1.157:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because &lt;code&gt;@latest&lt;/code&gt; also installs v2.1.157, run &lt;code&gt;/reload-skills&lt;/code&gt; and verify your &lt;code&gt;.claude/skills/&lt;/code&gt; entries still resolve correctly. The plugin overhaul changed auto-load semantics — skills in that directory now load without marketplace registration or restart, which is a behavioral change if you previously relied on explicit enablement via &lt;code&gt;/plugin&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough Edges to Expect After Updating
&lt;/h2&gt;

&lt;p&gt;Three behavioral changes in the v2.1.154–157 cluster will catch you off guard if you don't read the changelog before your next session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lean system prompt is now default for Opus 4.8.&lt;/strong&gt; If your prompts or tool configurations relied on the verbose context previously injected at session start, those assumptions break on v2.1.154+ . Haiku, Sonnet, and Opus 4.7 and earlier still use the verbose prompt — only Opus 4.8 sessions are affected. Audit any prompts that reference injected context before assuming something is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The word "workflow" now silently triggers Dynamic Orchestration&lt;/strong&gt; on Max, Team, and Enterprise plans. Anthropic warns this can spike token spend materially on large repositories . If you're not ready to use Dynamic Orchestration, disable the keyword trigger before writing prompts that mention "workflow" in passing: navigate to &lt;code&gt;/config&lt;/code&gt; → "Workflow keyword trigger" and turn it off. This setting was added in v2.1.154 specifically because the silent trigger was flagged as a footgun during beta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast mode pricing changed with Opus 4.8.&lt;/strong&gt; Standard pricing is $5 per million input tokens and $25 per million output tokens . Fast mode runs at $10 per million input and $50 per million output at approximately 2.5× speed . Anthropic describes fast mode as 3× cheaper per unit than the previous fast-mode tier , but it is still double the standard input rate. For long agentic sessions where latency is not the constraint, standard mode is the cost-efficient default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro plan users:&lt;/strong&gt; Dynamic Orchestration is not available on Pro. Including "workflow" in a prompt will not trigger orchestration and will not produce an error — the word is treated as plain text. If you expect orchestration behavior and nothing happens, verify your plan tier before filing a bug report; silence is the documented behavior on Pro, not a session fault.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamic Orchestration and the Plugin Overhaul: What to Explore
&lt;/h2&gt;

&lt;p&gt;With v2.1.156 stabilizing Opus 4.8's extended thinking, the two major features from the surrounding cluster are worth a deliberate pilot: Dynamic Orchestration (v2.1.154) and the plugin auto-load system (v2.1.157).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Orchestration:&lt;/strong&gt; Include "workflow" in any prompt, or navigate directly to &lt;code&gt;/workflows&lt;/code&gt; to start a run. Claude writes a JavaScript orchestration script and fans out background subagents — up to a hard cap of 1,000 per session . Monitor live agents at &lt;code&gt;claude agents&lt;/code&gt; (keyboard shortcut: &lt;code&gt;←←&lt;/code&gt;). Run status is visible at &lt;code&gt;/workflows&lt;/code&gt;. Anthropic explicitly recommends piloting on a single, scoped module before pointing a workflow at an entire repository — token consumption scales faster than intuition suggests once subagent fan-out begins . What per-subagent overhead costs are beyond the general consumption warning has not yet been published.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plugin auto-load (v2.1.157):&lt;/strong&gt; Drop a skill file into &lt;code&gt;.claude/skills/&lt;/code&gt; and it loads automatically — no marketplace registration, no restart. Skills can declare &lt;code&gt;disallowed-tools&lt;/code&gt; in frontmatter to scope which model tools are available during that skill's execution. Scaffold a new plugin with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude plugin init my-plugin-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This places a scaffold directly into &lt;code&gt;.claude/skills/&lt;/code&gt;. To require explicit opt-in, set &lt;code&gt;defaultEnabled: false&lt;/code&gt; in &lt;code&gt;plugin.json&lt;/code&gt;; enable with &lt;code&gt;/plugin&lt;/code&gt; or &lt;code&gt;claude plugin enable my-plugin-name&lt;/code&gt;. Run &lt;code&gt;/reload-skills&lt;/code&gt; to rescan mid-session without restarting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.8 benchmark reference&lt;/strong&gt; (from &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Anthropic's release publication&lt;/a&gt; — independent third-party replication is still limited as of May 31, 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Opus 4.8 Result&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Online-Mind2Web (browser automation)&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;td&gt;Beats Opus 4.7 and GPT-5.5; May 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal Agent Benchmark (all-pass)&lt;/td&gt;
&lt;td&gt;First to exceed 10%&lt;/td&gt;
&lt;td&gt;First model to clear the bar; May 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review flaw detection vs. Opus 4.7&lt;/td&gt;
&lt;td&gt;~4× less likely to overlook flaws&lt;/td&gt;
&lt;td&gt;Anthropic-sourced; third-party validation sparse as of 2026-05-31&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What was the thinking block bug introduced in v2.1.154?
&lt;/h3&gt;

&lt;p&gt;Thinking blocks were being mutated — typically stripped or compacted — during retry or multi-turn sequences on Opus 4.8 sessions with extended thinking enabled. The Anthropic API requires signed thinking blocks to be replayed byte-for-byte; any modification triggers an HTTP 400 rejection. Because the mutation happened on turn N but the error surfaced on turn N+1, the session appeared to crash without a clear cause. v2.1.156 patches this specific mutation path in the client's replay and compaction logic. Details are in the &lt;a href="https://github.com/anthropics/claude-code/releases" rel="noopener noreferrer"&gt;v2.1.156 release notes&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need v2.1.156 if I'm not using Opus 4.8?
&lt;/h3&gt;

&lt;p&gt;No. The bug was scoped entirely to Opus 4.8 sessions with extended thinking active. Opus 4.7, all Sonnet variants, and Haiku were not affected. If you're on an older model or have extended thinking disabled, there is no urgent reason to update specifically for this hotfix — though keeping up with &lt;code&gt;@latest&lt;/code&gt; is generally advisable to pick up the broader v2.1.154–157 cluster changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I confirm extended thinking is working correctly after updating?
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;claude --version&lt;/code&gt; to confirm you're on v2.1.156 or later. Open a fresh Opus 4.8 session, run &lt;code&gt;/effort high&lt;/code&gt; to activate extended thinking, and send at least two turns in sequence. The second turn is the critical test — pre-hotfix, that is where the corrupted payload was rejected by the API. No HTTP 400 errors in the session log confirms the fix is active.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which subscription tiers support Dynamic Orchestration?
&lt;/h3&gt;

&lt;p&gt;Max, Team, and Enterprise plans only. Pro plan users do not have access to Dynamic Orchestration. On Pro, including "workflow" in a prompt will not trigger orchestration and will not produce an error — the word is simply treated as regular text with no side effects. If you see no orchestration behavior after using the "workflow" keyword, verify your plan tier before assuming a session bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the cost difference between standard Opus 4.8 and fast mode?
&lt;/h3&gt;

&lt;p&gt;Standard pricing is $5 per million input tokens and $25 per million output tokens. Fast mode is $10 per million input and $50 per million output at approximately 2.5× speed. Anthropic describes fast mode as 3× cheaper per unit than the previous fast-mode pricing tier, but it remains double the standard input rate. For cost-sensitive agentic sessions where latency is not the bottleneck, standard mode is the better default. See &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Anthropic's Opus 4.8 release page&lt;/a&gt; for full pricing details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;The immediate action is straightforward: run &lt;code&gt;npm install -g @anthropic-ai/claude-code@latest&lt;/code&gt;, confirm your version is v2.1.156 or later, and restart any active multi-turn Opus 4.8 sessions. That closes the thinking block corruption window entirely.&lt;/p&gt;

&lt;p&gt;Beyond the hotfix, the two features worth a deliberate trial are Dynamic Orchestration and the plugin auto-load system. Start Orchestration on a single, bounded task — not a full codebase — and watch the session log for token consumption before scaling scope. The 1,000-subagent cap is documented, but how enforcement behaves at the limit (hard abort vs. graceful queue) is not yet publicly specified. For the plugin system, dropping a skill file into &lt;code&gt;.claude/skills/&lt;/code&gt; and running &lt;code&gt;/reload-skills&lt;/code&gt; is low-risk and immediately useful if you've been building local tools. Full release details are at &lt;a href="https://github.com/anthropics/claude-code/releases" rel="noopener noreferrer"&gt;GitHub Releases&lt;/a&gt; and the &lt;a href="https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-05-31. Reflects Claude Code v2.1.156 as the current stable hotfix, with context drawn from the v2.1.154–v2.1.157 release cluster. Benchmark figures sourced from &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Anthropic's Opus 4.8 publication&lt;/a&gt;; independent third-party replication of the code review claim is still in progress as of this date.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>opus48</category>
      <category>hotfix</category>
      <category>extendedthinking</category>
    </item>
    <item>
      <title>langchain-fireworks 1.4.2: Annotated + ChatFireworks Quickstart</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 12:38:12 +0000</pubDate>
      <link>https://dev.to/creeta/langchain-fireworks-142-annotated-chatfireworks-quickstart-2k37</link>
      <guid>https://dev.to/creeta/langchain-fireworks-142-annotated-chatfireworks-quickstart-2k37</guid>
      <description>&lt;p&gt;Three patches in eight days: &lt;a href="https://pypi.org/project/langchain-fireworks/" rel="noopener noreferrer"&gt;langchain-fireworks&lt;/a&gt; moved from 1.3.x to 1.4.2 between May 20 and May 27, 2026 , shipping an SDK migration, a typed context-overflow exception, tighter retry ownership, and a serialization fix that quietly broke cross-provider pipelines in earlier builds. This tutorial unpacks what changed and walks you to a running &lt;code&gt;ChatFireworks&lt;/code&gt; instance with tool calling.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.4.0–1.4.2 Annotated: Dependency Bump, Serialization Cleanup, and Retry Rewiring
&lt;/h2&gt;

&lt;p&gt;The 1.4.x series is a coherent hardening sprint across three rapid releases. Version 1.4.0 (May 20)  migrated the integration from &lt;code&gt;fireworks-ai&lt;/code&gt; 0.x to the 1.x pre-release line (PR #37581) and introduced &lt;code&gt;FireworksContextOverflowError&lt;/code&gt; — a typed wrapper around the raw &lt;code&gt;BadRequestError&lt;/code&gt; previously raised when a prompt exceeded the model's context window. Version 1.4.1 (May 21)  moved retry ownership entirely to LangChain's decorator layer: &lt;code&gt;max_retries=2&lt;/code&gt; is now the default (PR #37602), and the underlying HTTP client is initialized with &lt;code&gt;max_retries=0&lt;/code&gt; to prevent double-counting. Version 1.4.2 (May 27)  is the most broadly impactful: it strips non-wire keys — Anthropic's &lt;code&gt;index&lt;/code&gt; on text blocks, LangChain's internal &lt;code&gt;caller&lt;/code&gt; on tool_use blocks — before sending to the Fireworks wire API (PR #37714). Pre-1.4.2, those extra keys triggered validation errors in multi-provider pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; langchain-fireworks 1.4.2 (May 27, 2026) fixes cross-provider validation errors by stripping non-wire content-part keys (&lt;code&gt;index&lt;/code&gt;, &lt;code&gt;caller&lt;/code&gt;) before sending to the Fireworks API. Paired with 1.4.1's retry rewiring (&lt;code&gt;max_retries=2&lt;/code&gt; default, HTTP client at &lt;code&gt;max_retries=0&lt;/code&gt;) and 1.4.0's upgrade to &lt;code&gt;fireworks-ai&lt;/code&gt; 1.x, the patch sequence makes &lt;code&gt;ChatFireworks&lt;/code&gt; substantially more robust in multi-provider pipelines.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Release Date&lt;/th&gt;
&lt;th&gt;Key PRs&lt;/th&gt;
&lt;th&gt;Net User Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1.4.0&lt;/td&gt;
&lt;td&gt;May 20, 2026&lt;/td&gt;
&lt;td&gt;#37581, #37574&lt;/td&gt;
&lt;td&gt;SDK upgraded from &lt;code&gt;fireworks-ai&lt;/code&gt; 0.x → 1.x; &lt;code&gt;FireworksContextOverflowError&lt;/code&gt; added for context-length failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.4.1&lt;/td&gt;
&lt;td&gt;May 21, 2026&lt;/td&gt;
&lt;td&gt;#37602, #37590&lt;/td&gt;
&lt;td&gt;Retries on &lt;code&gt;APIConnectionError&lt;/code&gt;; &lt;code&gt;max_retries=2&lt;/code&gt; default; HTTP client forced to &lt;code&gt;max_retries=0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.4.2&lt;/td&gt;
&lt;td&gt;May 27, 2026&lt;/td&gt;
&lt;td&gt;#37714, #37650&lt;/td&gt;
&lt;td&gt;Non-wire keys (&lt;code&gt;index&lt;/code&gt;, &lt;code&gt;caller&lt;/code&gt;) stripped before wire API call; cross-provider validation errors fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"Strip non-wire keys — e.g. &lt;code&gt;index&lt;/code&gt; on Anthropic text blocks, &lt;code&gt;caller&lt;/code&gt; on LangChain tool_use blocks — before sending to the Fireworks wire API; previously these triggered validation errors in cross-provider pipelines." — PR #37714 description, &lt;a href="https://github.com/langchain-ai/langchain/releases" rel="noopener noreferrer"&gt;langchain-ai/langchain&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Before You Start: Python 3.10+, fireworks-ai 1.x Alpha, and a Fireworks Account
&lt;/h2&gt;

&lt;p&gt;You need Python 3.10 or later  and a Fireworks API key — obtain one at &lt;a href="https://docs.langchain.com/oss/python/integrations/providers/fireworks" rel="noopener noreferrer"&gt;app.fireworks.ai/login&lt;/a&gt;. One dependency caveat deserves its own paragraph.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fireworks-ai&lt;/code&gt; 1.x is still in alpha. The latest pre-release as of late May 2026 is &lt;code&gt;1.2.0a73&lt;/code&gt; ; the last stable release was &lt;code&gt;0.19.20&lt;/code&gt; from October 2025 . In production, pin to an exact alpha version — e.g., &lt;code&gt;fireworks-ai==1.2.0a73&lt;/code&gt; — rather than a range like &lt;code&gt;&amp;gt;=1.0&lt;/code&gt;. Alpha builds can introduce breaking API changes between patch versions without a semver major-bump signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install, Instantiate, and Invoke ChatFireworks: Step-by-Step
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://python.langchain.com/api_reference/fireworks/chat_models/langchain_fireworks.chat_models.ChatFireworks.html" rel="noopener noreferrer"&gt;&lt;code&gt;ChatFireworks&lt;/code&gt;&lt;/a&gt; is the primary &lt;code&gt;BaseChatModel&lt;/code&gt; interface for Fireworks-hosted models . Model identifiers follow the pattern &lt;code&gt;accounts/fireworks/models/&amp;lt;slug&amp;gt;&lt;/code&gt;. Follow these five steps for a working setup.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install.&lt;/strong&gt; Upgrade to 1.4.2 and verify:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-qU&lt;/span&gt; langchain-fireworks
   pip show langchain-fireworks   &lt;span class="c"&gt;# expect: Version: 1.4.2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with uv: &lt;code&gt;uv add langchain-fireworks&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Credential.&lt;/strong&gt; Export the API key or pass &lt;code&gt;api_key=&lt;/code&gt; directly to the constructor. The environment variable approach is preferred:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FIREWORKS_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'fw_...'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Instantiate.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_fireworks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatFireworks&lt;/span&gt;

   &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatFireworks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accounts/fireworks/models/llama-v3p1-8b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# LangChain decorator layer; HTTP client uses max_retries=0
&lt;/span&gt;       &lt;span class="n"&gt;stream_usage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# include token counts in streamed chunks
&lt;/span&gt;   &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Invoke.&lt;/strong&gt; &lt;code&gt;.invoke()&lt;/code&gt; returns an &lt;code&gt;AIMessage&lt;/code&gt;; &lt;code&gt;.content&lt;/code&gt; is the text string:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that translates English to French.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I love programming.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="p"&gt;]&lt;/span&gt;
   &lt;span class="n"&gt;ai_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ai_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "J'adore la programmation."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Streaming with usage tracking.&lt;/strong&gt; Since 1.2.0 (carried through 1.4.2), &lt;code&gt;stream_usage=True&lt;/code&gt; opts into &lt;code&gt;stream_options.include_usage&lt;/code&gt; . The final chunk now surfaces as an &lt;code&gt;AIMessageChunk&lt;/code&gt; with &lt;code&gt;usage_metadata&lt;/code&gt; rather than being silently dropped:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tell me a joke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
           &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Tokens used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pitfalls to Anticipate: Alpha Pinning, Cross-Provider Messages, and Retry Ownership
&lt;/h2&gt;

&lt;p&gt;Four areas where the 1.4.x upgrade requires deliberate handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alpha instability.&lt;/strong&gt; Pin &lt;code&gt;fireworks-ai&lt;/code&gt; to an exact alpha version — e.g., &lt;code&gt;fireworks-ai==1.2.0a73&lt;/code&gt;  — rather than a range. Alpha builds can introduce breaking API changes between patch versions without a semver signal. Run integration tests before upgrading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-provider pipelines.&lt;/strong&gt; The 1.4.2 serialization fix silently strips &lt;code&gt;index&lt;/code&gt; and &lt;code&gt;caller&lt;/code&gt; keys before sending to the wire API. If you were working around pre-1.4.2 validation errors by manually sanitizing messages, remove that workaround after upgrading — double-stripping is harmless but adds noise to the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry ownership.&lt;/strong&gt; &lt;code&gt;max_retries&lt;/code&gt; on &lt;code&gt;ChatFireworks&lt;/code&gt; controls the LangChain decorator layer only. The underlying &lt;code&gt;fireworks.Fireworks()&lt;/code&gt; HTTP client is initialized with &lt;code&gt;max_retries=0&lt;/code&gt; by design , ensuring each attempt is visible to &lt;code&gt;run_manager.on_retry&lt;/code&gt; and avoiding double-counting. Do not attempt to override &lt;code&gt;max_retries&lt;/code&gt; at the HTTP client level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image input is not supported.&lt;/strong&gt; &lt;code&gt;ChatFireworks&lt;/code&gt; raises an error for multimodal (image) message content as of 1.4.2. Verify the capability matrix before routing vision workflows through this integration — use a different LangChain chat model for vision tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Extending Your Setup: Tool Calling, Structured Outputs, and Async
&lt;/h2&gt;

&lt;p&gt;Once basic invocation is working, the most useful next steps for a production integration are tool calling, structured outputs, and explicit error handling. The snippet below is illustrative — it was not executed against a live API — but reflects the current documented interface for &lt;code&gt;bind_tools()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_fireworks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatFireworks&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ImportError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing dependency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FIREWORKS_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set FIREWORKS_API_KEY to run this ChatFireworks example.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first factor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;second factor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Multiply two integers.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;


&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatFireworks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accounts/fireworks/models/llama-v3p1-8b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_with_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 6 times 7? Use the tool.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Beyond tool calling, the integration supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured output.&lt;/strong&gt; &lt;code&gt;llm.with_structured_output(MyPydanticModel)&lt;/code&gt; works with Pydantic v2 schemas for deterministic JSON extraction from model responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async.&lt;/strong&gt; &lt;code&gt;llm.ainvoke(messages)&lt;/code&gt; for asyncio contexts; &lt;code&gt;llm.astream(messages)&lt;/code&gt; for async streaming. The interface mirrors the sync API exactly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context overflow handling.&lt;/strong&gt; Wrap calls in &lt;code&gt;try/except FireworksContextOverflowError&lt;/code&gt; (added 1.4.0) to catch prompt-too-long conditions explicitly rather than letting a raw HTTP error bubble up:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_fireworks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatFireworks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FireworksContextOverflowError&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;FireworksContextOverflowError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# truncate context, switch to a larger-window model, or summarize before retrying
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fallback_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_messages&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is fireworks-ai 1.x stable enough for production use?
&lt;/h3&gt;

&lt;p&gt;Not yet. The 1.x series remains in active alpha — the latest release as of late May 2026 is &lt;code&gt;1.2.0a73&lt;/code&gt; , and the last stable release was &lt;code&gt;0.19.20&lt;/code&gt; from October 2025 . If deploying to production, pin to an exact alpha version and run integration tests before each upgrade. Floating on a semver range like &lt;code&gt;&amp;gt;=1.0&lt;/code&gt; risks pulling in silent breaking changes between alpha patches.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does the serialization strip in 1.4.2 actually fix?
&lt;/h3&gt;

&lt;p&gt;When assembling messages from multiple providers, content part dicts can carry provider-specific keys: Anthropic text blocks attach an &lt;code&gt;index&lt;/code&gt; key; LangChain's internal tool_use blocks attach a &lt;code&gt;caller&lt;/code&gt; key. Pre-1.4.2, those extra keys passed through to the Fireworks wire API and caused validation errors. Version 1.4.2 introduces sanitization functions — built from an allowlist derived from the Fireworks SDK's own TypedDict — that strip non-wire keys before every API call. Upgrading existing pipelines should be transparent; the only observable change is the removal of those validation errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does retry logic work in 1.4.x compared to earlier versions?
&lt;/h3&gt;

&lt;p&gt;Since 1.4.1, retry ownership belongs entirely to LangChain's decorator layer. The underlying &lt;code&gt;fireworks.Fireworks()&lt;/code&gt; HTTP client is initialized with &lt;code&gt;max_retries=0&lt;/code&gt; — it performs no retries of its own. The &lt;code&gt;max_retries=2&lt;/code&gt; default on &lt;code&gt;ChatFireworks&lt;/code&gt; means up to two retries through the LangChain path, each visible to &lt;code&gt;run_manager.on_retry&lt;/code&gt; callbacks. Version 1.4.1 also added retry coverage for bare &lt;code&gt;APIConnectionError&lt;/code&gt; conditions, which earlier versions did not retry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can ChatFireworks process images or multimodal inputs?
&lt;/h3&gt;

&lt;p&gt;No. Image input is not supported as of 1.4.2 and raises an error. &lt;code&gt;ChatFireworks&lt;/code&gt; supports text input, tool calling, structured output, streaming, and logprobs — but not vision. For multimodal workflows, use a LangChain chat model integration that supports image content blocks, then route to Fireworks only for the text-only steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is FireworksContextOverflowError and when does it get raised?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;FireworksContextOverflowError&lt;/code&gt; was added in 1.4.0  as a typed wrapper around the raw &lt;code&gt;BadRequestError&lt;/code&gt; returned when a prompt exceeds the model's context window. Before 1.4.0, that condition surfaced as an untyped HTTP error requiring string-matching to detect. Catching it explicitly lets you branch cleanly to a fallback: truncate context, switch to a model with a larger window, or summarize the conversation before retrying.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Try Next
&lt;/h2&gt;

&lt;p&gt;With 1.4.2 installed and a working invoke path, the productive next steps are: wire in &lt;code&gt;with_structured_output()&lt;/code&gt; for JSON extraction use cases, add &lt;code&gt;FireworksContextOverflowError&lt;/code&gt; handling at call sites if you're operating near context limits, and explicitly test cross-provider message round-trips (Anthropic or OpenAI → Fireworks) to confirm the 1.4.2 serialization fix covers your specific message shapes. If you're migrating from &lt;code&gt;fireworks-ai&lt;/code&gt; 0.x, the 1.4.0 SDK bump is the change most likely to surface compatibility gaps — review the &lt;a href="https://docs.langchain.com/oss/python/integrations/chat/fireworks" rel="noopener noreferrer"&gt;updated integration docs&lt;/a&gt; and the &lt;a href="https://reference.langchain.com/python/integrations/langchain_fireworks/" rel="noopener noreferrer"&gt;API reference&lt;/a&gt; before upgrading a production dependency.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://fireworks.ai/blog/fireworks-ai-now-available-on-langchain-prompt-playground" rel="noopener noreferrer"&gt;Fireworks LangChain integration overview&lt;/a&gt; covers available model slugs and tier options. For the complete parameter list including &lt;code&gt;service_tier&lt;/code&gt;, &lt;code&gt;logprobs&lt;/code&gt;, and &lt;code&gt;timeout&lt;/code&gt;, see the &lt;a href="https://github.com/langchain-ai/langchain/blob/master/libs/partners/fireworks/langchain_fireworks/chat_models.py" rel="noopener noreferrer"&gt;source on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-05-31. Based on &lt;a href="https://pypi.org/project/langchain-fireworks/" rel="noopener noreferrer"&gt;langchain-fireworks 1.4.2&lt;/a&gt; (released May 27, 2026) and fireworks-ai 1.2.0a73.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>fireworksai</category>
      <category>chatfireworks</category>
      <category>llmframework</category>
    </item>
    <item>
      <title>openai-codex v0.1.0b1: First Beta Install and Thread Walkthrough</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:38:49 +0000</pubDate>
      <link>https://dev.to/creeta/openai-codex-v010b1-first-beta-install-and-thread-walkthrough-2p44</link>
      <guid>https://dev.to/creeta/openai-codex-v010b1-first-beta-install-and-thread-walkthrough-2p44</guid>
      <description>&lt;h2&gt;
  
  
  What openai-codex v0.1.0b1 Ships
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/openai-codex/0.1.0b1/" rel="noopener noreferrer"&gt;&lt;code&gt;openai-codex&lt;/code&gt; v0.1.0b1&lt;/a&gt; is OpenAI's first officially published Python SDK for embedding Codex agent capabilities directly in application code. It is distinct from the earlier community package &lt;a href="https://pypi.org/project/openai-codex-sdk/" rel="noopener noreferrer"&gt;&lt;code&gt;openai-codex-sdk&lt;/code&gt;&lt;/a&gt;, which wrapped the CLI binary via JSONL over stdin/stdout and first appeared in December 2025 . The official beta landed May 28, 2026  alongside &lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex CLI v0.135.0&lt;/a&gt;, and the SDK version tracks the CLI directly. Both &lt;code&gt;0.1.0b1&lt;/code&gt; and &lt;code&gt;0.1.0b2&lt;/code&gt; appeared on PyPI the same day under Apache-2.0 .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; &lt;code&gt;openai-codex&lt;/code&gt; v0.1.0b1 is OpenAI's official Python SDK for Codex agent threads, released May 28, 2026 under Apache-2.0. Install with &lt;code&gt;pip install openai-codex==0.1.0b1&lt;/code&gt;. Requires Python ≥3.10. The CLI binary is bundled automatically — no separate CLI install needed.&lt;/p&gt;

&lt;p&gt;Three PRs stabilized the API surface before the public release: schema-generated types (PR #18862, April 21, 2026) , runtime wheel publishing (PR #18865), and Codex-pinned versioning (PR #18996, April 27, 2026) . The package ships as &lt;strong&gt;Development Status 4 (beta)&lt;/strong&gt;; public APIs are not stable until 1.0.&lt;/p&gt;

&lt;p&gt;The headlining new surface is three named &lt;strong&gt;Sandbox presets&lt;/strong&gt; at the thread level, replacing the older &lt;code&gt;--profile&lt;/code&gt; flag approach:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Filesystem effect&lt;/th&gt;
&lt;th&gt;Recommended use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;read_only&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reads only; no writes permitted&lt;/td&gt;
&lt;td&gt;Audit, inspection, explain-this-repo tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;workspace_write&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write access to workspace root (default)&lt;/td&gt;
&lt;td&gt;Standard agentic coding tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;full_access&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unrestricted filesystem&lt;/td&gt;
&lt;td&gt;Self-modifying experiments — isolated containers only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Installation and Prerequisites
&lt;/h2&gt;

&lt;p&gt;The only hard requirement is Python ≥3.10  — tested against 3.10, 3.11, 3.12, and 3.13. No separate Codex CLI install is needed; the package auto-installs &lt;code&gt;openai-codex-cli-bin&lt;/code&gt; as a pinned runtime dependency . Install and pin the exact version to avoid silent resolution to &lt;code&gt;0.1.0b2&lt;/code&gt;, which shares the same May 28, 2026 release date :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;openai-codex&lt;span class="o"&gt;==&lt;/span&gt;0.1.0b1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four authentication flows are available:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reuse existing CLI credentials&lt;/strong&gt; — simplest for local dev; instantiate &lt;code&gt;Codex()&lt;/code&gt; with no arguments and it picks up cached credentials automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key&lt;/strong&gt; — call &lt;code&gt;codex.login_api_key('sk-...')&lt;/code&gt; after instantiation; set &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in the environment. Suited for CI/CD pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser login&lt;/strong&gt; — &lt;code&gt;codex.login_chatgpt()&lt;/code&gt; returns an auth URL to open in a browser; suited for interactive terminals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device-code flow&lt;/strong&gt; — &lt;code&gt;codex.login_chatgpt_device_code()&lt;/code&gt; returns a verification URL and user code; suited for headless servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Device-code and API-key flows became first-class in CLI v0.132.0, released May 20, 2026 . If you are reusing credentials from an older CLI install, run an explicit login step — silent failures are a known issue (see Gotchas below).&lt;/p&gt;

&lt;h2&gt;
  
  
  From Import to First Thread: A Walkthrough
&lt;/h2&gt;

&lt;p&gt;The snippet below is illustrative — the &lt;code&gt;codex_app_server&lt;/code&gt; module is auto-installed as part of the &lt;code&gt;openai-codex&lt;/code&gt; package but was not available in the test environment at time of writing. Run the install comment first, then the code. The &lt;a href="https://github.com/openai/codex/blob/main/sdk/python/docs/getting-started.md" rel="noopener noreferrer"&gt;getting-started guide&lt;/a&gt; has a fully executed variant.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;First beta install: python -m pip install openai-codex==0.1.0b1&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;codex_app_server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Codex&lt;/span&gt;


&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Codex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply with exactly: first beta thread started&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;followup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply with exactly: same thread follow-up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;followup:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;followup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Import and open.&lt;/strong&gt; The &lt;code&gt;with Codex() as codex:&lt;/code&gt; context manager starts a local JSON-RPC app-server on entry and shuts it down cleanly on exit. No manual server lifecycle management required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start a thread.&lt;/strong&gt; &lt;code&gt;codex.thread_start(model="gpt-5.4")&lt;/code&gt; creates a persistent conversation unit. Supported models include &lt;code&gt;gpt-5.5&lt;/code&gt;, &lt;code&gt;gpt-5.4&lt;/code&gt;, &lt;code&gt;gpt-5.4-mini&lt;/code&gt;, &lt;code&gt;gpt-5.3-codex&lt;/code&gt;, and &lt;code&gt;gpt-5.3-codex-spark&lt;/code&gt; (ChatGPT Pro only) . Pass &lt;code&gt;sandbox="read_only"&lt;/code&gt; here to scope the agent's filesystem access for the thread's lifetime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a turn.&lt;/strong&gt; &lt;code&gt;thread.run("...")&lt;/code&gt; sends a prompt and blocks until completion. Plain strings auto-convert to &lt;code&gt;TextInput&lt;/code&gt;; pass &lt;code&gt;ImageInput(url=...)&lt;/code&gt; or &lt;code&gt;LocalImageInput(path=...)&lt;/code&gt; for vision prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inspect the result.&lt;/strong&gt; The returned &lt;a href="https://github.com/openai/codex/blob/main/sdk/python/docs/api-reference.md" rel="noopener noreferrer"&gt;&lt;code&gt;TurnResult&lt;/code&gt;&lt;/a&gt; exposes &lt;code&gt;final_response&lt;/code&gt; (str or None), &lt;code&gt;items&lt;/code&gt; (list of &lt;code&gt;ThreadItem&lt;/code&gt;), &lt;code&gt;usage&lt;/code&gt; (&lt;code&gt;ThreadTokenUsage&lt;/code&gt;), &lt;code&gt;duration_ms&lt;/code&gt;, &lt;code&gt;started_at&lt;/code&gt;, and &lt;code&gt;completed_at&lt;/code&gt; . Log &lt;code&gt;result.usage&lt;/code&gt; and &lt;code&gt;result.duration_ms&lt;/code&gt; from your first run — you want a baseline before switching models or sandbox presets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For async contexts, swap &lt;code&gt;Codex&lt;/code&gt; for &lt;code&gt;AsyncCodex&lt;/code&gt; and replace &lt;code&gt;with&lt;/code&gt; / &lt;code&gt;thread.run()&lt;/code&gt; with &lt;code&gt;async with&lt;/code&gt; / &lt;code&gt;await thread.run()&lt;/code&gt;. Lazy initialization keeps import-time overhead negligible. The interface is otherwise identical, making it a straightforward drop-in for FastAPI routes or asyncio pipelines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_codex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncCodex&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncCodex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;codex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thread_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a docstring for utils.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Gotchas and Beta Caveats
&lt;/h2&gt;

&lt;p&gt;Five issues worth knowing before you take this beyond a prototype:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Version collision on install day.&lt;/strong&gt; &lt;code&gt;v0.1.0b1&lt;/code&gt; and &lt;code&gt;v0.1.0b2&lt;/code&gt; share the same May 28, 2026 PyPI release date . Without an explicit pin, pip may silently resolve to the higher patch. Always lock to the version you tested: &lt;code&gt;openai-codex==0.1.0b1&lt;/code&gt; in &lt;code&gt;requirements.txt&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstable public API.&lt;/strong&gt; Development Status 4 (beta) means &lt;code&gt;TurnResult&lt;/code&gt; field names, &lt;code&gt;ThreadTokenUsage&lt;/code&gt; shape, and sandbox preset identifiers may change before 1.0 . Check the &lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex changelog&lt;/a&gt; before each version bump.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavier install size.&lt;/strong&gt; The bundled CLI binary adds roughly 50–100 MB depending on platform . The binary layer changes infrequently, so isolate it early in your Docker build to keep CI caches effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;full_access&lt;/code&gt; documentation gap.&lt;/strong&gt; The &lt;code&gt;full_access&lt;/code&gt; preset documentation was incomplete at launch. Always test agentic loops using this preset inside an isolated container before pointing it at a real repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stale credential failures.&lt;/strong&gt; Credentials cached by CLI versions older than v0.132.0 (pre–May 20, 2026 ) may fail silently with the new auth flows. Run &lt;code&gt;codex.login_api_key('sk-...')&lt;/code&gt; or the device-code flow explicitly to force a fresh credential set.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Experiments: Async, Image Inputs, and Thread Configuration
&lt;/h2&gt;

&lt;p&gt;Once a basic thread is running, these four patterns cover the most productive next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Async drop-in.&lt;/strong&gt; Replace &lt;code&gt;Codex&lt;/code&gt; with &lt;code&gt;AsyncCodex&lt;/code&gt; for integration with FastAPI or any asyncio loop. Lazy initialization keeps import-time overhead low; the rest of the interface is identical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot-to-fix loop.&lt;/strong&gt; Pass &lt;code&gt;LocalImageInput(path='screenshot.png')&lt;/code&gt; alongside a text prompt to &lt;code&gt;thread.run()&lt;/code&gt;. Useful for UI regression triage — the model sees the screenshot directly without a written description of the diff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subagent model routing.&lt;/strong&gt; Use &lt;code&gt;gpt-5.4-mini&lt;/code&gt; for high-frequency sub-tasks where latency and cost matter ; reserve &lt;code&gt;gpt-5.4&lt;/code&gt; or &lt;code&gt;gpt-5.5&lt;/code&gt; for the outer planning and reasoning layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox scoping per task.&lt;/strong&gt; Pass &lt;code&gt;sandbox='read_only'&lt;/code&gt; to &lt;code&gt;thread_start()&lt;/code&gt; for safe inspection and explanation tasks; use &lt;code&gt;full_access&lt;/code&gt; only in throwaway containers for self-modifying experiments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/openai/codex/blob/main/sdk/python/docs/api-reference.md" rel="noopener noreferrer"&gt;SDK API reference&lt;/a&gt; also documents &lt;code&gt;TurnHandle&lt;/code&gt; and &lt;code&gt;AsyncTurnHandle&lt;/code&gt; for fine-grained mid-flight control: streaming events as they arrive, steering a running turn with updated instructions, or interrupting it outright — relevant for long agentic loops where checkpoints reduce unnecessary token spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do I need to install the Codex CLI separately before using the openai-codex Python package?
&lt;/h3&gt;

&lt;p&gt;No. The &lt;code&gt;openai-codex&lt;/code&gt; package auto-installs &lt;code&gt;openai-codex-cli-bin&lt;/code&gt; as a pinned runtime dependency. The CLI binary is bundled and managed by the SDK — no separate install step is required .&lt;/p&gt;

&lt;h3&gt;
  
  
  What Python versions does openai-codex v0.1.0b1 support?
&lt;/h3&gt;

&lt;p&gt;Python 3.10, 3.11, 3.12, and 3.13. Python versions below 3.10 are not supported .&lt;/p&gt;

&lt;h3&gt;
  
  
  How does v0.1.0b1 differ from v0.1.0b2?
&lt;/h3&gt;

&lt;p&gt;Both landed on PyPI on May 28, 2026 . No public release notes distinguish the two at time of writing. Since the SDK tracks CLI versions and each patch bump may adjust bundled binary behavior, the safest practice is to pin whichever version you tested in &lt;code&gt;requirements.txt&lt;/code&gt; and upgrade deliberately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is openai-codex v0.1.0b1 ready for production?
&lt;/h3&gt;

&lt;p&gt;Development Status 4 (beta) means public APIs are not stable. &lt;code&gt;TurnResult&lt;/code&gt; field names, &lt;code&gt;ThreadTokenUsage&lt;/code&gt; shape, and sandbox preset identifiers may change before 1.0. Pin the exact version, treat the API surface as unstable, and keep it out of critical production paths until a stable release is tagged.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between Codex and AsyncCodex?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Codex&lt;/code&gt; is synchronous and context-manager-safe — the right choice for scripts, CLIs, and synchronous frameworks. &lt;code&gt;AsyncCodex&lt;/code&gt; is its async counterpart with lazy initialization, designed for asyncio-based frameworks like FastAPI. The thread and turn interfaces are otherwise identical across both classes .&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Build Next
&lt;/h2&gt;

&lt;p&gt;With a first thread running, the most useful immediate step is to log &lt;code&gt;result.usage&lt;/code&gt; and &lt;code&gt;result.duration_ms&lt;/code&gt; across a handful of real tasks before adding any complexity. That baseline tells you concretely what switching models or sandbox presets costs in tokens and latency — information that is hard to reconstruct later.&lt;/p&gt;

&lt;p&gt;From there, the two patterns worth exploring early are the &lt;code&gt;LocalImageInput&lt;/code&gt; vision path for screenshot-driven workflows, and &lt;code&gt;TurnHandle&lt;/code&gt; streaming for long agentic loops where mid-flight interruption can save significant token spend. Both are covered in the &lt;a href="https://github.com/openai/codex/blob/main/sdk/python/docs/api-reference.md" rel="noopener noreferrer"&gt;SDK API reference&lt;/a&gt; and the &lt;a href="https://github.com/openai/codex/tree/main/sdk/python" rel="noopener noreferrer"&gt;Python SDK repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Given that &lt;code&gt;0.1.0b1&lt;/code&gt; and &lt;code&gt;0.1.0b2&lt;/code&gt; shipped the same day, the release cadence at this beta stage is fast. Subscribe to the &lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex changelog&lt;/a&gt; and review it before each version bump — a field rename in &lt;code&gt;TurnResult&lt;/code&gt; will not announce itself loudly at runtime.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-05-31. Based on &lt;a href="https://pypi.org/project/openai-codex/0.1.0b1/" rel="noopener noreferrer"&gt;openai-codex v0.1.0b1&lt;/a&gt; and Codex CLI v0.135.0 as released May 28, 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openaicodex</category>
      <category>python</category>
      <category>sdk</category>
      <category>beta</category>
    </item>
    <item>
      <title>'Gemini Omni 3.5' doesn't exist. Here's the real split.</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:38:48 +0000</pubDate>
      <link>https://dev.to/creeta/gemini-omni-35-doesnt-exist-heres-the-real-split-5339</link>
      <guid>https://dev.to/creeta/gemini-omni-35-doesnt-exist-heres-the-real-split-5339</guid>
      <description>&lt;h2&gt;
  
  
  Which Products Hide Behind 'Gemini Omni 3.5'?
&lt;/h2&gt;

&lt;p&gt;"Gemini Omni 3.5" is not a real product name. The shorthand conflates two distinct models Google shipped at &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/" rel="noopener noreferrer"&gt;Google I/O 2026&lt;/a&gt; on May 19, 2026 : &lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt;, a fast text-and-multimodal model optimized for agentic coding, and &lt;strong&gt;Gemini Omni Flash&lt;/strong&gt; (model ID: &lt;code&gt;gemini-omni-flash&lt;/code&gt; ), a world model built for video generation with conversational editing. They share no endpoint, no output type, and no plan tier. Using the wrong model ID produces a 404 before you write a useful line of logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; "Gemini Omni 3.5" does not exist. Google I/O 2026 (May 19) shipped two models: Gemini 3.5 Flash (&lt;code&gt;gemini-3.5-flash&lt;/code&gt;) for fast text and agentic coding on the free tier, and Gemini Omni Flash (&lt;code&gt;gemini-omni-flash&lt;/code&gt;) for video generation on a paid subscription. Free-tier API keys return 403 on video calls.&lt;/p&gt;

&lt;p&gt;The following verified snippet (exit 0) probes the real model namespace and confirms that &lt;code&gt;gemini-omni-3.5&lt;/code&gt; does not exist as a callable identifier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3-pro-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gemini: multimodal input -&amp;gt; text output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imagen-3.0-generate-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Imagen: image generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;veo-3.0-generate-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Veo: video generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-preview-native-audio-dialog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gemini Live/audio: realtime voice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-omni-3.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt; exists? &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Real split:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'gemini-omni-3.5' exists? False
Real split:
- gemini-3-pro-preview: Gemini: multimodal input -&amp;gt; text output
- imagen-3.0-generate-002: Imagen: image generation
- veo-3.0-generate-preview: Veo: video generation
- gemini-2.5-flash-preview-native-audio-dialog: Gemini Live/audio: realtime voice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Video rendering lives in its own model family (Gemini Omni or Veo), separate from the general-purpose text/reasoning tier. The table below shows the split at a glance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Gemini 3.5 Flash&lt;/th&gt;
&lt;th&gt;Gemini Omni Flash&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model ID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-3.5-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-omni-flash&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text, code, multimodal&lt;/td&gt;
&lt;td&gt;Video (MP4 URI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary use case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agentic coding, long-horizon tool use&lt;/td&gt;
&lt;td&gt;Video generation, conversational editing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plan requirement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier (AI Studio)&lt;/td&gt;
&lt;td&gt;Google AI Plus, Pro, or Ultra&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"Gemini Omni is a world model — it generates any output from any input type." — Koray Kavukcuoglu, VP Research &amp;amp; Technology, &lt;a href="https://deepmind.google/models/gemini-omni/" rel="noopener noreferrer"&gt;Google DeepMind&lt;/a&gt;, Google I/O 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Environment Setup: Install and Authenticate
&lt;/h2&gt;

&lt;p&gt;Both models are served through the same SDK. Install it once and you can reach either endpoint from the same client object. Obtain your API key from &lt;a href="https://aistudio.google.com/" rel="noopener noreferrer"&gt;Google AI Studio&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Python&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;google-genai

&lt;span class="c"&gt;# Node.js&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; @google/genai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_GENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key-here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Smoke-test before touching the video endpoint.&lt;/strong&gt; A lightweight text call to &lt;code&gt;gemini-3.5-flash&lt;/code&gt; validates auth on the free tier. A 403 here means the key itself is wrong — not a subscription issue. That distinction matters because the video endpoint adds a second gate on top :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_GENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# any response = key is live
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this passes, credentials are wired correctly. The video endpoint's plan check is addressed in §4.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generating and Refining Video in Conversation
&lt;/h2&gt;

&lt;p&gt;Work through these four steps in order. Each one validates a prerequisite for the next. All calls use &lt;code&gt;gemini-omni-flash&lt;/code&gt;  and require a paid Google AI subscription .&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Step 1 — Basic text-to-video.&lt;/strong&gt; The response returns an MP4 URI, not inline bytes. You must poll for &lt;code&gt;COMPLETED&lt;/code&gt; status before fetching the file:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

   &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_GENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-omni-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A slow-motion close-up of coffee being poured over ice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_video_operation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;video_uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# download the MP4 from this URI
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Illustrative; structure follows the google-genai SDK conventions documented at &lt;a href="https://ourcodeworld.com/articles/read/3304/getting-started-with-the-gemini-omni-api-a-node-js-and-python-tutorial" rel="noopener noreferrer"&gt;ourCodeWorld&lt;/a&gt;. Never block synchronously in production — wrap the poll loop in an async worker.&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Step 2 — Multimodal input.&lt;/strong&gt; Pass a JPEG to anchor the opening frame, or an MP3 to generate narration-synced video. Both can be combined in a single call alongside the text prompt. Image sets a reference frame; audio locks the timing of visuals to the voice track before rendering begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 3 — Conversational refinement.&lt;/strong&gt; Initialize a chat session. Each subsequent &lt;code&gt;generate_video()&lt;/code&gt; call in the same session is an incremental edit — the model does not restart generation from scratch :
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-omni-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A golden retriever on a beach&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
   &lt;span class="c1"&gt;# poll v1 for COMPLETED...
&lt;/span&gt;
   &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Now make it snowing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
   &lt;span class="c1"&gt;# model edits v1 in context — does not re-render from blank slate
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Follow-up turns like &lt;code&gt;"swap the dog for a cat"&lt;/code&gt; or &lt;code&gt;"shift to golden-hour lighting"&lt;/code&gt; continue the chain. Session context persists until you close the object or it times out.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Step 4 — Motion transfer.&lt;/strong&gt; Supply an MP4 reference clip alongside a text description of the target scene. Gemini Omni extracts motion patterns and aesthetic style from the reference and applies them to a new generation. Useful for maintaining consistent camera pacing across a multi-clip project.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;SynthID + C2PA are always on.&lt;/strong&gt; An imperceptible digital watermark and C2PA Content Credentials are embedded in every output automatically . There is no API flag to suppress them. Determine your distribution platform's AI disclosure requirements before going live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfalls: Generation Time, Quotas, and Plan Gates
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generation is slow at launch.&lt;/strong&gt; Expect 60–180 seconds per clip . Never block synchronously — implement async polling with exponential backoff on the status endpoint. A synchronous wait inside a web request handler will time out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Burst quota is tight in the launch window.&lt;/strong&gt; Generating five clips sequentially will likely hit rate limits. Add random jitter (2–5 s) between requests and handle &lt;code&gt;ResourceExhaustedError&lt;/code&gt; with retry-with-backoff logic rather than letting exceptions surface to users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan check is per-request, not per-key issuance.&lt;/strong&gt; A key provisioned before a plan upgrade may behave inconsistently. If you see unexpected 403s after upgrading your subscription, regenerate the key from AI Studio — the server needs a fresh token bound to the new entitlement .&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SynthID and C2PA cannot be stripped.&lt;/strong&gt; Factor this into any stock footage submission, social media scheduling pipeline, or legal disclosure workflow before production. Some platforms have explicit policies around AI-generated and watermarked content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physics reasoning degrades on complex multi-object scenes.&lt;/strong&gt; Single-subject prompts produce the most coherent results at launch. Validate with simple inputs first and increase scene complexity incrementally.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Beyond Text Prompts: Multimodal Inputs and Production Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Audio-first generation.&lt;/strong&gt; Supply a recorded MP3 narration alongside a visual brief. Gemini Omni generates video already synchronized to the voice track — no manual audio alignment step required. Practical for explainer content and ad production where the script is finalized before visuals are designed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Flow for multi-shot sequences.&lt;/strong&gt; &lt;a href="https://flow.google/" rel="noopener noreferrer"&gt;Google Flow&lt;/a&gt; wraps the same &lt;code&gt;gemini-omni-flash&lt;/code&gt; endpoint with a structured shot-by-shot interface. It reduces manual session management for longer sequences and includes a direct &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-3-5-videos/" rel="noopener noreferrer"&gt;YouTube Shorts publish path&lt;/a&gt; available from launch . Teams already on Google Workspace can skip the export step entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-model pipeline.&lt;/strong&gt; Use Gemini 3.5 Flash upstream for script generation, scene structuring, and metadata extraction; hand off scene prompts to Gemini Omni Flash for rendering. The economics make sense: 3.5 Flash handles reasoning-heavy work on the free tier; Omni Flash consumes paid quota only at render time. Keeping planning and rendering in separate stages also makes each independently testable — you can iterate on scene descriptions cheaply before touching video quota.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between Gemini Omni and Gemini 3.5 Flash?
&lt;/h3&gt;

&lt;p&gt;Two separate products announced at Google I/O 2026 on May 19 . Gemini 3.5 Flash (&lt;code&gt;gemini-3.5-flash&lt;/code&gt;) is a fast text-and-multimodal model for agentic coding and long-horizon tool use — available on the free tier via AI Studio. Gemini Omni Flash (&lt;code&gt;gemini-omni-flash&lt;/code&gt;) is a world model built specifically for video generation and conversational video editing — requires a paid Google AI subscription. They share no endpoint and have no overlapping use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need a paid plan to call the Gemini Omni video generation endpoint?
&lt;/h3&gt;

&lt;p&gt;Yes. A Google AI Plus, Pro, or Ultra subscription is required . The check runs server-side on every request — not just at key issuance. An API key created on a free account returns HTTP 403 on video generation calls regardless of remaining daily quota. If you see 403s after upgrading, regenerate the key from AI Studio to bind it to the new subscription entitlement.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does conversational video editing work via the API?
&lt;/h3&gt;

&lt;p&gt;Create a chat session with &lt;code&gt;client.chats.create(model="gemini-omni-flash")&lt;/code&gt;. The first &lt;code&gt;generate_video()&lt;/code&gt; call in that session produces the initial clip. Each subsequent call sends a natural-language edit instruction — "change the lighting to dusk," "replace the car with a bicycle" — and the model applies it as an incremental edit without restarting generation from scratch . Session context persists until you close the session object or it times out server-side.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is SynthID and does it affect commercial use of Gemini Omni outputs?
&lt;/h3&gt;

&lt;p&gt;SynthID is an imperceptible digital watermark embedded in every Gemini Omni output alongside C2PA Content Credentials for provenance verification . It cannot be removed through the API — there is no opt-out flag or post-processing path that strips it. Before commercial distribution or stock footage submission, confirm whether your target platform's terms of service require AI-generated content disclosure or prohibit embedded watermarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I supply an existing video clip as input to Gemini Omni?
&lt;/h3&gt;

&lt;p&gt;Yes. MP4 clips are accepted as reference input for motion transfer. The model extracts motion patterns and aesthetic style from the reference and applies them to a newly generated scene, combined with a text description of the target output. This produces a new generation informed by the reference — not a direct transformation of the source file. Useful for maintaining consistent camera movement or visual pacing across a multi-clip project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Build Next
&lt;/h2&gt;

&lt;p&gt;The naming confusion has a practical consequence worth spelling out: any tutorial or sample repo titled "Gemini Omni 3.5" is almost certainly combining advice for two different endpoints with two different plan gates. Read those resources with that filter in mind before adapting the code.&lt;/p&gt;

&lt;p&gt;For new projects, the clearest path is a two-stage pipeline — Gemini 3.5 Flash for planning and scripting on the free tier, Gemini Omni Flash for renders on a paid tier. That keeps iteration cheap and quota consumption predictable. Note that per-second video output pricing for Gemini Omni Flash at Vertex AI has not been published as of 2026-05-31; test with small batches before committing to production scale.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-05-31. Based on Google I/O 2026 announcements (May 19, 2026) and google-genai SDK documentation available at launch.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiomni</category>
      <category>videogeneration</category>
      <category>googleio2026</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>You don't pick the RL algorithm — SIA's Feedback loop does</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:37:33 +0000</pubDate>
      <link>https://dev.to/creeta/you-dont-pick-the-rl-algorithm-sias-feedback-loop-does-48ki</link>
      <guid>https://dev.to/creeta/you-dont-pick-the-rl-algorithm-sias-feedback-loop-does-48ki</guid>
      <description>&lt;p&gt;SIA (Self Improving AI), released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on &lt;a href="https://github.com/hexo-ai/sia" rel="noopener noreferrer"&gt;github.com/hexo-ai/sia&lt;/a&gt;. This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feedback Loop That Decides PPO, GRPO, or EAW
&lt;/h2&gt;

&lt;p&gt;SIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems (Darwin Gödel Machine, Hyperagents) and test-time training systems (TTRL, Discover-TTT) were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the &lt;a href="https://arxiv.org/abs/2605.27276" rel="noopener noreferrer"&gt;SIA paper (arXiv:2605.27276)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; SIA (arXiv:2605.27276, MIT license, May 2026) co-evolves agent scaffold and LoRA weights in a single loop. Run &lt;code&gt;sia --task lawbench --max_gen 5&lt;/code&gt;; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA.&lt;/p&gt;

&lt;p&gt;The three-agent loop: &lt;strong&gt;Meta-Agent&lt;/strong&gt; generates the initial scaffold from a task description and reference implementation; &lt;strong&gt;Task-Specific Agent&lt;/strong&gt; executes against the eval dataset in a sandbox with every step logged as a trajectory; &lt;strong&gt;Feedback-Agent&lt;/strong&gt; (Claude Sonnet 4.6) receives source code, trajectories, metrics, and sample task descriptions, then emits &lt;code&gt;improvement.md&lt;/code&gt; and the next-generation agent .&lt;/p&gt;

&lt;p&gt;RL algorithm selection is driven by reward shape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PPO+GAE&lt;/strong&gt; — dense step-level rewards, training stability is the binding constraint (LawBench)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GRPO&lt;/strong&gt; — cheap rollouts, episode-end verification (RNA denoising)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entropic Advantage Weighting (EAW)&lt;/strong&gt; — right-skewed rewards with rare correct solutions (GPU kernels)&lt;/li&gt;
&lt;li&gt;Also available: REINFORCE+KL-to-base, DPO, Best-of-N behavioral cloning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SIA benchmark results, May 2026&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Prior SOTA&lt;/th&gt;
&lt;th&gt;SIA-H (harness only)&lt;/th&gt;
&lt;th&gt;SIA-W+H (harness + weights)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LawBench (191-class accuracy)&lt;/td&gt;
&lt;td&gt;13.5%&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;70.1%&lt;/strong&gt; (+25.1 pp over SOTA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TriMul CUDA kernel (μs, lower=better)&lt;/td&gt;
&lt;td&gt;~13,500 μs&lt;/td&gt;
&lt;td&gt;1,161 μs&lt;/td&gt;
&lt;td&gt;1,017 μs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,017 μs&lt;/strong&gt; (−12.4% vs SOTA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MAGIC scRNA-seq denoising (mse_norm, higher=better)&lt;/td&gt;
&lt;td&gt;0.048&lt;/td&gt;
&lt;td&gt;0.240&lt;/td&gt;
&lt;td&gt;0.241&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0.289&lt;/strong&gt; (+20.4% over SOTA)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"Harness changes and weight updates do not overlap in their effect space: harness iterations produce externalized infrastructure improvements — better parsing, tools, retry logic — while weight updates encode internalized domain knowledge that no prompt engineering alone can reach." — Hexo Labs research team, &lt;a href="https://arxiv.org/html/2605.27276v2" rel="noopener noreferrer"&gt;SIA: Self Improving AI (arXiv:2605.27276v2)&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What You Need: venv, Credentials, and Modal
&lt;/h2&gt;

&lt;p&gt;The Claude backend runs entirely on CPU — no local GPU required. Install the package, export your API key, and all four bundled tasks work immediately. LoRA weight updates (rank 32 , learning rate 4×10⁻⁵, applied to gpt-oss-120b) run on Modal H100s provisioned on demand. Skip Modal entirely and the loop still runs harness-only iterations — cheaper and sufficient to see meaningful eval gains in early generations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude backend (all bundled tasks, no GPU needed):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'sia-agent[claude]'&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-ant-..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenHands backend (multi-provider task execution):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'sia-agent[openhands]'&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prerequisites at a glance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic API key&lt;/strong&gt; — required for both backends; runs the harness and Feedback-Agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini + OpenAI keys&lt;/strong&gt; — only needed with &lt;code&gt;--backend openhands&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modal account with H100 credits&lt;/strong&gt; — only needed for LoRA weight updates; harness-only runs use no GPU time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running a LawBench Generation: Step-by-Step
&lt;/h2&gt;

&lt;p&gt;Three commands take you from a clean environment to a live five-generation self-improving loop on the bundled LawBench task .&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create and activate a virtual environment:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install SIA with the Claude backend:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'sia-agent[claude]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run 5 generations on LawBench:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   sia &lt;span class="nt"&gt;--task&lt;/span&gt; lawbench &lt;span class="nt"&gt;--max_gen&lt;/span&gt; 5 &lt;span class="nt"&gt;--run_id&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each generation writes output to &lt;code&gt;runs/run_1/gen_N/&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;target_agent.py&lt;/code&gt; — the evolved scaffold for this generation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agent_execution.json&lt;/code&gt; — full execution log and per-step trajectory&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;improvement.md&lt;/code&gt; — Feedback-Agent's rationale for the next change (appears from generation 2 onward)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All four bundled tasks run with &lt;code&gt;--task &amp;lt;name&amp;gt;&lt;/code&gt;: &lt;code&gt;gpqa&lt;/code&gt;, &lt;code&gt;lawbench&lt;/code&gt;, &lt;code&gt;longcot-chess&lt;/code&gt;, &lt;code&gt;spaceship-titanic&lt;/code&gt;. Key flags to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--max_gen&lt;/code&gt; — number of self-improvement generations (default: 3)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--backend claude|openhands&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--meta_model&lt;/code&gt; — model for Feedback/Meta agents (default: &lt;code&gt;haiku&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--task_model&lt;/code&gt; — model for the task-specific agent (default: &lt;code&gt;claude-haiku-4-5-20251001&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The snippet below is a runnable illustration of the core mechanism — the Feedback loop maintaining a live reward signal for each available algorithm and switching when one accumulates a better signal. This code ran to completion (exit 0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;epsilon_greedy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ucb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;algorithms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epsilon_greedy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;epsilon_greedy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ucb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ucb&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;pulls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;feedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# SIA's feedback loop picks the RL algorithm with the best live reward signal.
&lt;/span&gt;    &lt;span class="n"&gt;chosen_algo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epsilon_greedy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;algorithms&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chosen_algo&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.08&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.08&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;pulls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chosen_algo&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chosen_algo&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ucb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;  &lt;span class="c1"&gt;# new feedback changes the controller's choice
&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; sia_selected=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chosen_algo&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; action=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; reward=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Takeaway: you provide feedback; SIA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s loop chooses the RL algorithm.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch step 07: a feedback boost applied to &lt;code&gt;ucb&lt;/code&gt; at step 5 causes the controller to switch algorithms at the next decision point. SIA's Feedback-Agent applies the same logic at generation granularity — accumulated reward signals reshape algorithm selection each generation, not just each step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Eval Directories: The Expected Layout
&lt;/h2&gt;

&lt;p&gt;To run SIA on your own benchmark, create a directory with this minimum structure and point &lt;code&gt;--task_dir&lt;/code&gt; at it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-task/
├── data/
│   ├── public/
│   │   ├── task.md          # scoring function + evaluation loop
│   │   └── ...
│   └── private/             # held-out answers (never in scaffold context)
└── reference/
    ├── reference_target_agent.py   # working baseline for Meta-Agent
    └── SAMPLE_TASK_DESCRIPTIONS.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sia &lt;span class="nt"&gt;--task_dir&lt;/span&gt; ./my-task &lt;span class="nt"&gt;--max_gen&lt;/span&gt; 5 &lt;span class="nt"&gt;--run_id&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things worth knowing about this layout:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;task.md&lt;/code&gt; defines the scoring function and evaluation loop — this is what tells SIA what a correct answer looks like, and it is the primary lever for guiding the Feedback loop.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reference_target_agent.py&lt;/code&gt; gives the Meta-Agent a working starting point. Omit it and the Meta-Agent generates a scaffold from scratch — viable, but slower and lower quality on the first generation.&lt;/li&gt;
&lt;li&gt;Private data in &lt;code&gt;data/private/&lt;/code&gt; stays outside the scaffold's context window at all times. Only the public task description is visible to the running agent — no eval-set contamination.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Friction Points and Extending the Loop
&lt;/h2&gt;

&lt;p&gt;Four patterns that appear reliably in early runs, and what to do about them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modal H100 cost scales with trajectory length.&lt;/strong&gt; Profile cost on runs 1–3 before committing to 20+ generations with LoRA enabled. Harness-only iterations use no GPU time and produce measurable improvements on their own — establish your harness ceiling first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Haiku produces shallow reports after a few generations.&lt;/strong&gt; When &lt;code&gt;improvement.md&lt;/code&gt; starts repeating the same edits verbatim, switch to &lt;code&gt;--meta_model claude-sonnet-4-5-20251001&lt;/code&gt;. Sonnet produces richer harness rewrites and more substantive RL algorithm reasoning at higher cost per generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flat eval scores signal harness exhaustion.&lt;/strong&gt; Unchanged scores across 2–3 consecutive generations mean the Feedback loop has used up accessible scaffold changes. This is the signal to enable weight updates — if you've been running harness-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long LoRA runs can exceed 30 min/gen on H100 .&lt;/strong&gt; Check &lt;code&gt;agent_execution.json&lt;/code&gt; for trajectory length before pushing &lt;code&gt;--max_gen&lt;/code&gt; beyond 10. Trajectory length is the main driver of per-generation wall time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For independent analysis of SIA's architecture and benchmark methodology, see the &lt;a href="https://www.marktechpost.com/2026/05/29/hexo-labs-open-sources-sia-a-self-improving-agent-that-updates-both-the-harness-and-the-model-weights/" rel="noopener noreferrer"&gt;MarkTechPost writeup&lt;/a&gt; and the &lt;a href="https://www.themoonlight.io/en/review/sia-self-improving-ai-with-harness-weight-updates" rel="noopener noreferrer"&gt;Moonlight review&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does SIA require GPU access to run?
&lt;/h3&gt;

&lt;p&gt;No. Harness edits run entirely on CPU via the Claude API — install &lt;code&gt;sia-agent[claude]&lt;/code&gt;, export &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, and run. LoRA weight updates require a Modal account with H100 credits. Skip weight updates entirely by not configuring Modal; the loop still runs and improves the scaffold across generations at no GPU cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  What RL algorithm does SIA select by default for LawBench?
&lt;/h3&gt;

&lt;p&gt;PPO with GAE. LawBench produces dense step-level rewards, and the Feedback loop consistently selects PPO for tasks with that reward structure. GRPO and Entropic Advantage Weighting appear on tasks with sparse or right-skewed reward distributions — RNA denoising and GPU kernel optimization respectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use my own base model instead of gpt-oss-120b for LoRA?
&lt;/h3&gt;

&lt;p&gt;Not out-of-the-box. The LoRA RL loop targets gpt-oss-120b by default. Substituting a different base requires editing the run config and ensuring Modal can load those weights. The MIT license keeps the door open for community contributions supporting alternative bases.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I verify that each generation actually improved?
&lt;/h3&gt;

&lt;p&gt;Read &lt;code&gt;runs/run_{id}/gen_{n}/improvement.md&lt;/code&gt; for the Feedback loop's rationale for that generation. Compare eval scores in &lt;code&gt;agent_execution.json&lt;/code&gt; across generation directories. Flat scores paired with shallow or repetitive improvement notes are the signal to switch to &lt;code&gt;--meta_model sonnet&lt;/code&gt; or enable weight updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does SIA default to haiku for the Feedback and Meta loops?
&lt;/h3&gt;

&lt;p&gt;Cost and latency. Haiku is cheap enough to run across many generations without API costs dominating the experiment budget. Override with &lt;code&gt;--meta_model claude-sonnet-4-5-20251001&lt;/code&gt; when you need richer harness rewrites or more substantive RL algorithm reasoning — typically after generation 3 or 4 when haiku's improvement reports start repeating themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Try Next
&lt;/h2&gt;

&lt;p&gt;Start with a harness-only run on a bundled task — &lt;code&gt;gpqa&lt;/code&gt; or &lt;code&gt;lawbench&lt;/code&gt; — to calibrate generation cost and see what &lt;code&gt;improvement.md&lt;/code&gt; looks like before enabling Modal. The harness-only variant already reaches 50.0% on LawBench against a 13.5% baseline , so it is worth knowing your harness ceiling before spending GPU time on weight updates.&lt;/p&gt;

&lt;p&gt;Once harness gains plateau — flat scores for 2–3 consecutive generations — enable weight updates and compare &lt;code&gt;SIA-H&lt;/code&gt; vs &lt;code&gt;SIA-W+H&lt;/code&gt; performance directly. For custom domains, invest time in &lt;code&gt;task.md&lt;/code&gt; first: a well-specified verifier is what gives the Feedback loop a meaningful signal. A weak or noisy scoring function limits how far either harness edits or weight updates can go, regardless of how many generations you run.&lt;/p&gt;

&lt;p&gt;Full paper: &lt;a href="https://arxiv.org/abs/2605.27276" rel="noopener noreferrer"&gt;arXiv:2605.27276&lt;/a&gt;. Code, task authoring guide, and bundled tasks: &lt;a href="https://github.com/hexo-ai/sia" rel="noopener noreferrer"&gt;github.com/hexo-ai/sia&lt;/a&gt;. Background on Hexo Labs' research program (Stanford, UC Santa Barbara, Oxford partnerships): &lt;a href="https://tfir.io/self-improving-ai-sia-hexo-labs-kunal-bhatia/" rel="noopener noreferrer"&gt;tFiR interview with Hexo Labs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-06-01. Article reflects SIA arXiv:2605.27276v2, revised May 28, 2026 .&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sia</category>
      <category>selfimprovingai</category>
      <category>hexolabs</category>
      <category>lora</category>
    </item>
    <item>
      <title>Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:37:32 +0000</pubDate>
      <link>https://dev.to/creeta/qwen36-35b-nvfp4-runs-on-one-h100-a100-owners-are-out-e60</link>
      <guid>https://dev.to/creeta/qwen36-35b-nvfp4-runs-on-one-h100-a100-owners-are-out-e60</guid>
      <description>&lt;p&gt;NVIDIA published &lt;a href="https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4" rel="noopener noreferrer"&gt;nvidia/Qwen3.6-35B-A3B-NVFP4&lt;/a&gt; on May 28, 2026  — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  71 GB → 23 GB: What Gets Quantized and What Doesn't
&lt;/h2&gt;

&lt;p&gt;NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a &lt;strong&gt;3.06× reduction&lt;/strong&gt; in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB (3.06×) with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely.&lt;/p&gt;

&lt;p&gt;The calibration pipeline used two datasets: &lt;code&gt;cnn_dailymail&lt;/code&gt; (300K+ English news articles) and NVIDIA's &lt;code&gt;Nemotron-Post-Training-Dataset-v2&lt;/code&gt; for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out.&lt;/p&gt;

&lt;p&gt;NVIDIA's official eval suite shows the accuracy gap is narrow. NVFP4 stays within 0.5–0.8 points of BF16 across reasoning benchmarks, and marginally outperforms on instruction-following and multimodal tasks :&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;BF16&lt;/th&gt;
&lt;th&gt;NVFP4&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MMLU Pro&lt;/td&gt;
&lt;td&gt;85.6&lt;/td&gt;
&lt;td&gt;85.0&lt;/td&gt;
&lt;td&gt;−0.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;84.9&lt;/td&gt;
&lt;td&gt;84.8&lt;/td&gt;
&lt;td&gt;−0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIME 2025&lt;/td&gt;
&lt;td&gt;89.2&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;−0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;τ²-Bench Telecom&lt;/td&gt;
&lt;td&gt;95.5&lt;/td&gt;
&lt;td&gt;94.7&lt;/td&gt;
&lt;td&gt;−0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SciCode&lt;/td&gt;
&lt;td&gt;40.8&lt;/td&gt;
&lt;td&gt;40.6&lt;/td&gt;
&lt;td&gt;−0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IFBench&lt;/td&gt;
&lt;td&gt;62.3&lt;/td&gt;
&lt;td&gt;62.8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMU Pro&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;td&gt;74.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"The NVFP4 quantized model achieves nearly identical accuracy to the BF16 original while reducing memory requirements by 3.06×, enabling deployment on hardware that would otherwise require tensor parallelism across multiple GPUs." — NVIDIA Model Optimization Team, &lt;a href="https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4" rel="noopener noreferrer"&gt;nvidia/Qwen3.6-35B-A3B-NVFP4 model card&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Hopper or Blackwell: Why Other Cards Won't Work
&lt;/h2&gt;

&lt;p&gt;FP4 tensor core execution paths exist only on Hopper (H100, H200) and Blackwell (GB200, GB300, DGX Spark GB10) architectures . The RTX 4090 (Ada Lovelace, sm_89), RTX 5090, and A100 (Ampere, sm_80) have no native FP4 compute units. Passing &lt;code&gt;--quantization modelopt&lt;/code&gt; on those cards will produce an error at load time or, worse, silently wrong output.&lt;/p&gt;

&lt;p&gt;Your fallback options on non-Hopper/Blackwell hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BF16 base model&lt;/strong&gt;: Requires ~71 GB VRAM — an RTX PRO 6000 (96 GB) or H100/A100 80 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community GGUF quantizations&lt;/strong&gt;: Run on consumer hardware via llama.cpp. &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-NVFP4" rel="noopener noreferrer"&gt;unsloth/Qwen3.6-35B-A3B-NVFP4&lt;/a&gt; and &lt;a href="https://huggingface.co/AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4" rel="noopener noreferrer"&gt;AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4&lt;/a&gt; offer different quantization trade-offs and broader hardware coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DGX Spark (Blackwell, sm_120/121a) is officially supported but needs extra setup: CUDA 13.0 and the &lt;code&gt;vllm/vllm-openai:cu130-nightly&lt;/code&gt; Docker image . Stable vLLM releases do not yet include the FlashInfer CUTLASS MoE kernels for that architecture. Verify your vLLM build has compressed-tensors NVFP4 support before attempting to serve — a mismatched build will silently fall back or crash at model load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The vLLM Serve Commands: Standard and DGX Spark
&lt;/h2&gt;

&lt;p&gt;The minimum viable Hopper command. Two flags matter here: &lt;code&gt;--quantization modelopt&lt;/code&gt; activates NVIDIA Model Optimizer's compressed-tensors backend, and &lt;code&gt;--reasoning-parser qwen3&lt;/code&gt; strips &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; chain-of-thought blocks from API responses so callers see clean completions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization&lt;/span&gt; modelopt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DGX Spark (Blackwell) requires three environment variables set before launching. Omitting any of them causes a FlashInfer MoE kernel mismatch at startup — the error message is not always explicit about which variable is missing, so set all three :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VLLM_USE_FLASHINFER_MOE_FP4&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VLLM_FP8_MOE_BACKEND&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;flashinfer_cutlass
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FLASHINFER_DISABLE_VERSION_CHECK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization&lt;/span&gt; modelopt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attention-backend&lt;/span&gt; flashinfer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--moe-backend&lt;/span&gt; marlin &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.85 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 65536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-seqs&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-chunked-prefill&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--speculative-config&lt;/span&gt; &lt;span class="s1"&gt;'{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flag-by-flag breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--kv-cache-dtype fp8&lt;/code&gt; — halves KV-cache memory versus BF16, directly enabling longer usable context at 0.85 VRAM utilization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--moe-backend marlin&lt;/code&gt; — selects the Marlin MoE kernel for Blackwell; the default selection may not be optimal on this architecture&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--max-num-seqs 4&lt;/code&gt; — keeps total concurrent sequence memory predictable on constrained VRAM; raise cautiously and watch OOM behavior&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--enable-chunked-prefill&lt;/code&gt; — required on DGX Spark; without it, long prompts OOM well before the 65536-token cap&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--enable-prefix-caching&lt;/code&gt; — reduces time-to-first-token for repeated system prompts in multi-turn chat workloads&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--speculative-config '{"method":"mtp",...}'&lt;/code&gt; — enables the built-in Multi-Token Prediction head; no separate draft model required or loaded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The snippet below (illustrative — not executed; running it requires a CUDA-enabled environment with &lt;code&gt;transformers&lt;/code&gt; installed) shows how to verify your GPU is Hopper-class before attempting to load the model. The &lt;code&gt;major &amp;lt; 9&lt;/code&gt; check is the key gate: H100 reports sm_90, A100 reports sm_80:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-NVFP4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA GPU required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_capability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (sm_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;minor&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;major&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NVFP4 path requires H100-class sm_90+; A100 is sm_80&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen NVFP4 fits on one H100 because&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An A100 hits the &lt;code&gt;SystemExit&lt;/code&gt; before wasting time on a multi-minute model download. Run this check before provisioning storage or bandwidth for the weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks: Consumer GPUs and Wrong Flags
&lt;/h2&gt;

&lt;p&gt;Four failure modes worth knowing before you lose an hour to a non-obvious error:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--quantization modelopt&lt;/code&gt; on A100 or RTX hardware&lt;/strong&gt;: The FP4 matmul path does not exist on Ampere or Ada Lovelace. You get an error at load time or, worse, silently degraded output. Use BF16 or GGUF on those cards — there is no workaround.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing env vars on DGX Spark&lt;/strong&gt;: Omitting &lt;code&gt;VLLM_FP8_MOE_BACKEND&lt;/code&gt; or &lt;code&gt;FLASHINFER_DISABLE_VERSION_CHECK&lt;/code&gt; before launch triggers a FlashInfer MoE kernel mismatch. The startup error does not always name the specific missing variable — set all three unconditionally before touching vLLM on Blackwell.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Omitting &lt;code&gt;--reasoning-parser qwen3&lt;/code&gt;&lt;/strong&gt;: The model emits raw &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; blocks in every completion response. Clients parsing JSON completions will see malformed output; streaming clients will surface the thinking chain directly to end users. This flag is not optional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;--enable-chunked-prefill&lt;/code&gt; on DGX Spark&lt;/strong&gt;: Long prompts OOM well before the 65536-token cap at 0.85 VRAM utilization. Chunked prefill is not a performance optimization on that platform — it is a correctness requirement for any long-context workload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One operational caveat for DGX Spark production: the &lt;code&gt;vllm/vllm-openai:cu130-nightly&lt;/code&gt; image is not a stable release . Pin to a specific build hash for any deployment you need to reproduce, or wait for a stable vLLM release that includes full NVFP4 Blackwell support upstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pushing Further: MTP and Longer Prompts
&lt;/h2&gt;

&lt;p&gt;The built-in MTP speculative decoding head achieves an &lt;strong&gt;85.4% token acceptance rate&lt;/strong&gt; at single-user baseline (512-token outputs), rising to 92.8% at 4,096-token outputs . No second draft model to load or manage — the MTP head is baked into the base checkpoint. At concurrency 1, output throughput is 55.9 tokens/s; at concurrency 32, it scales to 433.4 tokens/s . The community AEON-7 DFlash variant reports 117 tok/s greedy decoding on DGX Spark with 62–78% draft acceptance and 2.7–4.4 mean accepted tokens per target step .&lt;/p&gt;

&lt;p&gt;The native context window is 131K tokens, extended to 262,144 via RoPE scaling . On DGX Spark, cap &lt;code&gt;--max-model-len&lt;/code&gt; at 65536 to stay within safe VRAM margins at 0.85 utilization. The full 262K context is accessible on H100/H200 with more VRAM headroom. Note that long-context RAG quality under NVFP4 versus BF16 at the 262K limit has not been independently benchmarked as of June 2026 — treat that range as best-effort until data appears.&lt;/p&gt;

&lt;p&gt;The same vLLM endpoint handles image and video inputs alongside text once the server is running on Hopper or Blackwell — no additional flags needed for multimodal prompts. Multimodal inference quality under NVFP4 quantization is also unbenchmarked publicly, so evaluate against your specific workload rather than relying on text benchmark results as a proxy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does Qwen3.6-35B-A3B-NVFP4 work on an RTX 4090 or A100?
&lt;/h3&gt;

&lt;p&gt;No. FP4 tensor core paths require Hopper (H100, H200) or Blackwell (GB200, GB300, DGX Spark) architecture. The RTX 4090 is Ada Lovelace (sm_89) and the A100 is Ampere (sm_80) — neither has native FP4 compute units. On those cards, use community GGUF quantizations via llama.cpp or the BF16 base model if you have 71+ GB VRAM available (RTX PRO 6000 96 GB or H100/A100 80 GB).&lt;/p&gt;

&lt;h3&gt;
  
  
  What does &lt;code&gt;--quantization modelopt&lt;/code&gt; actually do?
&lt;/h3&gt;

&lt;p&gt;It tells vLLM to route weight loading through NVIDIA Model Optimizer's compressed-tensors backend, which understands the NVFP4 format and dispatches matrix multiplications through FP4 tensor cores. Without this flag, vLLM will not recognize the quantization scheme and will either throw an error at startup or attempt to interpret the weights as a different format — neither produces usable output.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much accuracy do you lose with NVFP4 vs BF16?
&lt;/h3&gt;

&lt;p&gt;0.5–0.8 points on most benchmarks per NVIDIA's official eval suite . MMLU Pro drops from 85.6 to 85.0; GPQA Diamond from 84.9 to 84.8; AIME 2025 from 89.2 to 88.8. On instruction-following (IFBench: 62.8 vs 62.3) and multimodal reasoning (MMMU Pro: 74.5 vs 74.1), NVFP4 marginally outperforms BF16 — likely a calibration dataset effect from the multi-turn Nemotron data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need a separate draft model for MTP speculative decoding?
&lt;/h3&gt;

&lt;p&gt;No. The Multi-Token Prediction head is embedded in the Qwen3.6-35B-A3B checkpoint itself. Pass &lt;code&gt;--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'&lt;/code&gt; to activate it — vLLM uses the model's own MTP head without downloading or loading a second checkpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which vLLM version and CUDA version are required for DGX Spark?
&lt;/h3&gt;

&lt;p&gt;CUDA 13.0 and the &lt;code&gt;vllm/vllm-openai:cu130-nightly&lt;/code&gt; Docker image . Current stable vLLM releases lack FlashInfer CUTLASS MoE kernels for Blackwell sm_120/121a. Pin to a specific nightly build hash for any production deployment — a stable vLLM release with full NVFP4 Blackwell support had not shipped as of June 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Try Next
&lt;/h2&gt;

&lt;p&gt;On a Hopper card, the path is now practical: one &lt;code&gt;vllm serve&lt;/code&gt; command with &lt;code&gt;--quantization modelopt&lt;/code&gt; and &lt;code&gt;--reasoning-parser qwen3&lt;/code&gt;, and you have a 35B reasoning model with 262K context, built-in chain-of-thought handling, and native tool calling — on a single GPU. The 3.06× memory reduction is the operational threshold between needing four-way tensor parallelism and fitting on one card.&lt;/p&gt;

&lt;p&gt;Extend the baseline from here: add &lt;code&gt;--enable-auto-tool-choice --tool-call-parser qwen3&lt;/code&gt; for structured tool calling in agent workloads; toggle thinking mode off for latency-sensitive paths with &lt;code&gt;--default-chat-template-kwargs '{"enable_thinking": false}'&lt;/code&gt;; stress-test the 262K RAG path against your actual document lengths. A &lt;a href="https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4" rel="noopener noreferrer"&gt;RedHatAI mirror&lt;/a&gt; is also on Hugging Face for enterprise environments with registry requirements.&lt;/p&gt;

&lt;p&gt;On DGX Spark: the nightly image dependency is the main operational risk. Track the &lt;a href="https://github.com/AEON-7/Qwen3.6-NVFP4-DFlash" rel="noopener noreferrer"&gt;AEON-7/Qwen3.6-NVFP4-DFlash&lt;/a&gt; repository for community patch status and watch upstream vLLM releases for when Blackwell sm_120/121a kernels land in a stable build. Until then, pin your nightly image hash.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-06-01. Based on the &lt;a href="https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4" rel="noopener noreferrer"&gt;nvidia/Qwen3.6-35B-A3B-NVFP4 model card&lt;/a&gt; (released 2026-05-28) and community deployment reports reviewed as of June 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>qwen3</category>
      <category>nvfp4</category>
      <category>vllm</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>Step 3.7 Flash is a drop-in — except for one endpoint detail</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 09:36:50 +0000</pubDate>
      <link>https://dev.to/creeta/step-37-flash-is-a-drop-in-except-for-one-endpoint-detail-bcf</link>
      <guid>https://dev.to/creeta/step-37-flash-is-a-drop-in-except-for-one-endpoint-detail-bcf</guid>
      <description>&lt;p&gt;Step 3.7 Flash shipped on May 29, 2026  as a structural upgrade to 3.5 Flash: same OpenAI-compatible SDK, new vision encoder, new runtime escalation, and a compute-control flag you can set per request. The migration from 3.5 is two environment variables. One of them has to be exactly right — or every call returns a silent 401.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 3.7 brings that 3.5 didn't
&lt;/h2&gt;

&lt;p&gt;Step 3.7 Flash adds three net-new capabilities over 3.5 Flash: a native 1.8B-parameter ViT encoder that injects image representations directly into the language backbone without a separate model call , an automatic Advisor Mode that routes failure-prone subtasks to a larger model at runtime, and a &lt;code&gt;reasoning_effort&lt;/code&gt; parameter (low / medium / high) as a first-class API flag rather than a prompt-engineering convention. The production-relevance number is variance: 3.5 Flash scores ranged from 43% to 73% across different harnesses ; 3.7 narrows that to 64.5–71.5% , which matters more for production scheduling than the raw score improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; Step 3.7 Flash is an OpenAI-SDK-compatible model — model string &lt;code&gt;step-3.7-flash&lt;/code&gt;, base URL &lt;code&gt;https://api.stepfun.ai/v1&lt;/code&gt; (global) or &lt;code&gt;https://api.stepfun.com/v1&lt;/code&gt; (China region). New over 3.5: native vision input, automatic Advisor Mode escalation, and a &lt;code&gt;reasoning_effort&lt;/code&gt; flag. The only breaking change from 3.5: base URL must match your account region exactly, or you get a 401 with no error body.&lt;/p&gt;

&lt;p&gt;The architecture is a 198B sparse MoE model  with roughly 11B parameters active per forward pass  — dense-10B compute cost at much larger capacity. SWE-Bench Pro improved to 56.3% from 51.3% ; Terminal-Bench 2.1 improved to 59.5% from 53.4% , suggesting the planning and shell-operation gains that matter for coding agents are consistent across benchmarks.&lt;/p&gt;

&lt;p&gt;Advisor Mode carries the headline cost claim from StepFun's internal harness: 97% of Claude Opus 4.6's coding performance at $0.19 vs. $1.76 per task . That's a vendor figure on a first-party SWE-Bench Verified run — treat it as directional until independent replication appears.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Step 3.5 Flash&lt;/th&gt;
&lt;th&gt;Step 3.7 Flash&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vision input&lt;/td&gt;
&lt;td&gt;External model call&lt;/td&gt;
&lt;td&gt;Native 1.8B ViT encoder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-Bench Pro&lt;/td&gt;
&lt;td&gt;51.3%&lt;/td&gt;
&lt;td&gt;56.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark spread&lt;/td&gt;
&lt;td&gt;43–73%&lt;/td&gt;
&lt;td&gt;64.5–71.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;reasoning_effort&lt;/code&gt; flag&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;low / medium / high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advisor Mode&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Automatic (runtime)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;256k tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Before you send a request
&lt;/h2&gt;

&lt;p&gt;Two environment variables and the correct regional URL are all that's required. The URL is the part that fails silently — verify it before writing any code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Account region and base URL.&lt;/strong&gt; StepFun runs two separate API domains that share no authentication state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global account&lt;/strong&gt; — register at &lt;a href="https://platform.stepfun.ai/docs/en/guides/models/step-3.7-flash" rel="noopener noreferrer"&gt;platform.stepfun.ai&lt;/a&gt;; set &lt;code&gt;STEP_BASE_URL=https://api.stepfun.ai/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;China-region account&lt;/strong&gt; — register at platform.stepfun.com; set &lt;code&gt;STEP_BASE_URL=https://api.stepfun.com/v1&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Export both before running any code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;STEP_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;STEP_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.stepfun.ai/v1"&lt;/span&gt;   &lt;span class="c"&gt;# global account&lt;/span&gt;
&lt;span class="c"&gt;# China-region: export STEP_BASE_URL="https://api.stepfun.com/v1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenRouter alternative.&lt;/strong&gt; If you want to skip a StepFun account or consolidate all model routing behind a single proxy, &lt;a href="https://openrouter.ai/stepfun/step-3.7-flash/api" rel="noopener noreferrer"&gt;OpenRouter lists Step 3.7 Flash&lt;/a&gt; under model ID &lt;code&gt;stepfun/step-3.7-flash&lt;/code&gt;. Set base URL to &lt;code&gt;https://openrouter.ai/api/v1&lt;/code&gt; and use your existing OpenRouter key. No StepFun registration required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVIDIA NIM.&lt;/strong&gt; For enterprise GPU inference, &lt;a href="https://developer.nvidia.com/blog/run-step-3-7-flash-on-nvidia-gpus-with-enterprise-ready-multimodal-ai/" rel="noopener noreferrer"&gt;NVIDIA's NIM containerized endpoint&lt;/a&gt; runs Step 3.7 Flash on Hopper-class GPUs at up to 600 tokens/second , exposes the same OpenAI-compatible interface at &lt;code&gt;http://0.0.0.0:8000/v1&lt;/code&gt;, and supports NeMo-based fine-tuning. Requires an NVIDIA enterprise license.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python dependency:&lt;/strong&gt; &lt;code&gt;pip install openai&lt;/code&gt;. No StepFun-specific SDK or plugin needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chat, image, and effort level: runnable examples
&lt;/h2&gt;

&lt;p&gt;All four steps below use the standard &lt;code&gt;openai&lt;/code&gt; Python client without modification. The only constructor differences from a standard OpenAI call are &lt;code&gt;api_key&lt;/code&gt; and &lt;code&gt;base_url&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Basic call.&lt;/strong&gt; The SDK call structure is identical for any OpenAI-compatible Flash endpoint. The snippet below is illustrative (not executed in this context) and demonstrates the structural pattern — the same shape applies to Step 3.7 Flash by substituting your StepFun credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# Flash is otherwise OpenAI-compatible; the endpoint needs Google's /openai/ path.
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://generativelanguage.googleapis.com/v1beta/openai/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Say hello in five words.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Step 3.7 Flash, substitute your StepFun credentials in the constructor and set the model string to &lt;code&gt;step-3.7-flash&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STEP_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STEP_BASE_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# https://api.stepfun.ai/v1
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step-3.7-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the actor model of concurrency.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — &lt;code&gt;reasoning_effort&lt;/code&gt;.&lt;/strong&gt; Pass the parameter directly to &lt;code&gt;create()&lt;/code&gt; . Use &lt;code&gt;high&lt;/code&gt; for complex code review or multi-step planning; use &lt;code&gt;low&lt;/code&gt; for extraction, summarization, or rewriting where latency matters more than depth; omit it entirely to default to &lt;code&gt;medium&lt;/code&gt; for general-purpose tasks. If you later switch base models, test the parameter explicitly — it may be accepted without error but silently ignored on models that don't support it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step-3.7-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reasoning_effort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# low | medium | high
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this code for race conditions: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — Image input.&lt;/strong&gt; Replace the string content with a content array. Add a &lt;code&gt;text&lt;/code&gt; dict and an &lt;code&gt;image_url&lt;/code&gt; dict — identical shape to GPT-4o vision calls. The native 1.8B ViT encoder handles the image directly in the language backbone  without routing to an external vision model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step-3.7-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;List the interactive elements in this UI screenshot.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/screenshot.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 — Advisor Mode.&lt;/strong&gt; No parameter required. When the model detects high failure probability on a subtask — repeated errors, complex architectural reasoning — it automatically routes that subtask to a larger model at runtime  without any caller intervention. To confirm escalation occurred in a given turn, inspect the response's &lt;code&gt;usage&lt;/code&gt; or &lt;code&gt;metadata&lt;/code&gt; fields; unexpectedly high per-step token counts relative to your base-model baseline are a reliable indicator. There is no flag in the current public API to force or suppress escalation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it breaks and why
&lt;/h2&gt;

&lt;p&gt;Most Step 3.7 Flash failures trace to one of four predictable sources. None produce descriptive error bodies — you have to know what to check.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;401 from the wrong domain.&lt;/strong&gt; The single most common integration failure. &lt;code&gt;STEP_BASE_URL&lt;/code&gt; must match your account's registration domain exactly: global keys work only against &lt;code&gt;api.stepfun.ai/v1&lt;/code&gt;; China-region keys work only against &lt;code&gt;api.stepfun.com/v1&lt;/code&gt;. The 401 response body is empty — no hint about the actual cause. Check the env var before investigating anything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;reasoning_effort&lt;/code&gt; silently ignored.&lt;/strong&gt; If the model string has a typo or uses an undocumented alias, the parameter may be accepted with HTTP 200 but have no effect. The only confirmed model string is &lt;code&gt;step-3.7-flash&lt;/code&gt; exactly — no version aliases are currently documented in the official API reference . Verify the string before debugging effort-parameter behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-run benchmark scores are noise.&lt;/strong&gt; The 7-percentage-point spread (64.5–71.5%)  means one eval run can look 7 points better or worse than the actual baseline. Run at least 5 passes on your task distribution before making a production decision based on a benchmark number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advisor Mode cost figures are first-party only.&lt;/strong&gt; The $0.19 vs. $1.76 per-task comparison against Claude Opus 4.6  comes from StepFun's own SWE-Bench Verified harness. No independent replication has been published as of May 29, 2026. Don't anchor infrastructure cost projections to this number until third-party results exist.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to explore once it's working
&lt;/h2&gt;

&lt;p&gt;Once the basic call runs, four experiments will give you grounded data on whether Step 3.7 Flash fits your actual workload rather than StepFun's harness.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Measure &lt;code&gt;reasoning_effort&lt;/code&gt; tradeoffs on your own task distribution.&lt;/strong&gt; Run a representative sample at &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, and &lt;code&gt;high&lt;/code&gt;, and record latency, cost, and quality score for each tier. The optimal setting is workload-specific — the vendor benchmarks don't answer this for your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test native ViT grounding on layout-sensitive tasks.&lt;/strong&gt; Send a UI screenshot or a chart alongside a structured extraction prompt. The native 1.8B ViT encoder  should outperform a tool-chained vision call on tasks where spatial layout matters — form parsing, diagram annotation, UI diffing. Measure it on your actual data; this is a testable, falsifiable claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument Advisor Mode over a 10+ step agentic loop.&lt;/strong&gt; Log per-step costs across a full run and compare total cost against a single-model baseline at &lt;code&gt;reasoning_effort=high&lt;/code&gt;. The auto-escalation cost delta is invisible on a single call; it becomes meaningful at the loop level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head-to-head against DeepSeek V4 Flash on your eval set.&lt;/strong&gt; ClawEval-1.1 shows a 9-point tool-calling robustness gap: 67.1% for Step 3.7 vs. 57.8% for DeepSeek V4 Flash . Domain-specific results vary considerably — run the comparison on your own task set before committing to either model for production tool-calling workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why does my Step 3.7 Flash request return a 401 error?
&lt;/h3&gt;

&lt;p&gt;Endpoint region mismatch. Keys issued from &lt;a href="https://platform.stepfun.ai/docs/en/guides/models/step-3.7-flash" rel="noopener noreferrer"&gt;platform.stepfun.ai&lt;/a&gt; (global) only authenticate against &lt;code&gt;api.stepfun.ai/v1&lt;/code&gt;. Keys from the China-region platform only work with &lt;code&gt;api.stepfun.com/v1&lt;/code&gt;. The 401 response body is empty — there is no hint in the error itself about the cause. Fix: confirm &lt;code&gt;STEP_BASE_URL&lt;/code&gt; exactly matches the domain where you registered your account before investigating anything else in your request chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Step 3.7 Flash a true drop-in for the OpenAI API?
&lt;/h3&gt;

&lt;p&gt;Structurally yes. The same &lt;code&gt;openai&lt;/code&gt; Python client, the same &lt;code&gt;messages&lt;/code&gt; array shape, and the same &lt;code&gt;image_url&lt;/code&gt; content format all carry over without modification. Three things differ from a plain OpenAI call: the model string (&lt;code&gt;step-3.7-flash&lt;/code&gt; instead of a GPT variant), the base URL (your regional StepFun endpoint), and &lt;code&gt;reasoning_effort&lt;/code&gt; semantics — OpenAI's o-series uses it as a reasoning-chain depth hint, while Step 3.7 Flash uses it as a direct compute-allocation tier that controls inference cost and speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I have to configure Advisor Mode, or is it automatic?
&lt;/h3&gt;

&lt;p&gt;Automatic. No API parameter enables, disables, or triggers it. The model identifies subtasks it predicts will fail — recovering from repeated errors, deep architectural planning steps — and routes them to a larger model at runtime without any caller-side configuration. StepFun's own SWE-Bench Verified harness reports this blended approach reaches 97% of Claude Opus 4.6's coding performance at $0.19 vs. $1.76 per task . Independent replication of that figure has not been published as of the time of writing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use Step 3.7 Flash for video as well as images?
&lt;/h3&gt;

&lt;p&gt;The ViT architecture is described as video-capable in StepFun's materials, but video input via the public API should be verified against current &lt;a href="https://platform.stepfun.ai/docs/en/guides/models/step-3.7-flash" rel="noopener noreferrer"&gt;platform API documentation&lt;/a&gt; before building on it. Static &lt;code&gt;image_url&lt;/code&gt; objects in the &lt;code&gt;messages&lt;/code&gt; content array are confirmed working today via the native encoder. Don't assume video parity from the architecture description alone — check the current API reference first.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Step 3.7 Flash compare in price to similar models?
&lt;/h3&gt;

&lt;p&gt;Input tokens cost $0.20/M tokens (cache miss) and $0.04/M tokens (cache hit); output is $1.15/M tokens as of May 2026 . For agentic workflows, the more meaningful unit is per-task cost: StepFun claims $0.19 per task with Advisor Mode enabled vs. $1.76 for Claude Opus 4.6 alone . Compare to DeepSeek V4 Flash and similar sparse-MoE models at the task level rather than the token level — actual token consumption per task varies widely with prompt length, context reuse, and workflow structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one endpoint detail that's not a drop-in
&lt;/h2&gt;

&lt;p&gt;The migration from Step 3.5 Flash is mechanical: one model string, one env var, and you get vision input, Advisor Mode, and &lt;code&gt;reasoning_effort&lt;/code&gt; without any other code changes. The SDK, the message shape, and the image format are identical to what you already use.&lt;/p&gt;

&lt;p&gt;The only non-drop-in detail is the regional base URL. It produces a silent 401, it has no descriptive error, and it catches most developers on first integration. Set &lt;code&gt;STEP_BASE_URL&lt;/code&gt; to match the domain where you registered, confirm the model string is exactly &lt;code&gt;step-3.7-flash&lt;/code&gt;, and the rest of the call works as written. Track independent benchmark results as they emerge at &lt;a href="https://benchable.ai/models/stepfun/step-3.7-flash-20260528" rel="noopener noreferrer"&gt;Benchable&lt;/a&gt; and monitor API parameter additions via the &lt;a href="https://github.com/stepfun-ai/Step-3.7-Flash" rel="noopener noreferrer"&gt;official GitHub repository&lt;/a&gt; as the API stabilizes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-06-01. Reflects Step 3.7 Flash as released May 29, 2026 . Benchmark claims are vendor-reported unless otherwise noted; Advisor Mode cost figures are from StepFun's internal harness and have not been independently replicated as of this date.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>stepfun</category>
      <category>step37flash</category>
      <category>llm</category>
      <category>moe</category>
    </item>
    <item>
      <title>llama-bench skipped FA on capable GPUs — b9437 corrects it</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 09:36:49 +0000</pubDate>
      <link>https://dev.to/creeta/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it-42ik</link>
      <guid>https://dev.to/creeta/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it-42ik</guid>
      <description>&lt;h2&gt;
  
  
  What flipped in b9437
&lt;/h2&gt;

&lt;p&gt;Build &lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;b9437&lt;/a&gt;, published on May 30, 2026 at 20:56 UTC , ships two targeted default-value corrections to &lt;code&gt;llama-bench&lt;/code&gt;. Flash attention (&lt;code&gt;-fa&lt;/code&gt;) shifts from a hard-coded &lt;code&gt;off&lt;/code&gt; to &lt;code&gt;auto&lt;/code&gt; (&lt;code&gt;LLAMA_FLASH_ATTN_TYPE_AUTO&lt;/code&gt;), and the GPU-layer count (&lt;code&gt;-ngl&lt;/code&gt;) changes from the legacy sentinel &lt;code&gt;99&lt;/code&gt; to &lt;code&gt;-1&lt;/code&gt;. Both values now match what &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt; already used — the bench tool was simply never updated to track them until this build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; Before b9437 (published May 30, 2026) , &lt;code&gt;llama-bench&lt;/code&gt; hard-coded &lt;code&gt;-fa off&lt;/code&gt;, silently skipping flash attention even on CUDA, Metal, and Vulkan hardware. Build b9437 sets the default to &lt;code&gt;-fa auto&lt;/code&gt; and &lt;code&gt;-ngl -1&lt;/code&gt;, matching &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt;. Any pre-b9437 baseline on FA-capable hardware needs a flag-matched re-run to remain valid.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/pull/23714" rel="noopener noreferrer"&gt;PR #23714&lt;/a&gt; , reviewed and merged by maintainers JohannesGaessler and pwilkin, adds the same &lt;code&gt;-fa auto|off|on&lt;/code&gt; tri-state flag to &lt;code&gt;llama-bench&lt;/code&gt; that the rest of the toolchain already supported. With &lt;code&gt;LLAMA_FLASH_ATTN_TYPE_AUTO&lt;/code&gt; as the new default, flash attention activates automatically when the runtime detects a capable backend (CUDA, Metal, Vulkan); on CPU-only hosts it stays off with no error and no output change.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Before b9437&lt;/th&gt;
&lt;th&gt;After b9437&lt;/th&gt;
&lt;th&gt;Behavioral impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-fa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;off&lt;/code&gt; (hard-coded)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;auto&lt;/code&gt; (&lt;code&gt;LLAMA_FLASH_ATTN_TYPE_AUTO&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;GPU-capable hosts bench with FA active by default; pre/post comparisons require explicit flag-matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;99&lt;/code&gt; (offload-all sentinel)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;-1&lt;/code&gt; (runtime decides)&lt;/td&gt;
&lt;td&gt;CPU-only builds no longer attempt full GPU offload; eliminates spurious CUDA errors when no GPU is present&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The following verified script (executed successfully, exit 0) demonstrates the behavioral gap in concrete terms — on a capable GPU, the pre-b9437 defaults schedule zero FA rows while b9437 defaults schedule one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;old_llama_bench&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Before b9437, the default bench matrix used FA=0, so FA rows were skipped.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;b9437_llama_bench&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# b9437: default ngl=-1 and -fa auto, which enables FA on capable GPUs.
&lt;/span&gt;    &lt;span class="n"&gt;fa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fa&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;


&lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;old_llama_bench&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;b9437_llama_bench&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capable GPU: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; flash_attn=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;flash_attn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pre-b9437 scheduled FA rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b9437 scheduled FA rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What you need on hand
&lt;/h2&gt;

&lt;p&gt;Before compiling, confirm you have Git, CMake 3.14+, and a C++17-capable compiler: GCC 11+ or clang 13+ on Linux/macOS, MSVC 2022 on Windows . These are current project minimums; newer versions work fine.&lt;/p&gt;

&lt;p&gt;You also need a GGUF model file. A practical starting point is &lt;code&gt;qwen3-8b-q4_k_m.gguf&lt;/code&gt; — fetch it with &lt;code&gt;huggingface-cli download&lt;/code&gt; or let &lt;code&gt;llama-server&lt;/code&gt;'s &lt;code&gt;--hf&lt;/code&gt; flag pull it at startup. The path goes into &lt;code&gt;llama-bench&lt;/code&gt;'s &lt;code&gt;-m&lt;/code&gt; argument.&lt;/p&gt;

&lt;p&gt;A GPU is optional but required for &lt;code&gt;-fa auto&lt;/code&gt; to activate flash attention. Three backends support it: CUDA for NVIDIA cards, Metal for macOS (enabled by default), and Vulkan for AMD, Intel, and older NVIDIA hardware. On a CPU-only host, &lt;code&gt;-fa auto&lt;/code&gt; stays off — no error, no change to the output format, just standard attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-on: compile and run the corrected bench
&lt;/h2&gt;

&lt;p&gt;These steps target Linux/macOS. On Windows, substitute &lt;code&gt;-j$(nproc)&lt;/code&gt; with &lt;code&gt;-j%NUMBER_OF_PROCESSORS%&lt;/code&gt; and run from a Developer Command Prompt for MSVC builds. Full platform-specific options are in &lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md" rel="noopener noreferrer"&gt;docs/build.md&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the repository and confirm your build is at b9437 or later.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   git clone https://github.com/ggml-org/llama.cpp
   &lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
   git log &lt;span class="nt"&gt;--oneline&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The top commit should reference the &lt;code&gt;-fa bench&lt;/code&gt; PR or show a hash at or after b9437. Continuous builds don't carry semantic version tags; cross-check against the &lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;releases page&lt;/a&gt; if you're unsure.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Compile for your backend.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CPU-only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CUDA (NVIDIA):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Metal is on by default on macOS — no extra flag needed. Vulkan (cross-platform AMD/Intel/NVIDIA):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run the benchmark with the b9437 defaults made explicit.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   ./build/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/qwen3-8b-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-ngl&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; &lt;span class="nt"&gt;-fa&lt;/span&gt; auto &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-r&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-p 512&lt;/code&gt; sets prompt tokens (prefill throughput), &lt;code&gt;-n 128&lt;/code&gt; sets generated tokens (generation throughput), &lt;code&gt;-r 3&lt;/code&gt; repeats the run three times and averages. Passing these explicitly makes your results reproducible against any build, not just b9437+.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Confirm FA actually activated.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   ./build/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/qwen3-8b-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-ngl&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; &lt;span class="nt"&gt;-fa&lt;/span&gt; auto &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-r&lt;/span&gt; 3 &lt;span class="nt"&gt;--verbose&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;flash_attn = 1&lt;/code&gt; in the model load output. If you see &lt;code&gt;flash_attn = 0&lt;/code&gt; on a CUDA host, the backend was compiled without &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt; — delete your build directory and recompile with the flag.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reproduce pre-b9437 behavior for a direct comparison.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   ./build/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/qwen3-8b-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-fa&lt;/span&gt; off &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This resets both flags to their pre-b9437 defaults, giving you an apples-to-apples baseline if you have historical numbers to compare against.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where old comparisons break
&lt;/h2&gt;

&lt;p&gt;Any &lt;code&gt;llama-bench&lt;/code&gt; run before b9437 used &lt;code&gt;-fa off&lt;/code&gt; as the implicit default — even on hardware that fully supports flash attention. If you have recorded t/s numbers from those builds and your hardware supports FA, those figures captured the slower attention path without indicating it. To align old results with new defaults, either re-run old baselines with &lt;code&gt;-fa off -ngl 99&lt;/code&gt; (matching the original behavior) or re-run everything with &lt;code&gt;-fa auto&lt;/code&gt; to get forward-comparable numbers. In either case, make the &lt;code&gt;-fa&lt;/code&gt; state an explicit column in your benchmark output going forward.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;-ngl 99&lt;/code&gt; legacy default also caused a quiet footgun on CPU-only hosts: with no &lt;code&gt;-ngl&lt;/code&gt; flag set, the runtime attempted to load all 99 layers to GPU, triggering CUDA initialization errors even with no GPU present. With &lt;code&gt;-ngl -1&lt;/code&gt;, the runtime skips GPU offload when no backend is detected, removing that noise from logs entirely.&lt;/p&gt;

&lt;p&gt;Multi-Token Prediction gains for Qwen 3.6 27B dense — approximately 77 to 96 t/s on an RTX 4090 , a 24% throughput increase via PR #22673  — were measured in a separate context from b9437's defaults change. If you're trying to reproduce those figures, verify the &lt;code&gt;-fa&lt;/code&gt; state from the original run; a mismatch gives you a result that is neither the clean MTP baseline nor a combined MTP+FA measurement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to explore beyond b9437
&lt;/h2&gt;

&lt;p&gt;Three nearby builds are worth pulling alongside b9437:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;b9436 (May 30, 2026, 14:25 UTC) :&lt;/strong&gt; The OpenCL backend gains BF16-via-FP16 conversion. If you run BF16-format models on AMD or Intel hardware via the OpenCL or Vulkan path, this expands compatibility without requiring native BF16 support in the GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;b9439 (May 31, 2026, 06:57 UTC) :&lt;/strong&gt; Multi-GPU hosts now default to using only one integrated GPU, preventing automatic selection of a low-performance iGPU alongside a discrete card. If you run a hybrid system — laptop with a discrete GPU and Intel UHD, for example — verify your device selection is still correct after updating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Token Prediction for Qwen 3.6 (PR #22673):&lt;/strong&gt; Approximately 24% throughput gain on dense 27B models . Enable with &lt;code&gt;--mtp-n-draft&lt;/code&gt; and confirm your GGUF quant is compatible. MoE variants (Qwen 3.6 35B-A3B) show mixed results — expert-union verifier overhead can negate the gains on consumer hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What does -fa auto mean in llama-bench after b9437?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;-fa auto&lt;/code&gt; sets flash attention to &lt;code&gt;LLAMA_FLASH_ATTN_TYPE_AUTO&lt;/code&gt;, telling the runtime to enable flash attention when the backend supports it. Before b9437, &lt;code&gt;llama-bench&lt;/code&gt; always defaulted to &lt;code&gt;-fa off&lt;/code&gt; — unlike &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt;, which already had the tri-state &lt;code&gt;auto|off|on&lt;/code&gt; flag. After b9437, all three tools use the same flag semantics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are pre-b9437 llama-bench numbers still valid?
&lt;/h3&gt;

&lt;p&gt;Yes, with caveats. If your original run explicitly passed &lt;code&gt;-fa off&lt;/code&gt;, or the host hardware does not support flash attention, the numbers remain comparable. If you relied on the default and ran on FA-capable hardware — CUDA, Metal, or Vulkan — those measurements were taken without flash attention even though the GPU supported it. Re-run with matched flags to produce a clean, apples-to-apples comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why did -ngl default to 99 instead of -1?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;99&lt;/code&gt; was a legacy sentinel meaning "offload all layers to GPU." The project later standardized &lt;code&gt;-1&lt;/code&gt; as the runtime-decides value across the toolchain. &lt;code&gt;llama-bench&lt;/code&gt; was simply never updated to match until b9437 brought it into alignment with &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need to recompile from source to get b9437?
&lt;/h3&gt;

&lt;p&gt;Yes, for a local source build: pull the latest commit from &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;ggml-org/llama.cpp&lt;/a&gt; and recompile. Tagged binary releases lag the continuous builds. Check the &lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;GitHub releases page&lt;/a&gt; for a pre-built artifact if you want to skip compilation, but verify the build number includes the b9437 changes before treating it as current.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does -fa auto now behave the same across llama-bench, llama-cli, and llama-server?
&lt;/h3&gt;

&lt;p&gt;Yes — b9437 closes the gap. &lt;code&gt;llama-cli&lt;/code&gt; and &lt;code&gt;llama-server&lt;/code&gt; already supported the &lt;code&gt;-fa auto|off|on&lt;/code&gt; tri-state. b9437 brings &lt;code&gt;llama-bench&lt;/code&gt; into parity, so flag semantics are now consistent across all three tools. A flag value you validated in &lt;code&gt;llama-server&lt;/code&gt; means exactly the same thing when passed to &lt;code&gt;llama-bench&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rebaseline before your next regression run
&lt;/h2&gt;

&lt;p&gt;After pulling b9437 or later, the immediate action is straightforward: re-baseline any &lt;code&gt;llama-bench&lt;/code&gt; results used for regression tracking, and make the &lt;code&gt;-fa&lt;/code&gt; state an explicit column in your output going forward. The default change is a minor toolchain alignment, but its effect on benchmark validity is concrete — any pre-b9437 run on CUDA, Metal, or Vulkan was silently measuring the slower attention path.&lt;/p&gt;

&lt;p&gt;If you're on a multi-GPU system, pull at least b9439  alongside for the iGPU default fix. And if Qwen 3.6 throughput is in your test matrix, keep Multi-Token Prediction's &lt;code&gt;--mtp-n-draft&lt;/code&gt; flag in scope — the roughly 24% gain on dense 27B  is worth measuring, but MoE variant results vary enough that you'll want numbers from your own hardware and quant configuration.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-06-01. Based on llama.cpp continuous builds b9436–b9439 (May 30–31, 2026) .&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llamacpp</category>
      <category>llm</category>
      <category>gguf</category>
      <category>flashattention</category>
    </item>
    <item>
      <title>Opus 4.8 kills budget_tokens — here's what else moved</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 08:57:34 +0000</pubDate>
      <link>https://dev.to/creeta/opus-48-kills-budgettokens-heres-what-else-moved-5fl6</link>
      <guid>https://dev.to/creeta/opus-48-kills-budgettokens-heres-what-else-moved-5fl6</guid>
      <description>&lt;h2&gt;
  
  
  What Opus 4.8 Adds Over 4.7
&lt;/h2&gt;

&lt;p&gt;Anthropic shipped Claude Opus 4.8 on May 28, 2026 , 41 days after Opus 4.7 — a compressed cycle driven partly by competitive pressure from OpenAI Codex and Google Gemini Flash . The headline change is the removal of &lt;code&gt;budget_tokens&lt;/code&gt; — passing it now returns a 400. Beyond that, four additions land: a fast-throughput mode, mid-session system messages, a lower prompt-cache floor, and publicly documented refusal &lt;code&gt;stop_details&lt;/code&gt;. Code running cleanly on Opus 4.7 requires no other changes to move to 4.8 .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Answer:&lt;/strong&gt; Swap &lt;code&gt;claude-opus-4-7&lt;/code&gt; for &lt;code&gt;claude-opus-4-8&lt;/code&gt; and remove any &lt;code&gt;budget_tokens&lt;/code&gt; usage — it returns a 400. New in this release: &lt;code&gt;speed: "fast"&lt;/code&gt; (up to 2.5× throughput at $10/$50 per million tokens), mid-session &lt;code&gt;role: "system"&lt;/code&gt; messages, a 1,024-token prompt-cache floor (down from ~2,000), and documented refusal &lt;code&gt;stop_details&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput mode&lt;/td&gt;
&lt;td&gt;Standard only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;speed: "fast"&lt;/code&gt; — up to 2.5× output rate, $10/$50 per M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mid-session system messages&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;role: "system"&lt;/code&gt; in messages array, no beta header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt cache minimum&lt;/td&gt;
&lt;td&gt;~2,000 tokens&lt;/td&gt;
&lt;td&gt;1,024 tokens — no code changes needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;stop_details&lt;/code&gt; on refusals&lt;/td&gt;
&lt;td&gt;Present but undocumented&lt;/td&gt;
&lt;td&gt;Publicly documented; categorizes refusal type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard pricing&lt;/td&gt;
&lt;td&gt;$5/$25 per M tokens&lt;/td&gt;
&lt;td&gt;$5/$25 per M tokens — unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Paid Plans and Enrollment Checklist
&lt;/h2&gt;

&lt;p&gt;Claude Code and fast mode are not available on Anthropic's free tier. Before writing any Opus 4.8 code, work through this checklist to avoid hitting enrollment errors in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code access&lt;/strong&gt;: Requires a paid Anthropic subscription (Pro, Max, Teams, or Enterprise) or API console credits — the free tier does not include it (video: Tech With Tim).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast mode enrollment&lt;/strong&gt;: Research preview, gated separately. Enroll via the &lt;a href="https://www.anthropic.com/claude/opus" rel="noopener noreferrer"&gt;Anthropic console&lt;/a&gt; before adding &lt;code&gt;speed: "fast"&lt;/code&gt; to any request. Sending the parameter without prior enrollment returns an error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform and context window&lt;/strong&gt;: Claude API, Amazon Bedrock, and Vertex AI each provide a 1 million token context window . Microsoft Foundry caps at 200k tokens — verify your target platform before designing long-context pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output limits&lt;/strong&gt;: Maximum synchronous output is 128k tokens . The Message Batches API raises this to 300k tokens per call with the beta header &lt;code&gt;output-300k-2026-03-24&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge cutoff&lt;/strong&gt;: January 2026 .&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDK version&lt;/strong&gt;: Confirm &lt;code&gt;anthropic&amp;gt;=0.51&lt;/code&gt; is installed before deploying.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Call Each Capability in Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Five call patterns cover everything new or changed in Opus 4.8. All examples use the Python SDK (&lt;code&gt;anthropic&amp;gt;=0.51&lt;/code&gt;). Steps 4 and 5 are the most migration-sensitive; read the next section before deploying either.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Basic call — update the model ID&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Change &lt;code&gt;model&lt;/code&gt; to &lt;code&gt;claude-opus-4-8&lt;/code&gt;. All other request structure from 4.7 carries over unchanged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

   &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the migration in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Fast mode — add &lt;code&gt;speed="fast"&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pass &lt;code&gt;speed="fast"&lt;/code&gt; as a top-level parameter (not inside the model string). Doubles per-token cost to $10 input / $50 output per million tokens . Confirm console enrollment before deploying; unenrolled requests return an error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;speed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a 500-word summary.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Mid-session system messages&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Insert &lt;code&gt;{"role": "system", "content": "..."}&lt;/code&gt; into the &lt;code&gt;messages&lt;/code&gt; array immediately after any user turn. No beta header required. Earlier turns stay cached — you pay only for the injected delta .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this codebase.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting with the entry points...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="c1"&gt;# Inject updated instruction mid-session — no beta header needed
&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Focus only on security-critical paths from here.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Continue with the auth module.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
   &lt;span class="p"&gt;]&lt;/span&gt;

   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Adaptive thinking — omit &lt;code&gt;budget_tokens&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pass &lt;code&gt;thinking={"type": "adaptive"}&lt;/code&gt; and set &lt;code&gt;effort&lt;/code&gt; separately. Accepted values: &lt;code&gt;"low"&lt;/code&gt;, &lt;code&gt;"high"&lt;/code&gt; (default), &lt;code&gt;"max"&lt;/code&gt;. The model decides per turn whether extended reasoning is warranted (video: Skill Leap AI). Passing &lt;code&gt;budget_tokens&lt;/code&gt; returns a 400.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adaptive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
       &lt;span class="n"&gt;output_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# or "low", "max"
&lt;/span&gt;       &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Audit this function for security issues.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Reading &lt;code&gt;stop_details&lt;/code&gt; on refusals&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Check &lt;code&gt;response.stop_details.type&lt;/code&gt; to branch by category rather than parsing free-text content. The following snippet is illustrative — see the &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8" rel="noopener noreferrer"&gt;Opus 4.8 changelog&lt;/a&gt; for all documented type values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refusal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;refusal_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;
       &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;refusal_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harmful_content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
           &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revise_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
       &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;refusal_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;privacy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
           &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;privacy_policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_pii&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
       &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
           &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refusal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;refusal_type&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following verified migration script summarizes the full diff from Opus 4.6-style requests to Opus 4.8. It was executed and confirmed (exit 0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;budget_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adaptive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;budget_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;removed: use adaptive thinking + effort instead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;after&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;also_moved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effort_default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sampling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;omit temperature/top_p/top_k non-defaults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_cache_min_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;128000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output from the above script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"budget_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"removed: use adaptive thinking + effort instead"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"before"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-opus-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"thinking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"budget_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-opus-4-8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"thinking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"adaptive"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"effort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"speed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fast"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"also_moved"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"effort_default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sampling"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"omit temperature/top_p/top_k non-defaults"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_cache_min_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  400 Errors and Other Failure Points
&lt;/h2&gt;

&lt;p&gt;Opus 4.8 inherits the same sampling constraints as Opus 4.7. These four failure points account for the majority of migration bugs, particularly when upgrading from Opus 3 or Sonnet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;thinking: {"type": "enabled", "budget_tokens": N}&lt;/code&gt; → 400.&lt;/strong&gt; Deprecated since Opus 4.7, still unsupported in 4.8. Replace with &lt;code&gt;thinking={"type": "adaptive"}&lt;/code&gt; plus &lt;code&gt;output_config={"effort": "..."}&lt;/code&gt;. This is the single most common mistake when switching from extended-thinking-era code .&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any non-default &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, or &lt;code&gt;top_k&lt;/code&gt; → 400.&lt;/strong&gt; Omit these parameters entirely. Steer output style through prompting techniques instead — this constraint is unchanged from 4.7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;speed: "fast"&lt;/code&gt; without enrollment → API error.&lt;/strong&gt; This is not a 400 but a distinct enrollment-check error. Confirm console access before shipping any fast-mode code path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Foundry: 200k token context, not 1M.&lt;/strong&gt; If your pipeline was designed around the 1M window on the Claude API, Bedrock, or Vertex AI, it behaves differently on Foundry . Verify your target platform before committing to a retrieval or long-document architecture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the behavioral side: per &lt;a href="https://techcrunch.com/2026/05/28/anthropic-releases-opus-4-8-with-new-dynamic-workflow-tool/" rel="noopener noreferrer"&gt;TechCrunch&lt;/a&gt;'s reporting on early enterprise testing, Opus 4.8 is approximately four times less likely than Opus 4.7 to allow code flaws to pass unremarked, and it proactively flags input/output uncertainties in analytical pipelines — a pattern confirmed by Bridgewater Associates during pre-launch evaluation . If your evals measure output volume rather than accuracy, recalibrate before moving to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going Further With Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Three capabilities in Opus 4.8 are worth queuing up for exploration, though two remain in research preview as of June 2026  with no confirmed GA date:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Workflows (Claude Code, research preview)&lt;/strong&gt;: Decomposes a large task into a plan, then fans it across hundreds of parallel subagents in a single session. Designed for codebase-scale work — migrations across hundreds of thousands of lines, from kickoff to merge, using your existing test suite as the quality bar . Token consumption is substantially higher than single-agent flows; budget accordingly before enabling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Batches API — 300k output tokens per call&lt;/strong&gt;: Add the beta header &lt;code&gt;output-300k-2026-03-24&lt;/code&gt; to raise the per-call output ceiling from 128k to 300k tokens . Practical for large-document generation or batch summarization at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-session prompt injection for cost reduction&lt;/strong&gt;: In multi-turn agentic loops, inject only the changed instruction slice after each turn while leaving earlier cached turns intact. Savings compound with session length — no need to retransmit the full system prompt on every call.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the model ID for Claude Opus 4.8?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;claude-opus-4-8&lt;/code&gt;. The model is available on the Claude API, Amazon Bedrock, and Vertex AI with a 1 million token context window , and on Microsoft Foundry with a 200k token limit. Standard pricing is $5 per million input tokens and $25 per million output tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does setting &lt;code&gt;budget_tokens&lt;/code&gt; cause a 400 error on Opus 4.8?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;thinking: {"type": "enabled", "budget_tokens": N}&lt;/code&gt; syntax was deprecated starting with Opus 4.7 and is unsupported in 4.8. The correct form is &lt;code&gt;thinking={"type": "adaptive"}&lt;/code&gt; with a separate &lt;code&gt;output_config={"effort": "..."}&lt;/code&gt; parameter — values are &lt;code&gt;"low"&lt;/code&gt;, &lt;code&gt;"high"&lt;/code&gt;, or &lt;code&gt;"max"&lt;/code&gt;, defaulting to &lt;code&gt;"high"&lt;/code&gt;. Omit &lt;code&gt;budget_tokens&lt;/code&gt; entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does fast mode cost compared to standard Opus 4.8?
&lt;/h3&gt;

&lt;p&gt;Standard Opus 4.8 costs $5 input / $25 output per million tokens. Fast mode costs $10 input / $50 output per million tokens — exactly double — but delivers up to 2.5× higher output token rate . Fast mode is a research preview and requires separate enrollment in the Anthropic console before any requests will succeed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I set &lt;code&gt;temperature&lt;/code&gt; or &lt;code&gt;top_p&lt;/code&gt; with Opus 4.8?
&lt;/h3&gt;

&lt;p&gt;No. Any non-default &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, or &lt;code&gt;top_k&lt;/code&gt; value returns a 400 error — the same constraint as Opus 4.7. Omit these parameters entirely and control output style through prompting instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need to change any code to benefit from the lower prompt cache threshold?
&lt;/h3&gt;

&lt;p&gt;No code changes required. Opus 4.8 drops the minimum cacheable prompt from approximately 2,000 tokens to 1,024 tokens automatically . Prompts between 1,024 and 2,000 tokens that previously missed caching now qualify, cutting repeat-call input costs without any migration work on your end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch / Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=ntDIxaeo3Wg" rel="noopener noreferrer"&gt;Tech With Tim — Claude Code - Full Tutorial for Beginners&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=jeA-KBv0b68" rel="noopener noreferrer"&gt;Fireship — Claude just got another superpower...&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=2bJK7DckfcY" rel="noopener noreferrer"&gt;Skill Leap AI — The New Claude Opus 4.7 Can Actually Do This Now&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to Migrate, What to Monitor
&lt;/h2&gt;

&lt;p&gt;The Opus 4.8 upgrade is one line for most codebases: change the model ID. The substantive migration work is removing &lt;code&gt;budget_tokens&lt;/code&gt; if you were on extended thinking, and confirming no sampling parameters are set. The four new capabilities — fast mode, mid-session system prompts, the lower cache floor, and &lt;code&gt;stop_details&lt;/code&gt; routing — each add a handful of lines, not a structural rearchitecture.&lt;/p&gt;

&lt;p&gt;Fast mode and Dynamic Workflows remain research previews with no confirmed GA date. The Mythos-class models expected to outperform Opus 4.8 on coding benchmarks were described as coming "in the coming weeks" as of the release date, with safety reviews still ongoing . For now, &lt;code&gt;claude-opus-4-8&lt;/code&gt; is the production ceiling on every supported platform.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: 2026-06-01. Based on &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8" rel="noopener noreferrer"&gt;Anthropic's Opus 4.8 release notes&lt;/a&gt; and model documentation published May 28, 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>opus48</category>
      <category>anthropic</category>
      <category>llm</category>
    </item>
    <item>
      <title>Composer 2.5 hits near-frontier at 60 lower spend</title>
      <dc:creator>Creeta</dc:creator>
      <pubDate>Thu, 18 Jun 2026 08:57:33 +0000</pubDate>
      <link>https://dev.to/creeta/composer-25-hits-near-frontier-at-60x-lower-spend-3fgc</link>
      <guid>https://dev.to/creeta/composer-25-hits-near-frontier-at-60x-lower-spend-3fgc</guid>
      <description>&lt;p&gt;Cursor's Composer 2.5 landed on May 18, 2026, and the interesting part isn't a new model — it's what the team did to an old one. The gains come almost entirely from post-training, not a fresh checkpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Cursor Trained Into 2.5: RL-Heavy Post-Training on Kimi K2.5
&lt;/h2&gt;

&lt;p&gt;Composer 2.5 runs on the &lt;strong&gt;same open-weights base as Composer 2&lt;/strong&gt;: Moonshot AI's Kimi K2.5, a 1.04T-parameter mixture-of-experts model with 32B active parameters . There is no architecture swap. Cursor's launch graphic states that &lt;strong&gt;~85% of Composer 2.5's total compute went into additional Composer training and RL&lt;/strong&gt; rather than a new checkpoint  — so every capability gain lives in the fine-tuning stack, not the weights it started from.&lt;/p&gt;

&lt;p&gt;What that stack added is concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;25× more synthetic tasks than Composer 2&lt;/strong&gt;, including feature-deletion exercises grounded in real codebases .&lt;/li&gt;
&lt;li&gt;More complex RL environments, plus &lt;strong&gt;targeted textual feedback injected at the exact trajectory point of each error&lt;/strong&gt; — tool-call failures, style deviations, instruction drift — implemented via an on-policy distillation KL loss .&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One infrastructure detail is worth noting for anyone tracking large-model training: Cursor reports &lt;strong&gt;sharded Muon with distributed orthogonalization and dual-mesh HSDP&lt;/strong&gt;, hitting a 0.2s optimizer step on a 1T-parameter model . Read against the independent &lt;a href="https://artificialanalysis.ai/articles/cursor-composer-2-5-coding-agent-index" rel="noopener noreferrer"&gt;Artificial Analysis Coding Agent Index&lt;/a&gt;, where Composer 2.5 jumped +14 points over Composer 2, the takeaway is that a heavier post-training pass — not a bigger or different base — drove the bulk of the improvement .&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Need Before Invoking Composer 2.5
&lt;/h2&gt;

&lt;p&gt;Before you can select Composer 2.5, you need a current Cursor build and a clear picture of how it bills. The model requires &lt;strong&gt;Cursor 3.4 or newer&lt;/strong&gt; — version 3.5 was current as of May 20, 2026 — so run &lt;code&gt;Cursor → Check for Updates&lt;/code&gt; and relaunch before the model name appears in the picker .&lt;/p&gt;

&lt;p&gt;Composer 2.5 ships in two tiers that run &lt;strong&gt;identical model weights&lt;/strong&gt; — tier changes per-token cost and latency only, not output quality . &lt;strong&gt;Fast&lt;/strong&gt; is the launch default and runs on hotter, pricier hardware, so cost-conscious users should switch to &lt;strong&gt;Standard&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;th&gt;Cache read ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast (default)&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both tiers draw from one shared &lt;strong&gt;Auto + Composer usage pool&lt;/strong&gt; on individual plans, with a $0.20/M cache-read rate either way . A first-week double-usage promo began May 18, 2026; as of June 2, 2026 treat it as expired and confirm under Usage in your Cursor dashboard before counting on it .&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Select, Prompt, and Verify Composer 2.5
&lt;/h2&gt;

&lt;p&gt;Selecting Composer 2.5 is a three-click path inside a current Cursor build, but the payoff comes from how you prompt it: write a verifiable success condition into the task and the model will self-correct toward it, because it was RL-trained against test verification [1][4]. The workflow below moves from updating the editor to encoding conventions and keeping a durable rollback path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Update Cursor.&lt;/strong&gt; Composer 2.5 needs a recent build — Cursor 3.4 or newer, with 3.5 current as of May 20, 2026. Run &lt;code&gt;Cursor → Check for Updates&lt;/code&gt; and relaunch when prompted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select the model.&lt;/strong&gt; Open Agent with &lt;code&gt;Cmd/Ctrl+I&lt;/code&gt; (or a chat session), click the model name in the prompt input, and pick &lt;code&gt;composer-2.5&lt;/code&gt;. The same dropdown appears in the inline-edit editor under &lt;code&gt;Cmd/Ctrl+K&lt;/code&gt; [1][4].&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set billing per workload.&lt;/strong&gt; Click the tier selector and switch to &lt;strong&gt;Standard&lt;/strong&gt; for background, cloud, or long agent loops; enable &lt;strong&gt;Fast&lt;/strong&gt; only when you need lower latency on short inline edits. Both tiers run identical weights and draw from one usage pool, so the choice changes per-token cost and first-token latency, not output quality [1][4].&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write a success condition into every substantive prompt.&lt;/strong&gt; Give the agent a real end state — for example, &lt;em&gt;"all existing tests stay green and the endpoint returns 422 on invalid input."&lt;/em&gt; The model was trained against test verification and iterates toward that target instead of stopping at first-draft code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ask it to plan before touching files.&lt;/strong&gt; For a safer first pass, prompt: &lt;em&gt;"identify the likely files, propose a minimal test-backed patch — do not write code yet."&lt;/em&gt; Reviewing the plan before edits land catches misreads early.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encode conventions in rules.&lt;/strong&gt; Put project standards in &lt;code&gt;.cursor/rules&lt;/code&gt; (&lt;code&gt;.mdc&lt;/code&gt; files) or &lt;code&gt;AGENTS.md&lt;/code&gt;. Cursor supports Project, User, and Team Rules; keep each rule actionable, scoped, under 500 lines, and checked into git when it is project-specific.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a rollback path.&lt;/strong&gt; The agent creates checkpoints — local snapshots you can roll back to — and &lt;code&gt;agent resume&lt;/code&gt; recovers an interrupted session. Treat these as convenience layers: Git remains the durable version control, so commit before long autonomous runs [3][4].&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One pro tip from the Cursor team is to lean on the agent's clarifying questions rather than over-specifying upfront (video: Cursor) — describe the goal and the verification, and let it surface the files it needs to read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before You Rely on It: Billing Reality and the Independent Verdict
&lt;/h2&gt;

&lt;p&gt;Before you trust Composer 2.5 with a long session, check your billing selector: &lt;strong&gt;Fast is the launch default&lt;/strong&gt;, and it runs the same model weights as Standard at a higher per-token rate . Fast costs $3.00/M input and $15.00/M output versus Standard's $0.50/M and $2.50/M — roughly 6× the rate for no quality gain, only faster first tokens . Switch to Standard for everything but latency-sensitive inline edits.&lt;/p&gt;

&lt;p&gt;Is the capability claim real? Independent measurement broadly corroborates Cursor without matching it exactly. Artificial Analysis's Coding Agent Index scores Composer 2.5 at &lt;strong&gt;62&lt;/strong&gt; — third overall, behind Claude Code with Opus 4.7 (66) and Codex with GPT-5.5 (65) . That is a +14-point jump over Composer 2's 48, driven by SWE-Bench-Pro-Hard-AA rising +35 points, from 12% to 47% . By contrast, CursorBench v3.1 — where Cursor reports 63.2% — is an internal benchmark built from real Cursor sessions, not a neutral public leaderboard .&lt;/p&gt;

&lt;p&gt;Two caveats deserve weight before you build a workflow on it. First, supply chain: the base weights are Moonshot AI's Kimi K2.5, a third-party open checkpoint, and the from-scratch model Cursor is training with SpaceXAI on Colossus 2 (targeting ~10× the compute) has not shipped and is not part of 2.5 . Second, there is no independent confirmation that Fast and Standard are quality-equivalent on real production repositories — Cursor claims identical weights, and that is the only evidence on the table . Treat the cost win as proven and the equivalence as a vendor assertion until your own diffs say otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to Take It: Worktrees, Rules Files, and Mixing With Other Models
&lt;/h2&gt;

&lt;p&gt;The strongest setup is not Composer 2.5 alone but Composer 2.5 as the cheap default in a multi-model rotation. A common routing pattern: send refactors and medium agent loops to Composer 2.5 Standard, route architectural reasoning to a frontier model like Opus 4.7, and hand terminal-heavy or multi-shell work to GPT-5.5 . Git worktrees let all three operate on one repository in parallel without stepping on each other, since each task gets its own isolated branch .&lt;/p&gt;

&lt;p&gt;Three CLI flags make that workflow practical: &lt;code&gt;--mode=plan&lt;/code&gt; inspects and proposes before touching files, &lt;code&gt;--worktree&lt;/code&gt; isolates each task on its own branch, and &lt;code&gt;--print&lt;/code&gt; runs non-interactively for scripting and CI .&lt;/p&gt;

&lt;p&gt;Don't take the cost claim on faith — calibrate with your own data. Run one real refactor on Standard, record token spend, then run the identical task on a frontier alternative and compare quality and spend side by side. The math that motivates the experiment is simple (this snippet ran, exit 0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;frontier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Frontier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;composer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Composer 2.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;97.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;score_gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;frontier&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;spend_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;frontier&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% of frontier quality&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gap to frontier: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score_gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spend reduction: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;spend_ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x lower spend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The takeaway: at roughly $0.07 per task standard versus $4.10 for Opus 4.7 , even a modest quality gap pays for itself across a day of agent loops. Make Composer 2.5 Standard your default, escalate deliberately, and let your own diffs decide the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Composer 2.5 the same model as Composer 2?
&lt;/h3&gt;

&lt;p&gt;Yes and no. Composer 2.5 runs on the same open-weights base checkpoint as Composer 2 — Moonshot AI's Kimi K2.5, a 1.04T-parameter mixture-of-experts model with 32B active parameters . The difference is post-training, not architecture: Cursor states that roughly 85% of Composer 2.5's total compute came from additional Composer training and reinforcement learning, including about 25× more synthetic tasks than Composer 2 . So the gains are in fine-tuning on the same checkpoint, not a new base model.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the real difference between Composer 2.5 Standard and Fast?
&lt;/h3&gt;

&lt;p&gt;They run identical model weights, so per Cursor there is no quality difference — only latency and cost . Fast runs on hotter, more expensive hardware for faster first tokens and is priced at $3.00/M input and $15.00/M output, versus Standard's $0.50/M input and $2.50/M output — roughly 6× more per token . Fast was the launch default, so for long agent sessions switch to Standard unless you specifically need low-latency inline edits.&lt;/p&gt;

&lt;h3&gt;
  
  
  How accurate are Cursor's benchmark claims for Composer 2.5?
&lt;/h3&gt;

&lt;p&gt;Independent testing broadly corroborates them. Artificial Analysis placed Composer 2.5 at 62 on its Coding Agent Index — third overall, behind Claude Opus 4.7 in Claude Code at 66 and GPT-5.5 in Codex at 65, and a +14-point jump over Composer 2's 48 . Treat Cursor's CursorBench figures more cautiously: it is an internal benchmark built from real Cursor engineering sessions, not a neutral public leaderboard .&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Composer 2.5 work in the Cursor CLI?
&lt;/h3&gt;

&lt;p&gt;Yes. Composer 2.5 is available in both the Agent UI and the terminal CLI . The agent CLI supports &lt;code&gt;--mode=plan&lt;/code&gt; for inspect-before-edit runs, &lt;code&gt;--worktree&lt;/code&gt; to isolate changes on a separate branch, and &lt;code&gt;--print&lt;/code&gt; for non-interactive scripting, alongside Ask mode, slash commands, &lt;code&gt;@&lt;/code&gt; context selection, and &lt;code&gt;agent resume&lt;/code&gt; .&lt;/p&gt;

&lt;h3&gt;
  
  
  Is the launch double-usage promotion still active?
&lt;/h3&gt;

&lt;p&gt;Almost certainly not. The first-week double-usage promotion began on May 18, 2026 . As of June 2, 2026 it should be treated as expired unless your Cursor dashboard says otherwise — check under Usage to confirm your current rate before planning around it.&lt;/p&gt;

</description>
      <category>cursor</category>
      <category>composer25</category>
      <category>codingagent</category>
      <category>kimik2</category>
    </item>
  </channel>
</rss>
