DEV Community: Lavelle Hatcher Jr

Bringing Scientific Rigor to LLM Comparison

Lavelle Hatcher Jr — Sun, 31 May 2026 09:52:16 +0000

Why I built Cli Modelarium, and why it belongs in your terminal, not a dashboard

Note: This is a personal project, not affiliated with any company. This does not constitute financial or investment advice.

Every time I wanted to compare two LLMs, I had to pick between a quick spot check in a chat window or spinning up an entire evaluation platform.

One tells you nothing useful.

The other takes longer to set up than the comparison is worth.

So I built a CLI that does it from the terminal. It's called Cli Modelarium, and it's live on PyPI today under Apache 2.0.

pip install cli-modelarium

In the rest of this post I'll walk through what it does, why I built it, what's actually under the hood, and how you can use it for your own LLM comparison work in under a minute.

The Problem No One Talks About

The LLM tooling landscape has two ends.

On one end you have the chat-window spot check. You paste a prompt into Claude, then into GPT, then into Gemini, eyeball the outputs, and decide which one is "better." This is what most developers actually do. It feels productive. It produces nothing trustworthy.

The problem with spot checks is that LLM output has variance. You can run the same prompt twice and get different answers. You can also run the same prompt across two models, get answers that look similar, and miss the fact that one is hallucinating subtle facts. Eyeballing single outputs is not a comparison. It's a vibe.

On the other end you have enterprise evaluation platforms. These exist and they're powerful. They also require you to set up an account, configure an integration, define a dataset schema, write evaluators, plug in providers, and orchestrate runs through a dashboard. By the time you've finished onboarding, the question you wanted to answer has changed.

Most LLM comparison questions don't justify that overhead. You want to know: which model produces better outputs for this specific prompt at this specific cost. You don't want a dashboard. You want an answer.

That's the gap Cli Modelarium fills.

What It Does in 30 Seconds

You install it. You set your provider keys. You run a comparison. You get statistically rigorous results in your terminal.

Here's a real example:

cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini \
  --max-cost 0.10

That command sends the prompt to both models through their official APIs, tracks cost per call against your --max-cost cap so you don't accidentally spend more than 10 cents, measures time to first token and total latency for each model, and returns a side-by-side comparison with timing, cost, and full outputs.

No infrastructure. No dashboard. No account onboarding. Just a CLI that returns an answer.

If you want statistical rigor on top of that, you add a few flags:

cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash \
  --runs 10 \
  --significance \
  --hallucination-check \
  --judge claude-opus-4 \
  --max-cost 1.00

Now you're running 10 trials against each of three models, computing bootstrap confidence intervals, running paired significance tests, checking outputs for hallucination patterns, and using a separate model as a judge to score quality, all while staying under a $1 cost cap.

That's the gap I wanted to close. Publication-grade methodology, terminal-grade ergonomics.

What's Actually Under the Hood

The headline features are easy to list. The details are where the work lives.

Multi-Provider Support. Cli Modelarium supports 8 cloud LLM providers plus local models through a unified interface: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, and OpenRouter. Each provider has its own SDK, its own auth pattern, its own error semantics, its own rate-limit behavior. The interface hides all of that. You specify --models claude-haiku-4-5,gpt-4o-mini and the CLI figures out which provider to route each call to, handles credentials, and returns normalized outputs.

You only need API keys for the providers you actually want to use. You set them once with cli-modelarium configure or via environment variables.

Statistical Rigor. This is where Cli Modelarium differs from every other LLM comparison tool I've used. LLM outputs have variance. To compare them rigorously you need actual statistics, not visual inspection. The CLI implements bootstrap confidence intervals using the BCa method, paired statistical tests including McNemar's test for binary outcomes, multiple comparison corrections including Bonferroni and Holm methods, and effect sizes using Cohen's d. These aren't decorative additions. They're the methods you'd use if you were writing a research paper comparing LLMs. The CLI just makes them invokable through flags.

Hallucination Detection. When models generate plausible-sounding nonsense, statistical tests don't catch it. Hallucination detection runs additional checks on outputs to flag responses that contain markers of fabrication: invented citations, contradictory claims within the same response, fabricated names or dates, and other patterns that experienced reviewers learn to spot. It's not perfect. No hallucination detector is. But it surfaces high-risk outputs for human review, which is far better than flying blind.

LLM-as-Judge Panels. For subjective quality questions, you can use a separate model as a judge. The CLI supports panels with multiple judge models voting independently to reduce single-judge bias.

Cost and Latency Tracking. Every comparison tracks cost per call, total cost, time to first token, and total latency per model. The --max-cost flag enforces a hard cap. If your comparison would exceed the budget, the CLI stops before the next call and reports what it did.

The Engineering Behind It

This is the part that usually gets skipped. I think it matters because it's where a CLI either earns trust or doesn't.

917 Automated Tests. Every commit runs 917 automated tests covering provider integration, statistical computation accuracy, CLI behavior, error handling, and edge cases. Zero CI failures since v0.1.0.

9 OS/Python Combinations. The CI matrix runs on Linux, macOS, and Windows across Python 3.11, 3.12, and 3.13. That's 9 combinations on every push. If something works on Linux but breaks on Windows, I know before users do.

Statistical Validation Against Literature. For the statistical methods, "passes tests" isn't enough. I cross-validated outputs against reference implementations: bootstrap CIs against scipy's bootstrap method with BCa correction, McNemar's test against binomtest for small samples and chi2.sf with Edwards correction for larger samples, and effect sizes against published formulas for Cohen's d. When my implementation disagreed with the reference, I traced the discrepancy to its source and fixed it.

README in 9 Languages. The README is available in 9 languages so developers across different regions can read about the project in their preferred language.

Why I Built This

Every time I evaluate a new LLM for one of my projects, I run into the same problem: I want to know if Claude is better than GPT for this specific task, or if Gemini is fast enough for that other use case, or whether DeepSeek is worth the cost savings.

I'd open three browser tabs. I'd paste prompts. I'd squint at outputs. I'd make a call. Then later I would revisit the decision and realize I didn't remember why I picked what I picked.

The CLI started as a script for my own evaluation work. Then I added statistical methods because I wanted to know if differences I was seeing were real or noise. Then I added cost tracking because I was burning through API credits. Then I added the test suite because I kept introducing regressions.

At some point I looked at the project and realized I'd built something other people would find useful, so I open sourced it under Apache 2.0.

Getting Started

The full workflow takes about 30 seconds.

Install:

pip install cli-modelarium

Works on Linux, macOS, and Windows. Python 3.11 or newer.

Set your provider keys (you only need keys for providers you'll actually use):

cli-modelarium configure

Run your first comparison:

cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini \
  --max-cost 0.10

With statistical rigor:

cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash \
  --runs 10 \
  --significance \
  --hallucination-check \
  --output results.json \
  --max-cost 1.00

Compare against a local model:

cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,local:llama-3.1-8b \
  --max-cost 0.10

What's Next

The launch version (v0.1.3) is feature-complete for the use cases I had when I built it. Additional language support is on my radar. The architecture is designed for it, so stay tuned.

If you use it and find something missing, open an issue.

Try It

pip install cli-modelarium

I built this because I needed it. I open sourced it because if I needed it, other people probably do too.

If you find it useful, a star on the repo helps surface it to other developers.

I built an offline Chrome extension that reads webpages aloud with AI voices and zero cloud calls

Lavelle Hatcher Jr — Sun, 10 May 2026 23:58:58 +0000

Every text-to-speech Chrome extension I tried had one of two problems. Either it sent my text to a server, or it used the browser's built-in voices that sound like a GPS from 2012.

I wanted TTS that stays on my machine and doesn't sound terrible. So I built one.

What it does

GlowReadTTS is a Chrome extension that reads text aloud using AI voices bundled directly in the extension package. No cloud, no accounts, no API keys, no network calls at all.

Two ways to use it:

Right-click mode: Select text on any webpage, right-click, choose "Read with GlowReadTTS." It reads the text aloud and highlights each sentence on the page as it goes. A floating stop button appears at the top-right of the page so you can halt playback without opening the popup.

Popup mode: Open the extension popup, paste or type text, hit play.

15 AI voices (American and British English), speed control from 0.25x to 2x, streaming playback so audio starts quickly.

How a 96MB model fits in a Chrome extension

The AI voice model is about 96MB and ships entirely inside the .crx package. After install, there are no runtime downloads and no network calls. You can turn off wifi and it still works.

The voices sound significantly better than built-in browser TTS. 15 voices are bundled covering American and British English.

Architecture

The whole thing is vanilla JavaScript. No React, no build step, no bundler. Manifest V3.

background/    → service worker, context menu, message routing
content/       → in-page sentence highlighting, floating stop button
offscreen/     → audio playback + TTS inference
popup/         → extension UI (voice picker, text input, controls)
options/       → settings (speed, voice, performance toggle)
libs/          → bundled AI voice model + inference code

The offscreen/ part is worth explaining. Chrome extensions can't play audio from a service worker, so an offscreen document handles the TTS inference and pipes audio out. This is a Manifest V3 pattern that trips people up if you haven't seen it before.

The performance toggle

Cold-starting a 96MB model takes a few seconds. To avoid that delay on the first right-click read of a session, GlowReadTTS can optionally pre-warm the model whenever you select text on a page. This is on by default.

If you'd rather keep idle RAM minimal, switch it off in Settings. The first read will be slower, but subsequent reads in the same session stay fast.

Privacy

This is the whole point of the project.

Zero data collection. No analytics, no telemetry, no tracking.
Text never leaves the device. 100% local processing.
No accounts. No sign-up, no API keys.
The extension doesn't even have permission to make network requests. If you're reading a article nothing gets sent anywhere.

Status

License: Apache 2.0
Source: github.com/lavellehatcherjr/GlowReadTTS
Chrome Web Store: Submitted, pending review. I'll update this post with the install link once it's approved. If you find it useful, a ⭐ on the repo helps more than you'd think.

What's next

Additional language support is on my radar. The architecture is designed for it, so stay tuned.

Questions, feedback, or bugs? Open an issue.

How I Stopped Getting "Stream Idle Timeout" Errors in Claude Code

Lavelle Hatcher Jr — Sat, 25 Apr 2026 11:59:59 +0000

The fix is five lines in your CLAUDE.md, not a settings change

Note: This is a personal workaround based on my own experience. Your mileage may vary.

If you use Claude Code for anything longer than a short conversation, you have probably seen this:

API Error: Stream idle timeout - partial response received

It cuts off mid-response. Your work disappears. The retry often fails the same way. There is no recovery button. As of April 2026, it is one of the most reported bugs on the Claude Code GitHub repo, with multiple open issues going back months.

The issue has been more common since the launch of Claude Opus 4.7. Several GitHub issues filed since mid-April specifically name Opus 4.7 and the 1M context variant as triggers, and recent Claude Code changelogs show stream-handling improvements shipping. The bug also shows up in regular Claude chat sessions during long outputs, but Claude Code is where it hits hardest because of the heavy tool-call chains.

I hit it repeatedly while using Claude Code for multi-file projects. After losing work three or four times in a row, I started experimenting with prompt-level instructions that prevent the timeout from firing in the first place. The trick is not to fix the timeout. The trick is to never trigger it.

Why it happens

The timeout fires when Claude Code's streaming connection goes idle for too long during a single response. Long outputs are the trigger. If Claude tries to write a 300-line file in one tool call, or runs a grep that dumps hundreds of lines, or chains multiple heavy tool calls without pausing, the stream stalls and the connection drops.

The bug is worse in longer sessions. After 20 or more tool calls in a single conversation, the probability of hitting it goes up noticeably.

The fix: add these instructions to your CLAUDE.md

Create or open a CLAUDE.md file in your project root. Add this block:

## Stream Timeout Prevention

1. Do each numbered task ONE AT A TIME. Complete one task fully,
   confirm it worked, then move to the next.
2. Never write a file longer than ~150 lines in a single tool call.
   If a file will be longer, write it in multiple append/edit passes.
3. Start a fresh session if the conversation gets long (20+ tool calls).
   The error gets worse as the session grows.
4. Keep individual grep/search outputs short. Use flags like
   `--include` and `-l` (list files only) to limit output size.
5. If you do hit the timeout, retry the same step in a shorter form.
   Don't repeat the entire task from scratch.

That is it. Claude Code reads CLAUDE.md at the start of every session and follows the instructions as constraints. These five rules keep each streaming chunk small enough that the idle timeout never fires.

Why this works

Each rule targets a specific trigger:

Rule 1 prevents Claude from batching multiple tasks into one giant response. Instead of "create three files, run tests, and fix the errors" in a single output, it does one step, confirms, then moves on. Smaller outputs, no stall.

Rule 2 is the most important one. A 300-line file write is the single most common trigger for the timeout. Splitting it into two 150-line passes keeps each chunk under the threshold.

Rule 3 addresses session degradation. I have not seen Anthropic document this publicly, but in my experience the timeout becomes almost guaranteed after about 20 tool calls in a single session. Starting fresh resets whatever internal state is accumulating.

Rule 4 catches the other common trigger: unbounded search output. A recursive grep that returns 500 lines of matches will stall the stream just as badly as a long file write.

Rule 5 saves you from the retry death spiral. When you hit the timeout and retry the exact same prompt, you get the exact same stall. Retrying with a shorter version of the same step usually works on the first try.

What I tried that did not work

Before landing on the CLAUDE.md approach, I tried several other things:

Increasing CLAUDE_STREAM_IDLE_TIMEOUT_MS: This is a terminal CLI environment variable and does not always resolve the issue.
Switching browsers: Same behavior in Chrome, Firefox, and Safari.
Switching models: Happens on both Opus and Sonnet.
Shorter prompts: The prompt length is not the issue. The output length is.

The CLAUDE.md approach works because it constrains the output at the source. Claude follows the instructions before it starts generating, so the stream never gets long enough to stall.

Worth noting

Recent Claude Code changelogs show stream-handling improvements shipping regularly. This CLAUDE.md workaround is a bridge for the meantime, not a permanent solution. Once the platform-level fix ships, you can remove the block.

If you are using Claude Code for real work today, adding these five lines saves a lot of frustration while improvements are in progress.

References

Serving Qwen3.6-35B-A3B With vLLM and Building a Coding Agent With Tool Calling

Lavelle Hatcher Jr — Sun, 19 Apr 2026 05:42:43 +0000

Alibaba's Qwen team released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. It is a sparse mixture-of-experts model with 35 billion total parameters but only about 3 billion active per token. It scores 73.4% on SWE-bench Verified and 37.0 on MCPMark, which makes it one of the strongest open-weight models for agentic coding right now.

This post walks through serving it locally with vLLM, calling it from Python with the OpenAI SDK, and wiring up tool calling so the model can act as a coding agent.

Note: This is a personal summary based on publicly available information, not the official view of any company.

Prerequisites

vLLM 0.19.0 or later (required for Qwen3.6 architecture support)
NVIDIA GPU (RTX 4090 24GB works for single-GPU, multi-GPU for larger context)
Python 3.12
The model downloads automatically from Hugging Face on first launch ## Install vLLM

python -m venv qwen36-env
source qwen36-env/bin/activate
pip install vllm>=0.19.0

Older vLLM versions do not support the Qwen3.6 MoE architecture. If you hit errors about Qwen3MoeSparseMoeBlock, your vLLM is too old.

Start the vLLM Server

Basic (inference only)

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --reasoning-parser qwen3

The --reasoning-parser qwen3 flag enables thinking mode, where the model generates internal reasoning steps before its final answer. This improves accuracy on coding tasks.

On a single RTX 4090, keep --max-model-len at 32768 or 65536. The full 262,144 context will OOM.

With tool calling

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

--tool-call-parser qwen3_coder is mandatory. Without it, the model generates tool call JSON but vLLM will not parse it into structured tool_calls objects. This is the most common setup mistake and it fails silently.

Multi-GPU (example: 4 GPUs)

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Once running, the OpenAI-compatible API is available at http://localhost:8000/v1.

Basic Chat From Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",  # vLLM local does not require a real key
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "Write a Python generator for the Fibonacci sequence."}
    ],
    temperature=0.7,
    max_tokens=2048,
)

print(response.choices[0].message.content)

Since vLLM exposes an OpenAI-compatible endpoint, the standard OpenAI SDK works directly. Swapping from gpt-4o to a local Qwen3.6 is a one-line base_url change.

Tool Calling (Function Calling)

This is where it gets interesting. Qwen3.6-35B-A3B was explicitly trained on tool-use patterns, scoring 37.0 on MCPMark compared to 18.1 for Gemma 4-31B.

import json
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_files",
            "description": "Search for files in the project by keyword",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search keyword"
                    },
                    "file_extension": {
                        "type": "string",
                        "description": "File extension filter (e.g. .py, .ts)"
                    }
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file at a given path",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path"
                    }
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file at a given path",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path"
                    },
                    "content": {
                        "type": "string",
                        "description": "Content to write"
                    }
                },
                "required": ["path", "content"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {
            "role": "system",
            "content": "You are a coding agent. Search, read, and modify files to complete the user's request."
        },
        {
            "role": "user",
            "content": "Find all Python files that handle database connections and change the pool size from 5 to 20."
        }
    ],
    tools=tools,
    tool_choice="auto",
    temperature=1.0,  # recommended for thinking mode
    max_tokens=4096,
)

message = response.choices[0].message

if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")
        print("---")
else:
    print(message.content)

Agent Loop (Multi-Turn Tool Calling)

A real coding agent needs to call a tool, feed the result back to the model, then let it decide the next action. Here is a minimal loop:

def run_agent(user_request: str, tools: list, max_steps: int = 10):
    messages = [
        {
            "role": "system",
            "content": "You are a coding agent. Use the available tools to complete the task."
        },
        {"role": "user", "content": user_request}
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="Qwen/Qwen3.6-35B-A3B",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=1.0,
            max_tokens=4096,
        )

        assistant_message = response.choices[0].message
        messages.append(assistant_message)

        if not assistant_message.tool_calls:
            print(f"[Done] {assistant_message.content}")
            return assistant_message.content

        for call in assistant_message.tool_calls:
            tool_name = call.function.name
            tool_args = json.loads(call.function.arguments)

            print(f"[Step {step+1}] {tool_name}({tool_args})")

            result = execute_tool(tool_name, tool_args)

            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(result),
            })

    return "Reached maximum steps"


def execute_tool(name: str, args: dict) -> dict:
    """Replace with real file system operations."""
    if name == "search_files":
        return {
            "files": [
                {"path": "src/db/connection.py", "match": "pool_size=5"},
                {"path": "src/db/config.py", "match": "POOL_SIZE = 5"},
            ]
        }
    elif name == "read_file":
        return {"content": f"# Contents of {args['path']}"}
    elif name == "write_file":
        return {"status": "ok", "path": args["path"]}
    return {"error": "unknown tool"}

Replace execute_tool with real file system calls and you have a working local coding agent.

Thinking Mode

Qwen3.6 has two inference modes:

Mode	Temperature	Use case
Thinking (recommended)	1.0	Complex coding, debugging, design decisions
Non-thinking	0.7	Simple completions, quick answers

When --reasoning-parser qwen3 is set at server startup, thinking mode is on by default.

Hardware Guide

Setup	max-model-len	Notes
RTX 4090 (24GB) × 1	32,768	Handles most coding tasks
RTX 4090 × 2 (TP=2)	65,536	Enough for repo-wide context
A100 80GB × 1	131,072	Comfortable single-GPU setup
H100 × 4 (TP=4)	262,144	Full context, production use

The FP8 variant (Qwen/Qwen3.6-35B-A3B-FP8) uses less VRAM with nearly identical performance.

Key Takeaways

vLLM 0.19.0+ serves Qwen3.6-35B-A3B as an OpenAI-compatible API at localhost:8000/v1
Tool calling requires --enable-auto-tool-choice --tool-call-parser qwen3_coder at startup — without it, tool calls silently fail
The standard OpenAI Python SDK works directly, so switching from a cloud API to local inference is a one-line change
Use temperature=1.0 with thinking mode for best coding accuracy
Apache 2.0 license — free for commercial use The model is three days old, so tool calling stability is still being validated by the community. Test on your own workloads before shipping to production.

References

Qwen/Qwen3.6-35B-A3B - Hugging Face
Qwen3.5 & Qwen3.6 Usage Guide - vLLM Recipes
QwenLM/Qwen3.6 - GitHub

Anthropic Releases Claude Opus 4.7: Key Changes and Migration Guide for Developers

Lavelle Hatcher Jr — Fri, 17 Apr 2026 08:07:59 +0000

Here is a developer-focused summary of what changed in Claude Opus 4.7, released on April 16, 2026.

Note: This article is a personal summary based on publicly available information, not the official view of any company. This article does not constitute financial or investment advice.

Where Opus 4.7 Sits

Claude Opus 4.7 is Anthropic's most capable generally available model. It sits below Claude Mythos Preview on benchmarks, but Mythos Preview remains restricted to a handful of platform partners through Project Glasswing and is not available for general use.

Pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens. The model ID is claude-opus-4-7. It is available across all Claude products, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Benchmark Results

Key numbers from the release and third-party evaluations:

SWE-bench Verified: 87.6% (significant improvement over Opus 4.6)
SWE-bench Pro: 64.3% (Opus 4.6: 53.4%, GPT-5.4: 57.7%)
CursorBench: 70% (Opus 4.6: 58%)
MCP-Atlas (multi-tool orchestration): 77.3% (best in class)
CharXiv visual reasoning: 82.1% (Opus 4.6: 69.1%)
XBOW visual acuity: 98.5% (Opus 4.6: 54.5%) Rakuten reported 3x more production tasks resolved compared to Opus 4.6. CodeRabbit noted recall improved by over 10 percent, with the model being slightly faster than GPT-5.4 at xhigh effort.

New Features

High-Resolution Image Support

Opus 4.7 is the first Claude model with high-resolution image support. Maximum image resolution increased from 1,568 pixels on the long edge (about 1.15 megapixels) to 2,576 pixels (about 3.75 megapixels), which is roughly 3x the visual capacity of previous Claude models.

For computer use workflows, pixel coordinates now map 1:1 with actual screen pixels, eliminating the scale-factor math that was previously required. Document analysis benefits from the ability to read smaller text and finer details in scanned documents, slides, and diagrams.

xhigh Effort Level

The effort parameter now has five levels: low, medium, high, xhigh, and max. The new xhigh level sits between high and max, providing deeper reasoning than high without the full cost of max.

Claude Code defaults to xhigh for all plans. Anthropic recommends starting with high or xhigh for coding and agentic use cases.

Task Budgets (Public Beta)

Task budgets let developers set a token allowance for an entire agentic loop rather than a single turn. The model sees a running countdown and uses it to prioritize work, skip low-value steps, and finish gracefully as the budget runs out. This is useful for preventing cost runaway in long-running agent sessions.

Claude Code `/ultrareview` Command

A new dedicated code review command that performs a multi-pass review looking for bugs, edge cases, security issues, and logic errors with more depth than a standard review pass.

Breaking API Changes

Three changes that will cause errors if not addressed:

1. Extended Thinking Budgets Removed

Setting thinking: {"type": "enabled", "budget_tokens": N} now returns a 400 error. The only supported thinking mode on Opus 4.7 is thinking: {"type": "adaptive"}. Note that adaptive thinking is off by default; requests with no thinking field run without thinking. You must set it explicitly to enable it.

2. Sampling Parameters Removed

Setting temperature, top_p, or top_k to any non-default value returns a 400 error. Use prompting to guide output behavior instead.

3. Thinking Content Hidden by Default

Thinking blocks still appear in the response stream, but their content is empty unless you opt in with "display": "summarized". If your product streams reasoning to users, the new default will appear as a long pause before output begins.

Migration Code Example

# Before (Opus 4.6)
model = "claude-opus-4-6"
thinking = {"type": "enabled", "budget_tokens": 8192}
temperature = 0.7

# After (Opus 4.7)
model = "claude-opus-4-7"
thinking = {"type": "adaptive"}
# Remove temperature entirely — use prompting instead
# Increase max_tokens for headroom (new tokenizer uses more tokens)

Behavior Changes

These are not API breaking changes but may require prompt adjustments:

More literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another
Response length calibrates to perceived task complexity rather than defaulting to a fixed verbosity
Fewer tool calls by default. Raise effort to increase tool usage
More direct, opinionated tone with less validation-forward phrasing than Opus 4.6
More regular progress updates during long agentic traces. If you added scaffolding to force interim status messages, try removing it
Fewer subagents spawned by default. Steerable through prompting ## Tokenizer Change

Opus 4.7 uses a new tokenizer that may produce roughly 1.0 to 1.35x as many tokens for the same input, depending on content type. Per-token prices are unchanged, but the same prompt may cost more in practice. Test your workloads before switching production traffic.

Cybersecurity Safeguards

Opus 4.7 includes automated safeguards that detect and block requests involving prohibited or high-risk cybersecurity uses. Cyber capabilities were deliberately reduced compared to Mythos Preview. Security professionals who want to use the model for legitimate purposes such as vulnerability research and penetration testing can apply through the Cyber Verification Program.

Who Should Migrate and When

Teams running production coding agents: The SWE-bench gains are large enough that the upgrade likely pays for itself in reduced human review cycles. Pair with task budgets to control costs
Teams using computer use or image-heavy workflows: The 3.75 megapixel vision support alone justifies the switch
Simple Q&A or FAQ bots: Haiku 4.5 or Sonnet 4.6 are more cost-effective. No need to move to Opus for these workloads The safe migration approach is to keep Opus 4.6 as a fallback for one to two weeks while validating Opus 4.7 on your production workloads in parallel.

References

Calling Anthropic's Advisor Tool in 50 Lines of Python

Lavelle Hatcher Jr — Mon, 13 Apr 2026 11:39:09 +0000

This article reflects my own experience and research. It is not the official view of any company mentioned.

When I first read Anthropic's Advisor Strategy post earlier this week, my first thought was: can a single /v1/messages call really let one Claude model consult another one mid-generation? I wanted to see the actual wire format and the token accounting before I trusted it in production, so I sat down and wrote the smallest working example I could. That is what this article is.

Versions used

Python 3.11
anthropic Python SDK 0.94.0 (released 2026-04-10)
Claude API, advisor tool in public beta since 2026-04-09
Beta header: advisor-tool-2026-03-01
Tool type: advisor_20260301

If you are reading this later, double check the beta header and tool type against the official docs. Beta names change when features move toward GA.

What the advisor tool actually does

Most server side tools the executor can call (web search, code execution) perform an action and return data. The advisor tool is different. When the executor invokes it, the server runs a separate sub-inference on a stronger model using the entire transcript so far, then injects the advice back into the executor's stream. No extra round trip on your side.

The mechanics are slightly unusual. The executor emits a server_tool_use block with name: "advisor" and, unusually, an empty input. The executor only decides the timing. The server constructs the advisor's view automatically from the full transcript (system prompt, tool definitions, prior turns, prior tool results). Then the advisor runs without tools and without its own context management, its thinking blocks are stripped, and only the advice text lands back in the executor's prompt as an advisor_tool_result block. The executor resumes generating.

The pairing Anthropic recommends is Sonnet 4.6 (executor) plus Opus 4.6 (advisor). Haiku 4.5 also works as an executor. The only advisor model available today is claude-opus-4-6, and the advisor must be at least as capable as the executor.

The minimal call

Here is the smallest viable request, using client.beta.messages.create:

import anthropic

client = anthropic.Anthropic()

response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    betas=["advisor-tool-2026-03-01"],
    tools=[
        {
            "type": "advisor_20260301",
            "name": "advisor",
            "model": "claude-opus-4-6",
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Build a concurrent worker pool in Go with graceful shutdown.",
        }
    ],
)

print(response)

Four things worth pointing at:

betas=["advisor-tool-2026-03-01"] turns the feature on. This is the SDK shortcut for the anthropic-beta header.
The tool type is advisor_20260301, and name must literally be the string advisor.
model inside the tool definition is the advisor model. The top level model is the executor.
You call client.beta.messages.create, not client.messages.create.

Reading what came back

When the executor decides to consult the advisor, two new content blocks appear in the response: a server_tool_use block with an empty input, followed by an advisor_tool_result block carrying the advice. This loop walks the content array and pulls each piece out:

for block in response.content:
    if block.type == "text":
        print("EXECUTOR:", block.text)
    elif block.type == "server_tool_use" and block.name == "advisor":
        print("ADVISOR CALL:", block.id)
    elif block.type == "advisor_tool_result":
        content = block.content
        if content.type == "advisor_result":
            print("ADVISOR SAID:", content.text)
        elif content.type == "advisor_tool_result_error":
            print("ADVISOR FAILED:", content.error_code)

Notice the two success variants. advisor_result carries human readable text. advisor_redacted_result carries encrypted_content that you round trip verbatim on the next turn. Opus 4.6 returns plaintext today, but other advisor models may not. If the sub-inference fails, you get advisor_tool_result_error with an error_code such as overloaded, too_many_requests, max_uses_exceeded, prompt_too_long, or execution_time_exceeded. The whole request does not fail in that case. The executor keeps going without further advice.

Counting the tokens properly

This is the part I wanted to see with my own eyes. Usage is split between executor and advisor, and the top level usage.input_tokens does not include the advisor's tokens at all. Everything lives in usage.iterations[]:

usage = response.usage
print(f"Executor output tokens (top level): {usage.output_tokens}")

for i, it in enumerate(usage.iterations):
    if it.type == "advisor_message":
        print(
            f"  [{i}] advisor ({it.model}): "
            f"in={it.input_tokens} out={it.output_tokens}"
        )
    else:
        print(
            f"  [{i}] executor: "
            f"in={it.input_tokens} out={it.output_tokens}"
        )

Advisor tokens bill at the advisor model's rate, so rolling them into the executor numbers would give you the wrong cost. The docs spell out the aggregation rules: top level output_tokens is the sum across executor iterations, and top level input_tokens reflects the first executor iteration only. For anything resembling billing, loop over iterations and group by type.

Capping cost with max_uses

The advisor tool ships without a conversation level cap, but it does support a per request max_uses:

tools = [
    {
        "type": "advisor_20260301",
        "name": "advisor",
        "model": "claude-opus-4-6",
        "max_uses": 2,
    }
]

Once the executor hits that cap, additional advisor calls return an advisor_tool_result_error with error_code: "max_uses_exceeded". This is per request, so on a multi turn conversation you still need a client side counter if you want a total ceiling. When you decide to stop offering the advisor, the docs are explicit: remove it from tools AND strip every advisor_tool_result block from your message history before the next request. Leaving the blocks behind without the tool returns a 400 invalid_request_error.

Advisor side caching

For long agent loops where the advisor fires three or more times, you can enable caching on the advisor's own transcript:

tools = [
    {
        "type": "advisor_20260301",
        "name": "advisor",
        "model": "claude-opus-4-6",
        "caching": {"type": "ephemeral", "ttl": "5m"},
    }
]

The shape is fixed: type must be ephemeral and ttl is 5m or 1h. Unlike cache_control on normal content blocks, this is just an on or off switch. The server decides where the cache boundaries go. The documented break even point is about three advisor calls per conversation. Below that, the write cost exceeds the read savings.

Things to watch out for

Streaming pauses. The advisor sub-inference does not stream. While it runs, your executor stream sits idle except for standard SSE ping keepalives roughly every 30 seconds. Short advisor calls may show no pings at all. Your UI needs to handle that silence without timing out.
max_tokens bounds the executor only. It does not cap advisor output. Budget for an extra 1,400 to 1,800 tokens per advisor call (400 to 700 text plus thinking).
Rate limits draw from two buckets. Executor rate limits fail the whole request with HTTP 429. Advisor rate limits come back as too_many_requests inside the advisor_tool_result block, and the request continues.
Invalid pairings return 400. The advisor must be at least as capable as the executor. Today that means Opus as advisor for any executor. Haiku as advisor is not supported.
Do not rewrite redacted results. If the advisor returns advisor_redacted_result, pass the opaque encrypted_content back on the next turn verbatim. The server decrypts it server side. Reading it or substituting text will break the conversation.
Context editing has sharp edges. clear_thinking with any keep value other than "all" shifts the advisor's quoted transcript each turn and kills advisor side caching. If you use extended thinking alongside the advisor, set keep: "all" explicitly.

Is it worth wiring in?

From a single request cost angle, the advisor is cheaper than Opus solo whenever your task is mostly mechanical output with a few key decisions. It is more expensive than Sonnet solo whenever those decisions are unnecessary. That tradeoff lives in your prompt and your workload, not in the API. I would not blindly turn it on for chat, but for agent loops with dozens of turns it is the right knob to have.

Anthropic's own guidance in the docs is specific about timing: call the advisor early, after a few exploratory reads are in the transcript but before substantive work begins, and call it again near the end after file writes and test outputs are available. That matches what I see in practice. The advisor adds almost all of its value in the first call, before your approach crystallizes. If you wait until the executor is three quarters of the way through a wrong solution, the advisor will politely tell you so and you will still have to redo the work.

The part I underestimated before writing this example was the usage accounting. If you have a billing pipeline that reads usage.input_tokens and usage.output_tokens directly, it will silently undercount advisor time. Migrate to iterations before you flip this on in production.

What would you use a second opinion model for in your own agent loops? I am curious whether people are reaching for this more on planning or on verification.

References

What I Learned Calling 4 Different LLM APIs From the Same Codebase

Lavelle Hatcher Jr — Thu, 09 Apr 2026 13:03:06 +0000

Most comparison articles give you benchmark scores. This one gives you the practical details benchmarks don't cover: response format differences, streaming implementations, and cost considerations I encountered while building a tool that lets users pick their own LLM provider.

Why one codebase, four APIs?

I build browser-based dev tools. One of my projects needed to support multiple LLM providers so users could choose whichever API they prefer. The user picks their provider, enters their own API key, and the tool handles the rest.

Sounds simple. It turned out to be more nuanced than expected.

Supporting OpenAI, Google Gemini, Anthropic Claude, and any OpenAI-compatible endpoint (like local Ollama or LM Studio) from a single codebase taught me a lot about the practical differences between these APIs.

Response format differences

Every provider returns responses in a slightly different structure.

OpenAI (Chat Completions API):

{
  "choices": [{"message": {"content": "..."}}]
}

OpenAI also offers a newer Responses API (/v1/responses) which returns a different format:

{
  "output": [{"type": "message", "content": [{"type": "output_text", "text": "..."}]}]
}

Claude:

{
  "content": [{"type": "text", "text": "..."}]
}

Gemini:

{
  "candidates": [{"content": {"parts": [{"text": "..."}]}}]
}

This seems trivial until you realize your entire downstream pipeline depends on extracting that text reliably. I ended up writing a normalizer function early on:

function extractText(provider, response) {
  switch (provider) {
    case 'openai':
      // Chat Completions API format
      return response.choices?.[0]?.message?.content ?? '';
    case 'openai-responses':
      // Responses API format
      return response.output
        ?.filter(b => b.type === 'message')
        .flatMap(b => b.content)
        .filter(c => c.type === 'output_text')
        .map(c => c.text)
        .join('\n') ?? '';
    case 'claude':
      return response.content
        ?.filter(b => b.type === 'text')
        .map(b => b.text)
        .join('\n') ?? '';
    case 'gemini':
      return response.candidates?.[0]?.content?.parts
        ?.map(p => p.text)
        .join('\n') ?? '';
    default:
      // OpenAI-compatible fallback
      return response.choices?.[0]?.message?.content ?? '';
  }
}

Write this normalizer first. Before you build anything else. Trust me.

Streaming implementations

All four providers support streaming, but each has its own implementation worth understanding.

OpenAI and compatible endpoints use data: [DONE] to signal the end of a stream. Claude uses event: message_stop. Gemini has its own SSE format.

The chunk structure is different too. OpenAI sends delta.content. Claude sends delta.text inside a content_block_delta event. Gemini sends partial text inside candidates[0].content.parts.

If you're building a UI that shows streaming text, you'll need a parser for each provider's format.

System prompt handling

OpenAI Chat Completions API accepts a system role in the messages array:

{"role": "system", "content": "You are a helpful assistant."}

OpenAI's newer Responses API uses a top-level instructions parameter instead:

{
  "instructions": "You are a helpful assistant.",
  "input": [...]
}

Claude takes the system prompt as a separate top-level parameter:

{
  "system": "You are a helpful assistant.",
  "messages": [...]
}

Gemini uses system_instruction as a separate field:

{
  "system_instruction": {"parts": [{"text": "You are a helpful assistant."}]},
  "contents": [...]
}

If you're abstracting this behind a single interface, you need to intercept the system message and route it to the correct location before sending the request. This ensures the model properly receives your system prompt regardless of provider.

Token counting and cost

Each provider has its own pricing structure, so it's worth understanding the differences.

OpenAI charges separately for input and output tokens. Claude does the same but with different pricing tiers per model. Gemini has a free tier with rate limits, and a paid tier.

An interesting observation: the same prompt can produce different output lengths depending on the provider. Each model has its own default verbosity level. This means your cost per request varies even when the input is identical.

If you're letting users bring their own API key, make this transparent. Show estimated token counts before sending the request if possible.

Error handling differences

Each API returns errors differently.

OpenAI returns error.message with HTTP status codes you'd expect (429 for rate limit, 401 for bad key).

Claude returns errors in a error.type and error.message structure. Rate limits come back as rate_limit_error.

Gemini sometimes returns 200 OK with an error inside the response body, so it's important to check the response content as well as the HTTP status code.

// Check both status codes and response body
if (response.ok) {
  const data = await response.json();
  // Gemini can return 200 with an error
  if (data.error) {
    throw new Error(data.error.message);
  }
}

What I'd do differently

If I started over today, here's what I'd do from day one:

Normalize everything immediately. Keep provider-specific response formats contained in an adapter layer so your application logic stays clean.
Test with the cheapest model from each provider. Save your token budget by using GPT-4o-mini (or the newer GPT-5 series mini models), Claude Haiku, and Gemini Flash during development.
OpenAI-compatible is your best friend. If a provider supports the OpenAI format (and many do, including local tools like Ollama and LM Studio), treat them all as one integration. That covers 80% of providers with one code path.
Stream from the start. Adding streaming to a synchronous architecture later requires significant refactoring. Build for streaming on day one even if you don't need it yet.
Log raw responses during development. Having the raw API response saved makes debugging much faster when investigating unexpected behavior.

The bigger picture

The LLM API landscape in 2026 has evolved significantly since 2024.

Most providers now support function calling, structured outputs, and vision. The baseline quality is high enough that the "best" model depends more on your specific use case than on benchmark rankings. OpenAI now offers both the Chat Completions API and the newer Responses API, adding another dimension to consider when integrating.

At the same time, each provider continues to add features with their own implementations. MCP, tool use, multimodal inputs, and structured outputs all have differences across providers, which makes a good abstraction layer increasingly valuable.

The key advantage in this environment isn't picking the "best" LLM. It's building clean abstractions that let you switch between providers without rewriting your application.

References

I built an AI browser extension and website builder here is what went into each decision

Lavelle Hatcher Jr — Fri, 03 Apr 2026 23:07:28 +0000

The problem I kept running into

Every browser extension project starts the same way. You need a manifest, a background script, popup HTML, icons. Before you have written a single line of real logic you have already written a hundred lines of setup.

AI can handle that. But the harder problem is what comes after. You get a working extension, you want to change one file, and either you regenerate everything or you edit manually with no context. Neither is good.

I built NuModeX Ext Maker to solve both.

What it does

NuModeX Ext Maker generates Manifest V3 browser extensions and static websites from text prompts. Output is code structured for Chrome, Edge, Firefox, Whale, Opera, or Safari.

Available now on the Chrome Web Store, Firefox Add-ons, Edge Add-ons, and the Whale Store. The interface is available in English, Japanese, Spanish, French, Korean, Chinese, German, Portuguese, and Italian.

Import from ZIP

The most requested feature before launch. Load an existing browser extension or website from a ZIP file directly into the tool and edit it with AI. No manual file-by-file importing, no rebuilding from scratch.

This matters because a lot of people have extensions they already built by hand, with other tools, years ago and want to maintain or extend them without starting over.

The post-build editing toolkit

Edit File - select a file from the tree, describe the change, AI rewrites only that file.

Improve Extension - passes the entire project to the AI. Describe a multi-file change in one prompt and it figures out what to update.

Add File - create new files in an existing project via plain language.

View Changes - diff viewer showing every file the AI modified, line-by-line or side-by-side, before you accept anything.

Undo - reverse the last AI edit in one click.

Import Files - bring individual existing files into the project for AI editing.

Manual inline editor - direct code editing when you want full control.

Live preview

A sandboxed iframe preview renders the output before you download. Multi-project support with auto-naming from the manifest. Projects auto-save and restore on reopen.

The AI setup

Not tied to any specific provider. Cloud AI models via your own API key, on-device AI models where your browser supports them with no API key required, or a custom model on a local or remote server running the /v1/chat/completions API.

One honest note on on-device models: they handle chat and editing well but cannot build full extensions from scratch. Use a cloud or custom model for initial generation.

No accounts, no connections, no setup maze

A lot of AI tools require you to create an account, connect a workspace, authorize third-party integrations, and navigate a settings flow before you can do anything. NuModeX Ext Maker has none of that. Install the extension, paste in your API key from your chosen cloud AI provider, and you are building. Everything runs in your browser. There is no dashboard to log into, no external service to connect, and no platform sitting between you and your AI provider.

The licensing decision

NuModeX Ext Maker is licensed under BSL 1.1 (Business Source License 1.1). Source is publicly available on GitHub.

What this means in practice: free for personal use and internal business use. You can copy it, modify it, study it, run a modified version internally. The one restriction is redistribution of NuModeX Ext Maker itself to browser extension marketplaces - that requires written permission from SoraVantia GK.

Extensions you generate with the tool are entirely yours. No restrictions on what you build or publish.

I chose BSL 1.1 because the marketplace restriction is the one thing that actually matters for protecting the product. It does not affect the vast majority of use cases.

Try it

Website: https://numodex.com/numodexextmaker
Chrome Web Store: https://chromewebstore.google.com/detail/numodex-ext-maker/amkcpiiepjfmcichnkniiabhdcieidpf
Firefox Add-ons: https://addons.mozilla.org/firefox/addon/numodex-ext-maker/
Edge Add-ons: https://microsoftedge.microsoft.com/addons/detail/jkdimfdgngcachpaggijnegdmmokdhkc
Whale Store: https://store.whale.naver.com/detail/jmbkjagjlhbnagganjfjeknboiagnnmk
GitHub: https://github.com/SoraVantia/NuModeX-Ext-Maker
Built by SoraVantia GK, Japan.

What part of the browser extension workflow causes you the most friction? Curious about what to prioritize next.

DEV Community: Lavelle Hatcher Jr

Bringing Scientific Rigor to LLM Comparison

Why I built Cli Modelarium, and why it belongs in your terminal, not a dashboard

The Problem No One Talks About

What It Does in 30 Seconds

What's Actually Under the Hood

The Engineering Behind It

Why I Built This

Getting Started

What's Next

Try It

I built an offline Chrome extension that reads webpages aloud with AI voices and zero cloud calls

What it does

How a 96MB model fits in a Chrome extension

Architecture

The performance toggle

Privacy

Status

What's next

How I Stopped Getting "Stream Idle Timeout" Errors in Claude Code

The fix is five lines in your CLAUDE.md, not a settings change

Why it happens

The fix: add these instructions to your CLAUDE.md

Why this works

What I tried that did not work

Worth noting

References

Serving Qwen3.6-35B-A3B With vLLM and Building a Coding Agent With Tool Calling

Prerequisites

Start the vLLM Server

Basic (inference only)

With tool calling

Multi-GPU (example: 4 GPUs)

Basic Chat From Python

Tool Calling (Function Calling)

Agent Loop (Multi-Turn Tool Calling)

Thinking Mode

Hardware Guide

Key Takeaways

References

Anthropic Releases Claude Opus 4.7: Key Changes and Migration Guide for Developers

Where Opus 4.7 Sits

Benchmark Results

New Features

High-Resolution Image Support

xhigh Effort Level

Task Budgets (Public Beta)

Claude Code /ultrareview Command

Breaking API Changes

1. Extended Thinking Budgets Removed

2. Sampling Parameters Removed

3. Thinking Content Hidden by Default

Migration Code Example

Behavior Changes

Cybersecurity Safeguards

Who Should Migrate and When

References

Calling Anthropic's Advisor Tool in 50 Lines of Python

Versions used

What the advisor tool actually does

The minimal call

Reading what came back

Counting the tokens properly

Capping cost with max_uses

Advisor side caching

Things to watch out for

Is it worth wiring in?

References

What I Learned Calling 4 Different LLM APIs From the Same Codebase

Why one codebase, four APIs?

Response format differences

Streaming implementations

System prompt handling

Token counting and cost

Error handling differences

What I'd do differently

The bigger picture

References

I built an AI browser extension and website builder here is what went into each decision

The problem I kept running into

Claude Code `/ultrareview` Command