<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lavelle Hatcher Jr</title>
    <description>The latest articles on DEV Community by Lavelle Hatcher Jr (@lavellehatcherjr).</description>
    <link>https://dev.to/lavellehatcherjr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3860079%2F668ec800-7cf1-492e-baf0-0ce71e77a1ef.png</url>
      <title>DEV Community: Lavelle Hatcher Jr</title>
      <link>https://dev.to/lavellehatcherjr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lavellehatcherjr"/>
    <language>en</language>
    <item>
      <title>Cli-Modelarium 0.1.4: 10 LLM providers now, with Qwen and GLM</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Wed, 24 Jun 2026 22:08:07 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/cli-modelarium-014-10-llm-providers-now-with-qwen-and-glm-dg5</link>
      <guid>https://dev.to/lavellehatcherjr/cli-modelarium-014-10-llm-providers-now-with-qwen-and-glm-dg5</guid>
      <description>&lt;p&gt;Quick release note. Cli-Modelarium 0.1.4 just shipped, and the headline is two new providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two new providers, ten in total
&lt;/h2&gt;

&lt;p&gt;You can now compare Alibaba's Qwen models (via DashScope) and Z.AI's GLM models side by side with the rest of the lineup: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, OpenRouter, plus your local models. That brings it to 10 cloud providers.&lt;/p&gt;

&lt;p&gt;If you have wanted to benchmark the open-weight models against the frontier ones on your own prompts, it is now a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; cli-modelarium

cli-modelarium &lt;span class="s2"&gt;"Write a haiku about garbage collection in programming"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; qwen3.7-max,glm-5.2,gpt-5.4,claude-opus-4-8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--runs&lt;/span&gt; 10 &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 0.50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get a side by side table with cost and latency per model. With &lt;code&gt;--runs&lt;/code&gt; greater than 1 it repeats the trials and runs the statistical tests automatically, so you can tell a real difference from noise instead of eyeballing one output. The &lt;code&gt;--max-cost&lt;/code&gt; flag is a hard cap, so a multi-model run does not surprise your API bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also in this release
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Refreshed all pricing to current provider rates&lt;/li&gt;
&lt;li&gt;Added Qwen and GLM to the model groups (all-flagship, all-budget, all-fast, all-cheap), plus GLM to all-reasoning, so you can pull them in by group&lt;/li&gt;
&lt;li&gt;Added Python 3.14 support&lt;/li&gt;
&lt;li&gt;A few model id updates to track provider renames&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  New here?
&lt;/h2&gt;

&lt;p&gt;Cli-Modelarium is a command line tool for comparing LLM outputs side by side, with real statistics (bootstrap confidence intervals, paired significance tests, McNemar's), CI-ready assertions, hallucination detection, LLM-as-judge scoring, and cost tracking. One pip install, no infrastructure, Apache 2.0.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/lavellehatcherjr/cli-modelarium" rel="noopener noreferrer"&gt;https://github.com/lavellehatcherjr/cli-modelarium&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/cli-modelarium/" rel="noopener noreferrer"&gt;https://pypi.org/project/cli-modelarium/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would love to hear how the new providers work for your use case.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The 5 Things Your LLM Benchmark Misses That Actually Decide the Winner</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Tue, 23 Jun 2026 02:12:32 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/the-5-things-your-llm-benchmark-misses-that-actually-decide-the-winner-12be</link>
      <guid>https://dev.to/lavellehatcherjr/the-5-things-your-llm-benchmark-misses-that-actually-decide-the-winner-12be</guid>
      <description>&lt;p&gt;&lt;em&gt;A practical guide to choosing the right LLM for your use case, before a generic ranking talks you into the wrong one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Picture this. You switch to the LLM sitting at the top of every leaderboard. It costs four times what you were paying. Two weeks later you switch back, because on your actual prompts it was worse: it broke your output format about a third of the time, and the cheaper model you had been using almost never did. The leaderboard was not wrong. It just was not measuring anything your project cared about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the leaderboard keeps lying to you
&lt;/h2&gt;

&lt;p&gt;Public leaderboards are useful for exactly one thing: a rough sense of which models are in the same general tier. Past that, they answer a question you are probably not asking.&lt;/p&gt;

&lt;p&gt;A leaderboard measures aggregate performance across a fixed set of tasks, usually academic-flavored ones: reasoning puzzles, exam questions, coding challenges, broad trivia. Your use case is almost certainly narrower than that. Maybe you need a model that reliably returns clean JSON. Maybe you need one that holds a very specific tone. Maybe you need one that is fast and cheap because you are running it ten thousand times a day, and "a little smarter" is worth nothing to you if it doubles your latency.&lt;/p&gt;

&lt;p&gt;The model ranked third overall might be first on your prompts. The leaderboard cannot tell you that, because it never saw your prompts. It also says nothing about cost, nothing about speed, and nothing about consistency, which is the thing that quietly wrecks production systems. A model that is brilliant ninety percent of the time and bizarre the other ten will look great in a demo and cause you pain for a year.&lt;/p&gt;

&lt;p&gt;People lean on leaderboards anyway for an understandable reason. Building your own benchmark feels like work, the ranking is right there, it has numbers, and numbers feel like truth. So they pick the top of the list, ship it, and find out the hard way.&lt;/p&gt;

&lt;p&gt;Here is the better approach. It is not complicated. It is mostly a matter of being deliberate about five things.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Build the test set from your real prompts
&lt;/h2&gt;

&lt;p&gt;This is the whole game, and it is &lt;strong&gt;the step people skip&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Do not benchmark on generic questions. Pull the actual prompts your application sends, or write a couple dozen that closely mirror them. If your product summarizes support tickets, your test set is real support tickets. If it writes product descriptions, it is real product data. The closer your test set is to your live traffic, the better your benchmark predicts real behavior.&lt;/p&gt;

&lt;p&gt;You do not need thousands. Twenty to fifty well-chosen prompts that cover your common cases plus your ugliest edge cases will tell you more than any giant academic benchmark. Include the weird ones on purpose. Edge cases are where models actually diverge, and they are exactly what a generic ranking averages away.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Decide what "better" actually means, in writing
&lt;/h2&gt;

&lt;p&gt;"Better" is the most expensive word in this entire process, because it hides the fact that you have not defined success.&lt;/p&gt;

&lt;p&gt;Before you compare anything, write down the conditions a good answer has to meet, and make them checkable. Not "the summary should be high quality," but things a machine or a careful reader can verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the output contain the fields it is supposed to?&lt;/li&gt;
&lt;li&gt;Is it valid JSON, or does it match your schema?&lt;/li&gt;
&lt;li&gt;Is it under the length you can ship, or over the minimum you need?&lt;/li&gt;
&lt;li&gt;Did it come back under your latency budget, and under your cost ceiling?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of these are mechanical, and you can check them automatically. Others are judgment calls about tone or factual accuracy, and for those you either read the outputs yourself or have a strong model grade them against your written criteria (an approach known as LLM-as-a-judge). Either way the point holds: turn the fuzzy idea of quality into a set of specific things you can score. If you cannot say what "better" means before the test, you will just pick whichever output you happened to like, and quietly call your preference a result.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Hold everything else constant
&lt;/h2&gt;

&lt;p&gt;Short section, because the idea is simple and constantly ignored.&lt;/p&gt;

&lt;p&gt;Run every model on the same prompts, with the same settings, at the same temperature, on the same day if you can. Change the prompt between models, or test one at temperature zero and another at zero point seven, and you are no longer measuring the models. You are measuring your own inconsistency. Controlled comparison is the entire reason a benchmark means anything. The moment a variable moves that you did not intend to move, your result stops being evidence and starts being a story.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Run it more than once, because the model will not give you the same answer twice
&lt;/h2&gt;

&lt;p&gt;This is the part that trips up almost everyone, including people who absolutely know better.&lt;/p&gt;

&lt;p&gt;LLMs are not deterministic. Send the same prompt through the same model five times and you can get five different answers, sometimes meaningfully different ones. So a single run is not a measurement. It is an anecdote. If model A beats model B once, that tells you very little, because you could run it again and watch the result flip.&lt;/p&gt;

&lt;p&gt;The caution worth tattooing somewhere: &lt;strong&gt;a difference you saw once is not a difference until you have shown it holds up.&lt;/strong&gt; Run each model on each prompt several times. Look at how consistent each one is, not just how good its single best answer was. And if you want to claim one model is genuinely better than another, rather than just luckier on the day, that is a statistics question and not a vibes question. Significance testing exists precisely so you can tell a real gap from random noise. Consistency, not peak performance, is usually the thing you are actually buying.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Cost and speed are part of the answer, not a footnote
&lt;/h2&gt;

&lt;p&gt;The best model for your use case is almost never the smartest one available. It is the one that is good enough, at a price and speed you can live with.&lt;/p&gt;

&lt;p&gt;Once you have quality numbers, put cost and latency right next to them and look at the trade honestly. A two percent quality bump that costs ten times as much and runs at half the speed is a terrible deal for most workloads, and a perfectly good deal for a few. Which one you are depends entirely on your use case, which is the whole theme here. A high-volume background job and a low-volume, high-stakes legal summarizer should not pick the same model, and a leaderboard would happily aim both at the same expensive option. Decide what you are optimizing for, then let the cheapest model that clears your quality bar win.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest problem: doing all of this by hand is a slog
&lt;/h2&gt;

&lt;p&gt;Everything above is straightforward in principle and genuinely tedious in practice.&lt;/p&gt;

&lt;p&gt;To actually run it, you are writing code against each provider's API, all of which differ in their own small infuriating ways. You are normalizing responses into something comparable. You are counting tokens and translating them into cost per model. You are running everything several times to deal with variance, gathering the results, then doing the statistics to find out whether the gap you are looking at is real. One comparison is an afternoon. Doing it properly, across several models, with repeats and significance, is a couple of days you do not get back. And then next month three new models drop and you want to re-run the whole thing, so you do.&lt;/p&gt;

&lt;p&gt;I got tired of rebuilding that harness for every project, so I built a command-line tool that does it. It runs the same prompts across models from ten providers side by side, applies the kind of pass-or-fail checks from step two, repeats runs to handle variance, runs the significance tests for you (with confidence intervals, so you can tell a real gap from noise), and tracks cost and latency per model so the trade in step five is just sitting there in the table. You point it at your prompts, set your criteria, and it hands you the comparison.&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;the technique matters more than the tool.&lt;/strong&gt; Everything in this article works whether you use software or a spreadsheet and a lot of patience. The five steps are the skill. A tool just makes them faster, and spares you from rewriting the same plumbing every time a new model ships. The judgment (what to test, what "better" means for you, what trade you are willing to accept) is yours, and it should stay that way. No tool decides that for you, and you should be a little suspicious of any that claims to.&lt;/p&gt;

&lt;p&gt;If you want to try it, it is one install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;cli-modelarium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source and docs are on GitHub: &lt;a href="https://github.com/lavellehatcherjr/cli-modelarium" rel="noopener noreferrer"&gt;https://github.com/lavellehatcherjr/cli-modelarium&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The package page is on PyPI: &lt;a href="https://pypi.org/project/cli-modelarium/" rel="noopener noreferrer"&gt;https://pypi.org/project/cli-modelarium/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Leaderboards rank models against a generic idea of "good." You are not building for a generic idea of good. You are building for a specific task, with specific constraints, a specific budget, and specific failure modes that matter intensely to you and to almost nobody on that leaderboard.&lt;/p&gt;

&lt;p&gt;The skill is being deliberate: test on your real prompts, define what better means before you look, hold your variables steady, run things enough times to trust the answer, and weigh quality against cost and speed. Do that and you will sometimes find the expensive top-ranked model really is the best fit. More often you will find something cheaper, faster, and steadier that the ranking buried at number five. Either way you will know, instead of guessing.&lt;/p&gt;

&lt;p&gt;The leaderboard is a starting point. Your own benchmark is the answer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I build cli-modelarium, an open-source CLI for comparing LLM outputs side by side with statistics, assertions, and cost tracking. If this was useful, the tool is on &lt;a href="https://github.com/lavellehatcherjr/cli-modelarium" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://pypi.org/project/cli-modelarium/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; under the same name.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The 6 Forensic Red Flags Hiding in SEC Filings Most Screeners Ignore</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Tue, 23 Jun 2026 00:57:54 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/the-6-forensic-red-flags-hiding-in-sec-filings-most-screeners-ignore-2knk</link>
      <guid>https://dev.to/lavellehatcherjr/the-6-forensic-red-flags-hiding-in-sec-filings-most-screeners-ignore-2knk</guid>
      <description>&lt;p&gt;&lt;em&gt;A practical guide to reading the warning signs in a small company's own disclosures, before they show up in the price.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Note: This is educational content about how to read SEC filings, not investment advice. It does not recommend buying, selling, or holding any security. Always do your own research and consult a licensed professional before making financial decisions. Micro-cap stocks carry extreme risk, including the total loss of your capital.&lt;/p&gt;

&lt;p&gt;A small company can file an 8-K disclosing that it has fallen out of compliance with its exchange's continued-listing rules, and the next morning most retail screeners will still show it sorted neatly by price and volume, with no indication anything is wrong. The warning was right there in the filing. The screener never opened it.&lt;/p&gt;

&lt;p&gt;That gap is the whole reason I started reading filings by hand, and eventually the reason I wrote a tool to do the reading for me. But the tool matters less than the skill. If you understand what these red flags look like in a filing, you can spot them yourself with nothing but a browser and the SEC's free EDGAR database.&lt;/p&gt;

&lt;p&gt;This is a walk through the six red flags I look for first in a small, thinly covered company, where they live in the filings, and why each one matters. None of this is exotic. It is all sitting in public documents that anyone can read for free. Most people just never do, because reading filings is tedious and screeners are easy.&lt;/p&gt;

&lt;p&gt;Let's fix that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why screeners miss this
&lt;/h2&gt;

&lt;p&gt;A stock screener reads numbers that someone else has already aggregated: price, market cap, P/E, maybe short interest. That feels like research, but it is reading a summary, not reading the company.&lt;/p&gt;

&lt;p&gt;For a large, heavily covered company, that is often fine. Analysts pore over every disclosure, and problems surface quickly in the price. But for a small, thinly traded micro-cap, the things that actually matter, dilution, a going-concern warning, a delisting notice, a restatement, are disclosed in the company's SEC filings long before they reach the screener's columns. The signal is in the primary documents. The screener never opens them.&lt;/p&gt;

&lt;p&gt;So if you want an edge in this corner of the market, the edge is simply this: read the filings the screener skips. Here is what to look for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 1: Delisting and continued-listing deficiency notices
&lt;/h2&gt;

&lt;p&gt;This one lives in 8-K filings under Item 3.01, and it is one of the clearest warning signs in all of micro-cap investing, hiding in a structured, searchable field that almost nobody checks.&lt;/p&gt;

&lt;p&gt;When a company falls out of compliance with its exchange's listing standards, for example its share price stays below the minimum bid requirement for too long, or its market value drops below a threshold, the exchange sends a deficiency notice. The company is generally required to disclose that notice in an 8-K under Item 3.01 ("Notice of Delisting or Failure to Satisfy a Continued Listing Rule or Standard"). An Item 3.01 does not guarantee the company will be delisted, companies often regain compliance, but it tells you the company is on the clock and that the exchange has formally put it on notice.&lt;/p&gt;

&lt;p&gt;To find one yourself, pull the company's filing history on EDGAR, look through its recent 8-K filings, and check the item numbers on each cover. An Item 3.01 is the one you want. Read the body to see which rule was tripped (minimum bid price, market value, stockholders' equity, and so on) and what the company says it plans to do about it.&lt;/p&gt;

&lt;p&gt;One honest caution: the notice tells you the company is in a compliance window, but it usually does not, by itself, tell you exactly how many days are left on the clock. That depends on the specific rule and the company's price history over time. So treat a 3.01 as "this company is in trouble with its exchange, dig deeper," never as a precise countdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 2: Financial distress and solvency stress
&lt;/h2&gt;

&lt;p&gt;A micro-cap that is burning cash and running low on solvency is a fundamentally different risk than one that is merely small, and the filings tell you which one you are looking at if you know where to look. The relevant documents are the financial statements in the 10-K and 10-Q, the auditor's report, and the notes.&lt;/p&gt;

&lt;p&gt;The most direct signal is a going-concern qualification. Auditors are required to flag substantial doubt about a company's ability to keep operating, and when they do, it shows up in the auditor's report and the notes to the financial statements. The phrase to search for is literally "going concern." If you find it, stop and read that section in full, because it is one of the loudest warnings a set of financials can carry.&lt;/p&gt;

&lt;p&gt;Beyond the explicit warning, you can assess distress quantitatively, and there are well-established published models for exactly that. The best known is the Altman Z-Score (with variants for non-manufacturing companies), which folds several balance-sheet and income-statement ratios into a single solvency reading that lands in a "safe," "grey," or "distress" zone. It has been studied and used for decades, and you can compute it by hand from the line items in the statements: working capital, retained earnings (a large accumulated deficit is telling on its own), operating income, and total liabilities against equity. Open the most recent 10-K and 10-Q, read the auditor's report for going-concern language, then run the ratios. It is not hard, just tedious, and that tedium compounds fast across a watchlist. (More on that later.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 3: Accounting quality and earnings-manipulation signals
&lt;/h2&gt;

&lt;p&gt;This is the subtle one, and it is where forensic accounting earns its name. The question shifts from "is the company healthy" to something harder: "are the reported numbers even trustworthy, or are there statistical fingerprints suggesting the financials are being massaged?" You answer it by reading the financial statements across several periods, not from a single snapshot.&lt;/p&gt;

&lt;p&gt;Two published models are the standard tools. The Beneish M-Score combines eight ratios (days-sales-in-receivables, gross-margin changes, asset-quality changes, total accruals, and others) into one number designed to flag a heightened probability of earnings manipulation. It was famously cited in connection with detecting the Enron irregularities before they became public. The Piotroski F-Score is the complement: a nine-point checklist of fundamental-health signals across profitability, leverage and liquidity, and operating efficiency, used to separate companies that are genuinely improving from ones quietly deteriorating.&lt;/p&gt;

&lt;p&gt;Neither is a verdict, and it is important to hold them that way. A high Beneish M-Score does not prove manipulation; a low Piotroski F-Score does not prove a company is doomed. They are screening signals that tell you where to look harder. But they are rigorous, published, decades old, and computable from the numbers in the filings.&lt;/p&gt;

&lt;p&gt;The practical obstacle is data. Both models depend on period-over-period changes, so you need at least two fiscal years of financials to compute them. EDGAR exposes structured financial data (the XBRL tags behind the statements) that makes this possible, but assembling two years of it by hand, for several companies, is exactly the kind of laborious bookkeeping nobody actually wants to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 4: Dilution and aggressive capital raising
&lt;/h2&gt;

&lt;p&gt;For a micro-cap, dilution is arguably the single most common way shareholders quietly lose money. The company issues new shares to raise cash, and every existing share is worth a little less. Done repeatedly, it grinds value down even if the business itself is fine. And it is all disclosed, if you read for it. The signals live in registration statements (S-1, S-3, F-1, F-3 and their variants), prospectus supplements (the 424B family), the share-count figures across filings, and 8-K disclosures of corporate actions.&lt;/p&gt;

&lt;p&gt;The signals to watch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shelf registrations (typically Form S-3, or S-1 / F-3 for some companies) let a company register a large block of securities to sell over time. A live shelf is a loaded mechanism for future dilution.&lt;/li&gt;
&lt;li&gt;At-the-market ("ATM") drip-selling, which you can often infer from repeated prospectus-supplement filings (the 424B family) over a short window. A steady stream of these can indicate the company is continuously feeding shares into the market.&lt;/li&gt;
&lt;li&gt;Rising share counts over time. Pull the diluted-share figure from several consecutive filings and look at the trajectory. A rapidly climbing count is dilution in action; an accelerating one is worse.&lt;/li&gt;
&lt;li&gt;Serial reverse splits. A company that repeatedly does reverse splits (consolidating shares to prop up the per-share price) is often masking ongoing dilution underneath. You can detect this from the share-count history, no price chart required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To check it yourself in EDGAR, look for S-3 / S-1 / F-3 filings (shelf capacity), watch the cadence of 424B filings (ATM activity), and track the share count across the last several 10-Qs and 10-Ks. The trend tells the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 5: Insider activity, the right way
&lt;/h2&gt;

&lt;p&gt;Insider behavior is one of the most cited and most misunderstood signals. The key is to read it carefully, because the raw data is full of noise that looks like signal. It lives in Forms 3, 4, and 5 (insider transactions), Form 144 (proposed sales), and Schedules 13D and 13G (large ownership stakes).&lt;/p&gt;

&lt;p&gt;The genuinely informative event is open-market buying: an insider using their own money to buy shares on the open market. That is a conviction signal, an insider voting with their wallet. In a Form 4, this is a transaction coded P (open-market purchase).&lt;/p&gt;

&lt;p&gt;The trap is that a lot of what shows up as "insider acquisition" is not open-market buying. Stock grants (code A), option exercises (code M), and shares withheld to cover taxes (code F) are routine compensation events, not signals of conviction. If you count those as bullish "buying," you will badly misread the data. Genuine analysis separates real open-market purchases from routine compensation.&lt;/p&gt;

&lt;p&gt;On the other side, Form 144 discloses an insider's intent to sell, and a cluster of these can indicate forward selling pressure ("overhang"). And Schedules 13D and 13G disclose when someone crosses a 5% ownership threshold, 13D signals an activist or strategic intent, 13G a passive holding.&lt;/p&gt;

&lt;p&gt;To do this yourself, pull the company's Form 4 filings and read the transaction codes, do not just count "acquisitions." A genuine open-market P purchase by an officer or director means something; a tax-withholding F does not. This distinction is the whole game with insider data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flag 6: Material events in the 8-K record
&lt;/h2&gt;

&lt;p&gt;The 8-K is the "something material just happened" filing, and its real power is that every event is filed under a specific, structured item code. Once you know the codes, a company's 8-K history stops being a pile of documents and becomes a readable timeline of its material events. These are the codes I weight most heavily for a small company:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Item 4.01: a change of auditor. Frequent or abrupt auditor changes can be a warning sign.&lt;/li&gt;
&lt;li&gt;Item 4.02: non-reliance on previously issued financials, in plain terms, a restatement. This is one of the most serious items there is.&lt;/li&gt;
&lt;li&gt;Item 5.02: departure or appointment of directors and officers. A wave of executive departures is worth understanding.&lt;/li&gt;
&lt;li&gt;Items 2.03 / 2.04: the creation of a major financial obligation, or a triggering event that accelerates one. Both point to debt and liquidity stress.&lt;/li&gt;
&lt;li&gt;Item 1.03: bankruptcy or receivership. (Self-explanatory.)&lt;/li&gt;
&lt;li&gt;Item 3.02: unregistered sales of equity securities (another dilution-related disclosure).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real skill is not spotting one 8-K, but reading the pattern over time, weighted by how serious each item is. A restatement (4.02) plus an auditor change (4.01) plus a string of officer departures (5.02) tells a very different story than a single routine filing. To do it yourself, pull the company's 8-K list on EDGAR, note the item number on each, and lay them out in order. The timeline usually speaks for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it together (and the tedious part)
&lt;/h2&gt;

&lt;p&gt;None of these six red flags is hidden. They are all in public filings, filed under structured forms and item codes, free to read on EDGAR. The "edge" is not access, it is the willingness to actually open the documents and read them the way a screener never will.&lt;/p&gt;

&lt;p&gt;Here is the honest catch, though, the one that sent me down this road in the first place: doing all six of these, by hand, for every company you are curious about, is genuinely laborious. Pulling the 8-K item codes, computing the Altman and Beneish ratios across two years of XBRL data, tracking share counts, reading every Form 4 transaction code, one company is an afternoon; a watchlist is a weekend you will not get back.&lt;/p&gt;

&lt;p&gt;So I eventually built a free, open-source command-line tool, PennyTune, that does this reading automatically: it pulls a company's filings from SEC EDGAR and surfaces these same forensic signals, computed from the public filings, with the 8-K item codes named. It runs on public data with no API key, and it is deliberately built to surface evidence for your own due diligence, not buy or sell recommendations, which is exactly the spirit of this article.&lt;/p&gt;

&lt;p&gt;If it is useful to you, it installs in one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pennytune
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/lavellehatcherjr/pennytune" rel="noopener noreferrer"&gt;https://github.com/lavellehatcherjr/pennytune&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/pennytune/" rel="noopener noreferrer"&gt;https://pypi.org/project/pennytune/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I want to be clear about the order of importance: the technique matters more than the tool. Whether you run a CLI or read the filings by hand, the six red flags above are the same, and learning to recognize them is the durable skill. A tool just makes the reading faster. The judgment is, and should remain, yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;For a small, thinly covered company, the risks that matter most are disclosed in SEC filings long before they reach a screener: delisting-deficiency notices (8-K Item 3.01), financial distress and going-concern warnings (the financials and auditor's report, plus models like Altman Z), accounting-quality and earnings-manipulation signals (Beneish M and Piotroski F across multiple periods), dilution and aggressive capital raising (shelf registrations, ATM activity, rising share counts, serial reverse splits), genuine insider buying versus routine compensation noise (Form 4 transaction codes), and the pattern of material events in the 8-K record (auditor changes, restatements, officer departures, and more). All of it is free to read on EDGAR. The only thing standing between you and it is the willingness to open the filings.&lt;/p&gt;

&lt;p&gt;PennyTune is a free, open-source tool I built for this kind of analysis; it is on &lt;a href="https://github.com/lavellehatcherjr/pennytune" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; (&lt;code&gt;pip install pennytune&lt;/code&gt;). This is educational content, not investment advice, and micro-caps carry extreme risk including total loss. Verify everything against the primary filings before acting on it.&lt;/p&gt;

</description>
      <category>stocks</category>
      <category>investing</category>
      <category>finance</category>
      <category>fintech</category>
    </item>
    <item>
      <title>Penny Stock Due Diligence From Your Terminal: A Free CLI</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Mon, 15 Jun 2026 23:49:48 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/penny-stock-due-diligence-from-your-terminal-a-free-cli-134b</link>
      <guid>https://dev.to/lavellehatcherjr/penny-stock-due-diligence-from-your-terminal-a-free-cli-134b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F362tqz3ispz4z9vvz196.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F362tqz3ispz4z9vvz196.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A free, open-source CLI that reads a stock's SEC filings. pip install pennytune&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Note: This is a personal, independent project, not affiliated with or endorsed by any company or regulator. PennyTune surfaces evidence for your own due diligence. It does not constitute financial or investment advice.&lt;/p&gt;

&lt;p&gt;A penny stock can file an 8-K disclosing a continued-listing deficiency, and the next morning most retail screeners will still show it as a green "buy." The warning was right there in the filing. The screener never read it.&lt;/p&gt;

&lt;p&gt;That gap is the whole reason this tool exists.&lt;/p&gt;

&lt;p&gt;Most penny stock and micro-cap screeners sort by price, volume, and a handful of ratios. They don't open the filings. But for a small, thinly covered company, the things that actually matter, dilution, going-concern doubt, a delisting notice, a restatement, are disclosed in SEC filings long before they show up in the price.&lt;/p&gt;

&lt;p&gt;So I built a CLI that does the reading. It's called PennyTune, it's &lt;a href="https://github.com/lavellehatcherjr/pennytune" rel="noopener noreferrer"&gt;open source on GitHub&lt;/a&gt;, and it's &lt;a href="https://pypi.org/project/pennytune/" rel="noopener noreferrer"&gt;live on PyPI&lt;/a&gt; today under the MIT license.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pennytune
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the rest of this post I'll walk through what it does, why I built it, what's actually under the hood, and how you can use it for your own due-diligence work in under a minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap Most Screeners Miss
&lt;/h2&gt;

&lt;p&gt;Retail tools for vetting small, cheap stocks fall into two buckets.&lt;/p&gt;

&lt;p&gt;On one end you have the screener. You sort a few thousand tickers by market cap, price, and maybe a P/E or a short-interest column, and you get a tidy table. It feels like research. But a screener is reading numbers someone else already aggregated. It is not reading the company's own disclosures. A stock can look statistically fine in a screener while its latest 8-K announces it has 30 days to regain compliance or be delisted.&lt;/p&gt;

&lt;p&gt;On the other end you have reading every filing by hand. This is where the real signal is. But pulling up EDGAR, finding the right 10-K, 10-Q, and 8-K, and working through dilution, financial distress, insider transactions, and going-concern language, for every name you're curious about, takes serious time. Most people don't do it, because it doesn't scale.&lt;/p&gt;

&lt;p&gt;So the signal that matters most is the signal almost nobody looks at, because looking at it is tedious.&lt;/p&gt;

&lt;p&gt;That's the gap PennyTune fills. It does the filing-reading for you and surfaces the forensic red flags in a penny stock's disclosures, so you can spend your time on judgment instead of document retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does in 30 Seconds
&lt;/h2&gt;

&lt;p&gt;You install it. You set a one-time SEC identity (more on that below). You point it at a ticker. You get a forensic breakdown in your terminal.&lt;/p&gt;

&lt;p&gt;Here's a real example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune inspect TICKER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That command pulls the company's filings from SEC EDGAR and computes a set of forensic risk signals from them: financial-distress and accounting-quality scores, dilution and capital-raising activity, insider buying versus selling, 8-K material events, and delisting and trading-suspension risk. It then shows you a decomposed score, so you can see which factors are dragging it down and which aren't.&lt;/p&gt;

&lt;p&gt;If you want to rank a set of names instead of inspecting one, there's a scan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune scan TICKER1 TICKER2 TICKER3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That ranks a curated set of tickers you choose by their filing-derived risk signals, so you can triage a watchlist instead of going one by one.&lt;/p&gt;

&lt;p&gt;No infrastructure. No dashboard. No account with a data vendor. No API key. Just a CLI that reads filings and returns an answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Under the Hood
&lt;/h2&gt;

&lt;p&gt;The headline is "it reads filings." The details are where the work lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One data source, on purpose: SEC EDGAR.&lt;/strong&gt; PennyTune pulls exclusively from the SEC's public EDGAR system, the same filings the company is legally required to submit. No paid data feeds, no scraped third-party aggregators, no opaque "alternative data." Everything it reports is computed from public filings, which means the inputs are auditable and free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forensic risk signals.&lt;/strong&gt; Rather than a single black-box score, PennyTune decomposes a company into separate signals so you can see what's driving the picture: financial-distress and accounting-quality scoring drawn from established, published models; dilution and shelf/capital-raise activity; insider transaction patterns; 8-K material events; and delisting and trading-suspension risk. For event-driven red flags, it names the specific 8-K item behind the flag (for example, a 3.01 continued-listing deficiency), so an event flag isn't just an abstract number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence, not verdicts.&lt;/strong&gt; This is the part I want to be precise about, because it's a deliberate design choice and it shapes everything. PennyTune does not tell you a stock is "clean" or "a landmine." It does not give buy or sell signals. It does not assess tradeability or fetch live prices. It surfaces the forensic signals that are in the filings and leaves the judgment to you. The output is framed as evidence for your own due diligence, full stop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No API key, no account.&lt;/strong&gt; A lot of finance tooling assumes you'll sign up somewhere and manage a secret key. PennyTune doesn't. The only identity it needs is the contact string the SEC itself asks programmatic EDGAR users to send (a name and email in the request header, so the SEC can reach you if your requests cause load). That's the SEC's own fair-access requirement, not an account with me. It's stored locally on your machine, redacted in config output, and never sent to the author or anyone else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-platform.&lt;/strong&gt; It runs on Linux, macOS, and Windows, on current Python versions, as a pure-Python package.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Engineering Behind It
&lt;/h2&gt;

&lt;p&gt;This is the part I think matters most. For a tool that touches financial data, "trust me" isn't good enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;365 automated tests.&lt;/strong&gt; Every commit runs 365 automated tests covering the scoring logic, the filing-parsing, CLI behavior, error handling, and edge cases. They run fully offline against fixtures, so the suite doesn't depend on a live network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;12 OS/Python combinations.&lt;/strong&gt; The CI matrix runs on Linux, macOS, and Windows across Python 3.11, 3.12, 3.13, and 3.14. That's 12 combinations on every push, with no platform-specific skips, so if something works on Linux but breaks on Windows, I find out before users do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Established models, not invented math.&lt;/strong&gt; The financial-distress and accounting-quality scoring is built on published academic models, not formulas I made up. The code is open source, so anyone can read exactly how each signal is computed and check it against the source models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest about its limits.&lt;/strong&gt; PennyTune doesn't claim to be a verdict machine. It doesn't fetch live prices, it doesn't assess whether a stock is tradeable, and it's explicit in its own output and documentation that it's a research tool, not advice. Being upfront about what it does not do is part of the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I kept wanting a fast way to answer one question about a small, cheap stock: is it cheap for a reason that's sitting in its filings?&lt;/p&gt;

&lt;p&gt;The honest answer to that question is almost always in the SEC filings, and almost never in a screener. So I'd end up on EDGAR, opening 10-Ks and 8-Ks, reading through dilution and going-concern language by hand. It worked, but it was slow, and it didn't scale past a name or two.&lt;/p&gt;

&lt;p&gt;The tool started as a way to automate the filing-reading I was doing manually. Then I added the decomposed scoring so I could see what was driving a flag instead of just getting a number. Then I added the watchlist scan so I could triage more than one name. Then I added the test suite, because for anything that parses financial filings, a quiet bug is worse than a loud one.&lt;/p&gt;

&lt;p&gt;At some point it was useful enough that I figured other people poking at micro-caps would want it too, so I open sourced it under MIT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The full workflow takes about a minute.&lt;/p&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pennytune
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works on Linux, macOS, and Windows. Python 3.11 or newer.&lt;/p&gt;

&lt;p&gt;Do the one-time setup (this sets the SEC-required contact identity and your preferences):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inspect a single ticker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune inspect TICKER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rank a watchlist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune scan TICKER1 TICKER2 TICKER3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See exactly what data sources it touches and the free-tier limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune sources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the full disclaimer (also shown on first run):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pennytune disclaimer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The launch version is feature-complete for the due-diligence workflow I had when I built it. One thing I'm considering for a future version: the filing references (the specific form, date, and accession) already exist internally where the data is fetched, so surfacing them next to each flag, so you could click straight through to the exact filing, is a natural next step. If you'd find that useful, that's good signal for me to prioritize it.&lt;/p&gt;

&lt;p&gt;If you use it and find something missing, open an issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pennytune
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I built this because I wanted it. Putting it out there in case other people find it useful too.&lt;/p&gt;

&lt;p&gt;PennyTune surfaces evidence for your own due diligence. It is not investment advice, and it doesn't tell you what to buy or sell. The judgment stays with you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/lavellehatcherjr/pennytune" rel="noopener noreferrer"&gt;PennyTune on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/pennytune/" rel="noopener noreferrer"&gt;PennyTune on PyPI&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>python</category>
      <category>opensource</category>
      <category>stocks</category>
    </item>
    <item>
      <title>Bringing Scientific Rigor to LLM Comparison</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Sun, 31 May 2026 09:52:16 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/bringing-scientific-rigor-to-llm-comparison-5a6l</link>
      <guid>https://dev.to/lavellehatcherjr/bringing-scientific-rigor-to-llm-comparison-5a6l</guid>
      <description>&lt;h2&gt;
  
  
  Why I built Cli Modelarium, and why it belongs in your terminal, not a dashboard
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Note: This is a personal project, not affiliated with any company. This does not constitute financial or investment advice.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every time I wanted to compare two LLMs, I had to pick between a quick spot check in a chat window or spinning up an entire evaluation platform.&lt;/p&gt;

&lt;p&gt;One tells you nothing useful.&lt;/p&gt;

&lt;p&gt;The other takes longer to set up than the comparison is worth.&lt;/p&gt;

&lt;p&gt;So I built a CLI that does it from the terminal. It's called &lt;strong&gt;Cli Modelarium&lt;/strong&gt;, and it's live on PyPI today under Apache 2.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;cli-modelarium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the rest of this post I'll walk through what it does, why I built it, what's actually under the hood, and how you can use it for your own LLM comparison work in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem No One Talks About
&lt;/h2&gt;

&lt;p&gt;The LLM tooling landscape has two ends.&lt;/p&gt;

&lt;p&gt;On one end you have the chat-window spot check. You paste a prompt into Claude, then into GPT, then into Gemini, eyeball the outputs, and decide which one is "better." This is what most developers actually do. It feels productive. It produces nothing trustworthy.&lt;/p&gt;

&lt;p&gt;The problem with spot checks is that LLM output has variance. You can run the same prompt twice and get different answers. You can also run the same prompt across two models, get answers that look similar, and miss the fact that one is hallucinating subtle facts. Eyeballing single outputs is not a comparison. It's a vibe.&lt;/p&gt;

&lt;p&gt;On the other end you have enterprise evaluation platforms. These exist and they're powerful. They also require you to set up an account, configure an integration, define a dataset schema, write evaluators, plug in providers, and orchestrate runs through a dashboard. By the time you've finished onboarding, the question you wanted to answer has changed.&lt;/p&gt;

&lt;p&gt;Most LLM comparison questions don't justify that overhead. You want to know: which model produces better outputs for this specific prompt at this specific cost. You don't want a dashboard. You want an answer.&lt;/p&gt;

&lt;p&gt;That's the gap Cli Modelarium fills.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does in 30 Seconds
&lt;/h2&gt;

&lt;p&gt;You install it. You set your provider keys. You run a comparison. You get statistically rigorous results in your terminal.&lt;/p&gt;

&lt;p&gt;Here's a real example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium &lt;span class="s2"&gt;"Explain quantum entanglement."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-haiku-4-5,gpt-4o-mini &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 0.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That command sends the prompt to both models through their official APIs, tracks cost per call against your &lt;code&gt;--max-cost&lt;/code&gt; cap so you don't accidentally spend more than 10 cents, measures time to first token and total latency for each model, and returns a side-by-side comparison with timing, cost, and full outputs.&lt;/p&gt;

&lt;p&gt;No infrastructure. No dashboard. No account onboarding. Just a CLI that returns an answer.&lt;/p&gt;

&lt;p&gt;If you want statistical rigor on top of that, you add a few flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium &lt;span class="s2"&gt;"Explain quantum entanglement."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--runs&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--significance&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hallucination-check&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--judge&lt;/span&gt; claude-opus-4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 1.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you're running 10 trials against each of three models, computing bootstrap confidence intervals, running paired significance tests, checking outputs for hallucination patterns, and using a separate model as a judge to score quality, all while staying under a $1 cost cap.&lt;/p&gt;

&lt;p&gt;That's the gap I wanted to close. Publication-grade methodology, terminal-grade ergonomics.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Actually Under the Hood
&lt;/h2&gt;

&lt;p&gt;The headline features are easy to list. The details are where the work lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Provider Support.&lt;/strong&gt; Cli Modelarium supports 8 cloud LLM providers plus local models through a unified interface: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, and OpenRouter. Each provider has its own SDK, its own auth pattern, its own error semantics, its own rate-limit behavior. The interface hides all of that. You specify &lt;code&gt;--models claude-haiku-4-5,gpt-4o-mini&lt;/code&gt; and the CLI figures out which provider to route each call to, handles credentials, and returns normalized outputs.&lt;/p&gt;

&lt;p&gt;You only need API keys for the providers you actually want to use. You set them once with &lt;code&gt;cli-modelarium configure&lt;/code&gt; or via environment variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistical Rigor.&lt;/strong&gt; This is where Cli Modelarium differs from every other LLM comparison tool I've used. LLM outputs have variance. To compare them rigorously you need actual statistics, not visual inspection. The CLI implements bootstrap confidence intervals using the BCa method, paired statistical tests including McNemar's test for binary outcomes, multiple comparison corrections including Bonferroni and Holm methods, and effect sizes using Cohen's d. These aren't decorative additions. They're the methods you'd use if you were writing a research paper comparing LLMs. The CLI just makes them invokable through flags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hallucination Detection.&lt;/strong&gt; When models generate plausible-sounding nonsense, statistical tests don't catch it. Hallucination detection runs additional checks on outputs to flag responses that contain markers of fabrication: invented citations, contradictory claims within the same response, fabricated names or dates, and other patterns that experienced reviewers learn to spot. It's not perfect. No hallucination detector is. But it surfaces high-risk outputs for human review, which is far better than flying blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-Judge Panels.&lt;/strong&gt; For subjective quality questions, you can use a separate model as a judge. The CLI supports panels with multiple judge models voting independently to reduce single-judge bias.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost and Latency Tracking.&lt;/strong&gt; Every comparison tracks cost per call, total cost, time to first token, and total latency per model. The &lt;code&gt;--max-cost&lt;/code&gt; flag enforces a hard cap. If your comparison would exceed the budget, the CLI stops before the next call and reports what it did.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Engineering Behind It
&lt;/h2&gt;

&lt;p&gt;This is the part that usually gets skipped. I think it matters because it's where a CLI either earns trust or doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;917 Automated Tests.&lt;/strong&gt; Every commit runs 917 automated tests covering provider integration, statistical computation accuracy, CLI behavior, error handling, and edge cases. Zero CI failures since v0.1.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9 OS/Python Combinations.&lt;/strong&gt; The CI matrix runs on Linux, macOS, and Windows across Python 3.11, 3.12, and 3.13. That's 9 combinations on every push. If something works on Linux but breaks on Windows, I know before users do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistical Validation Against Literature.&lt;/strong&gt; For the statistical methods, "passes tests" isn't enough. I cross-validated outputs against reference implementations: bootstrap CIs against scipy's bootstrap method with BCa correction, McNemar's test against binomtest for small samples and chi2.sf with Edwards correction for larger samples, and effect sizes against published formulas for Cohen's d. When my implementation disagreed with the reference, I traced the discrepancy to its source and fixed it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;README in 9 Languages.&lt;/strong&gt; The README is available in 9 languages so developers across different regions can read about the project in their preferred language.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;Every time I evaluate a new LLM for one of my projects, I run into the same problem: I want to know if Claude is better than GPT for this specific task, or if Gemini is fast enough for that other use case, or whether DeepSeek is worth the cost savings.&lt;/p&gt;

&lt;p&gt;I'd open three browser tabs. I'd paste prompts. I'd squint at outputs. I'd make a call. Then later I would revisit the decision and realize I didn't remember why I picked what I picked.&lt;/p&gt;

&lt;p&gt;The CLI started as a script for my own evaluation work. Then I added statistical methods because I wanted to know if differences I was seeing were real or noise. Then I added cost tracking because I was burning through API credits. Then I added the test suite because I kept introducing regressions.&lt;/p&gt;

&lt;p&gt;At some point I looked at the project and realized I'd built something other people would find useful, so I open sourced it under Apache 2.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The full workflow takes about 30 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;cli-modelarium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works on Linux, macOS, and Windows. Python 3.11 or newer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set your provider keys&lt;/strong&gt; (you only need keys for providers you'll actually use):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium configure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run your first comparison:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium &lt;span class="s2"&gt;"Explain quantum entanglement."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-haiku-4-5,gpt-4o-mini &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 0.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With statistical rigor:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium &lt;span class="s2"&gt;"Explain quantum entanglement."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--runs&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--significance&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hallucination-check&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; results.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 1.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Compare against a local model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cli-modelarium &lt;span class="s2"&gt;"Explain quantum entanglement."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-haiku-4-5,local:llama-3.1-8b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cost&lt;/span&gt; 0.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The launch version (v0.1.3) is feature-complete for the use cases I had when I built it. Additional language support is on my radar. The architecture is designed for it, so stay tuned.&lt;/p&gt;

&lt;p&gt;If you use it and find something missing, open an issue.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;cli-modelarium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I built this because I needed it. I open sourced it because if I needed it, other people probably do too.&lt;/p&gt;

&lt;p&gt;If you find it useful, a star on the repo helps surface it to other developers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/lavellehatcherjr/cli-modelarium" rel="noopener noreferrer"&gt;Cli Modelarium on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/cli-modelarium/" rel="noopener noreferrer"&gt;Cli Modelarium on PyPI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>ai</category>
      <category>cli</category>
    </item>
    <item>
      <title>I built an offline Chrome extension that reads webpages aloud with AI voices and zero cloud calls</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Sun, 10 May 2026 23:58:58 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/i-built-an-offline-chrome-extension-that-reads-webpages-aloud-with-ai-voices-and-zero-cloud-calls-57cp</link>
      <guid>https://dev.to/lavellehatcherjr/i-built-an-offline-chrome-extension-that-reads-webpages-aloud-with-ai-voices-and-zero-cloud-calls-57cp</guid>
      <description>&lt;p&gt;Every text-to-speech Chrome extension I tried had one of two problems. Either it sent my text to a server, or it used the browser's built-in voices that sound like a GPS from 2012.&lt;/p&gt;

&lt;p&gt;I wanted TTS that stays on my machine and doesn't sound terrible. So I built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GlowReadTTS&lt;/strong&gt; is a Chrome extension that reads text aloud using AI voices bundled directly in the extension package. No cloud, no accounts, no API keys, no network calls at all.&lt;/p&gt;

&lt;p&gt;Two ways to use it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right-click mode:&lt;/strong&gt; Select text on any webpage, right-click, choose "Read with GlowReadTTS." It reads the text aloud and highlights each sentence on the page as it goes. A floating stop button appears at the top-right of the page so you can halt playback without opening the popup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Popup mode:&lt;/strong&gt; Open the extension popup, paste or type text, hit play.&lt;/p&gt;

&lt;p&gt;15 AI voices (American and British English), speed control from 0.25x to 2x, streaming playback so audio starts quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a 96MB model fits in a Chrome extension
&lt;/h2&gt;

&lt;p&gt;The AI voice model is about 96MB and ships entirely inside the &lt;code&gt;.crx&lt;/code&gt; package. After install, there are no runtime downloads and no network calls. You can turn off wifi and it still works.&lt;/p&gt;

&lt;p&gt;The voices sound significantly better than built-in browser TTS. 15 voices are bundled covering American and British English.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The whole thing is vanilla JavaScript. No React, no build step, no bundler. Manifest V3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;background/    → service worker, context menu, message routing
content/       → in-page sentence highlighting, floating stop button
offscreen/     → audio playback + TTS inference
popup/         → extension UI (voice picker, text input, controls)
options/       → settings (speed, voice, performance toggle)
libs/          → bundled AI voice model + inference code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;offscreen/&lt;/code&gt; part is worth explaining. Chrome extensions can't play audio from a service worker, so an offscreen document handles the TTS inference and pipes audio out. This is a Manifest V3 pattern that trips people up if you haven't seen it before.&lt;/p&gt;

&lt;h2&gt;
  
  
  The performance toggle
&lt;/h2&gt;

&lt;p&gt;Cold-starting a 96MB model takes a few seconds. To avoid that delay on the first right-click read of a session, GlowReadTTS can optionally pre-warm the model whenever you select text on a page. This is on by default.&lt;/p&gt;

&lt;p&gt;If you'd rather keep idle RAM minimal, switch it off in Settings. The first read will be slower, but subsequent reads in the same session stay fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy
&lt;/h2&gt;

&lt;p&gt;This is the whole point of the project.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero data collection. No analytics, no telemetry, no tracking.&lt;/li&gt;
&lt;li&gt;Text never leaves the device. 100% local processing.&lt;/li&gt;
&lt;li&gt;No accounts. No sign-up, no API keys.&lt;/li&gt;
&lt;li&gt;The extension doesn't even have permission to make network requests.
If you're reading a article nothing gets sent anywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Status
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source:&lt;/strong&gt; &lt;a href="https://github.com/lavellehatcherjr/GlowReadTTS" rel="noopener noreferrer"&gt;github.com/lavellehatcherjr/GlowReadTTS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome Web Store:&lt;/strong&gt; Submitted, pending review. I'll update this post with the install link once it's approved.
If you find it useful, a ⭐ on the repo helps more than you'd think.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Additional language support is on my radar. The architecture is designed for it, so stay tuned.&lt;/p&gt;




&lt;p&gt;Questions, feedback, or bugs? Open an issue.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>javascript</category>
      <category>a11y</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Stopped Getting "Stream Idle Timeout" Errors in Claude Code</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Sat, 25 Apr 2026 11:59:59 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/how-i-stopped-getting-stream-idle-timeout-errors-in-claude-code-hf9</link>
      <guid>https://dev.to/lavellehatcherjr/how-i-stopped-getting-stream-idle-timeout-errors-in-claude-code-hf9</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdk5mdun5cga2fvym0wj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdk5mdun5cga2fvym0wj.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is five lines in your CLAUDE.md, not a settings change
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Note: This is a personal workaround based on my own experience. Your mileage may vary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you use Claude Code for anything longer than a short conversation, you have probably seen this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API Error: Stream idle timeout - partial response received
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It cuts off mid-response. Your work disappears. The retry often fails the same way. There is no recovery button. As of April 2026, it is one of the most reported bugs on the Claude Code GitHub repo, with multiple open issues going back months.&lt;/p&gt;

&lt;p&gt;The issue has been more common since the launch of &lt;strong&gt;Claude Opus 4.7&lt;/strong&gt;. Several GitHub issues filed since mid-April specifically name Opus 4.7 and the 1M context variant as triggers, and recent Claude Code changelogs show stream-handling improvements shipping. The bug also shows up in regular Claude chat sessions during long outputs, but Claude Code is where it hits hardest because of the heavy tool-call chains.&lt;/p&gt;

&lt;p&gt;I hit it repeatedly while using Claude Code for multi-file projects. After losing work three or four times in a row, I started experimenting with prompt-level instructions that prevent the timeout from firing in the first place. The trick is not to fix the timeout. The trick is to never trigger it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why it happens
&lt;/h2&gt;

&lt;p&gt;The timeout fires when Claude Code's streaming connection goes idle for too long during a single response. Long outputs are the trigger. If Claude tries to write a 300-line file in one tool call, or runs a grep that dumps hundreds of lines, or chains multiple heavy tool calls without pausing, the stream stalls and the connection drops.&lt;/p&gt;

&lt;p&gt;The bug is worse in longer sessions. After 20 or more tool calls in a single conversation, the probability of hitting it goes up noticeably.&lt;/p&gt;




&lt;h2&gt;
  
  
  The fix: add these instructions to your CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;Create or open a &lt;code&gt;CLAUDE.md&lt;/code&gt; file in your project root. Add this block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Stream Timeout Prevention&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Do each numbered task ONE AT A TIME. Complete one task fully,
   confirm it worked, then move to the next.
&lt;span class="p"&gt;2.&lt;/span&gt; Never write a file longer than ~150 lines in a single tool call.
   If a file will be longer, write it in multiple append/edit passes.
&lt;span class="p"&gt;3.&lt;/span&gt; Start a fresh session if the conversation gets long (20+ tool calls).
   The error gets worse as the session grows.
&lt;span class="p"&gt;4.&lt;/span&gt; Keep individual grep/search outputs short. Use flags like
   &lt;span class="sb"&gt;`--include`&lt;/span&gt; and &lt;span class="sb"&gt;`-l`&lt;/span&gt; (list files only) to limit output size.
&lt;span class="p"&gt;5.&lt;/span&gt; If you do hit the timeout, retry the same step in a shorter form.
   Don't repeat the entire task from scratch.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. Claude Code reads &lt;code&gt;CLAUDE.md&lt;/code&gt; at the start of every session and follows the instructions as constraints. These five rules keep each streaming chunk small enough that the idle timeout never fires.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;Each rule targets a specific trigger:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 1&lt;/strong&gt; prevents Claude from batching multiple tasks into one giant response. Instead of "create three files, run tests, and fix the errors" in a single output, it does one step, confirms, then moves on. Smaller outputs, no stall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2&lt;/strong&gt; is the most important one. A 300-line file write is the single most common trigger for the timeout. Splitting it into two 150-line passes keeps each chunk under the threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 3&lt;/strong&gt; addresses session degradation. I have not seen Anthropic document this publicly, but in my experience the timeout becomes almost guaranteed after about 20 tool calls in a single session. Starting fresh resets whatever internal state is accumulating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 4&lt;/strong&gt; catches the other common trigger: unbounded search output. A recursive grep that returns 500 lines of matches will stall the stream just as badly as a long file write.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 5&lt;/strong&gt; saves you from the retry death spiral. When you hit the timeout and retry the exact same prompt, you get the exact same stall. Retrying with a shorter version of the same step usually works on the first try.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I tried that did not work
&lt;/h2&gt;

&lt;p&gt;Before landing on the CLAUDE.md approach, I tried several other things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increasing &lt;code&gt;CLAUDE_STREAM_IDLE_TIMEOUT_MS&lt;/code&gt;&lt;/strong&gt;: This is a terminal CLI environment variable and does not always resolve the issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switching browsers&lt;/strong&gt;: Same behavior in Chrome, Firefox, and Safari.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switching models&lt;/strong&gt;: Happens on both Opus and Sonnet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shorter prompts&lt;/strong&gt;: The prompt length is not the issue. The output length is.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The CLAUDE.md approach works because it constrains the output at the source. Claude follows the instructions before it starts generating, so the stream never gets long enough to stall.&lt;/p&gt;




&lt;h2&gt;
  
  
  Worth noting
&lt;/h2&gt;

&lt;p&gt;Recent Claude Code changelogs show stream-handling improvements shipping regularly. This CLAUDE.md workaround is a bridge for the meantime, not a permanent solution. Once the platform-level fix ships, you can remove the block.&lt;/p&gt;

&lt;p&gt;If you are using Claude Code for real work today, adding these five lines saves a lot of frustration while improvements are in progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/49619" rel="noopener noreferrer"&gt;GitHub Issue #49619: Stream idle timeout during long tool-use turns on Opus 4.7&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/47841" rel="noopener noreferrer"&gt;GitHub Issue #47841: Stream idle timeout on Claude Code Web&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/47252" rel="noopener noreferrer"&gt;GitHub Issue #47252: Ultraplan repeated stream idle timeout errors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/46987" rel="noopener noreferrer"&gt;GitHub Issue #46987: Stream idle timeout - multiple times today&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Serving Qwen3.6-35B-A3B With vLLM and Building a Coding Agent With Tool Calling</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Sun, 19 Apr 2026 05:42:43 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/serving-qwen36-35b-a3b-with-vllm-and-building-a-coding-agent-with-tool-calling-2kob</link>
      <guid>https://dev.to/lavellehatcherjr/serving-qwen36-35b-a3b-with-vllm-and-building-a-coding-agent-with-tool-calling-2kob</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyhcdu0cs99qsbc81uuc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyhcdu0cs99qsbc81uuc.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;br&gt;
Alibaba's Qwen team released &lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; on April 16, 2026 under Apache 2.0. It is a sparse mixture-of-experts model with 35 billion total parameters but only about 3 billion active per token. It scores 73.4% on SWE-bench Verified and 37.0 on MCPMark, which makes it one of the strongest open-weight models for agentic coding right now.&lt;/p&gt;

&lt;p&gt;This post walks through serving it locally with vLLM, calling it from Python with the OpenAI SDK, and wiring up tool calling so the model can act as a coding agent.&lt;/p&gt;

&lt;p&gt;Note: This is a personal summary based on publicly available information, not the official view of any company.&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;vLLM 0.19.0 or later (required for Qwen3.6 architecture support)&lt;/li&gt;
&lt;li&gt;NVIDIA GPU (RTX 4090 24GB works for single-GPU, multi-GPU for larger context)&lt;/li&gt;
&lt;li&gt;Python 3.12&lt;/li&gt;
&lt;li&gt;The model downloads automatically from Hugging Face on first launch
## Install vLLM
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv qwen36-env
&lt;span class="nb"&gt;source &lt;/span&gt;qwen36-env/bin/activate
pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;0.19.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Older vLLM versions do not support the Qwen3.6 MoE architecture. If you hit errors about &lt;code&gt;Qwen3MoeSparseMoeBlock&lt;/code&gt;, your vLLM is too old.&lt;/p&gt;
&lt;h2&gt;
  
  
  Start the vLLM Server
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Basic (inference only)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve Qwen/Qwen3.6-35B-A3B &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;--reasoning-parser qwen3&lt;/code&gt; flag enables thinking mode, where the model generates internal reasoning steps before its final answer. This improves accuracy on coding tasks.&lt;/p&gt;

&lt;p&gt;On a single RTX 4090, keep &lt;code&gt;--max-model-len&lt;/code&gt; at 32768 or 65536. The full 262,144 context will OOM.&lt;/p&gt;
&lt;h3&gt;
  
  
  With tool calling
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve Qwen/Qwen3.6-35B-A3B &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; qwen3_coder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;&lt;code&gt;--tool-call-parser qwen3_coder&lt;/code&gt; is mandatory.&lt;/strong&gt; Without it, the model generates tool call JSON but vLLM will not parse it into structured &lt;code&gt;tool_calls&lt;/code&gt; objects. This is the most common setup mistake and it fails silently.&lt;/p&gt;
&lt;h3&gt;
  
  
  Multi-GPU (example: 4 GPUs)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve Qwen/Qwen3.6-35B-A3B &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; qwen3_coder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Once running, the OpenAI-compatible API is available at &lt;code&gt;http://localhost:8000/v1&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Basic Chat From Python
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dummy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# vLLM local does not require a real key
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python generator for the Fibonacci sequence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Since vLLM exposes an OpenAI-compatible endpoint, the standard OpenAI SDK works directly. Swapping from &lt;code&gt;gpt-4o&lt;/code&gt; to a local Qwen3.6 is a one-line base_url change.&lt;/p&gt;
&lt;h2&gt;
  
  
  Tool Calling (Function Calling)
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Qwen3.6-35B-A3B was explicitly trained on tool-use patterns, scoring 37.0 on MCPMark compared to 18.1 for Gemma 4-31B.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dummy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search for files in the project by keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_extension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File extension filter (e.g. .py, .ts)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read the contents of a file at a given path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write content to a file at a given path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content to write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a coding agent. Search, read, and modify files to complete the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s request.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find all Python files that handle database connections and change the pool size from 5 to 20.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# recommended for thinking mode
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Args: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Agent Loop (Multi-Turn Tool Calling)
&lt;/h2&gt;

&lt;p&gt;A real coding agent needs to call a tool, feed the result back to the model, then let it decide the next action. Here is a minimal loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a coding agent. Use the available tools to complete the task.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_request&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;assistant_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assistant_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;assistant_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Done] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;assistant_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;assistant_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;assistant_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
            &lt;span class="n"&gt;tool_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reached maximum steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Replace with real file system operations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/db/connection.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pool_size=5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/db/config.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POOL_SIZE = 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# Contents of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;execute_tool&lt;/code&gt; with real file system calls and you have a working local coding agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thinking Mode
&lt;/h2&gt;

&lt;p&gt;Qwen3.6 has two inference modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Temperature&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Thinking (recommended)&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Complex coding, debugging, design decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-thinking&lt;/td&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;Simple completions, quick answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When &lt;code&gt;--reasoning-parser qwen3&lt;/code&gt; is set at server startup, thinking mode is on by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Guide
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;max-model-len&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090 (24GB) × 1&lt;/td&gt;
&lt;td&gt;32,768&lt;/td&gt;
&lt;td&gt;Handles most coding tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090 × 2 (TP=2)&lt;/td&gt;
&lt;td&gt;65,536&lt;/td&gt;
&lt;td&gt;Enough for repo-wide context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A100 80GB × 1&lt;/td&gt;
&lt;td&gt;131,072&lt;/td&gt;
&lt;td&gt;Comfortable single-GPU setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H100 × 4 (TP=4)&lt;/td&gt;
&lt;td&gt;262,144&lt;/td&gt;
&lt;td&gt;Full context, production use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The FP8 variant (&lt;code&gt;Qwen/Qwen3.6-35B-A3B-FP8&lt;/code&gt;) uses less VRAM with nearly identical performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;vLLM 0.19.0+ serves Qwen3.6-35B-A3B as an OpenAI-compatible API at &lt;code&gt;localhost:8000/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tool calling requires &lt;code&gt;--enable-auto-tool-choice --tool-call-parser qwen3_coder&lt;/code&gt; at startup — without it, tool calls silently fail&lt;/li&gt;
&lt;li&gt;The standard OpenAI Python SDK works directly, so switching from a cloud API to local inference is a one-line change&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;temperature=1.0&lt;/code&gt; with thinking mode for best coding accuracy&lt;/li&gt;
&lt;li&gt;Apache 2.0 license — free for commercial use
The model is three days old, so tool calling stability is still being validated by the community. Test on your own workloads before shipping to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Qwen/Qwen3.6-35B-A3B - Hugging Face&lt;/li&gt;
&lt;li&gt;Qwen3.5 &amp;amp; Qwen3.6 Usage Guide - vLLM Recipes&lt;/li&gt;
&lt;li&gt;QwenLM/Qwen3.6 - GitHub&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Anthropic Releases Claude Opus 4.7: Key Changes and Migration Guide for Developers</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:07:59 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/anthropic-releases-claude-opus-47-key-changes-and-migration-guide-for-developers-3an4</link>
      <guid>https://dev.to/lavellehatcherjr/anthropic-releases-claude-opus-47-key-changes-and-migration-guide-for-developers-3an4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzotccb34v1aria8b0vpp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzotccb34v1aria8b0vpp.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;br&gt;
Here is a developer-focused summary of what changed in Claude Opus 4.7, released on April 16, 2026.&lt;/p&gt;

&lt;p&gt;Note: This article is a personal summary based on publicly available information, not the official view of any company. This article does not constitute financial or investment advice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Opus 4.7 Sits
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.7 is Anthropic's most capable generally available model. It sits below Claude Mythos Preview on benchmarks, but Mythos Preview remains restricted to a handful of platform partners through Project Glasswing and is not available for general use.&lt;/p&gt;

&lt;p&gt;Pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens. The model ID is &lt;code&gt;claude-opus-4-7&lt;/code&gt;. It is available across all Claude products, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Key numbers from the release and third-party evaluations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SWE-bench Verified: 87.6% (significant improvement over Opus 4.6)&lt;/li&gt;
&lt;li&gt;SWE-bench Pro: 64.3% (Opus 4.6: 53.4%, GPT-5.4: 57.7%)&lt;/li&gt;
&lt;li&gt;CursorBench: 70% (Opus 4.6: 58%)&lt;/li&gt;
&lt;li&gt;MCP-Atlas (multi-tool orchestration): 77.3% (best in class)&lt;/li&gt;
&lt;li&gt;CharXiv visual reasoning: 82.1% (Opus 4.6: 69.1%)&lt;/li&gt;
&lt;li&gt;XBOW visual acuity: 98.5% (Opus 4.6: 54.5%)
Rakuten reported 3x more production tasks resolved compared to Opus 4.6. CodeRabbit noted recall improved by over 10 percent, with the model being slightly faster than GPT-5.4 at xhigh effort.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  New Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  High-Resolution Image Support
&lt;/h3&gt;

&lt;p&gt;Opus 4.7 is the first Claude model with high-resolution image support. Maximum image resolution increased from 1,568 pixels on the long edge (about 1.15 megapixels) to 2,576 pixels (about 3.75 megapixels), which is roughly 3x the visual capacity of previous Claude models.&lt;/p&gt;

&lt;p&gt;For computer use workflows, pixel coordinates now map 1:1 with actual screen pixels, eliminating the scale-factor math that was previously required. Document analysis benefits from the ability to read smaller text and finer details in scanned documents, slides, and diagrams.&lt;/p&gt;

&lt;h3&gt;
  
  
  xhigh Effort Level
&lt;/h3&gt;

&lt;p&gt;The effort parameter now has five levels: low, medium, high, xhigh, and max. The new xhigh level sits between high and max, providing deeper reasoning than high without the full cost of max.&lt;/p&gt;

&lt;p&gt;Claude Code defaults to xhigh for all plans. Anthropic recommends starting with high or xhigh for coding and agentic use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task Budgets (Public Beta)
&lt;/h3&gt;

&lt;p&gt;Task budgets let developers set a token allowance for an entire agentic loop rather than a single turn. The model sees a running countdown and uses it to prioritize work, skip low-value steps, and finish gracefully as the budget runs out. This is useful for preventing cost runaway in long-running agent sessions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code &lt;code&gt;/ultrareview&lt;/code&gt; Command
&lt;/h3&gt;

&lt;p&gt;A new dedicated code review command that performs a multi-pass review looking for bugs, edge cases, security issues, and logic errors with more depth than a standard review pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking API Changes
&lt;/h2&gt;

&lt;p&gt;Three changes that will cause errors if not addressed:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extended Thinking Budgets Removed
&lt;/h3&gt;

&lt;p&gt;Setting &lt;code&gt;thinking: {"type": "enabled", "budget_tokens": N}&lt;/code&gt; now returns a 400 error. The only supported thinking mode on Opus 4.7 is &lt;code&gt;thinking: {"type": "adaptive"}&lt;/code&gt;. Note that adaptive thinking is off by default; requests with no &lt;code&gt;thinking&lt;/code&gt; field run without thinking. You must set it explicitly to enable it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sampling Parameters Removed
&lt;/h3&gt;

&lt;p&gt;Setting &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, or &lt;code&gt;top_k&lt;/code&gt; to any non-default value returns a 400 error. Use prompting to guide output behavior instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Thinking Content Hidden by Default
&lt;/h3&gt;

&lt;p&gt;Thinking blocks still appear in the response stream, but their content is empty unless you opt in with &lt;code&gt;"display": "summarized"&lt;/code&gt;. If your product streams reasoning to users, the new default will appear as a long pause before output begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Code Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (Opus 4.6)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;thinking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;budget_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;

&lt;span class="c1"&gt;# After (Opus 4.7)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;thinking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adaptive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Remove temperature entirely — use prompting instead
# Increase max_tokens for headroom (new tokenizer uses more tokens)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Behavior Changes
&lt;/h2&gt;

&lt;p&gt;These are not API breaking changes but may require prompt adjustments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another&lt;/li&gt;
&lt;li&gt;Response length calibrates to perceived task complexity rather than defaulting to a fixed verbosity&lt;/li&gt;
&lt;li&gt;Fewer tool calls by default. Raise effort to increase tool usage&lt;/li&gt;
&lt;li&gt;More direct, opinionated tone with less validation-forward phrasing than Opus 4.6&lt;/li&gt;
&lt;li&gt;More regular progress updates during long agentic traces. If you added scaffolding to force interim status messages, try removing it&lt;/li&gt;
&lt;li&gt;Fewer subagents spawned by default. Steerable through prompting
## Tokenizer Change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Opus 4.7 uses a new tokenizer that may produce roughly 1.0 to 1.35x as many tokens for the same input, depending on content type. Per-token prices are unchanged, but the same prompt may cost more in practice. Test your workloads before switching production traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cybersecurity Safeguards
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 includes automated safeguards that detect and block requests involving prohibited or high-risk cybersecurity uses. Cyber capabilities were deliberately reduced compared to Mythos Preview. Security professionals who want to use the model for legitimate purposes such as vulnerability research and penetration testing can apply through the Cyber Verification Program.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Migrate and When
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Teams running production coding agents&lt;/strong&gt;: The SWE-bench gains are large enough that the upgrade likely pays for itself in reduced human review cycles. Pair with task budgets to control costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams using computer use or image-heavy workflows&lt;/strong&gt;: The 3.75 megapixel vision support alone justifies the switch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple Q&amp;amp;A or FAQ bots&lt;/strong&gt;: Haiku 4.5 or Sonnet 4.6 are more cost-effective. No need to move to Opus for these workloads
The safe migration approach is to keep Opus 4.6 as a fallback for one to two weeks while validating Opus 4.7 on your production workloads in parallel.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;Introducing Claude Opus 4.7 (Anthropic)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7" rel="noopener noreferrer"&gt;What's new in Claude Opus 4.7 (Anthropic API Docs)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cnbc.com/2026/04/16/anthropic-claude-opus-4-7-model-mythos.html" rel="noopener noreferrer"&gt;Anthropic rolls out Claude Opus 4.7 (CNBC)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/" rel="noopener noreferrer"&gt;Introducing Anthropic's Claude Opus 4.7 model in Amazon Bedrock (AWS)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
      <category>news</category>
    </item>
    <item>
      <title>Calling Anthropic's Advisor Tool in 50 Lines of Python</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Mon, 13 Apr 2026 11:39:09 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/calling-anthropics-advisor-tool-in-50-lines-of-python-5fk0</link>
      <guid>https://dev.to/lavellehatcherjr/calling-anthropics-advisor-tool-in-50-lines-of-python-5fk0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8374szbwvkmc97frvft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8374szbwvkmc97frvft.png" alt=" " width="800" height="336"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;This article reflects my own experience and research. It is not the official view of any company mentioned.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I first read Anthropic's Advisor Strategy post earlier this week, my first thought was: can a single &lt;code&gt;/v1/messages&lt;/code&gt; call really let one Claude model consult another one mid-generation? I wanted to see the actual wire format and the token accounting before I trusted it in production, so I sat down and wrote the smallest working example I could. That is what this article is.&lt;/p&gt;
&lt;h2&gt;
  
  
  Versions used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;anthropic&lt;/code&gt; Python SDK 0.94.0 (released 2026-04-10)&lt;/li&gt;
&lt;li&gt;Claude API, advisor tool in public beta since 2026-04-09&lt;/li&gt;
&lt;li&gt;Beta header: &lt;code&gt;advisor-tool-2026-03-01&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tool type: &lt;code&gt;advisor_20260301&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are reading this later, double check the beta header and tool type against the official docs. Beta names change when features move toward GA.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the advisor tool actually does
&lt;/h2&gt;

&lt;p&gt;Most server side tools the executor can call (web search, code execution) perform an action and return data. The advisor tool is different. When the executor invokes it, the server runs a separate sub-inference on a stronger model using the entire transcript so far, then injects the advice back into the executor's stream. No extra round trip on your side.&lt;/p&gt;

&lt;p&gt;The mechanics are slightly unusual. The executor emits a &lt;code&gt;server_tool_use&lt;/code&gt; block with &lt;code&gt;name: "advisor"&lt;/code&gt; and, unusually, an empty &lt;code&gt;input&lt;/code&gt;. The executor only decides the timing. The server constructs the advisor's view automatically from the full transcript (system prompt, tool definitions, prior turns, prior tool results). Then the advisor runs without tools and without its own context management, its thinking blocks are stripped, and only the advice text lands back in the executor's prompt as an &lt;code&gt;advisor_tool_result&lt;/code&gt; block. The executor resumes generating.&lt;/p&gt;

&lt;p&gt;The pairing Anthropic recommends is Sonnet 4.6 (executor) plus Opus 4.6 (advisor). Haiku 4.5 also works as an executor. The only advisor model available today is &lt;code&gt;claude-opus-4-6&lt;/code&gt;, and the advisor must be at least as capable as the executor.&lt;/p&gt;
&lt;h2&gt;
  
  
  The minimal call
&lt;/h2&gt;

&lt;p&gt;Here is the smallest viable request, using &lt;code&gt;client.beta.messages.create&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor-tool-2026-03-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_20260301&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build a concurrent worker pool in Go with graceful shutdown.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four things worth pointing at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;betas=["advisor-tool-2026-03-01"]&lt;/code&gt; turns the feature on. This is the SDK shortcut for the &lt;code&gt;anthropic-beta&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;The tool type is &lt;code&gt;advisor_20260301&lt;/code&gt;, and &lt;code&gt;name&lt;/code&gt; must literally be the string &lt;code&gt;advisor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;model&lt;/code&gt; inside the tool definition is the advisor model. The top level &lt;code&gt;model&lt;/code&gt; is the executor.&lt;/li&gt;
&lt;li&gt;You call &lt;code&gt;client.beta.messages.create&lt;/code&gt;, not &lt;code&gt;client.messages.create&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reading what came back
&lt;/h2&gt;

&lt;p&gt;When the executor decides to consult the advisor, two new content blocks appear in the response: a &lt;code&gt;server_tool_use&lt;/code&gt; block with an empty &lt;code&gt;input&lt;/code&gt;, followed by an &lt;code&gt;advisor_tool_result&lt;/code&gt; block carrying the advice. This loop walks the content array and pulls each piece out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EXECUTOR:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server_tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADVISOR CALL:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADVISOR SAID:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_tool_result_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADVISOR FAILED:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the two success variants. &lt;code&gt;advisor_result&lt;/code&gt; carries human readable &lt;code&gt;text&lt;/code&gt;. &lt;code&gt;advisor_redacted_result&lt;/code&gt; carries &lt;code&gt;encrypted_content&lt;/code&gt; that you round trip verbatim on the next turn. Opus 4.6 returns plaintext today, but other advisor models may not. If the sub-inference fails, you get &lt;code&gt;advisor_tool_result_error&lt;/code&gt; with an &lt;code&gt;error_code&lt;/code&gt; such as &lt;code&gt;overloaded&lt;/code&gt;, &lt;code&gt;too_many_requests&lt;/code&gt;, &lt;code&gt;max_uses_exceeded&lt;/code&gt;, &lt;code&gt;prompt_too_long&lt;/code&gt;, or &lt;code&gt;execution_time_exceeded&lt;/code&gt;. The whole request does not fail in that case. The executor keeps going without further advice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Counting the tokens properly
&lt;/h2&gt;

&lt;p&gt;This is the part I wanted to see with my own eyes. Usage is split between executor and advisor, and the top level &lt;code&gt;usage.input_tokens&lt;/code&gt; does not include the advisor's tokens at all. Everything lives in &lt;code&gt;usage.iterations[]&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Executor output tokens (top level): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] advisor (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; out=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] executor: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; out=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Advisor tokens bill at the advisor model's rate, so rolling them into the executor numbers would give you the wrong cost. The docs spell out the aggregation rules: top level &lt;code&gt;output_tokens&lt;/code&gt; is the sum across executor iterations, and top level &lt;code&gt;input_tokens&lt;/code&gt; reflects the first executor iteration only. For anything resembling billing, loop over &lt;code&gt;iterations&lt;/code&gt; and group by &lt;code&gt;type&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capping cost with max_uses
&lt;/h2&gt;

&lt;p&gt;The advisor tool ships without a conversation level cap, but it does support a per request &lt;code&gt;max_uses&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_20260301&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_uses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the executor hits that cap, additional advisor calls return an &lt;code&gt;advisor_tool_result_error&lt;/code&gt; with &lt;code&gt;error_code: "max_uses_exceeded"&lt;/code&gt;. This is per request, so on a multi turn conversation you still need a client side counter if you want a total ceiling. When you decide to stop offering the advisor, the docs are explicit: remove it from &lt;code&gt;tools&lt;/code&gt; AND strip every &lt;code&gt;advisor_tool_result&lt;/code&gt; block from your message history before the next request. Leaving the blocks behind without the tool returns a &lt;code&gt;400 invalid_request_error&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advisor side caching
&lt;/h2&gt;

&lt;p&gt;For long agent loops where the advisor fires three or more times, you can enable caching on the advisor's own transcript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor_20260301&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advisor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shape is fixed: &lt;code&gt;type&lt;/code&gt; must be &lt;code&gt;ephemeral&lt;/code&gt; and &lt;code&gt;ttl&lt;/code&gt; is &lt;code&gt;5m&lt;/code&gt; or &lt;code&gt;1h&lt;/code&gt;. Unlike &lt;code&gt;cache_control&lt;/code&gt; on normal content blocks, this is just an on or off switch. The server decides where the cache boundaries go. The documented break even point is about three advisor calls per conversation. Below that, the write cost exceeds the read savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things to watch out for
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming pauses.&lt;/strong&gt; The advisor sub-inference does not stream. While it runs, your executor stream sits idle except for standard SSE &lt;code&gt;ping&lt;/code&gt; keepalives roughly every 30 seconds. Short advisor calls may show no pings at all. Your UI needs to handle that silence without timing out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_tokens&lt;/code&gt; bounds the executor only.&lt;/strong&gt; It does not cap advisor output. Budget for an extra 1,400 to 1,800 tokens per advisor call (400 to 700 text plus thinking).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits draw from two buckets.&lt;/strong&gt; Executor rate limits fail the whole request with HTTP 429. Advisor rate limits come back as &lt;code&gt;too_many_requests&lt;/code&gt; inside the &lt;code&gt;advisor_tool_result&lt;/code&gt; block, and the request continues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invalid pairings return 400.&lt;/strong&gt; The advisor must be at least as capable as the executor. Today that means Opus as advisor for any executor. Haiku as advisor is not supported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not rewrite redacted results.&lt;/strong&gt; If the advisor returns &lt;code&gt;advisor_redacted_result&lt;/code&gt;, pass the opaque &lt;code&gt;encrypted_content&lt;/code&gt; back on the next turn verbatim. The server decrypts it server side. Reading it or substituting text will break the conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context editing has sharp edges.&lt;/strong&gt; &lt;code&gt;clear_thinking&lt;/code&gt; with any &lt;code&gt;keep&lt;/code&gt; value other than &lt;code&gt;"all"&lt;/code&gt; shifts the advisor's quoted transcript each turn and kills advisor side caching. If you use extended thinking alongside the advisor, set &lt;code&gt;keep: "all"&lt;/code&gt; explicitly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Is it worth wiring in?
&lt;/h2&gt;

&lt;p&gt;From a single request cost angle, the advisor is cheaper than Opus solo whenever your task is mostly mechanical output with a few key decisions. It is more expensive than Sonnet solo whenever those decisions are unnecessary. That tradeoff lives in your prompt and your workload, not in the API. I would not blindly turn it on for chat, but for agent loops with dozens of turns it is the right knob to have.&lt;/p&gt;

&lt;p&gt;Anthropic's own guidance in the docs is specific about timing: call the advisor early, after a few exploratory reads are in the transcript but before substantive work begins, and call it again near the end after file writes and test outputs are available. That matches what I see in practice. The advisor adds almost all of its value in the first call, before your approach crystallizes. If you wait until the executor is three quarters of the way through a wrong solution, the advisor will politely tell you so and you will still have to redo the work.&lt;/p&gt;

&lt;p&gt;The part I underestimated before writing this example was the usage accounting. If you have a billing pipeline that reads &lt;code&gt;usage.input_tokens&lt;/code&gt; and &lt;code&gt;usage.output_tokens&lt;/code&gt; directly, it will silently undercount advisor time. Migrate to &lt;code&gt;iterations&lt;/code&gt; before you flip this on in production.&lt;/p&gt;

&lt;p&gt;What would you use a second opinion model for in your own agent loops? I am curious whether people are reaching for this more on planning or on verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/advisor-tool" rel="noopener noreferrer"&gt;Advisor tool - Claude API docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://claude.com/blog/the-advisor-strategy" rel="noopener noreferrer"&gt;The advisor strategy: Give Sonnet an intelligence boost with Opus - Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/anthropic/" rel="noopener noreferrer"&gt;anthropic PyPI package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/anthropic-sdk-python/releases" rel="noopener noreferrer"&gt;anthropic-sdk-python GitHub releases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What I Learned Calling 4 Different LLM APIs From the Same Codebase</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Thu, 09 Apr 2026 13:03:06 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/what-i-learned-calling-4-different-llm-apis-from-the-same-codebase-42jb</link>
      <guid>https://dev.to/lavellehatcherjr/what-i-learned-calling-4-different-llm-apis-from-the-same-codebase-42jb</guid>
      <description>&lt;p&gt;Most comparison articles give you benchmark scores. This one gives you the practical details benchmarks don't cover: response format differences, streaming implementations, and cost considerations I encountered while building a tool that lets users pick their own LLM provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one codebase, four APIs?
&lt;/h2&gt;

&lt;p&gt;I build browser-based dev tools. One of my projects needed to support multiple LLM providers so users could choose whichever API they prefer. The user picks their provider, enters their own API key, and the tool handles the rest.&lt;/p&gt;

&lt;p&gt;Sounds simple. It turned out to be more nuanced than expected.&lt;/p&gt;

&lt;p&gt;Supporting OpenAI, Google Gemini, Anthropic Claude, and any OpenAI-compatible endpoint (like local Ollama or LM Studio) from a single codebase taught me a lot about the practical differences between these APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Response format differences
&lt;/h2&gt;

&lt;p&gt;Every provider returns responses in a slightly different structure.&lt;/p&gt;

&lt;p&gt;OpenAI (Chat Completions API):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI also offers a newer Responses API (&lt;code&gt;/v1/responses&lt;/code&gt;) which returns a different format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"output_text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}]}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"candidates"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"parts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}]}}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This seems trivial until you realize your entire downstream pipeline depends on extracting that text reliably. I ended up writing a normalizer function early on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;switch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;// Chat Completions API format&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai-responses&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;// Responses API format&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;
        &lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;output_text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;
        &lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;parts&lt;/span&gt;
        &lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;// OpenAI-compatible fallback&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Write this normalizer first. Before you build anything else. Trust me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming implementations
&lt;/h2&gt;

&lt;p&gt;All four providers support streaming, but each has its own implementation worth understanding.&lt;/p&gt;

&lt;p&gt;OpenAI and compatible endpoints use &lt;code&gt;data: [DONE]&lt;/code&gt; to signal the end of a stream. Claude uses &lt;code&gt;event: message_stop&lt;/code&gt;. Gemini has its own SSE format.&lt;/p&gt;

&lt;p&gt;The chunk structure is different too. OpenAI sends &lt;code&gt;delta.content&lt;/code&gt;. Claude sends &lt;code&gt;delta.text&lt;/code&gt; inside a &lt;code&gt;content_block_delta&lt;/code&gt; event. Gemini sends partial &lt;code&gt;text&lt;/code&gt; inside &lt;code&gt;candidates[0].content.parts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're building a UI that shows streaming text, you'll need a parser for each provider's format.&lt;/p&gt;

&lt;h2&gt;
  
  
  System prompt handling
&lt;/h2&gt;

&lt;p&gt;OpenAI Chat Completions API accepts a &lt;code&gt;system&lt;/code&gt; role in the messages array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a helpful assistant."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI's newer Responses API uses a top-level &lt;code&gt;instructions&lt;/code&gt; parameter instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"instructions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a helpful assistant."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude takes the system prompt as a separate top-level parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a helpful assistant."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini uses &lt;code&gt;system_instruction&lt;/code&gt; as a separate field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"system_instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"parts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a helpful assistant."&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're abstracting this behind a single interface, you need to intercept the system message and route it to the correct location before sending the request. This ensures the model properly receives your system prompt regardless of provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token counting and cost
&lt;/h2&gt;

&lt;p&gt;Each provider has its own pricing structure, so it's worth understanding the differences.&lt;/p&gt;

&lt;p&gt;OpenAI charges separately for input and output tokens. Claude does the same but with different pricing tiers per model. Gemini has a free tier with rate limits, and a paid tier.&lt;/p&gt;

&lt;p&gt;An interesting observation: the same prompt can produce different output lengths depending on the provider. Each model has its own default verbosity level. This means your cost per request varies even when the input is identical.&lt;/p&gt;

&lt;p&gt;If you're letting users bring their own API key, make this transparent. Show estimated token counts before sending the request if possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error handling differences
&lt;/h2&gt;

&lt;p&gt;Each API returns errors differently.&lt;/p&gt;

&lt;p&gt;OpenAI returns &lt;code&gt;error.message&lt;/code&gt; with HTTP status codes you'd expect (429 for rate limit, 401 for bad key).&lt;/p&gt;

&lt;p&gt;Claude returns errors in a &lt;code&gt;error.type&lt;/code&gt; and &lt;code&gt;error.message&lt;/code&gt; structure. Rate limits come back as &lt;code&gt;rate_limit_error&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Gemini sometimes returns 200 OK with an error inside the response body, so it's important to check the response content as well as the HTTP status code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Check both status codes and response body&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Gemini can return 200 with an error&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;If I started over today, here's what I'd do from day one:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Normalize everything immediately.&lt;/strong&gt; Keep provider-specific response formats contained in an adapter layer so your application logic stays clean.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test with the cheapest model from each provider.&lt;/strong&gt; Save your token budget by using GPT-4o-mini (or the newer GPT-5 series mini models), Claude Haiku, and Gemini Flash during development.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OpenAI-compatible is your best friend.&lt;/strong&gt; If a provider supports the OpenAI format (and many do, including local tools like Ollama and LM Studio), treat them all as one integration. That covers 80% of providers with one code path.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stream from the start.&lt;/strong&gt; Adding streaming to a synchronous architecture later requires significant refactoring. Build for streaming on day one even if you don't need it yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Log raw responses during development.&lt;/strong&gt; Having the raw API response saved makes debugging much faster when investigating unexpected behavior.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;The LLM API landscape in 2026 has evolved significantly since 2024.&lt;/p&gt;

&lt;p&gt;Most providers now support function calling, structured outputs, and vision. The baseline quality is high enough that the "best" model depends more on your specific use case than on benchmark rankings. OpenAI now offers both the Chat Completions API and the newer Responses API, adding another dimension to consider when integrating.&lt;/p&gt;

&lt;p&gt;At the same time, each provider continues to add features with their own implementations. MCP, tool use, multimodal inputs, and structured outputs all have differences across providers, which makes a good abstraction layer increasingly valuable.&lt;/p&gt;

&lt;p&gt;The key advantage in this environment isn't picking the "best" LLM. It's building clean abstractions that let you switch between providers without rewriting your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/api-reference" rel="noopener noreferrer"&gt;OpenAI API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/migrate-to-responses" rel="noopener noreferrer"&gt;OpenAI Responses API Migration Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/api" rel="noopener noreferrer"&gt;Anthropic API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/docs" rel="noopener noreferrer"&gt;Google Gemini API Reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>I built an AI browser extension and website builder here is what went into each decision</title>
      <dc:creator>Lavelle Hatcher Jr</dc:creator>
      <pubDate>Fri, 03 Apr 2026 23:07:28 +0000</pubDate>
      <link>https://dev.to/lavellehatcherjr/i-built-an-ai-browser-extension-and-website-builder-here-is-what-went-into-each-decision-3ao9</link>
      <guid>https://dev.to/lavellehatcherjr/i-built-an-ai-browser-extension-and-website-builder-here-is-what-went-into-each-decision-3ao9</guid>
      <description>&lt;h3&gt;
  
  
  The problem I kept running into
&lt;/h3&gt;

&lt;p&gt;Every browser extension project starts the same way. You need a manifest, a background script, popup HTML, icons. Before you have written a single line of real logic you have already written a hundred lines of setup.&lt;/p&gt;

&lt;p&gt;AI can handle that. But the harder problem is what comes after. You get a working extension, you want to change one file, and either you regenerate everything or you edit manually with no context. Neither is good.&lt;/p&gt;

&lt;p&gt;I built NuModeX Ext Maker to solve both.&lt;/p&gt;




&lt;h3&gt;
  
  
  What it does
&lt;/h3&gt;

&lt;p&gt;NuModeX Ext Maker generates Manifest V3 browser extensions and static websites from text prompts. Output is code structured for Chrome, Edge, Firefox, Whale, Opera, or Safari.&lt;/p&gt;

&lt;p&gt;Available now on the Chrome Web Store, Firefox Add-ons, Edge Add-ons, and the Whale Store. The interface is available in English, Japanese, Spanish, French, Korean, Chinese, German, Portuguese, and Italian.&lt;/p&gt;




&lt;h3&gt;
  
  
  Import from ZIP
&lt;/h3&gt;

&lt;p&gt;The most requested feature before launch. Load an existing browser extension or website from a ZIP file directly into the tool and edit it with AI. No manual file-by-file importing, no rebuilding from scratch.&lt;/p&gt;

&lt;p&gt;This matters because a lot of people have extensions they already built by hand, with other tools, years ago and want to maintain or extend them without starting over.&lt;/p&gt;




&lt;h3&gt;
  
  
  The post-build editing toolkit
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Edit File&lt;/strong&gt; - select a file from the tree, describe the change, AI rewrites only that file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Improve Extension&lt;/strong&gt; - passes the entire project to the AI. Describe a multi-file change in one prompt and it figures out what to update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add File&lt;/strong&gt; - create new files in an existing project via plain language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;View Changes&lt;/strong&gt; - diff viewer showing every file the AI modified, line-by-line or side-by-side, before you accept anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Undo&lt;/strong&gt; - reverse the last AI edit in one click.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Import Files&lt;/strong&gt; - bring individual existing files into the project for AI editing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual inline editor&lt;/strong&gt; - direct code editing when you want full control.&lt;/p&gt;




&lt;h3&gt;
  
  
  Live preview
&lt;/h3&gt;

&lt;p&gt;A sandboxed iframe preview renders the output before you download. Multi-project support with auto-naming from the manifest. Projects auto-save and restore on reopen.&lt;/p&gt;




&lt;h3&gt;
  
  
  The AI setup
&lt;/h3&gt;

&lt;p&gt;Not tied to any specific provider. Cloud AI models via your own API key, on-device AI models where your browser supports them with no API key required, or a custom model on a local or remote server running the /v1/chat/completions API.&lt;/p&gt;

&lt;p&gt;One honest note on on-device models: they handle chat and editing well but cannot build full extensions from scratch. Use a cloud or custom model for initial generation.&lt;/p&gt;




&lt;h3&gt;
  
  
  No accounts, no connections, no setup maze
&lt;/h3&gt;

&lt;p&gt;A lot of AI tools require you to create an account, connect a workspace, authorize third-party integrations, and navigate a settings flow before you can do anything. NuModeX Ext Maker has none of that. Install the extension, paste in your API key from your chosen cloud AI provider, and you are building. Everything runs in your browser. There is no dashboard to log into, no external service to connect, and no platform sitting between you and your AI provider.&lt;/p&gt;




&lt;h3&gt;
  
  
  The licensing decision
&lt;/h3&gt;

&lt;p&gt;NuModeX Ext Maker is licensed under &lt;strong&gt;BSL 1.1&lt;/strong&gt; (Business Source License 1.1). Source is publicly available on GitHub.&lt;/p&gt;

&lt;p&gt;What this means in practice: free for personal use and internal business use. You can copy it, modify it, study it, run a modified version internally. The one restriction is redistribution of NuModeX Ext Maker itself to browser extension marketplaces - that requires written permission from SoraVantia GK.&lt;/p&gt;

&lt;p&gt;Extensions you generate with the tool are entirely yours. No restrictions on what you build or publish.&lt;/p&gt;

&lt;p&gt;I chose BSL 1.1 because the marketplace restriction is the one thing that actually matters for protecting the product. It does not affect the vast majority of use cases.&lt;/p&gt;




&lt;h3&gt;
  
  
  Try it
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://numodex.com/numodexextmaker" rel="noopener noreferrer"&gt;https://numodex.com/numodexextmaker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chrome Web Store: &lt;a href="https://chromewebstore.google.com/detail/numodex-ext-maker/amkcpiiepjfmcichnkniiabhdcieidpf" rel="noopener noreferrer"&gt;https://chromewebstore.google.com/detail/numodex-ext-maker/amkcpiiepjfmcichnkniiabhdcieidpf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Firefox Add-ons: &lt;a href="https://addons.mozilla.org/firefox/addon/numodex-ext-maker/" rel="noopener noreferrer"&gt;https://addons.mozilla.org/firefox/addon/numodex-ext-maker/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Edge Add-ons: &lt;a href="https://microsoftedge.microsoft.com/addons/detail/jkdimfdgngcachpaggijnegdmmokdhkc" rel="noopener noreferrer"&gt;https://microsoftedge.microsoft.com/addons/detail/jkdimfdgngcachpaggijnegdmmokdhkc&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Whale Store: &lt;a href="https://store.whale.naver.com/detail/jmbkjagjlhbnagganjfjeknboiagnnmk" rel="noopener noreferrer"&gt;https://store.whale.naver.com/detail/jmbkjagjlhbnagganjfjeknboiagnnmk&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/SoraVantia/NuModeX-Ext-Maker" rel="noopener noreferrer"&gt;https://github.com/SoraVantia/NuModeX-Ext-Maker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Built by SoraVantia GK, Japan.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;What part of the browser extension workflow causes you the most friction? Curious about what to prioritize next.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nocode</category>
      <category>extensions</category>
      <category>ai</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
