<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jangwook Kim</title>
    <description>The latest articles on DEV Community by Jangwook Kim (@jangwook_kim_e31e7291ad98).</description>
    <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909290%2F60a8c15f-b2b5-4189-8578-78b8ab78900b.jpg</url>
      <title>DEV Community: Jangwook Kim</title>
      <link>https://dev.to/jangwook_kim_e31e7291ad98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jangwook_kim_e31e7291ad98"/>
    <language>en</language>
    <item>
      <title>Why My Local Agent Forgot Its System Prompt — Measuring Ollama num_ctx Silent Truncation</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sun, 28 Jun 2026 06:50:17 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/why-my-local-agent-forgot-its-system-prompt-measuring-ollama-numctx-silent-truncation-3f94</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/why-my-local-agent-forgot-its-system-prompt-measuring-ollama-numctx-silent-truncation-3f94</guid>
      <description>&lt;p&gt;A few days ago, a meeting-notes summarizer I run locally started misbehaving. Short transcripts were fine, but feed it a long transcript and it ignored the instruction I'd written at the top ("answer in JSON only") and rambled in plain prose. My first guess was that the model was just dumb. What nagged me was that the same model obeyed the instruction perfectly on short inputs. So maybe the model hadn't gotten dumber. Maybe it had never &lt;strong&gt;seen&lt;/strong&gt; my instruction at all.&lt;/p&gt;

&lt;p&gt;In a &lt;a href="https://dev.to/en/blog/en/local-llm-prefill-generation-latency-experiment"&gt;previous post that decomposed prefill and generation cost&lt;/a&gt;, I measured how a longer context delays the first token. That was about speed. What I suspected this time was something else. Past a certain length, maybe it isn't speed that suffers but the &lt;strong&gt;content&lt;/strong&gt; that vanishes. So I measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hiding a secret at the front, then growing the length
&lt;/h2&gt;

&lt;p&gt;The method is a small twist on needle-in-haystack. Drop a secret code on the first line of the prompt (the head). Below it, lay down a long stretch of filler text, like a meeting transcript, to inflate the token count. Then at the very end, ask "what was the secret code written at the top?" If the model answers it correctly, it saw the head. If it can't, the head is gone.&lt;/p&gt;

&lt;p&gt;The key is to &lt;strong&gt;keep the same prompt and only change num_ctx&lt;/strong&gt;. If recall breaks while the input is identical, the culprit isn't the model. It's the context window setting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;

&lt;span class="n"&gt;SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALPHA-7723-ZULU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;HEAD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IMPORTANT: a secret code is hidden somewhere in this document. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The secret code is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SECRET&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Read the notes below, but answer only the final question.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;FILLER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The meeting covered the quarterly roadmap, deploy schedule, on-call rotation, and cost cuts. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Each team shared progress and reordered next sprint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s priorities.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question: what is the secret code written at the top of this document? Answer with the code only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_filler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HEAD&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILLER&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_filler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Q&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;melavisions/gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_ctx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_ctx&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;SECRET&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used the &lt;code&gt;gemma4:latest&lt;/code&gt; (3.2B, Q4_K_M) quantized build. It's small and fast, and since this experiment tests input preservation rather than model intelligence, a tiny model was plenty. Forty repetitions of the filler put the whole prompt at 3464 tokens. Remember that number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Changing only num_ctx broke recall
&lt;/h2&gt;

&lt;p&gt;I threw the same 3464-token prompt four times, varying only num_ctx across 1024, 2048, 4096, and 8192. The results split cleanly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;num_ctx&lt;/th&gt;
&lt;th&gt;prompt_eval_count&lt;/th&gt;
&lt;th&gt;secret recall&lt;/th&gt;
&lt;th&gt;model answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;1023&lt;/td&gt;
&lt;td&gt;failed&lt;/td&gt;
&lt;td&gt;"the secret code is None"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;2047&lt;/td&gt;
&lt;td&gt;failed&lt;/td&gt;
&lt;td&gt;"the secret code is ro"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;3464&lt;/td&gt;
&lt;td&gt;succeeded&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ALPHA-7723-ZULU&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;3464&lt;/td&gt;
&lt;td&gt;succeeded&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ALPHA-7723-ZULU&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Follama-num-ctx-silent-truncation-experiment%2Fchart.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Follama-num-ctx-silent-truncation-experiment%2Fchart.png" alt="prompt_eval_count and recall success/failure across num_ctx values" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One thing jumped out. With num_ctx at 1024, &lt;code&gt;prompt_eval_count&lt;/code&gt; was exactly 1023; at 2048, exactly 2047. My prompt was clearly 3464 tokens, yet the number of tokens the model actually read was clipped to fit num_ctx. Ollama had shaved the over-window input down to the window size. And because the shaved-off side was the head, the secret code hidden at the top disappeared entirely.&lt;/p&gt;

&lt;p&gt;No error. No warning. It just said "None," or "ro," with a straight face. Honestly, that's the scariest part. Nothing in the model's answer signals that the input was cut. From num_ctx 4096 up, 3464 fits inside the window, so &lt;code&gt;prompt_eval_count&lt;/code&gt; reads a full 3464 and recall is correct. The threshold sat between 2048 and 4096, right on top of my prompt length of 3464.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the front, not the back
&lt;/h2&gt;

&lt;p&gt;It felt backwards at first. When something "gets cut," you'd expect the tail to go, yet here the head vanishes. The reason is in how autoregressive inference works. When an LLM produces the next token, what it leans on most directly is the recent tokens just before it. So when the window runs short, the runtime keeps the newest tokens (the tail) and drops the oldest (the head). The same logic applies to multi-turn chat through &lt;code&gt;/api/chat&lt;/code&gt;: the &lt;a href="https://docs.ollama.com/faq" rel="noopener noreferrer"&gt;Ollama FAQ&lt;/a&gt; notes that when context overflows, it quietly drops the oldest messages first.&lt;/p&gt;

&lt;p&gt;The trouble is that the things you least want cut all live at the front. System prompt, role instructions, tool definitions, output-format rules. By convention they go right at the top. And the trimming starts precisely there. If an agent that was cruising through a long conversation suddenly loses its persona or breaks its tool-call format, the model didn't get moody. The system prompt may have been pushed out of the window. That was exactly my notes agent.&lt;/p&gt;

&lt;p&gt;There's a practical defense in this. Any instruction you truly cannot afford to lose, put it again near the &lt;strong&gt;end&lt;/strong&gt;, just before the question, instead of only at the front. You're placing it where truncation can't reach. Not elegant, but when you can't control num_ctx it worked surprisingly well.&lt;/p&gt;

&lt;h2&gt;
  
  
  prompt_eval_count snitches on the truncation
&lt;/h2&gt;

&lt;p&gt;The most useful thing I got from this is elsewhere: &lt;strong&gt;truncation leaves a trace in the response.&lt;/strong&gt; The &lt;code&gt;prompt_eval_count&lt;/code&gt; that Ollama returns in the &lt;code&gt;/api/generate&lt;/code&gt; response is the number of input tokens the model actually prefilled. If that value is smaller than your sent prompt's token count and clings to num_ctx, it was almost certainly truncated.&lt;/p&gt;

&lt;p&gt;Why it matters: nobody normally looks at this number. If the answer comes out plausible, you assume the whole input went in. But even if you've &lt;a href="https://dev.to/en/blog/en/ollama-structured-outputs-pydantic-local-llm-guide-2026"&gt;stabilized answers with structured outputs&lt;/a&gt;, if the model only saw half the input, you get an answer that's schema-clean but factually wrong. Schema validation passes while the facts are off, which is the nastiest kind of bug to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  But the default wasn't 4096
&lt;/h2&gt;

&lt;p&gt;If I'd stopped there, I'd have landed on the usual "Ollama's default num_ctx is 4096, so watch out." But with &lt;strong&gt;no num_ctx set at all&lt;/strong&gt;, the 3464-token prompt recalled fine, and &lt;code&gt;prompt_eval_count&lt;/code&gt; read a full 3464. Under the 4096-default lore, 3464 obviously passes, so that's consistent so far. So I grew the input.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;filler repeats&lt;/th&gt;
&lt;th&gt;prompt_eval_count at default num_ctx&lt;/th&gt;
&lt;th&gt;note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;5911&lt;/td&gt;
&lt;td&gt;fits whole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;8431&lt;/td&gt;
&lt;td&gt;fits whole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;12631&lt;/td&gt;
&lt;td&gt;fits whole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;16383&lt;/td&gt;
&lt;td&gt;clipped at 16384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;16383 (recall failed, answer "secret")&lt;/td&gt;
&lt;td&gt;head dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If the default were 4096, it should have clipped already at 5911. Instead, 12631 tokens went in fine and it hit the ceiling at 16383 (= 16384 − 1). So on my MacBook, the &lt;strong&gt;default num_ctx that Ollama 0.30.7 chose was not 4096 but 16384&lt;/strong&gt;. Checking the &lt;a href="https://docs.ollama.com/faq" rel="noopener noreferrer"&gt;Ollama FAQ&lt;/a&gt; and community write-ups, recent versions auto-size the default context to available memory. On a 16GB M1, it landed on 16384.&lt;/p&gt;

&lt;p&gt;This isn't trivia. It's a portability problem. An agent that ran great on my 32GB desktop, moved to an 8GB cloud instance, gets a smaller default num_ctx, and the same code silently truncates the same input. You get a hard-to-reproduce incident: fine in local testing, quality collapses after deploy. I land firmly on "don't trust the default, always set it." If a default differs from machine to machine, it's effectively a value you can't trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I added one guard to the code
&lt;/h2&gt;

&lt;p&gt;Nothing fancy. Two things. First, I set &lt;code&gt;num_ctx&lt;/code&gt; explicitly on every request. Second, once the response comes back, I check whether &lt;code&gt;prompt_eval_count&lt;/code&gt; got close to num_ctx, to catch a likely truncation in the logs right away.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;guarded_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_ctx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;melavisions/gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_ctx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# if it filled over 98% of the window, assume it likely truncated and warn
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;num_ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.98&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[warn] prompt_eval_count=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ~ num_ctx=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_ctx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input may be truncated. Raise num_ctx or shrink the input.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This guard doesn't prevent truncation. It just stops a truncation from slipping past in silence. In my case, this one line answered within five minutes why the summarizer ignored instructions only on long inputs. Input tokens were crossing the default window and the system prompt at the front was getting cut. Raise num_ctx enough and the same input obeyed the instruction again.&lt;/p&gt;

&lt;p&gt;If a post-hoc guard that only fires after the response feels off, you can also count tokens before sending. Ollama has no separate tokenizer endpoint, so I measure length up front by hitting &lt;code&gt;/api/generate&lt;/code&gt; with &lt;code&gt;num_predict: 0&lt;/code&gt;, generating nothing and reading just &lt;code&gt;prompt_eval_count&lt;/code&gt;. One cheap prefill tells me whether the input fits the window. In a RAG pipeline with variable input, that pre-measurement lets you branch: bump num_ctx dynamically, or cut the number of context chunks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this experiment doesn't tell you
&lt;/h2&gt;

&lt;p&gt;Let me draw the boundaries honestly. First, I measured with one 3.2B quantized model. Truncation itself is runtime-level behavior independent of the model, but "how plausibly the model hallucinates when the head is cut" will differ by model. A larger one might answer "I don't know."&lt;/p&gt;

&lt;p&gt;Second, putting the secret at the very front is a deliberate worst case. In real RAG or agents, the important information isn't always in the head. But the system prompt and tool definitions almost always come first, and their loss is the most damaging, which is why I designed it this way.&lt;/p&gt;

&lt;p&gt;Third, the default num_ctx landing on 16384 is a product of my 16GB M1 plus Ollama 0.30.7. It varies with version, memory, and how many models are loaded at once. So the lesson I take isn't "the default is 16384" but "the default differs by environment, so don't trust it."&lt;/p&gt;

&lt;p&gt;And honestly, one thing I couldn't resolve remains. I ran the same test through the OpenAI-compatible endpoint (&lt;code&gt;/v1/chat/completions&lt;/code&gt;) too, where there's no way to pass &lt;code&gt;options.num_ctx&lt;/code&gt; per request, and &lt;code&gt;usage.prompt_tokens&lt;/code&gt; reported a different number from &lt;code&gt;/api/generate&lt;/code&gt;'s &lt;code&gt;prompt_eval_count&lt;/code&gt; (3464 versus 2384 for the same text). On top of that, on my machine the head survived even on long input and recall worked. I get that the token accounting differs so the two endpoints can't be compared one-to-one, but why the truncation behavior looked different too, I can't cleanly explain. Either way, &lt;a href="https://github.com/ollama/ollama/issues/2714" rel="noopener noreferrer"&gt;a reported issue where num_ctx isn't honored on the OpenAI-compatible API and it silently truncates at 4096&lt;/a&gt; does exist, so if you go through &lt;code&gt;/v1&lt;/code&gt;, remember you're fully at the mercy of the server default (&lt;code&gt;OLLAMA_CONTEXT_LENGTH&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;It's the same thread as &lt;a href="https://dev.to/en/blog/en/local-llm-cold-start-load-duration-experiment"&gt;tracking load_duration during cold starts&lt;/a&gt;. The numbers Ollama quietly tucks into each response, however thinly documented, are the most honest clues to its real behavior. Just as &lt;code&gt;load_duration&lt;/code&gt; snitched on cold starts, &lt;code&gt;prompt_eval_count&lt;/code&gt; snitches on truncation. If you're running local models seriously, give these numbers a look.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why a local LLM''s first reply sometimes takes 10 seconds — I measured the cold start (load_duration)</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 26 Jun 2026 06:41:12 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/why-a-local-llms-first-reply-sometimes-takes-10-seconds-i-measured-the-cold-start-2dfl</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/why-a-local-llms-first-reply-sometimes-takes-10-seconds-i-measured-the-cold-start-2dfl</guid>
      <description>&lt;p&gt;I have been running a local agent on my MacBook for a few days. When I step away to do something else and come back to the same agent, the first reply is noticeably sluggish. The second and third are fine; only that first one drags. While writing &lt;a href="https://dev.to/en/blog/en/local-llm-prefill-generation-latency-experiment"&gt;yesterday's post decomposing prefill and generation cost&lt;/a&gt;, I wrote that I warmed the model before measuring "so model load time (load_duration) would not contaminate the numbers." That line nagged at me. The thing I deliberately threw away was exactly the delay I feel most often in daily use.&lt;/p&gt;

&lt;p&gt;So today I measured the cost I excluded yesterday. The time it takes a model to land in memory, the thing we casually call cold start.&lt;/p&gt;

&lt;h2&gt;
  
  
  load_duration: the line item you usually don't see
&lt;/h2&gt;

&lt;p&gt;Ollama's &lt;code&gt;/api/generate&lt;/code&gt; returns a bundle of timestamps on every response. Yesterday I looked at &lt;code&gt;prompt_eval_duration&lt;/code&gt; (prefill) and &lt;code&gt;eval_duration&lt;/code&gt; (generation). There is one more at the front: &lt;code&gt;load_duration&lt;/code&gt;. As the name says, it is the time spent loading the model.&lt;/p&gt;

&lt;p&gt;There is a reason you rarely notice it. Call the same model back to back and from the second call on the model is already resident, so &lt;code&gt;load_duration&lt;/code&gt; reads near zero. Leave it idle for a while and Ollama evicts the model from memory (five minutes by default), and the next call resurrects the load cost. That eviction is exactly what I was feeling when "stepping away and coming back" felt slow.&lt;/p&gt;

&lt;p&gt;I kept the method simple. To isolate load time, the prompt is one line, &lt;code&gt;Reply with the single word: ok&lt;/code&gt;, and &lt;code&gt;num_predict&lt;/code&gt; is capped at 8 so generation collapses toward zero. To force a cold state, I call &lt;code&gt;ollama stop &amp;lt;model&amp;gt;&lt;/code&gt; right before the request. Then the &lt;code&gt;load_duration&lt;/code&gt; of the first call is the cold start.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_alive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply with the single word: ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keep_alive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;keep_alive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OLLAMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;load_duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;  &lt;span class="c1"&gt;# nanoseconds -&amp;gt; milliseconds
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to check it yourself, one curl line does it. Stop the model, call it once, and pull out &lt;code&gt;load_duration&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama stop gemma4:12b-it-qat
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "gemma4:12b-it-qat", "prompt": "ok", "stream": false
}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'import sys,json; print(json.load(sys.stdin)["load_duration"]/1e9, "s")'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value comes back in nanoseconds, so divide by 1e9 for seconds. Run that line across a few models and you immediately feel how the table below shifts on your own hardware.&lt;/p&gt;

&lt;p&gt;There is one reason to trust this measurement. Ollama returns &lt;code&gt;load_duration&lt;/code&gt; separately from &lt;code&gt;prompt_eval_duration&lt;/code&gt; and &lt;code&gt;eval_duration&lt;/code&gt;, so load time does not bleed into the prefill or generation numbers. The response's &lt;code&gt;total_duration&lt;/code&gt; came out close to the sum of those three, which let me isolate the load cleanly. Yesterday I looked at the middle two; today I focus on just the first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cold start by model size
&lt;/h2&gt;

&lt;p&gt;I lined up four Gemma 4 models I had pulled, ordered by size. For each I ran &lt;code&gt;ollama stop&lt;/code&gt;, then called it cold three times, plus once warm with the model resident. In seconds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;On-disk size&lt;/th&gt;
&lt;th&gt;Cold #1&lt;/th&gt;
&lt;th&gt;Cold #3&lt;/th&gt;
&lt;th&gt;Warm&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;melavisions/gemma4&lt;/td&gt;
&lt;td&gt;2.0 GB&lt;/td&gt;
&lt;td&gt;3.33s&lt;/td&gt;
&lt;td&gt;1.55s&lt;/td&gt;
&lt;td&gt;0.20s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;yinw1590/gemma4-e2b&lt;/td&gt;
&lt;td&gt;3.1 GB&lt;/td&gt;
&lt;td&gt;3.57s&lt;/td&gt;
&lt;td&gt;1.79s&lt;/td&gt;
&lt;td&gt;0.38s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:12b-it-qat&lt;/td&gt;
&lt;td&gt;7.2 GB&lt;/td&gt;
&lt;td&gt;9.00s&lt;/td&gt;
&lt;td&gt;2.82s&lt;/td&gt;
&lt;td&gt;0.37s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:e4b&lt;/td&gt;
&lt;td&gt;9.6 GB&lt;/td&gt;
&lt;td&gt;9.71s&lt;/td&gt;
&lt;td&gt;3.86s&lt;/td&gt;
&lt;td&gt;0.37s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-cold-start-load-duration-experiment%2Fhero.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-cold-start-load-duration-experiment%2Fhero.png" alt="Cold-start load_duration by model size" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The broad pattern is what you would expect: bigger model, longer load. The 9.6GB model's first cold start was 9.7 seconds; the same call warm was 0.37 seconds. A 26x gap. In practice, that means if you leave a 7.2GB local chatbot idle past five minutes and speak to it again, you burn several seconds before a single token appears.&lt;/p&gt;

&lt;p&gt;What jumps out is the warm column. Whether the model is 2GB or 9.6GB, warm &lt;code&gt;load_duration&lt;/code&gt; sat at 0.2 to 0.4 seconds, basically flat. It does not scale with size. The way I read it, this is not actually re-reading weights; it is the keep_alive bookkeeping overhead of Ollama confirming "this model is still up." It is not a real load, so it ignores size. I won't claim to know exactly what work it represents. But for practical purposes, 0.4 seconds warm is "effectively no load cost," and that is the conclusion the measurement supports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "cold" #1 and #3 differ by 2x
&lt;/h2&gt;

&lt;p&gt;Look at the table again and something is off. I stopped the model and re-measured every single time, yet Cold #1 is nearly twice as slow as Cold #3. The 7.2GB model went from 9.00 to 2.82 seconds, the 9.6GB one from 9.71 to 3.86. Both are labeled "cold," but the numbers disagree.&lt;/p&gt;

&lt;p&gt;I got stuck here for a while. I first suspected a measurement bug. The answer was the operating system's page cache. &lt;code&gt;ollama stop&lt;/code&gt; only evicts the model from the Ollama process's memory; the OS keeps the model file it already read sitting in RAM as page cache. So Cold #2 and #3 re-read the file from RAM, not disk. Drop the disk I/O entirely and it speeds up.&lt;/p&gt;

&lt;p&gt;This matters because the thing we casually call "cold start" is really two different things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Truly cold: right after a reboot, or when memory pressure has flushed the cache. Weights are read from disk for the first time. This is Cold #1.&lt;/li&gt;
&lt;li&gt;Cached cold: the model is evicted from Ollama but the file is still in the page cache. This is Cold #3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you do not separate these when benchmarking, the second measurement onward quietly picks up the cached value, and you reach the rosy conclusion "cold start is faster than I thought." A real production server reboots, and swapping between several models flushes the page cache. So when setting an SLA or a cold-start budget, base it on Cold #1, the post-reboot worst case, not Cold #3. Had I not known this and measured once, I would have written the 7.2GB cold start as "2.8 seconds" when the real worst case was 9.&lt;/p&gt;

&lt;p&gt;The interesting part is that the gap is much smaller on small models. The 2.0GB model's Cold #1 (3.33s) and Cold #3 (1.55s) differ by about 1.8 seconds, while the 9.6GB model's 9.71s and 3.86s differ by almost 6. More bytes to read from disk means the page cache saves you more time. The bigger the model, the steeper the penalty the "first user after reboot" absorbs. If you plan to serve 13B-class or larger locally, treat this cache dependency as a real operational variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  keep_alive splits the bill
&lt;/h2&gt;

&lt;p&gt;The most direct lever against cold start is &lt;code&gt;keep_alive&lt;/code&gt;: how long to hold the model in memory. I put it at two extremes and hit the same 7.2GB model three times each.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Request&lt;/th&gt;
&lt;th&gt;keep_alive="0" (unload each time)&lt;/th&gt;
&lt;th&gt;keep_alive="10m" (stay warm)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request #1&lt;/td&gt;
&lt;td&gt;7.10s&lt;/td&gt;
&lt;td&gt;2.56s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request #2&lt;/td&gt;
&lt;td&gt;2.55s&lt;/td&gt;
&lt;td&gt;0.38s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request #3&lt;/td&gt;
&lt;td&gt;2.55s&lt;/td&gt;
&lt;td&gt;0.38s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-cold-start-load-duration-experiment%2Fkeepalive.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-cold-start-load-duration-experiment%2Fkeepalive.png" alt="load_duration by keep_alive setting" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The contrast is sharp. &lt;code&gt;keep_alive="0"&lt;/code&gt; unloads the model immediately after serving a request, so every request is cold. Each one eats 2.5 seconds or more of load up front. Check &lt;code&gt;ollama ps&lt;/code&gt; between requests and the model is not in memory.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;keep_alive="10m"&lt;/code&gt; pays the cold value (2.56s) only on the first request, then drops to 0.38 seconds. It shoves the cold start into request one and serves the rest warm. Request #1 of the keep_alive=0 run spiked to 7.1 seconds because the page cache was also empty at that point, a truly cold start. The effect from the previous section shows up here too.&lt;/p&gt;

&lt;p&gt;On the command line, the &lt;code&gt;OLLAMA_KEEP_ALIVE&lt;/code&gt; environment variable or the API's &lt;code&gt;keep_alive&lt;/code&gt; field controls the same thing. Set it to &lt;code&gt;-1&lt;/code&gt; to keep the model resident indefinitely.&lt;/p&gt;

&lt;h2&gt;
  
  
  So how should I run a local agent?
&lt;/h2&gt;

&lt;p&gt;Measuring turned a few vague operational hunches into something concrete.&lt;/p&gt;

&lt;p&gt;First, for chat or agent use, give &lt;code&gt;keep_alive&lt;/code&gt; plenty of room. If the model is evicted every time a user speaks, every turn is a cold start. Adding 2.5 seconds per turn on a 7.2GB model wrecks the conversation. As long as memory allows, set a long &lt;code&gt;keep_alive&lt;/code&gt; or pin it with &lt;code&gt;-1&lt;/code&gt;. This is a setting you can drop straight onto the deployment from my &lt;a href="https://dev.to/en/blog/en/ollama-fastapi-production-deployment-guide-2026"&gt;Ollama plus FastAPI production serving guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Second, warm the model once at startup. Have your boot script fire a dummy prompt to pay the cold start in advance, so the first real user never eats the 9 seconds. You cannot avoid paying the cold cost on the first request, but that first request does not have to be a real user.&lt;/p&gt;

&lt;p&gt;Third, routing across several models is pricier than it looks. Calling a different model per request triggers a reload each time, and if memory is tight they flush each other's page cache down to a true cold (#1 level). If you build a router that rotates four models, compute load cost times switch count up front.&lt;/p&gt;

&lt;p&gt;Fourth, benchmark inference speed only after warming. That is precisely why I warmed before measuring yesterday. Measure once while cold and &lt;code&gt;load_duration&lt;/code&gt; stacks 9 seconds on top of prefill and generation, so you cannot tell whether the model is slow or the load is. The same principle held in my &lt;a href="https://dev.to/en/blog/en/llm-determinism-temperature-seed-experiment"&gt;output reproducibility experiment&lt;/a&gt;. Fix every variable except the one you are measuring.&lt;/p&gt;

&lt;p&gt;Fifth, memory and responsiveness are a trade. A long &lt;code&gt;keep_alive&lt;/code&gt; erases cold starts past the first request, but that model occupies RAM the whole time. Pin a 9.6GB model indefinitely and you shrink what other work can use; load another model and the page cache gets flushed, reviving cold. So I decided which models stay up first, gave a long &lt;code&gt;keep_alive&lt;/code&gt; to the one or two I use most, and kept the rest short. Holding every model warm is a luxury only enough memory affords. Hammering in &lt;code&gt;keep_alive=-1&lt;/code&gt; to drive cold start to zero just returns it as a bigger cold on the next model's load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limits and what I still don't know
&lt;/h2&gt;

&lt;p&gt;Let me draw the boundary honestly. These numbers come from one MacBook (Apple Silicon, unified memory). A server with a CUDA GPU has an extra step in the load path, copying from disk through system RAM into VRAM, so the absolute values will differ. Don't transplant my numbers onto other hardware. That said, the structural conclusions should survive a hardware change: cold scales with size while warm does not, page cache splits cold into two kinds, and keep_alive decides every cost past the first request.&lt;/p&gt;

&lt;p&gt;Also, I did not dig deep enough into Ollama internals to claim exactly what &lt;code&gt;load_duration&lt;/code&gt; sums up. It may include initialization like graph construction, not just the file read. What I can observe is the number the API returns and how it responds to model size, page cache, and keep_alive. That range is the scope of today's measurement. The 0.37 seconds that shows up even when warm is something I guessed at, not confirmed.&lt;/p&gt;

&lt;p&gt;Finally, page cache behavior depends on how much free RAM you have. On a memory-tight server, even Cold #2 and #3 could get their cache flushed quickly and slow back down toward Cold #1. My measurement leans toward the optimistic case with ample RAM. Next I want to apply artificial memory pressure and see how long the cache holds. Cold start is not a measure-once topic; it is the kind of cost you have to re-measure per environment.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why a Local LLM Slows Down as the Conversation Grows — I Split Prefill From Generation</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 25 Jun 2026 06:48:26 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/why-a-local-llm-slows-down-as-the-conversation-grows-i-split-prefill-from-generation-28l4</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/why-a-local-llm-slows-down-as-the-conversation-grows-i-split-prefill-from-generation-28l4</guid>
      <description>&lt;p&gt;When I run a local agent on my MacBook, responses get noticeably sluggish as the conversation drags on. I knew it felt slower, but I had no idea which stage was slowing down or by how much. So I cracked open the timing fields Ollama returns with every response and measured it.&lt;/p&gt;

&lt;p&gt;The punchline first. I sent the same 9,700-token prompt twice. The first call took about 55 seconds to produce its first token. The second took 65 milliseconds. Same input, roughly a 396x difference. That single fact explains almost everything about local LLM latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stopwatch Ollama hides in every response
&lt;/h2&gt;

&lt;p&gt;Most people only touch Ollama through a chat UI or &lt;code&gt;ollama run&lt;/code&gt;. But if you call &lt;code&gt;/api/generate&lt;/code&gt; with &lt;code&gt;stream:false&lt;/code&gt;, the response JSON ships with precise timing fields. They are documented in the &lt;a href="https://github.com/ollama/ollama/blob/main/docs/api.md" rel="noopener noreferrer"&gt;Ollama API reference&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;prompt_eval_count&lt;/code&gt;: number of tokens in the input prompt&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prompt_eval_duration&lt;/code&gt;: time spent processing the prompt (this is prefill)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eval_count&lt;/code&gt;: number of tokens generated&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eval_duration&lt;/code&gt;: time spent generating those tokens (this is generation)&lt;/li&gt;
&lt;li&gt;all durations are returned in nanoseconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is that inference splits into two stages with completely different cost shapes. &lt;strong&gt;Prefill&lt;/strong&gt; reads my entire prompt in one pass and fills the KV cache. Everything up to just before the first token lives here. &lt;strong&gt;Generation&lt;/strong&gt; then emits tokens one at a time, autoregressively. The two behave differently as context grows, and I wanted to see them apart, not blended into one "it's slow" number.&lt;/p&gt;

&lt;p&gt;The measurement script is short and uses only the standard library. It varies the context length, asks the same question, and converts the timing fields into tokens per second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;

&lt;span class="n"&gt;OLLAMA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MODEL&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:e4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OLLAMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pe_n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pe_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval_duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;
    &lt;span class="n"&gt;ev_n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ctx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pe_n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefill_s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pe_d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefill_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pe_n&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;pe_d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# prefill throughput
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ev_n&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;ev_d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# generation throughput
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One trap had to be handled first. If you repeat the same prompt, the cache drives prefill to nearly zero (more on that later). To measure cold prefill honestly, every call has to differ from the very &lt;strong&gt;first byte&lt;/strong&gt;. So I prepended a fresh random ID to each prompt and seeded the filler body differently per run. I picked &lt;code&gt;gemma4:e4b&lt;/code&gt; to keep iterations light, and I warmed the model up once before measuring so model load time (&lt;code&gt;load_duration&lt;/code&gt;) wouldn't leak into the numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  A longer context means a later first token
&lt;/h2&gt;

&lt;p&gt;Cold prefill first. I grew the context from about 200 tokens to 9,700 and timed how long it took to reach the first output token.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context (tokens)&lt;/th&gt;
&lt;th&gt;Cold prefill&lt;/th&gt;
&lt;th&gt;Prefill tok/s&lt;/th&gt;
&lt;th&gt;Generation tok/s&lt;/th&gt;
&lt;th&gt;ms per generated token&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;1.0s&lt;/td&gt;
&lt;td&gt;197.8&lt;/td&gt;
&lt;td&gt;16.81&lt;/td&gt;
&lt;td&gt;59.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;644&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;198.4&lt;/td&gt;
&lt;td&gt;16.61&lt;/td&gt;
&lt;td&gt;60.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,244&lt;/td&gt;
&lt;td&gt;6.3s&lt;/td&gt;
&lt;td&gt;197.6&lt;/td&gt;
&lt;td&gt;16.38&lt;/td&gt;
&lt;td&gt;61.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,476&lt;/td&gt;
&lt;td&gt;12.7s&lt;/td&gt;
&lt;td&gt;194.6&lt;/td&gt;
&lt;td&gt;16.40&lt;/td&gt;
&lt;td&gt;61.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4,852&lt;/td&gt;
&lt;td&gt;25.7s&lt;/td&gt;
&lt;td&gt;188.6&lt;/td&gt;
&lt;td&gt;15.87&lt;/td&gt;
&lt;td&gt;63.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9,716&lt;/td&gt;
&lt;td&gt;54.8s&lt;/td&gt;
&lt;td&gt;177.3&lt;/td&gt;
&lt;td&gt;15.43&lt;/td&gt;
&lt;td&gt;64.8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-prefill-generation-latency-experiment%2Fresults-chart.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-prefill-generation-latency-experiment%2Fresults-chart.png" alt="Prefill time and generation rate against context length" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The left chart is nearly a straight line. As context grew 48x (200 to 9,716), prefill grew from 1 second to 55 seconds, about 54x. Roughly proportional. Intuitive enough: more tokens, more to read.&lt;/p&gt;

&lt;p&gt;The interesting bit is the prefill &lt;strong&gt;rate&lt;/strong&gt; in tok/s. At short context it ran at 198 tok/s, but near 10k tokens it fell to 177, about 10% slower. So the cost of processing a single token itself rises with context. As I understand it, attention scales quadratically with sequence length, so reading one word at the end of a long document, while re-attending over everything before it, is heavier than reading a word in a short one. That makes prefill climb a touch steeper than linear.&lt;/p&gt;

&lt;p&gt;Here is the first practical lesson. The usual culprit behind a slow local agent is not generation speed, it is prefill. Cram five RAG documents in, or replay the last 20 turns wholesale, and tens of seconds evaporate before the model writes a single character. On cloud APIs this cost is &lt;a href="https://dev.to/en/blog/en/llm-token-cost-data-format-experiment"&gt;billed as tokens that vary with your data format&lt;/a&gt;; locally it is billed straight to my wall clock.&lt;/p&gt;

&lt;p&gt;If you use a streaming UI, think of this prefill time as exactly how long the user stares at a blank screen or a loading spinner. Once tokens start flowing, they fill in fairly briskly at 16 a second. The problem is everything up to that first character. The longer the context, the longer the user has to sit through a silence that feels like "did it freeze?" What governs the felt responsiveness of a local chatbot is the length of that silence, not the speed tokens stream at. If nothing appears for 55 seconds at 9,700 tokens of context, that is not a tool you can use conversationally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation quietly gets slower too
&lt;/h2&gt;

&lt;p&gt;The right chart looks almost flat, which is exactly why I want to flag it. Generation fell from 16.81 tok/s to 15.43 tok/s, about 8%. Per token, that is 59.5ms creeping up to 64.8ms.&lt;/p&gt;

&lt;p&gt;Why? Each new token in the generation stage re-attends over the entire KV cache built so far. A longer context means more to attend over, so the per-token time inches up. Same root cause: attention is sensitive to length.&lt;/p&gt;

&lt;p&gt;Honestly, though, that 8% is a side dish next to prefill. Generating 64 tokens at 10k context took about 4 seconds, but the prefill in front of it was 55 seconds. More than 90% of the felt latency happens before the first token. That is why I see "make the prompt short, and make it cacheable" as a far bigger lever than "swap in a faster model for generation."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the second call was 396x faster
&lt;/h2&gt;

&lt;p&gt;The cache was the most striking part of this experiment. I sent the same 4,859-token prompt twice in a row.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;prompt_eval_count&lt;/th&gt;
&lt;th&gt;Prefill time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First (cold)&lt;/td&gt;
&lt;td&gt;4,859&lt;/td&gt;
&lt;td&gt;25,751ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second (warm)&lt;/td&gt;
&lt;td&gt;4,859&lt;/td&gt;
&lt;td&gt;65ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;prompt_eval_count&lt;/code&gt; reported 4,859 both times. The token count was unchanged, yet prefill time dropped about 396x. The model did not skip reading the tokens; it reused a KV cache it had already computed, so there was nothing to recompute.&lt;/p&gt;

&lt;p&gt;This is the &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/13606" rel="noopener noreferrer"&gt;prefix KV cache in llama.cpp&lt;/a&gt;. Ollama runs on top of llama.cpp, so it inherits the behavior. Two requests that share a leading prefix produce a bit-identical KV cache for that span, so the second request skips the shared prefix and processes only from the point where they diverge. With an identical prompt there is no divergence, so prefill effectively vanishes.&lt;/p&gt;

&lt;p&gt;One thing that confused me, worth noting. Even on a cache hit, &lt;code&gt;prompt_eval_count&lt;/code&gt; still returns the full 4,859. At first I stared at that number and assumed caching wasn't working. The field to watch is not the count but &lt;code&gt;prompt_eval_duration&lt;/code&gt;. To confirm the cache is doing its job, look at prefill time, not token count: send the same prompt twice, and if the second prefill drops to a few milliseconds, you got a hit. Miss this and you can misdiagnose "caching is broken" in an environment where it works fine.&lt;/p&gt;

&lt;p&gt;Now the random ID I prepended to measure cold prefill makes sense. The cache only reuses the span that is &lt;strong&gt;common from the very front&lt;/strong&gt;. Change the first byte and everything after it must be recomputed. Put a per-request changing value at the head of your prompt and you break the whole cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  So how should I build the agent?
&lt;/h2&gt;

&lt;p&gt;After this measurement I changed how I assemble prompts for my local agents. The shortlist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Stable content up front, volatile content at the back.&lt;/strong&gt; System prompt, tool definitions, fixed instructions, anything identical every turn, goes at the very front. The user's new question or a fresh search result, anything that changes, goes after. That alone lets the leading prefill ride the cache from the second turn on, nearly for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No timestamps or random IDs at the head of the prompt.&lt;/strong&gt; Pin a line like "Current time: 2026-06-25 15:23:06" to the top of your system prompt and the first line differs every request, busting the cache each time. If you must include it, push it to the end. This one detail can save tens of seconds of prefill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. "Fits in the context window" is not "usable."&lt;/strong&gt; A model can support 32k, but on my laptop even 10k tokens meant 55 seconds to the first token. For interactive use, size your context budget by measured prefill time, not by the supported limit. When I want conversational responsiveness locally, I keep context within a few thousand tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. If you truly need long context, pay prefill once and reuse it.&lt;/strong&gt; For RAG where you ask several questions about the same document, keep the document at a fixed position up front and vary only the question at the back. The first question pays the full prefill, but the rest nearly skip it thanks to the cache. The same ordering helps when you &lt;a href="https://dev.to/en/blog/en/local-llm-private-mcp-server-gemma4-fastmcp"&gt;wire a local model to an MCP server to build an agent&lt;/a&gt;, so you do not re-prefill the system prompt on every tool-call round trip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompt I actually changed
&lt;/h2&gt;

&lt;p&gt;Abstract rules don't land well, so here is how I reordered the prompt for my local tool-calling agent. Before, the order looked like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current time: 2026-06-25 15:23:06    &amp;lt;- changes every request (cache breaker)
Session ID: 9f3a-...                  &amp;lt;- changes every request
[system prompt, 800 tokens]
[tool definitions, 1,200 tokens]
[recent conversation history]
[the user's new question]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is the first two lines. With the time and session ID at the very front, the first byte of the prompt differs on every request. The 800-token system prompt and 1,200-token tool definitions behind them never change a character turn to turn, yet because the cache broke up front, prefill had to run from scratch every time. By the table above, a 2,000-token prefill is about 10 seconds. Ten seconds a turn, thrown away purely on a layout mistake.&lt;/p&gt;

&lt;p&gt;After, I did this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[system prompt, 800 tokens]          &amp;lt;- fixed, at the front
[tool definitions, 1,200 tokens]     &amp;lt;- fixed
[recent conversation history]        &amp;lt;- mostly fixed (only appended to)
Current time: 2026-06-25 15:23:06    &amp;lt;- changing values pushed to the end
Session ID: 9f3a-...
[the user's new question]            &amp;lt;- the part that changes every time, last
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the fixed block moved up front, from the second turn on the 2,000 tokens of system prompt and tool definitions rode the cache wholesale. The conversation history only appends new messages at the tail, so the common prefix stays long. The result: the only thing that needs re-prefilling each turn is the few hundred newly added tokens. Same model, same hardware, but the per-turn latency dropped visibly. Not one line of code got faster. I just changed the order in which I concatenate strings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limits of this measurement
&lt;/h2&gt;

&lt;p&gt;Let me draw the boundary honestly. These numbers come from one MacBook, one model (&lt;code&gt;gemma4:e4b&lt;/code&gt;), and one runtime (Ollama). The absolute figures (55 seconds, 16 tok/s) change wholesale with GPU, memory, quantization, and runtime. A bigger model, or &lt;a href="https://dev.to/en/blog/en/ollama-structured-outputs-pydantic-local-llm-guide-2026"&gt;structured outputs that return typed objects&lt;/a&gt;, would add their own variables.&lt;/p&gt;

&lt;p&gt;What I trust is the &lt;strong&gt;shape&lt;/strong&gt;, not the absolutes. Prefill grows nearly in proportion to context, generation slows a little, and an identical prefix becomes nearly free through the cache. I expect all three trends to point the same direction in any environment. I have not yet measured how the cache contends under concurrent requests, or how quantization level affects prefill speed. I'm leaving those for the next experiment.&lt;/p&gt;

&lt;p&gt;Stop lumping local LLMs into "fast" or "slow." Split them into prefill-to-first-token and per-token generation, and where to fix things becomes obvious. And most of the fixes, it turned out, were not about changing the model. They were about laying out the prompt so it stays cached.&lt;/p&gt;

&lt;p&gt;The single most useful line I took from this experiment: when building a local agent, the first thing to check is not the GPU or the model size, it is whether the front of my prompt stays identical every turn. Pin the front and prefill nearly vanishes from the second turn on, on the same hardware. It is free acceleration that costs not one dollar and not one extra GPU. Until I measured it, I had filed this away as a vague "feeling" and let it slide.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Same Article, 1.4x the Tokens in Korean: Measuring the Non-English Token Tax Across 285 of My Posts</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 23 Jun 2026 06:42:22 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/same-article-14x-the-tokens-in-korean-measuring-the-non-english-token-tax-across-285-of-my-posts-2npn</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/same-article-14x-the-tokens-in-korean-measuring-the-non-english-token-tax-across-285-of-my-posts-2npn</guid>
      <description>&lt;p&gt;I fed one English sentence in. "I refactored the agent loop to cut token usage." Twelve tokens under OpenAI's o200k_base. Then the same meaning in Korean: "토큰 사용량을 줄이려고 에이전트 루프를 다시 짰다." Twenty tokens. Less than half the characters, but 1.7x the tokens.&lt;/p&gt;

&lt;p&gt;I wanted to know whether that was a fluke of one sentence or a structural cost baked into my entire blog. I happened to have the perfect test bed. This blog ships every article in four versions: Korean, Japanese, English, Chinese. That means 285 pairs of semantically identical documents across four languages. A dataset where I can cleanly measure how many more tokens the same content costs when only the language changes, with no translation-quality argument in the way.&lt;/p&gt;

&lt;p&gt;So I tokenized all 285 articles times four languages with three real tokenizers. Up front: the non-English token tax is real, it's bigger than I expected, and switching models changes the rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I had to measure this myself
&lt;/h2&gt;

&lt;p&gt;"Korean costs more tokens than English" is folklore in the community. But ask "how much more" and you get scattered answers. Some say 2x, some say 1.2x. Of course they differ. The text measured was different, the tokenizer was different, and whether code blocks were mixed in was different too.&lt;/p&gt;

&lt;p&gt;What I needed wasn't a floating number but the cost that lands when &lt;em&gt;my blog&lt;/em&gt; runs through &lt;em&gt;the models I actually use&lt;/em&gt;. Translation, summarization, embedding, RAG context injection: nearly every step of my automation pipeline is metered in tokens. If Korean costs 1.4x English, that's a number printed straight onto my monthly bill.&lt;/p&gt;

&lt;p&gt;I picked three tokenizers as targets. The modern OpenAI line, o200k_base (GPT-4o / GPT-5 generation); the older cl100k_base (GPT-4 / 3.5 generation); and the Claude tokenizer. All three are BPE family, but they learned vocabulary from different data, so they handle non-English differently. The models I reach for daily fall roughly into these three buckets.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I measured it (frontmatter stripped, full body in)
&lt;/h2&gt;

&lt;p&gt;No fancy tooling. tiktoken runs offline out of the box, and I loaded the Claude tokenizer from Hugging Face's &lt;code&gt;Xenova/claude-tokenizer&lt;/code&gt;. For each Markdown file I cut only the YAML frontmatter and tokenized the entire body (code blocks included). What actually goes into an LLM is the whole body, so I deliberately left it unclean.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;o200k&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;o200k_base&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# GPT-4o / GPT-5 gen
&lt;/span&gt;&lt;span class="n"&gt;cl100k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cl100k_base&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# GPT-4 / 3.5 gen
&lt;/span&gt;&lt;span class="n"&gt;claude&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Xenova/claude-tokenizer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;strip_frontmatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# only slugs present in all four languages (285 sets)
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ko&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;strip_frontmatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;o200k_tokens&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o200k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;cl100k_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cl100k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;claude_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claude&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I only counted the 285 articles that have the same filename in all four languages. That removes any sampling imbalance from articles that exist in English but not Korean. It's a strict 1:1:1:1 comparison of the same article's four versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers say
&lt;/h2&gt;

&lt;p&gt;Totals across all 285 articles. Unit is tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;o200k (modern)&lt;/th&gt;
&lt;th&gt;cl100k (older)&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;Characters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English (en)&lt;/td&gt;
&lt;td&gt;908,938&lt;/td&gt;
&lt;td&gt;915,128&lt;/td&gt;
&lt;td&gt;1,003,948&lt;/td&gt;
&lt;td&gt;3,859,685&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese (zh)&lt;/td&gt;
&lt;td&gt;1,045,977&lt;/td&gt;
&lt;td&gt;1,267,007&lt;/td&gt;
&lt;td&gt;1,340,943&lt;/td&gt;
&lt;td&gt;2,493,687&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Japanese (ja)&lt;/td&gt;
&lt;td&gt;1,217,284&lt;/td&gt;
&lt;td&gt;1,502,403&lt;/td&gt;
&lt;td&gt;1,579,075&lt;/td&gt;
&lt;td&gt;2,584,255&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korean (ko)&lt;/td&gt;
&lt;td&gt;1,256,718&lt;/td&gt;
&lt;td&gt;1,668,007&lt;/td&gt;
&lt;td&gt;1,882,800&lt;/td&gt;
&lt;td&gt;3,076,489&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Set English to 1.0 and the tax snaps into focus.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;o200k ratio&lt;/th&gt;
&lt;th&gt;cl100k ratio&lt;/th&gt;
&lt;th&gt;Claude ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;td&gt;1.15&lt;/td&gt;
&lt;td&gt;1.39&lt;/td&gt;
&lt;td&gt;1.34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;1.34&lt;/td&gt;
&lt;td&gt;1.64&lt;/td&gt;
&lt;td&gt;1.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;1.38&lt;/td&gt;
&lt;td&gt;1.82&lt;/td&gt;
&lt;td&gt;1.88&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even on the modern tokenizer (o200k), Korean is 1.38x English and Japanese 1.34x. On the older one (cl100k), Korean stretches to 1.82x. On the Claude tokenizer, Korean is the most expensive of the three at 1.88x.&lt;/p&gt;

&lt;p&gt;The part I found striking is that character count and token count pull apart. The English versions are the longest by characters, at 3.86M. Korean is shorter at 3.08M. Yet Korean uses 38% more tokens. Fewer characters, more tokens. By tokens-per-character: English 0.235, Korean 0.408, Chinese 0.419, Japanese 0.471. English BPE crams common words into a single token each, but Hangul, kana, and Han characters weren't learned that way, so they shatter into more tokens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmultilingual-llm-token-tax-experiment%2Fhero.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmultilingual-llm-token-tax-experiment%2Fhero.png" alt="Token ratios across 285 articles x 4 languages, the non-English token tax" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Korean eats more tokens: I cut "에이전트" apart
&lt;/h2&gt;

&lt;p&gt;I wanted to see the cause with my own eyes, so I fed one word to the tokenizer and decoded how it splits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;에이전트&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# = agent
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o200k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;o200k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="c1"&gt;# ['에', '이', '전', '트']   -&amp;gt; 4 tokens
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;English "agent" is one token. Korean "에이전트" splits into four tokens, one per syllable block. cl100k gave the same result. In English a word is roughly a token; in Korean a character (syllable block) is closer to a token. That 4:1 gap on a single word, accumulated across a whole article, becomes 1.4x.&lt;/p&gt;

&lt;p&gt;One short sentence across all three tokenizers makes it even clearer. Every version means "I refactored the agent loop to cut token usage."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Chars&lt;/th&gt;
&lt;th&gt;o200k&lt;/th&gt;
&lt;th&gt;cl100k&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;English is 12 across all three. Once you go non-English, the tokenizers diverge. The same Korean sentence is 20 on o200k and 32 on Claude. A 1.6x swing decided by one model choice.&lt;/p&gt;

&lt;p&gt;Japanese has its own texture. By tokens-per-character it's the highest of the three at 0.471. Han characters, hiragana, and katakana mix in a single sentence, and katakana loanwords ("エージェント") split syllable by syllable. Yet by full-document total, Korean uses more tokens than Japanese. Japanese packs heavy meaning into single Han characters so documents come out shorter, while Korean documents are simply longer, so even with better per-character efficiency than Japanese it overtakes on total volume. Efficiency and total don't point the same way.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenAI's tokenizer jump was really a non-English discount
&lt;/h2&gt;

&lt;p&gt;This was the most unexpected find of the experiment. I computed how much each language's token count dropped when the tokenizer moved from cl100k to o200k.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;English: down 0.7%&lt;/li&gt;
&lt;li&gt;Chinese: down 17.4%&lt;/li&gt;
&lt;li&gt;Japanese: down 19.0%&lt;/li&gt;
&lt;li&gt;Korean: down 24.7%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For English users, swapping tokenizers barely moves the count. 0.7% is noise. But Korean got a quarter cheaper on the same articles. What OpenAI actually did by growing the vocabulary for o200k was quietly hand non-English a discount. English was already near-optimal, so there was nothing left to squeeze.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmultilingual-llm-token-tax-experiment%2Ftokenizer-discount.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmultilingual-llm-token-tax-experiment%2Ftokenizer-discount.png" alt="Token reduction from cl100k to o200k, a discount concentrated on non-English" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This feeds a real decision. For a workload that runs a lot of CJK text, when you pick a model don't just read benchmark scores; check the tokenizer generation too. Same price tag, but on non-English text the actual bill is decided by the tokenizer. My earlier &lt;a href="https://dev.to/en/blog/en/llm-token-cost-data-format-experiment"&gt;measurement of how data formats move token cost&lt;/a&gt; taught the same lesson. A model's price tag is only the start; the real cost is set by how the input turns into tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tax varies per article, so don't trust the average
&lt;/h2&gt;

&lt;p&gt;The totals are tidy, but split by article the variance is large. For each of the 285 articles I computed the Korean-to-English token ratio and took median and mean separately (o200k).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Per-article median&lt;/th&gt;
&lt;th&gt;Per-article mean&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;td&gt;1.10&lt;/td&gt;
&lt;td&gt;1.14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;1.41&lt;/td&gt;
&lt;td&gt;1.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;1.31&lt;/td&gt;
&lt;td&gt;1.40&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Korean has a median of 1.31 but a mean of 1.40. Mean above median means a handful of token-heavy articles drag the average up. Looking closer, those were articles dense in Korean prose with little English code or proper nouns. Tutorial articles full of code blocks sat near 1.1, because code is English anyway and tokenizes almost identically across all four versions.&lt;/p&gt;

&lt;p&gt;A practical lesson: estimate one article's cost from the corpus-average ratio and you'll miss badly on prose-heavy, code-light pieces. For a single job where cost matters, tokenize that specific text.&lt;/p&gt;

&lt;h2&gt;
  
  
  So what shows up on my bill
&lt;/h2&gt;

&lt;p&gt;This blog starts a new post from an English draft, renders it into three other languages, and embeds the whole corpus for related-post recommendations and search. Almost every step is token-metered.&lt;/p&gt;

&lt;p&gt;Across the corpus, the 285 English versions are about 0.91M tokens on o200k. All four languages together are about 4.43M tokens. Versus running English-only, multilingual isn't simply 4x; the non-English tax stacks on top. Look at just the three translations (ko+ja+zh) and it's about 3.52M tokens, where estimating from three English copies (2.73M) would have undercounted by about 29%.&lt;/p&gt;

&lt;p&gt;Here's the mistake I kept making: eyeballing Korean cost from English token counts. That's off by 28% even on a modern model, and nearly half on an older one. Quote in English, get billed in Korean, and the gap compounds every month.&lt;/p&gt;

&lt;p&gt;Concretely: picture translating one new post from an English draft into Korean. The input is the English source (about 3,200 tokens) and the output is the Korean translation. The Korean output carries 1.38x the tokens for the same content. Output tokens usually cost more than input, so the non-English tax lands on the most expensive side. Push that into Japanese and Chinese and the tax applies three separate times. In my one-article-four-languages structure, non-English output tokens are a much bigger chunk of total publishing cost than English's share. When I &lt;a href="https://dev.to/en/blog/en/adding-chinese-support"&gt;added Chinese&lt;/a&gt; and the article count quadrupled, the reason cost jumped by more than exactly 4x lives right here.&lt;/p&gt;

&lt;p&gt;The fixes I'm making to shrink the tax. My recommendation pipeline resends the same context repeatedly, so I &lt;a href="https://dev.to/en/blog/en/claude-api-prompt-caching-cost-optimization-guide"&gt;turned on prompt caching&lt;/a&gt; to keep the non-English tax from re-applying on every call. RAG chunks get cut by real token count, not character count. Invert the tokens-per-character figure and you get chunk size: with a 512-token embedding context, English fits about 2,170 characters but Korean hits the same limit at about 1,250. Cut Korean at the same "1,000 characters" as English and the chunk holds 1.7x the tokens, silently overrunning the context window or getting truncated. That's exactly why I revisited chunk sizing in my &lt;a href="https://dev.to/en/blog/en/sentence-transformers-korean-rag-embedding-guide-2026"&gt;Korean RAG embedding writeup&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this measurement doesn't reach
&lt;/h2&gt;

&lt;p&gt;Honest limits. First, token count is only one axis of cost. The same token has a different price per model, and input and output prices differ too. This post measured "how many tokens" only; "so how many dollars" you have to plug into your own model and plan.&lt;/p&gt;

&lt;p&gt;Second, more tokens isn't automatically a loss. CJK fits the same information into fewer characters. The proof is that the Korean versions have fewer characters than English yet carry the same meaning. Token efficiency and information density are separate stories.&lt;/p&gt;

&lt;p&gt;Third, my corpus is a tech blog, so the bodies are heavy with English code, proper nouns, and technical terms. Pure everyday Korean prose could show a bigger multiplier. So this 1.38x is "my environment, technical-document baseline," not a universal constant. If you want to dig deeper into tokenizer behavior, my &lt;a href="https://dev.to/en/blog/en/llama-cpp-iq-quantization-merge"&gt;post on BPE quantization and merging&lt;/a&gt; is an adjacent starting point.&lt;/p&gt;

&lt;p&gt;Fourth, models keep changing. If the next tokenizer generation grows its CJK vocabulary, this gap narrows again. So the real conclusion here isn't "Korean is 1.38x." It's "don't estimate; run your own text through your own model's tokenizer." The code is all above. Wiring it to your own corpus takes about ten minutes.&lt;/p&gt;

&lt;p&gt;Why this token-level sense of measuring and reproducing output matters runs through my &lt;a href="https://dev.to/en/blog/en/llm-determinism-temperature-seed-experiment"&gt;experiment on output reproducibility with temperature and seed&lt;/a&gt; in the same spirit. Working with an LLM is, in the end, counting tokens. Knowing the counting unit differs by language before you start, versus finding out later, is the difference your monthly bill remembers.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Does a Local LLM Ever Repeat Itself? Measuring Output Reproducibility with Temperature and Seed</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:57:17 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/does-a-local-llm-ever-repeat-itself-measuring-output-reproducibility-with-temperature-and-seed-1oh0</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/does-a-local-llm-ever-repeat-itself-measuring-output-reproducibility-with-temperature-and-seed-1oh0</guid>
      <description>&lt;p&gt;I ran my evaluation script twice and got two different scores. Same code, same prompt, same model. Nothing changed, yet one previously passing case failed.&lt;/p&gt;

&lt;p&gt;My first guess was that I had touched something. But a re-run passed again. That moved my suspicion to the model itself. An LLM is not a function that maps the same input to the same answer. I knew this intellectually, but once my eval pipeline started wobbling, the question got concrete fast: what exactly do I need to pin down to make the output reproduce?&lt;/p&gt;

&lt;p&gt;So I measured it myself. Not on a cloud API, but in a local Ollama + Gemma 4 setup I could control end to end. I sent the same prompt dozens of times and bucketed the outputs by hash to count how many distinct ones appeared. Up front: in my environment, reproducibility came down to exactly two knobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why results wobble when you changed nothing
&lt;/h2&gt;

&lt;p&gt;The final step where an LLM picks a token is sampling, drawing one option from a probability distribution. The knob that changes the character of that sampling is &lt;code&gt;temperature&lt;/code&gt;. At temperature 0 the model just takes the highest-probability token every time (greedy). There is no room for randomness, so in theory it should be deterministic. Raising temperature opens room for the second- and third-ranked tokens, and the thing that governs that lottery is &lt;code&gt;seed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is one more knob worth naming: &lt;code&gt;num_predict&lt;/code&gt;, the maximum number of tokens to generate. When it is short, there are fewer points where the model can diverge, so it looks more deterministic; when it is long, tiny differences have more room to accumulate toward the tail. So I grabbed a clean signal first with a short tagline (about 40 tokens) and tested long outputs separately. That long-output test is where I hit the empty-response problem I will get to later.&lt;/p&gt;

&lt;p&gt;All of that is straight from the docs. The question is whether the theory actually holds on my laptop. Just browsing the ollama issue tracker, you find a steady stream of reports: "I fixed the seed but the answer changes" (#4660), "temperature=0 with a fixed seed still differs between the first and second run" (#586). So I decided not to trust it and to measure instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment: the same prompt, dozens of times
&lt;/h2&gt;

&lt;p&gt;I built the test environment in a temporary directory outside the repo. Ollama 0.30.7, Apple Silicon, two models: a small 2GB Gemma 4 build and the 9.6GB &lt;code&gt;gemma4:e4b&lt;/code&gt;. The point of using two sizes was to see whether the same pattern shows up regardless of model scale.&lt;/p&gt;

&lt;p&gt;The method is simple. Send the same prompt 12 to 15 times per condition, hash each output with SHA-256, and count how many distinct hashes appear. One means fully deterministic; a larger number means the output is scattering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_predict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_predict&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;outs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;hashes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;most_common&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I split this into five conditions. For temperature=0 and temperature&amp;gt;0, a fixed seed and no seed, plus a final case where I keep the seed fixed but spin up a fresh Python process and run once more. That last condition matters, because "reproducing inside the same process" and "reproducing after you kill and restart the process" are completely different guarantees from an eval and CI standpoint.&lt;/p&gt;

&lt;p&gt;I picked a generative prompt with room to vary: "Write a single short marketing tagline for a new AI coding assistant." Diversity only shows up on a task that does not collapse to a single right answer. Had I used a closed question like "what is 2+2," raising temperature would barely scatter anything, and the seed's effect would be invisible.&lt;/p&gt;

&lt;p&gt;I tracked two metrics together. One is the distinct count above; the other is majority share, the fraction of all N runs taken by the single most frequent output. A distinct count of 5 with one output appearing 11 times is effectively stable; five outputs scattered evenly drops majority share toward 0.3. To compress the shape of the distribution into one number, watching both felt safer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: there were only two knobs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-determinism-temperature-seed-experiment%2Fhero.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-determinism-temperature-seed-experiment%2Fhero.png" alt="Distinct output counts per condition across two local models" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The table makes the pattern sharper.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;gemma4 ~2GB (N=15)&lt;/th&gt;
&lt;th&gt;gemma4:e4b 9.6GB (N=12)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;temperature=0, no seed&lt;/td&gt;
&lt;td&gt;1 (deterministic)&lt;/td&gt;
&lt;td&gt;1 (deterministic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature=0, fixed seed&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&amp;gt;0, no seed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&amp;gt;0, fixed seed&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&amp;gt;0, fixed seed (process rerun)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-determinism-temperature-seed-experiment%2Fresults-table.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-determinism-temperature-seed-experiment%2Fresults-table.png" alt="Per-condition measurement log" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is the story it tells. At temperature 0 both models produced exactly one kind of output. Fifteen runs, twelve runs, not a character off, the same sentence every time. Seed or no seed, the result was identical. That confirms directly that the seed has nothing to do under greedy decoding.&lt;/p&gt;

&lt;p&gt;The only cell that scattered was temperature&amp;gt;0 with no seed. The 2GB model split into 5 distinct outputs out of 15, the 9.6GB model into 7 out of 12. But fix the seed to 42 at the same temperature and it snapped back to 1. The most striking part is the last row. I spun up a brand-new Python process and ran it again, and with the same seed it produced the same sentence: "Code faster, effortlessly smart." Two independent executions matched character for character.&lt;/p&gt;

&lt;p&gt;The actual sentences make it tangible. At temperature 0 the 2GB model emitted only "Code Smarter, Not Harder" every time. Raise temperature to 0.8 and drop the seed and variations like "Code Smarter, Not Harder with Ada" slipped in. Pin the seed to 42 and it froze on that one line, which held across the process restart. The 9.6GB model behaved the same, splitting into 7 variants at seed-less temperature 0.8 and converging to "Code faster, effortlessly smart" once the seed was fixed. The contrast is even clearer in majority share: the 9.6GB model at seed-less temperature 0.8 sat at 0.333, meaning even its most common output showed up only one run in three. Every other condition was 1.0, all identical.&lt;/p&gt;

&lt;p&gt;The result was clean enough that I doubted it again, so I swapped models and reran. Two models that differ 5x in size showed the same pattern. In my environment at least, the two knobs of temperature and seed were enough to control reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  One empty response taught the bigger lesson
&lt;/h2&gt;

&lt;p&gt;I had originally meant to lean on the 12B &lt;code&gt;gemma4:12b-it-qat&lt;/code&gt; as the main model. But that community build returned nothing but an empty string. On both &lt;code&gt;/api/generate&lt;/code&gt; and &lt;code&gt;/api/chat&lt;/code&gt;, &lt;code&gt;done_reason&lt;/code&gt; came back &lt;code&gt;length&lt;/code&gt; and &lt;code&gt;eval_count&lt;/code&gt; climbed to 40 or 200, yet the &lt;code&gt;content&lt;/code&gt; was an empty string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;content repr: ''
done_reason: length   eval_count: 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tokens were clearly generated. The GPU spun for 28 seconds each time. But nothing reached the user-visible text. The chat template on this QAT build is probably broken, or the model emitted only invisible control tokens. The exact cause is outside my expertise, so I will not assert one.&lt;/p&gt;

&lt;p&gt;The lesson, though, is clear. "The model ran" and "the model answered" are different statements. A pipeline that treats a rising eval_count as success would have let this empty response slip straight into the evaluation data. Why you need one guard that rejects zero-length responses, this failure argued more convincingly than any line of code. Failure is content too, and I felt it again here. I have hit this kind of packaging variable before with local models, in a similar shape as the small-model schema limits I wrote about in the &lt;a href="https://dev.to/en/blog/en/ollama-structured-outputs-pydantic-local-llm-guide-2026"&gt;post on Ollama structured outputs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right locally does not mean right on the cloud
&lt;/h2&gt;

&lt;p&gt;Let me draw the line clearly. What I measured is determinism "on local Ollama, sending requests one at a time, sequentially." Move that condition onto a cloud API like OpenAI or Claude and the story changes.&lt;/p&gt;

&lt;p&gt;The most convincing explanation of why comes from Thinking Machines' "Defeating Nondeterminism in LLM Inference" (September 2025). People often say "GPU floating-point math is non-deterministic, that's why," but the piece locates the real cause elsewhere. Inference servers batch many users' requests together, and the batch size shifts unpredictably with server load at that moment. Because the core kernels take a slightly different numerical path depending on batch size, the same prompt can diverge into different tokens even under greedy decoding. As I understand it, the real culprit of non-determinism is not randomness but the fact that your request lands in a differently sized batch each time.&lt;/p&gt;

&lt;p&gt;Local ollama is not a perfectly safe zone either. Issue #586 reports that with the same seed, same temperature=0, and same num_ctx, the output still differed slightly between the first and second run, and more interestingly that the same code produced a different "fixed" output on Ubuntu versus Windows. In other words, determinism may be a platform-bound property. My measurements were probably clean because they ran on one Mac, with one version of ollama, on short outputs. The longer the output and the larger num_ctx, the more room tiny numerical differences have to accumulate and diverge.&lt;/p&gt;

&lt;p&gt;I did not reproduce this batch non-determinism directly. I never built an environment that applies concurrent load. So I cite that part from the writeup only. The range I verified by hand is strictly "sequential requests, local, short outputs." Blur that boundary and my article becomes one more piece selling unverified claims as fact.&lt;/p&gt;

&lt;p&gt;So honestly, building eval reproducibility on top of a cloud API is trickier than it sounds. Even an API that accepts a seed parameter does not protect you from batch non-determinism. That, I think, is the root reason LLM evaluation never settles as cleanly as a unit test.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I applied to evaluation and agent testing right away
&lt;/h2&gt;

&lt;p&gt;Here is what I folded into my workflow as soon as the measurement was done.&lt;/p&gt;

&lt;p&gt;First, pin regression evals at temperature=0 with a fixed seed. For a regression test that checks "did it change" rather than "how good is it," you do not need the model's creativity. Better to lock onto one reproducible output and catch the moment it shifts. In my environment this combination returned the same sentence even after a process restart, which is enough to put in CI.&lt;/p&gt;

&lt;p&gt;Second, never conclude from a single run. Features that run at a higher temperature, where diversity is the value, scatter by nature. To evaluate that output you cannot run it once and call it pass or fail; you run it N times and look at the distribution. Seeing it split 7 out of 12 in my measurement tells you how risky it is to judge a feature on one lucky output.&lt;/p&gt;

&lt;p&gt;Third, put an output-validity guard in front of the eval. The empty-response 12B model is the direct reason. Unless you explicitly classify zero length, JSON parse failure, or a missing expected field as a failure, a broken model masquerades as a healthy score. The skeleton I use for regression tests is about this simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assert_reproducible&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;outs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="c1"&gt;# 1) empty-response guard: separate "ran" from "answered"
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty output detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# 2) does it lock to one kind within a single run?
&lt;/span&gt;    &lt;span class="n"&gt;hashes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outs&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-deterministic: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; variants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# 3) does it match the baseline I pinned earlier?
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;expected_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output drifted from baseline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pin the expected hash once and CI catches the moment a model version or an ollama upgrade changes the output. That is my rebuttal to the common resignation that "you can't test an LLM." You can't test all of it, but the part where you control the reproducibility conditions, you certainly can.&lt;/p&gt;

&lt;p&gt;It extends to agents the same way. To regression-test an agent's sequence of tool calls, that sequence has to reproduce. If you have built a &lt;a href="https://dev.to/en/blog/en/local-llm-private-mcp-server-gemma4-fastmcp"&gt;fully offline MCP server on a local model&lt;/a&gt;, fixing the seed on top of it to reproduce tool calls is relatively controllable. An agent on a cloud LLM, by contrast, struggles to get the same guarantee because of batch non-determinism. In the end, "where you run inference" decides "how tightly you can write the test," and that was the most practical takeaway from this experiment.&lt;/p&gt;

&lt;p&gt;Next I plan to build an environment that applies concurrent load and check directly whether batch non-determinism reproduces even on local ollama. If it does, my tentative conclusion that "local is safe" will earn a footnote.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Stop Feeding Raw JSON to Your LLM — I Measured Token Cost Across 9 Data Formats</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sun, 21 Jun 2026 06:34:53 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/stop-feeding-raw-json-to-your-llm-i-measured-token-cost-across-9-data-formats-275o</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/stop-feeding-raw-json-to-your-llm-i-measured-token-cost-across-9-data-formats-275o</guid>
      <description>&lt;p&gt;I needed to hand an agent a 50-row product catalog as context. Out of habit I dumped it with &lt;code&gt;json.dumps(records, indent=2)&lt;/code&gt;, and the token counter read past 4,000. The data itself was tiny. I started to suspect the indentation and quotes were eating close to half my tokens. So I serialized the exact same data into nine formats and counted the real tokens.&lt;/p&gt;

&lt;p&gt;Up front: &lt;strong&gt;for flat data, TSV is 62% cheaper than pretty JSON.&lt;/strong&gt; But the moment the data nests, that conclusion reverses entirely. I went looking for where that boundary actually sits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup: count with tiktoken, don't guess
&lt;/h2&gt;

&lt;p&gt;Token cost gets tossed around as "roughly chars × 0.75," but that heuristic completely misses per-format differences. So I used OpenAI's own &lt;a href="https://github.com/openai/tiktoken" rel="noopener noreferrer"&gt;tiktoken&lt;/a&gt;. I ran two encodings side by side: &lt;code&gt;o200k_base&lt;/code&gt; (GPT-4o, the o-series, the GPT-5 family) and the older &lt;code&gt;cl100k_base&lt;/code&gt; (GPT-4 and 3.5).&lt;/p&gt;

&lt;p&gt;The test data mimics a realistic "tool result." Fifty product records, each a flat object with nine fields: &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;sku&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;stock&lt;/code&gt;, &lt;code&gt;warehouse&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;rating&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tomli_w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# 50 flat records
&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o200k_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# GPT-4o / GPT-5 family
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pretty :&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compact:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv    :&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran the sandbox in a throwaway &lt;code&gt;mktemp -d&lt;/code&gt; directory outside the repo, kept only the result logs and the chart, then wiped the environment. The habit of isolating one-off experiments like this hardened during the stretch where I &lt;a href="https://dev.to/en/blog/en/ai-agent-cost-reality"&gt;ran eight agents and tracked their real cost&lt;/a&gt;. Cutting tokens is, in the end, one line in that same ledger.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flat data: TSV wins by a mile
&lt;/h2&gt;

&lt;p&gt;Here are the measured results for serializing 50 flat records nine ways. On &lt;code&gt;o200k_base&lt;/code&gt;, pretty JSON (4,128 tokens) is the 0% baseline.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Chars&lt;/th&gt;
&lt;th&gt;o200k tokens&lt;/th&gt;
&lt;th&gt;cl100k tokens&lt;/th&gt;
&lt;th&gt;vs pretty JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TSV&lt;/td&gt;
&lt;td&gt;3,742&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,568&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,663&lt;/td&gt;
&lt;td&gt;−62.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;3,742&lt;/td&gt;
&lt;td&gt;1,650&lt;/td&gt;
&lt;td&gt;1,650&lt;/td&gt;
&lt;td&gt;−60.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown table&lt;/td&gt;
&lt;td&gt;4,766&lt;/td&gt;
&lt;td&gt;1,897&lt;/td&gt;
&lt;td&gt;1,897&lt;/td&gt;
&lt;td&gt;−54.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON (compact)&lt;/td&gt;
&lt;td&gt;7,985&lt;/td&gt;
&lt;td&gt;2,578&lt;/td&gt;
&lt;td&gt;2,593&lt;/td&gt;
&lt;td&gt;−37.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;key: value lines&lt;/td&gt;
&lt;td&gt;6,982&lt;/td&gt;
&lt;td&gt;2,708&lt;/td&gt;
&lt;td&gt;2,708&lt;/td&gt;
&lt;td&gt;−34.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;7,834&lt;/td&gt;
&lt;td&gt;3,159&lt;/td&gt;
&lt;td&gt;3,159&lt;/td&gt;
&lt;td&gt;−23.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOML&lt;/td&gt;
&lt;td&gt;8,533&lt;/td&gt;
&lt;td&gt;3,176&lt;/td&gt;
&lt;td&gt;3,191&lt;/td&gt;
&lt;td&gt;−23.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON (pretty)&lt;/td&gt;
&lt;td&gt;10,986&lt;/td&gt;
&lt;td&gt;4,128&lt;/td&gt;
&lt;td&gt;4,143&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XML&lt;/td&gt;
&lt;td&gt;13,654&lt;/td&gt;
&lt;td&gt;4,777&lt;/td&gt;
&lt;td&gt;4,778&lt;/td&gt;
&lt;td&gt;+15.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-token-cost-data-format-experiment.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fllm-token-cost-data-format-experiment.png" alt="Bar chart of LLM token cost by data format — the same 50 records cost 62% fewer tokens as TSV than as pretty JSON" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The thing that jumps out is the 2.6x gap between pretty JSON and TSV. Same information. Not one fact the model receives changes. And yet two-space indents, repeated key names, quotes, and braces inflate the token count 2.6 times. XML goes further still, writing every field twice thanks to closing tags, landing 16% above even pretty JSON. I think using XML as an LLM input format is a near-guaranteed loss.&lt;/p&gt;

&lt;p&gt;Why CSV, TSV, and Markdown tables are cheap is simple. &lt;strong&gt;They write the field names once in a header row, then list only values across the 50 rows.&lt;/strong&gt; JSON-family formats repeat a key like &lt;code&gt;"warehouse":&lt;/code&gt; 50 times, once per record. The more fields and the more rows, the worse that repetition tax gets.&lt;/p&gt;

&lt;p&gt;Break it down per record and it's starker. Pretty JSON's 4,128 tokens spread across 50 records is about 82 tokens each. Strip TSV's header row (~18 tokens) and 1,550 tokens over 50 rows is about 31 tokens per record. The same nine fields on one line, and one side spends 82 tokens while the other spends 31. The difference isn't the data. It's nine key names repeated 50 times, plus the quotes and braces wrapping them. To the model, &lt;code&gt;"category":"books"&lt;/code&gt; and &lt;code&gt;books&lt;/code&gt; are the same fact, but the former spends three or four times the tokens to convey it.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's not the repeated keys that cost, it's the boilerplate
&lt;/h2&gt;

&lt;p&gt;I want to correct a likely misread here. The accurate framing isn't "JSON is slow," it's "structural punctuation is expensive." Compact JSON beats pretty JSON by 37.5% not because it has fewer keys. The keys still repeat 50 times. What disappears is the indentation whitespace and the line breaks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2-space indent "  "  -&amp;gt; 1 token (id 220)
3 commas       ",,,"  -&amp;gt; 1 token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On &lt;code&gt;o200k_base&lt;/code&gt;, a two-space indent is itself a single token (id 220). When 50 records each indent nine fields, that whitespace token alone gets laid down hundreds of times. Add a newline per line and an opening and closing brace per object. To a human that's readability; to the model it's pure cost. So I've made "no pretty-printing unless a human is going to read it" my default.&lt;/p&gt;

&lt;h2&gt;
  
  
  With nested data, the conclusion flips
&lt;/h2&gt;

&lt;p&gt;If I'd stopped here I'd have walked away with the wrong lesson, "always CSV." So I measured a differently shaped dataset. Twenty orders, where each order holds a customer object (with a nested address) and a variable-length array of line items.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;o200k tokens&lt;/th&gt;
&lt;th&gt;vs pretty JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON (compact)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,538&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−45.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;1,958&lt;/td&gt;
&lt;td&gt;−30.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOML&lt;/td&gt;
&lt;td&gt;2,021&lt;/td&gt;
&lt;td&gt;−28.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON (pretty)&lt;/td&gt;
&lt;td&gt;2,835&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CSV, TSV, and Markdown tables drop out of the running entirely. There's no way to cram variable-length item arrays and nested objects into a two-dimensional grid. And compact JSON, which sat mid-pack at −34% on flat data, takes first place at −45.7% on nested data. YAML, flat or nested, was never as cheap as I'd assumed, thanks to its indentation cost. The folk wisdom that "YAML is both human-readable and token-light" did not hold up, at least not in this measurement.&lt;/p&gt;

&lt;p&gt;So the boundary is this. &lt;strong&gt;Uniform rows, use a tabular format (CSV/TSV/Markdown); nested structure, use compact JSON.&lt;/strong&gt; That one line is the most useful rule I pulled out of today's experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  In an agent loop, this compounds
&lt;/h2&gt;

&lt;p&gt;A few thousand tokens sounds like nothing, but an agent resends the same context every turn. Say you pin a 50-item catalog into the system prompt as pretty JSON and run a 30-turn conversation. Switching the format to TSV alone drops about 2,560 tokens per turn (4,128 → 1,568). Over 30 turns that's 76,000 tokens. On a model with a tight context window, that's the difference between fitting and not; on a metered model, it's input-token cost, dollar for dollar.&lt;/p&gt;

&lt;p&gt;Here's a likely objection: "Doesn't prompt caching make the same context cheap anyway?" It does. If the catalog is pinned in the system prompt, a cache hit cuts the cost sharply. But caching doesn't reduce the token count itself. Cached tokens still occupy the full context window, and caches usually expire after a short TTL and then refill at full price. On top of that, tool results that change every turn can't be cached in the first place. Trimming tokens via format isn't a technique that competes with caching, it's one that complements it. Turn both on and you win twice.&lt;/p&gt;

&lt;p&gt;This matters most when an MCP tool returns a large result. When a &lt;a href="https://dev.to/en/blog/en/fastmcp-python-mcp-server-build-guide-2026"&gt;server you built with FastMCP&lt;/a&gt; returns a DB query as raw JSON, that format is the model's input cost. One small decision on the server side, serializing a flat result as CSV or TSV, shifts the token ledger of the whole agent.&lt;/p&gt;

&lt;p&gt;The limits are real, of course. I only measured token counts; &lt;strong&gt;I did not measure whether the model understands each format equally well.&lt;/strong&gt; That's as far as I could reproduce without API calls. My intuition is that a format like CSV, where the header sits far from the values, could confuse the model on field-heavy data. With the header only once at the top, the model has to count by position to know what the 7th value in the 30th row means. Save tokens and lose accuracy and you haven't gained anything. So before applying this for real, run token savings and response quality together as an A/B at least once. I'm deferring that check to the next experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm applying starting tomorrow
&lt;/h2&gt;

&lt;p&gt;Today's measurements changed my defaults to this.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flat record arrays going into context: CSV or a Markdown table. No pretty JSON.&lt;/li&gt;
&lt;li&gt;Nested structures: &lt;code&gt;json.dumps(x, separators=(",",":"))&lt;/code&gt;, compact. &lt;code&gt;indent=2&lt;/code&gt; only when a human is debugging.&lt;/li&gt;
&lt;li&gt;No XML as LLM input. It spends the most tokens on the same information.&lt;/li&gt;
&lt;li&gt;If a format change cut tokens significantly, verify once that model accuracy holds in that format.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data format is a value your code picks almost automatically, so you rarely think about it. But the moment it enters an LLM context, that thoughtless &lt;code&gt;indent=2&lt;/code&gt; can become half your token bill. Until I measured it myself, I was underestimating the size of it too.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The measurement code and full logs were run once in the sandbox and preserved in &lt;code&gt;docs/evidence/llm-token-cost-data-format-experiment.md&lt;/code&gt;. Measured on tiktoken 0.12.0, Python 3.12.8.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    <item>
      <title>Building a TypeScript MCP Client from Scratch — @modelcontextprotocol/sdk v1.29 in Practice</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:38:30 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/building-a-typescript-mcp-client-from-scratch-modelcontextprotocolsdk-v129-in-practice-13a8</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/building-a-typescript-mcp-client-from-scratch-modelcontextprotocolsdk-v129-in-practice-13a8</guid>
      <description>&lt;p&gt;I've always been curious what Claude Desktop does internally when it connects to an MCP server. "It connects via stdio" is the short answer, but I needed to see the code to actually understand it. So today I installed &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt;, built a TypeScript MCP client from scratch, and ran it against a server I wrote myself.&lt;/p&gt;

&lt;p&gt;The verdict: it's a lot simpler than I expected. There was also one behavior that surprised me — error handling works differently from what I assumed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: do what Claude Desktop does, manually
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) is the standard interface for AI agents to access external tools and data. I've written a lot about building MCP servers — including &lt;a href="https://dev.to/en/blog/en/mcp-server-typescript-sdk-step-by-step-2026"&gt;how to build one in TypeScript&lt;/a&gt; and &lt;a href="https://dev.to/en/blog/en/fastmcp-python-mcp-server-build-guide-2026"&gt;spinning one up with Python FastMCP in 30 minutes&lt;/a&gt;. But I've never written about implementing the client side myself.&lt;/p&gt;

&lt;p&gt;Thinking about production use cases, there are clear situations where a custom MCP client makes sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calling MCP server tools automatically from a CI/CD pipeline&lt;/li&gt;
&lt;li&gt;Integrating an MCP server into a custom agent layer you're building yourself&lt;/li&gt;
&lt;li&gt;Using MCP server capabilities as a library inside existing Python or TypeScript code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need Claude Desktop or Claude Code. The &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt; package contains everything needed to build a client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing the SDK — two classes are all you need
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @modelcontextprotocol/sdk zod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The installed version is &lt;code&gt;1.29.0&lt;/code&gt; as of today. The SDK includes both server and client implementations.&lt;/p&gt;

&lt;p&gt;There are two core classes for building an MCP client.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Client&lt;/code&gt;&lt;/strong&gt; — Manages the logical connection to a server. Provides methods like &lt;code&gt;listTools()&lt;/code&gt;, &lt;code&gt;callTool()&lt;/code&gt;, &lt;code&gt;listResources()&lt;/code&gt;, and &lt;code&gt;readResource()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;StdioClientTransport&lt;/code&gt;&lt;/strong&gt; — The transport layer for communicating with stdio-based MCP servers. It spawns the server process directly using &lt;code&gt;command&lt;/code&gt; and &lt;code&gt;args&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To connect to remote MCP servers (HTTP/SSE), you'd use a different Transport class. This guide covers stdio only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up the demo — server and client, both from scratch
&lt;/h2&gt;

&lt;p&gt;I built a simple MCP server for the demo. It has two tools: &lt;code&gt;calculate&lt;/code&gt; (basic arithmetic) and &lt;code&gt;transform_text&lt;/code&gt; (string transformations).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// server.mjs&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;McpServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/mcp.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StdioServerTransport&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/stdio.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;demo-tools&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;calculate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Basic arithmetic: add, subtract, multiply, divide&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;add&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;subtract&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;multiply&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;divide&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
          &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Error: division by zero&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
          &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; = &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;transform_text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Text transformation: uppercase, lowercase, reverse, word_count&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;op&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;uppercase&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;lowercase&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;reverse&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;word_count&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;op&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toUpperCase&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;lowercase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;word_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Word count: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;op&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;server-info&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mcp://demo/info&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text/plain&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MCP demo server v1.0.0 — stdio transport&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioServerTransport&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's no need to run the server separately. The client's &lt;code&gt;StdioClientTransport&lt;/code&gt; spawns the server process automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client implementation — listTools, callTool, listResources
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// client.mjs&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/client/index.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StdioClientTransport&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/client/stdio.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioClientTransport&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;server.mjs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;demo-client&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment &lt;code&gt;client.connect(transport)&lt;/code&gt; is called, &lt;code&gt;node server.mjs&lt;/code&gt; is spawned as a subprocess. The client and server then exchange JSON-RPC 2.0 messages over stdin/stdout pipes.&lt;/p&gt;

&lt;p&gt;Once connected, I ran three operations in sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. List available tools&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;tools&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listTools&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  • &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;(&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;) — &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Call a tool&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;calculate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;multiply&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. List and read resources&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listResources&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readResource&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mcp://demo/info&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the actual output from the sandbox run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;=== MCP Client Demo — @modelcontextprotocol/sdk v1.29.0 ===

✓ Connected to MCP server

Found 2 tool(s):
  • calculate(operation, a, b) — Basic arithmetic: add, subtract, multiply, divide
  • transform_text(text, op) — Text transformation: uppercase, lowercase, reverse, word_count

--- calculate tool ---
  42 multiply 7 = 294
  100 divide 4 = 25
  999 add 1 = 1000

--- transform_text tool ---
  "Model Context Protocol" → MODEL CONTEXT PROTOCOL
  "BUILD ONCE RUN EVERYWHERE" → build once run everywhere
  "hello world from MCP" → Word count: 4

Found 1 resource(s): mcp://demo/info
  Content: MCP demo server v1.0.0 — stdio transport

✓ Client closed cleanly.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server was never started separately. The client spawned &lt;code&gt;node server.mjs&lt;/code&gt;, communicated with it, and terminated both processes cleanly when &lt;code&gt;client.close()&lt;/code&gt; was called.&lt;/p&gt;

&lt;h2&gt;
  
  
  Errors come back as isError, not as exceptions
&lt;/h2&gt;

&lt;p&gt;This is the behavior that surprised me. When you call a tool that doesn't exist, &lt;code&gt;callTool()&lt;/code&gt; doesn't throw — it returns a response object with &lt;code&gt;isError: true&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;nonexistent_tool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// {&lt;/span&gt;
&lt;span class="c1"&gt;//   content: [{ type: "text", text: "MCP error -32602: Tool nonexistent_tool not found" }],&lt;/span&gt;
&lt;span class="c1"&gt;//   isError: true&lt;/span&gt;
&lt;span class="c1"&gt;// }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No need for &lt;code&gt;try/catch&lt;/code&gt; around tool calls. Instead, check &lt;code&gt;result.isError&lt;/code&gt; in every tool call handler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;callToolSafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Unknown MCP error&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is intentional per the MCP spec. Tool execution errors and protocol errors are kept separate: protocol-level errors might throw, but tool-level errors come back in the content. Once I understood this, it made sense — it matches how Claude agents receive tool output. Even when a tool fails, the LLM gets the error text as part of the context and can reason about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel callTool — 4 calls in 1ms
&lt;/h2&gt;

&lt;p&gt;I also tested parallel calls with &lt;code&gt;Promise.all&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;add&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;multiply&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;subtract&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;divide&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;144&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(([&lt;/span&gt;&lt;span class="nx"&gt;op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;calculate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; parallel calls completed in &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parallel calls (4 ops) in 1ms:
  1 add 1 = 2
  12 multiply 12 = 144
  100 subtract 37 = 63
  144 divide 12 = 12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even over stdio, the SDK handles request multiplexing internally. Four concurrent calls go out and get matched to their responses correctly. Keep in mind that with stdio the server still processes them one at a time — if your tools are CPU-heavy, parallel calling has limited upside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three real-world situations where a custom client is useful
&lt;/h2&gt;

&lt;p&gt;Implementing this clarified where a custom client actually makes sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automation scripts calling MCP tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your MCP server exposes code linting, file conversion, or external API lookups, you can invoke those tools from GitHub Actions or a local shell script — just a small Node.js program running &lt;code&gt;node client.mjs&lt;/code&gt;. No GUI required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building a custom agent framework&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're writing your own agent loop without LangGraph or LlamaIndex, a custom MCP client slots in as the tool execution layer. Pull the tool list with &lt;code&gt;listTools()&lt;/code&gt;, inject it into your LLM prompt, parse the model's tool call decision, and run it with &lt;code&gt;callTool()&lt;/code&gt;. The &lt;a href="https://dev.to/en/blog/en/mcp-gateway-agent-traffic-control"&gt;MCP Gateway post&lt;/a&gt; is a natural follow-up if you need to route traffic across multiple servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing and debugging MCP servers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During MCP server development, a custom client lets you verify tool behavior quickly without Claude Desktop. Call &lt;code&gt;listTools()&lt;/code&gt; to check the generated &lt;code&gt;inputSchema&lt;/code&gt;, then fire &lt;code&gt;callTool()&lt;/code&gt; with various parameter combinations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting to an existing public MCP server
&lt;/h2&gt;

&lt;p&gt;So far I've only connected to my own server. The same client works with any public MCP server package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @modelcontextprotocol/server-filesystem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then update the transport:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioClientTransport&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npx&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-y&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/server-filesystem&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/path/to/allowed/directory&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this you can call &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, and &lt;code&gt;list_directory&lt;/code&gt; tools via &lt;code&gt;callTool()&lt;/code&gt;. The &lt;code&gt;command&lt;/code&gt;/&lt;code&gt;args&lt;/code&gt; in Claude Desktop's config file maps directly to what &lt;code&gt;StdioClientTransport&lt;/code&gt; accepts — copy and paste and it works.&lt;/p&gt;

&lt;p&gt;If you need multiple servers, create a separate &lt;code&gt;Client&lt;/code&gt; instance per server. Client-to-server is a 1:1 relationship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling content types safely
&lt;/h2&gt;

&lt;p&gt;One thing to be careful about: &lt;code&gt;callTool()&lt;/code&gt; and &lt;code&gt;readResource()&lt;/code&gt; return a &lt;code&gt;content&lt;/code&gt; array where each item's structure depends on its &lt;code&gt;type&lt;/code&gt; field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Image: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;resource&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Resource: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you blindly access &lt;code&gt;content[0].text&lt;/code&gt; without checking &lt;code&gt;type&lt;/code&gt;, you'll get &lt;code&gt;undefined&lt;/code&gt; when the tool returns an image or embedded resource. With third-party MCP servers, always check the &lt;code&gt;type&lt;/code&gt; field first.&lt;/p&gt;

&lt;p&gt;The SDK's &lt;code&gt;.d.ts&lt;/code&gt; files define &lt;code&gt;TextContent&lt;/code&gt;, &lt;code&gt;ImageContent&lt;/code&gt;, and &lt;code&gt;EmbeddedResource&lt;/code&gt; as separate types. In TypeScript, importing and using these for explicit type narrowing is cleaner than relying on runtime checks alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;A few friction points I ran into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thin TypeScript generics.&lt;/strong&gt; The return type of &lt;code&gt;callTool()&lt;/code&gt; is &lt;code&gt;{ content: Content[], isError?: boolean }&lt;/code&gt; — &lt;code&gt;Content&lt;/code&gt; being a union type. Narrowing it to &lt;code&gt;TextContent&lt;/code&gt; requires explicit checks. Not a blocker, but not ergonomic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSE/HTTP needs a different transport.&lt;/strong&gt; For remote MCP servers (HTTP-based), you'll need &lt;code&gt;StreamableHTTPClientTransport&lt;/code&gt; or &lt;code&gt;SSEClientTransport&lt;/code&gt;. The setup differs slightly and isn't covered in as many examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process lifecycle management.&lt;/strong&gt; &lt;code&gt;StdioClientTransport&lt;/code&gt; doesn't spawn a new process per call — the server process stays alive for the duration of the connection. Always call &lt;code&gt;client.close()&lt;/code&gt; at the end of a script, or use &lt;code&gt;process.on('exit', ...)&lt;/code&gt; to clean up, otherwise the server process lingers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;zod v4 compatibility warnings.&lt;/strong&gt; The SDK uses zod internally, and mixing it with zod v4 in your project may produce deprecation warnings. With SDK 1.29.0 and zod 4.4.3 everything ran fine for me, but it's worth watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  My take: the client side is underrepresented
&lt;/h2&gt;

&lt;p&gt;There's a lot of content about building MCP servers. Custom client implementations — doing what Claude Desktop does programmatically — are far less documented.&lt;/p&gt;

&lt;p&gt;The use cases are real. Developers building AI agent pipelines from scratch, teams integrating MCP into existing code, engineers debugging server behavior without a GUI. The &lt;code&gt;Client&lt;/code&gt; class in &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt; is stable and the API is clean.&lt;/p&gt;

&lt;p&gt;Seventy lines of TypeScript is enough to connect to an MCP server, list its tools, call them, read its resources, and close cleanly. I wish this had been in the official docs as a worked example from the start. Since it wasn't, here it is.&lt;/p&gt;

&lt;p&gt;Next up: attaching this client to a real-world MCP server — probably the filesystem one or the GitHub MCP server — and building a small automation script around it.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building AI Agents with Agno — I Actually Ran It with Gemini and Built-in Tools</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 18 Jun 2026 06:41:11 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/building-ai-agents-with-agno-i-actually-ran-it-with-gemini-and-built-in-tools-49in</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/building-ai-agents-with-agno-i-actually-ran-it-with-gemini-and-built-in-tools-49in</guid>
      <description>&lt;p&gt;If you've ever felt like LangChain was too heavy, you're not alone. The dependency tree is enormous. Abstraction layers pile up. At some point you lose track of what's actually happening underneath. That frustration has pushed a lot of people toward lighter alternatives — frameworks that prove you can build a capable agent without a hundred transitive dependencies.&lt;/p&gt;

&lt;p&gt;Agno is one of those alternatives. It started as Phidata and rebranded in early 2025. I spent an afternoon installing Agno v2.6.17 in a clean sandbox and running through Calculator tools, Wikipedia retrieval, Pydantic structured output, and a two-agent Team. I'll share the real execution logs and, more importantly, the traps I hit that the docs don't warn you about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agno Is and Where It Came from
&lt;/h2&gt;

&lt;p&gt;Phidata built a solid reputation as "the Python framework for AI assistants." When it rebranded to Agno in 2025, the design philosophy got articulated more clearly around three ideas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model-agnostic from day one.&lt;/strong&gt; Over 70 LLMs — OpenAI, Anthropic, Google, Ollama, Cohere — can plug in with the same code structure. Swap the model, keep the agent logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multimodal as a default.&lt;/strong&gt; Text, image, audio, video agents all use the same API surface. You don't need a different abstraction layer for each modality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent orchestration as a first-class citizen.&lt;/strong&gt; The &lt;code&gt;Team&lt;/code&gt; class is built in. You can switch between &lt;code&gt;coordinate&lt;/code&gt;, &lt;code&gt;route&lt;/code&gt;, and &lt;code&gt;collaborate&lt;/code&gt; modes with a single parameter change.&lt;/p&gt;

&lt;p&gt;Reading that, I thought: "How is this different from LangChain?" The answer showed up when I actually wrote code. Agno favors composition over class inheritance. One agent takes about 6 lines to set up. There's far less boilerplate to wade through.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation: No Dependency Hell
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agno google-genai ddgs wikipedia
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;agno&lt;/code&gt; package installs just the core. Tools require their own extra dependencies — &lt;code&gt;wikipedia&lt;/code&gt; for the Wikipedia tool, &lt;code&gt;google-genai&lt;/code&gt; for Gemini. This lazy-loading approach keeps the base install clean.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import agno; print(agno.__version__)"&lt;/span&gt;
2.6.17
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used the Gemini API key from my project &lt;code&gt;.env&lt;/code&gt;. Agno auto-initializes the Gemini client from either &lt;code&gt;GOOGLE_API_KEY&lt;/code&gt; or &lt;code&gt;GEMINI_API_KEY&lt;/code&gt;. If both are set, it uses &lt;code&gt;GOOGLE_API_KEY&lt;/code&gt; and prints a warning to stdout. Not a big deal, but you can't suppress it easily from code.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Agent: Calculator Tool
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.models.google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Gemini&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.tools.calculator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CalculatorTools&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;CalculatorTools&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A math helper agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2^10 + 3^5? Please use the calculator.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2^10 is 1024 and 3^5 is 243. Adding them gives 1267.
⏱ 8.98s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The math is right: 1024 + 243 = 1267. The agent invoked the Calculator tool rather than letting the LLM guess. The 9-second latency includes a Gemini API round-trip plus the tool call overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap #1: &lt;code&gt;show_tool_calls&lt;/code&gt; is gone.&lt;/strong&gt; Older Agno tutorials use &lt;code&gt;show_tool_calls=True&lt;/code&gt;. In v2.6.17, that raises &lt;code&gt;TypeError: Agent.__init__() got an unexpected keyword argument 'show_tool_calls'&lt;/code&gt;. Use &lt;code&gt;debug_mode=True&lt;/code&gt; instead if you want verbose output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap #2: &lt;code&gt;gemini-2.0-flash&lt;/code&gt; is deprecated.&lt;/strong&gt; Using that model ID throws a 404:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;ERROR&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Gemini&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;API:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;NOT_FOUND.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;'error':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;'message':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'This&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;models/gemini&lt;/span&gt;&lt;span class="mf"&gt;-2.0&lt;/span&gt;&lt;span class="err"&gt;-flash&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;longer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;available.'&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;gemini-2.5-flash&lt;/code&gt;. Always verify the current model IDs on Google's docs before hardcoding them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wikipedia Agent: Automatic Search Retry
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.tools.wikipedia&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WikipediaTools&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;WikipediaTools&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attention mechanism&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; in neural networks? 2 sentences only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Execution log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO Searching wikipedia for: attention mechanism neural networks
ERROR Error searching Wikipedia for 'attention mechanism neural networks':
      Page id "attention mechanism neural network" does not match any pages.
INFO Searching wikipedia for: attention (machine learning)
⏱ 9.98s

In machine learning, attention is a method that determines the importance
of each component in a sequence relative to the other components...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting bit: the first search failed, and the agent automatically reformulated the query (&lt;code&gt;attention (machine learning)&lt;/code&gt;) and retried. No extra code required. Agno runs a ReAct loop internally — plan, act, observe, adjust. Tool failures are handled gracefully.&lt;/p&gt;

&lt;p&gt;Compared with the code-execution approach in Smolagents (covered in the &lt;a href="https://dev.to/en/blog/en/python-ai-agent-library-comparison-2026"&gt;Python AI agent library comparison post&lt;/a&gt;), Agno leans more toward tool composition than code generation. Neither is strictly better; it depends on what you're building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Structured Output: Use &lt;code&gt;output_schema&lt;/code&gt;, Not &lt;code&gt;output_model&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;This is the most confusing naming in the API. There's a parameter called &lt;code&gt;output_model&lt;/code&gt;. Naturally you'd think: "Put the Pydantic model here." That's wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This fails
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;output_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DeveloperProfile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← WRONG
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ValueError: Model must be a Model instance, string, or None
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;output_model&lt;/code&gt; expects an LLM model instance (or string model ID). For Pydantic structured output, use &lt;code&gt;output_schema&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;year_since&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DeveloperProfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Skill&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DeveloperProfile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← CORRECT
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a developer profile for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Kim Jangwook&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, a Korean developer &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specializing in Claude Code, MCP, Python, TypeScript.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;lt;class '__main__.DeveloperProfile'&amp;gt;
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# Kim Jangwook
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;⏱ 4.00s&lt;/span&gt;
&lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DeveloperProfile&lt;/span&gt;
&lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Kim Jangwook&lt;/span&gt;
&lt;span class="na"&gt;Skills&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Claude Code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Expert (since 2022)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;MCP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Certified (since 2019)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Python&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Senior (since 2018)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;TypeScript&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Intermediate (since 2020)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;response.content&lt;/code&gt; returns an actual Pydantic instance. Parsing is handled internally; you get full IDE autocomplete on the result. The 4-second latency (vs 9 seconds for the Calculator agent) reflects the absence of tool call round-trips.&lt;/p&gt;

&lt;p&gt;This is similar in spirit to &lt;a href="https://dev.to/en/blog/en/pydantic-ai-type-safe-agent-tutorial-2026"&gt;PydanticAI's &lt;code&gt;output_type&lt;/code&gt; parameter&lt;/a&gt;, but the naming diverges. When jumping between frameworks, you need to memorize each one's vocabulary — that's friction that accumulates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Agent Team: &lt;code&gt;members=&lt;/code&gt;, Not &lt;code&gt;agents=&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.team&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt;

&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;WikipediaTools&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;calculator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;CalculatorTools&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Team&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# ← NOT agents=
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research &amp;amp; Calc Team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coordinate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trap #3: &lt;code&gt;agents=&lt;/code&gt; doesn't exist in Team.&lt;/strong&gt; Writing &lt;code&gt;Team(agents=[...])&lt;/code&gt; throws:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: Team.__init__() got an unexpected keyword argument 'agents'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The correct parameter is &lt;code&gt;members=[...]&lt;/code&gt;. You'd only know this from reading the source — the documentation still shows &lt;code&gt;agents=&lt;/code&gt; in some places.&lt;/p&gt;

&lt;p&gt;Running the team:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the year &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Attention is All You Need&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; was published on Wikipedia, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;then calculate how many years ago that was from 2026.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO Searching wikipedia for: Attention is All You Need
⏱ 13.83s

'Attention is All You Need' was published in 2017. 
As of 2026, that is 9 years ago.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The team leader (the Gemini model passed to &lt;code&gt;Team&lt;/code&gt;) analyzed the task, routed the Wikipedia lookup to the Researcher agent and the subtraction to the Calculator agent. Both returned correct results. The 14-second latency reflects sequential agent execution — &lt;code&gt;coordinate&lt;/code&gt; mode doesn't parallelize in v2.6.17.&lt;/p&gt;

&lt;h2&gt;
  
  
  100+ Built-in Tools
&lt;/h2&gt;

&lt;p&gt;One of Agno's practical advantages: the built-in tool library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;agno.tools&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pkgutil&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pkgutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_modules&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__path__&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="c1"&gt;# Over 100 entries including: 'arxiv', 'bravesearch', 'calculator',
# 'docker', 'duckduckgo', 'email', 'github', 'gmail',
# 'google_bigquery', 'jira', 'mcp', 'mem0', 'notion',
# 'postgres', 'slack', 'sql', 'tavily', 'wikipedia',
# 'yfinance', 'youtube', 'zoom', ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice: give Agno a &lt;code&gt;BRAVE_API_KEY&lt;/code&gt; and you have a web search agent running in under 5 minutes without writing any API wrapper code. Same story for Slack, Notion, GitHub, and Postgres integrations.&lt;/p&gt;

&lt;p&gt;The catch: not all tools are zero-install. Each tool module has its own dependency. &lt;code&gt;agno.tools.duckduckgo&lt;/code&gt; needs &lt;code&gt;ddgs&lt;/code&gt;, &lt;code&gt;agno.tools.wikipedia&lt;/code&gt; needs &lt;code&gt;wikipedia&lt;/code&gt;, and so on. If you import before installing, you get &lt;code&gt;ImportError&lt;/code&gt; at the import line for some tools and at first use for others. Inconsistent behavior across modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Think Are the Actual Limitations
&lt;/h2&gt;

&lt;p&gt;The latency is real. 9 seconds for a single Calculator call, 14 for a two-agent team — this is Gemini API round-trip cost compounded by tool calls, not an Agno inefficiency per se. But it matters for production APIs where users expect sub-second responses.&lt;/p&gt;

&lt;p&gt;Debugging is manual. &lt;code&gt;debug_mode=True&lt;/code&gt; spits out unstructured logs. I haven't found an official integration guide for LangSmith or LangFuse. If observability matters, you'll need to wire it up yourself.&lt;/p&gt;

&lt;p&gt;The docs lag behind the API. &lt;code&gt;show_tool_calls&lt;/code&gt;, &lt;code&gt;output_model&lt;/code&gt;, &lt;code&gt;agents=&lt;/code&gt; — these are examples of parameters whose behavior in the docs doesn't match the current codebase. Always check the GitHub &lt;code&gt;examples/&lt;/code&gt; directory for the latest version, not the tutorial blog posts.&lt;/p&gt;

&lt;p&gt;The Team's &lt;code&gt;coordinate&lt;/code&gt; mode is sequential. If you need parallel agent execution or complex conditional branching across many agents, &lt;a href="https://dev.to/en/blog/en/google-adk-vs-langgraph-agent-framework-comparison-2026"&gt;Google ADK or LangGraph&lt;/a&gt; are better fits.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Agno Makes Sense
&lt;/h2&gt;

&lt;p&gt;Three scenarios where I'd reach for Agno:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rapid prototyping.&lt;/strong&gt; An API key plus 10 lines of Python and you have a working agent. Great for PoCs, internal tools, solo projects where speed to first working version matters more than architectural elegance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tool agents.&lt;/strong&gt; When you need an agent that touches Slack, reads from Postgres, sends an email, and searches the web — Agno's tool library means you spend time on the agent logic, not on writing API wrappers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small agent teams (2-4 agents).&lt;/strong&gt; Agno's &lt;code&gt;Team&lt;/code&gt; class handles small coordination problems cleanly. Once you get into dozens of agents with complex dependency graphs, a state-graph framework gives you more explicit control.&lt;/p&gt;

&lt;p&gt;Where I wouldn't use Agno: real-time streaming UIs, production workflows requiring precise error handling and retry guarantees, or systems where you need to audit exactly what each agent decided and why at every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next to Explore
&lt;/h2&gt;

&lt;p&gt;Two things I didn't test today:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent memory.&lt;/strong&gt; Agno has &lt;code&gt;enable_agentic_memory=True&lt;/code&gt; with SQLite-backed storage. Cross-session memory persistence is the piece that would make agents feel genuinely stateful rather than starting fresh each time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP tool integration.&lt;/strong&gt; &lt;code&gt;agno.tools.mcp&lt;/code&gt; exists. If Agno agents can connect to MCP servers as tool sources, that means reusing existing MCP server infrastructure without rewriting anything. Worth testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Agno v2.6.17 installs cleanly via &lt;code&gt;pip install agno&lt;/code&gt;; Calculator, Wikipedia, Team all ran successfully&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;gemini-2.5-flash&lt;/code&gt; as the model ID — &lt;code&gt;gemini-2.0-flash&lt;/code&gt; is deprecated and returns 404&lt;/li&gt;
&lt;li&gt;Structured output uses &lt;code&gt;output_schema=YourPydanticModel&lt;/code&gt;, not &lt;code&gt;output_model&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Team&lt;/code&gt; takes &lt;code&gt;members=[...]&lt;/code&gt;, not &lt;code&gt;agents=[...]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;100+ built-in tools, each requiring its own dependency install&lt;/li&gt;
&lt;li&gt;Strong for prototyping and multi-tool agents; reach for LangGraph for complex state machines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been frustrated by LangChain's weight and want a Python agent framework that gets out of your way, Agno is worth an afternoon of experimentation.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Ollama Structured Outputs in Practice — Getting Type-Safe JSON from Local LLMs with Pydantic</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 17 Jun 2026 06:38:48 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/ollama-structured-outputs-in-practice-getting-type-safe-json-from-local-llms-with-pydantic-m38</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/ollama-structured-outputs-in-practice-getting-type-safe-json-from-local-llms-with-pydantic-m38</guid>
      <description>&lt;p&gt;&lt;code&gt;json.loads(response)&lt;/code&gt; fails at a certain point. You told the model "return JSON only," but it added a&lt;br&gt;
&lt;br&gt;
 ```json markdown code fence around everything. A quick regex strips it — until that regex hits an edge case, and that edge case blows up in production.&lt;/p&gt;

&lt;p&gt;Since Ollama 0.3.0, passing a JSON schema to the &lt;code&gt;format&lt;/code&gt; parameter eliminates this problem at the root. The model's inference itself is constrained by the schema, so no code fences, no explanatory text, no mid-thought artifacts. Just parseable JSON.&lt;/p&gt;

&lt;p&gt;I ran these tests locally with Gemma4 and Ollama 0.30.7 to see how well it holds up in practice.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why LLM Response Parsing Is Tricky
&lt;/h2&gt;

&lt;p&gt;The most common problem when running Ollama locally — without a cloud LLM API — is JSON parsing. Two reasons.&lt;/p&gt;

&lt;p&gt;First, text generation models are trained toward "natural text." Even if you ask for JSON only, they'll often wrap it in &lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;br&gt;
&lt;br&gt;
json ...&lt;br&gt;
&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt; blocks or prepend "Of course! Here is the JSON you requested:" style text. Here's what I reproduced directly:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
json
Input: 'Give me 3 Python tips as JSON with keys: tips (array), difficulty (1-5)'
Model output (no format parameter):


```json
{
  "tips": [
    "Master the fundamentals first...",
    ...
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;JSON parse: FAILED&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Python&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s `json.loads()` can&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;markdown&lt;/span&gt; &lt;span class="n"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JSON only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;unreliable&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;speed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;measured&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;same&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;both&lt;/span&gt; &lt;span class="n"&gt;ways&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt; &lt;span class="n"&gt;without&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;More&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;why&lt;/span&gt; &lt;span class="n"&gt;below&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="c1"&gt;## How the Ollama format Parameter Works
&lt;/span&gt;
&lt;span class="n"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s `/api/generate` endpoint has a `format` field. Pass a JSON schema object and Ollama applies **constrained decoding** during inference.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
import json&lt;br&gt;
import urllib.request&lt;/p&gt;

&lt;p&gt;def ollama_structured(prompt, schema, model="gemma4:e4b"):&lt;br&gt;
    payload = {&lt;br&gt;
        "model": model,&lt;br&gt;
        "prompt": prompt,&lt;br&gt;
        "format": schema,     # ← pass JSON schema object directly&lt;br&gt;
        "stream": False,&lt;br&gt;
        "options": {"temperature": 0}&lt;br&gt;
    }&lt;br&gt;
    data = json.dumps(payload).encode()&lt;br&gt;
    req = urllib.request.Request(&lt;br&gt;
        "&lt;a href="http://localhost:11434/api/generate" rel="noopener noreferrer"&gt;http://localhost:11434/api/generate&lt;/a&gt;",&lt;br&gt;
        data=data,&lt;br&gt;
        headers={"Content-Type": "application/json"}&lt;br&gt;
    )&lt;br&gt;
    with urllib.request.urlopen(req, timeout=60) as resp:&lt;br&gt;
        result = json.loads(resp.read())&lt;br&gt;
    return result["response"]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Constrained decoding sets the probability of any token that would violate the schema to zero at each generation step. So even if the model "wants" to generate a markdown fence, the schema makes it physically impossible. That's also where the speed gain comes from — the model doesn't waste tokens on formatting decisions.

Here are the measured numbers:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;
&lt;h1&gt;
  
  
  Direct measurement (Ollama 0.30.7, Gemma4:e4b, MacBook)
&lt;/h1&gt;
&lt;h1&gt;
  
  
  Same prompt, with and without format
&lt;/h1&gt;

&lt;p&gt;Without structured output:&lt;br&gt;
  Raw (first 200 chars):&lt;br&gt;
&lt;br&gt;
 ```json\n{\n  "tips": ["Master the fundamentals first...&lt;br&gt;
  Time: 31.84s&lt;br&gt;
  JSON parse: FAILED (markdown wrapper)&lt;/p&gt;

&lt;p&gt;With structured output:&lt;br&gt;
  Structured: {"tips": ["Understand the concept of indentation...", ...], "difficulty": 2}&lt;br&gt;
  Time: 4.99s&lt;br&gt;
  JSON parse: SUCCESS&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


6.4x difference. Local LLMs are already slow, and adding unreliable parsing on top makes the whole pipeline feel worse.

## Wiring Pydantic Models

Writing JSON schema objects by hand is tedious. With Pydantic models, `model_json_schema()` generates the schema automatically.



```python
from pydantic import BaseModel
from typing import List, Dict, Any, Literal

class CodeReview(BaseModel):
    severity: str  # "critical", "warning", "info"
    file: str
    line: int
    message: str
    suggestion: str

class ReviewResult(BaseModel):
    total_issues: int
    critical_count: int
    reviews: List[CodeReview]

# Pydantic → JSON schema, automatically
schema = ReviewResult.model_json_schema()

raw = ollama_structured(prompt, schema)

# Parses and validates in one step
result = ReviewResult.model_validate_json(raw)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;model_validate_json&lt;/code&gt; parses the JSON string and runs Pydantic validation simultaneously. If &lt;code&gt;severity&lt;/code&gt; gets an integer or &lt;code&gt;line&lt;/code&gt; gets a string, it throws &lt;code&gt;ValidationError&lt;/code&gt;. Catching that and retrying with a modified prompt is the common pattern in real agents.&lt;/p&gt;

&lt;p&gt;Actual output from the code review test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; Code Review Output &lt;span class="o"&gt;===&lt;/span&gt;
Total issues: 3
Critical: 2
  &lt;span class="o"&gt;[&lt;/span&gt;CRITICAL] process_user_data:2 - SQL Injection Vulnerability &lt;span class="o"&gt;(&lt;/span&gt;Direct String Formatting&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt;CRITICAL] process_user_data:3 - Storing Passwords &lt;span class="k"&gt;in &lt;/span&gt;Plain Text &lt;span class="o"&gt;(&lt;/span&gt;Data Leakage&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt;HIGH] process_user_data:4 - Potential Unused/Incomplete Database Interaction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;total_issues: 3&lt;/code&gt; and &lt;code&gt;critical_count: 2&lt;/code&gt; come in as integers. &lt;code&gt;if result.critical_count &amp;gt; 0&lt;/code&gt; branches safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Pattern: Agent Tool Dispatch
&lt;/h2&gt;

&lt;p&gt;The strongest use case for structured output is an agent deciding which tool to call next. You pass the tool list and current situation, and get back a type-safe tool call selection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;user_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the current Bitcoin price and save it to btc_price.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an AI agent. Decide which tool to call next.
Available tools: web_search, read_file, write_file, execute_code
Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Choose ONE tool call.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ollama_structured&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Params: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; Agent tool dispatch &lt;span class="o"&gt;===&lt;/span&gt;
Tool: web_search
Params: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'query'&lt;/span&gt;: &lt;span class="s1"&gt;'current Bitcoin price'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
Reasoning: The task requires finding real-time information...

Dispatch: OK &lt;span class="o"&gt;(&lt;/span&gt;type-safe&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because &lt;code&gt;tool_name&lt;/code&gt; is typed as &lt;code&gt;Literal["web_search", "read_file", ...]&lt;/code&gt;, &lt;code&gt;tool_call.tool_name&lt;/code&gt; is always one of those four values. If the model invents a nonexistent tool name, Pydantic throws &lt;code&gt;ValidationError&lt;/code&gt;. The &lt;code&gt;if tool_call.tool_name == "web_search"&lt;/code&gt; branch is safe to write.&lt;/p&gt;

&lt;p&gt;This is architecturally the same as function calling in cloud APIs. Comparing it with &lt;a href="https://dev.to/en/blog/en/claude-agent-sdk-tool-use-complete-guide-2026"&gt;Claude Agent SDK's Tool Use patterns&lt;/a&gt; shows an interesting design difference: cloud LLMs handle tool selection natively at the model level, while local Ollama needs an explicit JSON schema + Pydantic validation layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemma4 and Schema Complexity: Limitations I Found
&lt;/h2&gt;

&lt;p&gt;Honestly, it doesn't work perfectly in every case. Testing with Gemma4:e4b (4-bit quantized, 4B parameters), I found a few real constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deeply nested schemas.&lt;/strong&gt; JSON schemas nested 3+ levels deep (&lt;code&gt;List[Dict[str, List[BaseModel]]]&lt;/code&gt;) sometimes return empty arrays at intermediate levels. The 12B model (&lt;code&gt;gemma4:12b-it-qat&lt;/code&gt;) reduces this, but doesn't eliminate it. This is a fundamental limitation of the model's context handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optional field handling.&lt;/strong&gt; Fields declared as &lt;code&gt;Optional[str]&lt;/code&gt; sometimes get filled with empty string &lt;code&gt;""&lt;/code&gt; instead of &lt;code&gt;null&lt;/code&gt;. Pydantic validation passes, but semantics differ. You need &lt;code&gt;@validator&lt;/code&gt; post-processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema size.&lt;/strong&gt; A large Pydantic model's JSON schema can reach hundreds of tokens. That occupies context window space, reducing the room available for the actual prompt. Complex schemas need stronger models.&lt;/p&gt;

&lt;p&gt;Once you've deployed Ollama as an API server (covered in the &lt;a href="https://dev.to/en/blog/en/ollama-fastapi-production-deployment-guide-2026"&gt;Ollama FastAPI production guide&lt;/a&gt;), switching models at runtime based on schema complexity becomes a viable optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern Reference: When to Use What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple data extraction (1-2 levels)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;format&lt;/code&gt; + &lt;code&gt;json.loads()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Fast, no overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type validation needed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;format&lt;/code&gt; + Pydantic&lt;/td&gt;
&lt;td&gt;ValidationError catches issues early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent tool selection&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;format&lt;/code&gt; + Pydantic &lt;code&gt;Literal&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Blocks invalid tool names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex nested schema&lt;/td&gt;
&lt;td&gt;Consider larger model&lt;/td&gt;
&lt;td&gt;Small local model limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simple text response&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;format&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Avoid unnecessary constrained decoding overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I think of this as a switch that moves JSON parse reliability from "unreliable" to "near 100%." There was a time I was appending "JSON only please" to every prompt and hoping for the best. Measuring the actual difference made clear how fragile that approach was.&lt;/p&gt;

&lt;h2&gt;
  
  
  Copy-Paste Starter Code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ollama_structured&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_cls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
                      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:e4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Helper that combines Ollama structured output + Pydantic validation.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model_cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="c1"&gt;# Usage example
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SentimentAnalysis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;       &lt;span class="c1"&gt;# "positive", "negative", "neutral"
&lt;/span&gt;    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;    &lt;span class="c1"&gt;# 0.0 ~ 1.0
&lt;/span&gt;    &lt;span class="n"&gt;key_phrases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ollama_structured&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze sentiment: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This new MacBook is amazing but too expensive&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;SentimentAnalysis&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sentiment: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key phrases: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key_phrases&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What to Try Next
&lt;/h2&gt;

&lt;p&gt;This only covers the simplest cases. A real agent needs a bit more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry logic.&lt;/strong&gt; When Pydantic &lt;code&gt;ValidationError&lt;/code&gt; fires, retry with a slightly modified prompt — ideally including the error message. Models often self-correct when they can see why they were wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming.&lt;/strong&gt; With &lt;code&gt;stream: true&lt;/code&gt;, you can receive the JSON incrementally as it generates. Pair with a streaming JSON parser like &lt;code&gt;ijson&lt;/code&gt; for memory-efficient handling of large responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model switching.&lt;/strong&gt; Route simple extractions to &lt;code&gt;gemma4:e4b&lt;/code&gt; (fast) and complex nested schemas to &lt;code&gt;gemma4:12b-it-qat&lt;/code&gt; (accurate) at runtime. &lt;a href="https://dev.to/en/blog/en/pydantic-ai-type-safe-agent-tutorial-2026"&gt;Structuring an entire agent with Pydantic AI&lt;/a&gt; shows how to abstract this decision to the framework level.&lt;/p&gt;

&lt;p&gt;If you're already running a Gemma4-based agent locally, adding the &lt;code&gt;format&lt;/code&gt; parameter today is a one-line change with a measurable reliability improvement. Especially anywhere in the agent loop where an invalid response immediately causes a downstream error.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Testing RAG Embeddings Hands-On with sentence-transformers — Why Korean Queries Drop Accuracy by 67%</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 16 Jun 2026 06:41:12 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/testing-rag-embeddings-hands-on-with-sentence-transformers-why-korean-queries-drop-accuracy-by-67-16io</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/testing-rag-embeddings-hands-on-with-sentence-transformers-why-korean-queries-drop-accuracy-by-67-16io</guid>
      <description>&lt;p&gt;When I first learned about RAG, embeddings were an abstraction I accepted without questioning. "Sentences get converted to vectors," "similar meanings end up close together in vector space" — all true, but none of it clicked until I actually measured the numbers. So I installed &lt;code&gt;sentence-transformers&lt;/code&gt; locally, measured cosine similarities, ran a mini retrieval simulation, and checked what happens with Korean queries specifically.&lt;/p&gt;

&lt;p&gt;The short answer: &lt;strong&gt;building a Korean RAG system with an English-optimized embedding model dropped accuracy by 67% in my tests.&lt;/strong&gt; Two out of three queries returned the wrong document at rank 1. This post documents that experiment and how switching to a multilingual model fixed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it actually means to run an embedding model locally with pip
&lt;/h2&gt;

&lt;p&gt;I didn't believe it was possible without a cloud API at first. OpenAI and Gemini embeddings require API keys and round-trips to external servers. But sentence-transformers just downloads model weights locally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sentence-transformers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;building an AI agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# (384,)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; downloads from Hugging Face Hub on first run. Load time on my machine: &lt;strong&gt;9.52 seconds&lt;/strong&gt;. After that, it's cached and loads in under 1 second. Model size is around 22MB — genuinely lightweight.&lt;/p&gt;

&lt;p&gt;The vector internals are worth inspecting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Type: numpy.ndarray
Shape: (384,)
dtype: float32
L2 norm: 1.000000
Min: -0.147123
Max: 0.183166
First 5 dims: [-0.0216, 0.0593, -0.0049, -0.0172, 0.0079]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The L2 norm being exactly 1.0 is interesting. sentence-transformers normalizes embeddings by default, which means cosine similarity and dot product return identical results. This is a nice property — you don't need to remember to normalize before computing similarity.&lt;/p&gt;

&lt;p&gt;384 dimensions is a deliberate tradeoff. Modern large embedding models use 1536〜3072 dimensions, which is more expressive but costs more to store and search. For most semantic search use cases, 384 is plenty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cosine similarity: what the numbers actually look like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers.util&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cos_sim&lt;/span&gt;

&lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI agent uses tools to complete tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An autonomous agent invokes functions to achieve goals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI agent uses tools to complete tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The cat sat on the warm mat by the window&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I install Python packages?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the command to add a Python library?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I install Python packages?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The stock market closed higher today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Measured similarity scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.6489 ████████████  AI agent vs autonomous agent (same meaning, different words)
-0.0112            AI agent vs cat on mat (unrelated)
0.6248 ████████████  install packages vs add library (same meaning)
-0.0163            install packages vs stock market (unrelated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Semantically similar pairs land at 0.62〜0.65. Unrelated pairs are near zero or slightly negative. The negative values surprised me at first — cosine similarity ranges from -1 to 1, so truly unrelated sentences naturally cluster around 0, sometimes dipping slightly negative.&lt;/p&gt;

&lt;p&gt;For context on what "high" looks like in practice: the &lt;a href="https://dev.to/en/blog/en/vector-db-comparison-2026-qdrant-chroma-pgvector"&gt;vector DB benchmark post&lt;/a&gt; notes that RAG systems typically use 0.3〜0.5 as a retrieval threshold. So 0.65 is a strong semantic match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mini RAG simulation — two out of three queries failed
&lt;/h2&gt;

&lt;p&gt;I ran a retrieval simulation with 10 Korean blog post titles as the knowledge base and 3 Korean queries. Model: English-optimized &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Knowledge base included titles like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Reduce LLM costs 70% with Claude API Prompt Caching"&lt;/li&gt;
&lt;li&gt;"MCP vs A2A vs Open Responses — agent protocol comparison"&lt;/li&gt;
&lt;li&gt;"Use Node.js built-in SQLite without external packages"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Queries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"How do I reduce LLM API costs?"&lt;/li&gt;
&lt;li&gt;"Compare agent-to-agent communication protocols"&lt;/li&gt;
&lt;li&gt;"Lightweight database without Python"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "How do I reduce LLM API costs?"
  #1 [0.5649] Anthropic Message Batches API for bulk processing  ← expected: Prompt Caching
  #2 [0.4534] Claude API Prompt Caching — 70% cost reduction

Query: "Compare agent-to-agent communication protocols"
  #1 [0.5605] AI Agent Observability Guide  ← completely wrong topic
  (MCP vs A2A vs Open Responses ranked outside top 3)

Query: "Lightweight database without Python"
  #1 [0.5172] LangGraph multi-agent state management  ← irrelevant
  (Node.js SQLite not in top 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One query found the right answer at rank 2. The other two were completely off. The "agent protocol comparison" failure is particularly bad — there's a document with "protocol comparison" literally in its title, and the model ranked an observability guide above it. This is what an English model mishandling Korean semantics looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why English models fail on Korean and how multilingual models fix it
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; comes from the &lt;a href="https://arxiv.org/abs/1908.10084" rel="noopener noreferrer"&gt;SBERT paper&lt;/a&gt; and was trained primarily on English sentence pairs. It can process Korean text but doesn't represent Korean semantics with the same fidelity as English.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;paraphrase-multilingual-MiniLM-L12-v2&lt;/code&gt; was trained on parallel corpora across 50+ languages. It's explicitly designed to place the same meaning in different languages close together in vector space.&lt;/p&gt;

&lt;p&gt;Same 3 queries, both models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;English model #1 (correct?)&lt;/th&gt;
&lt;th&gt;Multilingual #1 (correct?)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reduce LLM API costs&lt;/td&gt;
&lt;td&gt;Message Batches [0.453] ✗&lt;/td&gt;
&lt;td&gt;Prompt Caching [0.720] ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent protocol comparison&lt;/td&gt;
&lt;td&gt;Observability [0.561] ✗&lt;/td&gt;
&lt;td&gt;MCP vs A2A [0.647] ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lightweight DB without Python&lt;/td&gt;
&lt;td&gt;LangGraph [0.461] ✗&lt;/td&gt;
&lt;td&gt;Node.js SQLite [0.439] ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;English model: 1/3 (33%). Multilingual: 3/3 (100%).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The similarity scores are also telling. The multilingual model assigned 0.720 to the correct cost-reduction document; the English model only managed 0.453 for the same pair. Stronger semantic connection, better ranked.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/en/blog/en/dena-llm-study-part4-rag"&gt;RAG architecture post&lt;/a&gt; argues that retrieval quality determines generation quality. This experiment shows that principle applies at the embedding model selection step — before you've written a single line of retrieval code.&lt;/p&gt;

&lt;p&gt;My conclusion: &lt;strong&gt;for Korean or multilingual RAG pipelines, start with a multilingual model.&lt;/strong&gt; Switching later means re-embedding your entire document collection, which is a meaningful operational cost for large corpora.&lt;/p&gt;

&lt;h2&gt;
  
  
  Batch encoding: the 2.4x throughput gap
&lt;/h2&gt;

&lt;p&gt;Sequential encoding is the obvious first implementation. Batch encoding is faster for reasons that have to do with CPU/GPU parallelism and memory bandwidth — but how much faster exactly?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This is test sentence number &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="c1"&gt;# Sequential: 1.075 seconds
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Batch: 0.455 seconds
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_progress_bar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sequential (100 sentences): 1.075s
Batch (100 sentences):      0.455s → 2.4x faster
Throughput (batch):         220 sentences/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;220 sentences/second on CPU. For 10,000 documents at 5 sentences each, that's 50,000 sentences — about 227 seconds, under 4 minutes. With GPU (CUDA or Apple Silicon MPS), throughput reportedly scales 10〜50x higher, though I didn't test that directly.&lt;/p&gt;

&lt;p&gt;The practical implication: use &lt;code&gt;model.encode(batch, batch_size=N)&lt;/code&gt; instead of calling encode in a loop. This is especially true for initial indexing jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting this into an actual RAG pipeline
&lt;/h2&gt;

&lt;p&gt;Today's experiment covered the R in RAG. A complete pipeline looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Index documents (offline, run once)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paraphrase-multilingual-MiniLM-L12-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_your_documents&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;doc_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Store in a vector DB (Chroma, Qdrant, pgvector)
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 2: Retrieve at query time (online)
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Fetch top-k by cosine similarity from vector DB
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 3: Generate with context (online)
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Pass context to Claude or another LLM
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One practical detail: store the model name as metadata alongside your embeddings. If you ever change models, you need to re-embed everything. Without metadata tracking, you risk mixing vectors from different embedding spaces in the same database — a subtle bug that produces silently degraded retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I still don't know and what's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Limitation 1&lt;/strong&gt;: My knowledge base had only 10 documents. Real retrieval performance over thousands of documents could look quite different, and I don't have that data yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation 2&lt;/strong&gt;: 3/3 accuracy for the multilingual model is based on 3 test cases. Not statistically meaningful. I'd need hundreds of test pairs to make a confident claim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation 3&lt;/strong&gt;: Embeddings have a known weakness: they struggle when specific keywords matter more than semantic similarity. "Without Python" as a constraint is the kind of thing BM25 handles better than dense retrieval. Hybrid search (BM25 + embeddings via reciprocal rank fusion) is the standard solution here, but I haven't implemented it yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;: testing &lt;code&gt;multilingual-e5-large&lt;/code&gt; on the same benchmark, wiring up a persistent ChromaDB store, and measuring hybrid search accuracy against pure vector retrieval.&lt;/p&gt;

&lt;p&gt;I went into this experiment thinking embedding models were interchangeable plug-ins. The 67% accuracy gap changed that view. For non-English RAG, model selection is a first-order decision that affects the entire pipeline downstream.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Mastra.ai Practical Guide — Running a TypeScript AI Agent in 5 Minutes</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sun, 14 Jun 2026 06:42:49 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/mastraai-practical-guide-running-a-typescript-ai-agent-in-5-minutes-2c42</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/mastraai-practical-guide-running-a-typescript-ai-agent-in-5-minutes-2c42</guid>
      <description>&lt;p&gt;"If you're a TypeScript developer building AI agents, you're stuck with LangChain.js or Vercel AI SDK." I've heard that line a lot. It felt accurate enough that I never seriously questioned it — until Mastra.ai popped up on my radar.&lt;/p&gt;

&lt;p&gt;Mastra hit v1.0 in January 2026 after graduating from YC W25 with $13M in funding. By then it had already passed 22,000 GitHub stars and 300,000 weekly npm downloads. The Gatsby.js team built it, which means it comes from people who understand what "framework" should actually mean for JavaScript developers.&lt;/p&gt;

&lt;p&gt;I had heard the name. Today I finally installed it, wired it to Google Gemini, and ran an actual agent with tool calls. This is the record of what worked, what didn't, and whether the "5-minute agent" claim holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Mastra.ai Actually Is
&lt;/h2&gt;

&lt;p&gt;Mastra is a TypeScript-first framework that bundles agents, workflows, memory, and observability into a single SDK. The pitch is simple: stop gluing packages together. Define your agent, its tools, and its memory in one place, and let the framework handle the plumbing.&lt;/p&gt;

&lt;p&gt;It supports any LLM provider that Vercel AI SDK covers — OpenAI, Anthropic, Google Gemini, Meta Llama, and more. Mastra uses Vercel AI SDK as its underlying layer, so if you've built &lt;a href="https://dev.to/en/blog/en/vercel-ai-sdk-claude-streaming-agent-2026"&gt;a streaming Claude agent with Vercel AI SDK&lt;/a&gt; before, you're already familiar with the foundation Mastra sits on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Now?
&lt;/h3&gt;

&lt;p&gt;The Python agent ecosystem got mature fast. LangGraph, CrewAI, PydanticAI — these tools accumulated years of production use, community plugins, and battle-tested patterns. The TypeScript side lagged. Mastra is an attempt to close that gap, and v1.42 (the version I installed today) suggests it's serious about catching up.&lt;/p&gt;

&lt;p&gt;I expected to be disappointed. I wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation: One Command
&lt;/h2&gt;

&lt;p&gt;I followed the official quickstart. Node.js v22.22.0 on my machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm create mastra@latest mastra-lab &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;--components&lt;/span&gt; agents,tools &lt;span class="nt"&gt;--llm&lt;/span&gt; google &lt;span class="nt"&gt;--example&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flags: &lt;code&gt;--components agents,tools&lt;/code&gt; pulls in the agent and tool scaffolding, &lt;code&gt;--llm google&lt;/code&gt; sets Google Gemini as the provider, &lt;code&gt;--example&lt;/code&gt; generates a working weather agent template.&lt;/p&gt;

&lt;p&gt;Setup takes 2〜3 minutes. The CLI walks through each step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;◇  Project structure created
◇  npm dependencies installed
◇  Mastra CLI installed
◇  Mastra dependencies installed
◇  .gitignore added
└  Project created successfully

◇  Mastra initialized successfully!

   Rename .env.example to .env
   and add your GOOGLE_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Core dependencies in the generated &lt;code&gt;package.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@mastra/core"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.42.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@mastra/memory"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.20.3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@mastra/libsql"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.13.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@mastra/observability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.14.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"zod"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^4.4.3"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"devDependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"typescript"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^6.0.3"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TypeScript 6.0.3 and Zod v4. Both bumped major versions in early 2026. The fact that Mastra ships with current versions of both is a good signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mastra-lab/
├── src/
│   └── mastra/
│       ├── index.ts          # Mastra instance setup
│       ├── agents/
│       │   └── weather-agent.ts  # Agent definition
│       └── tools/
│           └── weather-tool.ts   # Tool definition
├── .env.example
├── package.json
└── tsconfig.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The separation is clean. &lt;code&gt;agents/&lt;/code&gt; holds agent definitions. &lt;code&gt;tools/&lt;/code&gt; handles external API interfaces. &lt;code&gt;index.ts&lt;/code&gt; wires everything together into a Mastra instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code: Tools and Agents
&lt;/h2&gt;

&lt;p&gt;Looking at the generated code reveals Mastra's design priorities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/mastra/tools/weather-tool.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createTool&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@mastra/core/tools&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;weatherTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get-weather&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Get current weather for a location&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;City name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;outputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;feelsLike&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;humidity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;windSpeed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getWeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zod schemas for I/O typing. If you've seen &lt;a href="https://dev.to/en/blog/en/pydantic-ai-type-safe-agent-tutorial-2026"&gt;PydanticAI's type-safe agent approach&lt;/a&gt; in Python, this is structurally similar — "type definition as documentation and validation logic" across different languages.&lt;/p&gt;

&lt;p&gt;The weather tool uses Open-Meteo, which is free and needs no API key. It geocodes city names to coordinates, then fetches current weather data. Clean separation of concerns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/mastra/agents/weather-agent.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Agent&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@mastra/core/agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Memory&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@mastra/memory&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;weatherTool&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../tools/weather-tool&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;weatherAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;weather-agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Weather Agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`You are a helpful weather assistant...`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google/gemini-2.5-pro&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;weatherTool&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model string &lt;code&gt;'google/gemini-2.5-pro'&lt;/code&gt; is all Mastra needs to wire up the right provider. &lt;code&gt;@ai-sdk/google&lt;/code&gt; handles the actual API calls underneath.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It: Seoul vs Tokyo Weather
&lt;/h2&gt;

&lt;p&gt;I ran the agent directly without going through the full Mastra server setup. Memory was excluded initially — more on why below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Agent&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@mastra/core/agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;weather-agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Weather Agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Provide concise weather information.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google/gemini-2.5-flash&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;weatherTool&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Compare the current weather in Seoul and Tokyo. Which city is hotter right now?&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output (2026-06-14, response time: 5,866ms):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;It is 27.3°C and feels like 30.1°C in Seoul with mainly clear conditions.
In Tokyo, it is 25.6°C and feels like 27.1°C with overcast conditions.
Seoul is hotter right now.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent called &lt;code&gt;get-weather&lt;/code&gt; twice — once for Seoul, once for Tokyo — synthesized the results, and answered the comparison question. Two external API calls plus LLM reasoning in under six seconds. Seoul was indeed hotter.&lt;/p&gt;

&lt;p&gt;Location resolution worked correctly too. "Seoul" mapped to coordinates 37.566, 126.978 without any prompting.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Error I Hit: Memory Needs Storage
&lt;/h3&gt;

&lt;p&gt;My first attempt included &lt;code&gt;memory: new Memory()&lt;/code&gt; in the agent definition. This errored immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MastraError: Memory requires a storage provider to function.
Add a storage configuration to Memory or to your Mastra instance.
https://mastra.ai/en/docs/memory/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official example includes Memory, but Memory needs a storage backend — LibSQL or DuckDB. The generated &lt;code&gt;index.ts&lt;/code&gt; has this configured, but when running an agent standalone, that configuration is missing.&lt;/p&gt;

&lt;p&gt;This is a real friction point for new users. The error message points to docs, but someone just trying to run the quickstart example will hit this and wonder what went wrong. Better scaffolding here would help.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: Four Layers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmastra-ai-architecture-diagram.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fmastra-ai-architecture-diagram.png" alt="Mastra Architecture Diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mastra organizes around four layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agent Layer&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Calls the LLM and decides whether to invoke tools. A single &lt;code&gt;generate()&lt;/code&gt; call can involve multiple internal LLM↔tool round trips.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tools/Integrations&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Zod-typed interfaces to external APIs, databases, or any other service. The LLM fills in arguments according to the schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Memory&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Conversation history, semantic search, and working memory. Requires LibSQL or PostgreSQL as backing store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Observability&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
OpenTelemetry-based tracing, spans, and logging for every agent execution. &lt;code&gt;MastraStorageExporter&lt;/code&gt; persists locally; &lt;code&gt;MastraPlatformExporter&lt;/code&gt; sends to Mastra's cloud platform.&lt;/p&gt;
&lt;h2&gt;
  
  
  Mastra Studio
&lt;/h2&gt;

&lt;p&gt;Running &lt;code&gt;npm run dev&lt;/code&gt; opens Mastra Studio at &lt;code&gt;http://localhost:4111&lt;/code&gt;. It's a web UI for chatting with agents, inspecting tool call traces, and testing workflows. During development, it's genuinely useful — you can see exactly which tools fired and what they returned, without digging through logs.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;The closest comparison in TypeScript is Vercel AI SDK. Vercel AI SDK handles LLM calls and streaming well; Mastra adds agent lifecycle management, memory, and observability on top. It's not a replacement — it's a higher-level abstraction that uses AI SDK underneath.&lt;/p&gt;

&lt;p&gt;Against the &lt;a href="https://dev.to/en/blog/en/google-adk-vs-langgraph-agent-framework-comparison-2026"&gt;Google ADK vs LangGraph comparison&lt;/a&gt; I did earlier, both were Python-only. Mastra is occupying that same space in TypeScript.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Mastra&lt;/th&gt;
&lt;th&gt;Vercel AI SDK&lt;/th&gt;
&lt;th&gt;LangGraph.js&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent loop&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;td&gt;⚠️ Manual&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;✅ Built-in (storage req'd)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️ Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflows&lt;/td&gt;
&lt;td&gt;✅ Graph-based&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Graph-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;✅ OpenTelemetry&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️ External tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  What I'd Change
&lt;/h2&gt;

&lt;p&gt;Two things stood out as friction:&lt;/p&gt;

&lt;p&gt;The Memory setup barrier. New users following the official example will hit an error the first time they try to use Memory standalone. The error is informative but the setup path isn't obvious. A cleaner getting-started experience here would make a real difference.&lt;/p&gt;

&lt;p&gt;Production deployment documentation is thin. Mastra Studio is a development tool, but the path from "working locally" to "deployed on my server" isn't well documented outside of the Vercel deployment guide. Docker and self-hosted setups require figuring things out yourself.&lt;/p&gt;
&lt;h2&gt;
  
  
  Is It Worth Trying Right Now?
&lt;/h2&gt;

&lt;p&gt;Yes, with caveats.&lt;/p&gt;

&lt;p&gt;If you're starting a new TypeScript agent project, Mastra is worth a serious look. The setup is fast, the structure is sensible, and the core abstractions hold up. It took me about ten minutes from nothing to a working agent with real tool calls — the "5-minute" claim is close to accurate.&lt;/p&gt;

&lt;p&gt;For production: I'd wait. Mastra v1.0 shipped in January 2026. Six months of production exposure is not much. The ecosystem is still thin compared to LangChain. API stability is not yet something I'd bet a production service on.&lt;/p&gt;

&lt;p&gt;For a side project or internal tool: use it now. For a production service: revisit around v1.5.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get started&lt;/span&gt;
npm create mastra@latest my-agent-app &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;--components&lt;/span&gt; agents,tools &lt;span class="nt"&gt;--llm&lt;/span&gt; google &lt;span class="nt"&gt;--example&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;my-agent-app
&lt;span class="c"&gt;# Add GOOGLE_API_KEY to .env&lt;/span&gt;
npm run dev
&lt;span class="c"&gt;# → Open http://localhost:4111 to chat with your first agent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The TypeScript AI agent ecosystem finally has a real contender. It's not fully there yet, but it's the most promising thing I've seen on the TypeScript side in a while.&lt;/p&gt;

</description>
      <category>mastra</category>
      <category>typescript</category>
      <category>agents</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Anthropic and OpenAI Filed for IPO in the Same Month — What the Token Price War Means for Developers</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 13 Jun 2026 06:41:31 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/anthropic-and-openai-filed-for-ipo-in-the-same-month-what-the-token-price-war-means-for-developers-9ai</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/anthropic-and-openai-filed-for-ipo-in-the-same-month-what-the-token-price-war-means-for-developers-9ai</guid>
      <description>&lt;p&gt;On June 1st, news broke that Anthropic had quietly filed a confidential S-1 registration statement with the SEC. One week later, on June 8th, OpenAI did the same thing. Two of the most important companies in AI filing for IPO in the same month — that's never happened before in this space, and it's not just a capital markets story. It has real implications for how much developers pay to build with these models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version: this is good for you right now
&lt;/h2&gt;

&lt;p&gt;Let me be direct. In the short term, this competitive dynamic works in developers' favor. Both companies need to demonstrate strong growth metrics before they go public, and one lever they can pull is lowering prices to attract more API usage. OpenAI is reportedly considering "drastic" cuts to token prices. Anthropic restructured toward consumption-based revenue earlier this year, meaning their ARR grows the more developers use the API.&lt;/p&gt;

&lt;p&gt;The concern I have is about what happens after the IPO. Once there are public shareholders to answer to, the pressure on margins increases. And the competitive threat from Chinese open-source models, which is driving much of this pricing discussion, doesn't disappear once either company is listed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happened in the first week of June
&lt;/h2&gt;

&lt;p&gt;Anthropic filed a confidential draft S-1 with the SEC on June 1, 2026. The confidential process lets the SEC review before the full prospectus goes public. Here's what we already know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic's most recent Series H: $65 billion raised&lt;/li&gt;
&lt;li&gt;Post-money valuation from that round: approximately $965 billion&lt;/li&gt;
&lt;li&gt;Annualized revenue run rate (ARR): around $47 billion per reports&lt;/li&gt;
&lt;li&gt;Likely IPO timeline: as early as October 2026 on Nasdaq or NYSE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenAI followed on June 8, with a valuation around $852 billion based on its March 2026 raise — at that point, lower than Anthropic for the first time.&lt;/p&gt;

&lt;p&gt;The timing isn't coincidental. Both companies are chasing the same pool of institutional investors while AI enthusiasm is still high. If one company lists first and locks up investor attention, the other's roadshow gets harder. The race to file is, in part, a race to get on the public markets before the window closes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the IPO race is creating pricing pressure right now
&lt;/h2&gt;

&lt;p&gt;To build a compelling IPO story, you need revenue, growth rate, and a credible market position narrative. Right now, both companies are being squeezed on that third point — not by each other, but by Chinese open-source models.&lt;/p&gt;

&lt;p&gt;DeepSeek V3.2 delivers GPT-5-level coding performance at $0.28 input / $0.42 output per million tokens. That's roughly 10x cheaper than Claude Sonnet 4.6 ($3.00/$15.00) on the input side, and 18x cheaper than Opus 4.8 ($5.00/$25.00). Qwen3-Max sits in a similar price range. If enterprise customers can get comparable quality at that price, there's a real pull to route non-sensitive workloads there.&lt;/p&gt;

&lt;p&gt;I covered &lt;a href="https://dev.to/en/blog/en/anthropic-usage-caps-llm-pricing-disruption-analysis-2026"&gt;Anthropic's earlier pricing shift away from third-party subscriptions&lt;/a&gt; in April. That move was partly about converting OpenClaw and similar tools from cheap subscription access to consumption billing. The logic is the same now: the more API tokens developers burn, the better the ARR story looks heading into an IPO roadshow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Anthropic's IPO financials actually reveal
&lt;/h2&gt;

&lt;p&gt;The ~$47B ARR figure that's been reported isn't just a big number — it tells you something about how Anthropic has been building its revenue structure heading into a public offering.&lt;/p&gt;

&lt;p&gt;Anthropic has three meaningful revenue streams now: API consumption, enterprise contracts, and Claude.ai subscriptions. The fastest-growing channel is API consumption. This is directly connected to the strategic moves we saw earlier in the year, including &lt;a href="https://dev.to/en/blog/en/anthropic-usage-caps-llm-pricing-disruption-analysis-2026"&gt;blocking third-party subscription access&lt;/a&gt; through products like OpenClaw. When you cut off cheap subscription API routing and force developers onto direct API billing, every token burned shows up in the ARR.&lt;/p&gt;

&lt;p&gt;For IPO purposes, what investors care about isn't just current revenue — it's revenue growth rate and its predictability. If Anthropic's ARR was $20B six months ago and is now $47B, that's a 2.35x multiplier over six months. Maintaining that trajectory into the IPO roadshow is worth a lot. And one way to sustain it is to lower prices just enough to increase volume without tanking margins.&lt;/p&gt;

&lt;p&gt;This means the price discounts developers are seeing right now are a feature, not a bug, of Anthropic's IPO strategy. I'm not saying that to be cynical — it's genuinely useful to understand the incentive structure. Anthropic wants high API usage because it shows the market that Claude is winning.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the current API costs actually look like
&lt;/h2&gt;

&lt;p&gt;I installed @anthropic-ai/sdk 0.104.1 in a sandbox and worked through some token cost scenarios. The numbers below are based on official published pricing as of June 13, 2026. OpenAI's is pre-any announced cut.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fanthropic-openai-ipo-token-price-war-developer-guide-june-2026-pricing.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fanthropic-openai-ipo-token-price-war-developer-guide-june-2026-pricing.png" alt="AI API Pricing Comparison — June 2026" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Caching changes things meaningfully. When prompt cache hits, input token cost drops to 10% of the standard rate (a 90% discount). For a 50K input + 2K output scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without caching&lt;/strong&gt;: Claude Sonnet 4.6 → $0.18 / Claude Haiku 4.5 → $0.06&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With 80% cache hit rate&lt;/strong&gt;: Claude Sonnet 4.6 → ~$0.072 / Claude Haiku 4.5 → ~$0.024&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're repeatedly sending large codebases or long system prompts, effective costs can come down to 30-40% of sticker price. &lt;a href="https://dev.to/en/blog/en/claude-api-prompt-caching-cost-optimization-guide"&gt;The Claude prompt caching optimization guide&lt;/a&gt; has examples of 70% real-world reductions. That's meaningful — though it still doesn't close the full gap with DeepSeek at scale.&lt;/p&gt;

&lt;p&gt;Looking at &lt;a href="https://dev.to/en/blog/en/claude-fable-5-mythos-public-api-developer-analysis-2026"&gt;Claude Fable 5's $10/$50 pricing&lt;/a&gt; next to these other tiers, you can see Anthropic is running a dual strategy: premium pricing for frontier capability (Fable 5, Opus 4.8) while keeping Haiku competitive on the low end. The competitive pricing pressure probably won't touch Fable 5 much — workloads that need that level of performance tend to be price-inelastic. It's Sonnet and Haiku where the real price competition happens.&lt;/p&gt;

&lt;p&gt;If OpenAI cuts GPT-5.4 from $2.50 to $1.50, that puts Sonnet 4.6 at $3.00 in an awkward position. Anthropic would likely respond. That's the scenario that would actually benefit developers most, and it becomes more likely the closer both companies get to their respective IPOs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The risks developers should actually worry about post-IPO
&lt;/h2&gt;

&lt;p&gt;I'll be candid here: I think the current pricing pressure is in large part a pre-IPO phenomenon. There are structural reasons to expect it looks different afterward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shareholder pressure changes the calculus.&lt;/strong&gt; Private companies can sustain losses to build market share. Public companies face quarterly margin scrutiny. AWS, Azure, and GCP all went through a period of aggressive pricing before consolidating around higher margins once they dominated their markets. There's no reason to assume Anthropic or OpenAI won't follow a similar arc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lock-in is accumulating.&lt;/strong&gt; Claude Code, Cursor, Windsurf, and other AI coding tools are increasingly built around Claude. The more your workflows are optimized for a specific model's behavior, the higher the cost of switching when pricing changes. I've been watching this — the &lt;a href="https://dev.to/en/blog/en/claude-code-june-2026-new-features-changelog-developer-guide"&gt;June 2026 Claude Code update&lt;/a&gt; added features that deepen that integration further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Chinese model data problem isn't going away.&lt;/strong&gt; DeepSeek might be 10-30x cheaper, but routing your proprietary codebase or customer data through Chinese infrastructure is a conversation most enterprise legal and security teams won't approve. EU GDPR, US defense sector regulations, healthcare — all of these narrow the viable options. Anthropic and OpenAI carry enterprise trust that open-source models can't easily replicate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm doing about it
&lt;/h2&gt;

&lt;p&gt;Three things I've actually changed in how I build:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, I'm keeping model selection in config, not code.&lt;/strong&gt; Every place where I used to write &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; directly, I've moved to environment variables or config files. When pricing changes or a better option appears, switching costs drop significantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: model name baked in&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// After: externalized, switchable without code changes&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CLAUDE_MODEL&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks trivial but matters when you're running workflows across multiple services. Changing one environment variable rather than searching a codebase is the difference between a five-minute update and a half-day refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, I've prioritized prompt caching implementation.&lt;/strong&gt; For anything involving repeated large context — codebase analysis, document review, multi-turn agents — the 90% cache discount is the fastest win available. Waiting for prices to drop while ignoring this seems backward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;largeSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Your long context goes here&lt;/span&gt;
      &lt;span class="na"&gt;cache_control&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ephemeral&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;// One line to activate caching&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userInput&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cache_control&lt;/code&gt; field is all it takes. If your system prompt is 50K tokens and you're making 100 requests per hour, you've just saved 90% on 50K × 100 = 5M tokens worth of input cost per hour. Do the math for your actual usage — for many codebases-as-context use cases, caching alone makes Anthropic competitive with cheaper alternatives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, I'm routing workloads by risk profile.&lt;/strong&gt; Tasks that don't touch internal code or customer data are candidates for DeepSeek or Qwen testing. Tasks that do, stay on Anthropic or OpenAI. It's not a value judgment on the models — it's practical risk management.&lt;/p&gt;

&lt;p&gt;My current decision rule is simple: does this request contain customer names, internal source code, or business strategy? If yes, Anthropic or OpenAI. If it's processing public technical documentation, general text summarization, or analysis of public data, DeepSeek goes on the testing list. Following this rule alone can reduce monthly costs 20-30% for many teams without creating meaningful compliance exposure.&lt;/p&gt;

&lt;h2&gt;
  
  
  My take: you're a price war beneficiary, but stay clear-eyed about the structure
&lt;/h2&gt;

&lt;p&gt;Anthropic and OpenAI filing for IPO in the same month creates real short-term tailwinds for API users. Both companies have strong incentives to drive usage before they go public, and that means lower prices and faster feature development.&lt;/p&gt;

&lt;p&gt;But I'd call this market situation overblown if you treat the current pricing as permanent. Post-IPO shareholder pressure, gradual lock-in effects, and the data sovereignty issues around cheap Chinese models are all structural constraints that a price drop doesn't resolve.&lt;/p&gt;

&lt;p&gt;The right stance is simple: take advantage of falling prices, but build in a way that keeps you mobile. The developer who maintains clean provider abstraction and optimizes aggressively for what's available now will come out ahead regardless of what the post-IPO pricing looks like.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
