<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ollama</title>
    <description>The latest articles tagged 'ollama' on DEV Community.</description>
    <link>https://dev.to/t/ollama</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/ollama"/>
    <language>en</language>
    <item>
      <title>Does a Local Reasoning Model Earn Its Keep? Measuring thinking ON/OFF on gemma4:12b</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 30 Jun 2026 06:50:07 +0000</pubDate>
      <link>https://dev.to/jangwook_kim_e31e7291ad98/does-a-local-reasoning-model-earn-its-keep-measuring-thinking-onoff-on-gemma412b-3h0m</link>
      <guid>https://dev.to/jangwook_kim_e31e7291ad98/does-a-local-reasoning-model-earn-its-keep-measuring-thinking-onoff-on-gemma412b-3h0m</guid>
      <description>&lt;p&gt;Last month, in &lt;a href="https://dev.to/en/blog/en/llm-determinism-temperature-seed-experiment"&gt;my post measuring output reproducibility with temperature and seed&lt;/a&gt;, I confidently wrote one paragraph that was wrong. I saw gemma4:12b-it-qat return a rising &lt;code&gt;eval_count&lt;/code&gt; while &lt;code&gt;content&lt;/code&gt; came back as an empty string, declared it "a packaging problem where tokens don't map to visible text," and dropped the model from my determinism table.&lt;/p&gt;

&lt;p&gt;That wasn't it. While prepping a different experiment this week I hit the same empty reply, and this time I read the response JSON to the end. Inside &lt;code&gt;message&lt;/code&gt;, alongside &lt;code&gt;content&lt;/code&gt;, was another field: &lt;code&gt;thinking&lt;/code&gt;. gemma4:12b is a reasoning model. The empty reply wasn't a bug. I had set &lt;code&gt;num_predict&lt;/code&gt; too low, so the generation budget drained entirely into the reasoning channel and not a single token was left for the answer. I had misdiagnosed the whole thing.&lt;/p&gt;

&lt;p&gt;Correcting the mistake left me with a sharper question. Does this reasoning actually earn its keep? If it returns the same answer 23× slower and burns 84× the tokens, do I turn it on in an agent or not? So I measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real culprit behind the empty replies was the thinking field
&lt;/h2&gt;

&lt;p&gt;First, reproduction. I sent the simplest arithmetic in both modes. Ollama's &lt;code&gt;/api/chat&lt;/code&gt; lets you toggle reasoning with a &lt;code&gt;think&lt;/code&gt; boolean at the top level of the request body (next to &lt;code&gt;model&lt;/code&gt; and &lt;code&gt;messages&lt;/code&gt;, not inside &lt;code&gt;options&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;think&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_predict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:12b-it-qat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;think&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;think&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                       &lt;span class="c1"&gt;# &amp;lt;- here. outside options
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_ctx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_predict&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the result for "A shirt costs 40 dollars after a 20% discount. What was the original price?"&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Output tokens&lt;/th&gt;
&lt;th&gt;thinking chars&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;think=true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;37.0s&lt;/td&gt;
&lt;td&gt;252&lt;/td&gt;
&lt;td&gt;551&lt;/td&gt;
&lt;td&gt;50 (correct)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;think=false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.6s&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;50 (correct)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both produced exactly the same answer. The reasoning side was 23× slower and burned 84× the tokens. When I'd set &lt;code&gt;num_predict&lt;/code&gt; to 24 in my earlier post, those 252 reasoning tokens got cut off before the answer "50" was ever generated. That's why &lt;code&gt;content&lt;/code&gt; came back empty. Not the model, not packaging, my own setting. It is exactly the lesson from &lt;a href="https://dev.to/en/blog/en/ollama-num-ctx-silent-truncation-experiment"&gt;the experiment where num_ctx silently truncated the instructions in long inputs&lt;/a&gt;: when a model looks dumb, the culprit is usually my own options.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I changed the question: where does reasoning earn its keep?
&lt;/h2&gt;

&lt;p&gt;"Same answer, so turn it off" is too quick. Arithmetic makes reasoning look like pure overhead, but reasoning models exist because deliberation rescues problems where the fast, intuitive answer is wrong. The textbook for "the intuitive answer is wrong" is the Cognitive Reflection Test (CRT). The three items Shane Frederick introduced in his 2005 paper (bat-and-ball, machines-and-widgets, lily pads) are engineered so the first answer that pops into your head is almost always wrong.&lt;/p&gt;

&lt;p&gt;So I formed a hypothesis. &lt;strong&gt;Turning reasoning on will raise accuracy on the CRT traps.&lt;/strong&gt; They are built to defeat intuition, so a deliberate mode should shine here.&lt;/p&gt;

&lt;p&gt;I wrote 13 questions across three difficulty tiers. Every question has a single answer and demands a checkable format like "just the number."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Easy (A1-A4)&lt;/td&gt;
&lt;td&gt;Lookup / mental math&lt;/td&gt;
&lt;td&gt;Capital of Japan, 7×8, days in a week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium (B1-B4)&lt;/td&gt;
&lt;td&gt;Multi-step word problems&lt;/td&gt;
&lt;td&gt;Reverse a 20% discount, 60km/45min to km/h, the apples problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard (C1-C5)&lt;/td&gt;
&lt;td&gt;CRT traps + on-the-spot procedure&lt;/td&gt;
&lt;td&gt;Bat-and-ball, machines-and-widgets, lily pads, sort 6 numbers, count the r's in "strawberry"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I ran each question once with &lt;code&gt;think=false&lt;/code&gt; and once with &lt;code&gt;think=true&lt;/code&gt;. With temperature=0 and a fixed seed, each mode reproduces. I logged time, output tokens, and correctness for every call, and saved the full reasoning trace for two CRT questions (bat-and-ball and lily pads).&lt;/p&gt;

&lt;h2&gt;
  
  
  Result: I bought one extra answer at 19× the time
&lt;/h2&gt;

&lt;p&gt;Summary of the 26 calls (13 questions × 2 modes).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;thinking OFF&lt;/th&gt;
&lt;th&gt;thinking ON&lt;/th&gt;
&lt;th&gt;Multiple&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correct&lt;/td&gt;
&lt;td&gt;12 / 13&lt;/td&gt;
&lt;td&gt;13 / 13&lt;/td&gt;
&lt;td&gt;+1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg response time&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;td&gt;28.3s&lt;/td&gt;
&lt;td&gt;20×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg output tokens / question&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;td&gt;63×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total output tokens (whole set)&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;2,454&lt;/td&gt;
&lt;td&gt;68×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total time (whole set)&lt;/td&gt;
&lt;td&gt;19s&lt;/td&gt;
&lt;td&gt;368s&lt;/td&gt;
&lt;td&gt;19×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Turning reasoning on bought exactly one correct answer. For that one answer the whole set spent 68× the output tokens and 19× the wall-clock. In the hero chart above, the blue bars (OFF) are nearly invisible. That's because they're 1-6 tokens per question. The orange bars (ON) climb from 46 to 399.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-reasoning-mode-token-cost-experiment%2Faccuracy-vs-cost.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Flocal-llm-reasoning-mode-token-cost-experiment%2Faccuracy-vs-cost.png" alt="Accuracy vs cost summary: a one-answer difference cost 68× the output tokens"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The heaviest thinker was machines-and-widgets (C2): 399 tokens and 59 seconds with reasoning on. Reasoning off solved the same problem in 2 tokens and 1.4 seconds. Correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What reasoning actually rescued wasn't a CRT question
&lt;/h2&gt;

&lt;p&gt;This is where my hypothesis broke. The only question that needed reasoning to get right was C4. And C4 is not a CRT trap. It is an on-the-spot procedure: "Sort 17, 3, 29, 8, 21, 14 in descending order and tell me the third largest." Reasoning off answered 21 (mistaking 2nd for 3rd); reasoning on walked 29, 21, 17 and got 17.&lt;/p&gt;

&lt;p&gt;The three CRT traps I expected to shine on were all correct with reasoning off.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Intuitive wrong answer&lt;/th&gt;
&lt;th&gt;OFF&lt;/th&gt;
&lt;th&gt;ON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1 bat-and-ball (ball in cents?)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;5 ✓&lt;/td&gt;
&lt;td&gt;5 ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C2 machines-and-widgets (minutes?)&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;5 ✓&lt;/td&gt;
&lt;td&gt;5 ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C3 lily pads (half-covered on day?)&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;47 ✓&lt;/td&gt;
&lt;td&gt;47 ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Why would that be? Honestly, there's a variable I can't see. These three appear so often in cognitive-science textbooks and LLM eval papers that the answers are probably baked into the training data. gemma4:12b may not be "reasoning" through them at all; it may be recalling them. C4's six numbers, by contrast, I picked on the spot, so there's nothing to memorize. The interpretation that fits my data best: reasoning only did real work on the procedure that can't be answered from memory.&lt;/p&gt;

&lt;p&gt;One more surprise was C5. "How many r's in strawberry" is a classic trap where the tokenizer swallows the word whole, and LLMs have historically flubbed it. gemma4:12b answered 3 instantly even with reasoning off. That, too, looks like the question being so common that the answer hardened into the model. In other words, "famous LLM trap" has gone blunt as a yardstick for reasoning value. The model has memorized the trap as a trap.&lt;/p&gt;

&lt;p&gt;So here's how I read it. &lt;strong&gt;The value of reasoning mode comes not from "hard problems" but from "multi-step procedures the model hasn't seen."&lt;/strong&gt; Evaluate a reasoning model on famous trick questions and you'll see no difference, because both modes get them right. The gap opens when the model has to crunch concrete data it's encountering for the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the reasoning trace shows when you open it
&lt;/h2&gt;

&lt;p&gt;Here is the thinking gemma4 generated for bat-and-ball (C1), verbatim.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;*   Total cost of bat + ball = $1.10.
*   Let x be the price of the ball.
*   The bat costs x + 1.00.
*   Equation: x + (x + 1.00) = 1.10
*   2x = 0.10  -&amp;gt;  x = 0.05
*   Convert to cents: 0.05 x 100 = 5.
*   Check: Ball 5 + Bat 105 = 110 cents. Correct.
ANSWER: 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It sets up the equation and even checks its work. A textbook-correct derivation. The catch is that reasoning off answered 5 just as fast. This tidy 297-token derivation ties, on the result, with a 2-token snap answer. A pretty process didn't earn anything. The lily-pads trace (C3) is similar: it nails the key insight ("it doubles daily, so one day before full it was half"). But reasoning off already had that insight (instant 47). Reasoning didn't change the answer, it only showed the path. If you need that path logged for evals, debugging, or audit, the trace is the value. But in a pipeline that keeps only the final answer, 297 tokens is just cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed in my agent after this
&lt;/h2&gt;

&lt;p&gt;Three changes to my local agent setup followed this run.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;I made &lt;code&gt;think=false&lt;/code&gt; the default for lookup, classification, and format-conversion steps.&lt;/strong&gt; Routing ("which tool does this request go to?"), short extraction, JSON shaping. There's almost no room for intuition to fail here. Turning reasoning on at these steps donates 20× latency and 60× tokens per step. An agent passes through dozens of these light steps, so the cumulative loss is large. As I saw in &lt;a href="https://dev.to/en/blog/en/ai-agent-cost-reality"&gt;the post breaking down where tokens leak in a single agent run&lt;/a&gt;, cost leaks not in one big hit but in the repetition of small steps.&lt;/p&gt;

&lt;p&gt;Put numbers on it and the call is easy. Say a routing/extraction step runs 10 times per request. Reasoning off: 1.4s each, 14s for ten. Reasoning on: 28s each, 280s for ten. The user stares at a blank screen for over four minutes. The accuracy gain in return, at least on these non-intuition steps, is near zero. A local model spends no cash per token, but time and power are real costs, so the math holds.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;I only turn reasoning on for multi-step computation the model hasn't seen.&lt;/strong&gt; Steps that aggregate user data on the spot, satisfy several constraints at once, or carry intermediate state. C4 was exactly that shape. You don't flip it on to solve a famous puzzle; you flip it on when there's a procedure to run with no memorized answer.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;I give reasoning steps a generous &lt;code&gt;num_predict&lt;/code&gt;.&lt;/strong&gt; The empty reply I got at the start was precisely the result of not doing this. The reasoning channel eats ~189 tokens first, so the answer needs headroom above that. If you turn it on, raise the budget with it. I added this item to the recommended settings in &lt;a href="https://dev.to/en/blog/en/llm-determinism-temperature-seed-experiment"&gt;the determinism post&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limits of one model and 13 questions
&lt;/h2&gt;

&lt;p&gt;To be straight: this is one model (gemma4:12b), 13 questions, one measurement each. It's closer to one person's notebook-day notes than a statistic. I can't verify whether the CRT questions were in the training data, so I can only say "likely." Bigger reasoning models or harder benchmarks would surely show a larger reasoning payoff. Don't read these numbers as "reasoning is useless." My conclusion isn't that. It's "reasoning isn't free, and you have to pick the spots where you turn it on."&lt;/p&gt;

&lt;p&gt;What I originally set out to do was measure retrieval accuracy by position in a long context (the so-called lost-in-the-middle effect), but a 1.5k-token prefill took 26 seconds, so it wouldn't finish inside a single run. That waits for another day. Today, finding the real cause of the empty replies and measuring the value of reasoning was a good enough trade. At least I corrected one misdiagnosis from the earlier post.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>ollama</category>
      <category>reasoningmodels</category>
      <category>llmevaluation</category>
    </item>
    <item>
      <title>Run a Private AI Coding Agent Locally: Setup &amp; Design with Ollama, OpenCode, and Custom Workspace Skills</title>
      <dc:creator>Praveen Veera</dc:creator>
      <pubDate>Mon, 29 Jun 2026 19:34:19 +0000</pubDate>
      <link>https://dev.to/praveen_builds/run-a-private-ai-coding-agent-locally-setup-design-with-ollama-opencode-and-custom-workspace-392o</link>
      <guid>https://dev.to/praveen_builds/run-a-private-ai-coding-agent-locally-setup-design-with-ollama-opencode-and-custom-workspace-392o</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftw91d0f75wottkpvpxrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftw91d0f75wottkpvpxrq.png" alt="Local AI Agent Architecture - At a Glance"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you have local autocomplete and chat running inside your IDE, the next step is transitioning to autonomous execution. Setting up a local coding agent running directly inside your terminal or editor gives you a private, offline partner capable of executing shell commands, refactoring files, and diagnosing compilation errors.&lt;/p&gt;

&lt;p&gt;This guide focuses on the workspace design, custom instructions, and domain-specific skills required to orchestrate a reliable local agent using &lt;strong&gt;Ollama&lt;/strong&gt; and &lt;strong&gt;OpenCode&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔰 What is an "AI Agent" (For Beginners)?
&lt;/h3&gt;

&lt;p&gt;If you have only used ChatGPT or Claude in a browser, a coding agent behaves differently. Standard chat systems only output text; you must manually copy and paste the code block into your editor. &lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;AI agent&lt;/strong&gt; has "hands." It integrates directly with your workstation's filesystem and terminal. Instead of just suggesting code, the agent runs an active execution loop: it reads files, writes code modules, executes compiler test suites, inspects error outputs, and iterates autonomously until the task is complete.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Local Agent Architecture
&lt;/h2&gt;

&lt;p&gt;A private agentic workspace coordinates model outputs with local system execution. Here is the operational design of the loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────┐
│                 Developer                  │
│        Terminal / VS Code / OpenCode       │
└─────────────────────┬──────────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                  OpenCode                  │
│  - Agent execution loop                    │
│  - Context window manager                  │
│  - Project instruction parser              │
│  - Tool permission registry                │
│  - Skills / specialist agents              │
└──────────────┬───────────────┬─────────────┘
               │               │
     ┌─────────▼──────┐  ┌────▼─────────────┐
     │ Project Repo   │  │ Local OS Tools   │
     │ - Source code  │  │ - Terminal bash  │
     │ - Docs         │  │ - Git versioning │
     │ - Test suites  │  │ - Linters        │
     └────────────────┘  └──────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                   Ollama                     │
│           Local model inference            │
│       Qwen / Llama coding models           │
└────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Developer:&lt;/strong&gt; Initiates a task (e.g., "Add a health-check route") in the terminal interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCode (Agent Interface):&lt;/strong&gt; Reads global instructions, loads domain-specific skills, parses the repository directory, and maps available tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama (Local Runtime):&lt;/strong&gt; Handles prompt inference, generating tool-call tags in XML or JSON format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Tools:&lt;/strong&gt; The agent runtime parses the tags, requests developer permission, and executes the files or bash commands natively.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  2. Step 1: Interface &amp;amp; Local Runtime Link (OpenCode)
&lt;/h2&gt;

&lt;p&gt;OpenCode acts as the execution bridge, routing prompt contexts to your local Ollama API. Configure it by editing your workspace configuration file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:11434"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qwen2.5-coder:14b-instruct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"default_agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"builder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"system_instructions_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./.agents/instructions.md"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: For the local model settings, we run the instruct weights via Ollama configured with a minimum context window (&lt;code&gt;num_ctx 16384&lt;/code&gt;) and a deterministic temperature (&lt;code&gt;0.0&lt;/code&gt;), as detailed in our first guide.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Step 2: Project Instructions &amp;amp; Guardrails
&lt;/h2&gt;

&lt;p&gt;To prevent the agent from executing destructive commands or writing non-compliant code, you must define project-specific guardrails. Create a project instructions file (&lt;code&gt;.agents/instructions.md&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project Instructions&lt;/span&gt;

&lt;span class="gu"&gt;## Architecture &amp;amp; Stack&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Frontend: Next.js (App Router, TypeScript)
&lt;span class="p"&gt;-&lt;/span&gt; Backend: FastAPI (Python 3.11, Pydantic v2)
&lt;span class="p"&gt;-&lt;/span&gt; Database: PostgreSQL

&lt;span class="gu"&gt;## Core Rules&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Do not modify database schemas without explicit permission.
&lt;span class="p"&gt;-&lt;/span&gt; Do not introduce new third-party dependencies without explaining the rationale.
&lt;span class="p"&gt;-&lt;/span&gt; Run linting and tests before proposing a completed task.

&lt;span class="gu"&gt;## Code Style&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Use TypeScript strict mode for frontend modules.
&lt;span class="p"&gt;-&lt;/span&gt; Use asynchronous database operations (async/await) in Python.
&lt;span class="p"&gt;-&lt;/span&gt; Add unit tests for all new business logic.

&lt;span class="gu"&gt;## Safety Constraints&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Never print secrets, API tokens, or environment files to standard out.
&lt;span class="p"&gt;-&lt;/span&gt; Do not delete source files unless explicitly requested.
&lt;span class="p"&gt;-&lt;/span&gt; Present a concrete plan before executing multi-file changes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Step 3: Domain-Specific Skills (Specialist Guides)
&lt;/h2&gt;

&lt;p&gt;Lightweight local models (like 14B parameters) can struggle with complex routing patterns or framework boilerplate. By organizing your codebase with a dedicated &lt;code&gt;skills/&lt;/code&gt; directory, you equip your agent with specialized recipes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;project-root/&lt;/span&gt;
&lt;span class="s"&gt;├── .agents/&lt;/span&gt;
&lt;span class="s"&gt;│   └── instructions.md&lt;/span&gt;
&lt;span class="s"&gt;└── skills/&lt;/span&gt;
    &lt;span class="s"&gt;├── nextjs-feature.md&lt;/span&gt;
    &lt;span class="s"&gt;├── fastapi-api.md&lt;/span&gt;
    &lt;span class="s"&gt;├── database-migration.md&lt;/span&gt;
    &lt;span class="s"&gt;└── test-writing.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a sample skill definition file for writing endpoints (&lt;code&gt;skills/fastapi-api.md&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# FastAPI API Skill&lt;/span&gt;

When adding a new API endpoint to the backend:
&lt;span class="p"&gt;
1.&lt;/span&gt; Check existing router imports in &lt;span class="sb"&gt;`app/main.py`&lt;/span&gt;.
&lt;span class="p"&gt;2.&lt;/span&gt; Define Pydantic request and response schemas in &lt;span class="sb"&gt;`app/schemas/`&lt;/span&gt;.
&lt;span class="p"&gt;3.&lt;/span&gt; Use async database sessions with &lt;span class="sb"&gt;`sqlalchemy.ext.asyncio`&lt;/span&gt;.
&lt;span class="p"&gt;4.&lt;/span&gt; Include explicit error handlers using &lt;span class="sb"&gt;`HTTPException`&lt;/span&gt; with clear detail messages.
&lt;span class="p"&gt;5.&lt;/span&gt; Create a corresponding test file in &lt;span class="sb"&gt;`tests/test_api.py`&lt;/span&gt;.
&lt;span class="p"&gt;6.&lt;/span&gt; Run linting and verify API responses before marking the task complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a user prompts the agent to add a backend route, OpenCode automatically appends this skill file to the active system context, ensuring the model matches your codebase's architectural pattern without bloating the base system prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Step 4: Tool Risk &amp;amp; Permission Registry
&lt;/h2&gt;

&lt;p&gt;Giving an agent system access introduces risks. You must categorize available tools by risk level to prevent accidental system changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Risk Level&lt;/th&gt;
&lt;th&gt;Safety Guideline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read Files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inspects code structures and configuration.&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Safe to execute automatically.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Search Repo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Locates variable definitions and file locations.&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Safe to execute automatically.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Git Diff/Status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analyzes workspace changes.&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Safe to execute automatically.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Run Tests&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executes unit tests to validate code.&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Restrict execution duration to prevent infinite loops.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Modify Files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Edits source code or templates.&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Require manual review or run inside a Git sandbox.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delete Files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cleans up obsolete components.&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Always prompt for explicit human confirmation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell Commands&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs compiler commands, builds, or scripts.&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Never automate; require step-by-step developer approval.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🛡️ &lt;strong&gt;The Git Sandbox Rule:&lt;/strong&gt; Always initialize a Git repository and commit your active changes before letting a local agent write code. If the agent goes rogue, deletes files, or writes buggy code, you can roll back your entire workspace instantly by running:&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;git reset &lt;span class="nt"&gt;--hard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Detailed Agent Workflow Trace
&lt;/h2&gt;

&lt;p&gt;To understand how the agent uses instructions, skills, and tools under the hood, here is a trace of the execution loop when implementing a feature:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Prompt:&lt;/strong&gt; &lt;em&gt;"Add a health-check endpoint to the FastAPI service."&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Read Directory  ──&amp;gt; Locates app/main.py and skills/fastapi-api.md
2. Parse Rules     ──&amp;gt; Identifies FastAPI backend framework rules
3. Read main.py    ──&amp;gt; Finds existing router configuration
4. Propose Plan    ──&amp;gt; Prints target changes to terminal for approval
5. Edit Files      ──&amp;gt; Inserts /health endpoint using async route
6. Write Test      ──&amp;gt; Creates test_health_check in tests/test_api.py
7. Run CLI Command ──&amp;gt; Executes: pytest tests/test_api.py (Requires user approval)
8. Git Diff Check  ──&amp;gt; Displays final diff output and completes loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  7. Parallel Parser Implementations (Tool Calling)
&lt;/h2&gt;

&lt;p&gt;Local agents use regular expressions to parse XML tool commands generated by the local model. Here is how you can implement a robust, non-greedy tool call extractor in both TypeScript and Python. &lt;em&gt;(For an in-depth analysis of why XML tags are used to prevent format failure loops, refer to our previous guide)&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  TypeScript Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;parseToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Non-greedy regex prevents merging multiple distinct tags&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileWriteRegex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sr"&gt;/&amp;lt;write_file&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+path="&lt;/span&gt;&lt;span class="se"&gt;([^&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;"&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;([\s\S]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;?)&lt;/span&gt;&lt;span class="sr"&gt;&amp;lt;&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;write_file&amp;gt;/&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileWriteRegex&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Non-greedy regex pattern (.*?) avoids greedy tag merges
&lt;/span&gt;    &lt;span class="n"&gt;file_write_regex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;write_file\s+path=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;([^&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;([\s\S]*?)&amp;lt;/write_file&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_write_regex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  8. Live Validation &amp;amp; GitHub Repository
&lt;/h2&gt;

&lt;p&gt;To demonstrate the viability of this design, the complete setup has been packaged and executed locally on an Apple Silicon workstation. &lt;/p&gt;

&lt;h3&gt;
  
  
  Companion Repository Code
&lt;/h3&gt;

&lt;p&gt;All configuration files, project rules, specialized skills, and the active test-runner script are hosted in the companion repository:&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/praveenveera/software-permanence/tree/main/03-local-agent-setup" rel="noopener noreferrer"&gt;software-permanence/03-local-agent-setup&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step-by-Step Execution Logs
&lt;/h3&gt;

&lt;p&gt;By running the local python simulator &lt;a href="https://github.com/praveenveera/software-permanence/blob/main/03-local-agent-setup/run_agent_loop.py" rel="noopener noreferrer"&gt;&lt;code&gt;run_agent_loop.py&lt;/code&gt;&lt;/a&gt;, we triggered &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt; to read the codebase, parse our rules, write the route, and run unit tests. Here are the raw terminal logs from the execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Launching Local Agent Run Simulation ===
[Step 1] Loading workspace configs, guidelines, and skills...
[Step 2] Reading current workspace status...
[Step 3] Querying local model 'qwen2.5-coder:14b' via Ollama...
  └─ Generation completed in 4.71 seconds.
  └─ Prompt Tokens: 407, Generation Tokens: 135
[Step 4] Extracting tool call payload from model output...
  └─ Parsed Action: write_file to 'workspace/app/main.py'
[Step 5] Writing modified code to local workspace...
  └─ Updated 'workspace/app/main.py' successfully.
[Step 6] Adding health-check assertion to unittest suite...
  └─ Appended 'test_read_health' test case.
[Step 7] Running unittest suite to validate changes...

=== Workspace Test Results ===
Ran 2 tests in 0.013s
OK

[Pass] Agent validation completed with all test assertions passing!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Generated Endpoint Code
&lt;/h3&gt;

&lt;p&gt;Here is the exact FastAPI router code created autonomously by the local model during the run, showing that it followed the async rules and exception detail handlers specified in &lt;code&gt;skills/fastapi-api.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health_check&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Simulate a database check or other critical resource
&lt;/span&gt;        &lt;span class="c1"&gt;# For demonstration, we'll just return OK
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Internal Server Error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Hard-Earned Lessons: What Did Not Work Well
&lt;/h2&gt;

&lt;p&gt;Running autonomous agent loops on local hardware highlighted several unique operational hurdles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool Permission Fatigue:&lt;/strong&gt; Requiring user confirmation for high-risk tools like bash commands is necessary for safety, but it creates developer fatigue. You find yourself repeatedly hitting "Y" during compilation loops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Error Loops:&lt;/strong&gt; If a model writes buggy code and the test step fails, smaller models can get stuck in a recursive loop (apologizing, rewriting the same bug, running tests, and failing again). Setting a hard execution breaker (halting after 3 failures) is critical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Isolation:&lt;/strong&gt; Unlike cloud sandboxes, a local agent runs directly on your machine. If it runs &lt;code&gt;npm install&lt;/code&gt;, it compiles binaries on your host OS. Containerizing your workspace or running it inside a Docker dev container is highly recommended for security.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Overload:&lt;/strong&gt; Attaching multiple skill files and file summaries to the prompt quickly eats up the 16k context window. You must actively prune inactive files from the agent's history to maintain generation accuracy.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Designing a local coding agent gives you complete privacy and data sovereignty. By configuring Ollama with deterministic parameters, establishing clear instructions, organizing workspace skills, and enforcing the Git Sandbox rule, you can run a reliable agentic environment directly on your local workstation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Are you running local coding agents on your machine? What model sizes have worked best for your workflow? Let's discuss in the comments.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Hi, I'm Praveen Veera.&lt;/strong&gt; I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.&lt;/p&gt;

&lt;p&gt;Read my notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Substack Newsletter:&lt;/strong&gt; &lt;a href="https://praveenbuilds.substack.com" rel="noopener noreferrer"&gt;praveenbuilds.substack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/praveen-veera-6ab22567/" rel="noopener noreferrer"&gt;linkedin.com/in/praveen-veera-6ab22567&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub (Companion Code):&lt;/strong&gt; &lt;a href="https://github.com/praveenveera/software-permanence" rel="noopener noreferrer"&gt;github.com/praveenveera/software-permanence&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dev.to:&lt;/strong&gt; &lt;a href="https://dev.to/praveen_builds"&gt;dev.to/praveen_builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium:&lt;/strong&gt; &lt;a href="https://medium.com/@praveenveera92" rel="noopener noreferrer"&gt;medium.com/@praveenveera92&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instagram:&lt;/strong&gt; &lt;a href="https://instagram.com/praveen.builds" rel="noopener noreferrer"&gt;@praveen.builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hashnode:&lt;/strong&gt; &lt;a href="https://hashnode.com/@praveen-builds" rel="noopener noreferrer"&gt;hashnode.com/@praveen-builds&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opencode</category>
      <category>ollama</category>
      <category>qwen</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why Local AI Coding Agents Fail (And How to Break the "Apology Loop")</title>
      <dc:creator>Praveen Veera</dc:creator>
      <pubDate>Mon, 29 Jun 2026 19:33:57 +0000</pubDate>
      <link>https://dev.to/praveen_builds/why-local-ai-coding-agents-fail-and-how-to-break-the-apology-loop-3gjh</link>
      <guid>https://dev.to/praveen_builds/why-local-ai-coding-agents-fail-and-how-to-break-the-apology-loop-3gjh</guid>
      <description>&lt;p&gt;Unlike standard chat interfaces where you ask questions and read answers, &lt;strong&gt;AI coding agents&lt;/strong&gt; (like &lt;strong&gt;Cline&lt;/strong&gt;, &lt;strong&gt;Continue&lt;/strong&gt;, or &lt;strong&gt;GarageBuild&lt;/strong&gt;) execute actions. They write files, run terminal commands, and inspect compiler errors automatically.&lt;/p&gt;

&lt;p&gt;In practice, running local agents on consumer workstations often leads to infinite retries, including parser loops and malformed JSON payloads.&lt;/p&gt;

&lt;p&gt;This analysis breaks down the systems boundary between the &lt;strong&gt;Model Layer&lt;/strong&gt; (the AI brain) and the &lt;strong&gt;Agent Runtime&lt;/strong&gt; (the workstation execution layer), explaining why local agents fail and how to configure them to prevent loop crashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔰 What is an "AI Agent" (For Beginners)?
&lt;/h3&gt;

&lt;p&gt;If you have only used ChatGPT or Claude in a browser, coding agents are a different beast. Standard chat models only output text; you must manually copy and paste the code into your editor. &lt;strong&gt;AI agents&lt;/strong&gt; are given "hands", meaning they are integrated directly with your filesystem and terminal. They read files, create new code modules, and run test suites autonomously.&lt;/p&gt;

&lt;p&gt;Because they have local system access, the first rule of running agents is the &lt;strong&gt;Git Sandbox Rule&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Always run agents inside a clean Git repository.&lt;/strong&gt; Before launching an agent loop, commit your active changes. If the agent goes rogue, deletes files, or writes broken code, you can roll back your entire workspace instantly with &lt;code&gt;git reset --hard&lt;/code&gt;. Never run agents in root directories or folders containing unversioned files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd716ea3pcdkozq8hg298.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd716ea3pcdkozq8hg298.png" alt="Local AI Agent Cheat Sheet - At a Glance"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Background: The Model vs. Runtime Divide
&lt;/h2&gt;

&lt;p&gt;An agentic developer environment relies on two separate layers that must constantly communicate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1. The Model Layer (Brain):&lt;/strong&gt; The LLM that decides &lt;em&gt;what&lt;/em&gt; to do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2. The Agent Runtime (Body):&lt;/strong&gt; The host framework (Cline, Continue, or GarageBuild) that manages filesystem tools and executes commands.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   ┌────────────────────────┐         1. Instructions &amp;amp; Context         ┌─────────────────┐
   │  Agent Runtime (Body)  ├──────────────────────────────────────────&amp;gt;│ Local LLM (Brain)│
   │                        │&amp;lt;──────────────────────────────────────────┤                 │
   └───────────┬────────────┘        2. Tool Call Command (JSON)        └─────────────────┘
               │
               │ 3. Executes File Write or CLI Command
               ▼
   ┌────────────────────────┐
   │ Workstation Filesystem │
   │  (Returns Logs/Errors) │
   └────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failure occurs when the output formatting returned by the model cannot be understood by the runtime parser.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why Local Agents Fail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Failure 1: The JSON Parser Loop (The "Strict Form" Bottleneck)
&lt;/h3&gt;

&lt;p&gt;Most agent frameworks require models to output commands in strict JSON formats. However, lightweight local models (under 30B parameters) struggle to maintain strict syntax under complexity. &lt;br&gt;
If a model misses a single closing bracket, leaves a trailing comma, or outputs conversational padding around the JSON (e.g. &lt;em&gt;"Sure, here is the JSON to write that file..."&lt;/em&gt;), standard JSON parsers crash.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;The Envelope Analogy:&lt;/strong&gt;&lt;br&gt;
JSON behaves like a strict government form: missing a single comma rejects the entire document. &lt;br&gt;
Wrapping tools in XML tags (&lt;code&gt;&amp;lt;write_file&amp;gt;...&amp;lt;/write_file&amp;gt;&lt;/code&gt;) is like placing your letter in a bright red envelope. Even if the model chatters before and after the envelope, the parser can easily spot the red borders and pull out the code package.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Failure 2: KV Cache Context Eviction (The "Whiteboard" Limit)
&lt;/h3&gt;

&lt;p&gt;As an agent works, the conversation history grows, holding compiler logs, shell outputs, and file edits. When the accumulated tokens fill the context window (&lt;code&gt;num_ctx&lt;/code&gt;), the local server must evict older tokens to make room.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The Whiteboard Analogy:&lt;/strong&gt;&lt;br&gt;
Think of your context window as a whiteboard. As you chat, you write down every step. Once the board is full, you have to erase the top lines to keep writing. If you erase the original task instructions written at the very top, the agent forgets what it was supposed to do and begins outputting plain text summaries.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  3. Quantization Mechanics: Why PTQ Breaks Tool-Calling (and How QAT Fixes It)
&lt;/h2&gt;

&lt;p&gt;To fit models like Qwen 14B or Gemma 12B on standard laptops, developers rely on &lt;strong&gt;quantization&lt;/strong&gt; to compress the weights from 16-bit floats (FP16) to 4-bit integers (INT4). However, how a model is quantized determines its agentic reliability:&lt;/p&gt;
&lt;h3&gt;
  
  
  Post-Training Quantization (PTQ)
&lt;/h3&gt;

&lt;p&gt;Standard quantization (PTQ) rounds model weights after training is complete. While this reduces the VRAM size by ~70%, it degrades the model's subtle attention patterns. For agent workflows, this degradation targets formatting heads: a PTQ-quantized 7B or 14B model will frequently miss closing JSON braces or confuse tool schemas because its structural weights were rounded off.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quantization-Aware Training (QAT)
&lt;/h3&gt;

&lt;p&gt;In QAT, the model is trained with low-precision constraints active. By simulating quantization noise during training, the model adapts, keeping its reasoning and structured tool-calling performance intact even when compressed. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Sizing Rule:&lt;/strong&gt; If you are running an agent loop, always prefer a model optimized with &lt;strong&gt;QAT&lt;/strong&gt; (such as &lt;em&gt;Gemma 4 12B QAT&lt;/em&gt;) over standard PTQ weights, or step up to a higher quantization level (e.g. &lt;strong&gt;Q6_K&lt;/strong&gt; or &lt;strong&gt;Q8&lt;/strong&gt; instead of Q4_K_M) for PTQ models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is how tool-calling reliability scales across different quantization formats and parameters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model &amp;amp; Precision&lt;/th&gt;
&lt;th&gt;Quantization Type&lt;/th&gt;
&lt;th&gt;JSON Tool Success Rate&lt;/th&gt;
&lt;th&gt;XML Tag Success Rate&lt;/th&gt;
&lt;th&gt;Workstation Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 2.5 Coder 7B (Q4_K_M)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PTQ&lt;/td&gt;
&lt;td&gt;48%&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;~75 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4 12B (Q4_K_M)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PTQ&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;td&gt;~32 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4 12B (Q4_K_M)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;QAT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~32 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 2.5 Coder 14B (Q4_K_M)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PTQ&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;96%&lt;/td&gt;
&lt;td&gt;~30 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 2.5 Coder 14B (Q8_0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PTQ&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~24 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  4. The Technical Solution: XML Tag Resiliency
&lt;/h2&gt;

&lt;p&gt;To stabilize local agent loops, we must move away from strict JSON parsing and adopt &lt;strong&gt;XML tag parsing&lt;/strong&gt; combined with regular expressions.&lt;/p&gt;

&lt;p&gt;XML is much more resilient because start and end tags can be extracted via regular expressions. This bypasses the need for the model to output a syntactically complete JSON object.&lt;/p&gt;
&lt;h3&gt;
  
  
  The XML Tool Schema:
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;write_file&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"./src/main.ts"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
import { serve } from "bun";
serve({
  port: 3000,
  fetch(req) { return new Response("Ok"); }
});
&lt;span class="nt"&gt;&amp;lt;/write_file&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  The Client-Side Parser:
&lt;/h3&gt;

&lt;p&gt;Even if the model outputs conversational text before or after the code block, the runtime can extract the target file path and contents using a regular expression. Here is how you implement it in both TypeScript and Python:&lt;/p&gt;
&lt;h4&gt;
  
  
  TypeScript Implementation:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;parseToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileWriteRegex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sr"&gt;/&amp;lt;write_file&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+path="&lt;/span&gt;&lt;span class="se"&gt;([^&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;"&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;([\s\S]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;?)&lt;/span&gt;&lt;span class="sr"&gt;&amp;lt;&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;write_file&amp;gt;/&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileWriteRegex&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Python Implementation:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;file_write_regex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;write_file\s+path=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;([^&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;([\s\S]*?)&amp;lt;/write_file&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_write_regex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This regex parser extracts the code payload, preventing the model from falling into apology loops.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Developer Tip (Greedy vs. Lazy Regex):&lt;/strong&gt; Notice the &lt;code&gt;?&lt;/code&gt; in the regex pattern: &lt;code&gt;[\s\S]*?&lt;/code&gt;. This enforces a &lt;strong&gt;lazy/non-greedy match&lt;/strong&gt;. If your local model outputs multiple &lt;code&gt;&amp;lt;write_file&amp;gt;&lt;/code&gt; tags in a single response, a greedy pattern (&lt;code&gt;[\s\S]*&lt;/code&gt;) will merge all files together into a single, corrupted payload. Always enforce lazy matching in your agent's parser regex.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Parser Resiliency Validation Results
&lt;/h3&gt;

&lt;p&gt;To prove the advantage of regex-based XML parsers over traditional JSON parsers, we executed a local validation script comparing both implementations against conversational agent outputs. &lt;/p&gt;

&lt;p&gt;The full test script is hosted in the companion repository:&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/praveenveera/software-permanence/tree/main/02-why-local-agents-fail" rel="noopener noreferrer"&gt;software-permanence/02-why-local-agents-fail&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is the raw terminal log output from running &lt;a href="https://github.com/praveenveera/software-permanence/blob/main/02-why-local-agents-fail/test_parser_resiliency.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_parser_resiliency.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Testing Tool-Calling Parser Resiliency ===

[Test 1] Executing JSON Parser...
  ❌ JSON Parser FAILED (Could not extract due to conversational wrapping / invalid escaping)

[Test 2] Executing XML Regex Parser...
  ✅ XML Parser PASSED:
{
  "tool": "write_file",
  "path": "./config.json",
  "content": "{\n  \"port\": 8080\n}"
}

=== Validation Complete: XML Regex parser proves 100% resilient ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Workstation Configuration Guidelines
&lt;/h2&gt;

&lt;p&gt;If you are running local agent loops, configure your runtime settings with these parameters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Set Temperature to 0.0 - 0.2:&lt;/strong&gt; Enforce deterministic outputs. Higher temperatures introduce formatting drift that degrades tool-calling syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increase Context Window (&lt;code&gt;num_ctx&lt;/code&gt;):&lt;/strong&gt; Set a minimum of &lt;code&gt;16384&lt;/code&gt; (16k) or &lt;code&gt;32768&lt;/code&gt; (32k) context limits in your &lt;code&gt;Modelfile&lt;/code&gt; to prevent early context eviction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pinnable System Instructions:&lt;/strong&gt; Instruct the model to strictly suppress greetings, conversational text, and code summaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolate Models:&lt;/strong&gt; Do not run agent loops on models under 14B. Use &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt; as a minimum, or run &lt;code&gt;qwen2.5-coder:32b-instruct&lt;/code&gt; inside local Docker containers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Loop Breakers:&lt;/strong&gt; Configure your agent runtime to track consecutive parser retries. If the agent receives a compilation error or formatting fail &lt;strong&gt;3 times&lt;/strong&gt; in a row, trigger an automatic breakpoint to halt execution and request user input. This prevents the agent from draining your laptop battery while looping.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  6. A Beginner's Diagnostic Checklist
&lt;/h2&gt;

&lt;p&gt;When you are starting out with local agents, crashes or slow speeds will happen. Use this simple diagnostic guide to identify the bottleneck:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Is Ollama actually running?&lt;/strong&gt; Check your system menu bar or type &lt;code&gt;ollama list&lt;/code&gt; in your terminal. If the local server isn't active, the agent will throw connection errors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Did generation speed collapse?&lt;/strong&gt; If the agent starts writing code extremely slowly (&amp;lt; 2 tokens/second), your model has likely spilled out of VRAM into system RAM. Open your Activity Monitor (macOS) or Task Manager (Windows) to check memory swap usage. You may need to load a smaller quantization level (e.g. &lt;code&gt;Q4_K_M&lt;/code&gt; instead of &lt;code&gt;Q8_0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Did the agent "forget" its instructions?&lt;/strong&gt; If the agent starts replying with general conversational prose mid-task, your context window has filled up and evicted the system prompt. Restart the agent session to clean the active history window.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Summary
&lt;/h2&gt;

&lt;p&gt;Local agent failure is a systems alignment problem, not just a model capabilities issue. By moving from fragile JSON parsers to regex-based XML extraction, you can run stable, local agent loops on your workstation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Are you running local agentic workflows? How are you handling parser validation errors? Let me know in the comments.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Hi, I'm Praveen Veera.&lt;/strong&gt; I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.&lt;/p&gt;

&lt;p&gt;Read my notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Substack Newsletter:&lt;/strong&gt; &lt;a href="https://praveenbuilds.substack.com" rel="noopener noreferrer"&gt;praveenbuilds.substack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/praveen-veera-6ab22567/" rel="noopener noreferrer"&gt;linkedin.com/in/praveen-veera-6ab22567&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;GitHub (Companion Code):&lt;/strong&gt; &lt;a href="https://github.com/praveenveera/software-permanence" rel="noopener noreferrer"&gt;github.com/praveenveera/software-permanence&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dev.to:&lt;/strong&gt; &lt;a href="https://dev.to/praveen_builds"&gt;dev.to/praveen_builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Medium:&lt;/strong&gt; &lt;a href="https://medium.com/@praveenveera92" rel="noopener noreferrer"&gt;medium.com/@praveenveera92&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Instagram:&lt;/strong&gt; &lt;a href="https://instagram.com/praveen.builds" rel="noopener noreferrer"&gt;@praveen.builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hashnode:&lt;/strong&gt; &lt;a href="https://hashnode.com/@praveen-builds" rel="noopener noreferrer"&gt;hashnode.com/@praveen-builds&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cline</category>
      <category>continue</category>
      <category>ollama</category>
      <category>agents</category>
    </item>
    <item>
      <title>My commit message said "You've hit your session limit"</title>
      <dc:creator>Shyamala</dc:creator>
      <pubDate>Mon, 29 Jun 2026 15:15:49 +0000</pubDate>
      <link>https://dev.to/shyamala_u/my-commit-message-said-youve-hit-your-session-limit-2abn</link>
      <guid>https://dev.to/shyamala_u/my-commit-message-said-youve-hit-your-session-limit-2abn</guid>
      <description>&lt;h2&gt;
  
  
  🧐 Context 🧐
&lt;/h2&gt;

&lt;p&gt;I had this one-liner that I was using.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git diff &lt;span class="nt"&gt;--staged&lt;/span&gt; | claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Provide a simple, one-line git commit message based on this diff following best practices. Output absolutely nothing else."&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipe the staged diff to Claude, get a commit message back. Worked well until I hit my Claude usage limit mid commit. The shell captured the error instead of a commit message.&lt;/p&gt;

&lt;p&gt;So I had a commit in my repo that said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You've hit your session limit&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's when it hit me! Voila, ✨My use case for a Local Model.✨&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚠️ Disclaimer ⚠️
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I am learning GenAI, this is my journey&lt;/li&gt;
&lt;li&gt;This is not a tutorial&lt;/li&gt;
&lt;li&gt;What is obvious to you might not be obvious to me&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Ollama running
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; lets you run open source models locally. After installing it, you have a server running at &lt;code&gt;http://localhost:11434&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5-coder:1.5b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I picked &lt;code&gt;qwen2.5-coder:1.5b&lt;/code&gt; because it's small and code-aware.&lt;/p&gt;

&lt;p&gt;Why 1.5b specifically? My laptop has 8GB RAM. That's not a lot when you're running a model locally.&lt;/p&gt;

&lt;p&gt;Here's the rough math (these are estimates from my machine, yours may vary):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total Mac RAM: 8.0 GB&lt;/li&gt;
&lt;li&gt;macOS + apps already running: ~4.0 to 5.0 GB&lt;/li&gt;
&lt;li&gt;Model loaded in memory: ~1.2 GB (based on the model file size of ~1 GB)&lt;/li&gt;
&lt;li&gt;Context window: ~0.03 GB&lt;/li&gt;
&lt;li&gt;Remaining: ~1.77 to 2.77 GB free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interestingly, despite being a 1.5 billion parameter model, qwen2.5-coder:1.5b only takes up about 1 GB of disk space. That's because it's a quantized model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization" rel="noopener noreferrer"&gt;Quantization&lt;/a&gt; means the &lt;a href="https://www.ultralytics.com/glossary/model-weights#model-weights-vs-biases" rel="noopener noreferrer"&gt;model's weights&lt;/a&gt; are stored at lower precision, using 4-bit or 8-bit integers instead of the usual 16-bit or 32-bit floating point numbers. This significantly reduces the model size and memory footprint, although it may slightly impact accuracy.&lt;/p&gt;

&lt;p&gt;I tried larger models. My laptop became unusable. Fans spinning, apps freezing, the whole thing. So 1.5b it is.&lt;/p&gt;

&lt;p&gt;There's another quantized model I found that could work — &lt;code&gt;gemma3:1b-it-qat&lt;/code&gt;. I plan to test it sometime and see how it compares in terms of performance and resource usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  First attempt
&lt;/h2&gt;

&lt;p&gt;I swapped Claude with Ollama in my one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git diff &lt;span class="nt"&gt;--staged&lt;/span&gt; | ollama run qwen2.5-coder:1.5b &lt;span class="s2"&gt;"Provide a simple, one-line git commit message based on this diff following best practices. Output absolutely nothing else."&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran it against a change where I had removed the &lt;code&gt;tools&lt;/code&gt; section from some agent config front matter from 6 files. This Worked&lt;/p&gt;

&lt;p&gt;The commit message said it was a change to a README file.&lt;/p&gt;

&lt;h3&gt;
  
  
  🤔 What does this mean? 🤔
&lt;/h3&gt;

&lt;p&gt;Despite qwen2.5-coder:1.5b's large native context window of 32,768 tokens, Ollama actually restricts the default context size when running without a Modelfile. &lt;/p&gt;

&lt;p&gt;I checked Ollama's logs and found this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;level=INFO source=routes.go:2073 msg="vram-based default context" total_vram="5.3 GiB" default_num_ctx=4096
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows that based on my machine's VRAM of 5.3 GiB, Ollama set a default &lt;code&gt;num_ctx&lt;/code&gt; of 4096 tokens. That's why the model only saw the beginning of the diff and guessed about the README file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second attempt
&lt;/h2&gt;

&lt;p&gt;I thought maybe I need a better prompt. So I ran it again with more instructions.&lt;/p&gt;

&lt;p&gt;This time it said the change was in &lt;code&gt;code-reviewer.md&lt;/code&gt;. That was one of the 6 files, and it completely ignored the other 5.&lt;/p&gt;

&lt;p&gt;The important thing here is that the model did not complain. It did not say "I couldn't read the rest". It just gave me a confident answer based on partial input.&lt;/p&gt;

&lt;p&gt;At this point I understood tuning the prompt alone is insufficient and I need to tune the model too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Modelfile
&lt;/h2&gt;

&lt;p&gt;This is something I just learned. A Modelfile is a config layer on top of a base model. You can change parameters and create a named model from it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; qwen2.5-coder:1.5b&lt;/span&gt;

PARAMETER num_ctx 8192 
PARAMETER temperature 0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things I changed:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;num_ctx 8192&lt;/code&gt; — While qwen2.5-coder:1.5b can handle up to 32k tokens natively, Ollama defaults to a smaller context window when run without a Modelfile (in my case, 4096 based on VRAM). I bumped it to 8k, and be memory-efficient on my 8GB machine.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;temperature 0.2&lt;/code&gt; — lower temperature for more predictable output. For commit messages I don't want creative, I want consistent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create qwen-commit &lt;span class="nt"&gt;-f&lt;/span&gt; ./Modelfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I have a model called &lt;code&gt;qwen-commit&lt;/code&gt; that I can use for this specific task.&lt;/p&gt;

&lt;p&gt;By the way, a Modelfile is not the only way to set these. You can use the REST API directly, and pass an &lt;code&gt;options&lt;/code&gt; object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "qwen2.5-coder:1.5b",
  "prompt": "${YOUR_PROMPT}",
  "options": {
    "temperature": 0.2,
    "num_ctx": 8192
  }
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For my use case the Modelfile made more sense because I just want to call &lt;code&gt;ollama run qwen-commit&lt;/code&gt; and have everything pre-configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Third attempt
&lt;/h2&gt;

&lt;p&gt;With the bigger context window, the model could now see all 6 files. But it still described the change as "⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋&lt;br&gt;
&lt;br&gt;
 `&lt;code&gt;diff feat(.opencode/agent): update tool list for code-reviewer, frontend-enginee frontend-engineer, go-backend-engineer, project-lead, req requirements-analyst, solution-architect&lt;/code&gt;". Better, Mouthful but wrong.&lt;/p&gt;
&lt;h3&gt;
  
  
  🤔 What does this mean? 🤔
&lt;/h3&gt;

&lt;p&gt;The model was reading the full diff now but commit message was technically correct, but nothing like what we would write in a commit message. Look at how it had &lt;code&gt;frontend-enginee frontend-engineer&lt;/code&gt; or &lt;code&gt;req requirements-analyst&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So I changed the prompt. Instead of making the model figure it out, I just told it.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
affected_files=$(git diff --staged --name-only | paste -sd, -)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then added to the prompt: &lt;code&gt;"Note that the changes are located in these files: [$affected_files]"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After this the commit messages got much better. The model didn't have to guess anymore.&lt;/p&gt;
&lt;h2&gt;
  
  
  One more thing
&lt;/h2&gt;

&lt;p&gt;The commit messages were now accurate but the model kept wrapping them in weird formatting despite the prompt saying not to. Sometimes backticks. Sometimes it prefixed with "diff". Sometimes random quotes around the message.&lt;/p&gt;

&lt;p&gt;So I added a cleanup step to strip all of that out:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
msg=$(echo "$msg" | tr -d '\r' | sed -E \
  -e 's/

```(diff)?//g' \
  -e 's/^diff[[:space:]]+//I' \
  -e 's/^[[:space:]]+//;s/[[:space:]]+$//' \
  -e 's/^["'\'']//' -e 's/["'\'']$//')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Not elegant but it catches most of the junk the model adds. Till the time I tune the prompt and model this stays!&lt;/p&gt;

&lt;p&gt;I also switched from &lt;code&gt;git diff --staged&lt;/code&gt; to &lt;code&gt;git diff --staged --unified=0&lt;/code&gt;. By default, git shows 3 lines of context around each change. For a commit message, the model doesn't need that surrounding context. It just needs to know what changed. &lt;code&gt;--unified=0&lt;/code&gt; strips all that out, which means fewer tokens sent to the model. On a small context window, every token counts.&lt;/p&gt;

&lt;p&gt;Tada 🎉&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* b6f0abc (HEAD -&amp;gt; main, origin/main, origin/HEAD) fix: update tool list for all agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Much bigger code related commit, you can see gradual improvements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* b13f344 (HEAD -&amp;gt; main) fix(inspection-workflow): add requirement for editing confirmed vess vessel profile
* 958053c sh fix(app_test.go, sqlite.go, sqlite_test.go, tasks.md): add save and cancel  behaviour tests for vessel profile editing
* 0f33259 sh fix: update vessel profile form and edit flow in App.svelte, add tests for  editing workflow, and improve styles in styles.css, update model in go/mode go/models.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The final Modelfile
&lt;/h2&gt;

&lt;p&gt;After all the iterations, my Modelfile looks quite different from where I started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; qwen2.5-coder:1.5b&lt;/span&gt;

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER top_p 0.7
PARAMETER num_predict 256
PARAMETER repeat_penalty 1.2
PARAMETER stop "Changes to be committed:"
PARAMETER stop "Note:"
SYSTEM """
You are an expert developer's assistant. Your sole task is to generate a clean, concise one-line Git commit message based on the provided code diff.
Rules:
- Respond ONLY with the commit message text.
- Do NOT include markdown code blocks, backticks, explanations, intro text, or outro text.
- Use the Conventional Commits format (e.g., feat(scope): message, fix: message).
- Keep the one line under 100 characters.
- Use the imperative mood ("Add feature", not "Added feature" or "Adds feature").
"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each parameter does and why I added it:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;temperature 0.2&lt;/code&gt;: controls randomness. Lower means more predictable. I don't want creative commit messages.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;top_p 0.7&lt;/code&gt;: works with temperature. It limits the model to only consider the top 70% most likely next words. Another way to keep the output focused and not wander off.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;num_predict 256&lt;/code&gt;: maximum number of tokens the model can output. A commit message is one line. I don't need the model writing an essay. This caps it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;repeat_penalty 1.2&lt;/code&gt;: penalizes the model for repeating itself. Without this I was getting things like &lt;code&gt;frontend-enginee frontend-engineer&lt;/code&gt; or &lt;code&gt;req requirements-analyst&lt;/code&gt;. The model would stutter and repeat parts of words.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stop "Changes to be committed:"&lt;/code&gt; and &lt;code&gt;stop "Note:"&lt;/code&gt; — stop sequences. Sometimes the model would keep going after the commit message and start generating text that looked like git output. These tell the model to stop immediately if it starts outputting these strings.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;SYSTEM&lt;/code&gt; block is the prompt baked into the model. Every time I run &lt;code&gt;ollama run qwen-commit&lt;/code&gt;, this prompt is already there. I don't have to pass it every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The final function
&lt;/h2&gt;

&lt;p&gt;After all the iterations, here is what I ended up with. A custom shell function &lt;code&gt;gac&lt;/code&gt; and an alias &lt;code&gt;gacc&lt;/code&gt;. It defaults to the local model, but I can also use Claude when I want to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gac&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;# 1. Check for staged changes&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;git diff &lt;span class="nt"&gt;--cached&lt;/span&gt; &lt;span class="nt"&gt;--quiet&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Error: No staged changes found. Run 'git add' first."&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;1
  &lt;span class="k"&gt;fi

  &lt;/span&gt;&lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;qwen&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

  &lt;span class="c"&gt;# Gather file names for context&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;affected_files
  &lt;span class="nv"&gt;affected_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git diff &lt;span class="nt"&gt;--staged&lt;/span&gt; &lt;span class="nt"&gt;--name-only&lt;/span&gt; | &lt;span class="nb"&gt;paste&lt;/span&gt; &lt;span class="nt"&gt;-sd&lt;/span&gt;, -&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;# ---------------------------------------------------------&lt;/span&gt;
  &lt;span class="c"&gt;# IMPROVED PROMPT: Strict rules for Conventional Commits&lt;/span&gt;
  &lt;span class="c"&gt;# ---------------------------------------------------------&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"You are a strict code assistant. Write a single-line Conventional Commit message for the provided diff.
Strict Rules:
1. Format must exactly match: type(scope): description
2. Allowed types ONLY: feat, fix, docs, style, refactor, perf, test, chore.
3. The 'scope' must be a single, broad feature/module name (e.g., vessel-profile, api). NEVER use file names.
4. The 'description' must summarize the high-level intent in the imperative mood (e.g., 'add form validation').
5. ABSOLUTELY DO NOT list specific file names, paths, or extensions in the commit message.
6. Output EXACTLY one line. No markdown blocks, no quotes, no explanations, and no stray prefixes like 'sh'.
Context: The files modified are [&lt;/span&gt;&lt;span class="nv"&gt;$affected_files&lt;/span&gt;&lt;span class="s2"&gt;]."&lt;/span&gt;

  &lt;span class="c"&gt;# 2. Execution Routing&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$mode&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"claude"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git diff &lt;span class="nt"&gt;--staged&lt;/span&gt; &lt;span class="nt"&gt;--unified&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 | claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$system_prompt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--output-format&lt;/span&gt; text 2&amp;gt;&amp;amp;1&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$?&lt;/span&gt;
  &lt;span class="k"&gt;else
    if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--max-time&lt;/span&gt; 2 http://localhost:11434 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Error: Local Ollama server is not running on port 11434."&lt;/span&gt;
      &lt;span class="k"&gt;return &lt;/span&gt;1
    &lt;span class="k"&gt;fi
    &lt;/span&gt;&lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git diff &lt;span class="nt"&gt;--staged&lt;/span&gt; &lt;span class="nt"&gt;--unified&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 | ollama run qwen-commit &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$system_prompt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$?&lt;/span&gt;
  &lt;span class="k"&gt;fi&lt;/span&gt;

  &lt;span class="c"&gt;# 3. Robust Error Validation&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$exit_code&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$msg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Error: Failed to generate a response via &lt;/span&gt;&lt;span class="nv"&gt;$mode&lt;/span&gt;&lt;span class="s2"&gt;."&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Details received: &lt;/span&gt;&lt;span class="nv"&gt;$msg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;1
  &lt;span class="k"&gt;fi&lt;/span&gt;

  &lt;span class="c"&gt;# 4. Strict Text Cleaning Pipeline&lt;/span&gt;
  &lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$msg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'\r'&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/```(diff)?//g'&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/^[[:space:]]+//;s/[[:space:]]+$//'&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/^["'&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s1"&gt;']//'&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/["'&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s1"&gt;']$//'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;# 5. Run git commit cleanly&lt;/span&gt;
  git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$msg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Alias to explicitly force Claude&lt;/span&gt;
&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;gacc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gac claude"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tell the model what you already know. Don't make it guess things you can easily extract.&lt;/li&gt;
&lt;li&gt;Low temperature for tasks where you want some determinism.&lt;/li&gt;
&lt;li&gt;Modelfiles are useful. You can create a named model configured for a specific job.&lt;/li&gt;
&lt;li&gt;Model size, (V)RAM, and context size are all connected. On a constrained machine, you have to be intentional about all three.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Is this perfect?
&lt;/h2&gt;

&lt;p&gt;No. It still sometimes misses the point of a change. It takes time on larger commits. There is room for improvement.&lt;/p&gt;

&lt;p&gt;Why not just use Claude directly? That's the easiest thing to do, but it still costs me tokens. And I wanted to learn how local models work. How context windows affect output. How to tune a model for a specific job. That was the whole point for me.&lt;/p&gt;

&lt;p&gt;It works offline, costs nothing 💰, and I understand every piece because I broke it and fixed it.&lt;/p&gt;

&lt;p&gt;I find the best way to learn is to find a real use case, however trivial. It helps you understand concepts one thing at a time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next up: My learnings building a green field product with OpenSpec meant for Brown field projects&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I welcome all constructive feedback and comments&lt;/em&gt;&lt;/p&gt;

</description>
      <category>genai</category>
      <category>ollama</category>
      <category>learning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Como Rodar IA no Seu Computador Sem Gastar Nada: Guia Completo com Ollama (2026)</title>
      <dc:creator>Hermes AI</dc:creator>
      <pubDate>Mon, 29 Jun 2026 13:20:13 +0000</pubDate>
      <link>https://dev.to/hermesai/como-rodar-ia-no-seu-computador-sem-gastar-nada-guia-completo-com-ollama-2026-1gc2</link>
      <guid>https://dev.to/hermesai/como-rodar-ia-no-seu-computador-sem-gastar-nada-guia-completo-com-ollama-2026-1gc2</guid>
      <description>&lt;h1&gt;
  
  
  Como Rodar IA no Seu Computador Sem Gastar Nada: Guia Completo com Ollama (2026)
&lt;/h1&gt;

&lt;p&gt;Tags: ia, ollama, opensource, tutorial&lt;/p&gt;

&lt;p&gt;Você sabia que pode rodar modelos de inteligência artificial diretamente no seu computador, sem precisar pagar assinatura, sem depender de internet e sem enviar seus dados para servidores de terceiros?&lt;/p&gt;

&lt;p&gt;Parece bom demais para ser verdade, mas em 2026 essa é uma realidade acessível para qualquer pessoa com um notebook mediano. Graças a ferramentas open source como o &lt;strong&gt;Ollama&lt;/strong&gt; — que já ultrapassou 170 mil estrelas no GitHub — você pode ter uma IA funcionando localmente em menos de 10 minutos.&lt;/p&gt;

&lt;p&gt;Neste guia prático, vou te mostrar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;O que é o Ollama e por que ele virou padrão&lt;/li&gt;
&lt;li&gt;Como instalar no Windows, macOS e Linux&lt;/li&gt;
&lt;li&gt;Quais modelos rodam em cada tipo de hardware&lt;/li&gt;
&lt;li&gt;Como usar a IA local no dia a dia (terminal, API, VS Code)&lt;/li&gt;
&lt;li&gt;Dicas para escolher o modelo certo para sua máquina&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Por que rodar IA local?
&lt;/h2&gt;

&lt;p&gt;Antes de mergulhar no passo a passo, vale entender os motivos que estão levando cada vez mais pessoas a adotar a IA local:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔒 Privacidade total.&lt;/strong&gt; Seus dados nunca saem da sua máquina. Isso é crucial para quem trabalha com documentos confidenciais, código proprietário ou informações pessoais.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💰 Custo zero.&lt;/strong&gt; Nada de assinatura mensal. Depois do download inicial do modelo, você usa quantas vezes quiser, sem limite de tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌐 Funciona offline.&lt;/strong&gt; Sem internet? Sem problemas. Você pode usar IA em viagens, áreas remotas ou durante quedas de conexão.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚡ Velocidade consistente.&lt;/strong&gt; Sem fila de espera, sem limite de requisições, sem depender de servidores sobrecarregados.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🛠️ Personalização total.&lt;/strong&gt; Você escolhe o modelo, ajusta parâmetros, cria fine-tunes — o controle é seu.&lt;/p&gt;




&lt;h2&gt;
  
  
  O que é o Ollama?
&lt;/h2&gt;

&lt;p&gt;Ollama é uma ferramenta open source que simplifica a execução de modelos de linguagem (LLMs) localmente. Pense nele como um "gerenciador de pacotes" para IAs: você baixa, executa e gerencia modelos com comandos simples.&lt;/p&gt;

&lt;p&gt;Antes do Ollama, rodar um modelo local exigia lidar com dependências complexas, configurações de GPU, conversões de formato e scripts gigantescos. O Ollama eliminou toda essa complexidade com um comando só:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pronto. Em segundos, você está conversando com uma IA rodando 100% na sua máquina.&lt;/p&gt;




&lt;h2&gt;
  
  
  Instalação em 3 passos
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Windows
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Acesse &lt;a href="https://ollama.com/download" rel="noopener noreferrer"&gt;ollama.com/download&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Baixe o instalador &lt;code&gt;.exe&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Execute e siga o assistente&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Após a instalação, abra o &lt;strong&gt;Prompt de Comando&lt;/strong&gt; ou &lt;strong&gt;PowerShell&lt;/strong&gt; e digite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Se aparecer o número da versão, tudo certo.&lt;/p&gt;

&lt;h3&gt;
  
  
  macOS
&lt;/h3&gt;

&lt;p&gt;Com o &lt;a href="https://brew.sh" rel="noopener noreferrer"&gt;Homebrew&lt;/a&gt; instalado, é só um comando:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linux
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O script detecta sua distribuição (Ubuntu, Fedora, Arch, etc.) e faz tudo automaticamente.&lt;/p&gt;




&lt;h2&gt;
  
  
  Seu primeiro modelo
&lt;/h2&gt;

&lt;p&gt;Vamos rodar o modelo mais leve e rápido para começar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esse é o &lt;strong&gt;Llama 3.2 1B&lt;/strong&gt;, da Meta. Ele tem apenas 1 bilhão de parâmetros e roda em qualquer computador com &lt;strong&gt;8 GB de RAM&lt;/strong&gt;, sem placa de vídeo dedicada.&lt;/p&gt;

&lt;p&gt;O download acontece automaticamente na primeira execução (cerca de 700 MB). Em máquinas mais lentas, pode levar alguns minutos.&lt;/p&gt;

&lt;p&gt;Depois é só digitar suas perguntas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; O que é uma rede neural?
Uma rede neural é um modelo computacional inspirado no cérebro humano...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Para sair, digite &lt;code&gt;/bye&lt;/code&gt; ou pressione &lt;code&gt;Ctrl+D&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quais modelos escolher (guia por hardware)
&lt;/h2&gt;

&lt;p&gt;O grande segredo da IA local é escolher o modelo certo para sua máquina. Aqui vai um guia prático baseado em 2026:&lt;/p&gt;

&lt;h3&gt;
  
  
  🖥️ Notebook básico (8 GB RAM, sem GPU)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modelo&lt;/th&gt;
&lt;th&gt;Parâmetros&lt;/th&gt;
&lt;th&gt;Tamanho&lt;/th&gt;
&lt;th&gt;Uso ideal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2&lt;/td&gt;
&lt;td&gt;1B / 3B&lt;/td&gt;
&lt;td&gt;~700 MB / ~2 GB&lt;/td&gt;
&lt;td&gt;Chat simples, perguntas básicas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 3&lt;/td&gt;
&lt;td&gt;1B / 4B&lt;/td&gt;
&lt;td&gt;~800 MB / ~2,5 GB&lt;/td&gt;
&lt;td&gt;Respostas curtas, resumos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3.5 Mini&lt;/td&gt;
&lt;td&gt;3,8B&lt;/td&gt;
&lt;td&gt;~2,4 GB&lt;/td&gt;
&lt;td&gt;Código, lógica&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3.2:1b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  💻 Notebook intermediário (16 GB RAM, sem GPU)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modelo&lt;/th&gt;
&lt;th&gt;Parâmetros&lt;/th&gt;
&lt;th&gt;Tamanho&lt;/th&gt;
&lt;th&gt;Uso ideal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;~2 GB&lt;/td&gt;
&lt;td&gt;Chat, escrita criativa&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;~4,1 GB&lt;/td&gt;
&lt;td&gt;Conversas mais profundas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;~4,4 GB&lt;/td&gt;
&lt;td&gt;Código e raciocínio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek Coder V2 Lite&lt;/td&gt;
&lt;td&gt;16B (IQ)&lt;/td&gt;
&lt;td&gt;~6 GB&lt;/td&gt;
&lt;td&gt;Geração de código&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run mistral
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🚀 Desktop com GPU (16 GB+ VRAM)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modelo&lt;/th&gt;
&lt;th&gt;Parâmetros&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Uso ideal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4 Scout&lt;/td&gt;
&lt;td&gt;17B&lt;/td&gt;
&lt;td&gt;~10 GB&lt;/td&gt;
&lt;td&gt;Tudo: chat, código, análise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;~9 GB&lt;/td&gt;
&lt;td&gt;Excelente em português&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3 Lite&lt;/td&gt;
&lt;td&gt;16B&lt;/td&gt;
&lt;td&gt;~9 GB&lt;/td&gt;
&lt;td&gt;Raciocínio avançado&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4&lt;/td&gt;
&lt;td&gt;9B&lt;/td&gt;
&lt;td&gt;~6 GB&lt;/td&gt;
&lt;td&gt;Contexto gigante (128K tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama4-scout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🏢 Workstation (24 GB+ VRAM)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modelo&lt;/th&gt;
&lt;th&gt;Parâmetros&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Uso ideal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3&lt;/td&gt;
&lt;td&gt;32B&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;Assistente completo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3&lt;/td&gt;
&lt;td&gt;67B&lt;/td&gt;
&lt;td&gt;~40 GB&lt;/td&gt;
&lt;td&gt;Estado da arte local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4 Maverick&lt;/td&gt;
&lt;td&gt;90B (quantizado)&lt;/td&gt;
&lt;td&gt;~48 GB&lt;/td&gt;
&lt;td&gt;Máximo desempenho&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Usando IA local no dia a dia
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pelo terminal
&lt;/h3&gt;

&lt;p&gt;O Ollama já funciona como um chat direto no terminal, mas você também pode fazer perguntas pontuais sem entrar no modo interativo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pergunta direta&lt;/span&gt;
ollama run mistral &lt;span class="s2"&gt;"Explique o que é Docker em uma frase"&lt;/span&gt;

&lt;span class="c"&gt;# Com pipe&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;arquivo.txt | ollama run llama3.2 &lt;span class="s2"&gt;"Resuma este texto"&lt;/span&gt;

&lt;span class="c"&gt;# Usando template&lt;/span&gt;
ollama run qwen3 &lt;span class="s2"&gt;"Traduza para o inglês: Como rodar IA localmente"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pela API REST
&lt;/h3&gt;

&lt;p&gt;Cada modelo que você roda com &lt;code&gt;ollama run&lt;/code&gt; expõe automaticamente uma API local no endereço &lt;code&gt;http://localhost:11434&lt;/code&gt;. Isso significa que você pode integrar a IA em seus próprios programas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "mistral",
  "prompt": "Escreva um poema sobre programação",
  "stream": false
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Em Python, a integração fica ainda mais simples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;O que é API? Explique como se eu tivesse 10 anos&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  No VS Code
&lt;/h3&gt;

&lt;p&gt;A combinação mais poderosa de 2026 é &lt;strong&gt;Ollama + Cline&lt;/strong&gt; (ou Continue.dev):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Instale a extensão &lt;strong&gt;Continue&lt;/strong&gt; ou &lt;strong&gt;Cline&lt;/strong&gt; no VS Code&lt;/li&gt;
&lt;li&gt;Vá nas configurações e selecione "Ollama" como provedor&lt;/li&gt;
&lt;li&gt;Escolha seu modelo local (ex: &lt;code&gt;qwen3&lt;/code&gt; ou &lt;code&gt;llama4-scout&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Pronto! Agora você tem autocomplete e chat com IA &lt;strong&gt;100% offline&lt;/strong&gt; dentro do editor&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Isso significa que você pode gerar código, refatorar funções, escrever testes e documentar projetos sem que nenhuma linha de código saia do seu computador. Perfeito para quem trabalha com código proprietário.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comandos essenciais do Ollama
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Listar modelos baixados&lt;/span&gt;
ollama list

&lt;span class="c"&gt;# Baixar um modelo sem executar&lt;/span&gt;
ollama pull llama4-scout

&lt;span class="c"&gt;# Remover um modelo&lt;/span&gt;
ollama &lt;span class="nb"&gt;rm &lt;/span&gt;modelo-antigo

&lt;span class="c"&gt;# Ver modelo em execução&lt;/span&gt;
ollama ps

&lt;span class="c"&gt;# Criar um modelo personalizado (Modelfile)&lt;/span&gt;
ollama create meu-modelo &lt;span class="nt"&gt;--file&lt;/span&gt; Modelfile

&lt;span class="c"&gt;# Atualizar Ollama&lt;/span&gt;
&lt;span class="c"&gt;# Linux:&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;span class="c"&gt;# macOS:&lt;/span&gt;
brew upgrade ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Modelfile: criando seu próprio modelo
&lt;/h3&gt;

&lt;p&gt;Você pode personalizar o comportamento de qualquer modelo com um &lt;code&gt;Modelfile&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; mistral&lt;/span&gt;

&lt;span class="c"&gt;# Define a personalidade&lt;/span&gt;
SYSTEM "Você é um assistente especializado em direito brasileiro. Responda sempre citando artigos de lei quando possível."

# Ajusta temperatura (0 = determinístico, 1 = criativo)
PARAMETER temperature 0.3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create direito-br &lt;span class="nt"&gt;--file&lt;/span&gt; Modelfile
ollama run direito-br
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Dicas para extrair o máximo
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Menos é mais.&lt;/strong&gt; Comece com modelos pequenos (1B-3B). Eles são rápidos e suficientes para 80% das tarefas do dia a dia.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contexto importa.&lt;/strong&gt; Modelos locais têm limite de contexto (normalmente 8K a 32K tokens). Para textos longos, divida em partes ou use modelos maiores como Gemma 4 (128K).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPU acelera, mas não é obrigatória.&lt;/strong&gt; Modelos até 7B rodam bem só com CPU e 16 GB de RAM. A diferença é que com GPU as respostas saem em segundos em vez de minutos.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Atualize os modelos periodicamente.&lt;/strong&gt; A cada mês surgem versões melhores. &lt;code&gt;ollama pull&lt;/code&gt; atualiza para a última versão disponível.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Combine ferramentas.&lt;/strong&gt; Ollama + Open WebUI dá uma interface estilo ChatGPT para seus modelos locais. Ollama + AnythingLLM cria um RAG (busca em documentos) local completo.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusão
&lt;/h2&gt;

&lt;p&gt;Rodar IA localmente deixou de ser coisa de entusiasta para se tornar uma ferramenta prática e acessível. Com o Ollama, você instala em minutos, escolhe entre dezenas de modelos gratuitos e mantém o controle total sobre seus dados.&lt;/p&gt;

&lt;p&gt;Não importa se você tem um notebook básico ou uma workstation potente — existe um modelo que roda na sua máquina e atende suas necessidades.&lt;/p&gt;

&lt;p&gt;Em 2026, com a privacidade se tornando cada vez mais rara no mundo digital, ter sua própria IA local não é apenas uma opção interessante: é um passo rumo à &lt;strong&gt;autonomia tecnológica&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teste você mesmo. Abra o terminal e digite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Em menos de 2 minutos você terá uma IA conversando com você, rodando 100% no seu computador, sem pagar nada, sem depender de internet, sem compartilhar seus dados.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IA na Prática — tecnologia que você consegue usar hoje.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Gostou do artigo? Deixe seus comentários abaixo e compartilhe qual modelo você está usando localmente!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ia</category>
      <category>ollama</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Stop Paying for Copilot: Run Local LLMs in VS Code &amp; CLI (For Free)</title>
      <dc:creator>Praveen Veera</dc:creator>
      <pubDate>Mon, 29 Jun 2026 13:03:02 +0000</pubDate>
      <link>https://dev.to/praveen_builds/stop-paying-for-copilot-run-local-llms-in-vs-code-cli-for-free-cbp</link>
      <guid>https://dev.to/praveen_builds/stop-paying-for-copilot-run-local-llms-in-vs-code-cli-for-free-cbp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxvritu9rd5a74kr190g1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxvritu9rd5a74kr190g1.png" alt="Local AI Reference Card - At a Glance" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running generative AI assistants locally on your workstation is the most direct way to protect code privacy, maintain compliance, and eliminate monthly API subscription costs.&lt;/p&gt;

&lt;p&gt;However, moving off the cloud is not as simple as installing an extension. A misconfigured setup can introduce frustrating latency, drain your workstation battery, and fail to provide accurate autocomplete suggestions.&lt;/p&gt;

&lt;p&gt;This guide provides a conceptual overview of the local AI landscape followed by an actionable &lt;strong&gt;five-step guide&lt;/strong&gt; to move your setup from the cloud to a fully local workstation.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Local vs. Cloud: Engineering Tradeoffs
&lt;/h2&gt;

&lt;p&gt;Choosing a local setup is not a pure upgrade; it involves a series of engineering tradeoffs. While local models offer absolute data privacy and near-zero latency, they compromise on reasoning capacity and context across multiple files compared to models hosted in the cloud. Understanding these boundaries is critical to knowing when to keep development local and when to leverage the cloud:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Local Assistant (e.g., Qwen 14B / Gemma 12B)&lt;/th&gt;
&lt;th&gt;Cloud Assistant (e.g., Claude 3.5 Sonnet / GPT-4o)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Privacy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100% Private (No data leaves your workstation)&lt;/td&gt;
&lt;td&gt;Subject to compliance review (Data sent to third party servers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$0 / month&lt;/strong&gt; (Runs entirely on local electricity)&lt;/td&gt;
&lt;td&gt;$10–$20/mo subscription or fees based on token usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Autocomplete Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~150ms&lt;/strong&gt; (Instant, zero network delay)&lt;/td&gt;
&lt;td&gt;~500ms - 1.2s (Depends on network stability and cloud congestion)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Offline Capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Works on planes, trains, or secure offline VPCs)&lt;/td&gt;
&lt;td&gt;No (Crashes instantly without active internet connection)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cognitive Ceiling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Low to Medium&lt;/strong&gt; (Struggles with reasoning across multiple files)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt; (Resolves complex logic across different modules)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Where Local Models Fail
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Abstract Ceiling:&lt;/strong&gt; A 14B model lacks the neural density to construct deep mental abstractions of complex codebases. If you ask a local model to resolve circular dependencies across three separate modules, it will likely output syntax-valid but logically broken code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rare Libraries &amp;amp; Edge Cases:&lt;/strong&gt; Cloud models are pre-trained on terabytes of code, including obscure libraries and legacy documentation. Local models are far more narrow; they struggle with undocumented frameworks, internal APIs, or specialized languages (like COBOL or Rust edge-cases).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Modal Limitations:&lt;/strong&gt; Local setups cannot parse wireframes or UI mockups to generate front-end CSS layouts on consumer GPUs without immediately triggering out-of-memory (OOM) errors.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Local Model Landscape
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Qwen2.5-Coder&lt;/code&gt; &lt;strong&gt;(The Gold Standard):&lt;/strong&gt; Google-rivaling coding performance. It is optimized specifically for &lt;em&gt;Fill-in-the-Middle&lt;/em&gt; autocomplete tasks, making it the most fluent local coding weight available today.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;DeepSeek-Coder&lt;/code&gt; &lt;strong&gt;(The Alternative):&lt;/strong&gt; Highly optimized for Python and C++ structures. However, its older codebase context means it slightly lags behind Qwen on modern multi-language syntax.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Gemma 4 QAT&lt;/code&gt; &lt;strong&gt;(The Logic Specialist):&lt;/strong&gt; Excellent logic capabilities and a robust 32k context capability, though it requires custom parameter configuration in Ollama to run smoothly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. The Systems Metrics That Matter
&lt;/h2&gt;

&lt;p&gt;When running local models, developer experience is governed by three primary systems metrics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Time to First Token (TTFT) / Context Pre-fill Latency:&lt;/strong&gt; The delay (in milliseconds) between triggering an autocomplete completion and the model generating its first character. In autocomplete, a TTFT above &lt;strong&gt;250ms&lt;/strong&gt; breaks your visual typing flow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token Generation Throughput (Tokens/Second):&lt;/strong&gt; The speed at which the model streams its output text once it starts writing. For real-time reading, you need at least &lt;strong&gt;20–30 tokens/second&lt;/strong&gt;. For autocomplete, the model should complete lines instantly (&lt;strong&gt;75+ tokens/second&lt;/strong&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VRAM Footprint vs. System Memory Swap:&lt;/strong&gt; If a model fits 100% inside VRAM, it runs at full speed. If it overflows by even &lt;strong&gt;10MB&lt;/strong&gt;, the OS pages the remaining weights to system RAM, creating a massive memory bus bottleneck. This drops speeds from 30 tokens/sec to &lt;strong&gt;under 2 tokens/sec&lt;/strong&gt;. Always size your models to fit within 70% of your total VRAM, leaving 30% headroom for your OS and browser.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🚀 The Local AI Developer Journey
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  ├── Step 1: Audit Your Hardware (VRAM Sizing)
  ├── Step 2: Spin Up the Model Runner (Ollama)
  ├── Step 3: Link the IDE Interface (Continue config.json)
  ├── Step 4: Protect Workspace CPU (.continueignore)
  └── Step 5: Expand to the Command Line (CLI Pipes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 1: Audit Your Hardware (The "Kitchen Counter" Rule)
&lt;/h3&gt;

&lt;p&gt;Running models locally requires matching model parameters to your system's memory (VRAM/RAM).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;The Kitchen Counter Analogy:&lt;/strong&gt; Think of VRAM (GPU memory) as your kitchen counter, and system RAM/swap as the pantry down the hall. If all your ingredients fit on the counter (VRAM), you prepare the meal instantly. If the ingredients are too large and overflow the counter, you have to run back and forth to the pantry (RAM) for every single step. Your cooking speed collapses. Keep your models strictly within VRAM bounds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is your hardware compatibility reference sheet:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System VRAM (Kitchen Counter)&lt;/th&gt;
&lt;th&gt;Model Parameter Size&lt;/th&gt;
&lt;th&gt;Recommended Models&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;VRAM Footprint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1B - 3B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5-coder:1.5b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~1.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;16 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7B - 8B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5-coder:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~4.7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;24 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12B - 14B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~9.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;32 GB+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14B - 22B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;codestral:22b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~15.1 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Sizing Models to Task Complexity
&lt;/h3&gt;

&lt;p&gt;To optimize compute resources, structure your workflow by mapping developer tasks to model sizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Simple Tasks (Tab Autocomplete &amp;amp; Syntax Matching):&lt;/strong&gt; Single-line completions, closing parentheses, standard imports, variable assignments. Requires &amp;lt; 200ms latency. Sized at &lt;strong&gt;1.5B to 3B parameters&lt;/strong&gt; (e.g., &lt;code&gt;Qwen2.5-Coder-1.5B-Base&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Medium Tasks (Context-Aware Chat &amp;amp; Unit Testing):&lt;/strong&gt; Writing utility functions, refactoring single files, generating test suites, explaining compilation errors. Sized at &lt;strong&gt;7B to 14B parameters&lt;/strong&gt; (e.g., &lt;code&gt;Qwen2.5-Coder-14B-Instruct&lt;/code&gt; or &lt;code&gt;Gemma-4-12B&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Complex Tasks (Multi-File Debugging &amp;amp; System Architecture):&lt;/strong&gt; Architectural planning, debugging cross-module dependencies, codebase index search. Sized at &lt;strong&gt;22B+ parameters&lt;/strong&gt; (e.g., &lt;code&gt;Codestral-22B&lt;/code&gt; or private VPC-hosted 70B+ models).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Step 2: Spin Up the Model Runner (Ollama)
&lt;/h3&gt;

&lt;p&gt;Ollama acts as the engine room of your setup. It manages model weights, schedules GPU memory allocation, and exposes local API endpoints.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Download and install &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama for macOS&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Pull the two models we need (one lightweight model optimized for tab autocomplete, and one larger model for reasoning in chat):&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pull the lightweight autocomplete model (Base model)&lt;/span&gt;
ollama pull qwen2.5-coder:1.5b-base

&lt;span class="c"&gt;# Pull the chat sidebar reasoning model (Instruct model)&lt;/span&gt;
ollama pull qwen2.5-coder:14b-instruct
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  (Optional) Tuning Parameters via a Custom Modelfile
&lt;/h3&gt;

&lt;p&gt;If you need custom parameters, such as running &lt;strong&gt;Gemma 4 12B QAT&lt;/strong&gt; with an expanded 32k context window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Locate your local GGUF file directory and create a &lt;code&gt;Modelfile&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; /path/to/local/gemma-4-12b-it-QAT.gguf&lt;/span&gt;
PARAMETER num_ctx 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Build the model in Ollama:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create gemma4:12b-qat-32k &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Step 3: Link the IDE Interface (Continue config.json)
&lt;/h3&gt;

&lt;p&gt;Now we connect VS Code to your local Ollama engine using the open-source &lt;strong&gt;Continue.dev&lt;/strong&gt; extension.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Install the &lt;code&gt;Continue&lt;/code&gt; extension in VS Code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Open the Continue settings (&lt;code&gt;config.json&lt;/code&gt;) and configure it to point to your local Ollama instance:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ollama - Qwen 14B Coder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qwen2.5-coder:14b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apiBase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:11434"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ollama - Gemma 4 QAT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemma4:12b-qat-32k"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apiBase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:11434"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tabAutocompleteModel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ollama - Autocomplete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qwen2.5-coder:1.5b-base"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"apiBase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:11434"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Enabling the VS Code CLI Command
&lt;/h3&gt;

&lt;p&gt;To open your configuration file directly from your terminal, enable the VS Code shell utility:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Open VS Code, open the Command Palette (&lt;code&gt;Cmd+Shift+P&lt;/code&gt; on macOS, &lt;code&gt;Ctrl+Shift+P&lt;/code&gt; on Windows/Linux).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run: &lt;code&gt;Shell Command: Install 'code' command in PATH&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now, you can open and edit your configuration file directly from your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;code ~/.continue/config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Replacing Copilot Features 1-to-1
&lt;/h3&gt;

&lt;p&gt;Once Continue is connected to your local model runner, here is how you trigger the models to replace Copilot's core capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inline Autocomplete (Ghost Text):&lt;/strong&gt; As you write code, the lightweight &lt;code&gt;Qwen-1.5B-Base&lt;/code&gt; model streams single-line completions inline. Press &lt;code&gt;Tab&lt;/code&gt; to accept.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In-Place Code Editing (&lt;code&gt;Cmd+I&lt;/code&gt; / &lt;code&gt;Ctrl+I&lt;/code&gt;):&lt;/strong&gt; Select a block of code, press &lt;code&gt;Cmd+I&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+I&lt;/code&gt; (Windows/Linux), type your editing instruction (e.g. &lt;em&gt;"Convert this loop to a list comprehension"&lt;/em&gt;), and press Enter. The model will edit the file inline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sidebar Chat &amp;amp; Context (&lt;code&gt;Cmd+L&lt;/code&gt; / &lt;code&gt;Ctrl+L&lt;/code&gt;):&lt;/strong&gt; Press &lt;code&gt;Cmd+L&lt;/code&gt; to open the chat panel. Type &lt;code&gt;@&lt;/code&gt; to reference specific files, terminal shell commands, or your entire codebase index, routing the queries to your larger &lt;code&gt;Qwen-14B-Instruct&lt;/code&gt; model.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;ℹ️ &lt;strong&gt;Isolate Autocomplete from Chat:&lt;/strong&gt; Do not route both chat and autocomplete to the same model. Tab autocomplete requires immediate responses. Use &lt;code&gt;Qwen-1.5B-Base&lt;/code&gt; for autocomplete (optimized for fast, inline Fill-in-the-Middle tasks) and &lt;code&gt;Qwen-14B-Instruct&lt;/code&gt; for the chat sidebar.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Workstation Benchmark Results (Measured Live on Apple M5 Pro)
&lt;/h3&gt;

&lt;p&gt;To prove local viability, we measured prompt pre-fill speeds (Time to First Token) and token generation throughput (text output speed) using your hardware configuration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model Configuration&lt;/th&gt;
&lt;th&gt;Parameter Size&lt;/th&gt;
&lt;th&gt;VRAM Footprint&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;Context Pre-fill Speed&lt;/th&gt;
&lt;th&gt;Token Generation Speed&lt;/th&gt;
&lt;th&gt;Sizing Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5-Coder (Base)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.5B&lt;/td&gt;
&lt;td&gt;1.6 GB&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;190.6 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;188.4 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 80ms (Real-time autocomplete)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4 QAT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12B&lt;/td&gt;
&lt;td&gt;7.0 GB&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;129.5 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34.8 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5-Coder (Instruct)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;9.0 GB&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;214.8 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.0 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud-parity chat speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Benchmark Test Script &amp;amp; Code Reference
&lt;/h4&gt;

&lt;p&gt;The benchmark tests were executed locally using the companion test script. The full source code is hosted in the companion repository:&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/praveenveera/software-permanence/tree/main/01-local-llm-vscode" rel="noopener noreferrer"&gt;software-permanence/01-local-llm-vscode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is the raw terminal log output of running &lt;a href="https://github.com/praveenveera/software-permanence/blob/main/01-local-llm-vscode/test_local_llm.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_local_llm.py&lt;/code&gt;&lt;/a&gt; against Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Running Local LLM Workstation Benchmark ===
Target model: qwen2.5-coder:14b (Q4_K_M)

[Step 1] Measuring Context Pre-fill Speed (Time to First Token)
  - Processing prompt size: 8192 tokens
  - Pre-fill throughput: 214.8 tokens/second

[Step 2] Measuring Text Generation Speed (Output Throughput)
  - Generating 500 response tokens
  - Generation throughput: 30.0 tokens/second

[Step 3] Verifying Tool-Calling Parse Compliance
  - XML Tool Extraction: PASSED (Regex matched 100% output)
  - JSON Tool Extraction: FAILED (Output wrapped in Markdown fences)

=== Validation Complete: Qwen 14B behaves at cloud-parity speed ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 4: Protect Workspace CPU (.continueignore)
&lt;/h3&gt;

&lt;p&gt;By default, Continue tries to index every file in your workspace to build local vector embeddings for chat retrieval. On large projects, this causes your CPU usage to spike to 100% and chokes autocomplete.&lt;/p&gt;

&lt;p&gt;To prevent this, create a &lt;code&gt;.continueignore&lt;/code&gt; file in the root of your project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.git/
node_modules/
dist/
build/
.svelte-kit/
*.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fixing Context Shifting Latency
&lt;/h3&gt;

&lt;p&gt;Autocomplete can freeze for 2-3 seconds when you switch tabs because Continue is parsing the entire contents of the new file.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Fix:&lt;/strong&gt; In VS Code settings, search for &lt;code&gt;Continue: Tab Autocomplete Options&lt;/code&gt;, and set &lt;code&gt;Prefix Length&lt;/code&gt; to &lt;code&gt;500&lt;/code&gt; and &lt;code&gt;Suffix Length&lt;/code&gt; to &lt;code&gt;250&lt;/code&gt;. Reducing these boundaries limits context parsing size, giving you instant tab completions upon tab switching.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Step 5: Expand to the Command Line (Terminal Agents &amp;amp; Pipes)
&lt;/h3&gt;

&lt;p&gt;Once your local model runner is set up, you aren't restricted to the IDE. Ollama’s desktop interface includes a native &lt;strong&gt;Launch&lt;/strong&gt; registry that allows you to spin up open-source terminal agents directly from your CLI.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Beginner Warning (The Git Sandbox Rule):&lt;/strong&gt; Terminal-native agents (&lt;code&gt;opencode&lt;/code&gt;, &lt;code&gt;claude&lt;/code&gt;) execute edits and run commands directly on your local system. Before launching an agent from your CLI, &lt;strong&gt;always ensure you are running it inside a clean Git repository.&lt;/strong&gt; If the agent runs a destructive command or writes broken code, you can roll back your workspace instantly via &lt;code&gt;git reset --hard&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. Launching Terminal-Native Coding Agents
&lt;/h3&gt;

&lt;p&gt;Instead of paid cloud services, you can run autonomous command-line developers directly inside your shell:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OpenCode (Anomaly's open-source coding agent):&lt;/strong&gt; An autonomous terminal coder that reads build logs, refactors files, and handles tasks locally:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama launch opencode
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Copilot CLI (Terminal helper agent):&lt;/strong&gt; Explains shell commands, generates commands from natural language, and handles prompt operations in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama launch copilot-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Claude Code (Subagent coding CLI):&lt;/strong&gt; Anthropic’s subagent developer interface configured to run locally:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama launch claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Piping Logs for Custom Debugging
&lt;/h3&gt;

&lt;p&gt;For quick troubleshooting, you can pipe compiler errors or log dumps directly into the model without copying and pasting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pipe an execution error log to Ollama&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;error.log | ollama run qwen2.5-coder:14b &lt;span class="s2"&gt;"Explain this error and suggest a fix"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Direct Programmatic API Access
&lt;/h3&gt;

&lt;p&gt;You can call your local models directly inside your applications or custom tooling. Here is how to execute a generation request using Curl and Python:&lt;/p&gt;

&lt;h4&gt;
  
  
  Using Curl:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "qwen2.5-coder:14b",
  "prompt": "Convert this bash script to a Python script: $(cat build.sh)",
  "stream": false
}'&lt;/span&gt; | jq &lt;span class="s1"&gt;'.response'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Using Python:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder:14b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convert this bash script to a Python script.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pro-Tips &amp;amp; Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue: Port 11434 is Already in Use
&lt;/h3&gt;

&lt;p&gt;On macOS, Ollama runs as a background service and will block port &lt;code&gt;11434&lt;/code&gt; even if the app UI is closed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Manually kill the background process via terminal:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pkill Ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Issue: Zero-Lag Loading (keep_alive)
&lt;/h3&gt;

&lt;p&gt;By default, Ollama unloads models from memory after 5 minutes of inactivity. When you trigger code completion later, you face a 5–10 second delay as the model loads back into VRAM.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The fix:&lt;/strong&gt; Set the model to remain permanently loaded in GPU memory by configuring the &lt;code&gt;keep_alive&lt;/code&gt; parameter to &lt;code&gt;-1&lt;/code&gt; (always stay in memory) or &lt;code&gt;30m&lt;/code&gt; (30 minutes) in your API settings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔰 Beginner's Troubleshooting Checklist
&lt;/h3&gt;

&lt;p&gt;If your local development setup is failing, use this diagnostic guide to find the cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Is Ollama running?&lt;/strong&gt; Open your terminal and run &lt;code&gt;ollama list&lt;/code&gt;. If it fails with a connection error, the Ollama application service is shut down.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Is autocomplete lagging?&lt;/strong&gt; If suggestions take more than 2-3 seconds, check if your model is spilling into system RAM. In Activity Monitor (macOS) or Task Manager (Windows), look at memory swap. If swap is active, you are running a model too large for your VRAM.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Is Continue forgetting instructions?&lt;/strong&gt; If the sidebar chat stops responding or behaves erratically, you have hit the context limit of the loaded model. Restart the chat session to clean the active history window.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Running local models provides code privacy and offline capabilities. By combining &lt;strong&gt;Ollama&lt;/strong&gt;, &lt;strong&gt;LM Studio&lt;/strong&gt;, and &lt;strong&gt;Continue&lt;/strong&gt;, you can configure a usable local developer environment in both your IDE and terminal.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What models are you running locally for autocomplete? Let me know in the comments.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Hi, I'm Praveen Veera.&lt;/strong&gt; I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.&lt;/p&gt;

&lt;p&gt;Read my notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Substack Newsletter:&lt;/strong&gt; &lt;a href="https://praveenbuilds.substack.com" rel="noopener noreferrer"&gt;praveenbuilds.substack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/praveen-veera-6ab22567/" rel="noopener noreferrer"&gt;linkedin.com/in/praveen-veera-6ab22567&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;GitHub (Companion Code):&lt;/strong&gt; &lt;a href="https://github.com/praveenveera/software-permanence" rel="noopener noreferrer"&gt;github.com/praveenveera/software-permanence&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dev.to:&lt;/strong&gt; &lt;a href="https://dev.to/praveen_builds"&gt;dev.to/praveen_builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Medium:&lt;/strong&gt; &lt;a href="https://medium.com/@praveenveera92" rel="noopener noreferrer"&gt;medium.com/@praveenveera92&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Instagram:&lt;/strong&gt; &lt;a href="https://instagram.com/praveen.builds" rel="noopener noreferrer"&gt;@praveen.builds&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hashnode:&lt;/strong&gt; &lt;a href="https://hashnode.com/@praveen-builds" rel="noopener noreferrer"&gt;hashnode.com/@praveen-builds&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ollama</category>
      <category>continue</category>
      <category>qwen</category>
      <category>githubcopilot</category>
    </item>
    <item>
      <title>Ollama 'llama runner process has terminated'? Read the Exit Code, Then Fix It (2026)</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Mon, 29 Jun 2026 07:06:02 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/ollama-llama-runner-process-has-terminated-read-the-exit-code-then-fix-it-2026-4b6h</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/ollama-llama-runner-process-has-terminated-read-the-exit-code-then-fix-it-2026-4b6h</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/ollama-llama-runner-process-terminated-fix-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: &lt;code&gt;Error: llama runner process has terminated&lt;/code&gt; means the backend that actually runs the model died before it could load. The fix depends entirely on the code after it — &lt;code&gt;exit status 2&lt;/code&gt; is usually a GPU/VRAM or driver-library mismatch, &lt;code&gt;0xc0000409&lt;/code&gt; on Windows is an illegal CPU instruction (no AVX), and &lt;code&gt;signal: killed&lt;/code&gt; on Linux is the kernel's OOM killer reclaiming system RAM. Read the code first; don't reinstall blindly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll be able to do after this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decode the four termination codes you'll actually see in 2026 and map each to a root cause&lt;/li&gt;
&lt;li&gt;Pull the one line from the Ollama server log that tells you what really happened&lt;/li&gt;
&lt;li&gt;Apply the specific fix — context size, GPU layers, quant, or driver — instead of guessing&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: This error scares people because it looks like a crash deep in C++ land, but 90% of cases are one of three boring things: the model doesn't fit in memory, your CPU is too old for the prebuilt binary, or a GPU library got swapped under Ollama's feet. The exit code narrows it to one of those in about ten seconds. Find the code, then read the matching section below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 1: Read the exit code (this is the whole diagnosis)
&lt;/h2&gt;

&lt;p&gt;The full error always has the same shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;ollama run llama3.1:8b
Error: llama runner process has terminated: &lt;span class="nb"&gt;exit &lt;/span&gt;status 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That trailing token — &lt;code&gt;exit status 2&lt;/code&gt;, &lt;code&gt;exit status 0xc0000409&lt;/code&gt;, &lt;code&gt;signal: killed&lt;/code&gt;, &lt;code&gt;signal: aborted&lt;/code&gt; — is not noise. It's the operating system reporting &lt;em&gt;how&lt;/em&gt; the runner subprocess died, and it points straight at the cause. Here's the map:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you see&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Almost always means&lt;/th&gt;
&lt;th&gt;Jump to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exit status 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;GPU library/driver mismatch, VRAM overflow, or bad GGUF&lt;/td&gt;
&lt;td&gt;Cause A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exit status 0xc0000409&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Windows&lt;/td&gt;
&lt;td&gt;CPU lacks AVX/AVX2 (illegal instruction) or a GPU runtime fault&lt;/td&gt;
&lt;td&gt;Cause B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;signal: killed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Linux/Docker&lt;/td&gt;
&lt;td&gt;Kernel OOM killer — system RAM exhausted&lt;/td&gt;
&lt;td&gt;Cause C&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;signal: aborted&lt;/code&gt; / SIGABRT&lt;/td&gt;
&lt;td&gt;Linux/Mac&lt;/td&gt;
&lt;td&gt;Internal assertion failed (often a corrupt or unsupported model)&lt;/td&gt;
&lt;td&gt;Cause D&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These codes are stable across Ollama versions — they come from the OS, not Ollama. As of this writing the current release is &lt;strong&gt;Ollama v0.30.8 (June 12, 2026)&lt;/strong&gt;, and the behavior below was confirmed against the 0.30.x line. If you're more than a few versions behind, updating is a legitimate first move (see the bottom of Cause A) — but read your code first so you know what you're actually chasing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Get the real reason from the server log
&lt;/h2&gt;

&lt;p&gt;The one-line CLI error is a summary. The runner writes its actual death note to the server log before it dies. Find it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linux (systemd):&lt;/strong&gt; &lt;code&gt;journalctl -u ollama --no-pager | tail -n 50&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;macOS:&lt;/strong&gt; &lt;code&gt;cat ~/.ollama/logs/server.log | tail -n 50&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows:&lt;/strong&gt; open &lt;code&gt;%LOCALAPPDATA%\Ollama\server.log&lt;/code&gt; (i.e. &lt;code&gt;C:\Users\&amp;lt;you&amp;gt;\AppData\Local\Ollama\server.log&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scroll to the lines just before the termination. You're hunting for one of these tells:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SIGILL: illegal instruction
CUDA error: out of memory
cudaMalloc failed: out of memory
entering low vram mode
error loading model: unable to allocate backend buffer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whichever line shows up confirms which cause below applies. Don't skip this step — it's the difference between a five-minute fix and an afternoon of reinstalling drivers you didn't need to touch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause A — &lt;code&gt;exit status 2&lt;/code&gt;: VRAM, driver libraries, or a bad model
&lt;/h2&gt;

&lt;p&gt;This is the catch-all crash, and it has three common flavors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A1. The model doesn't fit (most common).&lt;/strong&gt; If the log shows &lt;code&gt;CUDA error: out of memory&lt;/code&gt;, &lt;code&gt;cudaMalloc failed&lt;/code&gt;, or &lt;code&gt;entering low vram mode&lt;/code&gt; right before the crash, the runner tried to allocate more VRAM than the card has and died. This is the same root cause covered in depth in our &lt;a href="https://dev.to/blog/cuda-out-of-memory-local-ai-fix-2026/"&gt;CUDA out of memory fix guide&lt;/a&gt; — the short version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Shrink the context.&lt;/strong&gt; The KV cache scales with context length and quietly dominates VRAM at long contexts. Cap it:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="c"&gt;# per-session&lt;/span&gt;
  &lt;span class="nv"&gt;$ OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4096 ollama serve
  &lt;span class="c"&gt;# or in the systemd service: Environment="OLLAMA_CONTEXT_LENGTH=4096"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drop to a smaller quant.&lt;/strong&gt; A &lt;code&gt;q4_K_M&lt;/code&gt; build of an 8B model needs ~6–7 GB; the &lt;code&gt;q8_0&lt;/code&gt; of the same model needs ~9 GB. If you're at the edge, the smaller quant is the cheapest win. (If you're unsure which quant to pick, see &lt;a href="https://dev.to/blog/local-llm-quantization-explained/"&gt;quantization explained&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let some layers spill to CPU on purpose.&lt;/strong&gt; Setting &lt;code&gt;num_gpu&lt;/code&gt; to a value lower than the model's layer count offloads the rest to RAM — slower, but it loads instead of crashing:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="nv"&gt;$ &lt;/span&gt;ollama run llama3.1:8b &lt;span class="nt"&gt;--num-gpu&lt;/span&gt; 28
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A2. A swapped GPU library (AMD/ROCm and custom builds).&lt;/strong&gt; A frequently reported version of &lt;code&gt;exit status 2&lt;/code&gt; happens after someone manually replaces Ollama's bundled ROCm libraries to force support for an unsupported architecture — for example dropping &lt;code&gt;gfx1031&lt;/code&gt; files in to make a Radeon RX 6750 XT work. When the patched library and the runner disagree, the runner faults on load. If you've hand-edited anything under Ollama's &lt;code&gt;lib/&lt;/code&gt; directory, reinstall Ollama cleanly to restore the matched binaries, then let it auto-detect the GPU rather than forcing an architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A3. A corrupt or partially downloaded model.&lt;/strong&gt; If the crash is specific to one model and only after an interrupted pull or an offline copy, the GGUF blob may be truncated. Re-pull it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;ollama &lt;span class="nb"&gt;rm &lt;/span&gt;llama3.1:8b
&lt;span class="nv"&gt;$ &lt;/span&gt;ollama pull llama3.1:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If A1–A3 don't apply and you're several releases behind, update Ollama — GGUF/llama.cpp hardware support broadens with nearly every release, and v0.30.8 specifically expanded the set of cards and quant formats the runner accepts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause B — &lt;code&gt;exit status 0xc0000409&lt;/code&gt; on Windows: your CPU, not your GPU
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;0xc0000409&lt;/code&gt; is a Windows NTSTATUS code for an &lt;strong&gt;illegal-instruction exception&lt;/strong&gt;. Despite how it reads, this is usually not a memory bug — it's the CPU being asked to execute an instruction it doesn't have. In practice that means &lt;strong&gt;the prebuilt Ollama runner uses AVX/AVX2 and your processor doesn't support it.&lt;/strong&gt; This has been reported across model families (phi3, llama3.2) on older Intel and budget CPUs going back to Ollama 0.1.x, and the SIGILL line in the log is the confirmation.&lt;/p&gt;

&lt;p&gt;What works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Confirm the CPU is the issue.&lt;/strong&gt; In the log, an &lt;code&gt;illegal instruction&lt;/code&gt; / &lt;code&gt;SIGILL&lt;/code&gt; line right before the exit confirms AVX is the culprit. You can also check your CPU's spec sheet for "AVX2" support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Force a GPU load so the CPU path is never taken.&lt;/strong&gt; If you have a supported NVIDIA/AMD GPU large enough for the model, make sure Ollama is actually using it (run &lt;code&gt;ollama ps&lt;/code&gt; and look for &lt;code&gt;100% GPU&lt;/code&gt;). When the model runs entirely on the GPU, the AVX-dependent CPU kernels aren't exercised. If &lt;code&gt;ollama ps&lt;/code&gt; shows a CPU/GPU split, you're back in CPU territory — shrink the model until it fits fully on the GPU. Our &lt;a href="https://dev.to/blog/ollama-not-using-gpu-fix-cpu-2026/"&gt;Ollama not using GPU guide&lt;/a&gt; walks through forcing GPU detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If there's no AVX and no usable GPU,&lt;/strong&gt; that machine genuinely can't run the prebuilt binary. The honest answer is to run inference somewhere else — a different box, or a rented cloud GPU. For occasional jobs, &lt;a href="https://runpod.io?ref=cjrwwd27" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; is cheaper than buying a new CPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A second, rarer flavor of &lt;code&gt;0xc0000409&lt;/code&gt; is a GPU &lt;strong&gt;runtime&lt;/strong&gt; fault — a mismatched or corrupted CUDA/driver install rather than a CPU issue. If the log shows CUDA errors instead of SIGILL, update your NVIDIA driver and reinstall Ollama, the same way you'd treat Cause A2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause C — &lt;code&gt;signal: killed&lt;/code&gt; on Linux: the OOM killer got you
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;signal: killed&lt;/code&gt; is SIGKILL, and on Linux the usual sender is the kernel's &lt;strong&gt;out-of-memory (OOM) killer&lt;/strong&gt;. When loading a model pushes total system RAM past the limit, the kernel picks a process and terminates it instantly — no cleanup, no error message from Ollama, the runner just vanishes. Confirm it:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
$ dmesg | grep -i "killed process"
[ 4823.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ollama</category>
      <category>troubleshooting</category>
      <category>localllm</category>
      <category>cuda</category>
    </item>
    <item>
      <title>Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)</title>
      <dc:creator>duke</dc:creator>
      <pubDate>Mon, 29 Jun 2026 01:22:31 +0000</pubDate>
      <link>https://dev.to/javaking1129/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys-2hfd</link>
      <guid>https://dev.to/javaking1129/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys-2hfd</guid>
      <description>&lt;p&gt;Most RAG tutorials open with "set your &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;." This one doesn't need it. In &lt;a href="https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi"&gt;Part 1&lt;/a&gt; I claimed the LLM and embeddings are behind a swappable boundary — "switch providers via config, not code." Part 3 is me &lt;em&gt;cashing that claim&lt;/em&gt;: running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with &lt;strong&gt;zero API keys and no Docker&lt;/strong&gt;, just Ollama and an embedded Qdrant.&lt;/p&gt;

&lt;p&gt;Everything below is real output from an actual run. Including the one thing that broke.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "offline" actually requires
&lt;/h2&gt;

&lt;p&gt;Three pieces, all local:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; running two models — one for chat, one for embeddings:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  ollama pull qwen3.5:9b   &lt;span class="c"&gt;# chat / reasoning&lt;/span&gt;
  ollama pull bge-m3       &lt;span class="c"&gt;# embeddings (1024-dim, multilingual)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedded Qdrant&lt;/strong&gt; — no server, no container. The vector store writes to a local directory.&lt;/li&gt;
&lt;li&gt;A one-line config flip so chat goes to Ollama instead of the gateway:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="nv"&gt;CHAT_PROVIDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;, no &lt;code&gt;docker compose up&lt;/code&gt;. The reason this is a &lt;em&gt;flip&lt;/em&gt; and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The embeddings factory — swap by config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app/llm/embeddings.py
&lt;/span&gt;&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Embeddings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_settings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_ollama&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OllamaEmbeddings&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OllamaEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;litellm_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;litellm_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown embedding_provider: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_provider&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both branches return the same LangChain &lt;code&gt;Embeddings&lt;/code&gt; interface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama (offline). Production → OpenAI via the gateway. &lt;strong&gt;One caveat that matters later:&lt;/strong&gt; the two providers produce &lt;em&gt;different vector dimensions&lt;/em&gt;, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.&lt;/p&gt;

&lt;h2&gt;
  
  
  The vector store — embedded vs. remote, also by config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app/rag/store.py
&lt;/span&gt;&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_settings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qdrant_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qdrant_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qdrant_api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# remote (prod)
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qdrant_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                             &lt;span class="c1"&gt;# embedded (local)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;QDRANT_URL&lt;/code&gt;? You get an embedded client that persists to &lt;code&gt;s.qdrant_path&lt;/code&gt; — a plain directory. Set &lt;code&gt;QDRANT_URL&lt;/code&gt; in prod and the &lt;em&gt;same code&lt;/em&gt; talks to a real Qdrant service. The trade-off of embedded mode: it &lt;strong&gt;locks the directory to a single process&lt;/strong&gt;, which becomes gotcha #2.&lt;/p&gt;




&lt;h2&gt;
  
  
  Ingestion: docs → chunks → vectors
&lt;/h2&gt;

&lt;p&gt;The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/ingest.py (trimmed)
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# probe the embedding dimension so the collection matches the provider
&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;probe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;ensure_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;get_vector_store&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;embed_query("probe")&lt;/code&gt; trick is worth pausing on: instead of hard-coding &lt;code&gt;1024&lt;/code&gt; for bge-m3 (or &lt;code&gt;1536&lt;/code&gt; for OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.&lt;/p&gt;

&lt;p&gt;Running it for real:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python scripts/ingest.py &lt;span class="nt"&gt;--reset&lt;/span&gt;
&lt;span class="go"&gt;[ingest] source=docs  collection=docs  embed=ollama:bge-m3
[ingest] 5 documents → 53 chunks
[ingest] embedding dim = 1024
[ingest] done — 53 points in collection
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the agent — no server needed
&lt;/h2&gt;

&lt;p&gt;You can hit the FastAPI endpoint, but to &lt;em&gt;see the graph think&lt;/em&gt; you can also invoke it directly. Here's a real run, asking about something that lives in the docs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How is short-term vs long-term memory implemented in this project?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;span class="c1"&gt;# ['HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage']
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That message sequence &lt;em&gt;is&lt;/em&gt; the ReAct loop, visible in the state:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;HumanMessage&lt;/code&gt; — the question&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AIMessage&lt;/code&gt; with &lt;code&gt;tool_calls=[search_docs(...)]&lt;/code&gt; — the model decides to retrieve&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ToolMessage&lt;/code&gt; — the retrieved chunks come back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AIMessage&lt;/code&gt; — the final synthesized answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And the answer itself, generated entirely by a 9B model on the laptop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Short-term memory: PostgreSQL (PostgresSaver) stores per-thread
  conversation state; swappable to Redis (RedisSaver) if needed.
Long-term memory: Zep manages the user's persistent knowledge,
  recalled by the app on later turns.

Sources: &amp;lt;doc-a&amp;gt;.md, &amp;lt;doc-b&amp;gt;.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotchas (the part that's actually worth reading)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The empty synthesis turn — the local model, not the pipeline
&lt;/h3&gt;

&lt;p&gt;On one run, the &lt;em&gt;exact same question&lt;/em&gt; produced this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'
[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked
[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retrieval succeeded. The chunks were right there in step 2. But step 3 — the model's job to &lt;em&gt;read the chunks and answer&lt;/em&gt; — came back &lt;strong&gt;empty&lt;/strong&gt;. &lt;code&gt;finish_reason='stop'&lt;/code&gt;, no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's &lt;strong&gt;intermittent&lt;/strong&gt;: a small local model occasionally produces an empty turn after a tool call.&lt;/p&gt;

&lt;p&gt;Two things to take away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's the &lt;em&gt;model&lt;/em&gt;, not your graph. The pipeline (routing → retrieval → state) was flawless; the synthesis step just whiffed.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;saw_token&lt;/code&gt; fallback from &lt;a href="https://dev.to/javaking1129/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel-2928"&gt;Part 2&lt;/a&gt; &lt;strong&gt;won't save you here&lt;/strong&gt; — that fallback calls &lt;code&gt;ainvoke&lt;/code&gt; when no tokens stream, but here &lt;code&gt;ainvoke&lt;/code&gt; &lt;em&gt;is&lt;/em&gt; the empty result. The real mitigations are a larger/better tool-tuned local model, or accepting some flakiness as the price of fully offline. Worth knowing before you demo it live.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Embedded Qdrant locks the directory
&lt;/h3&gt;

&lt;p&gt;Embedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: &lt;strong&gt;ingest first → let it exit → then start the server.&lt;/strong&gt; The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Embedding dimensions must match end to end
&lt;/h3&gt;

&lt;p&gt;bge-m3 is 1024-dim; OpenAI's &lt;code&gt;text-embedding-3-small&lt;/code&gt; is 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching &lt;code&gt;embedding_provider&lt;/code&gt; means &lt;strong&gt;re-ingesting&lt;/strong&gt; (&lt;code&gt;--reset&lt;/code&gt;). The &lt;code&gt;embed_query("probe")&lt;/code&gt; dimension check is exactly what keeps the collection honest per provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The first call is slow
&lt;/h3&gt;

&lt;p&gt;Ollama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;You can build, debug, and demo the &lt;em&gt;entire&lt;/em&gt; RAG agent — graph, retrieval, citations — on a plane with no wifi. Then, for production, you flip two config values (&lt;code&gt;CHAT_PROVIDER&lt;/code&gt;, &lt;code&gt;QDRANT_URL&lt;/code&gt;) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 &lt;em&gt;claimed&lt;/em&gt; the provider boundary; Part 3 &lt;em&gt;ran on both sides of it&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.&lt;/p&gt;

&lt;p&gt;Next: persisting conversation threads with a checkpointer — so the agent remembers across requests — and what that adds to the message log you just saw.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 3 of a series on running LangGraph in production. &lt;a href="https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi"&gt;Part 1&lt;/a&gt; · &lt;a href="https://dev.to/javaking1129/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel-2928"&gt;Part 2&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>llm</category>
      <category>rag</category>
      <category>ollama</category>
    </item>
    <item>
      <title>My RAG Benchmark is lying to me</title>
      <dc:creator>Dogukan Karademir</dc:creator>
      <pubDate>Sun, 28 Jun 2026 21:45:58 +0000</pubDate>
      <link>https://dev.to/mido-dev/my-rag-benchmark-is-lying-to-me-20co</link>
      <guid>https://dev.to/mido-dev/my-rag-benchmark-is-lying-to-me-20co</guid>
      <description>&lt;p&gt;I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.&lt;/p&gt;

&lt;p&gt;Here's the specific problem that broke my assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Kenning&lt;/strong&gt; is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.&lt;/p&gt;

&lt;p&gt;I built a benchmark to test six local models: &lt;code&gt;llama3.1:8b&lt;/code&gt;, &lt;code&gt;llama3.2:3b&lt;/code&gt;, &lt;code&gt;qwen2.5:7b&lt;/code&gt;, &lt;code&gt;gemma2:9b&lt;/code&gt;, &lt;code&gt;mistral:7b&lt;/code&gt;, &lt;code&gt;phi4:14b&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Four question categories, judged blind by &lt;code&gt;qwen2.5:14b&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;IN_CONTEXT&lt;/strong&gt; — answer is in the document&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;OUT_OF_CONTEXT&lt;/strong&gt; — answer isn't; model must refuse&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;PARTIAL_CONTEXT&lt;/strong&gt; — partial information; model must say what it found and what's missing&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;MULTI_CHUNK&lt;/strong&gt; — answer spans multiple sections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Maximum 875 points per model at 35 questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Problem: The Ceiling Effect
&lt;/h2&gt;

&lt;p&gt;First run, 20 questions on &lt;em&gt;Attention Is All You Need&lt;/em&gt; (the Transformer paper):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;481/500 — 96.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;475/500 — 95.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;474/500 — 94.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;473/500 — 94.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;466/500 — 93.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;463/500 — 92.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;IN_CONTEXT category: every single model averaged 25/25. Perfect score.&lt;/p&gt;

&lt;p&gt;This is what a useless benchmark looks like. Questions like &lt;em&gt;"How many attention heads does the Transformer use?"&lt;/em&gt; are trivially easy if the retrieved chunk contains &lt;code&gt;h = 8&lt;/code&gt;. I wasn't measuring model capability — I was measuring whether models can read.&lt;/p&gt;

&lt;p&gt;I added 15 harder questions and rewrote the chunking.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Rewrite That Changed Everything
&lt;/h2&gt;

&lt;p&gt;The original code used &lt;code&gt;TokenTextSplitter&lt;/code&gt; with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withChunkSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withKeepSeparator&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;overlapped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlapAppender&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addOverlap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.&lt;/p&gt;

&lt;p&gt;New results on 35 questions, same document:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;839/875 — 95.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;822/875 — 93.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;818/875 — 93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;815/875 — 93.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;780/875 — 89.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;771/875 — 88.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.&lt;/p&gt;




&lt;h2&gt;
  
  
  Here's the Problem
&lt;/h2&gt;

&lt;p&gt;I changed &lt;strong&gt;two things&lt;/strong&gt; at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.&lt;/p&gt;

&lt;p&gt;And I can prove the chunking changed what models actually saw.&lt;/p&gt;

&lt;p&gt;Question q01: &lt;em&gt;"How many attention heads does the base Transformer use?"&lt;/em&gt; — categorized as IN_CONTEXT because the answer (&lt;code&gt;h = 8&lt;/code&gt;) is in the paper.&lt;/p&gt;

&lt;p&gt;Original chunking: retrieved a chunk containing &lt;code&gt;h = 8&lt;/code&gt;. Model answered correctly.&lt;/p&gt;

&lt;p&gt;New chunking: retrieved chunks about multi-head attention applications. The specific &lt;code&gt;h = 8&lt;/code&gt; chunk was no longer in the top 5 by similarity score. phi4:14b correctly said: &lt;em&gt;"The provided context does not specify the number of attention heads."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Judge score: &lt;strong&gt;25/25&lt;/strong&gt;. The model isn't lying — it answered correctly given what it received.&lt;/p&gt;

&lt;p&gt;But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.&lt;/p&gt;

&lt;p&gt;So here's what I was actually measuring: &lt;strong&gt;model behavior given what my chunking strategy retrieved&lt;/strong&gt; — not model capability. The "model benchmark" was really a "chunking configuration benchmark." I just didn't realize it until the results changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Document Made It Worse
&lt;/h2&gt;

&lt;p&gt;I added a second document — &lt;strong&gt;NIST SP 800-63B&lt;/strong&gt;, a US federal authentication standard. ~70 pages of SHALL/SHOULD requirements, distributed across sections and tables. Nothing like an academic paper.&lt;/p&gt;

&lt;p&gt;Same questions structure, same judge, same chunking.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Transformer paper&lt;/th&gt;
&lt;th&gt;NIST&lt;/th&gt;
&lt;th&gt;Drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;95.9%&lt;/td&gt;
&lt;td&gt;90.9%&lt;/td&gt;
&lt;td&gt;−5.0 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;89.1%&lt;/td&gt;
&lt;td&gt;88.6%&lt;/td&gt;
&lt;td&gt;−&lt;strong&gt;0.5&lt;/strong&gt; pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.9%&lt;/td&gt;
&lt;td&gt;87.8%&lt;/td&gt;
&lt;td&gt;−6.1 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;td&gt;−10.1 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.1%&lt;/td&gt;
&lt;td&gt;83.2%&lt;/td&gt;
&lt;td&gt;−9.9 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;88.1%&lt;/td&gt;
&lt;td&gt;79.3%&lt;/td&gt;
&lt;td&gt;−8.8 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).&lt;/p&gt;

&lt;p&gt;Now I have two explanations and no way to distinguish them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First Guess:&lt;/strong&gt; These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation B:&lt;/strong&gt; Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything. &lt;/p&gt;

&lt;p&gt;Recent research highlights exactly how much the structure of a document dictates the winning pipeline. For instance, a February 2026 benchmark by Vecta evaluating 7 chunking strategies across 50 academic papers found that standard &lt;strong&gt;recursive 512-token splitting took 1st place with 69% accuracy&lt;/strong&gt;. In that specific domain, semantic chunking tanked at 54% because it over-fragmented the text, producing tiny snippets averaging just 43 tokens that stripped away crucial context. For a standard academic paper, fixed-size or recursive chunking is often perfectly fine or even superior.&lt;/p&gt;

&lt;p&gt;Conversely, when dealing with complex, non-linear layouts, fixed token limits completely collapse. A separate study evaluating structured/clinical documents found that &lt;strong&gt;adaptive, theme-boundary chunking reached 87% accuracy, while fixed-size baselines plummeted to a dismal 13%&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;This completely recontextualizes my results. My naive 200-token split with 100-token overlap happened to work reasonably well for the uniform, dense layout of the Transformer paper. But when applied to a 70-page regulatory standard like NIST—where a single requirement might be scattered across cross-referenced sections and multi-row tables—it arbitrarily butchered the text. Models like &lt;code&gt;gemma2&lt;/code&gt; that are highly sensitive to context fragmentation fell off a cliff, while &lt;code&gt;mistral&lt;/code&gt; proved much more resilient at handling the poorly sliced context.&lt;/p&gt;

&lt;p&gt;The takeaway isn't that semantic chunking is a silver bullet—it's that a one-size-fits-all chunking pipeline is fundamentally broken. The experiment that would actually prove this — running the same models with multiple chunking configurations (fixed vs. semantic vs. structure-aware) on the exact same document — is the one I didn't do.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Actually Need to Know Which Model to Pick
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Multiple chunking strategies per document type, held constant while varying models&lt;/li&gt;
&lt;li&gt;  Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)&lt;/li&gt;
&lt;li&gt;  Multiple judge models, not just one (my judge could have systematic biases I can't detect)&lt;/li&gt;
&lt;li&gt;  Real user questions from actual sessions, not questions I wrote after reading the document myself&lt;/li&gt;
&lt;li&gt;  Multiple runs per model to account for non-determinism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Takeaway
&lt;/h2&gt;

&lt;p&gt;I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."&lt;/p&gt;

&lt;p&gt;The thing I expected to matter most — model choice — turned out to be inseparable from chunking strategy, retrieval configuration, and document structure. Changing chunk size doesn't change which model is capable of what. It changes what the model sees. And what the model sees determines everything.&lt;/p&gt;

&lt;p&gt;If I had to tell someone one thing before they start benchmarking models for RAG: measure your retrieval quality first. If the right chunks aren't being retrieved, you're not benchmarking models — you're benchmarking whether your similarity search surfaces the right context. Those are very different problems.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>springai</category>
      <category>ollama</category>
    </item>
    <item>
      <title>My RAG Benchmark is lying to me</title>
      <dc:creator>Dogukan Karademir</dc:creator>
      <pubDate>Sun, 28 Jun 2026 21:45:58 +0000</pubDate>
      <link>https://dev.to/mido-dev/my-rag-benchmark-is-lying-to-me-54e4</link>
      <guid>https://dev.to/mido-dev/my-rag-benchmark-is-lying-to-me-54e4</guid>
      <description>&lt;p&gt;I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.&lt;/p&gt;

&lt;p&gt;Here's the specific problem that broke my assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Kenning&lt;/strong&gt; is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.&lt;/p&gt;

&lt;p&gt;I built a benchmark to test six local models: &lt;code&gt;llama3.1:8b&lt;/code&gt;, &lt;code&gt;llama3.2:3b&lt;/code&gt;, &lt;code&gt;qwen2.5:7b&lt;/code&gt;, &lt;code&gt;gemma2:9b&lt;/code&gt;, &lt;code&gt;mistral:7b&lt;/code&gt;, &lt;code&gt;phi4:14b&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Four question categories, judged blind by &lt;code&gt;qwen2.5:14b&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;IN_CONTEXT&lt;/strong&gt; — answer is in the document&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;OUT_OF_CONTEXT&lt;/strong&gt; — answer isn't; model must refuse&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;PARTIAL_CONTEXT&lt;/strong&gt; — partial information; model must say what it found and what's missing&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;MULTI_CHUNK&lt;/strong&gt; — answer spans multiple sections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Maximum 875 points per model at 35 questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Problem: The Ceiling Effect
&lt;/h2&gt;

&lt;p&gt;First run, 20 questions on &lt;em&gt;Attention Is All You Need&lt;/em&gt; (the Transformer paper):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;481/500 — 96.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;475/500 — 95.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;474/500 — 94.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;473/500 — 94.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;466/500 — 93.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;463/500 — 92.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;IN_CONTEXT category: every single model averaged 25/25. Perfect score.&lt;/p&gt;

&lt;p&gt;This is what a useless benchmark looks like. Questions like &lt;em&gt;"How many attention heads does the Transformer use?"&lt;/em&gt; are trivially easy if the retrieved chunk contains &lt;code&gt;h = 8&lt;/code&gt;. I wasn't measuring model capability — I was measuring whether models can read.&lt;/p&gt;

&lt;p&gt;I added 15 harder questions and rewrote the chunking.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Rewrite That Changed Everything
&lt;/h2&gt;

&lt;p&gt;The original code used &lt;code&gt;TokenTextSplitter&lt;/code&gt; with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withChunkSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withKeepSeparator&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;overlapped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlapAppender&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addOverlap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.&lt;/p&gt;

&lt;p&gt;New results on 35 questions, same document:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;839/875 — 95.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;822/875 — 93.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;818/875 — 93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;815/875 — 93.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;780/875 — 89.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;771/875 — 88.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.&lt;/p&gt;




&lt;h2&gt;
  
  
  Here's the Problem
&lt;/h2&gt;

&lt;p&gt;I changed &lt;strong&gt;two things&lt;/strong&gt; at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.&lt;/p&gt;

&lt;p&gt;And I can prove the chunking changed what models actually saw.&lt;/p&gt;

&lt;p&gt;Question q01: &lt;em&gt;"How many attention heads does the base Transformer use?"&lt;/em&gt; — categorized as IN_CONTEXT because the answer (&lt;code&gt;h = 8&lt;/code&gt;) is in the paper.&lt;/p&gt;

&lt;p&gt;Original chunking: retrieved a chunk containing &lt;code&gt;h = 8&lt;/code&gt;. Model answered correctly.&lt;/p&gt;

&lt;p&gt;New chunking: retrieved chunks about multi-head attention applications. The specific &lt;code&gt;h = 8&lt;/code&gt; chunk was no longer in the top 5 by similarity score. phi4:14b correctly said: &lt;em&gt;"The provided context does not specify the number of attention heads."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Judge score: &lt;strong&gt;25/25&lt;/strong&gt;. The model isn't lying — it answered correctly given what it received.&lt;/p&gt;

&lt;p&gt;But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.&lt;/p&gt;

&lt;p&gt;So here's what I was actually measuring: &lt;strong&gt;model behavior given what my chunking strategy retrieved&lt;/strong&gt; — not model capability. The "model benchmark" was really a "chunking configuration benchmark." I just didn't realize it until the results changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Document Made It Worse
&lt;/h2&gt;

&lt;p&gt;I added a second document — &lt;strong&gt;NIST SP 800-63B&lt;/strong&gt;, a US federal authentication standard. ~70 pages of SHALL/SHOULD requirements, distributed across sections and tables. Nothing like an academic paper.&lt;/p&gt;

&lt;p&gt;Same questions structure, same judge, same chunking.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Transformer paper&lt;/th&gt;
&lt;th&gt;NIST&lt;/th&gt;
&lt;th&gt;Drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phi4:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;95.9%&lt;/td&gt;
&lt;td&gt;90.9%&lt;/td&gt;
&lt;td&gt;−5.0 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;89.1%&lt;/td&gt;
&lt;td&gt;88.6%&lt;/td&gt;
&lt;td&gt;−&lt;strong&gt;0.5&lt;/strong&gt; pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.9%&lt;/td&gt;
&lt;td&gt;87.8%&lt;/td&gt;
&lt;td&gt;−6.1 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma2:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;td&gt;−10.1 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.1%&lt;/td&gt;
&lt;td&gt;83.2%&lt;/td&gt;
&lt;td&gt;−9.9 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;88.1%&lt;/td&gt;
&lt;td&gt;79.3%&lt;/td&gt;
&lt;td&gt;−8.8 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).&lt;/p&gt;

&lt;p&gt;Now I have two explanations and no way to distinguish them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First Guess:&lt;/strong&gt; These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation B:&lt;/strong&gt; Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything. &lt;/p&gt;

&lt;p&gt;Recent research highlights exactly how much the structure of a document dictates the winning pipeline. For instance, a February 2026 benchmark by Vecta evaluating 7 chunking strategies across 50 academic papers found that standard &lt;strong&gt;recursive 512-token splitting took 1st place with 69% accuracy&lt;/strong&gt;. In that specific domain, semantic chunking tanked at 54% because it over-fragmented the text, producing tiny snippets averaging just 43 tokens that stripped away crucial context. For a standard academic paper, fixed-size or recursive chunking is often perfectly fine or even superior.&lt;/p&gt;

&lt;p&gt;Conversely, when dealing with complex, non-linear layouts, fixed token limits completely collapse. A separate study evaluating structured/clinical documents found that &lt;strong&gt;adaptive, theme-boundary chunking reached 87% accuracy, while fixed-size baselines plummeted to a dismal 13%&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;This completely recontextualizes my results. My naive 200-token split with 100-token overlap happened to work reasonably well for the uniform, dense layout of the Transformer paper. But when applied to a 70-page regulatory standard like NIST—where a single requirement might be scattered across cross-referenced sections and multi-row tables—it arbitrarily butchered the text. Models like &lt;code&gt;gemma2&lt;/code&gt; that are highly sensitive to context fragmentation fell off a cliff, while &lt;code&gt;mistral&lt;/code&gt; proved much more resilient at handling the poorly sliced context.&lt;/p&gt;

&lt;p&gt;The takeaway isn't that semantic chunking is a silver bullet—it's that a one-size-fits-all chunking pipeline is fundamentally broken. The experiment that would actually prove this — running the same models with multiple chunking configurations (fixed vs. semantic vs. structure-aware) on the exact same document — is the one I didn't do.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Actually Need to Know Which Model to Pick
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Multiple chunking strategies per document type, held constant while varying models&lt;/li&gt;
&lt;li&gt;  Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)&lt;/li&gt;
&lt;li&gt;  Multiple judge models, not just one (my judge could have systematic biases I can't detect)&lt;/li&gt;
&lt;li&gt;  Real user questions from actual sessions, not questions I wrote after reading the document myself&lt;/li&gt;
&lt;li&gt;  Multiple runs per model to account for non-determinism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Takeaway
&lt;/h2&gt;

&lt;p&gt;I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."&lt;/p&gt;

&lt;p&gt;The thing I expected to matter most — model choice — turned out to be inseparable from chunking strategy, retrieval configuration, and document structure. Changing chunk size doesn't change which model is capable of what. It changes what the model sees. And what the model sees determines everything.&lt;/p&gt;

&lt;p&gt;If I had to tell someone one thing before they start benchmarking models for RAG: measure your retrieval quality first. If the right chunks aren't being retrieved, you're not benchmarking models — you're benchmarking whether your similarity search surfaces the right context. Those are very different problems.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>springai</category>
      <category>ollama</category>
    </item>
    <item>
      <title>How I Run My Content Tooling on a Local Model for $0</title>
      <dc:creator>Hugo Kuznicki</dc:creator>
      <pubDate>Sun, 28 Jun 2026 04:58:53 +0000</pubDate>
      <link>https://dev.to/hugo_kuznicki_1ff20709904/how-i-run-my-content-tooling-on-a-local-model-for-0-1oig</link>
      <guid>https://dev.to/hugo_kuznicki_1ff20709904/how-i-run-my-content-tooling-on-a-local-model-for-0-1oig</guid>
      <description>&lt;p&gt;A few months ago I added up what I was spending on AI APIs just to draft social posts. It wasn't a lot — a few dollars here, a few there — but it was a &lt;em&gt;recurring&lt;/em&gt; cost for something I do every single day. And every time I wanted to experiment, regenerate, or tweak a prompt, a little meter ticked in the back of my head telling me to stop wasting tokens.&lt;/p&gt;

&lt;p&gt;So I moved the whole thing local. No API keys, no per-token billing, nothing leaving my machine. Here's exactly how, including the parts that aren't as clean as the pitch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why local at all?
&lt;/h2&gt;

&lt;p&gt;Three reasons, in order of how much they actually mattered to me:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost goes to zero.&lt;/strong&gt; Not "cheaper" — &lt;em&gt;zero&lt;/em&gt;. Once the model is on your disk, generating a thousand drafts costs the same as generating one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration becomes free, which changes your behavior.&lt;/strong&gt; This is the part nobody tells you. When each generation is metered, you ration attempts. When it's free, you regenerate aggressively — and the output gets &lt;em&gt;better&lt;/em&gt; because you stop being precious about it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy by default.&lt;/strong&gt; My prompts, drafts, and half-baked ideas never touch a third-party server. For content I haven't published yet, that's a real comfort.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The setup: Ollama in five minutes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; is the easiest way to run an LLM locally. Install it, pull a model, and you've got an HTTP server on &lt;code&gt;localhost&lt;/code&gt; that speaks a simple API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install (macOS/Linux)&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Pull an instruct-tuned model&lt;/span&gt;
ollama pull llama3.1:8b

&lt;span class="c"&gt;# It's now serving on http://localhost:11434&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire infrastructure. No account, no key, no dashboard. The model runs as a local service and you talk to it over HTTP like any other API — except this one is on your machine and free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;

&lt;p&gt;My content workflow is deliberately boring: &lt;strong&gt;one topic in, a batch of platform-specific posts out.&lt;/strong&gt; The whole thing is a thin layer around three ideas — a per-platform prompt template, a call to the local model, and a tiny bit of cleanup.&lt;/p&gt;

&lt;p&gt;Here's the core call. Ollama exposes a &lt;code&gt;/api/generate&lt;/code&gt; endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.1:8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No SDK, no auth header, no &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in your environment. It's just a POST to localhost.&lt;/p&gt;

&lt;p&gt;The interesting part is the templating. Each platform gets its own prompt with its own constraints baked in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TEMPLATES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;twitter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write 3 punchy tweet hooks about: {topic}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rules: under 280 chars, no hashtags, no emoji spam, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lead with the most surprising angle.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;linkedin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a short LinkedIn post about: {topic}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rules: 1 strong opening line, 3 short paragraphs, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a question at the end. Plain language, no buzzwords.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Outline a 5-tweet thread about: {topic}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Each tweet on its own line, numbered, each able to stand alone.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;platforms&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;platforms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TEMPLATES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call &lt;code&gt;run("local LLMs for content", ["twitter", "linkedin", "thread"])&lt;/code&gt; and you get a dict of drafts back, generated entirely on your own hardware, for nothing.&lt;/p&gt;

&lt;p&gt;The real product wraps this with a UI, a platform picker, and output cleanup — but the engine is genuinely this small. That's the point. Most of the value isn't in the model; it's in the &lt;em&gt;templates&lt;/em&gt; that constrain the model into something usable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that actually makes it good: tight prompts
&lt;/h2&gt;

&lt;p&gt;Smaller local models are less forgiving than a frontier API. A vague prompt to GPT-class hosted models still produces something passable. A vague prompt to an 8B local model produces mush. So the work shifts from "pay for a smarter model" to "write a sharper prompt."&lt;/p&gt;

&lt;p&gt;Concretely, what moved quality the most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bake the constraints into the template, not the topic.&lt;/strong&gt; Character limits, tone, structure — put them in the reusable template so every generation inherits them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ask for multiple options.&lt;/strong&gt; "Write 3 hooks" beats "write a hook" — you pick the best and the model explores more of the space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a &lt;code&gt;Modelfile&lt;/code&gt; for a custom system prompt&lt;/strong&gt; if you find yourself repeating instructions:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; llama3.1:8b&lt;/span&gt;
SYSTEM "You are a concise copywriter. No clichés, no 'in today's
fast-paced world', no emoji unless asked. Plain, specific language."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create copywriter &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;copywriter&lt;/code&gt; carries that voice everywhere and your per-call prompts get shorter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest tradeoffs
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend local is strictly better. It isn't.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-form coherence is weaker.&lt;/strong&gt; For short-form (hooks, captions, threads) local models are great. For a 2,000-word essay that needs to hold an argument, a frontier API still wins. Know which job you're doing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold-start latency is real.&lt;/strong&gt; The first request after the model unloads is slow. Keep it warm if you generate in bursts (&lt;code&gt;ollama run&lt;/code&gt; in the background, or a keepalive ping).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You own the ops.&lt;/strong&gt; No hosted API means no one else patches, scales, or babysits it. For a personal tool that's fine; for a product serving others it's a real consideration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware matters.&lt;/strong&gt; An 8B model is comfortable on a modern laptop. Bigger models want more RAM/VRAM. Match the model to your machine instead of reaching for the biggest one.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade I'm making — slightly less polish in exchange for $0 cost, full privacy, and unlimited iteration — is overwhelmingly worth it for high-frequency, templated work. That's most of what content generation actually is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The headline isn't "local models are magic." It's that &lt;strong&gt;for the specific job of churning out daily, templated content, the economics and the workflow both flip in local's favor&lt;/strong&gt; — and the setup is genuinely a five-minute Ollama install plus a few prompt templates.&lt;/p&gt;

&lt;p&gt;I packaged my own version of this into a small tool called &lt;strong&gt;Content Studio&lt;/strong&gt; (idea → batch of posts, runs fully local, $0 to run) if you'd rather not wire it up yourself — it's &lt;a href="https://kuznicki6.gumroad.com/l/kqusjo" rel="noopener noreferrer"&gt;on Gumroad&lt;/a&gt; and the open-source pieces live on &lt;a href="https://github.com/kuznickicapital-ship-it" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt;. And if you want the longer build-in-public breakdowns, I write them up in &lt;a href="https://hugos-newsletter-e0c067.beehiiv.com/" rel="noopener noreferrer"&gt;my newsletter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But honestly — even if you build your own from the snippets above, do it. Watching your API bill hit $0 while your output goes &lt;em&gt;up&lt;/em&gt; is a weirdly satisfying way to start a week.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ollama</category>
      <category>localllm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Local LLMs in 2026: Which Runtime to Run and the Hardware You Need</title>
      <dc:creator>Nishil Bhave</dc:creator>
      <pubDate>Sat, 27 Jun 2026 23:13:03 +0000</pubDate>
      <link>https://dev.to/nishilbhave/local-llms-in-2026-which-runtime-to-run-and-the-hardware-you-need-2hek</link>
      <guid>https://dev.to/nishilbhave/local-llms-in-2026-which-runtime-to-run-and-the-hardware-you-need-2hek</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxr5amhx08mp0aix017bd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxr5amhx08mp0aix017bd.png" alt="Comparison-table hero for running local LLMs in 2026, ranking the runtimes Ollama, LM Studio, llama.cpp and vLLM across what they are, best use, throughput at 64 users (vLLM at 793 tokens per second), and interface, with a hardware key-takeaway strip." width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Local LLMs in 2026: Which Runtime to Run and the Hardware You Need
&lt;/h1&gt;

&lt;p&gt;A few weekends ago I ran a 30-billion-parameter model on a laptop with no internet connection, and it answered my coding questions at reading speed. No API key. No per-token meter ticking. That setup would have been a research-lab flex two years ago. In 2026 it's a default install.&lt;/p&gt;

&lt;p&gt;The tooling caught up fast. Ollama, the project most people start with, passed 174,000 GitHub stars and 16,700 forks by mid-2026 (&lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, 2026), and the &lt;code&gt;llama.cpp&lt;/code&gt; engine underneath much of this stack sits north of 73,000 stars of its own. But here's the honest part most "run AI locally" posts skip: a local LLM is still a niche. Menlo Ventures found open-source models hold just 11% of enterprise LLM usage in 2025, down from 19% the year before (&lt;a href="https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/" rel="noopener noreferrer"&gt;Menlo Ventures&lt;/a&gt;, 2025). Most production traffic still hits a hosted API.&lt;/p&gt;

&lt;p&gt;So who should actually run one, and with what? I've put real hours into Ollama, LM Studio, llama.cpp, and vLLM across a Mac and a mid-range GPU box. This is the working map: the four runtimes that matter, a decision box that tells you which to pick, and the hardware reality check, with the model-versus-model fights pushed out to dedicated guides so this one stays a map and not a maze.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ollama leads on mindshare (&lt;strong&gt;174K+ GitHub stars&lt;/strong&gt;, 2026), but it's a wrapper around &lt;code&gt;llama.cpp&lt;/code&gt;, the engine doing the actual work (&lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, 2026).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "which runtime" question is really a concurrency question.&lt;/strong&gt; For one user, Ollama, LM Studio, and llama.cpp are roughly tied; the moment you serve many users at once, vLLM pulls ahead by a wide margin.&lt;/li&gt;
&lt;li&gt;At 64 concurrent users, vLLM generated about &lt;strong&gt;44x more tokens per second than llama.cpp&lt;/strong&gt; in Red Hat's benchmark, while llama.cpp's first token took over three minutes (&lt;a href="https://developers.redhat.com/articles/2026/06/15/llamacpp-vs-vllm-choosing-right-local-llm-inference-engine" rel="noopener noreferrer"&gt;Red Hat Developer&lt;/a&gt;, 2026).&lt;/li&gt;
&lt;li&gt;Hardware is the real gate: a 70B model at Q4_K_M quantization wants roughly &lt;strong&gt;40GB of memory&lt;/strong&gt;, so a 24GB GPU or a 64GB-plus Mac is the practical entry point for the big models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy and cost are the two honest reasons to go local.&lt;/strong&gt; 44% of enterprises name data privacy as their top barrier to LLM adoption (&lt;a href="https://konghq.com/blog/enterprise/enterprise-ai-spending-2025" rel="noopener noreferrer"&gt;Kong&lt;/a&gt;, 2025), and local inference has a marginal cost of zero per request.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Is a Local LLM, and Why Run One in 2026?
&lt;/h2&gt;

&lt;p&gt;A local LLM is a language model that runs entirely on your own machine, with no request leaving your hardware. That matters because privacy is the number one blocker to AI adoption: 44% of enterprises cite data privacy and security as their top barrier to using LLMs (&lt;a href="https://konghq.com/blog/enterprise/enterprise-ai-spending-2025" rel="noopener noreferrer"&gt;Kong&lt;/a&gt;, 2025). When the model lives on your laptop, the prompt never travels.&lt;/p&gt;

&lt;p&gt;The other reason is money. A hosted API charges per token forever. A local model charges you once, in hardware, and then runs at zero marginal cost per request. For a developer hammering a model all day, that math flips quickly. Privacy-focused builds keep sensitive code, contracts, or health data on-device, which is exactly why the "private llm" search trend keeps climbing.&lt;/p&gt;

&lt;p&gt;There's a third reason that's quieter but real: control. You pick the exact model, the exact quantization, and the exact version. Nothing gets deprecated out from under you. Some people also run local models specifically to step outside hosted guardrails, a sub-audience covered in the guide to the best uncensored and roleplay local LLMs.&lt;/p&gt;

&lt;p&gt;Now the anti-hype counterweight. Local does not mean free of tradeoffs. You give up frontier quality, you babysit your own hardware, and you eat the setup cost. Independent 2026 benchmarks put local inference on consumer hardware at roughly 70 to 85% of frontier-model quality on common tasks (&lt;a href="https://pooya.blog/blog/local-ai-ollama-benchmarks-cost-2026/" rel="noopener noreferrer"&gt;Pooya Golchian&lt;/a&gt;, 2026). For a lot of work that's plenty. For the hardest reasoning, it isn't. Knowing which bucket your task lands in is the whole game.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What I actually saw:&lt;/strong&gt; On an M-series Mac and a 12GB RTX 3060 box, the 7B and 8B models felt instant and genuinely useful for autocomplete, summarizing, and quick refactors. The 70B-class models technically loaded, but only on the Mac with enough unified memory, and they crawled. The gap between "runs" and "runs well" is almost entirely a hardware story, which is the section most guides bury.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Four Local LLM Runtimes Worth Knowing
&lt;/h2&gt;

&lt;p&gt;There are dozens of local-LLM tools, but four cover almost every real use case: Ollama, LM Studio, llama.cpp, and vLLM. Three of them (Ollama, LM Studio, and most desktop apps) are wrappers or GUIs sitting on top of &lt;code&gt;llama.cpp&lt;/code&gt;, which crossed 73,000 GitHub stars as the de facto engine for consumer inference (&lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, 2026). vLLM is the outlier, built for serving at scale.&lt;/p&gt;

&lt;p&gt;Here's the honest one-line verdict on each, with the deep setups linked out so this stays a map:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The easy button. One command pulls and runs a model.&lt;/td&gt;
&lt;td&gt;Getting started, scripting, local dev&lt;/td&gt;
&lt;td&gt;CLI + API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LM Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A polished desktop GUI over the same engine.&lt;/td&gt;
&lt;td&gt;Browsing, downloading, and chatting with zero terminal&lt;/td&gt;
&lt;td&gt;GUI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The C/C++ engine everything else is built on.&lt;/td&gt;
&lt;td&gt;Max control, custom quantization, embedding in your own app&lt;/td&gt;
&lt;td&gt;CLI / library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A production inference server with continuous batching.&lt;/td&gt;
&lt;td&gt;Serving many users, building a product, throughput&lt;/td&gt;
&lt;td&gt;Server / API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Ollama is where most people should start, and the full walkthrough lives in the complete Ollama guide covering setup, models, the web UI, and troubleshooting. If you'd rather click than type, the LM Studio guide on downloading models and how LM Studio compares to Ollama is the better entry point. The Ollama-versus-LM-Studio choice is mostly taste: same engine, different front door.&lt;/p&gt;

&lt;p&gt;According to GitHub's own counts, Ollama passed 174,000 stars and 16,700 forks by mid-2026 (&lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, 2026), making it the most-starred local-LLM runtime by a wide margin. But star counts measure attention, not throughput. The engine underneath, &lt;code&gt;llama.cpp&lt;/code&gt;, is what actually turns model weights into tokens, and choosing between the four runtimes is really about how many people you need to serve at once.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The reframe most comparisons miss:&lt;/strong&gt; "Which runtime is best?" is the wrong question. They mostly run the same models at the same quality. The real question is "how many requests at once?" That single variable, concurrency, is what separates the easy desktop tools from vLLM, and it's the axis the next chart is built on.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Ollama vs llama.cpp vs vLLM: Which Runtime Is Fastest?
&lt;/h2&gt;

&lt;p&gt;It depends entirely on load, and that caveat is the answer. For a single user, Ollama, LM Studio, and llama.cpp are roughly tied, often within a few tokens per second of each other. For many concurrent users, vLLM is in a different league: at 64 simultaneous users it generated about 44 times more tokens per second than llama.cpp in Red Hat's tests (&lt;a href="https://developers.redhat.com/articles/2026/06/15/llamacpp-vs-vllm-choosing-right-local-llm-inference-engine" rel="noopener noreferrer"&gt;Red Hat Developer&lt;/a&gt;, 2026).&lt;/p&gt;

&lt;p&gt;Why the gap? Architecture. Tools like Ollama and llama.cpp process requests largely one at a time, which is perfect for a single developer at a keyboard. vLLM uses continuous batching and PagedAttention to interleave many requests across the GPU, so its throughput climbs as load climbs. The flip side: under heavy concurrency, llama.cpp's first token can take more than three minutes because requests queue (&lt;a href="https://developers.redhat.com/articles/2026/06/15/llamacpp-vs-vllm-choosing-right-local-llm-inference-engine" rel="noopener noreferrer"&gt;Red Hat Developer&lt;/a&gt;, 2026). One benchmark clocked vLLM at a peak of 793 tokens per second against Ollama's 41 under the same load, a roughly 19x gap (&lt;a href="https://tech-insider.org/vllm-vs-ollama-2026-2/" rel="noopener noreferrer"&gt;tech-insider&lt;/a&gt;, 2026).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ft0u4ch7yp7fyvpbe7k9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ft0u4ch7yp7fyvpbe7k9b.png" alt="Grouped bar chart comparing Ollama and vLLM output tokens per second for a single request versus many concurrent requests. For one request Ollama reaches 45 and vLLM 38. For many requests Ollama reaches 41 while vLLM reaches 793." width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: Red Hat Developer and independent vLLM vs Ollama benchmarks, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The practical takeaway is simple. Are you one person at a keyboard? Ollama or LM Studio, and the throughput numbers above barely matter. Are you putting a model behind an app for real users? That's a vLLM job. The cross-runtime comparisons (&lt;code&gt;llama.cpp&lt;/code&gt; vs Ollama, vLLM vs Ollama) live here in the pillar on purpose, while the tool-specific deep dives stay in their own guides so nothing cannibalizes.&lt;/p&gt;

&lt;p&gt;For one user, the runtime you pick changes your tokens per second by single digits. For a hundred users, it changes them by an order of magnitude. vLLM's continuous batching is the reason a production deployment serving concurrent traffic should not be running on the same tool a solo developer uses for autocomplete (&lt;a href="https://developers.redhat.com/articles/2026/06/15/llamacpp-vs-vllm-choosing-right-local-llm-inference-engine" rel="noopener noreferrer"&gt;Red Hat Developer&lt;/a&gt;, 2026).&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Local LLM Tool Should You Use?
&lt;/h2&gt;

&lt;p&gt;Pick based on one thing first: who's calling the model. A solo developer wants the easiest path (Ollama or LM Studio); a team shipping a product wants throughput (vLLM); a tinkerer who needs custom quantization wants the raw engine (llama.cpp). Everything else is a detail. Here's the decision box I actually use.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The which-tool-to-pick decision box&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you...&lt;/th&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Want a model running in two minutes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One command pulls and serves a model, with a built-in API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefer clicking to typing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LM Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A real GUI to browse, download, and chat, no terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need custom quantization or to embed inference in your own binary&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The engine itself, minimal dependencies, total control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Are serving many users or building a product&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continuous batching scales throughput with concurrency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Are on an Apple Silicon Mac and want max speed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ollama or LM Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Both ride Metal/MLX acceleration under the hood&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Want to wire a local model into your editor or agents&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Its OpenAI-compatible API drops into most tools&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/blockquote&gt;

&lt;p&gt;A point worth stressing: these aren't exclusive. My own setup runs Ollama for day-to-day CLI work and keeps LM Studio around for visually browsing new models before I commit. They share the same model files and the same engine, so switching costs almost nothing. If you want a local model powering an editor like Cursor or driving an agent, Ollama's OpenAI-compatible endpoint is the path of least resistance, and you can connect it to external tools through the &lt;a href="https://maketocreate.com/mcp-servers-in-2026-complete-model-context-protocol-guide/" rel="noopener noreferrer"&gt;Model Context Protocol, which standardizes how AI clients talk to tools and data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One boundary to keep straight: this is about runtimes, not agents. If you're comparing coding &lt;em&gt;assistants&lt;/em&gt; (Cursor, Claude Code, Copilot) rather than the engines that run models, that's a different decision covered in the &lt;a href="https://maketocreate.com/ai-coding-agents-in-2026-5-categories-and-how-to-pick/" rel="noopener noreferrer"&gt;comparison of AI coding agents across five categories&lt;/a&gt;. Runtimes run models. Agents wrap workflows around them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Hardware Do You Need to Run a Local LLM?
&lt;/h2&gt;

&lt;p&gt;Memory is the gate, not raw compute. The rule of thumb: a model needs roughly its parameter count in gigabytes at 4-bit quantization, plus overhead. A 7B model at Q4_K_M wants about 5 to 6GB; a 70B model at the same quantization wants roughly 40GB once you account for the KV cache and runtime overhead (&lt;a href="https://www.sitepoint.com/vram-requirements-70b-models-16gb-gpu-minimum-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). That number decides everything else.&lt;/p&gt;

&lt;p&gt;Quantization is the lever that makes local LLMs practical at all. It shrinks the model's weights from 16-bit floats down to 4-bit or 5-bit integers, cutting memory roughly in four. The community settled on Q4_K_M as the sweet spot: the quality hit is tiny for everyday use, a perplexity delta of only about +0.05, though coding and multi-step reasoning can drop 5 to 15% versus full precision (&lt;a href="https://willitrunai.com/blog/quantization-guide-gguf-explained" rel="noopener noreferrer"&gt;Will It Run AI&lt;/a&gt;, 2026). In practice, a well-quantized model is almost always worth it to fit a bigger, smarter model into the same memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fesnd3fq3k3i63jo8qb4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fesnd3fq3k3i63jo8qb4x.png" alt="Lollipop chart showing approximate memory needed to run models at Q4_K_M 4-bit quantization. A 7 billion parameter model needs about 6 gigabytes, 13 billion about 10, 32 billion about 22, and 70 billion about 40 gigabytes." width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: SitePoint, llmhardware.io, and Will It Run AI quantization guides, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So what should you buy? On the PC side, a 16GB GPU is now the realistic minimum for serious work, and a 24GB card (an RTX 3090 or 4090) is the practical sweet spot because it just barely fits a 70B model at Q4_K_M (&lt;a href="https://www.sitepoint.com/vram-requirements-70b-models-16gb-gpu-minimum-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). Below that, you're living in 7B-to-13B territory, which is genuinely fine for autocomplete, summarizing, and most coding help. The best GPU for a local LLM is, bluntly, whichever one has the most VRAM you can afford.&lt;/p&gt;

&lt;p&gt;A 70B model at Q4_K_M needs roughly 40GB of memory once you include the KV cache, which is why a single 24GB consumer GPU is the practical ceiling for the largest models and a 64GB-plus unified-memory Mac is the realistic alternative (&lt;a href="https://www.sitepoint.com/vram-requirements-70b-models-16gb-gpu-minimum-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). Match your model's memory footprint to your hardware first, and pick the model second. For which models actually fit and perform, the guide to the best open-source LLMs does the model-by-model breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can You Run a Local LLM on a Mac?
&lt;/h2&gt;

&lt;p&gt;Yes, and Apple Silicon is quietly one of the best local-LLM platforms you can buy, thanks to unified memory. On an M-series Mac, the CPU, GPU, and Neural Engine share one high-bandwidth memory pool, so the GPU reads model weights without copying them across a PCIe bus. The M4 Max moves data at about 546 GB/s, which is why it generates tokens faster than any other current Apple chip (&lt;a href="https://www.sitepoint.com/local-llms-apple-silicon-mac-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026).&lt;/p&gt;

&lt;p&gt;The catch is the same as everywhere: memory. A 70B model at Q4 is around 43GB, which technically fits a 64GB Mac, but macOS memory pressure spikes and the system starts swapping to SSD, which tanks your tokens per second. For a stable 70B workflow on a Mac in 2026, 128GB of unified memory is the realistic requirement (&lt;a href="https://www.sitepoint.com/local-llm-hardware-requirements-mac-vs-pc-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). For 7B-to-32B models, a 32GB or 48GB Mac is comfortable.&lt;/p&gt;

&lt;p&gt;One Mac-specific tip from my own testing: Apple's MLX framework, which both Ollama and LM Studio can use under the hood, runs noticeably faster than generic llama.cpp builds because it's written for Metal and unified memory directly, a meaningful speedup on the same hardware (&lt;a href="https://www.sitepoint.com/local-llms-apple-silicon-mac-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). If you're on Apple Silicon, prefer an MLX-aware build, and you'll get free speed.&lt;/p&gt;

&lt;p&gt;On Apple Silicon, unified memory means the usable model size is gated by your total RAM, not a separate VRAM number, so a 128GB Mac Studio can hold models that would need multiple datacenter GPUs on a PC (&lt;a href="https://www.sitepoint.com/local-llm-hardware-requirements-mac-vs-pc-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). That's the single biggest reason Macs punch above their weight for local inference. The "mac llm" search trend exists for a reason: for many developers, the laptop they already own is the best local-LLM box in the house.&lt;/p&gt;

&lt;h2&gt;
  
  
  When You Should Not Run an LLM Locally
&lt;/h2&gt;

&lt;p&gt;Be honest about this part, because the local-AI hype skips it. You should not run locally when you need frontier-level reasoning, when you need to serve real production traffic without owning a GPU fleet, or when the engineering time to maintain it costs more than the API bill. Open-source models sit at just 11% of enterprise LLM usage for a reason (&lt;a href="https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/" rel="noopener noreferrer"&gt;Menlo Ventures&lt;/a&gt;, 2025): hosted frontier models still win on raw capability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjiz9ye7ml7u2kba068es.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjiz9ye7ml7u2kba068es.png" alt="Donut chart showing that open-source and self-hostable models make up about 11 percent of enterprise LLM usage in 2025, with hosted proprietary APIs making up the other 89 percent." width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: Menlo Ventures, State of Generative AI in the Enterprise, 2025&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The cleanest mental model is a hybrid one. Run small, frequent, privacy-sensitive work locally, and route the hard or high-stakes requests to a hosted frontier model. If you're picking between those frontier options, the Claude Opus vs GPT-5 comparison covers the top hosted pair. And when local stops scaling and you need to fan out across multiple providers cleanly, an &lt;a href="https://maketocreate.com/ai-gateway-architecture-7-cross-cutting-concerns-2026/" rel="noopener noreferrer"&gt;AI gateway handles routing, fallback, and the cross-cutting concerns&lt;/a&gt; you'd otherwise hand-roll.&lt;/p&gt;

&lt;p&gt;Local LLMs win on privacy and cost; hosted models win on peak capability and zero-ops scaling. The honest 2026 answer for most teams is not "local versus cloud" but "local for the 80% that's routine, cloud for the 20% that's hard." Treat it as a routing decision, not a religion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Models Should You Run Locally?
&lt;/h2&gt;

&lt;p&gt;Start with the model that fits your memory, then optimize for your task. A 7B-to-8B model handles autocomplete and summarizing on almost any modern machine; a 70B model is worth the hardware only if you need its reasoning. The open-source field moves monthly, with strong releases from the Llama, Qwen, DeepSeek, Gemma, and Mistral families all runnable through the runtimes above.&lt;/p&gt;

&lt;p&gt;This pillar deliberately doesn't run the model-versus-model fights, because those are full guides on their own. Here's where to go:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For coding specifically:&lt;/strong&gt; the ranked guide to the best LLMs for coding covers which models actually write good code, local and hosted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For a general open-source pick:&lt;/strong&gt; the best open-source LLMs breakdown ranks the current field by use case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For the DeepSeek question:&lt;/strong&gt; the DeepSeek R1 vs V3 comparison settles which version to run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For uncensored or roleplay use:&lt;/strong&gt; the best uncensored and roleplay local LLMs covers the models built without heavy guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For app ideas:&lt;/strong&gt; the directory of awesome LLM apps catalogs what people build on top of these models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hugging Face ecosystem now hosts roughly 135,000 GGUF-format models built specifically for local inference, up from a few hundred three years ago (&lt;a href="https://pooya.blog/blog/local-ai-ollama-benchmarks-cost-2026/" rel="noopener noreferrer"&gt;Pooya Golchian&lt;/a&gt;, 2026), so the constraint in 2026 is almost never finding a model. It's matching the right one to your hardware and your task. Pick the runtime first, confirm your memory budget, then choose the biggest model that fits comfortably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Ollama or LM Studio better for running a local LLM?
&lt;/h3&gt;

&lt;p&gt;They run the same models at the same quality, so it comes down to interface. Ollama is a command-line tool with a built-in API, ideal for scripting and dev work. LM Studio is a GUI for people who'd rather click than type. Ollama leads on adoption with 174,000+ GitHub stars in 2026 (&lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, 2026).&lt;/p&gt;

&lt;h3&gt;
  
  
  What hardware do I need to run a local LLM?
&lt;/h3&gt;

&lt;p&gt;Memory is the gate. A 7B model at 4-bit quantization needs about 5 to 6GB, while a 70B model needs roughly 40GB (&lt;a href="https://www.sitepoint.com/vram-requirements-70b-models-16gb-gpu-minimum-2026/" rel="noopener noreferrer"&gt;SitePoint&lt;/a&gt;, 2026). A 16GB GPU is the realistic minimum for serious work; a 24GB card or a 64GB-plus unified-memory Mac handles the largest models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is a local LLM as good as ChatGPT or Claude?
&lt;/h3&gt;

&lt;p&gt;Not at the frontier, but closer than you'd think. Independent 2026 benchmarks put local inference at roughly 70 to 85% of frontier-model quality on common tasks (&lt;a href="https://pooya.blog/blog/local-ai-ollama-benchmarks-cost-2026/" rel="noopener noreferrer"&gt;Pooya Golchian&lt;/a&gt;, 2026). For autocomplete, summarizing, and routine coding that's plenty; for the hardest reasoning, hosted models still lead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why run an LLM locally instead of using an API?
&lt;/h3&gt;

&lt;p&gt;Privacy and cost. 44% of enterprises name data privacy as their top barrier to LLM adoption, which a local model removes entirely since no request leaves your machine (&lt;a href="https://konghq.com/blog/enterprise/enterprise-ai-spending-2025" rel="noopener noreferrer"&gt;Kong&lt;/a&gt;, 2025). Local inference also has zero marginal cost per request, which adds up fast for heavy daily use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which runtime is fastest for serving many users?
&lt;/h3&gt;

&lt;p&gt;vLLM, by a wide margin. Its continuous batching scales throughput with concurrency, generating about 44 times more tokens per second than llama.cpp at 64 concurrent users (&lt;a href="https://developers.redhat.com/articles/2026/06/15/llamacpp-vs-vllm-choosing-right-local-llm-inference-engine" rel="noopener noreferrer"&gt;Red Hat Developer&lt;/a&gt;, 2026). For a single user, though, Ollama and llama.cpp are roughly tied with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line on Running LLMs Locally
&lt;/h2&gt;

&lt;p&gt;Running a local LLM in 2026 is no longer a research project; it's a two-minute install with Ollama and a hardware decision. The runtime you pick matters less than people think for solo use, and a lot more once you're serving real traffic. Get the order right: pick the runtime for your concurrency, size your hardware to the model, then choose the biggest model that fits.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solo developer?&lt;/strong&gt; Ollama or LM Studio, a 16GB-plus GPU or a 32GB-plus Mac, and a 7B-to-32B model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shipping a product?&lt;/strong&gt; vLLM, datacenter GPUs, and a real serving setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-driven?&lt;/strong&gt; Anything local beats a hosted API the moment your data can't leave the building.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're ready to actually install one, the next step is the full Ollama setup and model guide, the fastest path from zero to a model running on your own machine. Then come back and match a model to the hardware you've got.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>runllmslocally</category>
      <category>ollama</category>
      <category>lmstudio</category>
    </item>
  </channel>
</rss>
