<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SomeOddCodeGuy</title>
    <description>The latest articles on DEV Community by SomeOddCodeGuy (@someoddcodeguy).</description>
    <link>https://dev.to/someoddcodeguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3490530%2F9cdfc762-b1a2-45b2-b90c-252cf15f6fea.png</url>
      <title>DEV Community: SomeOddCodeGuy</title>
      <link>https://dev.to/someoddcodeguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/someoddcodeguy"/>
    <language>en</language>
    <item>
      <title>Qwen3.6, and WilmerAI OpenCode workflows</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 20 Apr 2026 03:10:20 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/qwen36-and-wilmerai-opencode-workflows-oj7</link>
      <guid>https://dev.to/someoddcodeguy/qwen36-and-wilmerai-opencode-workflows-oj7</guid>
      <description>&lt;p&gt;Just a random note, but Qwen3.6 35b a3b is putting a smile on my face. This little model feels like a big upgrade over 3.5's 27b or 35b a3b.&lt;/p&gt;

&lt;p&gt;Also- the Wilmer workflow for OpenCode is really going well. I need to test it more, because I had to do a big refactor on it, but so far between that and Qwen3.6, the level of quality I'm seeing from OpenCode now feels &lt;strong&gt;reliable&lt;/strong&gt;. I won't over-exaggerate the situation by making any claims about it feeling similar in quality to X or Y proprietary cloud models; instead I'll say that up until now, I had not felt like a local model that ran at any kind of a decent speed was particularly reliable for power-user level agentic coding. This model + jamming my Wilmer workflow between MLX and OpenCode has now changed that. I have more work to do, a lot more testing to do, but I'm feeling really good about this right now.&lt;/p&gt;

&lt;p&gt;And on a side note: the M5 Max with MLX is absolutely destroying my M3 Ultra in terms of speeds when running Qwen3.6 35b. I currently have that model running at bf16 on the M5 Max, and Im watching it process prompts at insane (for Mac) speeds.&lt;/p&gt;

&lt;p&gt;M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 4k tokens&lt;br&gt;
Total Time: ~1.1 seconds&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-04-19 22:56:00,920 - INFO - Prompt processing progress: 322/4010
2026-04-19 22:56:01,475 - INFO - Prompt processing progress: 2370/4010
2026-04-19 22:56:01,972 - INFO - Prompt processing progress: 4006/4010
2026-04-19 22:56:02,004 - INFO - Prompt processing progress: 4009/4010
2026-04-19 22:56:02,029 - INFO - Prompt processing progress: 4010/4010
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 32k tokens&lt;br&gt;
Total time: ~11 seconds&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-04-19 22:56:18,074 - INFO - Prompt processing progress: 2048/32137
2026-04-19 22:56:18,652 - INFO - Prompt processing progress: 4096/32137
2026-04-19 22:56:19,259 - INFO - Prompt processing progress: 6144/32137
2026-04-19 22:56:19,896 - INFO - Prompt processing progress: 8192/32137
2026-04-19 22:56:20,561 - INFO - Prompt processing progress: 10240/32137
2026-04-19 22:56:21,249 - INFO - Prompt processing progress: 12288/32137
2026-04-19 22:56:21,971 - INFO - Prompt processing progress: 14336/32137
2026-04-19 22:56:22,714 - INFO - Prompt processing progress: 16384/32137
2026-04-19 22:56:23,485 - INFO - Prompt processing progress: 18432/32137
2026-04-19 22:56:24,288 - INFO - Prompt processing progress: 20480/32137
2026-04-19 22:56:25,122 - INFO - Prompt processing progress: 22528/32137
2026-04-19 22:56:25,989 - INFO - Prompt processing progress: 24576/32137
2026-04-19 22:56:26,879 - INFO - Prompt processing progress: 26624/32137
2026-04-19 22:56:27,800 - INFO - Prompt processing progress: 28672/32137
2026-04-19 22:56:28,761 - INFO - Prompt processing progress: 30720/32137
2026-04-19 22:56:29,542 - INFO - Prompt processing progress: 32136/32137
2026-04-19 22:56:29,581 - INFO - Prompt processing progress: 32137/32137
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anyhow, I have a very busy week coming up, so I'm unlikely to post much for a little bit, but I will be testing this workflow up a storm and really putting this little Qwen through its paces.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>coding</category>
      <category>llm</category>
    </item>
    <item>
      <title>Wilmer Tool Calling</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 13 Apr 2026 03:53:26 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wilmer-tool-calling-492g</link>
      <guid>https://dev.to/someoddcodeguy/wilmer-tool-calling-492g</guid>
      <description>&lt;p&gt;So some year and a half after the request was made for me to put tool calling into Wilmer, I've finally got it in there.&lt;/p&gt;

&lt;p&gt;First off- it was a huge pain to implement; if I didn't have Wilmer itself and agentic coders to help, I'm not sure I'd have done it. The way streaming works with tool calling is a bit odd, too, so that was interesting to navigate. Really, this was something I couldn't have pulled off without the earlier workflow engine refactor for the Execution Context.&lt;/p&gt;

&lt;p&gt;The idea is straightforward: Wilmer sits in between the frontend and the LLM, so it just needs to pass tool definitions from the frontend through to the model, and pass tool call responses from the model back to the frontend. Wilmer itself doesn't need to understand or execute the tools. The tricky part was that Wilmer has a whole pipeline of nodes doing different things (&lt;em&gt;memory lookups, categorization, summarization, context gathering&lt;/em&gt;) and you really don't want tool calls accidentally hitting nodes that are just doing internal processing. So I had to put per-node controls in place. Only the nodes you explicitly flag will pass tools through; the rest just strip it out and do their job; with the exception of pulling out just the tool call outputs to give in the case of some internal nodes using chat_user_prompt_*. &lt;/p&gt;

&lt;p&gt;Format conversion between OpenAI, Claude, and Ollama backends was also a headache since they all handle tool calling differently, and streaming tool calls needed their own handling to keep the structured data from getting mangled by the normal text processing pipeline.&lt;/p&gt;

&lt;p&gt;But the reason I finally sat down and did this is that I've been using OpenCode more lately. Up until summer of last year I had pretty much written off agentic coding, but once Claude Code got good I found myself sucked in like everyone else. Even though I'm usually a very local-first oriented guy, I've just stuck to that since because the quality is so great.&lt;/p&gt;

&lt;p&gt;A month or so ago I started dabbling in OpenCode, to have something for when the net goes out, and I have to say that Qwen3.5 27b combined with it is pretty nice... but nowhere near the quality of Claude (&lt;em&gt;obviously&lt;/em&gt;). My goal hasn't changed since 2023: trying to find ways to improve the quality of local tools to that of proprietary, even if it means sacrificing speed for quality. So as with all things, after trying OpenCode for a while, my answer is: shove Wilmer into the flow.&lt;/p&gt;

&lt;p&gt;Now that tool calling works end to end, I can do just that. The OpenCode calls pass through Wilmer, hit my workflows, and the tool calls get forwarded through to one of N number of models in llama.cpp and back without Wilmer needing to know anything about what the tools actually do. It slows everything down a lot, but the result is far less engagement from me because it gets things right in far fewer tries. Especially doing things like the earlier Qwen improvements of manually applying CoT.&lt;/p&gt;

&lt;p&gt;I've had really great luck with getting Qwen3.5 122b to give a lot better results than stock like this, but Qwen3.5 27b has been a bit harder to wrangle. Getting it to play nice with my decision trees is fairly challenging so far.&lt;/p&gt;

&lt;p&gt;I'm going to tinker with these OpenCode workflows for a month or so and then start putting them out for folks. Updating the example workflows in the repo is next on the list.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>A Quick Note on Gemma 4 Image Settings in Llama.cpp</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Fri, 03 Apr 2026 01:50:48 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-quick-note-on-gemma-4-image-settings-in-llamacpp-39ng</link>
      <guid>https://dev.to/someoddcodeguy/a-quick-note-on-gemma-4-image-settings-in-llamacpp-39ng</guid>
      <description>&lt;p&gt;In my last post, I mentioned &lt;a href="https://www.someoddcodeguy.dev/a-few-tips-for-ocr-with-qwen3-5-through-llama-cpp/" rel="noopener noreferrer"&gt;using --image-min-tokens to increase the quality of image responses from Qwen3.5&lt;/a&gt;. I went to load Gemma 4 the same way, and hit an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[58175] srv  process_chun: processing image...
[58175] encoding image slice...
[58175] image slice encoded in 7490 ms
[58175] decoding image batch 1/2, n_tokens_batch = 2048
&lt;/span&gt;&lt;span class="gp"&gt;[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; n_tokens_all&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s2"&gt;"non-causal attention requires n_ubatch &amp;gt;= n_tokens"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; failed
&lt;span class="go"&gt;[58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
[58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
[58175] See: https://github.com/ggml-org/llama.cpp/pull/17869
[58175] 0   libggml-base.0.9.11.dylib           0x0000000103a6136c ggml_print_backtrace + 276
[58175] 1   libggml-base.0.9.11.dylib           0x0000000103a61558 ggml_abort + 156
[58175] 2   libllama.0.0.0.dylib                0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484
[58175] 3   libllama.0.0.0.dylib                0x0000000103eb098c llama_decode + 20
[58175] 4   libmtmd.0.0.0.dylib                 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948
[58175] 5   libmtmd.0.0.0.dylib                 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536
[58175] 6   llama-server                        0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256
[58175] 7   llama-server                        0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396
[58175] 8   llama-server                        0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504
[58175] 9   llama-server                        0x0000000102f3a610 main + 14376
[58175] 10  dyld                                0x00000001968edd54 start + 7184
srv    operator(): http client error: Failed to read connection
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
srv    operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the crash is caused by the fact that I'm not setting ubatch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;58175&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;socg&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;llama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b8639&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;llama&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1597&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GGML_ASSERT&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;cparams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;causal_attn&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;cparams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_ubatch&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;n_tokens_all&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s"&gt;"non-causal attention requires n_ubatch &amp;gt;= n_tokens"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason is because Gemma 4's vision encoder uses non-causal attention for image tokens, which means all the image tokens have to fit within a single ubatch; since I specified that gotta be at least 2048, that's a problem since ubatch defaults to 512.&lt;/p&gt;

&lt;p&gt;First, we need to make sure the model actually supports going that high. &lt;a href="https://unsloth.ai/docs/models/gemma-4" rel="noopener noreferrer"&gt;If we peek over at Unsloth's page, we'll see that's not the case&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Gemma 4 supports multiple visual token budgets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70&lt;/li&gt;
&lt;li&gt;140&lt;/li&gt;
&lt;li&gt;280&lt;/li&gt;
&lt;li&gt;560&lt;/li&gt;
&lt;li&gt;1120&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use them like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70 / 140: classification, captioning, fast video understanding&lt;/li&gt;
&lt;li&gt;280 / 560: general multimodal chat, charts, screens, UI reasoning&lt;/li&gt;
&lt;li&gt;1120: OCR, document parsing, handwriting, small text&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;So our max is actually 1120 here. So for my case, Im going to want to set the --image-min-tokens and --image-max-tokens both 1120, and then I'll buffer up the batch and ubatch to 2048.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;-ngl&lt;/span&gt; 200 &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 65535 &lt;span class="nt"&gt;--models-dir&lt;/span&gt; /Users/socg/models &lt;span class="nt"&gt;--models-max&lt;/span&gt; 1 &lt;span class="nt"&gt;--port&lt;/span&gt; 5001 &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;--image-min-tokens&lt;/span&gt; 1120 &lt;span class="nt"&gt;--image-max-tokens&lt;/span&gt; 1120 &lt;span class="nt"&gt;--ubatch-size&lt;/span&gt; 2048 &lt;span class="nt"&gt;--batch-size&lt;/span&gt; 2048
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>google</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Few Tips for OCR With Qwen3.5 through Llama.cpp</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 31 Mar 2026 02:27:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-few-tips-for-ocr-with-qwen35-through-llamacpp-7de</link>
      <guid>https://dev.to/someoddcodeguy/a-few-tips-for-ocr-with-qwen35-through-llamacpp-7de</guid>
      <description>&lt;p&gt;Just a couple of quick tips. I am using the Unsloth Qwen3.5 27b gguf, and also tried the 122b gguf.&lt;/p&gt;

&lt;p&gt;First: The difference between the bf16 and fp32 mmproj is night and day. I was getting multiple hallucinations, errors, etc with the bf16. I swapped to the fp32 mmproj and it fixed up a lot of that almost instantly. Drastic improvement. The vision projector may have components that benefit from fp32's additional mantissa bits &lt;em&gt;(23 bits vs bf16's 7 bits)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Second: Forcing the model to kick up the minimum number of visual tokens. For example, I was trying to run OCR on an old image of a Japanese newspaper article from 1957 that I found. It was something like 733x1024, and the model was really struggling to read the body of the text; tons of hallucinations, just making up entire sections of text. By forcing the image-min-tokens up to 2048, it forced the model to use 3x the visual processing, and the quality went up MASSIVELY. All of a sudden it could read the paper, with only a handful of small issues.&lt;/p&gt;

&lt;p&gt;This is what I added to the llama-server command: &lt;code&gt;--image-min-tokens 2048 --image-max-tokens 8192&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I did have to toss 1.1 repetition penalty in there, as it was having a hard time transcribing Japanese without failing, but otherwise it is doing a great job now.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Wrangling Qwen's Overthinking with Workflows</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sat, 28 Mar 2026 17:45:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wrangling-qwens-overthinking-with-workflows-3hhm</link>
      <guid>https://dev.to/someoddcodeguy/wrangling-qwens-overthinking-with-workflows-3hhm</guid>
      <description>&lt;p&gt;So I've been running Qwen3.5 122b a10b lately on the M2 Ultra (currently GLM 5 is sitting on the M3), and if you've used any of the Qwen3.5 family, you've probably seen or heard about the overthinking issue. The models are great if you either have a lot of time to kill while you wait for a response, or for more straight forward work if you kill the reasoning. The 35b a3b with reasoning disabled has been my workhorse for the past couple of weeks and &lt;strong&gt;it is the greatest thing since sliced bread&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Anyhow, now that I want to use the 122b for actual hobby work, I've realized how painful the overthinking really is. I had a conversation a few days ago where I asked it to translate something simple. Not anything complex, just a straightforward translation request. It spat out over 5,000 tokens of reasoning before giving me the actual answer. I tested, and actually got a faster response by sending my request to GLM 5 with reasoning enabled, despite it being a 744b a40b model. It just thought so much less, because the request wasn't THAT complex.&lt;/p&gt;

&lt;p&gt;I tried all of the Qwen recommended samplers, and even kicked up repetition penalty alongside their recommended presence penalty just to see what it would do. But nope; think think think. I also sleuthed around the net a bit and saw that several folks ultimately solved this with forceful thinking budgets in the newer llama.cpp, but I'm not a huge fan of that; if the reasoning isn't done, then it'll just get cut-off mid thought and you really aren't getting the benefit of reasoning at all.&lt;/p&gt;

&lt;p&gt;So after banging my head on this for a bit, I went back to something I used to do when reasoning models were newer and their CoT actually hurt more than help: Wilmer workflows to the rescue.&lt;/p&gt;

&lt;p&gt;What I ended up doing was disabling Qwen3.5's native reasoning entirely. I'm passing &lt;code&gt;enable_thinking: false&lt;/code&gt; into &lt;code&gt;chat_template_kwargs&lt;/code&gt; through the llama.cpp server payload to disable thinking, then I built a workflow that handles the chain-of-thought process manually.&lt;/p&gt;

&lt;p&gt;The workflow does the usual context gathering that my setups always do, and then right before the final response there's a dedicated "thinking" node. This node gets all the context and produces a chain-of-thought analysis that then feeds into the responder node.&lt;/p&gt;

&lt;p&gt;Rather than wing the CoT, since things have probably changed a bit since the last time I did that in 2024 (lol), I had Claude do a deep research pass on how how Deepseek and GLM 4.7 structure their reasoning internally, to see if I could get some ideas. In my experience, both of those do amazingly at CoT.&lt;/p&gt;

&lt;p&gt;DeepSeek-R1 ended up having the most info available; it followed a four-phase pattern of problem definition, decomposition, reconstruction cycles, and final decision. The reconstruction cycles are where it either ruminates or genuinely tries new approaches. GLM 4.7 does something called interleaved thinking, where it reasons before each response and each tool call, not just at the start.&lt;/p&gt;

&lt;p&gt;The research I found showed something interesting. Incorrect solutions have more and longer reconstruction cycles than correct ones. There's a problem-specific sweet spot for reasoning length. As we already knew: more reasoning doesn't always mean better answers. In fact, R1 had a bad habit of ruminating, re-examining the same formulations repeatedly, which actually hurts its ability to find novel solutions.&lt;/p&gt;

&lt;p&gt;It was an overthinker, too; just not as bad as Qwen.&lt;/p&gt;

&lt;p&gt;Anyhow, long story long: I took all that and threw together a new CoT prompt in a new node just before the responder. The model has to assess complexity first and scale its effort accordingly; a simple greeting gets maybe two or three sentences of thought, while a multi-step coding problem gets a thorough breakdown. Then it has to work through the problem, verify its reasoning, and output a response plan. If it catches itself repeating the same line of reasoning, it's instructed to stop and either move on or try a genuinely different approach.&lt;/p&gt;

&lt;p&gt;Despite Qwen3.5 122b not being trained for this, the results have been solid. Instead of 5,000+ tokens of circular thinking on a simple translation, I'm seeing 900 to 1500 tokens now on that same request. The quality of the final responses seems about the same, maybe slightly better because the thinking is actually structured rather than meandering. And despite making two separate model calls instead of one, the total response time is lower because I'm not burning tokens on endless rumination.&lt;/p&gt;

&lt;p&gt;This isn't a new idea. I had to do this two years ago as well; it's just funny that I'm circling back to it now with one of the most powerful models out there.&lt;/p&gt;

&lt;p&gt;Anyhow, that's how I got Qwen3.5 to behave. Your mileage may vary. But if you've got a workflow system set up and you're willing to spend some time on prompt engineering, there's a lot you can do to tame a model that doesn't self-regulate well.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A New Toy...</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 17 Mar 2026 23:41:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-new-toy-4f56</link>
      <guid>https://dev.to/someoddcodeguy/a-new-toy-4f56</guid>
      <description>&lt;p&gt;The M5 Max Macbook Pro just arrived. First thing I did was fling llama.cpp, Wilmer and Open WebUI on it.&lt;/p&gt;

&lt;p&gt;Honestly, the speeds are really impressive, even considering that llama.cpp hasn't fully integrated the hardware changes yet (at least, that's my understanding). Here's a comparison of Qwen3.5 35b a3b between the M5 Max Macbook vs the M3 Ultra Mac Studio&lt;/p&gt;

&lt;h3&gt;
  
  
  M5 Max MacBook Pro:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1450 t/s processing, 68 t/s generation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time =    
    3202.80 ms /  4654 tokens 
    (0.69 ms per token,  1453.10 tokens per second)
eval time =    
    7098.19 ms /   483 tokens 
   (14.70 ms per token,    68.05 tokens per second)
total time =   10300.99 ms /  5137 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  M3 Ultra Mac Studio:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1647 t/s processing, 48 t/s generation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time = 
    3810.74 ms / 6280 tokens 
    (0.61 ms per token, 1647.97 tokens per second)
eval time = 
    14695.00 ms / 704 tokens 
    (20.87 ms per token, 47.91 tokens per second)
total time = 
    18505.75 ms / 6984 tokens 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So yea- the Studio processes prompts faster (&lt;em&gt;at this size of model and this amount of tokens, though I think that it actually saturates better on the M5 Max at larger prompts&lt;/em&gt;), but generates tokens slower than the M5 Max.&lt;/p&gt;

&lt;p&gt;Super excited to play with this. I got rid of the M2 Max Macbook, so this is my main travel machine now.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Slimming Down the Homelab Software Footprint</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 16 Mar 2026 03:09:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/slimming-down-the-homelab-software-footprint-5c6g</link>
      <guid>https://dev.to/someoddcodeguy/slimming-down-the-homelab-software-footprint-5c6g</guid>
      <description>&lt;p&gt;So my homelab setup post from a while back is already outdated. Not as much on the hardware part; rather the software side has consolidated dramatically.&lt;/p&gt;

&lt;p&gt;The original setup had somewhere around 20 to 30 separate WilmerAI instances running across my network. Each one was configured for a specific purpose: coding assistance, general chat, RAG workflows, reasoning-heavy tasks, fast responses, and so on. Each instance pointed at one of my three main inference machines (the M2 Ultras and M3 Ultra). If I wanted a different usecase, I spun up a different Wilmer instance and pointed at the appropriate models on the appropriate machine.&lt;/p&gt;

&lt;p&gt;This worked, but it was wasteful. Wilmer is lightweight at around 150 megabytes per instance, but multiply that by 25 or 30 instances and you're burning some memory. More importantly, it was fragile. If I fired off two different workflow requests that both targeted the same Mac, they could hit the LLM simultaneously and either slow down the machine massively or crash it entirely. Apple Silicon doesn't handle parallel LLM inference well at all, so I had to tiptoe around my own setup, mentally tracking which workflows were in use before triggering another one.&lt;/p&gt;

&lt;p&gt;Two changes have collapsed this down to something far more manageable.&lt;/p&gt;

&lt;p&gt;The first is actually a Llama.cpp change; lcpp server recently added router mode (think llama-swap), which lets a single instance manage multiple models. You start the server without specifying a model, point it at a directory of GGUF files, and then specify the model in each API request. The server handles loading, unloading, and LRU eviction automatically. For my use case, I now run two llama.cpp instances per physical machine: one for a large model (the responders) and one for a small model (the workers). Both stay loaded and pinned with mlock so there is no cold start penalty. The model field in the request tells llama.cpp which one to use. That took me from an average of 5 llama.cpp instances per machine down to 2.&lt;/p&gt;

&lt;p&gt;By doing two lcpp instances, I can work it out so that the memory balances. I'll make sure my largest responder model leaves enough memory headroom for my largest worker model; if that combination can load side by side, then I'm golden. With the Mac's memory caching, that makes it super quick to swap models around as needed.&lt;/p&gt;

&lt;p&gt;The second big change for me is on the Wilmer-side; specifically the multi-user support I just finished building into Wilmer.&lt;/p&gt;

&lt;p&gt;Instead of running a separate Wilmer process for each workflow, I now run a single Wilmer instance per physical machine with multiple users configured via the --User flag. Each "user" is really just a configuration profile: a set of endpoints, presets, memory settings, and workflow folders. The front-end selects which configuration to use by setting the model field to something like chris-openwebui-m3:coding or chris-openwebui-m3:general. Wilmer parses that prefix, loads the appropriate user config, and runs the shared workflow under that configuration.&lt;/p&gt;

&lt;p&gt;The shared workflows are also a new feature. They expose workflow folders through the /v1/models and /api/tags endpoints, so frontends like Open WebUI just see them as models in a dropdown. Selecting one tells Wilmer which workflow to run. &lt;/p&gt;

&lt;p&gt;In multi-user mode, the username prefix determines which user's endpoints and settings get used. So bob:openwebui-coding runs the same workflow as alice:openwebui-coding (assuming both are using shared workflows), but each hits their own configured LLM backends and presets.&lt;/p&gt;

&lt;p&gt;The result is that my M3 Ultra now has a single Wilmer instance pointed to it, serving about a dozen different shared workflows, plus Roland and a Wikipedia researcher. The M2 Ultras are set up similarly. This cleaned up a LOT of memory on the Mac mini.&lt;/p&gt;

&lt;p&gt;Concurrency limiting is the last big item. The --concurrency flag (defaulting to 1) queues incoming requests so only one hits the LLM at a time. I can now fire off multiple requests to different workflows on the same machine without worrying about crashing anything. Wilmer queues them and processes them sequentially, meaning I no longer have to keep track of what's hitting what.&lt;/p&gt;

&lt;p&gt;I still have separate instances for my mobile setup on the MacBook Pro. That one runs independently when I am on the road. &lt;/p&gt;

&lt;p&gt;This is all something I've meant to do forever; this and the new memory features (like the memory condenser I mentioned in an earlier post). It's a little headache that I've put up with for years, because scoping individual users was so challenging. But after the massive refactor I did in 2025, I could finally move almost all of the workflow/user related global variables into the new execution context, be able to finally ensure there was no bleed/crossover on multi-user setups.&lt;/p&gt;

&lt;p&gt;Up until now, Wilmer was absolutely built for 1 person running it on their own machine. Now it's finally about in a state where it can actually handle multiple people at once in a single instance appropriately.&lt;/p&gt;

&lt;p&gt;The multi-user and concurrency features are not released yet. Shared workflows got deployed out earlier this year. The rest is coming in the next update.&lt;/p&gt;

&lt;p&gt;I know deployments have slowed down a lot on Wilmer lately, but I haven't given up on it; it's just that it's in a spot where I can do some of the other projects I always wanted to, so I've kicked those off as well. Now my precious free time is split like 5 ways lol.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Right Monitor is Hard to Come By</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Fri, 13 Mar 2026 23:22:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/the-right-monitor-is-hard-to-come-by-57jj</link>
      <guid>https://dev.to/someoddcodeguy/the-right-monitor-is-hard-to-come-by-57jj</guid>
      <description>&lt;p&gt;It is shocking how difficult it is to find a 34" curved Ultrawide that is either 2560x1080 or 5120x2160. Back in 2020 or 2021, Spectre made one; it's been discontinued now though.&lt;/p&gt;

&lt;p&gt;The big issue for me is two fold because I have a triple monitor setup: The monitors to the left and right of my main monitor are both 1920x1080 27" monitors. A 34" ultrawide is physically identical in height to those monitors. 2560x1080 is also identical in resolution height. So with a 34" 1080p monitor, it's just a really nice setup.&lt;/p&gt;

&lt;p&gt;My main issue with the current stock you can find on Amazon is that MacOS &lt;em&gt;REALLY&lt;/em&gt; struggles with landing on that resolution if the monitor isn't either set to it natively, or is &lt;strong&gt;5K2K&lt;/strong&gt;. If you get a 3440x1440 monitor... well, I haven't been able to find one that lets me select 2560x1080 as a resolution in standard MacOS.&lt;/p&gt;

&lt;p&gt;I did try &lt;code&gt;BetterDisplay&lt;/code&gt;, but I had some issues that I couldn't work through on it, so Im back on the prowl for a monitor that fits my needs.&lt;/p&gt;

&lt;p&gt;Resolution selecting is definitely one of the areas that Windows has MacOS beat on. That and Microsoft Paint. Omg, I can't tell you how spoiled having that application had made me. I grabbed Gimp for the Mac, but it's overpowered for what I want to do with it; I really just need it manipulate screenshots or something now and then.&lt;/p&gt;

&lt;p&gt;Oh, and network file sharing. I made the mistake of trying to use a Mac as a local NAS. Never again.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>My Foray Back Into Linux...</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sun, 08 Mar 2026 18:21:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/my-foray-back-into-linux-4o64</link>
      <guid>https://dev.to/someoddcodeguy/my-foray-back-into-linux-4o64</guid>
      <description>&lt;p&gt;So I decided to make use of one of the mini-pcs I had gotten for the homelab to build a little web browsing box. My first iteration of the web browsing box was a Windows 11 machine, which is the same machine that got me banned from reddit for VPN use (oops), but I've finally decided it was time to graduate from Windows and move to the more private OSes.&lt;/p&gt;

&lt;p&gt;The goal was straightforward enough. I wanted something separate from my main machine that I could use for general web browsing. Something isolated, so if I picked up some nasty malware or clicked a bad link, my actual workstation would be fine. Something that wasn't Windows. And I wanted to remote into it from my Mac Studio so I didn't need yet another monitor on my desk.&lt;/p&gt;

&lt;p&gt;The last time I seriously touched Linux was probably 15 years ago. Back then, getting a Linux box to just work was an adventure that usually ended poorly. There's a reason there were so many memes about the ridiculous complexity of doing simple things in Linux. And it especially didn't help that I wanted to dual boot with Windows... I swear, it seems like Windows kills the Linux bootloader by design sometimes.&lt;/p&gt;

&lt;p&gt;So walking into this, I was mentally preparing for that same experience. I figured I'd brick the machine at least three times before I got anything usable.&lt;/p&gt;

&lt;p&gt;I ended up using a Kamrui mini PC. AMD Ryzen 7 5700U, 32GB of RAM, 1TB of storage. Small enough to tuck away somewhere, powerful enough to handle a browser without breaking a sweat. And I went with Linux Mint with Cinnamon because multiple folks told me it was the easiest transition from Windows.&lt;/p&gt;

&lt;p&gt;All together, the process was WAY easier now in the age of LLMs. What used to be an arduous processes of digging through tutorials and forum posts was actually a pretty painless task of just having GLM 5 and Claude talk me through various issues as they came up.&lt;/p&gt;

&lt;p&gt;The installation was painless. LUKS disk encryption is now just a checkbox in the installer. No hunting down down proprietary drivers, either. I had to use Ethernet because the WiFi card in this thing has no mainline Linux driver support, but that's fine.&lt;/p&gt;

&lt;p&gt;Where things got interesting was the hardening. Because I'm me, I couldn't just install the OS and call it a day. I wanted this thing locked down. UFW firewall, OpenSnitch for outbound traffic monitoring, NordVPN with a kill switch, Firefox hardened, AppArmor running, unnecessary services stripped out, etc.&lt;/p&gt;

&lt;p&gt;In the past, I would have absolutely bricked this machine multiple times. The robits helped with all of that. When xrdp kept failing with a sesman connection error, when NordVPN's kill switch locked me out of the machine entirely, when xrdp kept killing the webgl process in firefox causing it to crash over and over... the bots had an answer for everything.&lt;/p&gt;

&lt;p&gt;In the end, I still did a full refresh, just because I had gone to town on some of the config files in this thing trying to get it the way I wanted, and I couldn't tell if I'd made a mess or not. But another nice thing with the bots was that as I did stuff, I was telling them, so in the end I got them to spit out all the highlights and write up a doc that I could use to replicate the whole process.&lt;/p&gt;

&lt;p&gt;A few things I learned along the way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NordLynx doesn't work with OpenSnitch; at least as of the time of this writing. Both manipulate iptables at the kernel level, and they fight each other. I had to switch to OpenVPN, which runs in userspace and plays nice with the firewall, though it's slower.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The xrdp 0.9.24 version in Mint repos has an IPv6 binding issue that causes intermittent connection failures. The fix is checking the sesman binding after every reboot and restarting services if it's wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Firefox's built-in fingerprinting protection sounds great to have, but when I enabled it, Firefox would hang on JavaScript-heavy sites. I eventually dropped it, especially with uBlock Origin blocking tracking scripts anyway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Right Ctrl gets stuck when you switch virtual desktops on macOS while the MintOS RDP window is in focus. Linux sees the key press but not the release. I had to disable Right Ctrl entirely within Linux via xmodmap to fix it. Took me way too long to figure out what was happening there. But if you think about it... when do you ever use right ctrl? I didn't until I started using Mac more, and that's just for virtual desktop swapping.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final result is a machine that boots up, connects to VPN automatically, and sits there waiting for me to RDP in from my Mac. All traffic goes through NordVPN. DNS queries go through NordVPN's DNS servers. WebRTC is disabled in Firefox. Third-party outbound connections are blocked unless explicitly allowed. The firewall only accepts inbound SSH and RDP connections from my local subnet.&lt;/p&gt;

&lt;p&gt;At this point, I've relegated Windows to gaming only; which I really don't do a lot of these days, but nice to have around anyhow. I had been putting off the Windows 11 upgrade (there's an extension for Win 10 security updates until Oct 2026 available, so I had done that). Now that I've got everything personal off my Windows box, I'll get that updated to Win 11.&lt;/p&gt;

&lt;p&gt;Most of the house is now Mac and Linux. Huzzah. I used to love Windows, but they've just been too weird lately about OneDrive. I still really like Outlook and O365; I use both a lot. But my personal machine doesn't need to be so closely tied to the cloud, and if the core Windows experience is going to be a cloud-centric OS, then it's really just not for me anymore.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Wilmer and Token Management</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sat, 07 Mar 2026 02:04:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wilmer-and-token-management-4ha3</link>
      <guid>https://dev.to/someoddcodeguy/wilmer-and-token-management-4ha3</guid>
      <description>&lt;p&gt;One of the big keys to running LLMs on a Mac is token management. That's what a lot of Wilmer is built around.&lt;/p&gt;

&lt;p&gt;Wilmer started out because I wanted to make the most of Llama 2 finetunes, but eventually its workflows became a way for me to keep overall token counts down. Macs handle large prompts slowly, and the smaller the prompts, the easier that is to deal with.&lt;/p&gt;

&lt;p&gt;For example, consider a really long conversation with an LLM. I was working with GLM 5 on my M3 Ultra to help me set up a new Linux box in the house. I know Mac and Windows well enough, but my last true foray into Linux was 15 years ago or more, so I needed help.&lt;/p&gt;

&lt;p&gt;Eventually I hit a point where the overall conversation was about 300 messages or more. If I had been sending the whole conversation, it would have been at least 100,000 tokens. Any standard sliding cache could keep it quick, but at the cost of losing the start of the conversation. When you're on a Mac, a 20k token prompt is already in frustrating territory, so you don't want to send much more than that. This means you'd lose 4/5 of the conversation.&lt;/p&gt;

&lt;p&gt;You could rely solely on vector memory, but now you're playing with fire on the sliding cache, hoping you don't accidentally cause it to reset because too much context changed on it.&lt;/p&gt;

&lt;p&gt;So with Wilmer, I've been focused on a handful of context management techniques. Some have been in it since early 2024, and some I'm adding in now.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;File memories&lt;/strong&gt; are JSON files that tie summaries to chunks of messages. The summary prompts can be anything, so it depends on the conversation type. For the Linux conversation, I set it to capture what changes we made successfully: packages installed, configs edited, services started or stopped. The system generates these automatically every 6000 tokens or so, which keeps each chunk focused and digestible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chat Summary&lt;/strong&gt; is similar, but rolls everything into one running overview. I use this to capture the 100-mile-high view of where we're at - what the overall goal is, what phase of the project we're in, what big decisions we've made.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vector memories&lt;/strong&gt; are where the LLM generates individual facts as the conversation progresses and stores them for semantic search. This is more nuanced detail about what's going on: specific commands that worked, error messages we encountered and how we fixed them, configuration values we settled on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conversation condensing&lt;/strong&gt; is the newer piece. I configured it to keep my most recent 7000 tokens as raw, untouched messages. Then it takes the next 7000 tokens after that and summarizes them with awareness of the current topic. So if we're troubleshooting a networking issue, it'll lean into preserving networking details. Everything beyond that gets rolled into a neutral summary that captures the broad strokes without topic bias. This lets me keep the immediate context sharp while still holding onto the shape of a long conversation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On top of this, I give the LLMs persistent files they can read from and write to. Things like my speech preferences, behaviors to avoid, recent events in my life, and a persona file that defines how the AI presents itself. One problem for LLMs is losing that internal train of thought and having to re-reason what its stance or goal was each time. Not so with this. The AI can jot down notes between messages and pick up where it left off.&lt;/p&gt;

&lt;p&gt;Separating out the image processor also lets me use a different vision model from the main thinkers, but more importantly it lets me cache previous vision responses. Once I send an image, the LLM doesn't have to reprocess it but can still answer questions about it. That's super helpful, and something that I don't see a lot of front-ends doing; after just a few messages, it stops having the context of that image.&lt;/p&gt;

&lt;p&gt;All of this gives me the ability to have massive conversations, hundreds of messages long, while maintaining consistency in knowledge; all while barely sending 15-20k tokens to the LLM in any given message. Overall I process more tokens than if I just left it all to sliding cache, but in return I get an assistant that can continue answering questions during message 300 about something way back in the first 20 messages.&lt;/p&gt;

&lt;p&gt;The real advantage is that I can use smaller models for most of the heavy lifting. During my Linux setup, what I really wanted was the final response from GLM 5. That's the model walking me through everything. But parsing through memories, updating summaries, deciding whether to pull from Wikipedia, condensing old conversation chunks? That gets pawned off to weaker models, sometimes down to the 4-billion-parameter range. They finish in no time at all. Then when GLM 5 kicks off, it's been handed everything it could hope for in terms of context, and it only has to work with 20k tokens or less.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>If You Have the Hardware- Use it to Learn!</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 03 Mar 2026 03:51:40 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/if-you-have-the-hardware-use-it-to-learn-30j1</link>
      <guid>https://dev.to/someoddcodeguy/if-you-have-the-hardware-use-it-to-learn-30j1</guid>
      <description>&lt;p&gt;If you've never messed with open source LLMs and you jumped on the ClawdBot/OpenClaw hype train: take some time to learn more about how local models work. You likely went through the trouble of getting a Mac Mini, so you now have a nice little test box to play with. Just do it. Turn off Clawdbot/OpenClaw, and make OTHER things with it. Just for a few hours, even.&lt;/p&gt;

&lt;p&gt;For the vast majority of folks using AI to Vibe code, make agents, etc- right now they are the equivalent of people building websites using the heaviest no-code/low-code solutions, or just slapping ALL the biggest libraries in, without a care in the world for performance. You're probably wasting a ton of efficiency in your current setups because you don't understand how a lot of it works under the hood. You don't understand samplers well, or what tokenization is doing. You may not have a good feel for what small and weak models can really do, or what you absolutely have to have large models for &lt;em&gt;(When I say small models- Im talking models that make Claude Sonnet 3.7 look like a genius)&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;Whatever efficiencies you're aiming for are probably a drop in the bucket compared to what you could be doing if you really had a feel for all that. And the only thing holding you back from that knowledge is just taking the time to learn it.&lt;/p&gt;

&lt;p&gt;The easiest way to learn this stuff is doing. You have the hardware now, so why not? Forget the little hype-bot that LinkedIn convinced you to install. Set it aside and use that Mac Mini to learn how LLMs work at a deeper level by trying to wrangle local models to do complex work. &lt;/p&gt;

&lt;p&gt;THAT will be worth its weight in gold.&lt;/p&gt;

&lt;p&gt;Also, don't cheat yourself. Yes, the local ecosystem is easier now. 10 minutes + an LM Studio install and tada: all done! But what did you really learn? No no; I'm saying to do it the long way around. Grab Open WebUI. Grab llama.cpp. Get em hooked up together. Use a little model like one of the new Qwen3.5 8b models. Get the responses to be actually good; try to find ways to make the model stop repeating itself. Things like that. &lt;/p&gt;

&lt;p&gt;Next: write a small agent. Do it with that crappy little 8b or less model, and try to get something of value out of it. &lt;/p&gt;

&lt;p&gt;This is all possible to do, but I promise it'll be harder than accomplishing the same thing with some 2026 proprietary API model. And that's the point.&lt;/p&gt;

&lt;p&gt;Once you've done all that, you'll later go back and revisit what you think right now is great work with LLMs, and suddenly have the same realization every developer does when they go back to their old code: &lt;em&gt;"Wow, I can do a lot better than this now."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Much like developers first learning to code, and thinking that just writing 500x "if statements" is good enough- you're only just now scratching the surface of how you should properly use LLMs. Now you need to start learning the more complex stuff. Don't settle for the novice approaches you're doing so far. There's SO MUCH MORE out there.&lt;/p&gt;

&lt;p&gt;And who knows- you may just find that local models are fun enough to be worth obsessing over a bit ;)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>An Analogy to Help Understand Mixture of Experts</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Thu, 26 Feb 2026 03:55:22 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/an-analogy-to-help-understand-mixture-of-experts-3a4o</link>
      <guid>https://dev.to/someoddcodeguy/an-analogy-to-help-understand-mixture-of-experts-3a4o</guid>
      <description>&lt;p&gt;If you're having a hard time understanding MoE strength vs dense models, and roughly where they might land when comparing them, think about this super oversimplified analogy. I'm hoping it makes sense:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;Imagine a paid trivia competition, but all the questions are about carpentry regulations: you're given a piece of paper, you fill out the paper and then hand it in. &lt;/p&gt;

&lt;p&gt;There are two "teams" competing with each other, except one team just has a single dude on it. Both teams need a place to sit in the building while the competition is going on.&lt;/p&gt;

&lt;h4&gt;
  
  
  Team 1 (10b Dense Model)
&lt;/h4&gt;

&lt;p&gt;Team 1 is just some fairly experienced carpenter with 10 years of experience. He gets the paper, works through every question himself, and turns it in.&lt;/p&gt;

&lt;p&gt;He really likes his personal space, so he reserved 10 seats all to himself. &lt;em&gt;(Bear with me...)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Total experience on the team: &lt;strong&gt;10 years&lt;/strong&gt;&lt;br&gt;
Experience applied to each question: &lt;strong&gt;10 years&lt;/strong&gt;&lt;br&gt;
Total Seats Needed: &lt;strong&gt;10 seats&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Team 2 (40b a10b MoE Model)
&lt;/h4&gt;

&lt;p&gt;Team 2 is a large crew of 40 first-year apprentices. None of them know the full trade; each one has only learned a few specific things about carpentry during their year.&lt;/p&gt;

&lt;p&gt;Each question has multiple parts to it, and for each part, 10 of the apprentices are picked based on whoever among them has the most relevant knowledge to that specific part. Once a part is answered, those ten return to the group, and the process repeats for the next part. By the time a single question is fully answered, dozens of different apprentices may have contributed.&lt;/p&gt;

&lt;p&gt;When answering, each set of ten apprentices that get called up aren't huddling up and collaborating; they each independently write their own answer to the question part on a small piece of paper, and then all of those answers get blended together to create one combined response. The final answer written on the trivia paper for that part of the question will be a mix of what they all came up with.&lt;/p&gt;

&lt;p&gt;Once all of the questions have been answered in this fashion, they turn it in.&lt;/p&gt;

&lt;p&gt;Total aggregate experience on the team: &lt;strong&gt;40 years&lt;/strong&gt;&lt;br&gt;
Experience applied to each question: &lt;strong&gt;10 years&lt;/strong&gt; &lt;em&gt;(10 apprentices x 1 year each)&lt;/em&gt;&lt;br&gt;
Total Seats Needed: &lt;strong&gt;40 seats&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Comparing the Teams
&lt;/h4&gt;

&lt;p&gt;Now, technically you could say that each team is applying the same number of years of experience to each question, even though the way the teams are structured is totally different. For each question, they are bringing an aggregate total of 10 years of experience. &lt;/p&gt;

&lt;p&gt;But beyond that: Team 2's &lt;strong&gt;combined aggregate&lt;/strong&gt; knowledge and experience of 40 years is much larger.&lt;/p&gt;

&lt;p&gt;Team 2's setup is so powerful because even though their team is full of apprentices who each only know a slice of the trade, they are hand-picking the best ten people for each question part. Depending on what all the different apprentices studied, you could end up with Team 2's total knowledge including information Team 1's carpenter doesn't know; and they may reason through things that the carpenter struggled with alone.&lt;/p&gt;

&lt;p&gt;The downside to team 2's setup is that they need 40 seats, while Team 1 only needs 10 seats. Team 2 takes up a LOT more space than Team 1.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Socg's note: The seats are memory. In case you missed that lol. I couldn't figure out a better way to shoehorn that into the analogy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Team 3 (40b Dense Model)
&lt;/h4&gt;

&lt;p&gt;Now, imagine if there was a third team with a master carpenter that had 40 years of experience; the same number of years of experience as all of Team 2 combined. And he absolutely loves his space, so he also got 40 seats. But its 1 really really experienced and smart carpenter doing all the work.&lt;/p&gt;

&lt;p&gt;Even though team 2 has a combined total of 40 years of experience, and the master carpenter has 40 years, and even though both teams required 40 seats: the quality difference is going to be significant. The master carpenter will likely have 'seen it all' and experienced it, too, while the apprentices are only ever applying 10 aggregate years of apprentices at a time. &lt;/p&gt;

&lt;p&gt;This means that not only is that master carpenter likely going to make better use of their overall knowledge, but they will understand the questions much better and be able to really comprehend what is being asked at a level the apprentices likely won't.&lt;/p&gt;

&lt;p&gt;Total experience on the team: &lt;strong&gt;40 years&lt;/strong&gt;&lt;br&gt;
Experience applied to each question: &lt;strong&gt;40 years&lt;/strong&gt;&lt;br&gt;
Total Seats Needed: &lt;strong&gt;40 seats&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;When comparing models, it's pretty safe to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All things being equal, an MoE will likely outperform a model that has the same number of parameters as the active parameters. So a 30b a3b MoE (30b model, but only 3b active) will beat out a 3b dense model. &lt;/li&gt;
&lt;li&gt;All things being equal, an MoE will likely have worse overall comprehension than a similar size dense model of the same size as its total parameters. Even if their knowledge might be similar, the dense model will simply "&lt;em&gt;get&lt;/em&gt;" things better than the MoE. For example, a 120b a5b MoE will likely misunderstand statements far more often than a 120b dense model, which will "&lt;em&gt;read between the lines&lt;/em&gt;" on what you want far better and understand inferred speech better.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anyhow, that's majorly over-simplified, but hopefully it helps paint a better picture.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
