<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SomeOddCodeGuy</title>
    <description>The latest articles on DEV Community by SomeOddCodeGuy (@someoddcodeguy).</description>
    <link>https://dev.to/someoddcodeguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3490530%2F9cdfc762-b1a2-45b2-b90c-252cf15f6fea.png</url>
      <title>DEV Community: SomeOddCodeGuy</title>
      <link>https://dev.to/someoddcodeguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/someoddcodeguy"/>
    <language>en</language>
    <item>
      <title>A Quick-ish Rundown of LLM Basics</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sat, 25 Apr 2026 21:36:11 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-quick-ish-rundown-of-llm-basics-4n14</link>
      <guid>https://dev.to/someoddcodeguy/a-quick-ish-rundown-of-llm-basics-4n14</guid>
      <description>&lt;p&gt;Over the past few days, I've realized that there are a lot of folks out there using LLMs that haven't had an opportunity to dig, even a little, into the basics of how LLMs really work. And I guess that makes sense; for the most part, the average person doesn't have a lot of reason to know this. But if you're going to be a power user, there are things that would really help you to understand.&lt;/p&gt;

&lt;p&gt;Below are the most basic basics. Not covering everything, just some stuff that I think if you get then the rest will start to make sense for you as well. Hopefully it helps someone out there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tokens
&lt;/h3&gt;

&lt;p&gt;When you write something to an LLM, it doesn't break that thing down by character, it breaks them down by groups of characters called "Tokens". Every LLM has its own tokenizer, so not all choose the same tokens. &lt;/p&gt;

&lt;p&gt;Here's a real world example of what tokenization might look like using Qwen3.6 27b's tokenizer: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json" rel="noopener noreferrer"&gt;https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json&lt;/a&gt;. If you open that file, you'll see the full list of tokens that Qwen3.6 27b utilizes.&lt;/p&gt;

&lt;p&gt;As for how tokens work... here's an example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"This is a token"&lt;br&gt;
    - That's 15 characters&lt;/p&gt;

&lt;p&gt;'This' 'Ġis' 'Ġa' 'Ġtoken'&lt;br&gt;
    - That's 4 tokens. You'll notice 'Ġ' is in each; that's what &lt;br&gt;
GPT-2/GPT-3/GPT-4 use as a space in tokenization&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These line up to numbers, which the LLM then uses to do matrix math to determine the right output. If we go back to the link I gave you above, then you can see the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This   == 1919
ĠIs    == 369
Ġa     == 264
Ġtoken == 3817
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So Qwen3.6 27b would see your sentence as (1919, 369, 264, 3817). It then does matrix math and other cool pattern-y stuff to determine the best tokens to respond to you with.&lt;/p&gt;

&lt;p&gt;So remember this when you hear that an LLM has a context window of 1,000,000 tokens: it's talking about those things. Sometimes whole words are tokens, sometimes not. Don't just assume every word is a token; they try to create tokens off the most commonly used words. &lt;em&gt;This&lt;/em&gt;, &lt;em&gt;is&lt;/em&gt;, &lt;em&gt;a&lt;/em&gt; are all very common in the English language. &lt;em&gt;Token&lt;/em&gt; is very common when talking about LLMs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Windows
&lt;/h3&gt;

&lt;p&gt;The way I usually describe context windows is to imagine the full Song of Ice and Fire book series printed out on one really long parchment, and you have a piece of cardboard with a window cut in it that you can read text through. All you know is whatever's currently in that window. If someone asks you about something outside the window? Tough luck, you don't know it.&lt;/p&gt;

&lt;p&gt;Now, the obvious thought is "well just make the window bigger". The problem is that if you cut the window too big, you have a harder time finding any specific thing in there, and you start mixing details up. You've learned how to read a certain amount within that window, and pushing past that doesn't go great. If the full book was the length of a parking lot, and someone asked you for details that could exist anywhere in that whole parking lot worth of text... well, good luck.&lt;/p&gt;

&lt;p&gt;That's pretty much how it works with LLMs. You'll see models advertise huge context windows like 1,000,000 tokens, but the real-world practical use of that is a lot smaller than the marketing implies. The bigger you stuff that window, the worse the model gets at pinpointing specific information inside it. There's a whole pile of benchmarks (needle in a haystack tests, NoLiMa, RULER, etc) showing accuracy drop as the context fills up. So a 200k token context window is not an invitation to dump 200k tokens in there and expect great results. You'll generally get a much better answer giving the model 8k of really relevant tokens than 200k of "everything I have on the topic".&lt;/p&gt;

&lt;p&gt;To get a better visualization, check this benchmark out: &lt;a href="https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87" rel="noopener noreferrer"&gt;https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scroll down to the results section and you'll see a table- the numbers in there represent how well the model pulls the right info out based on the context size it was fed. You can see that some models, like GPT-5.2 or Opus 4.6, did great all the way up to 120k (except 5.2 pro for some reason...). But look at something like minimax 2.5, for example: by the time you hit 60k tokens, you have less than a 50% chance to get all the right info you asked for.&lt;/p&gt;

&lt;p&gt;This is a struggle a lot of us running local models deal with, and it usually means you want to account for that with a lot of great wrapper software or middleware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Sizes (ie- parameters)
&lt;/h3&gt;

&lt;p&gt;When we talk about models, we size them based on the number of parameters they have. 1M is a 1 Million parameter model. That's itty bitty. 1b is 1 billion parameters- also itty bitty. Many modern models release in really huge sizes like 397b to 1T (1 Trillion parameters).&lt;/p&gt;

&lt;p&gt;The easiest way to imagine parameters is as data points that can correspond to several pieces of data at once. So 1 datapoint doesn't necessarily equate to something like "When did the first Ford car release?" It could also correspond to several other pieces of info at once.&lt;/p&gt;

&lt;p&gt;Models are generally created in BF16 format to start with. Size wise- BF16 equates to about 2GB per 1b; so a 20b model would be 40GB. If you "quantize" the model (easiest way is to think of it is 'compressing' the model) to 8bpw, or ~q8_0, that becomes 1GB per 1b. If you go further to 4bpw, or ~q4_0, you get down to 0.5GB per 1b. That's how we fit big models on smaller hardware.&lt;/p&gt;

&lt;p&gt;As you can imagine, the more you quantize, the more mistakes the model will likely make.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open Weight Models
&lt;/h3&gt;

&lt;p&gt;These are models that you can download and run yourself. There are a few ways to do it, and here are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raw transformers&lt;/strong&gt; - this is the original format of the models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GGUF&lt;/strong&gt; - This is a model that has been converted to run in llama.cpp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLX&lt;/strong&gt; - This is converted to run in Apple's MLX&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many applications, like Ollama or LM Studio, wrap some of these and then have their own repositories to pull models from. For best speed and the fastest updates for model support, you generally want to avoid that. You can find all models here: &lt;a href="https://huggingface.co" rel="noopener noreferrer"&gt;https://huggingface.co&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mixture of Experts (ie- MoE)
&lt;/h3&gt;

&lt;p&gt;This section is only really relevant to Open Weight models, so you can skip this if you never plan to host your own.&lt;/p&gt;

&lt;p&gt;Parameter count doesn't just affect knowledge, it also affects speed. The bigger the model, the more matrix math the computer has to do per token. So a 70b model running at the same quantization on the same hardware as a 7b is going to be a whole lot slower; you're doing roughly 10x the math per token. That's also why video cards handle LLMs better than CPUs: it's a lot of floating point math, and GPUs eat that up. Which means when you're trying to figure out if you can fit a model on your machine, the real question is how much you can fit into VRAM.&lt;/p&gt;

&lt;p&gt;Up until a year or two ago, pretty much every model you used was what we call a "dense" model. Dense means every single parameter in the model gets activated for every token it produces. A 70b dense model is doing 70b worth of math, every single token.&lt;/p&gt;

&lt;p&gt;Then Mixture of Experts (MoE) models started taking off. You'll see them named like Qwen3.5-397b-a17b, or Qwen3.6-35b-a3b. The "a" in the first one stands for "active parameters". The way MoE works is the model is split up into a bunch of smaller "experts", and for each token, a "router" picks just a few of those experts to use. So Qwen3.5-397b-a17b has 397 billion total parameters, but only 17 billion get used for any given token.&lt;/p&gt;

&lt;p&gt;What this means in practice: an MoE model runs at roughly the speed of its active parameter count, not its total. So Qwen3.5-397b-a17b runs only a little slower than the speed of a 17b dense model, even though it has 397b worth of parameters. &lt;/p&gt;

&lt;p&gt;That's a big deal for performance, especially on local hardware. It really made those of us who invested in Macs early very happy. I almost, ALMOST, started to regret my first Mac Studio back in 2023... then not long after Mixtral 8x7B came out and that changed everything. It's only gotten better since.&lt;/p&gt;

&lt;p&gt;The cool thing about MoEs really is on the knowledge side. An MoE with 397b total isn't as smart as a dense 397b model would be; the smarts land somewhere in between the active count and the total count. Where exactly is debated and varies by model, but the rule of thumb is to expect noticeably better than a dense model at the active size, and nowhere near a dense model at the total size. So Qwen3.6-35b-a3b isn't going to behave like a 35b dense; it'll feel like something north of a 3b but well short of a 35b.&lt;/p&gt;

&lt;p&gt;The other catch, and this one matters a lot if you're running locally, is that even though MoE only uses a fraction of params per token, you still have to load ALL the params into memory. That 397b model still needs somewhere around 200GB at q4 to run, even though only 17b worth is doing math at any given moment. Llama.cpp does have a clever way to offload the inactive expert layers to system RAM so you can run these things on regular gaming hardware, but that's a deeper topic. I have a &lt;a href="https://www.someoddcodeguy.dev/understanding-moe-offloading/" rel="noopener noreferrer"&gt;whole writeup on MoE offloading&lt;/a&gt; if you want to go down that rabbit hole.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training
&lt;/h3&gt;

&lt;p&gt;LLMs learn by being "trained". It's a complex process that, at the absolute highest level, involves the LLM seeing billions upon billions of tokens of information and learning patterns from it. "When I see someone say this, it usually involves someone responding with that" kind of thing. This is why people constantly harp about good data in training being the most important thing- if you have really clean examples of speech, knowledge, etc, it is easier for the LLM to find the right patterns.&lt;/p&gt;

&lt;p&gt;Eventually, more powerful LLMs start to infer new patterns that they haven't seen before. Remember the old math problems like &lt;code&gt;if A == B and B == C, then A == C&lt;/code&gt;? Imagine that on a MASSIVE scale, where it creates connections between information many many many many layers deep to get from A to Z.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training a commercially viable model takes ungodly amounts of money and data, and you need really smart people to do it. Companies spend millions to billions of dollars making some of the most powerful models.&lt;/li&gt;
&lt;li&gt;Training data is hard to come by. If you've heard about how some companies scraped the internet for data? That's why. They are looking for examples of speech, knowledge, etc. When an LLM wants to train on your data, it is less that the company wants to include your personal PII in the model (they generally don't; they don't want that bad publicity if someone makes the model spit it out) and more that they want nice clean interactions to give to the LLM to look at and learn more patterns.&lt;/li&gt;
&lt;li&gt;This is also why AI companies are mad at each other for "distilling" their products. Distilling is the act of interacting with an LLM over and over again to get examples of the LLM's speaking or thinking process, then creating training data to teach another LLM to act or reason that same way. An example of this from recently was that DeepSeek, Moonshot AI, and MiniMax got accused of doing this by Anthropic. The accusation was that they were using thousands of fraudulent accounts to interact with Claude millions of times, then using those interactions to teach their own models to think and speak similarly.&lt;/li&gt;
&lt;li&gt;It's possible to train little fun models pretty cheaply. One guy recently trained a small model from scratch on 1800s text, with nothing at all modern in it. This little model has no concept of anything past the industrial age. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Finetuning / Post-Training
&lt;/h3&gt;

&lt;p&gt;When you hear a non-tech company say they are "training a model", they most likely mean finetuning or post-training an open weight model.&lt;/p&gt;

&lt;p&gt;Imagine an LLM as a big calculator for matrix math. Numbers go in, one number comes out. So that over and over and you get a response. The neat thing about matrix math is something called rank factorization- the idea that you can represent a matrix &lt;code&gt;m*n&lt;/code&gt; with rank &lt;code&gt;r&lt;/code&gt; by using smaller matrices &lt;code&gt;m*r&lt;/code&gt; and &lt;code&gt;r*n&lt;/code&gt;. Some super smart folks figured out that this allowed us to have LoRAs, which you can think of like add-on components to LLMs that modify the weight distribution.&lt;/p&gt;

&lt;p&gt;In other words- rather than retraining the entire model to try to add more information, you train an itty bitty version of that model with the info you want, and then you can load the original model + LoRA at the same time to get a post-trained model.&lt;/p&gt;

&lt;p&gt;Truthfully- I am pretty staunchly in the camp that you can't reliably train new knowledge into a model this way. That's a very common but &lt;strong&gt;not&lt;/strong&gt; a universal view within the deeper LLM tinkering community; some companies have made post-training their bread and butter. I do believe that you CAN train styles, tones, etc really well into it (&lt;em&gt;for example: training a model to handle documentation a certain way, or think a certain way&lt;/em&gt;), but ultimately I've yet to see a good example of a post-trained model outside of basic Instruct models from the same manufacturer that has actually been worth the effort. Maybe there are some out there, but I'm not familiar with them.&lt;/p&gt;

&lt;p&gt;Anyhow, long story short- you CAN post-train a small model for $100 or less, but I wouldn't even recommend it unless you really understand what you want to get out of it and why. There's very little a post-trained model can do that you can't do with a good workflow, prompt and data to RAG against.&lt;/p&gt;

&lt;h3&gt;
  
  
  How LLMs Respond
&lt;/h3&gt;

&lt;p&gt;When you boil it down, LLMs work in a really simple loop. You give it a chunk of tokens. It processes them and spits out one new token. Then it takes all your original tokens plus that one new token it just spit out, and processes the whole thing again, and spits out the next token. Then it takes all your tokens plus the two new tokens, processes again, spits out the next. On and on, one token at a time, until it decides it is done and sends a stop token. You now have your response.&lt;/p&gt;

&lt;p&gt;To simplify it- LLMs don't think about the response all at once- they think 1 token at a time. Over and over and over until they are done. That's it.&lt;/p&gt;

&lt;p&gt;This is also why "reasoning" works. If you ask a model to just answer a hard math problem cold, it can fumble it, because by the time it gets to the answer it's already locked into early tokens it picked. But if you tell it to think out loud first- write out the problem, work through it step by step- then while it's writing all that, it's still just predicting one token at a time, except now each new token gets to "see" all the work it just laid out. If it makes a mistake at step 2, it can sometimes catch it at step 4 and shift the line of thinking before it commits to a final answer.&lt;/p&gt;

&lt;p&gt;If you ever watch an LLM think, and it constantly goes "But wait...", that's because it was trained to in order to stop it from locking in. It says its response, then it challenges the response, and in doing so that gives it a chance to realize the response was wrong.&lt;/p&gt;

&lt;p&gt;That's basically what chain of thought and reasoning models are. The model writing out its work so it has more to reference when generating each next token. It's not magic, it's just giving the model more useful context to predict from. The flip side is that more reasoning means more tokens, which means more time and more cost. And some models, like Qwen3.5/3.6 and Gemma 4, overthink badly. With those, you want to use a workflow app to manually apply CoT, if you can. Since I use Wilmer everywhere, I have workflows specifically to use Qwen/Gemma with thinking disabled, and then have a manual CoT step. That helps with overthinking massively.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG - Retrieval Augmented Generation
&lt;/h3&gt;

&lt;p&gt;This is a $5 term for a $0.05 concept. When we talk about RAG, it boils down to a very simple concept: give the LLM the answer before it responds. Everything else, when talking about RAG, is talking about a design pattern.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplest example:&lt;/strong&gt; The simplest form of RAG would be copying the text of an article or tutorial, putting it in your prompt, and asking the LLM to answer a question about that. The LLM will use the article to answer you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next level of simplicity:&lt;/strong&gt; You might ask an LLM a question, the LLM uses a tool (web search, local wiki search, whatever) to pull the article, concatenates it into your prompt, and answers your question.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What a lot of folks think of when they think of RAG:&lt;/strong&gt; You have a program that takes thousands, or even millions, of documents and turns them into "embeddings"- ie breaks the document into logical chunks and stores them somewhere easy to retrieve off of, such as a Vector database. Then, when you ask a question, it does some fancy stuff in the background to find the right chunks and answer your question with them. Since putting 1,000,000 files into your context all at once is impossible, this is how you go about the oft-advertised "chat with your documents" situation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But all together, RAG comes down to a very simple concept: give the LLM the answer before it responds. That's it. LLMs are very, very strong at this, and it's a great way to avoid hallucinations.&lt;/p&gt;

&lt;p&gt;For the most part, RAG solutions are not an LLM problem, they're a software problem. If you're struggling with RAG, you probably need to revisit HOW you're feeding the data to your LLM and whether you're giving it too much unnecessary stuff along with the right stuff.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucinations
&lt;/h3&gt;

&lt;p&gt;A hallucination is when the LLM responds with something that's flat wrong. The reason it happens comes back to that loop in the &lt;code&gt;How LLMs Respond&lt;/code&gt; section: an LLM doesn't actually know anything. It's a pattern matcher predicting the most likely next token based on what came before, based on the training that it did to determine "when I see X, I usually see a response of Y". If the most likely next token happens to be the wrong one, well, that's what you get. This can especially happen with information that there isn't a lot of great data out there for, so the LLM had to infer the relationships. Asking a detailed question about Excel means it has millions of example questions, articles, documents, etc from the internet to have learned from; asking a question about FIS' Relius Administration has far far fewer examples, so it likely inferred a lot of things based on other patterns, and it will hallucinate like mad. &lt;/p&gt;

&lt;p&gt;LLMs, as a technology, don't have a built-in "I'm not sure about this" lever they can pull. It just generates whatever the patterns say to generate, and confidence isn't really part of the equation. The answer it gave you is 'right' from the perspective that it generated the most likely pattern. Whether that pattern is of any use to you has nothing to do with the LLM lol.&lt;/p&gt;

&lt;p&gt;The most common reasons you see hallucinations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The training data was wrong, so the pattern the model learned is wrong.&lt;/li&gt;
&lt;li&gt;The training data didn't cover the topic well, so the model is filling in gaps with whatever sounds plausible.&lt;/li&gt;
&lt;li&gt;You asked something outside what the model was really trained for, and it tries to answer anyway because that's what it was trained to do- give an answer.&lt;/li&gt;
&lt;li&gt;Your context window is huge or messy, and the model is losing track of what's actually relevant in there.&lt;/li&gt;
&lt;li&gt;The model is over-quantized and just making more mistakes generally (going back to that earlier section).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reasoning models hallucinate a bit less on certain types of problems because they get a chance to second-guess themselves while writing things out, but they absolutely still hallucinate. The single best mitigation is to put the answer in the context for it, which is RAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using That Info
&lt;/h3&gt;

&lt;p&gt;Knowing all this should hopefully help you start to narrow down why some of the "pro tips" of using LLMs exist. When you want a factual answer, you don't just ask the LLM. Right or wrong, you're getting a confident response. Instead, make sure you are injecting the right answer in before it responds- this often means tool use such as web search or, even better, "Deep Research" features you find on commercial LLMs.&lt;/p&gt;

&lt;p&gt;This also hopefully will help you imagine why jamming ALL your codebase into the LLM, or constantly asking "What model has a bigger context window?" is the wrong question. It's lazy to just look for bigger context windows; and that laziness will bite you. Instead, focus on how you can break the data apart so that the LLM can work in the confines of what it handles best. That means writing or downloading some supporting software.&lt;/p&gt;

&lt;p&gt;Anyhow, good luck folks. Hope this helps the like 4 people that might read this far.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Qwen3.6, and WilmerAI OpenCode workflows</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 20 Apr 2026 03:10:20 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/qwen36-and-wilmerai-opencode-workflows-oj7</link>
      <guid>https://dev.to/someoddcodeguy/qwen36-and-wilmerai-opencode-workflows-oj7</guid>
      <description>&lt;p&gt;Just a random note, but Qwen3.6 35b a3b is putting a smile on my face. This little model feels like a big upgrade over 3.5's 27b or 35b a3b.&lt;/p&gt;

&lt;p&gt;Also- the Wilmer workflow for OpenCode is really going well. I need to test it more, because I had to do a big refactor on it, but so far between that and Qwen3.6, the level of quality I'm seeing from OpenCode now feels &lt;strong&gt;reliable&lt;/strong&gt;. I won't over-exaggerate the situation by making any claims about it feeling similar in quality to X or Y proprietary cloud models; instead I'll say that up until now, I had not felt like a local model that ran at any kind of a decent speed was particularly reliable for power-user level agentic coding. This model + jamming my Wilmer workflow between MLX and OpenCode has now changed that. I have more work to do, a lot more testing to do, but I'm feeling really good about this right now.&lt;/p&gt;

&lt;p&gt;And on a side note: the M5 Max with MLX is absolutely destroying my M3 Ultra in terms of speeds when running Qwen3.6 35b. I currently have that model running at bf16 on the M5 Max, and Im watching it process prompts at insane (for Mac) speeds.&lt;/p&gt;

&lt;p&gt;M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 4k tokens&lt;br&gt;
Total Time: ~1.1 seconds&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-04-19 22:56:00,920 - INFO - Prompt processing progress: 322/4010
2026-04-19 22:56:01,475 - INFO - Prompt processing progress: 2370/4010
2026-04-19 22:56:01,972 - INFO - Prompt processing progress: 4006/4010
2026-04-19 22:56:02,004 - INFO - Prompt processing progress: 4009/4010
2026-04-19 22:56:02,029 - INFO - Prompt processing progress: 4010/4010
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 32k tokens&lt;br&gt;
Total time: ~11 seconds&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-04-19 22:56:18,074 - INFO - Prompt processing progress: 2048/32137
2026-04-19 22:56:18,652 - INFO - Prompt processing progress: 4096/32137
2026-04-19 22:56:19,259 - INFO - Prompt processing progress: 6144/32137
2026-04-19 22:56:19,896 - INFO - Prompt processing progress: 8192/32137
2026-04-19 22:56:20,561 - INFO - Prompt processing progress: 10240/32137
2026-04-19 22:56:21,249 - INFO - Prompt processing progress: 12288/32137
2026-04-19 22:56:21,971 - INFO - Prompt processing progress: 14336/32137
2026-04-19 22:56:22,714 - INFO - Prompt processing progress: 16384/32137
2026-04-19 22:56:23,485 - INFO - Prompt processing progress: 18432/32137
2026-04-19 22:56:24,288 - INFO - Prompt processing progress: 20480/32137
2026-04-19 22:56:25,122 - INFO - Prompt processing progress: 22528/32137
2026-04-19 22:56:25,989 - INFO - Prompt processing progress: 24576/32137
2026-04-19 22:56:26,879 - INFO - Prompt processing progress: 26624/32137
2026-04-19 22:56:27,800 - INFO - Prompt processing progress: 28672/32137
2026-04-19 22:56:28,761 - INFO - Prompt processing progress: 30720/32137
2026-04-19 22:56:29,542 - INFO - Prompt processing progress: 32136/32137
2026-04-19 22:56:29,581 - INFO - Prompt processing progress: 32137/32137
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anyhow, I have a very busy week coming up, so I'm unlikely to post much for a little bit, but I will be testing this workflow up a storm and really putting this little Qwen through its paces.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>coding</category>
      <category>llm</category>
    </item>
    <item>
      <title>Wilmer Tool Calling</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 13 Apr 2026 03:53:26 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wilmer-tool-calling-492g</link>
      <guid>https://dev.to/someoddcodeguy/wilmer-tool-calling-492g</guid>
      <description>&lt;p&gt;So some year and a half after the request was made for me to put tool calling into Wilmer, I've finally got it in there.&lt;/p&gt;

&lt;p&gt;First off- it was a huge pain to implement; if I didn't have Wilmer itself and agentic coders to help, I'm not sure I'd have done it. The way streaming works with tool calling is a bit odd, too, so that was interesting to navigate. Really, this was something I couldn't have pulled off without the earlier workflow engine refactor for the Execution Context.&lt;/p&gt;

&lt;p&gt;The idea is straightforward: Wilmer sits in between the frontend and the LLM, so it just needs to pass tool definitions from the frontend through to the model, and pass tool call responses from the model back to the frontend. Wilmer itself doesn't need to understand or execute the tools. The tricky part was that Wilmer has a whole pipeline of nodes doing different things (&lt;em&gt;memory lookups, categorization, summarization, context gathering&lt;/em&gt;) and you really don't want tool calls accidentally hitting nodes that are just doing internal processing. So I had to put per-node controls in place. Only the nodes you explicitly flag will pass tools through; the rest just strip it out and do their job; with the exception of pulling out just the tool call outputs to give in the case of some internal nodes using chat_user_prompt_*. &lt;/p&gt;

&lt;p&gt;Format conversion between OpenAI, Claude, and Ollama backends was also a headache since they all handle tool calling differently, and streaming tool calls needed their own handling to keep the structured data from getting mangled by the normal text processing pipeline.&lt;/p&gt;

&lt;p&gt;But the reason I finally sat down and did this is that I've been using OpenCode more lately. Up until summer of last year I had pretty much written off agentic coding, but once Claude Code got good I found myself sucked in like everyone else. Even though I'm usually a very local-first oriented guy, I've just stuck to that since because the quality is so great.&lt;/p&gt;

&lt;p&gt;A month or so ago I started dabbling in OpenCode, to have something for when the net goes out, and I have to say that Qwen3.5 27b combined with it is pretty nice... but nowhere near the quality of Claude (&lt;em&gt;obviously&lt;/em&gt;). My goal hasn't changed since 2023: trying to find ways to improve the quality of local tools to that of proprietary, even if it means sacrificing speed for quality. So as with all things, after trying OpenCode for a while, my answer is: shove Wilmer into the flow.&lt;/p&gt;

&lt;p&gt;Now that tool calling works end to end, I can do just that. The OpenCode calls pass through Wilmer, hit my workflows, and the tool calls get forwarded through to one of N number of models in llama.cpp and back without Wilmer needing to know anything about what the tools actually do. It slows everything down a lot, but the result is far less engagement from me because it gets things right in far fewer tries. Especially doing things like the earlier Qwen improvements of manually applying CoT.&lt;/p&gt;

&lt;p&gt;I've had really great luck with getting Qwen3.5 122b to give a lot better results than stock like this, but Qwen3.5 27b has been a bit harder to wrangle. Getting it to play nice with my decision trees is fairly challenging so far.&lt;/p&gt;

&lt;p&gt;I'm going to tinker with these OpenCode workflows for a month or so and then start putting them out for folks. Updating the example workflows in the repo is next on the list.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>A Quick Note on Gemma 4 Image Settings in Llama.cpp</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Fri, 03 Apr 2026 01:50:48 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-quick-note-on-gemma-4-image-settings-in-llamacpp-39ng</link>
      <guid>https://dev.to/someoddcodeguy/a-quick-note-on-gemma-4-image-settings-in-llamacpp-39ng</guid>
      <description>&lt;p&gt;In my last post, I mentioned &lt;a href="https://www.someoddcodeguy.dev/a-few-tips-for-ocr-with-qwen3-5-through-llama-cpp/" rel="noopener noreferrer"&gt;using --image-min-tokens to increase the quality of image responses from Qwen3.5&lt;/a&gt;. I went to load Gemma 4 the same way, and hit an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[58175] srv  process_chun: processing image...
[58175] encoding image slice...
[58175] image slice encoded in 7490 ms
[58175] decoding image batch 1/2, n_tokens_batch = 2048
&lt;/span&gt;&lt;span class="gp"&gt;[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; n_tokens_all&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s2"&gt;"non-causal attention requires n_ubatch &amp;gt;= n_tokens"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; failed
&lt;span class="go"&gt;[58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
[58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
[58175] See: https://github.com/ggml-org/llama.cpp/pull/17869
[58175] 0   libggml-base.0.9.11.dylib           0x0000000103a6136c ggml_print_backtrace + 276
[58175] 1   libggml-base.0.9.11.dylib           0x0000000103a61558 ggml_abort + 156
[58175] 2   libllama.0.0.0.dylib                0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484
[58175] 3   libllama.0.0.0.dylib                0x0000000103eb098c llama_decode + 20
[58175] 4   libmtmd.0.0.0.dylib                 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948
[58175] 5   libmtmd.0.0.0.dylib                 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536
[58175] 6   llama-server                        0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256
[58175] 7   llama-server                        0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396
[58175] 8   llama-server                        0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504
[58175] 9   llama-server                        0x0000000102f3a610 main + 14376
[58175] 10  dyld                                0x00000001968edd54 start + 7184
srv    operator(): http client error: Failed to read connection
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
srv    operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the crash is caused by the fact that I'm not setting ubatch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;58175&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;socg&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;llama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b8639&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;llama&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1597&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GGML_ASSERT&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;cparams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;causal_attn&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;cparams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_ubatch&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;n_tokens_all&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s"&gt;"non-causal attention requires n_ubatch &amp;gt;= n_tokens"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason is because Gemma 4's vision encoder uses non-causal attention for image tokens, which means all the image tokens have to fit within a single ubatch; since I specified that gotta be at least 2048, that's a problem since ubatch defaults to 512.&lt;/p&gt;

&lt;p&gt;First, we need to make sure the model actually supports going that high. &lt;a href="https://unsloth.ai/docs/models/gemma-4" rel="noopener noreferrer"&gt;If we peek over at Unsloth's page, we'll see that's not the case&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Gemma 4 supports multiple visual token budgets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70&lt;/li&gt;
&lt;li&gt;140&lt;/li&gt;
&lt;li&gt;280&lt;/li&gt;
&lt;li&gt;560&lt;/li&gt;
&lt;li&gt;1120&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use them like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70 / 140: classification, captioning, fast video understanding&lt;/li&gt;
&lt;li&gt;280 / 560: general multimodal chat, charts, screens, UI reasoning&lt;/li&gt;
&lt;li&gt;1120: OCR, document parsing, handwriting, small text&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;So our max is actually 1120 here. So for my case, Im going to want to set the --image-min-tokens and --image-max-tokens both 1120, and then I'll buffer up the batch and ubatch to 2048.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;-ngl&lt;/span&gt; 200 &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 65535 &lt;span class="nt"&gt;--models-dir&lt;/span&gt; /Users/socg/models &lt;span class="nt"&gt;--models-max&lt;/span&gt; 1 &lt;span class="nt"&gt;--port&lt;/span&gt; 5001 &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;--image-min-tokens&lt;/span&gt; 1120 &lt;span class="nt"&gt;--image-max-tokens&lt;/span&gt; 1120 &lt;span class="nt"&gt;--ubatch-size&lt;/span&gt; 2048 &lt;span class="nt"&gt;--batch-size&lt;/span&gt; 2048
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>google</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Few Tips for OCR With Qwen3.5 through Llama.cpp</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 31 Mar 2026 02:27:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-few-tips-for-ocr-with-qwen35-through-llamacpp-7de</link>
      <guid>https://dev.to/someoddcodeguy/a-few-tips-for-ocr-with-qwen35-through-llamacpp-7de</guid>
      <description>&lt;p&gt;Just a couple of quick tips. I am using the Unsloth Qwen3.5 27b gguf, and also tried the 122b gguf.&lt;/p&gt;

&lt;p&gt;First: The difference between the bf16 and fp32 mmproj is night and day. I was getting multiple hallucinations, errors, etc with the bf16. I swapped to the fp32 mmproj and it fixed up a lot of that almost instantly. Drastic improvement. The vision projector may have components that benefit from fp32's additional mantissa bits &lt;em&gt;(23 bits vs bf16's 7 bits)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Second: Forcing the model to kick up the minimum number of visual tokens. For example, I was trying to run OCR on an old image of a Japanese newspaper article from 1957 that I found. It was something like 733x1024, and the model was really struggling to read the body of the text; tons of hallucinations, just making up entire sections of text. By forcing the image-min-tokens up to 2048, it forced the model to use 3x the visual processing, and the quality went up MASSIVELY. All of a sudden it could read the paper, with only a handful of small issues.&lt;/p&gt;

&lt;p&gt;This is what I added to the llama-server command: &lt;code&gt;--image-min-tokens 2048 --image-max-tokens 8192&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I did have to toss 1.1 repetition penalty in there, as it was having a hard time transcribing Japanese without failing, but otherwise it is doing a great job now.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Wrangling Qwen's Overthinking with Workflows</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sat, 28 Mar 2026 17:45:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wrangling-qwens-overthinking-with-workflows-3hhm</link>
      <guid>https://dev.to/someoddcodeguy/wrangling-qwens-overthinking-with-workflows-3hhm</guid>
      <description>&lt;p&gt;So I've been running Qwen3.5 122b a10b lately on the M2 Ultra (currently GLM 5 is sitting on the M3), and if you've used any of the Qwen3.5 family, you've probably seen or heard about the overthinking issue. The models are great if you either have a lot of time to kill while you wait for a response, or for more straight forward work if you kill the reasoning. The 35b a3b with reasoning disabled has been my workhorse for the past couple of weeks and &lt;strong&gt;it is the greatest thing since sliced bread&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Anyhow, now that I want to use the 122b for actual hobby work, I've realized how painful the overthinking really is. I had a conversation a few days ago where I asked it to translate something simple. Not anything complex, just a straightforward translation request. It spat out over 5,000 tokens of reasoning before giving me the actual answer. I tested, and actually got a faster response by sending my request to GLM 5 with reasoning enabled, despite it being a 744b a40b model. It just thought so much less, because the request wasn't THAT complex.&lt;/p&gt;

&lt;p&gt;I tried all of the Qwen recommended samplers, and even kicked up repetition penalty alongside their recommended presence penalty just to see what it would do. But nope; think think think. I also sleuthed around the net a bit and saw that several folks ultimately solved this with forceful thinking budgets in the newer llama.cpp, but I'm not a huge fan of that; if the reasoning isn't done, then it'll just get cut-off mid thought and you really aren't getting the benefit of reasoning at all.&lt;/p&gt;

&lt;p&gt;So after banging my head on this for a bit, I went back to something I used to do when reasoning models were newer and their CoT actually hurt more than help: Wilmer workflows to the rescue.&lt;/p&gt;

&lt;p&gt;What I ended up doing was disabling Qwen3.5's native reasoning entirely. I'm passing &lt;code&gt;enable_thinking: false&lt;/code&gt; into &lt;code&gt;chat_template_kwargs&lt;/code&gt; through the llama.cpp server payload to disable thinking, then I built a workflow that handles the chain-of-thought process manually.&lt;/p&gt;

&lt;p&gt;The workflow does the usual context gathering that my setups always do, and then right before the final response there's a dedicated "thinking" node. This node gets all the context and produces a chain-of-thought analysis that then feeds into the responder node.&lt;/p&gt;

&lt;p&gt;Rather than wing the CoT, since things have probably changed a bit since the last time I did that in 2024 (lol), I had Claude do a deep research pass on how how Deepseek and GLM 4.7 structure their reasoning internally, to see if I could get some ideas. In my experience, both of those do amazingly at CoT.&lt;/p&gt;

&lt;p&gt;DeepSeek-R1 ended up having the most info available; it followed a four-phase pattern of problem definition, decomposition, reconstruction cycles, and final decision. The reconstruction cycles are where it either ruminates or genuinely tries new approaches. GLM 4.7 does something called interleaved thinking, where it reasons before each response and each tool call, not just at the start.&lt;/p&gt;

&lt;p&gt;The research I found showed something interesting. Incorrect solutions have more and longer reconstruction cycles than correct ones. There's a problem-specific sweet spot for reasoning length. As we already knew: more reasoning doesn't always mean better answers. In fact, R1 had a bad habit of ruminating, re-examining the same formulations repeatedly, which actually hurts its ability to find novel solutions.&lt;/p&gt;

&lt;p&gt;It was an overthinker, too; just not as bad as Qwen.&lt;/p&gt;

&lt;p&gt;Anyhow, long story long: I took all that and threw together a new CoT prompt in a new node just before the responder. The model has to assess complexity first and scale its effort accordingly; a simple greeting gets maybe two or three sentences of thought, while a multi-step coding problem gets a thorough breakdown. Then it has to work through the problem, verify its reasoning, and output a response plan. If it catches itself repeating the same line of reasoning, it's instructed to stop and either move on or try a genuinely different approach.&lt;/p&gt;

&lt;p&gt;Despite Qwen3.5 122b not being trained for this, the results have been solid. Instead of 5,000+ tokens of circular thinking on a simple translation, I'm seeing 900 to 1500 tokens now on that same request. The quality of the final responses seems about the same, maybe slightly better because the thinking is actually structured rather than meandering. And despite making two separate model calls instead of one, the total response time is lower because I'm not burning tokens on endless rumination.&lt;/p&gt;

&lt;p&gt;This isn't a new idea. I had to do this two years ago as well; it's just funny that I'm circling back to it now with one of the most powerful models out there.&lt;/p&gt;

&lt;p&gt;Anyhow, that's how I got Qwen3.5 to behave. Your mileage may vary. But if you've got a workflow system set up and you're willing to spend some time on prompt engineering, there's a lot you can do to tame a model that doesn't self-regulate well.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A New Toy...</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 17 Mar 2026 23:41:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/a-new-toy-4f56</link>
      <guid>https://dev.to/someoddcodeguy/a-new-toy-4f56</guid>
      <description>&lt;p&gt;The M5 Max Macbook Pro just arrived. First thing I did was fling llama.cpp, Wilmer and Open WebUI on it.&lt;/p&gt;

&lt;p&gt;Honestly, the speeds are really impressive, even considering that llama.cpp hasn't fully integrated the hardware changes yet (at least, that's my understanding). Here's a comparison of Qwen3.5 35b a3b between the M5 Max Macbook vs the M3 Ultra Mac Studio&lt;/p&gt;

&lt;h3&gt;
  
  
  M5 Max MacBook Pro:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1450 t/s processing, 68 t/s generation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time =    
    3202.80 ms /  4654 tokens 
    (0.69 ms per token,  1453.10 tokens per second)
eval time =    
    7098.19 ms /   483 tokens 
   (14.70 ms per token,    68.05 tokens per second)
total time =   10300.99 ms /  5137 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  M3 Ultra Mac Studio:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1647 t/s processing, 48 t/s generation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time = 
    3810.74 ms / 6280 tokens 
    (0.61 ms per token, 1647.97 tokens per second)
eval time = 
    14695.00 ms / 704 tokens 
    (20.87 ms per token, 47.91 tokens per second)
total time = 
    18505.75 ms / 6984 tokens 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So yea- the Studio processes prompts faster (&lt;em&gt;at this size of model and this amount of tokens, though I think that it actually saturates better on the M5 Max at larger prompts&lt;/em&gt;), but generates tokens slower than the M5 Max.&lt;/p&gt;

&lt;p&gt;Super excited to play with this. I got rid of the M2 Max Macbook, so this is my main travel machine now.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Slimming Down the Homelab Software Footprint</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Mon, 16 Mar 2026 03:09:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/slimming-down-the-homelab-software-footprint-5c6g</link>
      <guid>https://dev.to/someoddcodeguy/slimming-down-the-homelab-software-footprint-5c6g</guid>
      <description>&lt;p&gt;So my homelab setup post from a while back is already outdated. Not as much on the hardware part; rather the software side has consolidated dramatically.&lt;/p&gt;

&lt;p&gt;The original setup had somewhere around 20 to 30 separate WilmerAI instances running across my network. Each one was configured for a specific purpose: coding assistance, general chat, RAG workflows, reasoning-heavy tasks, fast responses, and so on. Each instance pointed at one of my three main inference machines (the M2 Ultras and M3 Ultra). If I wanted a different usecase, I spun up a different Wilmer instance and pointed at the appropriate models on the appropriate machine.&lt;/p&gt;

&lt;p&gt;This worked, but it was wasteful. Wilmer is lightweight at around 150 megabytes per instance, but multiply that by 25 or 30 instances and you're burning some memory. More importantly, it was fragile. If I fired off two different workflow requests that both targeted the same Mac, they could hit the LLM simultaneously and either slow down the machine massively or crash it entirely. Apple Silicon doesn't handle parallel LLM inference well at all, so I had to tiptoe around my own setup, mentally tracking which workflows were in use before triggering another one.&lt;/p&gt;

&lt;p&gt;Two changes have collapsed this down to something far more manageable.&lt;/p&gt;

&lt;p&gt;The first is actually a Llama.cpp change; lcpp server recently added router mode (think llama-swap), which lets a single instance manage multiple models. You start the server without specifying a model, point it at a directory of GGUF files, and then specify the model in each API request. The server handles loading, unloading, and LRU eviction automatically. For my use case, I now run two llama.cpp instances per physical machine: one for a large model (the responders) and one for a small model (the workers). Both stay loaded and pinned with mlock so there is no cold start penalty. The model field in the request tells llama.cpp which one to use. That took me from an average of 5 llama.cpp instances per machine down to 2.&lt;/p&gt;

&lt;p&gt;By doing two lcpp instances, I can work it out so that the memory balances. I'll make sure my largest responder model leaves enough memory headroom for my largest worker model; if that combination can load side by side, then I'm golden. With the Mac's memory caching, that makes it super quick to swap models around as needed.&lt;/p&gt;

&lt;p&gt;The second big change for me is on the Wilmer-side; specifically the multi-user support I just finished building into Wilmer.&lt;/p&gt;

&lt;p&gt;Instead of running a separate Wilmer process for each workflow, I now run a single Wilmer instance per physical machine with multiple users configured via the --User flag. Each "user" is really just a configuration profile: a set of endpoints, presets, memory settings, and workflow folders. The front-end selects which configuration to use by setting the model field to something like chris-openwebui-m3:coding or chris-openwebui-m3:general. Wilmer parses that prefix, loads the appropriate user config, and runs the shared workflow under that configuration.&lt;/p&gt;

&lt;p&gt;The shared workflows are also a new feature. They expose workflow folders through the /v1/models and /api/tags endpoints, so frontends like Open WebUI just see them as models in a dropdown. Selecting one tells Wilmer which workflow to run. &lt;/p&gt;

&lt;p&gt;In multi-user mode, the username prefix determines which user's endpoints and settings get used. So bob:openwebui-coding runs the same workflow as alice:openwebui-coding (assuming both are using shared workflows), but each hits their own configured LLM backends and presets.&lt;/p&gt;

&lt;p&gt;The result is that my M3 Ultra now has a single Wilmer instance pointed to it, serving about a dozen different shared workflows, plus Roland and a Wikipedia researcher. The M2 Ultras are set up similarly. This cleaned up a LOT of memory on the Mac mini.&lt;/p&gt;

&lt;p&gt;Concurrency limiting is the last big item. The --concurrency flag (defaulting to 1) queues incoming requests so only one hits the LLM at a time. I can now fire off multiple requests to different workflows on the same machine without worrying about crashing anything. Wilmer queues them and processes them sequentially, meaning I no longer have to keep track of what's hitting what.&lt;/p&gt;

&lt;p&gt;I still have separate instances for my mobile setup on the MacBook Pro. That one runs independently when I am on the road. &lt;/p&gt;

&lt;p&gt;This is all something I've meant to do forever; this and the new memory features (like the memory condenser I mentioned in an earlier post). It's a little headache that I've put up with for years, because scoping individual users was so challenging. But after the massive refactor I did in 2025, I could finally move almost all of the workflow/user related global variables into the new execution context, be able to finally ensure there was no bleed/crossover on multi-user setups.&lt;/p&gt;

&lt;p&gt;Up until now, Wilmer was absolutely built for 1 person running it on their own machine. Now it's finally about in a state where it can actually handle multiple people at once in a single instance appropriately.&lt;/p&gt;

&lt;p&gt;The multi-user and concurrency features are not released yet. Shared workflows got deployed out earlier this year. The rest is coming in the next update.&lt;/p&gt;

&lt;p&gt;I know deployments have slowed down a lot on Wilmer lately, but I haven't given up on it; it's just that it's in a spot where I can do some of the other projects I always wanted to, so I've kicked those off as well. Now my precious free time is split like 5 ways lol.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Right Monitor is Hard to Come By</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Fri, 13 Mar 2026 23:22:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/the-right-monitor-is-hard-to-come-by-57jj</link>
      <guid>https://dev.to/someoddcodeguy/the-right-monitor-is-hard-to-come-by-57jj</guid>
      <description>&lt;p&gt;It is shocking how difficult it is to find a 34" curved Ultrawide that is either 2560x1080 or 5120x2160. Back in 2020 or 2021, Spectre made one; it's been discontinued now though.&lt;/p&gt;

&lt;p&gt;The big issue for me is two fold because I have a triple monitor setup: The monitors to the left and right of my main monitor are both 1920x1080 27" monitors. A 34" ultrawide is physically identical in height to those monitors. 2560x1080 is also identical in resolution height. So with a 34" 1080p monitor, it's just a really nice setup.&lt;/p&gt;

&lt;p&gt;My main issue with the current stock you can find on Amazon is that MacOS &lt;em&gt;REALLY&lt;/em&gt; struggles with landing on that resolution if the monitor isn't either set to it natively, or is &lt;strong&gt;5K2K&lt;/strong&gt;. If you get a 3440x1440 monitor... well, I haven't been able to find one that lets me select 2560x1080 as a resolution in standard MacOS.&lt;/p&gt;

&lt;p&gt;I did try &lt;code&gt;BetterDisplay&lt;/code&gt;, but I had some issues that I couldn't work through on it, so Im back on the prowl for a monitor that fits my needs.&lt;/p&gt;

&lt;p&gt;Resolution selecting is definitely one of the areas that Windows has MacOS beat on. That and Microsoft Paint. Omg, I can't tell you how spoiled having that application had made me. I grabbed Gimp for the Mac, but it's overpowered for what I want to do with it; I really just need it manipulate screenshots or something now and then.&lt;/p&gt;

&lt;p&gt;Oh, and network file sharing. I made the mistake of trying to use a Mac as a local NAS. Never again.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>My Foray Back Into Linux...</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sun, 08 Mar 2026 18:21:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/my-foray-back-into-linux-4o64</link>
      <guid>https://dev.to/someoddcodeguy/my-foray-back-into-linux-4o64</guid>
      <description>&lt;p&gt;So I decided to make use of one of the mini-pcs I had gotten for the homelab to build a little web browsing box. My first iteration of the web browsing box was a Windows 11 machine, which is the same machine that got me banned from reddit for VPN use (oops), but I've finally decided it was time to graduate from Windows and move to the more private OSes.&lt;/p&gt;

&lt;p&gt;The goal was straightforward enough. I wanted something separate from my main machine that I could use for general web browsing. Something isolated, so if I picked up some nasty malware or clicked a bad link, my actual workstation would be fine. Something that wasn't Windows. And I wanted to remote into it from my Mac Studio so I didn't need yet another monitor on my desk.&lt;/p&gt;

&lt;p&gt;The last time I seriously touched Linux was probably 15 years ago. Back then, getting a Linux box to just work was an adventure that usually ended poorly. There's a reason there were so many memes about the ridiculous complexity of doing simple things in Linux. And it especially didn't help that I wanted to dual boot with Windows... I swear, it seems like Windows kills the Linux bootloader by design sometimes.&lt;/p&gt;

&lt;p&gt;So walking into this, I was mentally preparing for that same experience. I figured I'd brick the machine at least three times before I got anything usable.&lt;/p&gt;

&lt;p&gt;I ended up using a Kamrui mini PC. AMD Ryzen 7 5700U, 32GB of RAM, 1TB of storage. Small enough to tuck away somewhere, powerful enough to handle a browser without breaking a sweat. And I went with Linux Mint with Cinnamon because multiple folks told me it was the easiest transition from Windows.&lt;/p&gt;

&lt;p&gt;All together, the process was WAY easier now in the age of LLMs. What used to be an arduous processes of digging through tutorials and forum posts was actually a pretty painless task of just having GLM 5 and Claude talk me through various issues as they came up.&lt;/p&gt;

&lt;p&gt;The installation was painless. LUKS disk encryption is now just a checkbox in the installer. No hunting down down proprietary drivers, either. I had to use Ethernet because the WiFi card in this thing has no mainline Linux driver support, but that's fine.&lt;/p&gt;

&lt;p&gt;Where things got interesting was the hardening. Because I'm me, I couldn't just install the OS and call it a day. I wanted this thing locked down. UFW firewall, OpenSnitch for outbound traffic monitoring, NordVPN with a kill switch, Firefox hardened, AppArmor running, unnecessary services stripped out, etc.&lt;/p&gt;

&lt;p&gt;In the past, I would have absolutely bricked this machine multiple times. The robits helped with all of that. When xrdp kept failing with a sesman connection error, when NordVPN's kill switch locked me out of the machine entirely, when xrdp kept killing the webgl process in firefox causing it to crash over and over... the bots had an answer for everything.&lt;/p&gt;

&lt;p&gt;In the end, I still did a full refresh, just because I had gone to town on some of the config files in this thing trying to get it the way I wanted, and I couldn't tell if I'd made a mess or not. But another nice thing with the bots was that as I did stuff, I was telling them, so in the end I got them to spit out all the highlights and write up a doc that I could use to replicate the whole process.&lt;/p&gt;

&lt;p&gt;A few things I learned along the way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NordLynx doesn't work with OpenSnitch; at least as of the time of this writing. Both manipulate iptables at the kernel level, and they fight each other. I had to switch to OpenVPN, which runs in userspace and plays nice with the firewall, though it's slower.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The xrdp 0.9.24 version in Mint repos has an IPv6 binding issue that causes intermittent connection failures. The fix is checking the sesman binding after every reboot and restarting services if it's wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Firefox's built-in fingerprinting protection sounds great to have, but when I enabled it, Firefox would hang on JavaScript-heavy sites. I eventually dropped it, especially with uBlock Origin blocking tracking scripts anyway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Right Ctrl gets stuck when you switch virtual desktops on macOS while the MintOS RDP window is in focus. Linux sees the key press but not the release. I had to disable Right Ctrl entirely within Linux via xmodmap to fix it. Took me way too long to figure out what was happening there. But if you think about it... when do you ever use right ctrl? I didn't until I started using Mac more, and that's just for virtual desktop swapping.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final result is a machine that boots up, connects to VPN automatically, and sits there waiting for me to RDP in from my Mac. All traffic goes through NordVPN. DNS queries go through NordVPN's DNS servers. WebRTC is disabled in Firefox. Third-party outbound connections are blocked unless explicitly allowed. The firewall only accepts inbound SSH and RDP connections from my local subnet.&lt;/p&gt;

&lt;p&gt;At this point, I've relegated Windows to gaming only; which I really don't do a lot of these days, but nice to have around anyhow. I had been putting off the Windows 11 upgrade (there's an extension for Win 10 security updates until Oct 2026 available, so I had done that). Now that I've got everything personal off my Windows box, I'll get that updated to Win 11.&lt;/p&gt;

&lt;p&gt;Most of the house is now Mac and Linux. Huzzah. I used to love Windows, but they've just been too weird lately about OneDrive. I still really like Outlook and O365; I use both a lot. But my personal machine doesn't need to be so closely tied to the cloud, and if the core Windows experience is going to be a cloud-centric OS, then it's really just not for me anymore.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Wilmer and Token Management</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Sat, 07 Mar 2026 02:04:00 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/wilmer-and-token-management-4ha3</link>
      <guid>https://dev.to/someoddcodeguy/wilmer-and-token-management-4ha3</guid>
      <description>&lt;p&gt;One of the big keys to running LLMs on a Mac is token management. That's what a lot of Wilmer is built around.&lt;/p&gt;

&lt;p&gt;Wilmer started out because I wanted to make the most of Llama 2 finetunes, but eventually its workflows became a way for me to keep overall token counts down. Macs handle large prompts slowly, and the smaller the prompts, the easier that is to deal with.&lt;/p&gt;

&lt;p&gt;For example, consider a really long conversation with an LLM. I was working with GLM 5 on my M3 Ultra to help me set up a new Linux box in the house. I know Mac and Windows well enough, but my last true foray into Linux was 15 years ago or more, so I needed help.&lt;/p&gt;

&lt;p&gt;Eventually I hit a point where the overall conversation was about 300 messages or more. If I had been sending the whole conversation, it would have been at least 100,000 tokens. Any standard sliding cache could keep it quick, but at the cost of losing the start of the conversation. When you're on a Mac, a 20k token prompt is already in frustrating territory, so you don't want to send much more than that. This means you'd lose 4/5 of the conversation.&lt;/p&gt;

&lt;p&gt;You could rely solely on vector memory, but now you're playing with fire on the sliding cache, hoping you don't accidentally cause it to reset because too much context changed on it.&lt;/p&gt;

&lt;p&gt;So with Wilmer, I've been focused on a handful of context management techniques. Some have been in it since early 2024, and some I'm adding in now.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;File memories&lt;/strong&gt; are JSON files that tie summaries to chunks of messages. The summary prompts can be anything, so it depends on the conversation type. For the Linux conversation, I set it to capture what changes we made successfully: packages installed, configs edited, services started or stopped. The system generates these automatically every 6000 tokens or so, which keeps each chunk focused and digestible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chat Summary&lt;/strong&gt; is similar, but rolls everything into one running overview. I use this to capture the 100-mile-high view of where we're at - what the overall goal is, what phase of the project we're in, what big decisions we've made.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vector memories&lt;/strong&gt; are where the LLM generates individual facts as the conversation progresses and stores them for semantic search. This is more nuanced detail about what's going on: specific commands that worked, error messages we encountered and how we fixed them, configuration values we settled on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conversation condensing&lt;/strong&gt; is the newer piece. I configured it to keep my most recent 7000 tokens as raw, untouched messages. Then it takes the next 7000 tokens after that and summarizes them with awareness of the current topic. So if we're troubleshooting a networking issue, it'll lean into preserving networking details. Everything beyond that gets rolled into a neutral summary that captures the broad strokes without topic bias. This lets me keep the immediate context sharp while still holding onto the shape of a long conversation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On top of this, I give the LLMs persistent files they can read from and write to. Things like my speech preferences, behaviors to avoid, recent events in my life, and a persona file that defines how the AI presents itself. One problem for LLMs is losing that internal train of thought and having to re-reason what its stance or goal was each time. Not so with this. The AI can jot down notes between messages and pick up where it left off.&lt;/p&gt;

&lt;p&gt;Separating out the image processor also lets me use a different vision model from the main thinkers, but more importantly it lets me cache previous vision responses. Once I send an image, the LLM doesn't have to reprocess it but can still answer questions about it. That's super helpful, and something that I don't see a lot of front-ends doing; after just a few messages, it stops having the context of that image.&lt;/p&gt;

&lt;p&gt;All of this gives me the ability to have massive conversations, hundreds of messages long, while maintaining consistency in knowledge; all while barely sending 15-20k tokens to the LLM in any given message. Overall I process more tokens than if I just left it all to sliding cache, but in return I get an assistant that can continue answering questions during message 300 about something way back in the first 20 messages.&lt;/p&gt;

&lt;p&gt;The real advantage is that I can use smaller models for most of the heavy lifting. During my Linux setup, what I really wanted was the final response from GLM 5. That's the model walking me through everything. But parsing through memories, updating summaries, deciding whether to pull from Wikipedia, condensing old conversation chunks? That gets pawned off to weaker models, sometimes down to the 4-billion-parameter range. They finish in no time at all. Then when GLM 5 kicks off, it's been handed everything it could hope for in terms of context, and it only has to work with 20k tokens or less.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>If You Have the Hardware- Use it to Learn!</title>
      <dc:creator>SomeOddCodeGuy</dc:creator>
      <pubDate>Tue, 03 Mar 2026 03:51:40 +0000</pubDate>
      <link>https://dev.to/someoddcodeguy/if-you-have-the-hardware-use-it-to-learn-30j1</link>
      <guid>https://dev.to/someoddcodeguy/if-you-have-the-hardware-use-it-to-learn-30j1</guid>
      <description>&lt;p&gt;If you've never messed with open source LLMs and you jumped on the ClawdBot/OpenClaw hype train: take some time to learn more about how local models work. You likely went through the trouble of getting a Mac Mini, so you now have a nice little test box to play with. Just do it. Turn off Clawdbot/OpenClaw, and make OTHER things with it. Just for a few hours, even.&lt;/p&gt;

&lt;p&gt;For the vast majority of folks using AI to Vibe code, make agents, etc- right now they are the equivalent of people building websites using the heaviest no-code/low-code solutions, or just slapping ALL the biggest libraries in, without a care in the world for performance. You're probably wasting a ton of efficiency in your current setups because you don't understand how a lot of it works under the hood. You don't understand samplers well, or what tokenization is doing. You may not have a good feel for what small and weak models can really do, or what you absolutely have to have large models for &lt;em&gt;(When I say small models- Im talking models that make Claude Sonnet 3.7 look like a genius)&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;Whatever efficiencies you're aiming for are probably a drop in the bucket compared to what you could be doing if you really had a feel for all that. And the only thing holding you back from that knowledge is just taking the time to learn it.&lt;/p&gt;

&lt;p&gt;The easiest way to learn this stuff is doing. You have the hardware now, so why not? Forget the little hype-bot that LinkedIn convinced you to install. Set it aside and use that Mac Mini to learn how LLMs work at a deeper level by trying to wrangle local models to do complex work. &lt;/p&gt;

&lt;p&gt;THAT will be worth its weight in gold.&lt;/p&gt;

&lt;p&gt;Also, don't cheat yourself. Yes, the local ecosystem is easier now. 10 minutes + an LM Studio install and tada: all done! But what did you really learn? No no; I'm saying to do it the long way around. Grab Open WebUI. Grab llama.cpp. Get em hooked up together. Use a little model like one of the new Qwen3.5 8b models. Get the responses to be actually good; try to find ways to make the model stop repeating itself. Things like that. &lt;/p&gt;

&lt;p&gt;Next: write a small agent. Do it with that crappy little 8b or less model, and try to get something of value out of it. &lt;/p&gt;

&lt;p&gt;This is all possible to do, but I promise it'll be harder than accomplishing the same thing with some 2026 proprietary API model. And that's the point.&lt;/p&gt;

&lt;p&gt;Once you've done all that, you'll later go back and revisit what you think right now is great work with LLMs, and suddenly have the same realization every developer does when they go back to their old code: &lt;em&gt;"Wow, I can do a lot better than this now."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Much like developers first learning to code, and thinking that just writing 500x "if statements" is good enough- you're only just now scratching the surface of how you should properly use LLMs. Now you need to start learning the more complex stuff. Don't settle for the novice approaches you're doing so far. There's SO MUCH MORE out there.&lt;/p&gt;

&lt;p&gt;And who knows- you may just find that local models are fun enough to be worth obsessing over a bit ;)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
