DEV Community: SomeOddCodeGuy

Third Post's the Charm- Lack of Recent Updates

SomeOddCodeGuy — Mon, 18 May 2026 00:59:27 +0000

I haven't posted or made any updates on Wilmer in like a month, and then I suddenly dropped 3 blog posts all at once. If that doesn't tell you that I don't pay attention to things like trying to game SEO, not sure what does =D

The reason it's been quiet on my end is a mix of work (a couple of work trips + me being heads down trying to knock something out) but also some new projects I'm working on, built on top of Wilmer. I do plan to open source most, if not all, of these, so I'm not just talking about it here and never sharing. I still have a bit more work and testing to do first, but just know that the below list is the result of what I've been doing for the past year or so.

I'm mentioning this now because these projects are why Wilmer updates have been quiet. I know with it being an older project and the world having moved on to giant workflow apps like n8n or over to general agentic stuff like OpenClaw (lol, I know... I know...), most of you probably would have expected me to tap out some time back. But I actually spend a TON of time during my weekends still working on Wilmer and some offshoot projects. For right now, most of the updates and work I've done are specific to those projects, so I can't put it out there yet, but I definitely plan to soon.

Here's a few, but not all, of the things I've been working on since last summer. Not going into implementation detail yet here, as I'd like to wait until they are released or, at a minimum, I can write a devoted blog post per item with the deeper details.

A fully offline knowledge search and deep researcher. With this, I intend to deprecate the Offline Wiki API project on GitHub (setting it to archive mode, most likely), as this new project is vastly improved in the response quality and is also stand-alone. The amount of knowledge now spans far beyond just wikipedia, with my current setup having almost a terabyte of knowledge to pull from, as well as easy ways to expand beyond that. I'll write more about this after its release, but so far through my testing I've been getting some really acceptable results- factually correct answers across history, science, and coding; fairly close but not super reliable answers across medical and legal; haven't tested other topics yet. Speeds on an M2 Ultra using Qwen3.6 35b a3b are about 15-25s for a quick search and about 20-30 minutes for Deep Research. The project will come with instructions of where and how to get the data; it's all really easy to use and grab.
A local web search and deep researcher. Similar to above, but this is designed to use web searches instead of the locally saved info
A fully offline translation app, similar to Google Translate.
A custom made front-end for myself, to replace Open WebUI and SillyTavern. This is something I've been really happy with so far, but Im not sure how much the broader audience will enjoy it so I may or may not release it. I essentially have captured all my favorite features from a whole range of front-ends, and dumped 90% of the unnecessary (for me) overhead that comes with ST or OWI. My goal was to make something that was a mix of all the best productivity features from open webui and claude.ai, but also be capable of supporting personas and group chats, since some of my main workflows are Roland and SomeOddCodeBot. (It felt ridiculous having my main productivity bases sitting in a front-end whose main logo is a cat girl). Also adding a lot of other little features, including integrations to the searches above
A lean IT agent designed specifically to handle my common homelab use-cases that are getting annoying or repetitive for me to keep up with. May or may not share this, but will definitely do a write-up later.

I also have a few other things that are just personal tinkering projects outside of just this going on: like a SearXNG instance, porting Socb to the new frontend, putting together a separate custom system for Roland as I start to expand its capability with sub-agents, and a few other things that I'll be writing about in the near future as well. Next on my list, when the hardware comes in, is setting up an air-gapped tailscale endpoint.

My hobby mission remains the same: I want to make local AI as good as I can get it. As we see more of these cloud services starting to get more expensive, adding Identity Verification via untrustworthy vendors and all else: having something we can fall back on, even with weaker models, is still my #1 goal. I am relying on cloud based AI more often these days, but my tinkering focus is entirely local.

On top of that, despite the amount of hardware I have available, my goal is to work against the lowest common denominator in terms of hardware. I want to get the best value I can out of something like a 9b model, with the understanding that larger models will do even better.

As always, my tinkering time is almost entirely relegated to Saturday/Sunday, with my weeknights either being focused on my actual job or with studying, so things move slowly. Usually the only updates I might do on weeknights is if I get a dependabot alert for something pretty important looking; in those cases I might tackle that late on a weeknight. So with that said, please don't take bursts of silence as me stepping back; like the energizer bunny, I keep going... and going... and going. I've been at this for 3 years now, and I feel like I've only just started.

Llama.cpp's New MTP on MacOS

SomeOddCodeGuy — Mon, 18 May 2026 00:13:00 +0000

MTP

So I decided to test out the new MTP in llama.cpp on Metal using my M2 Ultra, and figured I'd toss the results up here. This isn't meant to show the maximum tps you can get on Mac hardware; I'd have run it on the M5 Max or M3 Ultra if that were the case. My goal is to see what overall percentage gains we might expect to see across the various spec-draft-n-max sizes, which I could do on any of the devices.

MTP Test Runs

Hardware (M2 Ultra Mac Studio, 192GB unified memory)
Model (Qwen3.6-35B-A3B UD-Q8_K_XL, an MoE)
llama.cpp build (b9196)
The exact flags: --seed 42, --no-cache-prompt, thinking disabled, single prompt repeated 3x per setting
RAG against a wikipedia article (no code, since everyone else is benchmarking code).
n for these runs is spec-draft-n-max

Token Generation

Config	Mean t/s	Speedup	Mean acceptance	Variance
No MTP (baseline)	68.07	1.00x	n/a	±0.02
n=2	73.04	1.07x	86.16%	±1.2
n=3	76.00	1.12x	78.29%	±0.3
n=4	77.68	1.14x	76.72%	±4.1
n=5	74.68	1.10x	67.97%	±2.6
n=6	73.68	1.08x	66.26%	±5.4

n_max	Run 1 t/s	Run 2 t/s	Run 3 t/s	Mean t/s	Run 1 acc	Run 2 acc	Run 3 acc	Mean acc
2	72.30	72.26	74.57	73.04	84.66%	84.66%	89.15%	86.16%
3	76.23	76.16	75.61	76.00	78.18%	78.18%	78.51%	78.29%
4	79.08	72.90	81.05	77.68	78.13%	70.66%	81.38%	76.72%
5	72.86	73.06	78.12	74.68	65.87%	65.87%	72.16%	67.97%
6	66.48	77.29	77.27	73.68	58.11%	70.34%	70.34%	66.26%

Prompt Processing

Config	Mean PP t/s	Loss vs baseline
No MTP (baseline)	1015.34	—
n=2	841.72	-17.1%
n=3	842.80	-17.0%
n=4	846.62	-16.6%
n=5	834.57	-17.8%
n=6	836.42	-17.6%

Without MTP, my three baseline runs produced essentially identical numbers: 68.05, 68.06, and 68.09 t/s. But the moment I turned MTP on, runs at the same n_max value started drifting from each other, and the drift got worse as n_max went up. At n=3, the runs stayed within 0.6 t/s of each other. At n=6, the gap between best and worst hit 11 t/s. I don't have a definitive explanation, but my best guess is that MTP's batched verification step introduces enough floating-point ordering variance on Metal that generation paths diverge between otherwise-identical runs. That's why I'd lean toward n=3 even though n=4 has a slightly higher mean, since n=3 stayed reliably consistent.

Your mileage may vary on the numbers for your setup, but the loss on prompt processing looks pretty static no matter what I pick.

NOTE: I built b9200, which is supposed to have the prompt processing improvement code merged in. My PP speed on n=3 was still around 882 tps, so not a huge jump.

For my full llama.cpp run command, I use this:

./llama-server -ngl 99 -c 65535 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --model ~/models/MTP_Qwen3.6-35B-A3B-UD-Q8_K_XL/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --mmproj ~/models/MTP_Qwen3.6-35B-A3B-UD-Q8_K_XL/mmproj-F32.gguf --image-min-tokens 2048 --image-max-tokens 8192 --parallel 1 --host 0.0.0.0 --jinja --port 5003

--ngl 99 High number to guarantee no offloading. Means all the model should go into Metal / your GPU
-fa on Specifying that Flash Attention should be on
--parallel 1 I don't do parallel prompts, Mac just doesn't handle it well, but the way llama.cpp handles cache checkpoints is affected by this and I've noticed a slowdown when parallel is above 1 because of that, so I keep this on to be safe
--image-min-tokens 2048 --image-max-tokens 8192 This enforces a higher quality on the vision portion of the model. I had another post where I mentioned that, but the quality with this set vs not is night and day. Just note that each model has its own acceptable settings
--jinja Telling llama.cpp to use the jinja template that comes with the model. You want this on unless you know why you don't.
--host 0.0.0.0 Host of 0.0.0.0 is the same as "--listen" in some programs: it lets you connect to this instance of llama.cpp server from other computers on your network, if you want.
--port 5003 Sets the port to connect to; I specify it because I run multiple instances of llama.cpp at once, for different models.
-c 65535 The context size to load. I choose 65535 tokens

NOTE: There's a warning that sending an image input while MTP is enabled can crash llama.cpp. I kept vision on when I ran all my tests, and have sent a couple of images in other conversations with it on and haven't seen the crash, but just a note in case you hit any issue there.

Building and Running Llama.cpp on an Air-Gapped Mac

SomeOddCodeGuy — Mon, 18 May 2026 00:03:39 +0000

If you ever tried to run Llama.cpp on a MacOS device that doesn't have internet on it, you've probably hit the annoying GateKeeper errors that it's downloaded from the internet and you should delete it. Generally I just build from source to avoid that, but I ran into something interesting that I thought I'd share.

Last night I noticed that llama.cpp's newly added WebUI feature now includes downloads from huggingface and/or npm when you are running cmake, so if you are trying to build it on a computer that has no net connection, you'll hit an error:

 UI: failed to download index.html from version: "Could not resolve hostname"
-- UI: downloading assets from latest: https://huggingface.co/buckets/ggml-org/llama-ui/resolve/latest
-- UI: failed to download index.html from latest: "Could not resolve hostname"
CMake Warning at /home/user/llama.cpp-b9181/scripts/ui-download.cmake:209 (message):
  UI: failed to download assets from HF Bucket (llama-ui)

There was a note that if you set LLAMA_BUILD_UI=OFF then it would disable that, and you'd be able to build offline- however, that didn't work and it kept crashing. There's a fix in for that, but in the meantime the fix is to set that AND LLAMA_BUILD_WEBUI=OFF.

Steps to Build Llama.cpp from Source on MacOS

NOTE: You have to have cmake installed on your machine for this to work. It's an installer you can grab and run yourself.

1) Go to the repo, go to releases, go to the latest release (or the one you want), head to the bottom and download the source zip (named Source code (zip) at the bottom).
2) Unzip it somewhere
3) In terminal, navigate into the llama.cpp folder. For example, if you dropped it in your user folder -> llama.cpp-b9196, then you'd do cd ~/llama.cpp-b9196
4) Now you can run this to build it

cmake -B build -DLLAMA_BUILD_UI=OFF -DLLAMA_BUILD_WEBUI=OFF
cmake --build build --config Release

NOTE: There is a PR to fix the need for both. Once it's merged and tested, just -DLLAMA_BUILD_UI=OFF will work. https://github.com/ggml-org/llama.cpp/pull/23190

NOTE: You can add -j after "Release" to have it use more cores. Be careful with that, though, as it can be pretty performance hungry if you do just -j without a value, as it will just use all cores.

Once it's done, you will find the executables within the /build/bin folder of that directory, so in our example ~/llama.cpp-b9196/build/bin!

Using the Pre-built Assemblies on MacOS

If you decide to download one of the pre-built assemblies like macOS Apple Silicon (arm64), then you may hit an issue where it complains that the application was downloaded from the internet and only give you the option to stop/delete the file. This is the fault of GateKeeper. You can press cmd + Space, type GateKeeper, and it should open that in settings. You'll see a spot to tell it to let you run the app anyway; if you select that and then try to re-run the program, it'll prompt you for the password. Unfortunately, it will do that not only for llama-server, but all the child processes, too... sometimes it can take as many as 7-9 password types.

It's also possible to strip the com.apple.quarantine xattribute that macOS adds to internet downloaded files that causes Gatekeeper to be annoying. Removing it skips the prompts, so I usually just do that if I can't build the sourcecode myself. The command that I use is:
xattr -dr com.apple.quarantine ~/replace-with-llama-folder-path.

A Quick-ish Rundown of LLM Basics

SomeOddCodeGuy — Sat, 25 Apr 2026 21:36:11 +0000

Over the past few days, I've realized that there are a lot of folks out there using LLMs that haven't had an opportunity to dig, even a little, into the basics of how LLMs really work. And I guess that makes sense; for the most part, the average person doesn't have a lot of reason to know this. But if you're going to be a power user, there are things that would really help you to understand.

Below are the most basic basics. Not covering everything, just some stuff that I think if you get then the rest will start to make sense for you as well. Hopefully it helps someone out there.

Tokens

When you write something to an LLM, it doesn't break that thing down by character, it breaks them down by groups of characters called "Tokens". Every LLM has its own tokenizer, so not all choose the same tokens.

Here's a real world example of what tokenization might look like using Qwen3.6 27b's tokenizer: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json. If you open that file, you'll see the full list of tokens that Qwen3.6 27b utilizes.

As for how tokens work... here's an example:

"This is a token"
- That's 15 characters

'This' 'Ġis' 'Ġa' 'Ġtoken'
- That's 4 tokens. You'll notice 'Ġ' is in each; that's what
GPT-2/GPT-3/GPT-4 use as a space in tokenization

These line up to numbers, which the LLM then uses to do matrix math to determine the right output. If we go back to the link I gave you above, then you can see the following:

This   == 1919
ĠIs    == 369
Ġa     == 264
Ġtoken == 3817

So Qwen3.6 27b would see your sentence as (1919, 369, 264, 3817). It then does matrix math and other cool pattern-y stuff to determine the best tokens to respond to you with.

So remember this when you hear that an LLM has a context window of 1,000,000 tokens: it's talking about those things. Sometimes whole words are tokens, sometimes not. Don't just assume every word is a token; they try to create tokens off the most commonly used words. This, is, a are all very common in the English language. Token is very common when talking about LLMs.

Context Windows

The way I usually describe context windows is to imagine the full Song of Ice and Fire book series printed out on one really long parchment, and you have a piece of cardboard with a window cut in it that you can read text through. All you know is whatever's currently in that window. If someone asks you about something outside the window? Tough luck, you don't know it.

Now, the obvious thought is "well just make the window bigger". The problem is that if you cut the window too big, you have a harder time finding any specific thing in there, and you start mixing details up. You've learned how to read a certain amount within that window, and pushing past that doesn't go great. If the full book was the length of a parking lot, and someone asked you for details that could exist anywhere in that whole parking lot worth of text... well, good luck.

That's pretty much how it works with LLMs. You'll see models advertise huge context windows like 1,000,000 tokens, but the real-world practical use of that is a lot smaller than the marketing implies. The bigger you stuff that window, the worse the model gets at pinpointing specific information inside it. There's a whole pile of benchmarks (needle in a haystack tests, NoLiMa, RULER, etc) showing accuracy drop as the context fills up. So a 200k token context window is not an invitation to dump 200k tokens in there and expect great results. You'll generally get a much better answer giving the model 8k of really relevant tokens than 200k of "everything I have on the topic".

To get a better visualization, check this benchmark out: https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

Scroll down to the results section and you'll see a table- the numbers in there represent how well the model pulls the right info out based on the context size it was fed. You can see that some models, like GPT-5.2 or Opus 4.6, did great all the way up to 120k (except 5.2 pro for some reason...). But look at something like minimax 2.5, for example: by the time you hit 60k tokens, you have less than a 50% chance to get all the right info you asked for.

This is a struggle a lot of us running local models deal with, and it usually means you want to account for that with a lot of great wrapper software or middleware.

Model Sizes (ie- parameters)

When we talk about models, we size them based on the number of parameters they have. 1M is a 1 Million parameter model. That's itty bitty. 1b is 1 billion parameters- also itty bitty. Many modern models release in really huge sizes like 397b to 1T (1 Trillion parameters).

The easiest way to imagine parameters is as data points that can correspond to several pieces of data at once. So 1 datapoint doesn't necessarily equate to something like "When did the first Ford car release?" It could also correspond to several other pieces of info at once.

Models are generally created in BF16 format to start with. Size wise- BF16 equates to about 2GB per 1b; so a 20b model would be 40GB. If you "quantize" the model (easiest way is to think of it is 'compressing' the model) to 8bpw, or ~q8_0, that becomes 1GB per 1b. If you go further to 4bpw, or ~q4_0, you get down to 0.5GB per 1b. That's how we fit big models on smaller hardware.

As you can imagine, the more you quantize, the more mistakes the model will likely make.

Open Weight Models

These are models that you can download and run yourself. There are a few ways to do it, and here are some examples:

Raw transformers - this is the original format of the models
GGUF - This is a model that has been converted to run in llama.cpp
MLX - This is converted to run in Apple's MLX

Many applications, like Ollama or LM Studio, wrap some of these and then have their own repositories to pull models from. For best speed and the fastest updates for model support, you generally want to avoid that. You can find all models here: https://huggingface.co.

Mixture of Experts (ie- MoE)

This section is only really relevant to Open Weight models, so you can skip this if you never plan to host your own.

Parameter count doesn't just affect knowledge, it also affects speed. The bigger the model, the more matrix math the computer has to do per token. So a 70b model running at the same quantization on the same hardware as a 7b is going to be a whole lot slower; you're doing roughly 10x the math per token. That's also why video cards handle LLMs better than CPUs: it's a lot of floating point math, and GPUs eat that up. Which means when you're trying to figure out if you can fit a model on your machine, the real question is how much you can fit into VRAM.

Up until a year or two ago, pretty much every model you used was what we call a "dense" model. Dense means every single parameter in the model gets activated for every token it produces. A 70b dense model is doing 70b worth of math, every single token.

Then Mixture of Experts (MoE) models started taking off. You'll see them named like Qwen3.5-397b-a17b, or Qwen3.6-35b-a3b. The "a" in the first one stands for "active parameters". The way MoE works is the model is split up into a bunch of smaller "experts", and for each token, a "router" picks just a few of those experts to use. So Qwen3.5-397b-a17b has 397 billion total parameters, but only 17 billion get used for any given token.

What this means in practice: an MoE model runs at roughly the speed of its active parameter count, not its total. So Qwen3.5-397b-a17b runs only a little slower than the speed of a 17b dense model, even though it has 397b worth of parameters.

That's a big deal for performance, especially on local hardware. It really made those of us who invested in Macs early very happy. I almost, ALMOST, started to regret my first Mac Studio back in 2023... then not long after Mixtral 8x7B came out and that changed everything. It's only gotten better since.

The cool thing about MoEs really is on the knowledge side. An MoE with 397b total isn't as smart as a dense 397b model would be; the smarts land somewhere in between the active count and the total count. Where exactly is debated and varies by model, but the rule of thumb is to expect noticeably better than a dense model at the active size, and nowhere near a dense model at the total size. So Qwen3.6-35b-a3b isn't going to behave like a 35b dense; it'll feel like something north of a 3b but well short of a 35b.

The other catch, and this one matters a lot if you're running locally, is that even though MoE only uses a fraction of params per token, you still have to load ALL the params into memory. That 397b model still needs somewhere around 200GB at q4 to run, even though only 17b worth is doing math at any given moment. Llama.cpp does have a clever way to offload the inactive expert layers to system RAM so you can run these things on regular gaming hardware, but that's a deeper topic. I have a whole writeup on MoE offloading if you want to go down that rabbit hole.

Training

LLMs learn by being "trained". It's a complex process that, at the absolute highest level, involves the LLM seeing billions upon billions of tokens of information and learning patterns from it. "When I see someone say this, it usually involves someone responding with that" kind of thing. This is why people constantly harp about good data in training being the most important thing- if you have really clean examples of speech, knowledge, etc, it is easier for the LLM to find the right patterns.

Eventually, more powerful LLMs start to infer new patterns that they haven't seen before. Remember the old math problems like if A == B and B == C, then A == C? Imagine that on a MASSIVE scale, where it creates connections between information many many many many layers deep to get from A to Z.

Training a commercially viable model takes ungodly amounts of money and data, and you need really smart people to do it. Companies spend millions to billions of dollars making some of the most powerful models.
Training data is hard to come by. If you've heard about how some companies scraped the internet for data? That's why. They are looking for examples of speech, knowledge, etc. When an LLM wants to train on your data, it is less that the company wants to include your personal PII in the model (they generally don't; they don't want that bad publicity if someone makes the model spit it out) and more that they want nice clean interactions to give to the LLM to look at and learn more patterns.
This is also why AI companies are mad at each other for "distilling" their products. Distilling is the act of interacting with an LLM over and over again to get examples of the LLM's speaking or thinking process, then creating training data to teach another LLM to act or reason that same way. An example of this from recently was that DeepSeek, Moonshot AI, and MiniMax got accused of doing this by Anthropic. The accusation was that they were using thousands of fraudulent accounts to interact with Claude millions of times, then using those interactions to teach their own models to think and speak similarly.
It's possible to train little fun models pretty cheaply. One guy recently trained a small model from scratch on 1800s text, with nothing at all modern in it. This little model has no concept of anything past the industrial age.

Finetuning / Post-Training

When you hear a non-tech company say they are "training a model", they most likely mean finetuning or post-training an open weight model.

Imagine an LLM as a big calculator for matrix math. Numbers go in, one number comes out. So that over and over and you get a response. The neat thing about matrix math is something called rank factorization- the idea that you can represent a matrix m*n with rank r by using smaller matrices m*r and r*n. Some super smart folks figured out that this allowed us to have LoRAs, which you can think of like add-on components to LLMs that modify the weight distribution.

In other words- rather than retraining the entire model to try to add more information, you train an itty bitty version of that model with the info you want, and then you can load the original model + LoRA at the same time to get a post-trained model.

Truthfully- I am pretty staunchly in the camp that you can't reliably train new knowledge into a model this way. That's a very common but not a universal view within the deeper LLM tinkering community; some companies have made post-training their bread and butter. I do believe that you CAN train styles, tones, etc really well into it (for example: training a model to handle documentation a certain way, or think a certain way), but ultimately I've yet to see a good example of a post-trained model outside of basic Instruct models from the same manufacturer that has actually been worth the effort. Maybe there are some out there, but I'm not familiar with them.

Anyhow, long story short- you CAN post-train a small model for $100 or less, but I wouldn't even recommend it unless you really understand what you want to get out of it and why. There's very little a post-trained model can do that you can't do with a good workflow, prompt and data to RAG against.

How LLMs Respond

When you boil it down, LLMs work in a really simple loop. You give it a chunk of tokens. It processes them and spits out one new token. Then it takes all your original tokens plus that one new token it just spit out, and processes the whole thing again, and spits out the next token. Then it takes all your tokens plus the two new tokens, processes again, spits out the next. On and on, one token at a time, until it decides it is done and sends a stop token. You now have your response.

To simplify it- LLMs don't think about the response all at once- they think 1 token at a time. Over and over and over until they are done. That's it.

This is also why "reasoning" works. If you ask a model to just answer a hard math problem cold, it can fumble it, because by the time it gets to the answer it's already locked into early tokens it picked. But if you tell it to think out loud first- write out the problem, work through it step by step- then while it's writing all that, it's still just predicting one token at a time, except now each new token gets to "see" all the work it just laid out. If it makes a mistake at step 2, it can sometimes catch it at step 4 and shift the line of thinking before it commits to a final answer.

If you ever watch an LLM think, and it constantly goes "But wait...", that's because it was trained to in order to stop it from locking in. It says its response, then it challenges the response, and in doing so that gives it a chance to realize the response was wrong.

That's basically what chain of thought and reasoning models are. The model writing out its work so it has more to reference when generating each next token. It's not magic, it's just giving the model more useful context to predict from. The flip side is that more reasoning means more tokens, which means more time and more cost. And some models, like Qwen3.5/3.6 and Gemma 4, overthink badly. With those, you want to use a workflow app to manually apply CoT, if you can. Since I use Wilmer everywhere, I have workflows specifically to use Qwen/Gemma with thinking disabled, and then have a manual CoT step. That helps with overthinking massively.

RAG - Retrieval Augmented Generation

This is a $5 term for a $0.05 concept. When we talk about RAG, it boils down to a very simple concept: give the LLM the answer before it responds. Everything else, when talking about RAG, is talking about a design pattern.

Simplest example: The simplest form of RAG would be copying the text of an article or tutorial, putting it in your prompt, and asking the LLM to answer a question about that. The LLM will use the article to answer you.
Next level of simplicity: You might ask an LLM a question, the LLM uses a tool (web search, local wiki search, whatever) to pull the article, concatenates it into your prompt, and answers your question.
What a lot of folks think of when they think of RAG: You have a program that takes thousands, or even millions, of documents and turns them into "embeddings"- ie breaks the document into logical chunks and stores them somewhere easy to retrieve off of, such as a Vector database. Then, when you ask a question, it does some fancy stuff in the background to find the right chunks and answer your question with them. Since putting 1,000,000 files into your context all at once is impossible, this is how you go about the oft-advertised "chat with your documents" situation.

But all together, RAG comes down to a very simple concept: give the LLM the answer before it responds. That's it. LLMs are very, very strong at this, and it's a great way to avoid hallucinations.

For the most part, RAG solutions are not an LLM problem, they're a software problem. If you're struggling with RAG, you probably need to revisit HOW you're feeding the data to your LLM and whether you're giving it too much unnecessary stuff along with the right stuff.

Hallucinations

A hallucination is when the LLM responds with something that's flat wrong. The reason it happens comes back to that loop in the How LLMs Respond section: an LLM doesn't actually know anything. It's a pattern matcher predicting the most likely next token based on what came before, based on the training that it did to determine "when I see X, I usually see a response of Y". If the most likely next token happens to be the wrong one, well, that's what you get. This can especially happen with information that there isn't a lot of great data out there for, so the LLM had to infer the relationships. Asking a detailed question about Excel means it has millions of example questions, articles, documents, etc from the internet to have learned from; asking a question about FIS' Relius Administration has far far fewer examples, so it likely inferred a lot of things based on other patterns, and it will hallucinate like mad.

LLMs, as a technology, don't have a built-in "I'm not sure about this" lever they can pull. It just generates whatever the patterns say to generate, and confidence isn't really part of the equation. The answer it gave you is 'right' from the perspective that it generated the most likely pattern. Whether that pattern is of any use to you has nothing to do with the LLM lol.

The most common reasons you see hallucinations:

The training data was wrong, so the pattern the model learned is wrong.
The training data didn't cover the topic well, so the model is filling in gaps with whatever sounds plausible.
You asked something outside what the model was really trained for, and it tries to answer anyway because that's what it was trained to do- give an answer.
Your context window is huge or messy, and the model is losing track of what's actually relevant in there.
The model is over-quantized and just making more mistakes generally (going back to that earlier section).

Reasoning models hallucinate a bit less on certain types of problems because they get a chance to second-guess themselves while writing things out, but they absolutely still hallucinate. The single best mitigation is to put the answer in the context for it, which is RAG.

Using That Info

Knowing all this should hopefully help you start to narrow down why some of the "pro tips" of using LLMs exist. When you want a factual answer, you don't just ask the LLM. Right or wrong, you're getting a confident response. Instead, make sure you are injecting the right answer in before it responds- this often means tool use such as web search or, even better, "Deep Research" features you find on commercial LLMs.

This also hopefully will help you imagine why jamming ALL your codebase into the LLM, or constantly asking "What model has a bigger context window?" is the wrong question. It's lazy to just look for bigger context windows; and that laziness will bite you. Instead, focus on how you can break the data apart so that the LLM can work in the confines of what it handles best. That means writing or downloading some supporting software.

Anyhow, good luck folks. Hope this helps the like 4 people that might read this far.

Qwen3.6, and WilmerAI OpenCode workflows

SomeOddCodeGuy — Mon, 20 Apr 2026 03:10:20 +0000

Just a random note, but Qwen3.6 35b a3b is putting a smile on my face. This little model feels like a big upgrade over 3.5's 27b or 35b a3b.

Also- the Wilmer workflow for OpenCode is really going well. I need to test it more, because I had to do a big refactor on it, but so far between that and Qwen3.6, the level of quality I'm seeing from OpenCode now feels reliable. I won't over-exaggerate the situation by making any claims about it feeling similar in quality to X or Y proprietary cloud models; instead I'll say that up until now, I had not felt like a local model that ran at any kind of a decent speed was particularly reliable for power-user level agentic coding. This model + jamming my Wilmer workflow between MLX and OpenCode has now changed that. I have more work to do, a lot more testing to do, but I'm feeling really good about this right now.

And on a side note: the M5 Max with MLX is absolutely destroying my M3 Ultra in terms of speeds when running Qwen3.6 35b. I currently have that model running at bf16 on the M5 Max, and Im watching it process prompts at insane (for Mac) speeds.

M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 4k tokens
Total Time: ~1.1 seconds

2026-04-19 22:56:00,920 - INFO - Prompt processing progress: 322/4010
2026-04-19 22:56:01,475 - INFO - Prompt processing progress: 2370/4010
2026-04-19 22:56:01,972 - INFO - Prompt processing progress: 4006/4010
2026-04-19 22:56:02,004 - INFO - Prompt processing progress: 4009/4010
2026-04-19 22:56:02,029 - INFO - Prompt processing progress: 4010/4010

M5 Max 128GB Macbook Pro MLX Qwen3.6 35b a3b bf16 - 32k tokens
Total time: ~11 seconds

2026-04-19 22:56:18,074 - INFO - Prompt processing progress: 2048/32137
2026-04-19 22:56:18,652 - INFO - Prompt processing progress: 4096/32137
2026-04-19 22:56:19,259 - INFO - Prompt processing progress: 6144/32137
2026-04-19 22:56:19,896 - INFO - Prompt processing progress: 8192/32137
2026-04-19 22:56:20,561 - INFO - Prompt processing progress: 10240/32137
2026-04-19 22:56:21,249 - INFO - Prompt processing progress: 12288/32137
2026-04-19 22:56:21,971 - INFO - Prompt processing progress: 14336/32137
2026-04-19 22:56:22,714 - INFO - Prompt processing progress: 16384/32137
2026-04-19 22:56:23,485 - INFO - Prompt processing progress: 18432/32137
2026-04-19 22:56:24,288 - INFO - Prompt processing progress: 20480/32137
2026-04-19 22:56:25,122 - INFO - Prompt processing progress: 22528/32137
2026-04-19 22:56:25,989 - INFO - Prompt processing progress: 24576/32137
2026-04-19 22:56:26,879 - INFO - Prompt processing progress: 26624/32137
2026-04-19 22:56:27,800 - INFO - Prompt processing progress: 28672/32137
2026-04-19 22:56:28,761 - INFO - Prompt processing progress: 30720/32137
2026-04-19 22:56:29,542 - INFO - Prompt processing progress: 32136/32137
2026-04-19 22:56:29,581 - INFO - Prompt processing progress: 32137/32137

Anyhow, I have a very busy week coming up, so I'm unlikely to post much for a little bit, but I will be testing this workflow up a storm and really putting this little Qwen through its paces.

Wilmer Tool Calling

SomeOddCodeGuy — Mon, 13 Apr 2026 03:53:26 +0000

So some year and a half after the request was made for me to put tool calling into Wilmer, I've finally got it in there.

First off- it was a huge pain to implement; if I didn't have Wilmer itself and agentic coders to help, I'm not sure I'd have done it. The way streaming works with tool calling is a bit odd, too, so that was interesting to navigate. Really, this was something I couldn't have pulled off without the earlier workflow engine refactor for the Execution Context.

The idea is straightforward: Wilmer sits in between the frontend and the LLM, so it just needs to pass tool definitions from the frontend through to the model, and pass tool call responses from the model back to the frontend. Wilmer itself doesn't need to understand or execute the tools. The tricky part was that Wilmer has a whole pipeline of nodes doing different things (memory lookups, categorization, summarization, context gathering) and you really don't want tool calls accidentally hitting nodes that are just doing internal processing. So I had to put per-node controls in place. Only the nodes you explicitly flag will pass tools through; the rest just strip it out and do their job; with the exception of pulling out just the tool call outputs to give in the case of some internal nodes using chat_user_prompt_*.

Format conversion between OpenAI, Claude, and Ollama backends was also a headache since they all handle tool calling differently, and streaming tool calls needed their own handling to keep the structured data from getting mangled by the normal text processing pipeline.

But the reason I finally sat down and did this is that I've been using OpenCode more lately. Up until summer of last year I had pretty much written off agentic coding, but once Claude Code got good I found myself sucked in like everyone else. Even though I'm usually a very local-first oriented guy, I've just stuck to that since because the quality is so great.

A month or so ago I started dabbling in OpenCode, to have something for when the net goes out, and I have to say that Qwen3.5 27b combined with it is pretty nice... but nowhere near the quality of Claude (obviously). My goal hasn't changed since 2023: trying to find ways to improve the quality of local tools to that of proprietary, even if it means sacrificing speed for quality. So as with all things, after trying OpenCode for a while, my answer is: shove Wilmer into the flow.

Now that tool calling works end to end, I can do just that. The OpenCode calls pass through Wilmer, hit my workflows, and the tool calls get forwarded through to one of N number of models in llama.cpp and back without Wilmer needing to know anything about what the tools actually do. It slows everything down a lot, but the result is far less engagement from me because it gets things right in far fewer tries. Especially doing things like the earlier Qwen improvements of manually applying CoT.

I've had really great luck with getting Qwen3.5 122b to give a lot better results than stock like this, but Qwen3.5 27b has been a bit harder to wrangle. Getting it to play nice with my decision trees is fairly challenging so far.

I'm going to tinker with these OpenCode workflows for a month or so and then start putting them out for folks. Updating the example workflows in the repo is next on the list.

A Quick Note on Gemma 4 Image Settings in Llama.cpp

SomeOddCodeGuy — Fri, 03 Apr 2026 01:50:48 +0000

In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5. I went to load Gemma 4 the same way, and hit an error:

[58175] srv  process_chun: processing image...
[58175] encoding image slice...
[58175] image slice encoded in 7490 ms
[58175] decoding image batch 1/2, n_tokens_batch = 2048
[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed
[58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
[58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
[58175] See: https://github.com/ggml-org/llama.cpp/pull/17869
[58175] 0   libggml-base.0.9.11.dylib           0x0000000103a6136c ggml_print_backtrace + 276
[58175] 1   libggml-base.0.9.11.dylib           0x0000000103a61558 ggml_abort + 156
[58175] 2   libllama.0.0.0.dylib                0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484
[58175] 3   libllama.0.0.0.dylib                0x0000000103eb098c llama_decode + 20
[58175] 4   libmtmd.0.0.0.dylib                 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948
[58175] 5   libmtmd.0.0.0.dylib                 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536
[58175] 6   llama-server                        0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256
[58175] 7   llama-server                        0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396
[58175] 8   llama-server                        0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504
[58175] 9   llama-server                        0x0000000102f3a610 main + 14376
[58175] 10  dyld                                0x00000001968edd54 start + 7184
srv    operator(): http client error: Failed to read connection
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
srv    operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1

As you can see, the crash is caused by the fact that I'm not setting ubatch.

[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed

The reason is because Gemma 4's vision encoder uses non-causal attention for image tokens, which means all the image tokens have to fit within a single ubatch; since I specified that gotta be at least 2048, that's a problem since ubatch defaults to 512.

First, we need to make sure the model actually supports going that high. If we peek over at Unsloth's page, we'll see that's not the case

Gemma 4 supports multiple visual token budgets:

70

140

280

560

1120

Use them like this:

70 / 140: classification, captioning, fast video understanding

280 / 560: general multimodal chat, charts, screens, UI reasoning

1120: OCR, document parsing, handwriting, small text

So our max is actually 1120 here. So for my case, Im going to want to set the --image-min-tokens and --image-max-tokens both 1120, and then I'll buffer up the batch and ubatch to 2048.

./llama-server -ngl 200 --ctx-size 65535 --models-dir /Users/socg/models --models-max 1 --port 5001 --host 0.0.0.0 --jinja --image-min-tokens 1120 --image-max-tokens 1120 --ubatch-size 2048 --batch-size 2048

A Few Tips for OCR With Qwen3.5 through Llama.cpp

SomeOddCodeGuy — Tue, 31 Mar 2026 02:27:00 +0000

Just a couple of quick tips. I am using the Unsloth Qwen3.5 27b gguf, and also tried the 122b gguf.

First: The difference between the bf16 and fp32 mmproj is night and day. I was getting multiple hallucinations, errors, etc with the bf16. I swapped to the fp32 mmproj and it fixed up a lot of that almost instantly. Drastic improvement. The vision projector may have components that benefit from fp32's additional mantissa bits (23 bits vs bf16's 7 bits).

Second: Forcing the model to kick up the minimum number of visual tokens. For example, I was trying to run OCR on an old image of a Japanese newspaper article from 1957 that I found. It was something like 733x1024, and the model was really struggling to read the body of the text; tons of hallucinations, just making up entire sections of text. By forcing the image-min-tokens up to 2048, it forced the model to use 3x the visual processing, and the quality went up MASSIVELY. All of a sudden it could read the paper, with only a handful of small issues.

This is what I added to the llama-server command: --image-min-tokens 2048 --image-max-tokens 8192

I did have to toss 1.1 repetition penalty in there, as it was having a hard time transcribing Japanese without failing, but otherwise it is doing a great job now.

Wrangling Qwen's Overthinking with Workflows

SomeOddCodeGuy — Sat, 28 Mar 2026 17:45:00 +0000

So I've been running Qwen3.5 122b a10b lately on the M2 Ultra (currently GLM 5 is sitting on the M3), and if you've used any of the Qwen3.5 family, you've probably seen or heard about the overthinking issue. The models are great if you either have a lot of time to kill while you wait for a response, or for more straight forward work if you kill the reasoning. The 35b a3b with reasoning disabled has been my workhorse for the past couple of weeks and it is the greatest thing since sliced bread.

Anyhow, now that I want to use the 122b for actual hobby work, I've realized how painful the overthinking really is. I had a conversation a few days ago where I asked it to translate something simple. Not anything complex, just a straightforward translation request. It spat out over 5,000 tokens of reasoning before giving me the actual answer. I tested, and actually got a faster response by sending my request to GLM 5 with reasoning enabled, despite it being a 744b a40b model. It just thought so much less, because the request wasn't THAT complex.

I tried all of the Qwen recommended samplers, and even kicked up repetition penalty alongside their recommended presence penalty just to see what it would do. But nope; think think think. I also sleuthed around the net a bit and saw that several folks ultimately solved this with forceful thinking budgets in the newer llama.cpp, but I'm not a huge fan of that; if the reasoning isn't done, then it'll just get cut-off mid thought and you really aren't getting the benefit of reasoning at all.

So after banging my head on this for a bit, I went back to something I used to do when reasoning models were newer and their CoT actually hurt more than help: Wilmer workflows to the rescue.

What I ended up doing was disabling Qwen3.5's native reasoning entirely. I'm passing enable_thinking: false into chat_template_kwargs through the llama.cpp server payload to disable thinking, then I built a workflow that handles the chain-of-thought process manually.

The workflow does the usual context gathering that my setups always do, and then right before the final response there's a dedicated "thinking" node. This node gets all the context and produces a chain-of-thought analysis that then feeds into the responder node.

Rather than wing the CoT, since things have probably changed a bit since the last time I did that in 2024 (lol), I had Claude do a deep research pass on how how Deepseek and GLM 4.7 structure their reasoning internally, to see if I could get some ideas. In my experience, both of those do amazingly at CoT.

DeepSeek-R1 ended up having the most info available; it followed a four-phase pattern of problem definition, decomposition, reconstruction cycles, and final decision. The reconstruction cycles are where it either ruminates or genuinely tries new approaches. GLM 4.7 does something called interleaved thinking, where it reasons before each response and each tool call, not just at the start.

The research I found showed something interesting. Incorrect solutions have more and longer reconstruction cycles than correct ones. There's a problem-specific sweet spot for reasoning length. As we already knew: more reasoning doesn't always mean better answers. In fact, R1 had a bad habit of ruminating, re-examining the same formulations repeatedly, which actually hurts its ability to find novel solutions.

It was an overthinker, too; just not as bad as Qwen.

Anyhow, long story long: I took all that and threw together a new CoT prompt in a new node just before the responder. The model has to assess complexity first and scale its effort accordingly; a simple greeting gets maybe two or three sentences of thought, while a multi-step coding problem gets a thorough breakdown. Then it has to work through the problem, verify its reasoning, and output a response plan. If it catches itself repeating the same line of reasoning, it's instructed to stop and either move on or try a genuinely different approach.

Despite Qwen3.5 122b not being trained for this, the results have been solid. Instead of 5,000+ tokens of circular thinking on a simple translation, I'm seeing 900 to 1500 tokens now on that same request. The quality of the final responses seems about the same, maybe slightly better because the thinking is actually structured rather than meandering. And despite making two separate model calls instead of one, the total response time is lower because I'm not burning tokens on endless rumination.

This isn't a new idea. I had to do this two years ago as well; it's just funny that I'm circling back to it now with one of the most powerful models out there.

Anyhow, that's how I got Qwen3.5 to behave. Your mileage may vary. But if you've got a workflow system set up and you're willing to spend some time on prompt engineering, there's a lot you can do to tame a model that doesn't self-regulate well.

A New Toy...

SomeOddCodeGuy — Tue, 17 Mar 2026 23:41:00 +0000

The M5 Max Macbook Pro just arrived. First thing I did was fling llama.cpp, Wilmer and Open WebUI on it.

Honestly, the speeds are really impressive, even considering that llama.cpp hasn't fully integrated the hardware changes yet (at least, that's my understanding). Here's a comparison of Qwen3.5 35b a3b between the M5 Max Macbook vs the M3 Ultra Mac Studio

M5 Max MacBook Pro:

1450 t/s processing, 68 t/s generation

prompt eval time =    
    3202.80 ms /  4654 tokens 
    (0.69 ms per token,  1453.10 tokens per second)
eval time =    
    7098.19 ms /   483 tokens 
   (14.70 ms per token,    68.05 tokens per second)
total time =   10300.99 ms /  5137 tokens

M3 Ultra Mac Studio:

1647 t/s processing, 48 t/s generation

prompt eval time = 
    3810.74 ms / 6280 tokens 
    (0.61 ms per token, 1647.97 tokens per second)
eval time = 
    14695.00 ms / 704 tokens 
    (20.87 ms per token, 47.91 tokens per second)
total time = 
    18505.75 ms / 6984 tokens

So yea- the Studio processes prompts faster (at this size of model and this amount of tokens, though I think that it actually saturates better on the M5 Max at larger prompts), but generates tokens slower than the M5 Max.

Super excited to play with this. I got rid of the M2 Max Macbook, so this is my main travel machine now.

Slimming Down the Homelab Software Footprint

SomeOddCodeGuy — Mon, 16 Mar 2026 03:09:00 +0000

So my homelab setup post from a while back is already outdated. Not as much on the hardware part; rather the software side has consolidated dramatically.

The original setup had somewhere around 20 to 30 separate WilmerAI instances running across my network. Each one was configured for a specific purpose: coding assistance, general chat, RAG workflows, reasoning-heavy tasks, fast responses, and so on. Each instance pointed at one of my three main inference machines (the M2 Ultras and M3 Ultra). If I wanted a different usecase, I spun up a different Wilmer instance and pointed at the appropriate models on the appropriate machine.

This worked, but it was wasteful. Wilmer is lightweight at around 150 megabytes per instance, but multiply that by 25 or 30 instances and you're burning some memory. More importantly, it was fragile. If I fired off two different workflow requests that both targeted the same Mac, they could hit the LLM simultaneously and either slow down the machine massively or crash it entirely. Apple Silicon doesn't handle parallel LLM inference well at all, so I had to tiptoe around my own setup, mentally tracking which workflows were in use before triggering another one.

Two changes have collapsed this down to something far more manageable.

The first is actually a Llama.cpp change; lcpp server recently added router mode (think llama-swap), which lets a single instance manage multiple models. You start the server without specifying a model, point it at a directory of GGUF files, and then specify the model in each API request. The server handles loading, unloading, and LRU eviction automatically. For my use case, I now run two llama.cpp instances per physical machine: one for a large model (the responders) and one for a small model (the workers). Both stay loaded and pinned with mlock so there is no cold start penalty. The model field in the request tells llama.cpp which one to use. That took me from an average of 5 llama.cpp instances per machine down to 2.

By doing two lcpp instances, I can work it out so that the memory balances. I'll make sure my largest responder model leaves enough memory headroom for my largest worker model; if that combination can load side by side, then I'm golden. With the Mac's memory caching, that makes it super quick to swap models around as needed.

The second big change for me is on the Wilmer-side; specifically the multi-user support I just finished building into Wilmer.

Instead of running a separate Wilmer process for each workflow, I now run a single Wilmer instance per physical machine with multiple users configured via the --User flag. Each "user" is really just a configuration profile: a set of endpoints, presets, memory settings, and workflow folders. The front-end selects which configuration to use by setting the model field to something like chris-openwebui-m3:coding or chris-openwebui-m3:general. Wilmer parses that prefix, loads the appropriate user config, and runs the shared workflow under that configuration.

The shared workflows are also a new feature. They expose workflow folders through the /v1/models and /api/tags endpoints, so frontends like Open WebUI just see them as models in a dropdown. Selecting one tells Wilmer which workflow to run.

In multi-user mode, the username prefix determines which user's endpoints and settings get used. So bob:openwebui-coding runs the same workflow as alice:openwebui-coding (assuming both are using shared workflows), but each hits their own configured LLM backends and presets.

The result is that my M3 Ultra now has a single Wilmer instance pointed to it, serving about a dozen different shared workflows, plus Roland and a Wikipedia researcher. The M2 Ultras are set up similarly. This cleaned up a LOT of memory on the Mac mini.

Concurrency limiting is the last big item. The --concurrency flag (defaulting to 1) queues incoming requests so only one hits the LLM at a time. I can now fire off multiple requests to different workflows on the same machine without worrying about crashing anything. Wilmer queues them and processes them sequentially, meaning I no longer have to keep track of what's hitting what.

I still have separate instances for my mobile setup on the MacBook Pro. That one runs independently when I am on the road.

This is all something I've meant to do forever; this and the new memory features (like the memory condenser I mentioned in an earlier post). It's a little headache that I've put up with for years, because scoping individual users was so challenging. But after the massive refactor I did in 2025, I could finally move almost all of the workflow/user related global variables into the new execution context, be able to finally ensure there was no bleed/crossover on multi-user setups.

Up until now, Wilmer was absolutely built for 1 person running it on their own machine. Now it's finally about in a state where it can actually handle multiple people at once in a single instance appropriately.

The multi-user and concurrency features are not released yet. Shared workflows got deployed out earlier this year. The rest is coming in the next update.

I know deployments have slowed down a lot on Wilmer lately, but I haven't given up on it; it's just that it's in a spot where I can do some of the other projects I always wanted to, so I've kicked those off as well. Now my precious free time is split like 5 ways lol.

The Right Monitor is Hard to Come By

SomeOddCodeGuy — Fri, 13 Mar 2026 23:22:00 +0000

It is shocking how difficult it is to find a 34" curved Ultrawide that is either 2560x1080 or 5120x2160. Back in 2020 or 2021, Spectre made one; it's been discontinued now though.

The big issue for me is two fold because I have a triple monitor setup: The monitors to the left and right of my main monitor are both 1920x1080 27" monitors. A 34" ultrawide is physically identical in height to those monitors. 2560x1080 is also identical in resolution height. So with a 34" 1080p monitor, it's just a really nice setup.

My main issue with the current stock you can find on Amazon is that MacOS REALLY struggles with landing on that resolution if the monitor isn't either set to it natively, or is 5K2K. If you get a 3440x1440 monitor... well, I haven't been able to find one that lets me select 2560x1080 as a resolution in standard MacOS.

I did try BetterDisplay, but I had some issues that I couldn't work through on it, so Im back on the prowl for a monitor that fits my needs.

Resolution selecting is definitely one of the areas that Windows has MacOS beat on. That and Microsoft Paint. Omg, I can't tell you how spoiled having that application had made me. I grabbed Gimp for the Mac, but it's overpowered for what I want to do with it; I really just need it manipulate screenshots or something now and then.

Oh, and network file sharing. I made the mistake of trying to use a Mac as a local NAS. Never again.