Originally published at deepu.tech.
One of my most popular posts of all time was when I wrote about my beautiful Linux development machine in 2019...
For further actions, you may consider blocking this person and/or reporting abuse
I have a similar machine but it's a Desktop one (minisforum 395+ 128GB) but while I've never looked into its BIOS, I've thought the whole point of these machines was to have similar unified memory DGX spark has, as example (and I have one of those too) ... is there any reason you had to explicitly split 64GB of memory here and there as opposite of letting the machine/OS handle that for you? Specially DS4 project (which I love and use on DGX Spark) requires 96GB minimum to run but it doesn't necessarily need to take all that space, although I believe with a 32GB CPU split and a 96GB for the GPU that project should run, still curious to learn/know why nobody on macOS needs to worry about this, and neither do I on my DGX Spark (or maybe it comes pre-configured to handle that automatically) ... thanks!
That being said, nice post ... I feel you for the AMD ROCm state but it's really getting better day by day, can't wait to have it more reliable/robust to make it the mac alternative for developers!
Last I tried there was some issues in loading models larger than RAM. But I think its not an issue on newer kernels, I'm planning on disabling the split and see how my previous use cases work now.
i love that you're focusing on a fully offline setup for AI-assisted development. itβs cool to see how you've customized your environment with arch and niri. if you're ever interested in quickly spinning up a web app, moonshift lets you deploy a next.js + postgres + auth build in about 7 minutes, and you keep the code on your github. let me know if you want to give it a shot for free.
You should also give omp.sh a try.
I found it much better in speed and management that opencode.
Looks great! I like to use linux, at least unix based terminal. For example my company laptop is a windows11 but the
wls install ubuntu 22.4partial solve my development workflow. I know that is fare from this handcraftect solutions, but the company requriments are strict, even I can't reach the dev.to from some weird company policy from my working computer. Any way I like your work!Is it a company laptop?
2020 Dell i5 16GB Ram, worn english layout keyboard, but I always using US layout - minor confusion.
A good news copilot cli running on cloud so that capacity don't effect the computer.
Neah
That's a helluva broputer... π
I'm gonna steal broputer π although not sure if I should be offended or not π€£
Nah, no offense, thatβs a really cool setup made with lots of love and dedication, Iβm pretty sure it pays off big time ππΌ
Lovely setup!
Have you considered using Qwen3.6 35BA3B?
I use it on my MI50 32GB and basically get a 3x boost in tokens/s (both in and out) for not much intelligence penalty. Also probably worth turning on the feature to remember its thinking, given that you can support its full context window.
Once I saw that kind of tokens/s it was hard to justify the slower dense models.
I haven't personally tried it since I saw someone comparing that with dense models for long context tasks and the MOE models hallucinated way more when context was big. I will try it when I have time and see.
What context are you using
This is the dream setup for anyone who cares about owning their stack. The llama.cpp + ROCm combo on the Flow Z13 is impressive β 128GB unified memory changes the calculus for local AI entirely. I've been thinking about a similar local-first approach for some of my financial data analysis pipelines where I really don't want prompts hitting third-party APIs. The tradeoff you mentioned about context-length slowdown with 27B models matches what I've seen too. Qwen3.6 Q8_0 at 256k context is a solid sweet spot. Thanks for sharing the bench numbers and the archdots repo β exactly the kind of practical detail that's hard to find.
Nice article. I never thought about this approach before
How has this setup performed under real traffic
I have been using it for reviews, quick fixes, repo research etc and have been quite good. Right now building a full fledged filesystem management TUI in Rust. Will report back my findings. So far very impressed, i'm 3 prompts in and its fxing issues after first iteration.
Check out my repo. I've got the keyboard and back window RGB working.
github.com/th3cavalry/GZ302-Linux-...
Super cool. thanks for sharing. i will use it as benchmark to test the model.
Never even knew one could do something like this. so creative. I appear to have a long way to go.
Try Krusader or similar 2 pane keyboard heavy file managers.
For coding assistance, a well-quantized 4B model at 40+ tok/s beats a 27B model at 8 tok/s in actual productivity. The bottleneck isn't intelligence β it's iteration speed. At 50-100 completions per hour, latency compounds fast. The practical setup: small+fast model for flow state (completions, quick edits), big+slow model for architecture planning and code review invoked 2-3 times per session. Two-tier local beats single-tier every time.
Have you tried the said models? In my experience anything other than the dense models aren't that useful for serious coding. Maybe ok for minor stuff but not for generating entire apps or fixing complex issues.
Also 8 tok/s is not that bad when you generating code. The 35B A3B will get you around 40 tok/s if the task doesn't need best intelligence. 4B models are not useful atleast for what I do. Again you do you. Everyone has different needs.
Fully-offline AI dev is more practical now than most people realize, and posts like this matter because they show the quality floor of local models has crossed "actually useful" for a lot of day-to-day work. Privacy, zero marginal cost, no rate limits, works on a plane - the tradeoffs increasingly favor local for the bulk of tasks.
The honest setup most people land on is hybrid: local model handles the high-volume mechanical work offline/free, and you reach for a frontier API only on the rare genuinely-hard problem where the local model's ceiling shows. Even then, you've moved 80% of your usage off the meter. Your writeup is a good blueprint for that - curious which local model you settled on and where you still felt the need to phone home to a bigger one. Great build.
A fully offline AI-assisted setup sounds cool, but wouldn't keeping updates and dependencies for tools like llama.cpp be a hassle without cloud access? How do you manage version control on your offline setup? I can see isolation making versioning tricky. If you're getting your Linux environment ready for job interviews, take a look at prachub.com. They have company-tagged coding banks that could be useful, especially if you're targeting a specific tech role.
Fascinating local AI setup. While great for development, true accessibility in health AI-especially for global, voice-first users-needs to move beyond local inference. The next billion users will speak symptoms like 'kaaichal' (Tamil for fever) in their mother tongue, not type them.\n\nThis demands robust, scalable voice models that understand...