The Local AI Powerhouse

#ai #qwen #localllm #gptoss

Introduction

Keeping up with everything AI-related in the recent months has got me wondering about the magic behind it, so I started searching extensively about running it locally and eventually devised a plan to run AI models on my own hardware.
Once I did the hardware research, I settled on choosing a mini PC running AMD Strix Halo, the GMKTec EVO X2. After ordering it from Ali Express (which ships worldwide), configuring it with Linux Mint, Llama cpp, and its dependencies, I managed to run models like GPT-OSS-120b, Qwen-3-Next-80-A3b, and GLM 4.7 Flash.
This article walks you through how the whole thing was done and the reasoning behind each decision.

Motivation

The ability to run AI locally allows for ownership, control, the ability to work offline, and maximum flexibility. The idea is to be able to have a device powerful enough to accomplish the tasks needed without having to subscribe to a cloud provider and need an internet connection to function. Further, as AI technologies are rapidly advancing, such a device is considered an investment given that the research is heading towards providing the same value from smaller models.

Hardware Research

Now that the motivation is clear, it is time to choose the ideal device for the task, keeping both the cost and the benefit in mind. After following brilliant people like Alex Ziskind, Network Chuck, and many others on YouTube over several months, I eventually reached the decision of buying a Mini PC both due to its portability and its ability to provide decent performance while maintaining a reasonable temperature.
Now the question that follows, which Mini PC do I buy for this purpose? The answer to that is one of three:

Mac Mini. This is the ideal one since Apple M Chips are best in class. Only con is it is too expensive once you need enough RAM.
Nvidia GB10. This one is great, but by the time NVIDIA's GB10 became available, AMD's Strix Halo had already established itself in the market
GMKTec Evo X2 (or the Framework Desktop). This one provides nearly as good performance as the other two for a much more affordable price range, making it ideal both for the tinkerer and the small organization that wants to run AI on their own hardware. The GMKTec Evo X2 I ordered came with 128GB of RAM. What makes this device special is that up to 96GB can be allocated to the integrated GPU, giving it enough memory to run models with 80B+ parameters at reasonable speeds. With that in mind, I went for choice #3, the local AI powerhouse.

The AI runtime

There are multiple available runtimes for inference on the market. Mainly, we have both LM Studio and Ollama, and both run Llama cpp under the hood. This means that Llama cpp is ideal to run inference locally given that it is blazing fast, and it recently got its Web UI which is a massive plus. Setting up Llama.cpp on Linux Mint required installing Vulkan drivers and building from source with Vulkan support enabled, since the Radeon 8060s iGPU relies on Vulkan rather than CUDA for GPU acceleration. Once that was done, inference worked out of the box.
Another client used with this local AI runtime was OpenCode for agentic coding purposes, and its configuration to connect to a local instance was a little tricky, but it was done nonetheless. More on that can be found here. In short, configuring OpenCode to connect to a local Llama.cpp instance involved creating an opencode.json file under ~/.config/opencode/ with the provider set to @ai-sdk/openai-compatible and the base URL pointing to http://127.0.0.1:8080/v1. It took some trial and error, but once configured, OpenCode picks up the local model automatically.

The models of choice

Three models were chosen, each for a distinct reason:

Industry Standard: I went for GPT OSS 120B by OpenAI, since OpenAI currently sets the industry standard as of now.
Multilingual Capabilities: Qwen3-Next-80B-A3B is a surprisingly capable model that can do essentially anything, and you can speak to it using any language of choice.
Coding: GLM 4.7 Flash has a good reputation online for being a capable local model. I tested it and connected it to OpenCode, and built a 2d Mario Game using it. I have to say it is decent, but not as good as Opus by Anthropic or Codex by Open AI. The cloud models still rule here. After using all three for a few weeks, GPT OSS felt slow at 120B parameters for everyday use, and GLM 4.7 Flash, while great for coding, was too narrow in scope for general tasks. Qwen3-Next struck the best balance, fast inference thanks to its mixture-of-experts architecture (only 3B parameters active at a time), strong multilingual support, and solid performance across both conversation and code. Finally, I settled for Qwen 3 Next. That thing rocks!

Conclusion

What surprised me most was how capable local models have become, for most day-to-day tasks, the experience is comparable to cloud providers. If I were to do it differently, I would have gone with Linux Mint from day one rather than experimenting with other distros first. As for what is next, I am keeping an eye on smaller, more efficient models as they continue to improve, and I plan to integrate this setup into my development workflow more deeply using tools like OpenCode and n8n. If you are considering running AI locally and deliberating whether it is a good choice, I hope this article was of help to you. Good luck.

DEV Community