Peter Grishechkin

Posted on Jul 1

I Ran an LLM Locally on My ASUS ROG Ally and Here's What I Actually Learned

#ai #llm #beginners

TL;DR:

I ran an LLM locally on my ASUS ROG Ally for a few weeks, expecting a fun tinkering project and got a real lesson in hardware limits instead. The fastest model wasn't the best choice, the "obvious" memory fixes mostly didn't work, and the actual value showed up only after I stopped treating local AI as a cloud replacement and started treating it as a specialized tool with a narrow, honest job description.

My ROG Ally was sitting on the shelf, doing nothing between gaming sessions. That felt wasteful. It has a decent APU, 16 GB of shared memory, and it's basically a tiny PC with a screen glued to it. So one evening I thought: why not turn it into a local AI server?

What followed was not the smooth "download model, type prompt, get magic" experience you see in demo videos. It was hours of memory errors, weirdly slow generation, and a slow realization that most of my intuitions about how this should work were wrong.

Here's the long version, with the specific fixes that worked, the ones that didn't, and the model choices that surprised me.

Why the ROG Ally Seemed Like a Good Idea

The pitch is simple. A handheld gaming PC running Windows or Linux, an APU (a chip combining CPU and GPU on one die) with shared memory, sitting idle most of the day. Running a local large language model (LLM – an AI model trained to understand and generate text) on it means:

• No cloud subscription fees for basic tasks

• No sending your notes, code, or drafts to someone else's server

• A private assistant that works even without internet

On paper, that's a great use for hardware that otherwise just collects dust between gaming sessions. In practice, the first thing I hit was a wall I didn't even know existed.

UMA Frame Buffer

If you've never touched an APU-based system before, this part will save you the same headache I went through.

Devices like the ROG Ally don't have dedicated video memory (VRAM). Instead, they use UMA – Unified Memory Architecture. The CPU and the integrated GPU share one big pool of RAM, and the system decides how much of that pool the GPU is allowed to touch. That allocation is called the frame buffer.

By default, that buffer is small. On my Ally, it was reserving a tiny slice out of the 16 GB total, nowhere near enough for anything past a couple of gigabytes. That doesn't sound like a huge deal until you remember: quantized LLM weight files (compressed versions of the model, usually in GGUF format) for anything beyond a small model start at 4-6 GB and go up fast.

The result: the moment my model didn't fit into that tiny GPU-visible slice, the system quietly fell back to running inference on the CPU. And CPU inference on a handheld chip is not "a bit slower." It's several times slower. Painfully, watch-the-progress-bar slower.

The fix turned out to be simple but far from obvious if you haven't run into it before: go into the BIOS and manually raise the UMA frame buffer. I pushed it to 4 GB out of the 16 GB total shared memory. That single change did more for generation speed than every other tweak I tried combined.

If you're setting up a similar box, check this first. Before touching swap settings, before messing with compression, before anything – confirm your GPU-visible memory allocation actually matches what your model needs. It's the one setting most guides skip entirely because on a desktop with a dedicated GPU, this problem simply doesn't exist.

The Fix That Sounded Smart and Wasn't

Here's where my intuition led me straight into a wall for the second time.

zRAM is a Linux feature that compresses data in memory instead of pushing it to disk, giving you the effect of more usable RAM without physically adding sticks. My thinking went: my model doesn't fully fit, zRAM compresses stuff in memory, so maybe it squeezes the model in too.

It doesn't work that way, and here's the technical reason. GGUF files – the quantized model weights most local LLM runners use – are already compressed. Quantization (reducing the precision of the model's numbers to make it smaller) packs the weights about as tightly as they're going to get for that format. Feeding already-compressed data into a compression layer gives you close to nothing back. There's no extra "air" in that data for zRAM to squeeze out.

So the intuitive fix – "let's add compressed memory and the model will fit" – turned out to be a dead end. It's a fix that solves a different problem (general system memory pressure from lots of small apps), not the specific problem of a large model weight file that needs to be loaded whole.

I still kept zRAM enabled, because it wasn't hurting anything, but I stopped expecting it to be the answer to memory shortages for model weights specifically. If your model doesn't fit, the fix is a smaller model, a more aggressive quantization, or more real memory allocated to the GPU frame buffer – not a compression layer.

Disk Swap

Next intuition that got corrected: disk swap.

Swap is when the operating system uses a chunk of your storage drive as overflow memory when RAM runs out. My assumption was that adding swap space would somehow help performance on borderline model sizes. It doesn't. If anything, the opposite happens.

When your system actually starts swapping model data to disk during inference, generation doesn't just slow down – it becomes borderline unusable. Disk read/write speeds, even on a fast SSD, are nowhere close to RAM speeds. Once your model starts getting paged out mid-generation, you're watching single words appear every few seconds.

So why keep swap enabled at all? Because of what happens without it. On a system running close to its memory ceiling, without any swap space configured, the Linux OOM-killer (out-of-memory killer, a system process that terminates programs to prevent a full crash when memory runs out) steps in and kills your inference process outright. No graceful slowdown, no warning – the Ollama process (the tool I used to run and serve local models) just dies mid-request.

Swap doesn't make things faster. It gives the system a buffer zone so that a borderline memory situation ends in "this response took a while" instead of "your entire session got killed." That distinction matters a lot when you're running something unattended, like a background agent that's supposed to answer a notification an hour from now.

Fixing the "Choppy" Generation with vm.swappiness

Even after sorting the frame buffer, swap, and giving up on zRAM as a memory-fitting trick, generation still felt off on some runs. Not slow exactly – choppy. Words would come out in bursts, then stall, then burst again.

The culprit was a Linux kernel setting called vm.swappiness. It controls how aggressively the kernel decides to move memory pages to swap versus keeping them in RAM. On distributions that ship with zRAM enabled out of the box, this value is often tuned aggressively high, on the assumption that swapping to compressed memory is cheap and should happen early and often.

That assumption is reasonable for a general desktop workload with lots of small background apps. It's a bad assumption for a single process trying to hold a multi-gigabyte model weight file in memory continuously during a generation run. The kernel was proactively swapping out parts of the model process before it strictly needed to, causing exactly that stutter I was seeing.

Manually lowering vm.swappiness calmed this down immediately. The system stopped being trigger-happy about paging things out, and generation went from choppy to smooth-but-steady. This isn't a dramatic fix like the frame buffer change, but it's the difference between a model that feels broken and one that feels merely slow.

If you're running into inconsistent generation speed – not consistently slow, but jumpy – check this setting before you start blaming the model itself.

Picking a Model

Here's the part of this whole experiment that surprised me the most, and it has nothing to do with hardware tuning.

My first instinct, like most people's, was: find the fastest model that still gives decent answers, and run that. Under that logic, Qwen 3.5 at 9B parameters (a smaller model, roughly 9 billion internal parameters) was the clear winner. It generated noticeably faster than the alternatives and the output quality was close enough to the bigger options for most everyday tasks.

I ended up not using it as my daily driver. Instead, I settled on Phi-4 at 14B parameters – a bigger, slower model. That decision only makes sense once you understand how I was actually planning to use this setup.

Why Speed Stopped Being the Priority

Most local LLM comparisons you'll find online treat generation speed – tokens per second, response latency – as close to the deciding factor. That makes sense if you're sitting in front of a chat window, typing a question, and waiting for the answer to stream back in real time. In that scenario, every extra second of latency is annoying.

My setup doesn't work that way. I run a small agent that receives notifications and responds to them asynchronously – I'm not staring at a terminal, waiting. I send a request, go do something else, and get a notification back when the model is done. Under that usage pattern, whether the response takes eight seconds or forty seconds barely registers. What registers is whether the answer is actually useful.

That reframing changes the entire calculus of model choice. If latency doesn't matter to your workflow, you can afford the "slower but sharper" model every time. Most benchmark charts don't split this out – they report tokens-per-second as if every use case is a synchronous chat, when for a background agent, this metric is close to irrelevant.

If you're building something similar – a background assistant, a batch summarizer, an overnight processing job – ask yourself honestly whether you're actually watching the screen wait for tokens, or just checking results later. The answer changes which model is "best" for your box.

The Reasoning Model Trap

I also tried a reasoning-focused model – DeepSeek R1 Distill, a variant built to work through problems using an explicit step-by-step chain of thought before giving a final answer. On paper, reasoning models should shine on a slower, asynchronous setup exactly like mine, since I'm not watching a timer.

In practice, on this hardware, it was close to a trap. The chain-of-thought process – the model essentially talking itself through a problem before answering – burns a disproportionate amount of time relative to the quality gain it delivers. On a beefy cloud GPU, that overhead is invisible because raw compute is abundant. On a handheld APU running quantized weights off shared memory, that same reasoning chain turns a ten-second task into a multi-minute one, and the final answer often wasn't meaningfully better than what a straightforward instruction-following model produced in a fraction of the time.

Phi-4, running as a plain instruction-following model without an explicit reasoning chain, gave better results faster on this specific hardware for the tasks I cared about. The lesson here isn't "reasoning models are bad" – it's that the value of extended reasoning scales with the compute you throw at it, and on constrained local hardware, that trade stops paying off.

Building a Two-Device Setup: Brain and Muscle

Once the model choice settled, the next problem was workflow. Sitting with a terminal window open on the Ally, waiting for prompts, wasn't sustainable. I wanted to write and manage things from my Mac and let the Ally handle the actual computation in the background.

The fix was to run Ollama's OpenAI-compatible API (a local server that mimics the same request format as OpenAI's cloud API, so existing tools can talk to it without modification) on the Ally, and point my client tools on the Mac at it over the local network.

This split turned the Ally into a dedicated inference server sitting quietly in another room, while the Mac stayed the interface I actually interact with – writing prompts, reviewing responses, managing the agent. No terminal window needed to stay open on the handheld itself; the server process just runs, and requests come in over the network whenever something needs generating.

The localhost Trap

One detail tripped me up longer than it should have: the difference between localhost and the device's actual local IP address on the network.

When the client and the server run on the same machine, pointing your tool at localhost works fine – it's shorthand for "this same computer." The moment you split the setup across two devices, that shorthand breaks. The Mac's localhost refers to the Mac itself, not the Ally sitting in another room. Requests sent to localhost from the Mac just fail silently or connect to nothing, because there's no server running there.

The fix was pointing the client at the Ally's actual IP address on the home network instead – something like its assigned local network address rather than the loopback address. Once I made that swap, requests started flowing correctly between devices.

This sounds obvious written out, but it's an easy trap to fall into if you copy configuration examples from single-device tutorials without adjusting them for a networked setup. If your client and server live on different machines, double-check this is set to an actual network address, not the loopback shortcut.

What Local Models Are Actually Good For

After all the tuning, here's where the honest evaluation starts. Not the hardware fixes – the actual usefulness of what I built.

A 16 GB memory ceiling, running quantized models in the 9B-14B range, is fine for a specific band of tasks:

• Drafting emails and short messages that just need a decent first pass

• Reviewing simple code snippets or spotting obvious bugs in short files

• Rough planning – outlining a day, breaking down a small task list

• Short, self-contained requests that don't require remembering a long backstory

Where it consistently disappointed me was anything requiring depth or a long working memory. The context window (how much text the model can "remember" at once during a conversation) on models this size is short enough that long back-and-forth conversations start losing earlier details. Feed it a full document and ask for detailed analysis, and it either forgets the beginning by the time it reaches the end, or gives a surface-level pass that reads fine at a glance but falls apart under scrutiny.

Anything resembling a real SEO strategy with competitor research, a full landing page write-up, or a multi-step code refactor across several files – the model either loses the thread partway through or produces something that needs heavy manual editing anyway. At that point, the time saved by generating locally gets eaten by the time spent fixing what came out.

The 2026-Feels-Like-Early-GPT-3 Feeling

There's a specific sensation running these small local models that I didn't expect: it feels like traveling backward in time. The responses come out confident, quick, and reasonably fluent – right up until the moment they aren't. Then you get a hallucinated fact stated with the same confident tone as a correct one, or the model quietly drops the actual constraint you gave it three messages ago.

That gap – fluent-sounding output paired with occasional confident wrongness – is close to what early large cloud models felt like, before the current generation of top-tier cloud models got noticeably better at staying on-task and self-correcting. Running a small local model in 2026 genuinely feels like stepping back into that earlier phase. It's a useful, honest way to frame the real distance between where local small models sit today and where the best cloud models already are. The gap isn't small, and pretending otherwise sets you up for a bad afternoon when you hand it something it can't handle.

So Is It Worth It?

Yes, but no.

Here's my actual verdict after weeks of running this setup daily.

A local server on hardware that would otherwise sit idle is worth doing – for the right slice of tasks. Zero ongoing subscription cost for routine generation, and anything sensitive never leaves my own network. For drafts, quick code checks, and planning notes, that trade is a clear win.

It doesn't replace the cloud. I still reach for a cloud model the moment a task needs real depth, long context, or multi-step precision. The local Ally setup earned a permanent spot in my workflow, but strictly as a supporting tool that handles the routine, lightweight stuff while the heavy lifting still goes to the cloud.

What I'd Tell Someone Starting This Today

If you're thinking about running an LLM locally on similar handheld or APU-based hardware, here's the condensed version of everything above, in the order I'd actually check things:

• Raise the UMA frame buffer in BIOS first, before anything else – this fixes the single biggest performance killer on shared-memory hardware

• Don't rely on zRAM to fit a larger model; already-quantized weights don't compress meaningfully further

• Keep disk swap enabled as a safety net against the OOM-killer, but don't expect it to help speed

• Lower vm.swappiness if generation feels choppy rather than uniformly slow

• Pick your model based on how you'll actually use it – synchronous chat needs speed, async/background agents can afford a slower, sharper model

• Skip reasoning models on constrained hardware unless you've confirmed the quality gain is worth the time cost for your specific tasks

• If splitting client and server across devices, point requests at the real local network IP, not localhost

• Set your expectations at "good for short, routine, private tasks" and you'll be happy with the result

Running an LLM locally on an ASUS ROG Ally taught me more about memory architecture and model selection than any spec sheet could have. The setup earned its place as a permanent background tool on my network. It's not a cloud replacement, and it was never going to be – but for the narrow band of tasks it's actually good at, it's already paying for the idle hardware it's running on.

DEV Community