Upayan Ghosh

Posted on May 5

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

#ai #coding #llm #performance

Recently, I got tired of depending on paid cloud models for every coding experiment.

Cloud models are great. They are fast, convenient, and usually very capable.

But they also come with the usual baggage: cost, rate limits, internet dependency, privacy questions, and that small feeling that every serious coding workflow is rented from someone else's GPU.

So I started exploring local LLMs properly.

Not in the casual "can I run a small chat model?" way.

I wanted to know:

How capable are local coding models now?
Can they help with real code generation, debugging, refactoring, and repo Q&A?
Can they plug into editor agents through an OpenAI-compatible API?
And most importantly, what actually stops them from being useful?

After enough research, the answer became pretty obvious.

The wall is hardware.

More specifically: VRAM.

You can have the model file. You can have the runtime. You can have Docker. You can have the scripts. But once the model weights, routed experts, KV cache, context window, and compute buffers start fighting for GPU memory, everything gets painful very quickly.

That made me curious.

Was there a practical workaround?

Fortunately, I had a very normal consumer rig available.

The hardware was very normal:

GPU: NVIDIA RTX 3060 Ti
VRAM: 8 GB
OS: Windows
RAM: about 32 GB
CPU: Intel i5-14600KF

This is not a 4090 box. It is not a workstation. It is exactly the kind of machine where most people would say, "Just run a 7B model and move on."

So I turned it into a challenge:

Can I run a proper 30B coding model locally on consumer-grade hardware, with enough context to actually be useful?

The model target was ambitious:

Qwen3-Coder-30B-A3B-Instruct

Specifically, the GGUF from:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

The quant I used:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

That is a 30B-ish coding-specialized MoE model. The important part is MoE: Mixture of Experts. The total parameter count is large, but only some expert weights are active per token.

That changes the whole local inference strategy.

For a dense 30B model, 8 GB VRAM is not where I would start. For a compact MoE coding model, the question becomes more interesting:

Can I keep the always-active parts fast, keep the routed experts mostly in system RAM, and still get usable speed?

Short answer: yes.

Long answer: it took a bunch of false starts.

First, the boring audit

Before downloading anything huge, I checked the machine.

This sounds obvious, but local AI setup gets messy fast if you skip it.

I verified:

Windows version
GPU model
NVIDIA driver
nvidia-smi in PowerShell
WSL2
Docker Desktop
Docker GPU passthrough
CUDA container access to the GPU
system RAM
disk space
CPU

Docker GPU passthrough worked:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

That meant the clean first path was:

Docker + llama.cpp CUDA server

The initial server image:

ghcr.io/ggml-org/llama.cpp:server-cuda

I also checked llama-server --help before trusting any command from the internet.

That became a recurring theme.

Do not assume the flag exists. Ask the binary.

Downloading the model

The target model repo was:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

I verified the actual file name before downloading:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

The downloaded file size was:

17,665,334,432 bytes

Everything went under one local project folder:

local-qwen-coder/
  models/
  scripts/
  configs/
  docs/

No global mystery folder. No "where did this 17 GB file go?" moment.

Small win.

First real blocker: Docker memory

The first serious issue was not the GPU.

It was Docker memory.

Windows had about 32 GB RAM available, but Docker Desktop was exposing only about 16 GB RAM plus 4 GB swap to its Linux VM.

That mattered because my first instinct was to use:

--no-mmap
--mlock

That is a good idea when you want the model loaded into RAM instead of page-faulting from disk later.

Except the container did not have enough RAM.

It got killed.

Exit code:

Docker inspect confirmed:

OOMKilled=true

So the first fix was not glamorous:

Keep mmap enabled for the Docker path.

The "technically better" flag was wrong for the actual container memory limit.

Getting a stable stock llama.cpp server

With stock llama.cpp Docker, the model loaded and served an OpenAI-compatible endpoint.

Base URL:

http://127.0.0.1:8080/v1

The important MoE flag was:

--cpu-moe

This keeps MoE expert weights on CPU.

The model became usable, but not fast enough yet.

Baseline:

Mode	Prompt eval	Generation
`--cpu-moe`	~2.78 tok/s	~13.38 tok/s

Generation was okay. Prompt eval was painful.

Then came the next knob:

--n-cpu-moe N

This keeps the first N MoE layers on CPU and allows more expert weights to live on GPU.

Lower N usually means more GPU residency, more speed, and less VRAM headroom.

So I benchmarked it.

MoE offload tuning

Here are the useful results:

Mode	VRAM used	VRAM free	Prompt eval	Generation
`--cpu-moe`	4388 MiB	3637 MiB	2.78 tok/s	13.38 tok/s
`--n-cpu-moe 48`	4392 MiB	3633 MiB	2.51 tok/s	13.83 tok/s
`--n-cpu-moe 46`	5224 MiB	2801 MiB	6.03 tok/s	18.75 tok/s
`--n-cpu-moe 44`	5893 MiB	2132 MiB	38.36 tok/s	29.40 tok/s
`--n-cpu-moe 42`	6568 MiB	1457 MiB	44.49 tok/s	30.26 tok/s
`--n-cpu-moe 40`	7265 MiB	760 MiB	51.63 tok/s	32.49 tok/s
`--n-cpu-moe 38`	7664 MiB	361 MiB	53.14 tok/s	33.64 tok/s

The fastest tested value was:

--n-cpu-moe 38

But it only left around 361 MiB free VRAM.

Too tight.

The practical winner was:

--n-cpu-moe 40

That gave around 32.49 tok/s generation with about 760 MiB free VRAM.

At this point, I had a good local coding backend.

But I did not have the thing I actually wanted.

The real target: 262K context

Qwen3-Coder-30B-A3B supports long context natively.

The model metadata showed:

n_ctx_train = 262144

So the question became:

Can I actually run it at 262K context on 8 GB VRAM?

The stock Docker build could not get me there in the way I wanted.

I could lower KV cache precision using normal llama.cpp types like:

q8_0
q4_0
iq4_nl

But the video I had watched was talking about TurboQuant.

That was the key difference.

And this is where I almost fooled myself.

I was not actually using TurboQuant yet

I checked the stock Docker image:

docker run --rm --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --help

The supported KV cache types were:

f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

No turbo3.

No turbo4.

No tbq3_0.

No tbq4_0.

So the answer was clear:

The stock runtime was not doing TurboQuant.

TurboQuant is not model-weight quantization. It does not require changing the GGUF model file.

It changes how the runtime stores the KV cache.

Same model.

Different runtime.

Different cache format.

That was the real pivot.

Finding a TurboQuant runtime

I found a Windows CUDA runtime build:

atomicmilkshake/llama-cpp-turboquant-binaries

The downloaded file:

llama-turboquant-triattention-win-cu13-x64.zip

I extracted it under:

runtimes/turboquant/win-cu13

Then I tried:

.\llama-server.exe --help

It failed instantly.

No useful output.

The process exit code was:

0xc0000135

That usually means a missing DLL on Windows.

The README confirmed the likely issue:

cublasLt64_13.dll

The build needed the CUDA 13 cuBLASLt runtime.

I did not want to install the full CUDA Toolkit globally just for one DLL.

So I pulled the official NVIDIA cuBLAS wheel:

python -m pip download nvidia-cublas==13.4.0.1 --only-binary=:all:

Then I extracted:

cublasLt64_13.dll

and copied it into the local runtime folder next to llama-server.exe.

After that:

.\llama-server.exe --help

worked.

And this time the cache types included:

turbo2, turbo3, turbo4

for both:

--cache-type-k
--cache-type-v

That was the moment where the setup changed from "normal llama.cpp tuning" to "actual TurboQuant path."

The final 262K launch

The final command shape was:

.\runtimes\turboquant\win-cu13\llama-server.exe `
  -m .\models\qwen3-coder-30b-a3b\Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf `
  --alias qwen3-coder-30b-a3b-turbo-262k `
  --host 127.0.0.1 `
  --port 8080 `
  --jinja `
  --gpu-layers all `
  --cpu-moe `
  --flash-attn on `
  --ctx-size 262144 `
  --cache-type-k turbo4 `
  --cache-type-v turbo3 `
  --parallel 1 `
  --batch-size 256 `
  --ubatch-size 64 `
  --temp 0.3 `
  --top-p 0.8 `
  --top-k 20 `
  --repeat-penalty 1.05 `
  --fit off `
  --cache-ram 0 `
  --no-mmap `
  --mlock

I forced:

--fit off

because I did not want llama.cpp quietly shrinking the context and pretending everything was fine.

If it loaded, it had to really load at 262144.

And it did.

The proof

The runtime logs showed:

llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 256
llama_context: n_ubatch      = 64

The KV cache line was the real proof:

llama_kv_cache: size = 5664.00 MiB (262144 cells, 48 layers, 1/1 seqs), K (turbo4): 3264.00 MiB, V (turbo3): 2400.00 MiB

VRAM after load:

7525 MiB used
500 MiB free

Very tight.

But loaded.

Then I sent a small coding prompt through the OpenAI-compatible endpoint.

It answered.

Timings:

prompt eval time = 1125.54 ms / 46 tokens = 40.87 tokens per second
eval time        = 3672.56 ms / 107 tokens = 29.13 tokens per second

That was the win.

Qwen3-Coder-30B-A3B.

262K context.

8 GB VRAM.

Local endpoint.

Same model file.

TurboQuant KV cache.

The repeatable script

I wrapped the TurboQuant launch into:

scripts/run-qwen-coder-turboquant.ps1

So the repeatable command is:

.\scripts\run-qwen-coder-turboquant.ps1 -Replace

The stock Docker fallback still exists:

.\scripts\run-qwen-coder-docker.ps1 -Profile daily-fast

The Docker route is useful for a safer daily profile.

The TurboQuant route is the full-context profile.

Important caveats

This is not magic.

The 262K profile is VRAM-tight.

It leaves roughly 500 MiB free on my RTX 3060 Ti. That means:

single client only
do not run multiple editor agents at once
close GPU-heavy apps
expect this to be less forgiving than the 32K profile

Also, I have not yet proven that this setup is great at real-world coding tasks.

The infrastructure works.

The endpoint works.

The context loads.

The smoke test passes.

But the next test is actual development work:

Can it refactor a real repo?
Can it debug Unity C# sanely?
Can it handle multi-file context without drifting?
Can it stay stable across longer sessions?

That is the next milestone.

What I learned

The big lesson is that local AI infra is not just:

download model
run server
profit

The defaults are often the bottleneck.

In this setup:

MoE placement mattered.
Docker memory limits mattered.
KV cache format mattered.
Runtime build mattered.
llama-server --help mattered a lot.

The 30B model was not the whole problem.

The runtime strategy was.

And sometimes the difference between "impossible" and "working" is one missing DLL plus the right KV cache type.

Repo

I published the setup as a GitHub repo with:

launch scripts
benchmark notes
troubleshooting docs
client settings
reproducible setup notes

GitHub link:

UpayanGhosh / local-qwen-coder-turboquant

Local Qwen3-Coder 30B TurboQuant setup for 8GB VRAM coding workflows

Local Qwen Coder TurboQuant Setup

Practical Windows setup notes and scripts for running Qwen3-Coder-30B-A3B-Instruct as a local coding-only OpenAI-compatible backend on an 8 GB NVIDIA GPU.

This repo documents the journey from a stable stock llama.cpp Docker setup to a full-context TurboQuant KV-cache runtime:

RTX 3060 Ti, 8 GB VRAM
Windows
Qwen3-Coder-30B-A3B-Instruct GGUF
MoE expert CPU/GPU residency tuning
OpenAI-compatible local endpoint
Verified 262144 context with TurboQuant KV cache

What Is Included

PowerShell scripts for launching and testing the backend
Client settings for Cline, Continue, Roo Code, OpenCode, and generic OpenAI-compatible clients
Benchmark notes
TurboQuant research and troubleshooting notes
LinkedIn post draft documenting the build story

What Is Not Included

This repo intentionally does not track:

GGUF model files
CUDA/runtime DLLs
downloaded wheels/zips
logs
local caches

Those files are large and/or machine-specific. See .gitignore.

Key Result

Verified TurboQuant profile:

Context: 262144
KV cache: K=turbo4, V=turbo3
VRAM: ~7525 MiB used /

…

View on GitHub

The repo will not include the GGUF model, CUDA DLLs, wheels, or downloaded binaries. Those are too large and machine-specific.

Closing thought

This started as:

"Can I make a useful local coding backend?"

Then it became:

"Can I get the full 262K context working on 8 GB VRAM?"

The first version merely ran.

The final version actually hit the target.

I am calling that a win.

Top comments (1)

Rasmus Ros • May 21

This is exactly the niche where a Mac mini wins. I used to be an Apple hater, but for local models the unified memory and lack of NVIDIA driver fuss is more convenient than trying to get 8 GB VRAM useful.