DEV Community

Cover image for From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM
Upayan Ghosh
Upayan Ghosh

Posted on

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

Recently, I got tired of depending on paid cloud models for every coding experiment.

Cloud models are great. They are fast, convenient, and usually very capable.

But they also come with the usual baggage: cost, rate limits, internet dependency, privacy questions, and that small feeling that every serious coding workflow is rented from someone else's GPU.

So I started exploring local LLMs properly.

Not in the casual "can I run a small chat model?" way.

I wanted to know:

  • How capable are local coding models now?
  • Can they help with real code generation, debugging, refactoring, and repo Q&A?
  • Can they plug into editor agents through an OpenAI-compatible API?
  • And most importantly, what actually stops them from being useful?

After enough research, the answer became pretty obvious.

The wall is hardware.

More specifically: VRAM.

You can have the model file. You can have the runtime. You can have Docker. You can have the scripts. But once the model weights, routed experts, KV cache, context window, and compute buffers start fighting for GPU memory, everything gets painful very quickly.

That made me curious.

Was there a practical workaround?

Fortunately, I had a very normal consumer rig available.

The hardware was very normal:

  • GPU: NVIDIA RTX 3060 Ti
  • VRAM: 8 GB
  • OS: Windows
  • RAM: about 32 GB
  • CPU: Intel i5-14600KF

This is not a 4090 box. It is not a workstation. It is exactly the kind of machine where most people would say, "Just run a 7B model and move on."

So I turned it into a challenge:

Can I run a proper 30B coding model locally on consumer-grade hardware, with enough context to actually be useful?

The model target was ambitious:

Qwen3-Coder-30B-A3B-Instruct

Specifically, the GGUF from:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

The quant I used:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

That is a 30B-ish coding-specialized MoE model. The important part is MoE: Mixture of Experts. The total parameter count is large, but only some expert weights are active per token.

That changes the whole local inference strategy.

For a dense 30B model, 8 GB VRAM is not where I would start. For a compact MoE coding model, the question becomes more interesting:

Can I keep the always-active parts fast, keep the routed experts mostly in system RAM, and still get usable speed?

Short answer: yes.

Long answer: it took a bunch of false starts.

First, the boring audit

Before downloading anything huge, I checked the machine.

This sounds obvious, but local AI setup gets messy fast if you skip it.

I verified:

  • Windows version
  • GPU model
  • NVIDIA driver
  • nvidia-smi in PowerShell
  • WSL2
  • Docker Desktop
  • Docker GPU passthrough
  • CUDA container access to the GPU
  • system RAM
  • disk space
  • CPU

Docker GPU passthrough worked:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Enter fullscreen mode Exit fullscreen mode

That meant the clean first path was:

Docker + llama.cpp CUDA server
Enter fullscreen mode Exit fullscreen mode

The initial server image:

ghcr.io/ggml-org/llama.cpp:server-cuda
Enter fullscreen mode Exit fullscreen mode

I also checked llama-server --help before trusting any command from the internet.

That became a recurring theme.

Do not assume the flag exists. Ask the binary.

Downloading the model

The target model repo was:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Enter fullscreen mode Exit fullscreen mode

I verified the actual file name before downloading:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
Enter fullscreen mode Exit fullscreen mode

The downloaded file size was:

17,665,334,432 bytes
Enter fullscreen mode Exit fullscreen mode

Everything went under one local project folder:

local-qwen-coder/
  models/
  scripts/
  configs/
  docs/
Enter fullscreen mode Exit fullscreen mode

No global mystery folder. No "where did this 17 GB file go?" moment.

Small win.

First real blocker: Docker memory

The first serious issue was not the GPU.

It was Docker memory.

Windows had about 32 GB RAM available, but Docker Desktop was exposing only about 16 GB RAM plus 4 GB swap to its Linux VM.

That mattered because my first instinct was to use:

--no-mmap
--mlock
Enter fullscreen mode Exit fullscreen mode

That is a good idea when you want the model loaded into RAM instead of page-faulting from disk later.

Except the container did not have enough RAM.

It got killed.

Exit code:

137
Enter fullscreen mode Exit fullscreen mode

Docker inspect confirmed:

OOMKilled=true
Enter fullscreen mode Exit fullscreen mode

So the first fix was not glamorous:

Keep mmap enabled for the Docker path.

The "technically better" flag was wrong for the actual container memory limit.

Getting a stable stock llama.cpp server

With stock llama.cpp Docker, the model loaded and served an OpenAI-compatible endpoint.

Base URL:

http://127.0.0.1:8080/v1
Enter fullscreen mode Exit fullscreen mode

The important MoE flag was:

--cpu-moe
Enter fullscreen mode Exit fullscreen mode

This keeps MoE expert weights on CPU.

The model became usable, but not fast enough yet.

Baseline:

Mode Prompt eval Generation
--cpu-moe ~2.78 tok/s ~13.38 tok/s

Generation was okay. Prompt eval was painful.

Then came the next knob:

--n-cpu-moe N
Enter fullscreen mode Exit fullscreen mode

This keeps the first N MoE layers on CPU and allows more expert weights to live on GPU.

Lower N usually means more GPU residency, more speed, and less VRAM headroom.

So I benchmarked it.

MoE offload tuning

Here are the useful results:

Mode VRAM used VRAM free Prompt eval Generation
--cpu-moe 4388 MiB 3637 MiB 2.78 tok/s 13.38 tok/s
--n-cpu-moe 48 4392 MiB 3633 MiB 2.51 tok/s 13.83 tok/s
--n-cpu-moe 46 5224 MiB 2801 MiB 6.03 tok/s 18.75 tok/s
--n-cpu-moe 44 5893 MiB 2132 MiB 38.36 tok/s 29.40 tok/s
--n-cpu-moe 42 6568 MiB 1457 MiB 44.49 tok/s 30.26 tok/s
--n-cpu-moe 40 7265 MiB 760 MiB 51.63 tok/s 32.49 tok/s
--n-cpu-moe 38 7664 MiB 361 MiB 53.14 tok/s 33.64 tok/s

The fastest tested value was:

--n-cpu-moe 38
Enter fullscreen mode Exit fullscreen mode

But it only left around 361 MiB free VRAM.

Too tight.

The practical winner was:

--n-cpu-moe 40
Enter fullscreen mode Exit fullscreen mode

That gave around 32.49 tok/s generation with about 760 MiB free VRAM.

At this point, I had a good local coding backend.

But I did not have the thing I actually wanted.

The real target: 262K context

Qwen3-Coder-30B-A3B supports long context natively.

The model metadata showed:

n_ctx_train = 262144
Enter fullscreen mode Exit fullscreen mode

So the question became:

Can I actually run it at 262K context on 8 GB VRAM?

The stock Docker build could not get me there in the way I wanted.

I could lower KV cache precision using normal llama.cpp types like:

q8_0
q4_0
iq4_nl
Enter fullscreen mode Exit fullscreen mode

But the video I had watched was talking about TurboQuant.

That was the key difference.

And this is where I almost fooled myself.

I was not actually using TurboQuant yet

I checked the stock Docker image:

docker run --rm --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --help
Enter fullscreen mode Exit fullscreen mode

The supported KV cache types were:

f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Enter fullscreen mode Exit fullscreen mode

No turbo3.

No turbo4.

No tbq3_0.

No tbq4_0.

So the answer was clear:

The stock runtime was not doing TurboQuant.

TurboQuant is not model-weight quantization. It does not require changing the GGUF model file.

It changes how the runtime stores the KV cache.

Same model.

Different runtime.

Different cache format.

That was the real pivot.

Finding a TurboQuant runtime

I found a Windows CUDA runtime build:

atomicmilkshake/llama-cpp-turboquant-binaries
Enter fullscreen mode Exit fullscreen mode

The downloaded file:

llama-turboquant-triattention-win-cu13-x64.zip
Enter fullscreen mode Exit fullscreen mode

I extracted it under:

runtimes/turboquant/win-cu13
Enter fullscreen mode Exit fullscreen mode

Then I tried:

.\llama-server.exe --help
Enter fullscreen mode Exit fullscreen mode

It failed instantly.

No useful output.

The process exit code was:

0xc0000135
Enter fullscreen mode Exit fullscreen mode

That usually means a missing DLL on Windows.

The README confirmed the likely issue:

cublasLt64_13.dll
Enter fullscreen mode Exit fullscreen mode

The build needed the CUDA 13 cuBLASLt runtime.

I did not want to install the full CUDA Toolkit globally just for one DLL.

So I pulled the official NVIDIA cuBLAS wheel:

python -m pip download nvidia-cublas==13.4.0.1 --only-binary=:all:
Enter fullscreen mode Exit fullscreen mode

Then I extracted:

cublasLt64_13.dll
Enter fullscreen mode Exit fullscreen mode

and copied it into the local runtime folder next to llama-server.exe.

After that:

.\llama-server.exe --help
Enter fullscreen mode Exit fullscreen mode

worked.

And this time the cache types included:

turbo2, turbo3, turbo4
Enter fullscreen mode Exit fullscreen mode

for both:

--cache-type-k
--cache-type-v
Enter fullscreen mode Exit fullscreen mode

That was the moment where the setup changed from "normal llama.cpp tuning" to "actual TurboQuant path."

The final 262K launch

The final command shape was:

.\runtimes\turboquant\win-cu13\llama-server.exe `
  -m .\models\qwen3-coder-30b-a3b\Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf `
  --alias qwen3-coder-30b-a3b-turbo-262k `
  --host 127.0.0.1 `
  --port 8080 `
  --jinja `
  --gpu-layers all `
  --cpu-moe `
  --flash-attn on `
  --ctx-size 262144 `
  --cache-type-k turbo4 `
  --cache-type-v turbo3 `
  --parallel 1 `
  --batch-size 256 `
  --ubatch-size 64 `
  --temp 0.3 `
  --top-p 0.8 `
  --top-k 20 `
  --repeat-penalty 1.05 `
  --fit off `
  --cache-ram 0 `
  --no-mmap `
  --mlock
Enter fullscreen mode Exit fullscreen mode

I forced:

--fit off
Enter fullscreen mode Exit fullscreen mode

because I did not want llama.cpp quietly shrinking the context and pretending everything was fine.

If it loaded, it had to really load at 262144.

And it did.

The proof

The runtime logs showed:

llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 256
llama_context: n_ubatch      = 64
Enter fullscreen mode Exit fullscreen mode

The KV cache line was the real proof:

llama_kv_cache: size = 5664.00 MiB (262144 cells, 48 layers, 1/1 seqs), K (turbo4): 3264.00 MiB, V (turbo3): 2400.00 MiB
Enter fullscreen mode Exit fullscreen mode

VRAM after load:

7525 MiB used
500 MiB free
Enter fullscreen mode Exit fullscreen mode

Very tight.

But loaded.

Then I sent a small coding prompt through the OpenAI-compatible endpoint.

It answered.

Timings:

prompt eval time = 1125.54 ms / 46 tokens = 40.87 tokens per second
eval time        = 3672.56 ms / 107 tokens = 29.13 tokens per second
Enter fullscreen mode Exit fullscreen mode

That was the win.

Qwen3-Coder-30B-A3B.

262K context.

8 GB VRAM.

Local endpoint.

Same model file.

TurboQuant KV cache.

The repeatable script

I wrapped the TurboQuant launch into:

scripts/run-qwen-coder-turboquant.ps1
Enter fullscreen mode Exit fullscreen mode

So the repeatable command is:

.\scripts\run-qwen-coder-turboquant.ps1 -Replace
Enter fullscreen mode Exit fullscreen mode

The stock Docker fallback still exists:

.\scripts\run-qwen-coder-docker.ps1 -Profile daily-fast
Enter fullscreen mode Exit fullscreen mode

The Docker route is useful for a safer daily profile.

The TurboQuant route is the full-context profile.

Important caveats

This is not magic.

The 262K profile is VRAM-tight.

It leaves roughly 500 MiB free on my RTX 3060 Ti. That means:

  • single client only
  • do not run multiple editor agents at once
  • close GPU-heavy apps
  • expect this to be less forgiving than the 32K profile

Also, I have not yet proven that this setup is great at real-world coding tasks.

The infrastructure works.

The endpoint works.

The context loads.

The smoke test passes.

But the next test is actual development work:

  • Can it refactor a real repo?
  • Can it debug Unity C# sanely?
  • Can it handle multi-file context without drifting?
  • Can it stay stable across longer sessions?

That is the next milestone.

What I learned

The big lesson is that local AI infra is not just:

download model
run server
profit
Enter fullscreen mode Exit fullscreen mode

The defaults are often the bottleneck.

In this setup:

  • MoE placement mattered.
  • Docker memory limits mattered.
  • KV cache format mattered.
  • Runtime build mattered.
  • llama-server --help mattered a lot.

The 30B model was not the whole problem.

The runtime strategy was.

And sometimes the difference between "impossible" and "working" is one missing DLL plus the right KV cache type.

Repo

I published the setup as a GitHub repo with:

  • launch scripts
  • benchmark notes
  • troubleshooting docs
  • client settings
  • reproducible setup notes

GitHub link:

GitHub logo UpayanGhosh / local-qwen-coder-turboquant

Local Qwen3-Coder 30B TurboQuant setup for 8GB VRAM coding workflows

Local Qwen Coder TurboQuant Setup

Practical Windows setup notes and scripts for running Qwen3-Coder-30B-A3B-Instruct as a local coding-only OpenAI-compatible backend on an 8 GB NVIDIA GPU.

This repo documents the journey from a stable stock llama.cpp Docker setup to a full-context TurboQuant KV-cache runtime:

  • RTX 3060 Ti, 8 GB VRAM
  • Windows
  • Qwen3-Coder-30B-A3B-Instruct GGUF
  • MoE expert CPU/GPU residency tuning
  • OpenAI-compatible local endpoint
  • Verified 262144 context with TurboQuant KV cache

What Is Included

  • PowerShell scripts for launching and testing the backend
  • Client settings for Cline, Continue, Roo Code, OpenCode, and generic OpenAI-compatible clients
  • Benchmark notes
  • TurboQuant research and troubleshooting notes
  • LinkedIn post draft documenting the build story

What Is Not Included

This repo intentionally does not track:

  • GGUF model files
  • CUDA/runtime DLLs
  • downloaded wheels/zips
  • logs
  • local caches

Those files are large and/or machine-specific. See .gitignore.

Key Result

Verified TurboQuant profile:

Context: 262144
KV cache: K=turbo4, V=turbo3
VRAM: ~7525 MiB used /

The repo will not include the GGUF model, CUDA DLLs, wheels, or downloaded binaries. Those are too large and machine-specific.

Closing thought

This started as:

"Can I make a useful local coding backend?"

Then it became:

"Can I get the full 262K context working on 8 GB VRAM?"

The first version merely ran.

The final version actually hit the target.

I am calling that a win.

Top comments (0)