<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Upayan Ghosh</title>
    <description>The latest articles on DEV Community by Upayan Ghosh (@upayanghosh).</description>
    <link>https://dev.to/upayanghosh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3914189%2Fd9fbab6b-8660-4a21-b03a-c1362c2f745f.jpeg</url>
      <title>DEV Community: Upayan Ghosh</title>
      <link>https://dev.to/upayanghosh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/upayanghosh"/>
    <language>en</language>
    <item>
      <title>From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM</title>
      <dc:creator>Upayan Ghosh</dc:creator>
      <pubDate>Tue, 05 May 2026 14:18:02 +0000</pubDate>
      <link>https://dev.to/upayanghosh/from-oom-to-262k-context-running-qwen3-coder-30b-locally-on-8gb-vram-1ej1</link>
      <guid>https://dev.to/upayanghosh/from-oom-to-262k-context-running-qwen3-coder-30b-locally-on-8gb-vram-1ej1</guid>
      <description>&lt;p&gt;Recently, I got tired of depending on paid cloud models for every coding experiment.&lt;/p&gt;

&lt;p&gt;Cloud models are great. They are fast, convenient, and usually very capable.&lt;/p&gt;

&lt;p&gt;But they also come with the usual baggage: cost, rate limits, internet dependency, privacy questions, and that small feeling that every serious coding workflow is rented from someone else's GPU.&lt;/p&gt;

&lt;p&gt;So I started exploring local LLMs properly.&lt;/p&gt;

&lt;p&gt;Not in the casual "can I run a small chat model?" way.&lt;/p&gt;

&lt;p&gt;I wanted to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How capable are local coding models now?&lt;/li&gt;
&lt;li&gt;Can they help with real code generation, debugging, refactoring, and repo Q&amp;amp;A?&lt;/li&gt;
&lt;li&gt;Can they plug into editor agents through an OpenAI-compatible API?&lt;/li&gt;
&lt;li&gt;And most importantly, what actually stops them from being useful?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After enough research, the answer became pretty obvious.&lt;/p&gt;

&lt;p&gt;The wall is hardware.&lt;/p&gt;

&lt;p&gt;More specifically: VRAM.&lt;/p&gt;

&lt;p&gt;You can have the model file. You can have the runtime. You can have Docker. You can have the scripts. But once the model weights, routed experts, KV cache, context window, and compute buffers start fighting for GPU memory, everything gets painful very quickly.&lt;/p&gt;

&lt;p&gt;That made me curious.&lt;/p&gt;

&lt;p&gt;Was there a practical workaround?&lt;/p&gt;

&lt;p&gt;Fortunately, I had a very normal consumer rig available.&lt;/p&gt;

&lt;p&gt;The hardware was very normal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU: NVIDIA RTX 3060 Ti&lt;/li&gt;
&lt;li&gt;VRAM: 8 GB&lt;/li&gt;
&lt;li&gt;OS: Windows&lt;/li&gt;
&lt;li&gt;RAM: about 32 GB&lt;/li&gt;
&lt;li&gt;CPU: Intel i5-14600KF&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a 4090 box. It is not a workstation. It is exactly the kind of machine where most people would say, "Just run a 7B model and move on."&lt;/p&gt;

&lt;p&gt;So I turned it into a challenge:&lt;/p&gt;

&lt;p&gt;Can I run a proper 30B coding model locally on consumer-grade hardware, with enough context to actually be useful?&lt;/p&gt;

&lt;p&gt;The model target was ambitious:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Specifically, the GGUF from:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The quant I used:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That is a 30B-ish coding-specialized MoE model. The important part is MoE: Mixture of Experts. The total parameter count is large, but only some expert weights are active per token.&lt;/p&gt;

&lt;p&gt;That changes the whole local inference strategy.&lt;/p&gt;

&lt;p&gt;For a dense 30B model, 8 GB VRAM is not where I would start. For a compact MoE coding model, the question becomes more interesting:&lt;/p&gt;

&lt;p&gt;Can I keep the always-active parts fast, keep the routed experts mostly in system RAM, and still get usable speed?&lt;/p&gt;

&lt;p&gt;Short answer: yes.&lt;/p&gt;

&lt;p&gt;Long answer: it took a bunch of false starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, the boring audit
&lt;/h2&gt;

&lt;p&gt;Before downloading anything huge, I checked the machine.&lt;/p&gt;

&lt;p&gt;This sounds obvious, but local AI setup gets messy fast if you skip it.&lt;/p&gt;

&lt;p&gt;I verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Windows version&lt;/li&gt;
&lt;li&gt;GPU model&lt;/li&gt;
&lt;li&gt;NVIDIA driver&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nvidia-smi&lt;/code&gt; in PowerShell&lt;/li&gt;
&lt;li&gt;WSL2&lt;/li&gt;
&lt;li&gt;Docker Desktop&lt;/li&gt;
&lt;li&gt;Docker GPU passthrough&lt;/li&gt;
&lt;li&gt;CUDA container access to the GPU&lt;/li&gt;
&lt;li&gt;system RAM&lt;/li&gt;
&lt;li&gt;disk space&lt;/li&gt;
&lt;li&gt;CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker GPU passthrough worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--rm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--gpus&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nvidia/cuda:12.4.1-base-ubuntu22.04&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nvidia-smi&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That meant the clean first path was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker + llama.cpp CUDA server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The initial server image:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ghcr.io/ggml-org/llama.cpp:server-cuda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I also checked &lt;code&gt;llama-server --help&lt;/code&gt; before trusting any command from the internet.&lt;/p&gt;

&lt;p&gt;That became a recurring theme.&lt;/p&gt;

&lt;p&gt;Do not assume the flag exists. Ask the binary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Downloading the model
&lt;/h2&gt;

&lt;p&gt;The target model repo was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I verified the actual file name before downloading:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The downloaded file size was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;17,665,334,432 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Everything went under one local project folder:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;local-qwen-coder/
  models/
  scripts/
  configs/
  docs/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No global mystery folder. No "where did this 17 GB file go?" moment.&lt;/p&gt;

&lt;p&gt;Small win.&lt;/p&gt;
&lt;h2&gt;
  
  
  First real blocker: Docker memory
&lt;/h2&gt;

&lt;p&gt;The first serious issue was not the GPU.&lt;/p&gt;

&lt;p&gt;It was Docker memory.&lt;/p&gt;

&lt;p&gt;Windows had about 32 GB RAM available, but Docker Desktop was exposing only about 16 GB RAM plus 4 GB swap to its Linux VM.&lt;/p&gt;

&lt;p&gt;That mattered because my first instinct was to use:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--no-mmap
--mlock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That is a good idea when you want the model loaded into RAM instead of page-faulting from disk later.&lt;/p&gt;

&lt;p&gt;Except the container did not have enough RAM.&lt;/p&gt;

&lt;p&gt;It got killed.&lt;/p&gt;

&lt;p&gt;Exit code:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;137
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Docker inspect confirmed:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OOMKilled=true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;So the first fix was not glamorous:&lt;/p&gt;

&lt;p&gt;Keep mmap enabled for the Docker path.&lt;/p&gt;

&lt;p&gt;The "technically better" flag was wrong for the actual container memory limit.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting a stable stock llama.cpp server
&lt;/h2&gt;

&lt;p&gt;With stock llama.cpp Docker, the model loaded and served an OpenAI-compatible endpoint.&lt;/p&gt;

&lt;p&gt;Base URL:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://127.0.0.1:8080/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The important MoE flag was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--cpu-moe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This keeps MoE expert weights on CPU.&lt;/p&gt;

&lt;p&gt;The model became usable, but not fast enough yet.&lt;/p&gt;

&lt;p&gt;Baseline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Prompt eval&lt;/th&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--cpu-moe&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~2.78 tok/s&lt;/td&gt;
&lt;td&gt;~13.38 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Generation was okay. Prompt eval was painful.&lt;/p&gt;

&lt;p&gt;Then came the next knob:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--n-cpu-moe N
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This keeps the first &lt;code&gt;N&lt;/code&gt; MoE layers on CPU and allows more expert weights to live on GPU.&lt;/p&gt;

&lt;p&gt;Lower &lt;code&gt;N&lt;/code&gt; usually means more GPU residency, more speed, and less VRAM headroom.&lt;/p&gt;

&lt;p&gt;So I benchmarked it.&lt;/p&gt;
&lt;h2&gt;
  
  
  MoE offload tuning
&lt;/h2&gt;

&lt;p&gt;Here are the useful results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;VRAM used&lt;/th&gt;
&lt;th&gt;VRAM free&lt;/th&gt;
&lt;th&gt;Prompt eval&lt;/th&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--cpu-moe&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4388 MiB&lt;/td&gt;
&lt;td&gt;3637 MiB&lt;/td&gt;
&lt;td&gt;2.78 tok/s&lt;/td&gt;
&lt;td&gt;13.38 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 48&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4392 MiB&lt;/td&gt;
&lt;td&gt;3633 MiB&lt;/td&gt;
&lt;td&gt;2.51 tok/s&lt;/td&gt;
&lt;td&gt;13.83 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 46&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5224 MiB&lt;/td&gt;
&lt;td&gt;2801 MiB&lt;/td&gt;
&lt;td&gt;6.03 tok/s&lt;/td&gt;
&lt;td&gt;18.75 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 44&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5893 MiB&lt;/td&gt;
&lt;td&gt;2132 MiB&lt;/td&gt;
&lt;td&gt;38.36 tok/s&lt;/td&gt;
&lt;td&gt;29.40 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6568 MiB&lt;/td&gt;
&lt;td&gt;1457 MiB&lt;/td&gt;
&lt;td&gt;44.49 tok/s&lt;/td&gt;
&lt;td&gt;30.26 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 40&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7265 MiB&lt;/td&gt;
&lt;td&gt;760 MiB&lt;/td&gt;
&lt;td&gt;51.63 tok/s&lt;/td&gt;
&lt;td&gt;32.49 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--n-cpu-moe 38&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7664 MiB&lt;/td&gt;
&lt;td&gt;361 MiB&lt;/td&gt;
&lt;td&gt;53.14 tok/s&lt;/td&gt;
&lt;td&gt;33.64 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fastest tested value was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--n-cpu-moe 38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;But it only left around 361 MiB free VRAM.&lt;/p&gt;

&lt;p&gt;Too tight.&lt;/p&gt;

&lt;p&gt;The practical winner was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--n-cpu-moe 40
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That gave around 32.49 tok/s generation with about 760 MiB free VRAM.&lt;/p&gt;

&lt;p&gt;At this point, I had a good local coding backend.&lt;/p&gt;

&lt;p&gt;But I did not have the thing I actually wanted.&lt;/p&gt;
&lt;h2&gt;
  
  
  The real target: 262K context
&lt;/h2&gt;

&lt;p&gt;Qwen3-Coder-30B-A3B supports long context natively.&lt;/p&gt;

&lt;p&gt;The model metadata showed:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;n_ctx_train = 262144
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;So the question became:&lt;/p&gt;

&lt;p&gt;Can I actually run it at 262K context on 8 GB VRAM?&lt;/p&gt;

&lt;p&gt;The stock Docker build could not get me there in the way I wanted.&lt;/p&gt;

&lt;p&gt;I could lower KV cache precision using normal llama.cpp types like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;q8_0
q4_0
iq4_nl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;But the video I had watched was talking about TurboQuant.&lt;/p&gt;

&lt;p&gt;That was the key difference.&lt;/p&gt;

&lt;p&gt;And this is where I almost fooled myself.&lt;/p&gt;
&lt;h2&gt;
  
  
  I was not actually using TurboQuant yet
&lt;/h2&gt;

&lt;p&gt;I checked the stock Docker image:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--rm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--gpus&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--help&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The supported KV cache types were:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No &lt;code&gt;turbo3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;No &lt;code&gt;turbo4&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;No &lt;code&gt;tbq3_0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;No &lt;code&gt;tbq4_0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So the answer was clear:&lt;/p&gt;

&lt;p&gt;The stock runtime was not doing TurboQuant.&lt;/p&gt;

&lt;p&gt;TurboQuant is not model-weight quantization. It does not require changing the GGUF model file.&lt;/p&gt;

&lt;p&gt;It changes how the runtime stores the KV cache.&lt;/p&gt;

&lt;p&gt;Same model.&lt;/p&gt;

&lt;p&gt;Different runtime.&lt;/p&gt;

&lt;p&gt;Different cache format.&lt;/p&gt;

&lt;p&gt;That was the real pivot.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finding a TurboQuant runtime
&lt;/h2&gt;

&lt;p&gt;I found a Windows CUDA runtime build:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;atomicmilkshake/llama-cpp-turboquant-binaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The downloaded file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama-turboquant-triattention-win-cu13-x64.zip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I extracted it under:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;runtimes/turboquant/win-cu13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then I tried:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\llama-server.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--help&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It failed instantly.&lt;/p&gt;

&lt;p&gt;No useful output.&lt;/p&gt;

&lt;p&gt;The process exit code was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0xc0000135
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That usually means a missing DLL on Windows.&lt;/p&gt;

&lt;p&gt;The README confirmed the likely issue:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cublasLt64_13.dll
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The build needed the CUDA 13 cuBLASLt runtime.&lt;/p&gt;

&lt;p&gt;I did not want to install the full CUDA Toolkit globally just for one DLL.&lt;/p&gt;

&lt;p&gt;So I pulled the official NVIDIA cuBLAS wheel:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;download&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nvidia-cublas&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mf"&gt;13.4&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--only-binary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;all:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then I extracted:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cublasLt64_13.dll
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and copied it into the local runtime folder next to &lt;code&gt;llama-server.exe&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After that:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\llama-server.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--help&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;worked.&lt;/p&gt;

&lt;p&gt;And this time the cache types included:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;turbo2, turbo3, turbo4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;for both:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--cache-type-k
--cache-type-v
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That was the moment where the setup changed from "normal llama.cpp tuning" to "actual TurboQuant path."&lt;/p&gt;
&lt;h2&gt;
  
  
  The final 262K launch
&lt;/h2&gt;

&lt;p&gt;The final command shape was:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\runtimes\turboquant\win-cu13\llama-server.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;\models\qwen3-coder-30b-a3b\Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--alias&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;qwen3-coder-30b-a3b-turbo-262k&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--host&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;127.0.0.1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;8080&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--jinja&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--gpu-layers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--cpu-moe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--flash-attn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--ctx-size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;262144&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--cache-type-k&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;turbo4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--cache-type-v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;turbo3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--parallel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--batch-size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;256&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--ubatch-size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--temp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;0.3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--top-p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;0.8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--top-k&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;20&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--repeat-penalty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;1.05&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;off&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--cache-ram&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--no-mmap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--mlock&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I forced:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--fit off
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;because I did not want llama.cpp quietly shrinking the context and pretending everything was fine.&lt;/p&gt;

&lt;p&gt;If it loaded, it had to really load at 262144.&lt;/p&gt;

&lt;p&gt;And it did.&lt;/p&gt;
&lt;h2&gt;
  
  
  The proof
&lt;/h2&gt;

&lt;p&gt;The runtime logs showed:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 256
llama_context: n_ubatch      = 64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The KV cache line was the real proof:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_kv_cache: size = 5664.00 MiB (262144 cells, 48 layers, 1/1 seqs), K (turbo4): 3264.00 MiB, V (turbo3): 2400.00 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;VRAM after load:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;7525 MiB used
500 MiB free
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Very tight.&lt;/p&gt;

&lt;p&gt;But loaded.&lt;/p&gt;

&lt;p&gt;Then I sent a small coding prompt through the OpenAI-compatible endpoint.&lt;/p&gt;

&lt;p&gt;It answered.&lt;/p&gt;

&lt;p&gt;Timings:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time = 1125.54 ms / 46 tokens = 40.87 tokens per second
eval time        = 3672.56 ms / 107 tokens = 29.13 tokens per second
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That was the win.&lt;/p&gt;

&lt;p&gt;Qwen3-Coder-30B-A3B.&lt;/p&gt;

&lt;p&gt;262K context.&lt;/p&gt;

&lt;p&gt;8 GB VRAM.&lt;/p&gt;

&lt;p&gt;Local endpoint.&lt;/p&gt;

&lt;p&gt;Same model file.&lt;/p&gt;

&lt;p&gt;TurboQuant KV cache.&lt;/p&gt;
&lt;h2&gt;
  
  
  The repeatable script
&lt;/h2&gt;

&lt;p&gt;I wrapped the TurboQuant launch into:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/run-qwen-coder-turboquant.ps1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;So the repeatable command is:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\scripts\run-qwen-coder-turboquant.ps1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-Replace&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The stock Docker fallback still exists:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\scripts\run-qwen-coder-docker.ps1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Profile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;daily-fast&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The Docker route is useful for a safer daily profile.&lt;/p&gt;

&lt;p&gt;The TurboQuant route is the full-context profile.&lt;/p&gt;
&lt;h2&gt;
  
  
  Important caveats
&lt;/h2&gt;

&lt;p&gt;This is not magic.&lt;/p&gt;

&lt;p&gt;The 262K profile is VRAM-tight.&lt;/p&gt;

&lt;p&gt;It leaves roughly 500 MiB free on my RTX 3060 Ti. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;single client only&lt;/li&gt;
&lt;li&gt;do not run multiple editor agents at once&lt;/li&gt;
&lt;li&gt;close GPU-heavy apps&lt;/li&gt;
&lt;li&gt;expect this to be less forgiving than the 32K profile&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, I have not yet proven that this setup is great at real-world coding tasks.&lt;/p&gt;

&lt;p&gt;The infrastructure works.&lt;/p&gt;

&lt;p&gt;The endpoint works.&lt;/p&gt;

&lt;p&gt;The context loads.&lt;/p&gt;

&lt;p&gt;The smoke test passes.&lt;/p&gt;

&lt;p&gt;But the next test is actual development work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can it refactor a real repo?&lt;/li&gt;
&lt;li&gt;Can it debug Unity C# sanely?&lt;/li&gt;
&lt;li&gt;Can it handle multi-file context without drifting?&lt;/li&gt;
&lt;li&gt;Can it stay stable across longer sessions?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the next milestone.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The big lesson is that local AI infra is not just:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;download model
run server
profit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The defaults are often the bottleneck.&lt;/p&gt;

&lt;p&gt;In this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MoE placement mattered.&lt;/li&gt;
&lt;li&gt;Docker memory limits mattered.&lt;/li&gt;
&lt;li&gt;KV cache format mattered.&lt;/li&gt;
&lt;li&gt;Runtime build mattered.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama-server --help&lt;/code&gt; mattered a lot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 30B model was not the whole problem.&lt;/p&gt;

&lt;p&gt;The runtime strategy was.&lt;/p&gt;

&lt;p&gt;And sometimes the difference between "impossible" and "working" is one missing DLL plus the right KV cache type.&lt;/p&gt;
&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;

&lt;p&gt;I published the setup as a GitHub repo with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;launch scripts&lt;/li&gt;
&lt;li&gt;benchmark notes&lt;/li&gt;
&lt;li&gt;troubleshooting docs&lt;/li&gt;
&lt;li&gt;client settings&lt;/li&gt;
&lt;li&gt;reproducible setup notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub link:&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/UpayanGhosh" rel="noopener noreferrer"&gt;
        UpayanGhosh
      &lt;/a&gt; / &lt;a href="https://github.com/UpayanGhosh/local-qwen-coder-turboquant" rel="noopener noreferrer"&gt;
        local-qwen-coder-turboquant
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Local Qwen3-Coder 30B TurboQuant setup for 8GB VRAM coding workflows
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Local Qwen Coder TurboQuant Setup&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;Practical Windows setup notes and scripts for running &lt;code&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/code&gt; as a local coding-only OpenAI-compatible backend on an 8 GB NVIDIA GPU.&lt;/p&gt;

&lt;p&gt;This repo documents the journey from a stable stock llama.cpp Docker setup to a full-context TurboQuant KV-cache runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTX 3060 Ti, 8 GB VRAM&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Qwen3-Coder-30B-A3B-Instruct GGUF&lt;/li&gt;
&lt;li&gt;MoE expert CPU/GPU residency tuning&lt;/li&gt;
&lt;li&gt;OpenAI-compatible local endpoint&lt;/li&gt;
&lt;li&gt;Verified &lt;code&gt;262144&lt;/code&gt; context with TurboQuant KV cache&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What Is Included&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;PowerShell scripts for launching and testing the backend&lt;/li&gt;
&lt;li&gt;Client settings for Cline, Continue, Roo Code, OpenCode, and generic OpenAI-compatible clients&lt;/li&gt;
&lt;li&gt;Benchmark notes&lt;/li&gt;
&lt;li&gt;TurboQuant research and troubleshooting notes&lt;/li&gt;
&lt;li&gt;LinkedIn post draft documenting the build story&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What Is Not Included&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;This repo intentionally does not track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GGUF model files&lt;/li&gt;
&lt;li&gt;CUDA/runtime DLLs&lt;/li&gt;
&lt;li&gt;downloaded wheels/zips&lt;/li&gt;
&lt;li&gt;logs&lt;/li&gt;
&lt;li&gt;local caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those files are large and/or machine-specific. See &lt;code&gt;.gitignore&lt;/code&gt;.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Key Result&lt;/h2&gt;

&lt;/div&gt;

&lt;p&gt;Verified TurboQuant profile:&lt;/p&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;Context: 262144
KV cache: K=turbo4, V=turbo3
VRAM: ~7525 MiB used /&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/UpayanGhosh/local-qwen-coder-turboquant" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;The repo will not include the GGUF model, CUDA DLLs, wheels, or downloaded binaries. Those are too large and machine-specific.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;This started as:&lt;/p&gt;

&lt;p&gt;"Can I make a useful local coding backend?"&lt;/p&gt;

&lt;p&gt;Then it became:&lt;/p&gt;

&lt;p&gt;"Can I get the full 262K context working on 8 GB VRAM?"&lt;/p&gt;

&lt;p&gt;The first version merely ran.&lt;/p&gt;

&lt;p&gt;The final version actually hit the target.&lt;/p&gt;

&lt;p&gt;I am calling that a win.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>llm</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
