This article was originally published on runaihome.com
TL;DR: Kimi K2.7 Code is the June 12, 2026 coding-focused refresh of K2.6 — same 1T-parameter MoE, 32B active, but it burns roughly 30% fewer "thinking" tokens per task. Its smallest usable Unsloth quant (2-bit) is ~325GB, so no single consumer GPU runs it; you need a 384GB DDR5 CPU build, a 4× RTX 3090 + 256GB RAM rig, or the API at $0.95/$4 per million tokens. The token cut lowers your effective cost-per-task more than it changes the hardware math.
| CPU build (384GB DDR5) | 4× RTX 3090 + 256GB RAM | Kimi API | |
|---|---|---|---|
| Best for | Always-on private coding server | Fastest consumer local path | Most developers, today |
| Est. cost | ~$3,500–$4,500 | ~$5,500–$6,500 (used GPUs) | $0 upfront, pay-per-use |
| Speed (2-bit) | ~8–11 tok/s | ~8–12 tok/s at 32k ctx | 20–60 tok/s (managed) |
| Memory needed | 384GB+ RAM | 96GB VRAM + 256GB RAM | None |
| The catch | Slow prefill on long prompts | Multi-GPU wiring + PCIe limits | Prompts leave your machine |
Honest take: For nearly every home-lab developer, the Kimi API is the right answer — K2.7 Code's token efficiency makes it cheaper per finished task than K2.6 while needing zero hardware. Build local only if your data can't leave the building or you're burning tens of millions of tokens a month.
What actually changed from K2.6
Moonshot AI released Kimi K2.7 Code on June 12, 2026 under a Modified MIT license. The architecture is unchanged from Kimi K2.6: approximately 1 trillion total parameters, 32B active per token, 384 experts (8 routed plus 1 shared per forward pass), and a 256K-token context window. If you already mapped K2.6 onto your hardware, K2.7 Code drops into the same memory footprint.
The headline change is behavioral, not structural. K2.7 Code is tuned coding-first and uses roughly 30% fewer "thinking" tokens than K2.6 to reach an answer on agentic software tasks. Moonshot reports gains of +21.8% on Kimi Code Bench v2, +11% on Program Bench, and +31.5% on MLS Bench Lite versus K2.6.
Here's the catch worth stating up front: all three of those are Moonshot's own proprietary benchmarks. As of mid-June 2026 there are no independent third-party results for K2.7 Code on the public suites — SWE-bench Verified, SWE-bench Pro, Terminal-Bench, LiveCodeBench, or Aider. VentureBeat's coverage quoted practitioners who said the vendor deltas don't obviously match hands-on behavior. Treat the numbers as directional until the community re-runs them. K2.6, by contrast, has a verified 80.2% SWE-bench Verified score — so for now K2.6 is the model with the stronger independent track record, and K2.7 Code is the bet on token efficiency.
That token cut is the actually-useful part for a home lab. Fewer thinking tokens means lower output-token billing per task on the API, and on local hardware it means each task finishes in fewer generation steps — which matters a lot when your rig only does 8–12 tok/s.
The 1T-parameter reality: why this isn't a single-GPU job
A Mixture-of-Experts model only computes 32B parameters per token, so the arithmetic per step is comparable to a 32B dense model. But it has to store all 1T parameters in memory, because the router can call any expert on any token. You cannot skip loading experts that don't happen to fire. Memory capacity, not compute, is the wall.
In full precision K2.7 Code's GGUF weights total roughly 605GB on disk. Moonshot ships the MoE weights at native INT4 with BF16 attention, so a 4-bit GGUF stores them at essentially training precision — which is why the lossless Q8 quant (~595GB) is only about 10GB larger than Q4. The savings only start once you go below 4-bit. Unsloth's Dynamic 2-bit quant (UD-Q2_K_XL) lands at ~325GB, a 48% cut, by keeping critical attention and routing layers at higher precision while squeezing the MoE experts.
325GB is still 10× the VRAM of an RTX 5090. This is the same structural problem every trillion-parameter open-weight model hits — see the parallel analysis in our GLM 5.2 hardware guide and MiniMax M3 guide.
Quantization options: the GGUF table
All sizes are for the Unsloth Dynamic GGUF release (unsloth/Kimi-K2.7-Code-GGUF on Hugging Face). Dynamic quantization upcasts attention and routing layers, so quality loss at a given bit-width is lower than uniform quantization.
| Quantization | Disk size | Min RAM+VRAM | Expected speed | Notes |
|---|---|---|---|---|
| UD-TQ1 (~1.8-bit) | ~290 GB | ~310 GB | ~9–13 tok/s | Smallest; reasoning quality drops noticeably |
| UD-Q2_K_XL (2-bit) | ~325 GB | ~350 GB | ~8–12 tok/s | Practical floor; best size/quality tradeoff |
| UD-Q4_K_XL (4-bit) | ~585 GB | ~600 GB | ~5–8 tok/s | Near-lossless (native INT4 MoE) |
| UD-Q8_K_XL (8-bit) | ~595 GB | ~610 GB | ~4–6 tok/s | Lossless; server-class memory only |
| Full BF16 | ~2 TB | 2+ TB | Impractical | H100/B200 cluster territory |
For local use, UD-Q2_K_XL is the only realistic starting point. Everything above it needs 600GB+ of combined memory — dual-socket server territory, not a home tower. Going below Q2 to TQ1 saves ~35GB and a couple tok/s, but for a model you picked specifically for coding accuracy, eating that quality hit defeats the purpose.
GPU tiers: what speed to actually expect
Because no consumer GPU holds 325GB, every "GPU path" here is really partial offload — the card holds whatever layers fit, system RAM holds the rest, and your throughput is dominated by the slowest memory tier the model has to route through. The figures below are projections scaled from measured K2.6 community runs (K2.7 Code shares K2.6's 32B-active architecture, so per-token throughput is effectively identical at the same quant); treat them as estimates, not lab benchmarks.
| Setup | Memory | Est. speed (Q2) | Verdict |
|---|---|---|---|
| RTX 4060 Ti 16GB + 320GB RAM | 16GB VRAM | ~3 tok/s | Painful — GPU holds <5% of weights |
| Single RTX 3090/4090 + 320GB RAM | 24GB VRAM | ~5–7 tok/s | Marginal; GPU barely helps |
| 4× RTX 3090 + 256GB RAM | 96GB VRAM | ~8–12 tok/s | Best consumer GPU path |
| 384GB DDR5, no GPU | 384GB RAM | ~8–11 tok/s | Simplest; full model in RAM |
The pattern is blunt: a single 24GB card holds under 10% of the model, so its 936 GB/s of bandwidth only applies to a sliver of each token's work — the other 90% crawls at DDR5's ~100 GB/s. You don't cross into comfortable territory until you either (a) put the whole model in fast unified/system memory or (b) stack enough VRAM (4× cards) that most experts live on the GPU. A RTX 4060 Ti 16GB technically "runs" it, but ~3 tok/s is slideshow speed for agentic coding.
If you go the multi-GPU route, read our multi-GPU NVLink vs PCIe guide first — cheap risers can halve effective Gen4 bandwidth across four cards and quietly kill your throughput.
Hardware path 1: 384GB DDR5 CPU build
The cheapest way to hold the whole 2-bit quant in fast memory is a CPU build with enough DDR5 to fit it with headroom.
- 384GB DDR5 (8× 48GB, or 12× 32GB on high-capacity boards)
- A modern high-core-count Ryzen or Threadripper Pro CPU
- Fast NVMe for model storage (the GGUF is 325GB)
Expected throughput on llama.cpp with a 16-core CPU and the full model in RAM: ~8–11 tok/s on UD-Q2_K_XL, scaling from K2.6 community numbers in the same memory class. The limitation is prefill, not generation. At 32K context with a 10K-token prompt you're waiting minutes before the first token. Keep context to 8K–16K for interactive work; KV-cache memory and prefill time both scale with it.
Rough cost: 8× 48GB DDR5-5600 (~$1,600 in mid-2026's elevated DRAM market — see
Top comments (0)