Kimi K2.7 Code Local Setup 2026: vLLM, SGLang, GGUF

#kimi #llm #moe #selfhosted

This article was originally published on aifoss.dev

TL;DR: Kimi K2.7 Code is Moonshot AI's June 12 2026 coding model — 1T total parameters, 32B active, 256K context, Modified MIT license. The native INT4 checkpoint needs 8×H200 to serve at full speed via vLLM or SGLang. There is exactly one way onto consumer hardware: Unsloth's 1.8-bit GGUF, which runs on a single 24GB GPU by offloading every expert layer to system RAM — slowly.

What you'll have running after this guide:

An OpenAI-compatible Kimi K2.7 Code endpoint on a multi-GPU node (vLLM or SGLang, INT4).
A single-GPU fallback using llama.cpp + Unsloth GGUF for anyone without a datacenter.
A clear-eyed sense of whether self-hosting this model is worth it versus the free API.

Honest take: If you have 8×H200, vLLM with expert parallelism is the move. If you don't, run the GGUF for tinkering and point your editor at Moonshot's API for real work — the math almost never favors buying the hardware.

What K2.7 Code actually is

K2.7 Code is a coding-first agentic model built on Kimi K2.6, released June 12 2026. The headline change is efficiency: Moonshot reports roughly 30% fewer thinking tokens than K2.6 for the same task quality, which directly cuts your inference cost and latency.

The architecture is a Mixture-of-Experts: 1 trillion total parameters, 384 experts, with 32 billion parameters activated per token. It uses Multi-head Latent Attention (MLA) and ships with a 256K-token context window. The weights are open and live on HuggingFace, ModelScope, and behind Moonshot's API.

One detail that matters more than the parameter count: K2.7 Code ships as a native INT4 checkpoint. Moonshot quantized it themselves rather than releasing BF16 and leaving you to convert. That halves the storage and VRAM footprint out of the box and is the format vLLM, SGLang, and KTransformers are all tuned for.

The license, in plain terms

K2.7 Code uses Moonshot's Modified MIT License. For practical purposes it behaves like standard MIT — use it commercially, modify it, deploy it, no fees. The single modification: if your product serves more than 100 million monthly active users or generates more than $20M/month in revenue, you must display "Kimi K2" visibly in your UI. No indie developer or normal company hits that ceiling, so for self-hosting it is effectively unrestricted. That's a meaningfully cleaner license than the Llama Community License or Qwen's attribution terms.

The honest hardware reality

This is the part most "setup guides" gloss over. K2.7 Code is a datacenter model. Here is what each path actually costs you.

	vLLM / SGLang (INT4)	Unsloth GGUF (consumer)	Moonshot API
Hardware	8×H200 (TP=8) or 4×MI300X	1×24GB GPU + 256GB+ RAM	Any machine
Aggregate VRAM	~640GB	24GB VRAM + RAM offload	None
Speed	Production (tens of tok/s)	Single-digit tok/s	Fast, hosted
Truly local?	Yes	Yes	No
Best for	Teams, SLA, air-gapped	Experiments, privacy tests	Most people

The INT4 checkpoint runs on 8×H200 with --tensor-parallel-size 8, or on NVIDIA B300 (8×, TP=8) and GB300 (4×, TP=4). On AMD, the verified configs are MI300X/MI325X and MI350X/MI355X at 4×, TP=4. The INT4 weights occupy roughly 640GB aggregate; if you instead run FP8 on 8×H200 SXM5 (~1128GB HBM3e total), the weights eat about 1TB, leaving ~128GB for KV cache at 256K context with small batches.

If you're renting rather than buying, an 8×H200 node on a provider like RunPod is the realistic way to test this without a five-figure hardware order. For the full hardware breakdown across Kimi releases, runaihome.com has a Kimi K2 local inference hardware guide.

Path 1: vLLM on a multi-GPU node

K2.7 Code is new enough that parser support may not be in a stable vLLM release yet. Pin a nightly build in your startup script rather than relying on pip install vllm:

pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly
# pin the exact nightly date once it works:
# pip install vllm==<nightly-build-date>

Serve the model. The two flags people forget are the parsers — without them, agentic tool calls and the reasoning trace come back as raw text instead of structured output:

vllm serve moonshotai/Kimi-K2.7-Code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --port 8000

--enable-expert-parallel is worth understanding. With 384 experts spread across an 8-GPU node, expert parallelism cuts the all-to-all communication overhead compared to tensor parallelism alone. The benefit is largest on long generation sequences — which is exactly what coding output is — so for this model it's not optional tuning, it's the default you want.

Once it's up, it speaks the OpenAI API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.7-Code",
    "messages": [{"role": "user", "content": "Write a Python LRU cache with a TTL."}]
  }'

Expected response shape:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [{"message": {"role": "assistant", "content": "import time\nfrom collections import OrderedDict\n..."}}],
  "usage": {"prompt_tokens": 18, "completion_tokens": 240, "total_tokens": 258}
}

Path 2: SGLang

SGLang is the other engine Moonshot recommends, and the launch is a one-liner once installed:

pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path moonshotai/Kimi-K2.7-Code \
  --quantization int4 \
  --tp 8 \
  --context-length 262144 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --host 0.0.0.0 --port 8000

Both engines give you an OpenAI-compatible endpoint, so picking between them is about which you already run. For deeper vLLM tuning — auth, multiple models, Nginx — see the vLLM setup guide.

A real problem you'll hit on AMD

If you try --tp 8 on a 4-GPU AMD box, or copy an 8×H200 command onto MI300X hardware, the server fails to start. The reason is head divisibility: K2.7 Code has 64 attention heads. With TP=4 each GPU gets 16 heads, which is valid. With TP=8 on AMD's supported 4-GPU configs the math doesn't work, and on MoE INT4 paths the parser also rejects expert parallelism. The fix is concrete: on AMD, keep --tp 4 and drop --enable-expert-parallel for the INT4 MoE path. On NVIDIA 8-GPU nodes, TP=8 with expert parallel is correct.

Path 3: the consumer GGUF (single 24GB GPU)

This is the only way K2.7 Code touches a normal machine, and it works because of how Unsloth quantizes MoE models. The trick: keep the active path in VRAM and offload all the rarely-touched expert layers to system RAM or a fast SSD.

Unsloth's dynamic GGUF sizes for K2.7 Code:

Quant	Size	Runs on
Full precision	605GB	Datacenter only
UD-Q8_K_XL	595GB	Lossless, multi-node
UD-Q4_K_XL	~585GB	Multi-GPU server
UD-Q2_K_XL	345GB	Best size/quality balance
UD-TQ1_0 (1.8-bit)	~325GB	1×24GB GPU + 256GB RAM

The rule of thumb: your combined RAM + VRAM should roughly equal the quant size. It still runs if you're short, just slower as it pages from disk. The 1.8-bit UD-TQ1_0 quant will run on a single 24GB GPU if you offload every MoE layer to system RAM or a fast SSD — which is why this needs 256GB+ of RAM to be tolerable rather than painful.

Download and run with llama.cpp:


bash
# grab the 1.8-bit quant
huggingface-cli download unsloth/Kimi-K2.7-Code-GGUF \
  --include "*UD-TQ1_0*" --local-dir kimi-k2.7

# run, offloading experts to