Here's a fact that should stop every AI infrastructure engineer in their tracks: as of mid-2026, the de facto standard for serving a 671B DeepSeek-R1 model in production still requires 8x H100 GPUs and roughly $200,000 of hardware. Meanwhile, an open-source project from MADSys Lab at Tsinghua University has been quietly running 236B-parameter MoE models on a single workstation since 2024, and hit 286 tokens/s prefill on DeepSeek-R1 671B on commodity hardware. That project is kvcache-ai/ktransformers, and as of 2026-06-12 it has 17,264 Stars, 1,313 Forks, and an Apache-2.0 license. The 2026 AI infrastructure conversation has been dominated by NVIDIA rack-scale systems and the ever-growing VRAM bill. KTransformers is the open-source counter-narrative: it lets you run frontier-class MoE models on a mix of consumer GPUs and CPU RAM, and it does this with five production-grade techniques that almost nobody talks about.
Context: Why CPU/GPU Hybrid Inference Matters in 2026
In 2026, Mixture-of-Experts (MoE) has become the default architecture for frontier open-weight models. DeepSeek-V3/R1, Qwen3-235B-A22B, Kimi-K2.5, GLM-4.7, and the new DeepSeek-V4-Flash are all MoE. The naive assumption is that MoE inference still needs H100-class GPUs because each token only activates a few experts, so the active parameter count is small, but the total parameter count is enormous (671B for DeepSeek-R1, 1T for Kimi-K2.5). The CPU-GPU hybrid approach moves the "cold" experts to CPU RAM and keeps the "hot" experts on the GPU. KTransformers has turned this idea into a production framework that supports nine different MoE models as of v0.6.2 (released 2026-05-03). The 2026 ACM SIGOPS paper "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models" formally published the architecture.
Hidden Use #1: CPU-GPU Expert Scheduling with Frequency-Aware Placement
What most people do: They treat the GPU as a black box and try to fit the entire MoE model into VRAM. When the model is too large, they either buy more GPUs or use a smaller model.
The hidden trick: KTransformers exposes four explicit expert placement strategies via the --kt-expert-placement-strategy flag. The frequency strategy records expert activation statistics, then places only the most frequently activated experts on the GPU while keeping cold experts in CPU RAM. You can also enable --kt-enable-dynamic-expert-update to redistribute experts at runtime when the prefill token count exceeds a threshold.
# Start the server with frequency-based placement
python -m sglang.launch_server \
--model /path/to/qwen3-next-80b \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.pt
# Add dynamic redistribution for long-context workloads
python -m sglang.launch_server \
--model /path/to/qwen3-next-80b \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.pt \
--kt-enable-dynamic-expert-update \
--kt-gpu-prefill-token-threshold 512
The result: On Qwen3-Next-80B-A3B-Instruct-FP8 with 4x RTX 4090 + Intel Xeon Gold 6454S, the official benchmark table shows that at a 50% GPU expert ratio, the frequency strategy delivers 76.19 tokens/s, and dynamic-expert-update pushes that to 81.17 tokens/s (versus 65.25 tokens/s for the default uniform strategy). At 80% GPU ratio, the frequency strategy hits 100.67 tokens/s.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, 1,313 Forks, Apache-2.0, last push 2026-06-07, v0.6.2 released 2026-05-03; benchmark table from doc/en/kt-kernel/experts-sched-Tutorial.md; HN "Show HN: KTransformers-236B Model and 1M Context LLM Inference" 20 points (story from 2024-08-29, 3 comments).
Hidden Use #2: 3-Layer (GPU-CPU-Disk) Prefix Cache Reuse
What most people do: They rebuild the KV cache from scratch for every request. For long-context workloads (a 100K token system prompt plus a 50K token conversation), this is a multi-minute cold start every single time.
The hidden trick: KTransformers' balance_serve engine implements a 3-layer KV cache hierarchy. Hot prefixes live on the GPU, warm prefixes live in CPU RAM, and cold prefixes live on disk. The attn.page_size and kvc2.cpu_memory_size_GB parameters control the split. Once you enable it, repeated requests that share a system prompt only compute the KV cache for the delta, not the full context.
# ktransformers/configs/config.yaml
attn:
page_size: 16 # Size of a page in KV Cache
chunk_size: 256
kvc2:
gpu_only: false # false = Disk + CPU + GPU KV storage
utilization_percentage: 1.0
cpu_memory_size_GB: 500 # Amount of CPU memory allocated for KV Cache
disk_path: /mnt/data/kvc # Path to store KV Cache on disk
After editing the config, recompile with prefix cache mode enabled:
git submodule update --init --recursive
USE_BALANCE_SERVE=1 bash ./install.sh
# For dual-NUMA systems with 1TB+ RAM:
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
The result: Multi-turn agent workflows and RAG pipelines with a stable system prompt reuse the cached prefix across thousands of requests. The CPU-GPU-Disk split means you can serve models whose total context window is far larger than GPU VRAM, with the disk layer acting as a transparent extension.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; configuration format from doc/en/prefix_cache.md; release notes from doc/en/balance-serve.md documenting v0.2.4 multi-concurrency architecture refactor.
Hidden Use #3: AMX BF16/INT8 Acceleration (8x Faster Than AVX-512)
What most people do: They run CPU matrix multiplications on AVX-512 instructions, which is the default in llama.cpp and most other inference stacks. On consumer CPUs, this caps MoE inference at 60-80 tokens/s.
The hidden trick: KTransformers v0.3+ ships native AMX (Intel Advanced Matrix Extensions) kernels for BF16 and INT8 quantization. AMX introduces 8 dedicated Tile registers (tmm0-tmm7) per CPU core, each holding up to 16 rows x 64 bytes. A single TDPBF16PS instruction performs 32,768 multiply-add operations in 16 CPU cycles, giving each core 2,048 multiply-add ops per cycle, which is 8x the throughput of AVX-512 on the same silicon.
# Install with AMX support
USE_BALANCE_SERVE=1 bash ./install.sh
# Run Qwen3MoE with the AMX backend
python ktransformers/server/main.py \
--architectures Qwen3MoeForCausalLM \
--model_path <model_dir> \
--gguf_path <gguf_dir> \
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml \
--backend_type balance_serve
The result: On a workstation with Xeon 4th Gen + RTX 4090, KTransformers with AMX hits 347 tokens/s prefill on Qwen3MoE-235B-A22. The same model on a consumer i9-14900KF + DDR5-4000 runs smoothly at 30B-A3B, with a high-end gaming laptop as the lower bound. KTransformers also offers an AVX2-only path (--kt-method for non-AMX CPUs), making the same MoE inference stack usable across Sapphire Rapids servers, EPYC workstations, and consumer desktops.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; AMX instruction details and 347 tokens/s prefill benchmark from doc/en/AMX.md; Intel AMX intrinsic reference from the same doc; HN "Show HN: KTransformers-671B DeepSeek-R1 on a Single Machine" 14 points (story from 2025-02-10, 0 comments at time of indexing).
Hidden Use #4: Multi-Concurrency balance_serve with Continuous Batching
What most people do: They run inference with a single request at a time, treating the LLM like a batch script. Throughput is limited to whatever one user can squeeze out of the GPU.
The hidden trick: KTransformers v0.2.4 introduced balance_serve, a SGLang-inspired C++ engine with three architectural layers: Server (handles OpenAI-compatible HTTP), Inference Engine (executes chunked prefill), and Scheduler (continuous batching in FCFS order). Combined with custom flashinfer kernels and variable batch size CUDA Graphs, this design lifts aggregate throughput by 130% under 4-way concurrency on DeepSeek-R1 0528. Intel engineers validated it on Xeon6 + MRDIMM-8800, going from 17 tokens/s single-user to 40 tokens/s aggregate output throughput, with the bottleneck shifting to the GPU side.
# Pull and run the v0.2.4+ multi-concurrency Docker image
docker pull approachingai/ktransformers:v0.2.4-AVX512
docker run -it --gpus all --privileged --shm-size 64g \
--name ktrans --network=host -v /mnt:/mnt \
approachingai/ktransformers:v0.2.4-AVX512 /bin/bash
# Open a second terminal and exec in
docker exec -it ktrans bash
# Start the multi-concurrency server
python ktransformers/server/main.py \
--architectures Qwen3MoeForCausalLM \
--model_path <model_dir> \
--gguf_path <gguf_dir> \
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml \
--backend_type balance_serve
# Hit it with multiple concurrent requests
for i in 1 2 3 4; do
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"model-name","messages":[{"role":"user","content":"Hello!"}],"stream":true}' &
done
wait
The result: A single KTransformers server now serves an entire team's interactive LLM workloads. On a Xeon6 + MRDIMM-8800 testbed, the multi-concurrency path bumped total output throughput from 17 tokens/s to 40 tokens/s, a 2.35x lift, by amortizing GPU cost across concurrent users. The OpenAI-compatible /v1/chat/completions API means existing tooling (LangChain, LlamaIndex, Cursor, Continue.dev) drops in unchanged.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, 1,313 Forks, Apache-2.0; 130% throughput gain and 17 to 40 tokens/s benchmark from doc/en/balance-serve.md; v0.2.4 release notes from the same doc; HN "Show HN: KTransformers-236B Model and 1M Context LLM Inference on Local Machines" 20 points (2024-08-29).
Hidden Use #5: LLaMA-Factory SFT for MoE LoRA Fine-Tuning (6-12x Faster Than ZeRO-Offload)
What most people do: They fine-tune MoE models with ZeRO-Offload in DeepSpeed. It works, but the CPU offload makes training painfully slow because every optimizer step shuttles hundreds of GB of gradients through the PCIe bus.
The hidden trick: KTransformers v0.6.1 ships a ktransformers[sft] extra that integrates directly with LLaMA-Factory. The integration uses KT-Kernel's CPU-optimized INT8/INT4 quantization on the optimizer states, plus FSDP2 with intelligent sharding. The result is 6-12x training speedup over ZeRO-Offload in benchmarked MoE SFT workloads, with roughly half the CPU memory.
# Install the SFT stack
conda create -n kt-sft python=3.11 -y
conda activate kt-sft
pip install --extra-index-url https://download.pytorch.org/whl/cu130 \
torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1
# Install LLaMA-Factory + KT SFT
cd /path/to/LLaMA-Factory
pip install -e .
pip install -r requirements/ktransformers.txt
# Launch MoE LoRA SFT on Qwen3-30B-A3B with 1x RTX 4090
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml \
src/train.py \
examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml
The result: On DeepSeek-V3 and DeepSeek-R1, KT SFT runs at 3.7 it/s with ~80GB total GPU memory on 4x RTX 4090. Qwen3-30B-A3B trains at 8+ it/s on a single RTX 4090 with ~24GB total. This makes it feasible to fine-tune frontier MoE models on a single consumer-grade GPU instead of an 8x H100 cluster.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; 6-12x speedup claim and 3.7 it/s / 8+ it/s benchmarks from doc/en/SFT/KTransformers-Fine-Tuning_Quick-Start.md and the SFT introduction in the main README; integration PR at hiyouga/LLaMA-Factory#10430; HN Show HN 20+14 points across the two launch stories (2024-08-29 and 2025-02-10).
Summary
Five production-grade techniques that turn KTransformers from a research curiosity into a 2026 AI infrastructure workhorse:
- CPU-GPU Expert Scheduling with frequency-aware placement and dynamic redistribution (114.26 tokens/s at 100% GPU ratio; 81.17 tokens/s with dynamic update at 50% ratio)
- 3-Layer Prefix Cache Reuse spanning GPU, CPU RAM, and disk (turns multi-minute cold starts into incremental updates)
- AMX BF16/INT8 Acceleration delivering 8x the throughput of AVX-512 (347 tokens/s prefill on a Xeon 4 + RTX 4090 workstation)
- Multi-Concurrency balance_serve with continuous batching (130% throughput gain under 4-way concurrency, 17 to 40 tokens/s aggregate on Xeon6)
- LLaMA-Factory SFT Integration for MoE LoRA fine-tuning (6-12x faster than ZeRO-Offload, 3.7 it/s on 4x RTX 4090 for DeepSeek-V3)
If you have read the other articles in this series, these will feel familiar: Agent Skills: 5 Hidden Uses in 49K Stars of Workflow Magic shows a similar "framework hides 5 production tricks" pattern for engineering skills, MemPalace: 5 Hidden Uses That Make It the Best-Benchmarked AI Memory System tackles memory infrastructure with comparable depth, and Goose's 5 Hidden Uses That Turn It Into a Production AI Agent Stack demonstrates the same "production tricks" pattern for the agent orchestration layer.
What is the most underrated MoE inference optimization you have hit in 2026? Drop a comment with the throughput number, the hardware, and the model, and we will dig into it in a future article.
Top comments (0)