Mooncake is a service-layer system designed to support LLM execution by separating the PREFILL phase (initial context construction) from the DECODE phase (token generation).
It leverages CPU, SSD, and DRAM resources to efficiently manage the KVCache generated during prompt execution on vLLM, enabling reuse of previously computed data and reducing GPU workload during inference.
In this post, we will explore what Mooncake is, its core components, its purpose, and how it integrates into the model execution pipeline.
We will then review how to build and run the system, what dependencies are required, and the issues you may encounter — along with their solutions.
What is Mooncake?
MOONCAKE — Clear Technical Overview
Mooncake is a distributed, high-performance storage system designed specifically for managing KVCache used in Large Language Model (LLM) inference.
Its main goal is to make LLM execution faster and more scalable by allowing multiple servers and GPUs to share precomputed context, instead of recalculating it each time.
What Problem Does Mooncake Solve?
When an LLM processes a prompt, it generates a structure called KVCache (key–value cache).
This cache stores the internal attention states of the model and is required for generating the next tokens.
However:
- KVCache is large.
- Recomputing it for every request is expensive.
- Passing it between servers is normally slow.
- GPU memory is limited.
Mooncake provides an efficient way to store, share, and reuse this KVCache across machines.
Core Ideas (Simplified)
1.Split between Prefill and Decode
Mooncake separates the LLM workflow into two phases:
Prefill
The model processes the prompt and generates KVCache.Decode
Token generation uses the already-computed KVCache.
With Mooncake, once Prefill is done, the KVCache can be saved and reused by any other server.
This means Decode does **not **need to recompute anything — reducing GPU load.
2. Distributed Memory Store
Mooncake includes a Store Cluster made up of many worker nodes.
Each worker contributes:
- DRAM(fast memory)
- SSD (persistent storage)
Together they form a single, shared memory pool for holding KVCache objects.
3. Fast Data Transfer (Transfer Engine)
Mooncake uses a high-speed communication engine supporting:
- RDMA
- NVMe-over-Fabric
- TCP
This allows “zero-copy” or near-zero-copy transfer of KVCache segments between machines.
The result is extremely high throughput with low latency.
4. Replication and Resilience
Mooncake automatically replicates KVCache objects across multiple workers.
This ensures:
- No “hotspots” (one overloaded server)
- Data availability even if a node fails
As long as the system has an active master and a reachable client, Mooncake continues operating.
5. Smart Memory Management
The system includes:
- LRU eviction (old items removed first)
- Soft pinning (prevent eviction of important cache objects)
- Persistence (optional SSD-based storage)
This keeps memory usage predictable and efficient.
6. Simple Developer API
Clients can communicate with Mooncake using:
- C++ API
- Python API
The client can run as:
- an embedded library inside an inference service, or
- a standalone process.
System Architecture (Simplified)
1.Inference Cluster
Runs LLM engines (e.g., vLLM). Creates KVCache.
2.Transfer Engine
Moves KVCache between inference nodes and Mooncake quickly.
3.Mooncake Store Cluster
Distributed memory pool storing KVCache.
4.Metadata Server (e.g., etcd/Redis)
Tracks where each KVCache object is stored and manages replicas.
How It Works (Step by Step)
1.Prefill
An LLM server processes the prompt → produces KVCache → saves it to Mooncake.
2.Share
Another server retrieves the same KVCache from Mooncake.
3.Decode
The second server generates tokens using the retrieved KVCache instead of recomputing it.
4.Eviction/Persistence
Mooncake cleans up old objects or saves them to SSD based on policy.
Key Advantages
- Higher throughput for LLM inference
- Lower GPU memory usage since KVCache can reside in DRAM/SSD
- Easy scaling by adding more worker nodes
- Fault tolerance through replication
- Optimized for long-context and multi-server LLM workloads
How to Build and Run Mooncake?
install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
use specific python
uv venv --python 3.10 --seed
source .venv/bin/activate
//(run "deactivate" for exit)
CUDA packages according to the CUDA version installed on your server (Here is an example for CUDA 12.9).
uv pip install quart httpx matplotlib aiohttp pandas datasets modelscope setuptools openpyxl pynvml xlsxwriter
uv pip install --index-url https://download.pytorch.org/whl/cu129 torch torchvision torchaudio
install mooncake with uv
uv pip install mooncake-transfer-engine
install vllm with specific version
git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive
cd vllm
python use_existing_torch.py
install requirements
uv pip install -r requirements/build.txt
Update these parameters in the configuration file (bashrc):
(Make sure to update all paths and version numbers to match the CUDA installation and directory structure on your server).
export LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH
export CUDA_HOME="/usr/local/cuda-12.9"
export PATH="$CUDA_HOME/bin:${PATH:+:${PATH}}"
export CUDACXX="$CUDA_HOME/bin/nvcc"
export CMAKE_CUDA_COMPILER="$CUDA_HOME/bin/nvcc"
export TORCH_CUDA_ARCH_LIST="8.9"
export MAX_JOBS=128
compile vllm
uv pip install --no-build-isolation -e .
Write a mooncake.json file and replace the IP address with your own.
Make sure to update the ROOT_FS_DIR path according to your server’s directory structure.
{
"local_hostname": "10.1.222.133",
"metadata_server": "etcd://10.1.222.133:2379",
"global_segment_size": 274877906944,
"local_buffer_size": 274877906944,
"protocol": "tcp",
"device_name": "",
"master_server_address": "10.1.222.133:10001",
"cluster_id": "mooncake_cluster",
"root_fs_dir": "/mnt/mooncake_data"
}
download model:
git lfs install
git clone "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
install and check if etcd run (need to kill process if it runs)
sudo apt install etcd-server
sudo lsof -i -P -n | grep etcd
sudo kill
Start the venv if it isnt active.
source .venv/bin/activate
check if ports are free:
lsof -t -i:8000
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8100" | awk -F ' ' '{print $2}'
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8200" | awk -F ' ' '{print $2}'
sudo lsof -i -P -n | grep mooncake_
sudo lsof -i -P -n | grep etcd
show all the ports are occupied:
sudo lsof -i -P -n
kill the ports if they are occupied:
sudo kill <PID1> <PID2> ...
run etcd:
nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379 > etcd_output.log 2>&1 &
run mooncake master:
nohup mooncake_master \
--port 10001 \
--root_fs_dir /mnt/mooncake_data \
--cluster_id mooncake_cluster \
> logs/master.txt 2>&1 &
run prefill:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0
CUDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
--model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.4 \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}' \
> logs/prefill-0.txt 2>&1 &
run decode:
UDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
--model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.4 \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}' \
> logs/decode-0.txt 2>&1 &
run proxy (replace --model to your path of model):
python3 ../proxy_demo.py \
--model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--prefill localhost:8100 \
--decode localhost:8200 \
--port 8000 \
2>&1 | tee logs/proxy-1-1.txt
Errors and malfunctions that may arise during the building and running of Mooncake
- ISSUE:
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.
- SOLUTION:
You should change the --gpu-memory-utilization parameter to a higher value,
because this setting prevents memory from being allocated for the KV cache when the value is low.
However, make sure to check first that the GPU is free.
- ISSUE:
you are running into an AssertionError (issubclass(connector_cls, KVConnectorBase_V1)) when starting the prefill process with MooncakeStoreConnector.
- SOLUTION:
Add these parameters to the run command:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0
- ISSUE:
File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/__init__.py", line 1, in <module>
from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/_optical_flow.py", line 14, in <module>
from .utils import _read_pfm, verify_str_arg
File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/utils.py", line 4, in <module>
import lzma
File "/usr/local/lib/python3.12/lzma.py", line 27, in <module>
from _lzma import *
ModuleNotFoundError: No module named '_lzma'
- SOLUTION:
The explanation for this error is as follows: if another version of Python is installed on top of the base Python version on the server.
In my case, someone installed Python 3.12 without uv, and it breaks all virtual environments for 3.12, because instead of using the local system libraries from Python 3.10, it tries to use the libraries from Python 3.12 — but on Ubuntu 22 there is no compiled lzma library for Python 3.12.
The solution is to reinstall Python on the server, but this is a time-consuming process.
Therefore, if the base Python on your server is not version 3.12, you can try running another version of Python based on the version installed on your server, for example:
uv venv --python 3.10 --seedinstead of:
uv venv --python 3.12 --seed
and work around the problem if possible.
On my server, this resolved the issue.
- ISSUE:
Errors when importing packages
- SOLUTION:
CUDA may not be installed correctly on your system. Therefore, install CUDA in the appropriate version and download the required packages according to the installation instructions written above, matching the version you have installed.
- ISSUE:
You receive an error when running both PREFILL and DECODE in two separate processes.
- POSSIBLE SOLUTION:
You may not have enough GPU resources on the server. If you have only one GPU, it is not possible to run both PREFILL and DECODE on the same GPU. Therefore, run only PREFILL and do not run PROXY or DECODE.
Alternatively, use another server that has multiple GPUs.
After everything is working as required, all that remains is to send requests and view the results:
Simple request structure:
curl http://127.0.0.1:8100/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4",
"prompt": "what is Mooncake?",
"max_tokens": 30
}'
You can try run complex request structure with script in python

Top comments (0)