DEV Community: Sara_T

Mooncake Memory Deep Dive: KVCache, Token Cost, DRAM Usage, and Saturation Analysis

Sara_T — Thu, 18 Dec 2025 16:36:40 +0000

This is Part 2 of our explanation about Mooncake.
To learn more and get started with Mooncake, please refer to Part 1.
In this part, we take a deep dive into advanced memory analysis, including:

How to measure DRAM consumption
How to calculate token cost
How to check how many tokens are retained
How to detect unexpected saturation

Whether you're optimizing performance or tracking resource efficiency – this guide gives you the tools to move forward with confidence.

How to measure DRAM consumption?

To accurately measure how much DRAM is consumed by Mooncake, the system must be up and running with real requests being processed.

Mooncake allocates DRAM dynamically as prompts are received and KVCache entries are created.

Step-by-Step Method

Run the system with logging

Start the mooncake_master process with output redirected to a log file:

nohup mooncake_master \
  --port 10001 \
  --root_fs_dir /mnt/mooncake_data \
  --cluster_id mooncake_cluster \
  > logs/master.txt 2>&1 &

Send sample requests

Start with lightweight prompts to establish a low baseline of memory usage.

Review the logs

Open the logs/master.txt file and look for DRAM-related metrics.

The log includes:

Total DRAM usage
Number of KVCache objects stored
Internal write operations
Count of requests per API type

For this section, we’ll focus only on DRAM usage.

Example: Log Output from mooncake_master
Below is a sample log excerpt from the mooncake_master process, captured during runtime.

How to Calculate Token Cost (KVCache Memory)

Every token processed by a large language model consumes memory — primarily in the KVCache, which stores attention Key/Value tensors per token, per layer, and per attention head.

Understanding how much memory each token uses is essential for capacity planning, optimization, and preventing resource saturation in inference systems.

Goal

Calculate how much memory a single token consumes in the model’s internal KVCache, based on architectural parameters.

Formula

To estimate token memory cost, use:

head_dim = hidden_size / num_attention_heads
Memory_per_token = num_kv_heads × head_dim × 2 (K+V) × bytes_per_value × num_layers

Example: Qwen2 Configuration

From the model's config:

{ "hidden_size": 3584, "num_attention_heads": 28, "num_key_value_heads": 4, "num_hidden_layers": 28, "torch_dtype": "float16" }

Step-by-step:

Calculate head_dim:

head_dim = 3584 / 28 = 128

Plug into the formula:

Memory_per_token = 4 × 128 × 2 × 2 × 28 = 57344 bytes ≈ 56 KB
Result

Each token consumes ~56 KB of KVCache memory (in float16).

Why does this matter?

KVCache memory scales linearly with:
Number of tokens
Batch size
Model depth

Examples:

100 tokens = ~5.6 MB
100 tokens × batch size 4 = ~22.4 MB

This is per prompt, and accumulates across concurrent users and context retention.

Use Cases

Capacity planning for inference servers
Monitoring for unexpected saturation
Comparing model footprints during evaluation

How to Check How Many Tokens Are Retained

After understanding how much memory a single token consumes, the next step is to determine how many tokens are actually retained in memory for a given prompt or request.

Since Mooncake stores attention state in the KVCache, the number of retained tokens directly affects total DRAM usage.

Counting Tokens Using the Model Tokenizer

The most reliable way to check how many tokens are retained for a prompt is to tokenize the input using the exact tokenizer of the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(your_model_path)

token_ids = tokenizer(
    your_prompt,
    add_special_tokens=False
)["input_ids"]

log_entry = {
    "event": "token_retention",
    "retained_tokens": len(token_ids)
}

with open("token_usage.log", "a") as f:
    f.write(json.dumps(log_entry) + "\n")

What does this number represent?

len(token_ids) is the number of tokens generated from the prompt
Each of these tokens creates one KV entry per layer
As long as the tokens are not evicted (e.g. by LRU), they are retained in KVCache

In other words:

Token count = number of KVCache entries retained for that prompt

Estimating Retained Memory

Once you know:

Number of retained tokens
Memory cost per token (e.g. ~56 KB from the previous section)

You can estimate total KVCache usage:

Total_KV_Memory ≈ Retained_Tokens × Memory_per_Token

Example:

Prompt length: 120 tokens
Memory per token: ~56 KB
120 × 56 KB ≈ 6.7 MB

Why This Matters

Long prompts increase retention linearly
Multi-user or batched inference multiplies memory usage
Retained tokens accumulate until eviction occurs (e.g. via LRU)

This makes token counting a critical diagnostic step when
investigating:

High DRAM usage
Unexpected saturation
Memory growth over time

How to detect unexpected saturation

After calculating token cost, measuring DRAM usage, and tracking retained tokens, the final step is to determine whether the observed memory saturation is expected behavior or an indication of a problem.

This validation is done by cross-checking theoretical memory estimates against actual runtime measurements from Mooncake.

Goal

Verify that:

Observed memory usage ≈ Expected memory usage

If they match → the system is behaving correctly.
If they do not → further investigation is required.

Step 1: Calculate Expected Memory Usage

Using the previous sections:

Count retained tokens (via tokenizer-based token counting)
Use the calculated cost per token

Expected_KV_Memory = Retained_Tokens × Memory_per_Token

Example:

Retained tokens: 120
Memory per token: 56 KB

Expected_KV_Memory = 120 × 56 KB ≈ 6.7 MB

This is the theoretical KVCache memory footprint for the request.

Step 2: Check Actual Memory Usage in Mooncake Logs

Next, inspect the mooncake_master logs and locate the memory usage reported for the same request.

Look for log entries indicating storage or DRAM usage associated with the request.

This represents the actual memory retained by Mooncake.

Step 3: Compare Expected vs. Actual

Now compare:

Expected memory is from token calculation
Actual memory is from Mooncake logs

Case 1: Values Match (Within Reasonable Margin)

Expected ≈ Actual
Minor differences due to alignment or metadata

Conclusion:
The saturation is expected.
The system is operating correctly.

✔ KVCache behaves as designed
✔ Token retention matches architecture
✔ No memory leak detected

Case 2: Values Do Not Match

Actual memory is significantly higher than expected

Conclusion:
The saturation is unexpected and requires investigation.

At this point, further debugging is needed to identify the root cause.

Key Insight

Token count × cost per token is the ground truth baseline.
Any persistent deviation from this baseline indicates abnormal memory behavior.

Final Outcome

By following this process, you can clearly determine whether:

Memory growth is normal and expected, or
The system is experiencing unexpected saturation

If the numbers align — the system is working as intended. Success. ✔

Summary

In this part, we explored a practical and systematic approach to analyzing memory behavior in Mooncake.
By combining architectural understanding with real runtime measurements, we showed how to:

Measure actual DRAM usage from mooncake_master logs
Calculate KVCache memory cost per token based on model configuration
Determine how many tokens are retained for a given request
Validate whether observed memory saturation is expected or abnormal

The key takeaway is that token count multiplied by cost per token provides a reliable baseline for expected memory usage. Comparing this theoretical estimate against real storage metrics allows you to quickly distinguish between healthy system behavior and potential issues such as excessive retention or eviction problems.

With this methodology, memory growth becomes explainable, predictable, and debuggable — enabling confident optimization and troubleshooting of large-scale inference workloads in Mooncake.

Deploying Mooncake for LLMs: Installation & Optimization

Sara_T — Thu, 11 Dec 2025 09:30:43 +0000

Mooncake is a service-layer system designed to support LLM execution by separating the PREFILL phase (initial context construction) from the DECODE phase (token generation).
It leverages CPU, SSD, and DRAM resources to efficiently manage the KVCache generated during prompt execution on vLLM, enabling reuse of previously computed data and reducing GPU workload during inference.

In this post, we will explore what Mooncake is, its core components, its purpose, and how it integrates into the model execution pipeline.
We will then review how to build and run the system, what dependencies are required, and the issues you may encounter — along with their solutions.

What is Mooncake?

MOONCAKE — Clear Technical Overview

Mooncake is a distributed, high-performance storage system designed specifically for managing KVCache used in Large Language Model (LLM) inference.
Its main goal is to make LLM execution faster and more scalable by allowing multiple servers and GPUs to share precomputed context, instead of recalculating it each time.

What Problem Does Mooncake Solve?

When an LLM processes a prompt, it generates a structure called KVCache (key–value cache).
This cache stores the internal attention states of the model and is required for generating the next tokens.

However:

KVCache is large.
Recomputing it for every request is expensive.
Passing it between servers is normally slow.
GPU memory is limited.

Mooncake provides an efficient way to store, share, and reuse this KVCache across machines.

Core Ideas (Simplified)

1.Split between Prefill and Decode
Mooncake separates the LLM workflow into two phases:

Prefill
The model processes the prompt and generates KVCache.
Decode
Token generation uses the already-computed KVCache.

With Mooncake, once Prefill is done, the KVCache can be saved and reused by any other server.
This means Decode does **not **need to recompute anything — reducing GPU load.

2. Distributed Memory Store

Mooncake includes a Store Cluster made up of many worker nodes.
Each worker contributes:

DRAM(fast memory)
SSD (persistent storage)

Together they form a single, shared memory pool for holding KVCache objects.

3. Fast Data Transfer (Transfer Engine)
Mooncake uses a high-speed communication engine supporting:

RDMA
NVMe-over-Fabric
TCP

This allows “zero-copy” or near-zero-copy transfer of KVCache segments between machines.
The result is extremely high throughput with low latency.

4. Replication and Resilience
Mooncake automatically replicates KVCache objects across multiple workers.
This ensures:

No “hotspots” (one overloaded server)
Data availability even if a node fails

As long as the system has an active master and a reachable client, Mooncake continues operating.

5. Smart Memory Management
The system includes:

LRU eviction (old items removed first)
Soft pinning (prevent eviction of important cache objects)
Persistence (optional SSD-based storage)

This keeps memory usage predictable and efficient.

6. Simple Developer API
Clients can communicate with Mooncake using:

C++ API
Python API

The client can run as:

an embedded library inside an inference service, or
a standalone process.

System Architecture (Simplified)

1.Inference Cluster
Runs LLM engines (e.g., vLLM). Creates KVCache.

2.Transfer Engine
Moves KVCache between inference nodes and Mooncake quickly.

3.Mooncake Store Cluster
Distributed memory pool storing KVCache.

4.Metadata Server (e.g., etcd/Redis)
Tracks where each KVCache object is stored and manages replicas.

How It Works (Step by Step)

1.Prefill
An LLM server processes the prompt → produces KVCache → saves it to Mooncake.

2.Share
Another server retrieves the same KVCache from Mooncake.

3.Decode
The second server generates tokens using the retrieved KVCache instead of recomputing it.

4.Eviction/Persistence
Mooncake cleans up old objects or saves them to SSD based on policy.

Key Advantages

Higher throughput for LLM inference
Lower GPU memory usage since KVCache can reside in DRAM/SSD
Easy scaling by adding more worker nodes
Fault tolerance through replication
Optimized for long-context and multi-server LLM workloads

How to Build and Run Mooncake?

install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

use specific python

uv venv --python 3.10 --seed
source .venv/bin/activate 
//(run "deactivate" for exit)

CUDA packages according to the CUDA version installed on your server (Here is an example for CUDA 12.9).

uv pip install quart httpx matplotlib aiohttp pandas datasets modelscope setuptools openpyxl pynvml xlsxwriter
uv pip install --index-url https://download.pytorch.org/whl/cu129 torch torchvision torchaudio

install mooncake with uv

uv pip install mooncake-transfer-engine

install vllm with specific version

git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive
cd vllm
python use_existing_torch.py

install requirements

uv pip install -r requirements/build.txt

Update these parameters in the configuration file (bashrc):

(Make sure to update all paths and version numbers to match the CUDA installation and directory structure on your server).

export LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH
export CUDA_HOME="/usr/local/cuda-12.9"
export PATH="$CUDA_HOME/bin:${PATH:+:${PATH}}"
export CUDACXX="$CUDA_HOME/bin/nvcc"
export CMAKE_CUDA_COMPILER="$CUDA_HOME/bin/nvcc"
export TORCH_CUDA_ARCH_LIST="8.9"
export MAX_JOBS=128

compile vllm

uv pip install --no-build-isolation -e .

Write a mooncake.json file and replace the IP address with your own.

Make sure to update the ROOT_FS_DIR path according to your server’s directory structure.

{
  "local_hostname": "10.1.222.133",          
  "metadata_server": "etcd://10.1.222.133:2379", 
  "global_segment_size": 274877906944,    
  "local_buffer_size": 274877906944,         
  "protocol": "tcp",                       
  "device_name": "",                         
  "master_server_address": "10.1.222.133:10001",  
  "cluster_id": "mooncake_cluster",     
  "root_fs_dir": "/mnt/mooncake_data" 
}

download model:

git lfs install
git clone "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"

install and check if etcd run (need to kill process if it runs)

sudo apt install etcd-server
sudo lsof -i -P -n | grep etcd
sudo kill

Start the venv if it isnt active.

source .venv/bin/activate

check if ports are free:

lsof -t -i:8000
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8100" | awk -F ' ' '{print $2}'
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8200" | awk -F ' ' '{print $2}'
sudo lsof -i -P -n | grep mooncake_
sudo lsof -i -P -n | grep etcd

show all the ports are occupied:

sudo lsof -i -P -n

kill the ports if they are occupied:

sudo kill <PID1> <PID2> ...

run etcd:

nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379 > etcd_output.log 2>&1 &

run mooncake master:

nohup mooncake_master \
  --port 10001 \
  --root_fs_dir /mnt/mooncake_data \
  --cluster_id mooncake_cluster \
  > logs/master.txt 2>&1 &

run prefill:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0
CUDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --port 8100 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.4 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}' \
  > logs/prefill-0.txt 2>&1 &

run decode:

UDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --port 8200 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.4 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}' \
  > logs/decode-0.txt 2>&1 &

run proxy (replace --model to your path of model):

python3 ../proxy_demo.py \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --prefill localhost:8100 \
  --decode localhost:8200 \
  --port 8000 \
  2>&1 | tee logs/proxy-1-1.txt

Errors and malfunctions that may arise during the building and running of Mooncake

ISSUE:

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

SOLUTION:

You should change the --gpu-memory-utilization parameter to a higher value,
because this setting prevents memory from being allocated for the KV cache when the value is low.
However, make sure to check first that the GPU is free.

ISSUE:

you are running into an AssertionError (issubclass(connector_cls, KVConnectorBase_V1)) when starting the prefill process with MooncakeStoreConnector.

SOLUTION:

Add these parameters to the run command:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0

ISSUE:

File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/__init__.py", line 1, in <module>
    from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
  File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/_optical_flow.py", line 14, in <module>
    from .utils import _read_pfm, verify_str_arg
  File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/utils.py", line 4, in <module>
    import lzma
  File "/usr/local/lib/python3.12/lzma.py", line 27, in <module>
    from _lzma import *
ModuleNotFoundError: No module named '_lzma'

SOLUTION:

The explanation for this error is as follows: if another version of Python is installed on top of the base Python version on the server.
In my case, someone installed Python 3.12 without uv, and it breaks all virtual environments for 3.12, because instead of using the local system libraries from Python 3.10, it tries to use the libraries from Python 3.12 — but on Ubuntu 22 there is no compiled lzma library for Python 3.12.

The solution is to reinstall Python on the server, but this is a time-consuming process.
Therefore, if the base Python on your server is not version 3.12, you can try running another version of Python based on the version installed on your server, for example:
uv venv --python 3.10 --seedinstead of:
uv venv --python 3.12 --seed
and work around the problem if possible.
On my server, this resolved the issue.

ISSUE:

Errors when importing packages

SOLUTION:

CUDA may not be installed correctly on your system. Therefore, install CUDA in the appropriate version and download the required packages according to the installation instructions written above, matching the version you have installed.

ISSUE:

You receive an error when running both PREFILL and DECODE in two separate processes.

POSSIBLE SOLUTION:

You may not have enough GPU resources on the server. If you have only one GPU, it is not possible to run both PREFILL and DECODE on the same GPU. Therefore, run only PREFILL and do not run PROXY or DECODE.
Alternatively, use another server that has multiple GPUs.

After everything is working as required, all that remains is to send requests and view the results:

Simple request structure:

curl http://127.0.0.1:8100/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "/home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4",
        "prompt": "what is Mooncake?",
        "max_tokens": 30
      }'

You can try run complex request structure with script in python

Next: Advanced Memory Analysis

In Part 2, we move from architecture and setup into practical memory analysis.
We examine how Mooncake consumes DRAM, how KVCache memory scales with token count, how to calculate per-token cost, and how to detect expected versus abnormal memory saturation at runtime.

Continue to Part 2