This is Part 2 of our explanation about Mooncake.
To learn more and get started with Mooncake, please refer to Part 1.
In this part, we take a deep dive into advanced memory analysis, including:
- How to measure DRAM consumption
- How to calculate token cost
- How to check how many tokens are retained
- How to detect unexpected saturation
Whether you're optimizing performance or tracking resource efficiency – this guide gives you the tools to move forward with confidence.
How to measure DRAM consumption?
To accurately measure how much DRAM is consumed by Mooncake, the system must be up and running with real requests being processed.
Mooncake allocates DRAM dynamically as prompts are received and KVCache entries are created.
Step-by-Step Method
- Run the system with logging
Start the mooncake_master process with output redirected to a log file:
nohup mooncake_master \
--port 10001 \
--root_fs_dir /mnt/mooncake_data \
--cluster_id mooncake_cluster \
> logs/master.txt 2>&1 &
- Send sample requests
Start with lightweight prompts to establish a low baseline of memory usage.
- Review the logs
Open the logs/master.txt file and look for DRAM-related metrics.
The log includes:
- Total DRAM usage
- Number of KVCache objects stored
- Internal write operations
- Count of requests per API type
For this section, we’ll focus only on DRAM usage.
Example: Log Output from mooncake_master
Below is a sample log excerpt from the mooncake_master process, captured during runtime.
How to Calculate Token Cost (KVCache Memory)
Every token processed by a large language model consumes memory — primarily in the KVCache, which stores attention Key/Value tensors per token, per layer, and per attention head.
Understanding how much memory each token uses is essential for capacity planning, optimization, and preventing resource saturation in inference systems.
Goal
Calculate how much memory a single token consumes in the model’s internal KVCache, based on architectural parameters.
Formula
To estimate token memory cost, use:
head_dim = hidden_size / num_attention_heads
Memory_per_token = num_kv_heads × head_dim × 2 (K+V) × bytes_per_value × num_layers
Example: Qwen2 Configuration
From the model's config:
{
"hidden_size": 3584,
"num_attention_heads": 28,
"num_key_value_heads": 4,
"num_hidden_layers": 28,
"torch_dtype": "float16"
}
Step-by-step:
Calculate head_dim:
head_dim = 3584 / 28 = 128
Plug into the formula:
Memory_per_token = 4 × 128 × 2 × 2 × 28 = 57344 bytes ≈ 56 KB
Result
Each token consumes ~56 KB of KVCache memory (in float16).
Why does this matter?
- KVCache memory scales linearly with:
- Number of tokens
- Batch size
- Model depth
Examples:
100 tokens = ~5.6 MB
100 tokens × batch size 4 = ~22.4 MB
This is per prompt, and accumulates across concurrent users and context retention.
Use Cases
- Capacity planning for inference servers
- Monitoring for unexpected saturation
- Comparing model footprints during evaluation
How to Check How Many Tokens Are Retained
After understanding how much memory a single token consumes, the next step is to determine how many tokens are actually retained in memory for a given prompt or request.
Since Mooncake stores attention state in the KVCache, the number of retained tokens directly affects total DRAM usage.
Counting Tokens Using the Model Tokenizer
The most reliable way to check how many tokens are retained for a prompt is to tokenize the input using the exact tokenizer of the model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(your_model_path)
token_ids = tokenizer(
your_prompt,
add_special_tokens=False
)["input_ids"]
log_entry = {
"event": "token_retention",
"retained_tokens": len(token_ids)
}
with open("token_usage.log", "a") as f:
f.write(json.dumps(log_entry) + "\n")
What does this number represent?
-
len(token_ids)is the number of tokens generated from the prompt - Each of these tokens creates one KV entry per layer
- As long as the tokens are not evicted (e.g. by LRU), they are retained in KVCache
In other words:
Token count = number of KVCache entries retained for that prompt
Estimating Retained Memory
Once you know:
- Number of retained tokens
- Memory cost per token (e.g. ~56 KB from the previous section)
You can estimate total KVCache usage:
Total_KV_Memory ≈ Retained_Tokens × Memory_per_Token
Example:
Prompt length: 120 tokens
Memory per token: ~56 KB
120 × 56 KB ≈ 6.7 MB
Why This Matters
- Long prompts increase retention linearly
- Multi-user or batched inference multiplies memory usage
- Retained tokens accumulate until eviction occurs (e.g. via LRU)
This makes token counting a critical diagnostic step when
investigating:
- High DRAM usage
- Unexpected saturation
- Memory growth over time
How to detect unexpected saturation
After calculating token cost, measuring DRAM usage, and tracking retained tokens, the final step is to determine whether the observed memory saturation is expected behavior or an indication of a problem.
This validation is done by cross-checking theoretical memory estimates against actual runtime measurements from Mooncake.
Goal
Verify that:
Observed memory usage ≈ Expected memory usage
If they match → the system is behaving correctly.
If they do not → further investigation is required.
Step 1: Calculate Expected Memory Usage
Using the previous sections:
Count retained tokens (via tokenizer-based token counting)
Use the calculated cost per token
Expected_KV_Memory = Retained_Tokens × Memory_per_Token
Example:
- Retained tokens: 120
- Memory per token: 56 KB
Expected_KV_Memory = 120 × 56 KB ≈ 6.7 MB
This is the theoretical KVCache memory footprint for the request.
Step 2: Check Actual Memory Usage in Mooncake Logs
Next, inspect the mooncake_master logs and locate the memory usage reported for the same request.
Look for log entries indicating storage or DRAM usage associated with the request.
This represents the actual memory retained by Mooncake.
Step 3: Compare Expected vs. Actual
Now compare:
Expected memory is from token calculation
Actual memory is from Mooncake logs
Case 1: Values Match (Within Reasonable Margin)
- Expected ≈ Actual
- Minor differences due to alignment or metadata
Conclusion:
The saturation is expected.
The system is operating correctly.
✔ KVCache behaves as designed
✔ Token retention matches architecture
✔ No memory leak detected
Case 2: Values Do Not Match
Actual memory is significantly higher than expected
Conclusion:
The saturation is unexpected and requires investigation.
At this point, further debugging is needed to identify the root cause.
Key Insight
Token count × cost per token is the ground truth baseline.
Any persistent deviation from this baseline indicates abnormal memory behavior.
Final Outcome
By following this process, you can clearly determine whether:
- Memory growth is normal and expected, or
- The system is experiencing unexpected saturation
If the numbers align — the system is working as intended. Success. ✔
Summary
In this part, we explored a practical and systematic approach to analyzing memory behavior in Mooncake.
By combining architectural understanding with real runtime measurements, we showed how to:
- Measure actual DRAM usage from mooncake_master logs
- Calculate KVCache memory cost per token based on model configuration
- Determine how many tokens are retained for a given request
- Validate whether observed memory saturation is expected or abnormal
The key takeaway is that token count multiplied by cost per token provides a reliable baseline for expected memory usage. Comparing this theoretical estimate against real storage metrics allows you to quickly distinguish between healthy system behavior and potential issues such as excessive retention or eviction problems.
With this methodology, memory growth becomes explainable, predictable, and debuggable — enabling confident optimization and troubleshooting of large-scale inference workloads in Mooncake.

Top comments (0)