DEV Community

Sara_T
Sara_T

Posted on • Originally published at dev.to

Mooncake Memory Deep Dive: KVCache, Token Cost, DRAM Usage, and Saturation Analysis

This is Part 2 of our explanation about Mooncake.
To learn more and get started with Mooncake, please refer to Part 1.
In this part, we take a deep dive into advanced memory analysis, including:

  • How to measure DRAM consumption
  • How to calculate token cost
  • How to check how many tokens are retained
  • How to detect unexpected saturation

Whether you're optimizing performance or tracking resource efficiency – this guide gives you the tools to move forward with confidence.

How to measure DRAM consumption?

To accurately measure how much DRAM is consumed by Mooncake, the system must be up and running with real requests being processed.

Mooncake allocates DRAM dynamically as prompts are received and KVCache entries are created.

Step-by-Step Method

  1. Run the system with logging

Start the mooncake_master process with output redirected to a log file:

nohup mooncake_master \
  --port 10001 \
  --root_fs_dir /mnt/mooncake_data \
  --cluster_id mooncake_cluster \
  > logs/master.txt 2>&1 &

Enter fullscreen mode Exit fullscreen mode
  1. Send sample requests

Start with lightweight prompts to establish a low baseline of memory usage.

  1. Review the logs

Open the logs/master.txt file and look for DRAM-related metrics.

The log includes:

  • Total DRAM usage
  • Number of KVCache objects stored
  • Internal write operations
  • Count of requests per API type

For this section, we’ll focus only on DRAM usage.

Example: Log Output from mooncake_master
Below is a sample log excerpt from the mooncake_master process, captured during runtime.

Example: Log Output from mooncake_master<br>

How to Calculate Token Cost (KVCache Memory)

Every token processed by a large language model consumes memory — primarily in the KVCache, which stores attention Key/Value tensors per token, per layer, and per attention head.

Understanding how much memory each token uses is essential for capacity planning, optimization, and preventing resource saturation in inference systems.

Goal

Calculate how much memory a single token consumes in the model’s internal KVCache, based on architectural parameters.

Formula

To estimate token memory cost, use:

head_dim = hidden_size / num_attention_heads
Memory_per_token = num_kv_heads × head_dim × 2 (K+V) × bytes_per_value × num_layers

Example: Qwen2 Configuration

From the model's config:

{
"hidden_size": 3584,
"num_attention_heads": 28,
"num_key_value_heads": 4,
"num_hidden_layers": 28,
"torch_dtype": "float16"
}

Step-by-step:

Calculate head_dim:

head_dim = 3584 / 28 = 128

Plug into the formula:

Memory_per_token = 4 × 128 × 2 × 2 × 28 = 57344 bytes ≈ 56 KB

Result

Each token consumes ~56 KB of KVCache memory (in float16).

Why does this matter?

  • KVCache memory scales linearly with:
  • Number of tokens
  • Batch size
  • Model depth

Examples:

100 tokens = ~5.6 MB
100 tokens × batch size 4 = ~22.4 MB

This is per prompt, and accumulates across concurrent users and context retention.

Use Cases

  • Capacity planning for inference servers
  • Monitoring for unexpected saturation
  • Comparing model footprints during evaluation

How to Check How Many Tokens Are Retained

After understanding how much memory a single token consumes, the next step is to determine how many tokens are actually retained in memory for a given prompt or request.

Since Mooncake stores attention state in the KVCache, the number of retained tokens directly affects total DRAM usage.

Counting Tokens Using the Model Tokenizer

The most reliable way to check how many tokens are retained for a prompt is to tokenize the input using the exact tokenizer of the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(your_model_path)

token_ids = tokenizer(
    your_prompt,
    add_special_tokens=False
)["input_ids"]

log_entry = {
    "event": "token_retention",
    "retained_tokens": len(token_ids)
}

with open("token_usage.log", "a") as f:
    f.write(json.dumps(log_entry) + "\n")
Enter fullscreen mode Exit fullscreen mode

What does this number represent?

  • len(token_ids) is the number of tokens generated from the prompt
  • Each of these tokens creates one KV entry per layer
  • As long as the tokens are not evicted (e.g. by LRU), they are retained in KVCache

In other words:

Token count = number of KVCache entries retained for that prompt

Estimating Retained Memory

Once you know:

  • Number of retained tokens
  • Memory cost per token (e.g. ~56 KB from the previous section)

You can estimate total KVCache usage:

Total_KV_Memory ≈ Retained_Tokens × Memory_per_Token

Example:

Prompt length: 120 tokens
Memory per token: ~56 KB

120 × 56 KB ≈ 6.7 MB

Why This Matters

  • Long prompts increase retention linearly
  • Multi-user or batched inference multiplies memory usage
  • Retained tokens accumulate until eviction occurs (e.g. via LRU)

This makes token counting a critical diagnostic step when
investigating:

  • High DRAM usage
  • Unexpected saturation
  • Memory growth over time

How to detect unexpected saturation

After calculating token cost, measuring DRAM usage, and tracking retained tokens, the final step is to determine whether the observed memory saturation is expected behavior or an indication of a problem.

This validation is done by cross-checking theoretical memory estimates against actual runtime measurements from Mooncake.

Goal

Verify that:

Observed memory usage ≈ Expected memory usage

If they match → the system is behaving correctly.
If they do not → further investigation is required.

Step 1: Calculate Expected Memory Usage

Using the previous sections:

  1. Count retained tokens (via tokenizer-based token counting)

  2. Use the calculated cost per token

Expected_KV_Memory = Retained_Tokens × Memory_per_Token

Example:

  • Retained tokens: 120
  • Memory per token: 56 KB

Expected_KV_Memory = 120 × 56 KB ≈ 6.7 MB

This is the theoretical KVCache memory footprint for the request.

Step 2: Check Actual Memory Usage in Mooncake Logs

Next, inspect the mooncake_master logs and locate the memory usage reported for the same request.

Look for log entries indicating storage or DRAM usage associated with the request.

This represents the actual memory retained by Mooncake.

Step 3: Compare Expected vs. Actual

Now compare:

Expected memory is from token calculation
Actual memory is from Mooncake logs

Case 1: Values Match (Within Reasonable Margin)

  • Expected ≈ Actual
  • Minor differences due to alignment or metadata

Conclusion:
The saturation is expected.
The system is operating correctly.

✔ KVCache behaves as designed
✔ Token retention matches architecture
✔ No memory leak detected

Case 2: Values Do Not Match

Actual memory is significantly higher than expected

Conclusion:
The saturation is unexpected and requires investigation.

At this point, further debugging is needed to identify the root cause.

Key Insight

Token count × cost per token is the ground truth baseline.
Any persistent deviation from this baseline indicates abnormal memory behavior.

Final Outcome

By following this process, you can clearly determine whether:

  • Memory growth is normal and expected, or
  • The system is experiencing unexpected saturation

If the numbers align — the system is working as intended. Success. ✔

Summary

In this part, we explored a practical and systematic approach to analyzing memory behavior in Mooncake.
By combining architectural understanding with real runtime measurements, we showed how to:

  • Measure actual DRAM usage from mooncake_master logs
  • Calculate KVCache memory cost per token based on model configuration
  • Determine how many tokens are retained for a given request
  • Validate whether observed memory saturation is expected or abnormal

The key takeaway is that token count multiplied by cost per token provides a reliable baseline for expected memory usage. Comparing this theoretical estimate against real storage metrics allows you to quickly distinguish between healthy system behavior and potential issues such as excessive retention or eviction problems.

With this methodology, memory growth becomes explainable, predictable, and debuggable — enabling confident optimization and troubleshooting of large-scale inference workloads in Mooncake.

Top comments (0)