<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sara_T</title>
    <description>The latest articles on DEV Community by Sara_T (@__82e06472cd325ef306e6).</description>
    <link>https://dev.to/__82e06472cd325ef306e6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3651593%2Fd3699654-14a2-4f3f-b4db-0eb879bfc9ee.png</url>
      <title>DEV Community: Sara_T</title>
      <link>https://dev.to/__82e06472cd325ef306e6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/__82e06472cd325ef306e6"/>
    <language>en</language>
    <item>
      <title>Mooncake Memory Deep Dive: KVCache, Token Cost, DRAM Usage, and Saturation Analysis</title>
      <dc:creator>Sara_T</dc:creator>
      <pubDate>Thu, 18 Dec 2025 16:36:40 +0000</pubDate>
      <link>https://dev.to/__82e06472cd325ef306e6/mooncake-memory-deep-dive-kvcache-token-cost-dram-usage-and-saturation-analysis-1n88</link>
      <guid>https://dev.to/__82e06472cd325ef306e6/mooncake-memory-deep-dive-kvcache-token-cost-dram-usage-and-saturation-analysis-1n88</guid>
      <description>&lt;p&gt;This is Part 2 of our explanation about Mooncake.&lt;br&gt;
To learn more and get started with Mooncake, please refer to &lt;a href="https://dev.to/__82e06472cd325ef306e6/getting-started-with-mooncake-installation-execution-troubleshooting-1hh9"&gt;Part 1.&lt;/a&gt;&lt;br&gt;
In this part, we take a deep dive into advanced memory analysis, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to measure DRAM consumption&lt;/li&gt;
&lt;li&gt;How to calculate token cost&lt;/li&gt;
&lt;li&gt;How to check how many tokens are retained&lt;/li&gt;
&lt;li&gt;How to detect unexpected saturation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're optimizing performance or tracking resource efficiency – this guide gives you the tools to move forward with confidence.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to measure DRAM consumption?
&lt;/h2&gt;

&lt;p&gt;To accurately measure how much DRAM is consumed by Mooncake, the system must be up and running with real requests being processed.&lt;/p&gt;

&lt;p&gt;Mooncake allocates DRAM dynamically as prompts are received and KVCache entries are created.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step-by-Step Method
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Run the system with logging &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start the mooncake_master process with output redirected to a log file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nohup mooncake_master \
  --port 10001 \
  --root_fs_dir /mnt/mooncake_data \
  --cluster_id mooncake_cluster \
  &amp;gt; logs/master.txt 2&amp;gt;&amp;amp;1 &amp;amp;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Send sample requests&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start with lightweight prompts to establish a low baseline of memory usage.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Review the logs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open the logs/master.txt file and look for DRAM-related metrics.&lt;/p&gt;

&lt;p&gt;The log includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total DRAM usage&lt;/li&gt;
&lt;li&gt;Number of KVCache objects stored&lt;/li&gt;
&lt;li&gt;Internal write operations&lt;/li&gt;
&lt;li&gt;Count of requests per API type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this section, we’ll focus only on DRAM usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Log Output from mooncake_master&lt;/strong&gt;&lt;br&gt;
Below is a sample log excerpt from the mooncake_master process, captured during runtime.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh0kgvz0g748wgkvurks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh0kgvz0g748wgkvurks.png" alt="Example: Log Output from mooncake_master&amp;lt;br&amp;gt;
" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Calculate Token Cost (KVCache Memory)
&lt;/h2&gt;

&lt;p&gt;Every token processed by a large language model consumes memory — primarily in the KVCache, which stores attention Key/Value tensors per token, per layer, and per attention head.&lt;/p&gt;

&lt;p&gt;Understanding how much memory each token uses is essential for capacity planning, optimization, and preventing resource saturation in inference systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Calculate how much memory a single token consumes in the model’s internal KVCache, based on architectural parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formula&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To estimate token memory cost, use:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;head_dim = hidden_size / num_attention_heads&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Memory_per_token = num_kv_heads × head_dim × 2 (K+V) × bytes_per_value × num_layers&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: Qwen2 Configuration&lt;/p&gt;

&lt;p&gt;From the model's config:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{&lt;br&gt;
  "hidden_size": 3584,&lt;br&gt;
  "num_attention_heads": 28,&lt;br&gt;
  "num_key_value_heads": 4,&lt;br&gt;
  "num_hidden_layers": 28,&lt;br&gt;
  "torch_dtype": "float16"&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step-by-step:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Calculate head_dim:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;head_dim = 3584 / 28 = 128&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plug into the formula:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Memory_per_token = 4 × 128 × 2 × 2 × 28 = 57344 bytes ≈ 56 KB&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each token consumes &lt;strong&gt;~56 KB&lt;/strong&gt; of KVCache memory (in float16).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this matter?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KVCache memory scales linearly with:&lt;/li&gt;
&lt;li&gt;Number of tokens&lt;/li&gt;
&lt;li&gt;Batch size&lt;/li&gt;
&lt;li&gt;Model depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;100 tokens = ~5.6 MB&lt;/code&gt;&lt;br&gt;
&lt;code&gt;100 tokens × batch size 4 = ~22.4 MB&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is per prompt, and accumulates across concurrent users and context retention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capacity planning for inference servers&lt;/li&gt;
&lt;li&gt;Monitoring for unexpected saturation&lt;/li&gt;
&lt;li&gt;Comparing model footprints during evaluation&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How to Check How Many Tokens Are Retained
&lt;/h2&gt;

&lt;p&gt;After understanding how much memory &lt;strong&gt;a single token consumes&lt;/strong&gt;, the next step is to determine &lt;strong&gt;how many tokens are actually retained in memory&lt;/strong&gt; for a given prompt or request.&lt;/p&gt;

&lt;p&gt;Since Mooncake stores attention state in the KVCache, the number of retained tokens directly affects total DRAM usage.&lt;/p&gt;
&lt;h3&gt;
  
  
  Counting Tokens Using the Model Tokenizer
&lt;/h3&gt;

&lt;p&gt;The most reliable way to check how many tokens are retained for a prompt is to tokenize the input using the &lt;strong&gt;exact tokenizer of the model.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(your_model_path)

token_ids = tokenizer(
    your_prompt,
    add_special_tokens=False
)["input_ids"]

log_entry = {
    "event": "token_retention",
    "retained_tokens": len(token_ids)
}

with open("token_usage.log", "a") as f:
    f.write(json.dumps(log_entry) + "\n")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What does this number represent?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;len(token_ids)&lt;/code&gt; is the number of tokens generated from the prompt&lt;/li&gt;
&lt;li&gt;Each of these tokens creates one KV entry per layer&lt;/li&gt;
&lt;li&gt;As long as the tokens are not evicted (e.g. by LRU), they are retained in KVCache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token count = number of KVCache entries retained for that prompt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Estimating Retained Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once you know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of retained tokens&lt;/li&gt;
&lt;li&gt;Memory cost per token (e.g. ~56 KB from the previous section)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can estimate total KVCache usage:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Total_KV_Memory ≈ Retained_Tokens × Memory_per_Token&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Prompt length: 120 tokens&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Memory per token: ~56 KB&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;code&gt;120 × 56 KB ≈ 6.7 MB&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long prompts increase retention linearly&lt;/li&gt;
&lt;li&gt;Multi-user or batched inference multiplies memory usage&lt;/li&gt;
&lt;li&gt;Retained tokens accumulate until eviction occurs (e.g. via LRU)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This makes token counting a critical diagnostic step when &lt;br&gt;
investigating:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High DRAM usage&lt;/li&gt;
&lt;li&gt;Unexpected saturation&lt;/li&gt;
&lt;li&gt;Memory growth over time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to detect unexpected saturation
&lt;/h2&gt;

&lt;p&gt;After calculating token cost, measuring DRAM usage, and tracking retained tokens, the final step is to determine whether the observed memory saturation is expected behavior or an indication of a problem.&lt;/p&gt;

&lt;p&gt;This validation is done by cross-checking theoretical memory estimates against actual runtime measurements from Mooncake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Verify that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observed memory usage ≈ Expected memory usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If they match → the system is behaving correctly.&lt;br&gt;
If they do not → further investigation is required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Calculate Expected Memory Usage
&lt;/h3&gt;

&lt;p&gt;Using the previous sections:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Count retained tokens (via tokenizer-based token counting)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use the calculated cost per token&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;Expected_KV_Memory = Retained_Tokens × Memory_per_Token&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retained tokens: 120&lt;/li&gt;
&lt;li&gt;Memory per token: 56 KB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;Expected_KV_Memory = 120 × 56 KB ≈ 6.7 MB&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;theoretical KVCache memory footprint&lt;/strong&gt; for the request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Check Actual Memory Usage in Mooncake Logs
&lt;/h3&gt;

&lt;p&gt;Next, inspect the &lt;code&gt;mooncake_master&lt;/code&gt; logs and locate the memory usage reported for the same request.&lt;/p&gt;

&lt;p&gt;Look for log entries indicating storage or DRAM usage associated with the request.&lt;/p&gt;

&lt;p&gt;This represents the actual memory retained by Mooncake.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Compare Expected vs. Actual
&lt;/h3&gt;

&lt;p&gt;Now compare:&lt;/p&gt;

&lt;p&gt;Expected memory is from token calculation&lt;br&gt;
Actual memory is from Mooncake logs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Case 1: Values Match (Within Reasonable Margin)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected ≈ Actual&lt;/li&gt;
&lt;li&gt;Minor differences due to alignment or metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;br&gt;
The saturation is &lt;strong&gt;expected&lt;/strong&gt;.&lt;br&gt;
The system is operating correctly.&lt;/p&gt;

&lt;p&gt;✔ KVCache behaves as designed&lt;br&gt;
✔ Token retention matches architecture&lt;br&gt;
✔ No memory leak detected&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Case 2: Values Do Not Match&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Actual memory is significantly higher than expected&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;br&gt;
The saturation is unexpected and requires investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At this point, further debugging is needed to identify the root cause.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token count × cost per token is the ground truth baseline.&lt;/strong&gt;&lt;br&gt;
Any persistent deviation from this baseline indicates abnormal memory behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Outcome&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By following this process, you can clearly determine whether:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory growth is normal and expected, or&lt;/li&gt;
&lt;li&gt;The system is experiencing unexpected saturation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the numbers align — the system is working as intended. Success. ✔&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this part, we explored a practical and systematic approach to &lt;strong&gt;analyzing memory behavior in Mooncake.&lt;/strong&gt;&lt;br&gt;
By combining architectural understanding with real runtime measurements, we showed how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measure actual DRAM usage from mooncake_master logs&lt;/li&gt;
&lt;li&gt;Calculate KVCache memory cost per token based on model configuration&lt;/li&gt;
&lt;li&gt;Determine how many tokens are retained for a given request&lt;/li&gt;
&lt;li&gt;Validate whether observed memory saturation is expected or abnormal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key takeaway is that token &lt;strong&gt;count multiplied by cost per token provides a reliable&lt;/strong&gt; baseline for expected memory usage. Comparing this theoretical estimate against real storage metrics allows you to quickly distinguish between healthy system behavior and potential issues such as excessive retention or eviction problems.&lt;/p&gt;

&lt;p&gt;With this methodology, memory growth becomes explainable, predictable, and debuggable — enabling confident optimization and troubleshooting of large-scale inference workloads in Mooncake.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>llm</category>
      <category>backend</category>
      <category>ai</category>
    </item>
    <item>
      <title>Deploying Mooncake for LLMs: Installation &amp; Optimization</title>
      <dc:creator>Sara_T</dc:creator>
      <pubDate>Thu, 11 Dec 2025 09:30:43 +0000</pubDate>
      <link>https://dev.to/__82e06472cd325ef306e6/getting-started-with-mooncake-installation-execution-troubleshooting-1hh9</link>
      <guid>https://dev.to/__82e06472cd325ef306e6/getting-started-with-mooncake-installation-execution-troubleshooting-1hh9</guid>
      <description>&lt;p&gt;Mooncake is a service-layer system designed to support LLM execution by separating the PREFILL phase (initial context construction) from the DECODE phase (token generation).&lt;br&gt;
It leverages CPU, SSD, and DRAM resources to efficiently manage the KVCache generated during prompt execution on vLLM, enabling reuse of previously computed data and reducing GPU workload during inference.&lt;/p&gt;

&lt;p&gt;In this post, we will explore what Mooncake is, its core components, its purpose, and how it integrates into the model execution pipeline.&lt;br&gt;
We will then review how to build and run the system, what dependencies are required, and the issues you may encounter — along with their solutions.&lt;/p&gt;
&lt;h1&gt;
  
  
  What is Mooncake?
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ejypvkv8e1y4nhlz0s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ejypvkv8e1y4nhlz0s.png" alt=" " width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  MOONCAKE — Clear Technical Overview
&lt;/h3&gt;

&lt;p&gt;Mooncake is a distributed, high-performance storage system designed specifically for managing KVCache used in Large Language Model (LLM) inference.&lt;br&gt;
Its main goal is to make LLM execution faster and more scalable by allowing multiple servers and GPUs to share precomputed context, instead of recalculating it each time.&lt;/p&gt;
&lt;h3&gt;
  
  
  What Problem Does Mooncake Solve?
&lt;/h3&gt;

&lt;p&gt;When an LLM processes a prompt, it generates a structure called KVCache (key–value cache).&lt;br&gt;
This cache stores the internal attention states of the model and is required for generating the next tokens.&lt;/p&gt;

&lt;p&gt;However:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KVCache is large.&lt;/li&gt;
&lt;li&gt;Recomputing it for every request is expensive.&lt;/li&gt;
&lt;li&gt;Passing it between servers is normally slow.&lt;/li&gt;
&lt;li&gt;GPU memory is limited.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mooncake provides an efficient way to store, share, and reuse this KVCache across machines.&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Core Ideas (Simplified)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1.Split between Prefill and Decode&lt;/strong&gt;&lt;br&gt;
Mooncake separates the LLM workflow into two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prefill&lt;/strong&gt;&lt;br&gt;
  The model processes the prompt and generates KVCache.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decode&lt;/strong&gt;&lt;br&gt;
  Token generation uses the already-computed KVCache.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With Mooncake, once Prefill is done, the KVCache can be saved and reused by any other server.&lt;br&gt;
This means Decode does **not **need to recompute anything — reducing GPU load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Distributed Memory Store&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mooncake includes a &lt;strong&gt;Store Cluster&lt;/strong&gt; made up of many worker nodes.&lt;br&gt;
Each worker contributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;DRAM&lt;/strong&gt;(fast memory)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;SSD&lt;/strong&gt; (persistent storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together they form a single, shared memory pool for holding KVCache objects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Fast Data Transfer (Transfer Engine)&lt;/strong&gt;&lt;br&gt;
Mooncake uses a high-speed communication engine supporting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDMA&lt;/li&gt;
&lt;li&gt;NVMe-over-Fabric&lt;/li&gt;
&lt;li&gt;TCP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows “zero-copy” or near-zero-copy transfer of KVCache segments between machines.&lt;br&gt;
The result is extremely high throughput with low latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Replication and Resilience&lt;/strong&gt;&lt;br&gt;
Mooncake automatically replicates KVCache objects across multiple workers.&lt;br&gt;
This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No “hotspots” (one overloaded server)&lt;/li&gt;
&lt;li&gt;Data availability even if a node fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As long as the system has an active master and a reachable client, Mooncake continues operating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Smart Memory Management&lt;/strong&gt;&lt;br&gt;
The system includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LRU eviction (old items removed first)&lt;/li&gt;
&lt;li&gt;Soft pinning (prevent eviction of important cache objects)&lt;/li&gt;
&lt;li&gt;Persistence (optional SSD-based storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps memory usage predictable and efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Simple Developer API&lt;/strong&gt;&lt;br&gt;
Clients can communicate with Mooncake using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C++ API&lt;/li&gt;
&lt;li&gt;Python API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client can run as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an embedded library inside an inference service, or&lt;/li&gt;
&lt;li&gt;a standalone process.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  System Architecture (Simplified)
&lt;/h3&gt;

&lt;p&gt;1.&lt;strong&gt;Inference Cluster&lt;/strong&gt;&lt;br&gt;
Runs LLM engines (e.g., vLLM). Creates KVCache.&lt;/p&gt;

&lt;p&gt;2.&lt;strong&gt;Transfer Engine&lt;/strong&gt;&lt;br&gt;
Moves KVCache between inference nodes and Mooncake quickly.&lt;/p&gt;

&lt;p&gt;3.&lt;strong&gt;Mooncake Store Cluster&lt;/strong&gt;&lt;br&gt;
Distributed memory pool storing KVCache.&lt;/p&gt;

&lt;p&gt;4.&lt;strong&gt;Metadata Server (e.g., etcd/Redis)&lt;/strong&gt;&lt;br&gt;
Tracks where each KVCache object is stored and manages replicas.&lt;/p&gt;
&lt;h3&gt;
  
  
  How It Works (Step by Step)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1.Prefill&lt;/strong&gt;&lt;br&gt;
An LLM server processes the prompt → produces KVCache → saves it to Mooncake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.Share&lt;/strong&gt;&lt;br&gt;
Another server retrieves the same KVCache from Mooncake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.Decode&lt;/strong&gt;&lt;br&gt;
The second server generates tokens using the retrieved KVCache instead of recomputing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.Eviction/Persistence&lt;/strong&gt;&lt;br&gt;
Mooncake cleans up old objects or saves them to SSD based on policy.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Higher throughput for LLM inference&lt;/li&gt;
&lt;li&gt;Lower GPU memory usage since KVCache can reside in DRAM/SSD&lt;/li&gt;
&lt;li&gt;Easy scaling by adding more worker nodes&lt;/li&gt;
&lt;li&gt;Fault tolerance through replication&lt;/li&gt;
&lt;li&gt;Optimized for long-context and multi-server LLM workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  How to Build and Run Mooncake?
&lt;/h1&gt;
&lt;h4&gt;
  
  
  install uv
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  use specific python
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv venv --python 3.10 --seed
source .venv/bin/activate 
//(run "deactivate" for exit)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  CUDA packages according to the CUDA version installed on your server (Here is an example for CUDA 12.9).
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv pip install quart httpx matplotlib aiohttp pandas datasets modelscope setuptools openpyxl pynvml xlsxwriter
uv pip install --index-url https://download.pytorch.org/whl/cu129 torch torchvision torchaudio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  install mooncake with uv
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv pip install mooncake-transfer-engine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  install vllm with specific version
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive
cd vllm
python use_existing_torch.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  install requirements
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv pip install -r requirements/build.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Update these parameters in the configuration file (bashrc):
&lt;/h4&gt;

&lt;p&gt;(Make sure to update all paths and version numbers to match the CUDA installation and directory structure on your server).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH
export CUDA_HOME="/usr/local/cuda-12.9"
export PATH="$CUDA_HOME/bin:${PATH:+:${PATH}}"
export CUDACXX="$CUDA_HOME/bin/nvcc"
export CMAKE_CUDA_COMPILER="$CUDA_HOME/bin/nvcc"
export TORCH_CUDA_ARCH_LIST="8.9"
export MAX_JOBS=128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  compile vllm
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv pip install --no-build-isolation -e .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Write a mooncake.json file and replace the IP address with your own.
&lt;/h4&gt;

&lt;p&gt;Make sure to update the ROOT_FS_DIR path according to your server’s directory structure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "local_hostname": "10.1.222.133",          
  "metadata_server": "etcd://10.1.222.133:2379", 
  "global_segment_size": 274877906944,    
  "local_buffer_size": 274877906944,         
  "protocol": "tcp",                       
  "device_name": "",                         
  "master_server_address": "10.1.222.133:10001",  
  "cluster_id": "mooncake_cluster",     
  "root_fs_dir": "/mnt/mooncake_data" 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  download model:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git lfs install
git clone "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  install and check if etcd run (need to kill process if it runs)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install etcd-server
sudo lsof -i -P -n | grep etcd
sudo kill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Start the venv if it isnt active.
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source .venv/bin/activate 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  check if ports are free:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lsof -t -i:8000
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8100" | awk -F ' ' '{print $2}'
ps -ef | grep 'vllm.entrypoints.openai.api_server' | grep "port 8200" | awk -F ' ' '{print $2}'
sudo lsof -i -P -n | grep mooncake_
sudo lsof -i -P -n | grep etcd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  show all the ports are occupied:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo lsof -i -P -n 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  kill the ports if they are occupied:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo kill &amp;lt;PID1&amp;gt; &amp;lt;PID2&amp;gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  run etcd:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379 &amp;gt; etcd_output.log 2&amp;gt;&amp;amp;1 &amp;amp;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  run mooncake master:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nohup mooncake_master \
  --port 10001 \
  --root_fs_dir /mnt/mooncake_data \
  --cluster_id mooncake_cluster \
  &amp;gt; logs/master.txt 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  run prefill:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0
CUDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --port 8100 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.4 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}' \
  &amp;gt; logs/prefill-0.txt 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  run decode:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UDA_VISIBLE_DEVICES=0 \
MOONCAKE_CONFIG_PATH=mooncake.json \
python3 -m vllm.entrypoints.openai.api_server \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --port 8200 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.4 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}' \
  &amp;gt; logs/decode-0.txt 2&amp;gt;&amp;amp;1 &amp;amp;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  run proxy (replace --model to your path of model):
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 ../proxy_demo.py \
  --model /home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4 \
  --prefill localhost:8100 \
  --decode localhost:8200 \
  --port 8000 \
  2&amp;gt;&amp;amp;1 | tee logs/proxy-1-1.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Errors and malfunctions that may arise during the building and running of Mooncake
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ISSUE:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ValueError: No available memory for the cache blocks. Try increasing &lt;code&gt;gpu_memory_utilization&lt;/code&gt; when initializing the engine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SOLUTION:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should change the --gpu-memory-utilization parameter to a higher value,&lt;br&gt;
because this setting prevents memory from being allocated for the KV cache when the value is low.&lt;br&gt;
However, make sure to check first that the GPU is free.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ISSUE:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you are running into an &lt;code&gt;AssertionError (issubclass(connector_cls, KVConnectorBase_V1))&lt;/code&gt; when starting the prefill process with MooncakeStoreConnector.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SOLUTION:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add these parameters to the run command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_USE_V1=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ISSUE:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/__init__.py", line 1, in &amp;lt;module&amp;gt;
    from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
  File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/_optical_flow.py", line 14, in &amp;lt;module&amp;gt;
    from .utils import _read_pfm, verify_str_arg
  File "/home/project/.venv/lib/python3.12/site-packages/torchvision/datasets/utils.py", line 4, in &amp;lt;module&amp;gt;
    import lzma
  File "/usr/local/lib/python3.12/lzma.py", line 27, in &amp;lt;module&amp;gt;
    from _lzma import *
ModuleNotFoundError: No module named '_lzma'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SOLUTION:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The explanation for this error is as follows: if another version of Python is installed on top of the base Python version on the server.&lt;br&gt;
In my case, someone installed Python 3.12 without uv, and it breaks all virtual environments for 3.12, because instead of using the local system libraries from Python 3.10, it tries to use the libraries from Python 3.12 — but on Ubuntu 22 there is no compiled lzma library for Python 3.12.&lt;br&gt;&lt;br&gt;
The solution is to reinstall Python on the server, but this is a time-consuming process.&lt;br&gt;
Therefore, if the base Python on your server is not version 3.12, you can try running another version of Python based on the version installed on your server, for example:&lt;br&gt;
&lt;code&gt;uv venv --python 3.10 --seed&lt;br&gt;
&lt;/code&gt;instead of:&lt;br&gt;
&lt;code&gt;uv venv --python 3.12 --seed&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
and work around the problem if possible.&lt;br&gt;
On my server, this resolved the issue.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ISSUE:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Errors when importing packages&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SOLUTION:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CUDA may not be installed correctly on your system. Therefore, install CUDA in the appropriate version and download the required packages according to the installation instructions written above, matching the version you have installed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ISSUE:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You receive an error when running both PREFILL and DECODE in two separate processes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;POSSIBLE SOLUTION:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You may not have enough GPU resources on the server. If you have only one GPU, it is not possible to run both PREFILL and DECODE on the same GPU. Therefore, run only PREFILL and do not run PROXY or DECODE.&lt;br&gt;
Alternatively, use another server that has multiple GPUs.&lt;/p&gt;
&lt;h2&gt;
  
  
  After everything is working as required, all that remains is to send requests and view the results:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Simple request structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl http://127.0.0.1:8100/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "/home/vllm/Qwen2.5-7B-Instruct-GPTQ-Int4",
        "prompt": "what is Mooncake?",
        "max_tokens": 30
      }'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;You can try run complex request structure with script in python&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Next: Advanced Memory Analysis
&lt;/h2&gt;

&lt;p&gt;In Part 2, we move from architecture and setup into practical memory analysis.&lt;br&gt;
We examine how Mooncake consumes DRAM, how KVCache memory scales with token count, how to calculate per-token cost, and how to detect expected versus abnormal memory saturation at runtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continue to &lt;a href="https://dev.to/__82e06472cd325ef306e6/mooncake-memory-deep-dive-kvcache-token-cost-dram-usage-and-saturation-analysis-1n88"&gt;Part 2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  GOOD LUCK!!!!!!
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
