<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: eyanpen</title>
    <description>The latest articles on DEV Community by eyanpen (@eyanpen).</description>
    <link>https://dev.to/eyanpen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3893228%2F3dc88537-5bc9-4c8b-acbb-8dcc4932177d.png</url>
      <title>DEV Community: eyanpen</title>
      <link>https://dev.to/eyanpen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eyanpen"/>
    <language>en</language>
    <item>
      <title>Runtime Backends: A Deep Dive into qwrap vs Container Isolation Modes</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Tue, 16 Jun 2026 00:41:08 +0000</pubDate>
      <link>https://dev.to/eyanpen/runtime-backends-a-deep-dive-into-qwrap-vs-container-isolation-modes-ib4</link>
      <guid>https://dev.to/eyanpen/runtime-backends-a-deep-dive-into-qwrap-vs-container-isolation-modes-ib4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;In sandbox runtimes, "isolation" is the core requirement. qwrap (based on bwrap user namespace) and Container (podman/docker) are two mainstream backends. They solve the same problem — running code in a restricted environment — but take completely different paths. This article uses extensive analogies to help you understand the similarities and differences.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Building Intuition: Two Ways to "Lock the Door"
&lt;/h2&gt;

&lt;p&gt;Imagine you need to confine someone you don't fully trust in a room to do work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;qwrap approach&lt;/strong&gt;: In your existing house, you put up a partition to wall off a corner, leaving only a small window to pass materials through. The walls are still the original walls, the floor is still the original floor, but the person can only see what's inside the partition.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Container approach&lt;/strong&gt;: You build a shipping container with its own independent power, water, and ventilation systems. Put the person inside, close the door. They feel like they're in a complete little house, completely unaware of what's outside.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the most fundamental difference: &lt;strong&gt;qwrap is lightweight view isolation, Container is complete environment encapsulation&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is qwrap (bwrap user namespace)
&lt;/h2&gt;

&lt;p&gt;qwrap uses &lt;code&gt;bubblewrap&lt;/code&gt; (bwrap) under the hood, a sandboxing tool that leverages Linux user namespaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host filesystem
├── /usr/bin/python3          ← Host's Python
├── /home/user/project/       ← User project
└── /tmp/secrets/             ← Sensitive files

qwrap sandbox view (what the process sees)
├── /usr/bin/python3          ← bind-mounted in, read-only
├── /workspace/               ← Only the project directory is exposed
└── (/tmp/secrets/ doesn't exist) ← Completely invisible
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Namespace&lt;/strong&gt;: The process thinks it's root, but actually maps to an unprivileged user on the host&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mount Namespace&lt;/strong&gt;: Only bind-mounts necessary directories in, everything else is invisible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No images, no layers, no network namespace&lt;/strong&gt; (unless explicitly configured)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Analogy: VPN Split Tunneling
&lt;/h3&gt;

&lt;p&gt;qwrap is like split-tunneling rules on your phone — you're not wrapping the entire phone in a VPN, just routing specific apps through the proxy. The system is still the same system, just with a restricted "field of view."&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Container (podman/docker)
&lt;/h2&gt;

&lt;p&gt;A Container is a complete isolated runtime environment, using multiple Linux namespaces (pid, net, mount, uts, ipc) plus cgroups for resource limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host
└── Running podman/docker daemon (or rootless direct fork)

Container interior
├── /usr/bin/python3          ← Shipped with the image, may differ from host version
├── /workspace/               ← Volume mounted in
├── Independent PID 1         ← Process tree starts from 1
├── Independent network stack ← Has its own eth0, IP address
└── Independent hostname      ← Not the host's name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OCI Image&lt;/strong&gt;: Environment fully packaged, including OS base layer, dependency libraries, toolchain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-dimensional Namespaces&lt;/strong&gt;: PID, network, mount, hostname all isolated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cgroups&lt;/strong&gt;: CPU, memory, IO can be capped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layered filesystem&lt;/strong&gt;: OverlayFS, writes don't affect the base image&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Analogy: A "Poor Man's VM"
&lt;/h3&gt;

&lt;p&gt;A Container is like a "lightweight virtual machine" — without the overhead of hardware virtualization, but giving the process an experience nearly equivalent to owning a dedicated machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Differences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Startup Speed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;qwrap&lt;/strong&gt;: Millisecond-level. Essentially just &lt;code&gt;clone()&lt;/code&gt; + set up a few namespaces + exec, similar to starting a regular process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt;: Hundreds of milliseconds to seconds. Needs to prepare rootfs (extract layers/mount overlay), configure networking, start init process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: You have an AI Agent that repeatedly executes user-submitted Python snippets, each needing isolation. With Container, doing &lt;code&gt;docker run&lt;/code&gt; then &lt;code&gt;docker rm&lt;/code&gt; each time becomes unsustainable at one call per second. qwrap can launch dozens of sandbox instances per second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Isolation Strength
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;qwrap&lt;/strong&gt;: Medium. The process still shares the host kernel, network is not isolated by default (can access the internet), only filesystem view trimming and privilege reduction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt;: Strong. Network, PID, and filesystem are comprehensively isolated. Combined with a seccomp profile, even syscalls can be restricted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: If sandboxed code attempts &lt;code&gt;kill -9 1&lt;/code&gt; (kill the init process):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;qwrap: Since it lacks CAP_KILL privileges over host processes in the user namespace, the kernel rejects the operation, but the process can "see" host PIDs (unless PID namespace is added).&lt;/li&gt;
&lt;li&gt;Container: PID 1 as seen by the process is the container's own init — killing it only crashes the container itself, the host is unharmed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Environment Consistency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;qwrap&lt;/strong&gt;: Depends on the host environment. If the host doesn't have &lt;code&gt;numpy&lt;/code&gt; installed, the sandbox doesn't either (unless you mount a virtualenv directory in).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt;: Self-contained environment. Whatever is installed in the image is available, regardless of what's on the host.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Your CI runs on an Ubuntu 22.04 machine, but the project needs Python 3.12 + CUDA 12.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;qwrap approach: You must install Python 3.12 and CUDA on the host first, then qwrap just restricts visible scope.&lt;/li&gt;
&lt;li&gt;Container approach: Simply &lt;code&gt;FROM nvidia/cuda:12.0-python3.12&lt;/code&gt;, everything is in the image, even if the host is CentOS 7.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Resource Overhead
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;qwrap&lt;/strong&gt;: Near-zero overhead. No extra processes, no overlay filesystem, no virtual bridge. The sandbox is just a "process with a restricted view."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt;: Lightweight but perceptible. Each container has its own mount stack, possibly a veth pair, and a cgroup controller tracking it. Running a few is fine, but running hundreds starts accumulating network and storage overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Portability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;qwrap&lt;/strong&gt;: Linux-only (depends on user namespace), requiring kernel version ≥ 3.8. Different distributions have different user namespace policies (Ubuntu enables by default, Debian/RHEL may need sysctl adjustments).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt;: Cross-platform. macOS/Windows can run them through VM layers (Docker Desktop, Podman Machine). Images are standard OCI format, deployable anywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose qwrap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Need extremely fast startup/teardown cycles (Agent spawns a sandbox for every tool call)&lt;/li&gt;
&lt;li&gt;Host environment is already prepared, just need to "restrict visibility"&lt;/li&gt;
&lt;li&gt;Don't need network isolation (or willing to manage with iptables manually)&lt;/li&gt;
&lt;li&gt;Resource-sensitive, don't want extra memory/storage overhead for isolation&lt;/li&gt;
&lt;li&gt;Runtime environment is definitely Linux with user namespace support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical scenario: &lt;strong&gt;Code execution sandbox&lt;/strong&gt;. An AI coding assistant runs LLM-generated code in qwrap, discards it when done. May run dozens of times per second, each needing only a Python interpreter + limited file access.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Container
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Need a complete, reproducible runtime environment (the "works on my machine" problem disappears)&lt;/li&gt;
&lt;li&gt;Need strong isolation (untrusted code, multi-tenant scenarios)&lt;/li&gt;
&lt;li&gt;Need network isolation (each task gets an independent network stack)&lt;/li&gt;
&lt;li&gt;Need cross-platform deployment&lt;/li&gt;
&lt;li&gt;Longer lifecycle (service processes, long-running tasks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical scenario: &lt;strong&gt;CI/CD Pipeline&lt;/strong&gt;. Each build runs in a clean container ensuring environment consistency. Or &lt;strong&gt;multi-tenant SaaS&lt;/strong&gt;, where each tenant's custom logic runs in an isolated container with full resource and network separation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can You Combine Them?
&lt;/h2&gt;

&lt;p&gt;Yes, and this is a very common pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outer Container + Inner qwrap&lt;/strong&gt;: Container provides environment consistency and coarse-grained isolation, qwrap provides fine-grained per-process sandboxing inside the container. For example, an Agent service runs inside a container, and each tool invocation spawns a qwrap sandbox.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;qwrap as a "lightweight container" substitute&lt;/strong&gt;: In development environments where you don't want to install Docker but need some isolation, qwrap can serve as a minimal alternative.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary at a Glance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Startup latency: qwrap milliseconds / Container hundreds of ms to seconds&lt;/li&gt;
&lt;li&gt;Isolation dimensions: qwrap filesystem + user privileges / Container filesystem + network + PID + resources&lt;/li&gt;
&lt;li&gt;Environment dependency: qwrap depends on host / Container self-contained image&lt;/li&gt;
&lt;li&gt;Resource overhead: qwrap near-zero / Container lightweight but perceptible&lt;/li&gt;
&lt;li&gt;Portability: qwrap Linux-only / Container cross-platform&lt;/li&gt;
&lt;li&gt;Best for: qwrap high-frequency short-lived / Container long-lived + strong isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing between qwrap and Container is fundamentally a tradeoff between "light" and "complete":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you want to "quickly blindfold a process" — choose qwrap&lt;/li&gt;
&lt;li&gt;If you want to "lock a process in an independent shipping container" — choose Container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding this distinction, you can make sound layered decisions when designing sandbox systems: use Containers to solve environment consistency, use qwrap to solve high-frequency isolated execution, and combine both to cover everything from CI to Agent Runtime.&lt;/p&gt;

</description>
      <category>sandbox</category>
      <category>container</category>
      <category>bubblewrap</category>
      <category>namespace</category>
    </item>
    <item>
      <title>Don't Rush to Clear History — Understanding KV Cache Will Change How You Think About LLM Conversation Strategy</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Tue, 09 Jun 2026 01:03:51 +0000</pubDate>
      <link>https://dev.to/eyanpen/dont-rush-to-clear-history-understanding-kv-cache-will-change-how-you-think-about-llm-3a89</link>
      <guid>https://dev.to/eyanpen/dont-rush-to-clear-history-understanding-kv-cache-will-change-how-you-think-about-llm-3a89</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Many people have an intuition when using LLMs: longer conversations mean more expensive tokens, so you should summarize and compress history early. When building Agent Loops, some merge multi-turn conversations into a single "stateless message" to save tokens. Both approaches seem clever but are actually anti-optimizations. This article explains from KV Cache principles why keeping the original history intact is the optimal strategy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Most Common Misconception: Proactively Summarizing to Compress History
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;

&lt;p&gt;You've chatted with an LLM for 20 turns, using 8K out of 128K in the context window. You start worrying: "Such a long history, sending it with every request — isn't that wasteful?"&lt;/p&gt;

&lt;p&gt;So you make an "optimization": have the LLM summarize the previous conversation into a digest, then start a new conversation with that digest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original conversation (20 turns, 8000 tokens):
  [system] [user_1] [asst_1] [user_2] [asst_2] ... [user_20] [asst_20]

"Optimized" (summary, 500 tokens):
  [system] [user: Here's a summary of the previous conversation: ...500 words...]  [user_21]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It looks like input dropped from 8000 tokens to 600, saving 93%?&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Is an Anti-Optimization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. You Destroyed the KV Cache&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the original conversation, the KV for the first 19 turns was already computed and cached in GPU memory during the last request. When the 21st turn arrives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original approach:
  [system][user_1][asst_1]...[user_20][asst_20] ← all cache hits (0 computation)
  [user_21]                                      ← only compute this one (tens of tokens)

Summary approach:
  [system][summary...500 tokens][user_21]        ← entirely new content, full recomputation (550 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The original approach only needs to compute tens of tokens (the new message), while the summary approach computes 550 tokens. &lt;strong&gt;You created ten times the computational overhead to "save tokens."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Summary Itself Is Extra Overhead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When creating the summary, although the previous 8000 tokens are covered by cache (low compute cost), you still need the LLM to generate 500 tokens of summary output. More critically, these 500 summary tokens will be fully computed as new input in the new conversation (with zero cache). You essentially spent 500 tokens generating the summary, then another 500 tokens recomputing it — a net increase in overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Irreversible Information Loss&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When summarizing, you can't predict which details future conversation turns will need. The LLM might need a specific parameter from turn 3 at turn 30, but it was already lost during summarization.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Correct Mental Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Existing history = free (covered by KV Cache, 0 computation)
Only the new tail content = actual computational cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An analogy: You're reading a 200-page book and have reached page 180. Each new page only requires reading 1 page. If you tear out the first 180 pages, write a one-page summary, then claim "I only need to read 1 page of summary" — but you only needed to read 1 new page anyway! The act of tearing the book wasted time.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Should You Actually Summarize?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Only when you're truly approaching the context window limit.&lt;/strong&gt; For example, a 128K window has used 120K, and adding new messages would overflow — then you have no choice but to compress.&lt;/p&gt;

&lt;p&gt;But before that point (e.g., only using 10%~50%), keeping the original history intact is the optimal strategy. Don't fight against KV Cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  Impact on API Billing
&lt;/h3&gt;

&lt;p&gt;You might say: "Even if cache hits, doesn't the API provider still charge by input token count?"&lt;/p&gt;

&lt;p&gt;In fact, major providers already offer significant discounts for cached tokens (far more than half off):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;New Input Token&lt;/th&gt;
&lt;th&gt;Cached Input Token&lt;/th&gt;
&lt;th&gt;Cache Discount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GPT-5 Series&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.125&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GPT-4.1 Mini&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.anthropic.com/api" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.x&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.anthropic.com/api" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Claude Opus 4.x&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.anthropic.com/api" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Claude Haiku&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$0.08&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Google AI Studio&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.125&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Google AI Studio&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.015&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Google AI Studio&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.0 Flash&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;$0.025&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Chinese providers typically offer even more aggressive cache discounts, especially the DeepSeek series (cached token prices as low as 1/10 or even lower than new tokens).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This means:&lt;/strong&gt; At the API billing level, keeping the original history intact is equally economical. Suppose you have 8000 tokens of history:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep as-is: 8000 × cached price (10~25% of full price) + new message × full price&lt;/li&gt;
&lt;li&gt;Replace with summary: 500 × full price (summary is new content, no cache) + new message × full price + summary generation output cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the surface 8000 → 500 seems like savings, but 8000 tokens at 10% pricing = equivalent to 800 tokens at full price. Adding summary output costs and information loss, the benefit is minimal or even negative.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;self-deployed models (vLLM/TGI)&lt;/strong&gt;: There's no per-token billing; overhead purely depends on GPU computation. Here the advantage of keeping original history is overwhelming — cache hit = zero extra computation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Same Problem in Agentic Loops
&lt;/h2&gt;

&lt;p&gt;The above misconception has a variant in Agent Loop design: merging multi-turn tool call history into a single "stateless message" to "save tokens." Let's analyze this with a concrete example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background
&lt;/h3&gt;

&lt;p&gt;In an Agentic RAG iterative search scenario, the Agent calls LLM each round to decide the next action (search, discard, finish). The LLM needs to know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The user's original question&lt;/li&gt;
&lt;li&gt;Which tool calls were previously executed&lt;/li&gt;
&lt;li&gt;What evidence has been collected so far&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The question is: &lt;strong&gt;How do you pass this information to the LLM?&lt;/strong&gt; This is fundamentally the same question as "should you compress history."&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach A: Full Merge (Stateless Merge)
&lt;/h3&gt;

&lt;p&gt;Each time calling the LLM, compress all history into one or two user messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_messages&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Merge all traces into one text
&lt;/span&gt;    &lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Executed tool calls]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;# Merge all evidence into one JSON
&lt;/span&gt;    &lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Current evidence]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;evidence_json&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Motivation&lt;/strong&gt;: Fewer messages, simpler structure, and omits the LLM's assistant replies from history (which may include verbose thinking/reasoning) — intuitively saving tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach B: Standard Multi-Turn Conversation (Stateful Messages)
&lt;/h3&gt;

&lt;p&gt;Maintain the complete conversation structure, appending assistant tool_call + tool result each round:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# assistant with tool_calls
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Concrete Example
&lt;/h2&gt;

&lt;p&gt;Suppose the Agent runs 3 rounds, each tool returning ~500 tokens of evidence, with ~200 tokens of LLM reasoning per round.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach A: Input Tokens Across 3 Rounds
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1: system(100) + user(50)                                     = 150
Round 2: system(100) + user(50) + trace(30) + evidence(500)         = 680
Round 3: system(100) + user(50) + trace(60) + evidence(1000)        = 1210
                                                        Total input = 2040
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every time it's entirely new content → &lt;strong&gt;KV Cache hit rate ≈ 0%&lt;/strong&gt; → all 2040 tokens require full GPU computation from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach B: Input Tokens Across 3 Rounds
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1: system(100) + user(50)                                     = 150
Round 2: system(100) + user(50) + asst_1(200) + tool_1(500)         = 850
Round 3: system(100) + user(50) + asst_1(200) + tool_1(500)
         + asst_2(200) + tool_2(500)                                = 1550
                                                        Total input = 2550
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More assistant messages (+400 tokens), but the key difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Round 2's first 150 tokens are identical to Round 1 → &lt;strong&gt;cache hit&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Round 3's first 850 tokens are identical to Round 2 → &lt;strong&gt;cache hit&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tokens actually needing computation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1: 150 (full computation)
Round 2: 700 (first 150 cache hit, only compute new 700)
Round 3: 700 (first 850 cache hit, only compute new 700)
                                    Actual computation = 1550
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Approach A (Full Merge)&lt;/th&gt;
&lt;th&gt;Approach B (Standard Multi-Turn)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total input tokens&lt;/td&gt;
&lt;td&gt;2040&lt;/td&gt;
&lt;td&gt;2550&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV Cache hit rate&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actual GPU computation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2040&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1550&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM comprehension difficulty&lt;/td&gt;
&lt;td&gt;Higher (non-standard format)&lt;/td&gt;
&lt;td&gt;Low (native training format)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion: Approach A appears to have fewer tokens but actually requires more computation.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dive: Prefill, Decode, and KV Cache
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Two Phases of LLM Inference
&lt;/h3&gt;

&lt;p&gt;You've surely noticed: after the LLM receives input, the first token comes out slowly, but subsequent tokens stream quickly. This reflects the two phases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prefill&lt;/strong&gt;: Process all input tokens, computing Key and Value vectors for each token at every Transformer layer, storing them in the KV Cache. This is &lt;strong&gt;compute-intensive&lt;/strong&gt; — requiring full attention matrix operations on N tokens, with complexity O(N²).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Decode&lt;/strong&gt;: Generate output tokens one by one. For each new token generated, only its Query needs attention against existing Keys in the KV Cache, with complexity O(N). Then the new token's K and V are appended to the cache for the next token.&lt;/p&gt;

&lt;p&gt;An analogy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill = Reading an entire book and taking notes (time-consuming, corresponds to slow TTFT)&lt;/li&gt;
&lt;li&gt;Decode = Writing answers based on notes (relatively easy, corresponds to fast subsequent tokens)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the "pause then stream" you experience is the Prefill → Decode boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Is KV Cache?
&lt;/h3&gt;

&lt;p&gt;The Self-Attention computation at each Transformer layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;Attention(Q, K, V) = softmax(Q × K&lt;span class="p"&gt;^&lt;/span&gt;T / √d) × V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a model with 32 layers, Key dimension 128, and 32 attention heads (similar to LLaMA-7B), the KV Cache size for 1000 tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;32 layers × 2(K and V) × 32 heads × 1000 tokens × 128 dims × 2 bytes(fp16)
≈ 512 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once computed, these K and V vectors can be repeatedly reused during the Decode phase when generating subsequent tokens — no need to recompute for historical tokens. This is the core value of KV Cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decode Phase: Only One Token Computed Per Step
&lt;/h3&gt;

&lt;p&gt;During Decode, each step always computes Q/K/V for &lt;strong&gt;exactly 1 new token&lt;/strong&gt;. The new token's KV is directly appended to the next slot in the cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Block5 (capacity 16):
  slot 0: token_a's KV  ← already computed
  slot 1: token_b's KV  ← already computed
  slot 2: token_c's KV  ← new token, only compute this one, write here
  slot 3~15: empty
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Block is the &lt;strong&gt;storage management&lt;/strong&gt; unit for KV Cache (similar to memory paging), not a computation unit. When a block isn't full, the new token's KV is directly written to the next slot in the same block without affecting existing values or requiring the entire block to be recomputed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-Request Prefix Caching
&lt;/h3&gt;

&lt;p&gt;Key insight: &lt;strong&gt;If two requests share the same prefix, the KV vectors for the prefix are identical and don't need recomputation.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Example: Standard Multi-Turn Conversation in Agent Loop
&lt;/h4&gt;

&lt;p&gt;Assume system prompt = "You are a search assistant", user question = "What is GraphRAG?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 1 request:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[system: You are a search assistant] [user: What is GraphRAG?]
 ←────────── 150 tokens ───────────→
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prefill computes KV for 150 tokens → stored in cache, key = hash("You are a search assistant|What is GraphRAG?")&lt;/p&gt;

&lt;p&gt;LLM returns: call search({"query": "GraphRAG"})&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 2 request:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[system: You are a search assistant] [user: What is GraphRAG?] [asst: search(...)] [tool: Result A]
 ←──── identical to Round 1 ────→ ←────── new 700 tokens ──────→
 ←────────────────────── 850 tokens ──────────────────────────→
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The inference engine discovers: the hash of the first 150 tokens matches the cache!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cached: KV for tokens 1~150 (directly reused, 0 computation)
To compute: KV for tokens 151~850 (only compute new 700 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Round 3 request:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[same 850 tokens above] [asst: search(...)] [tool: Result B]
 ←─ cache hit ─→ ←── new 700 ──→
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache hits 850 tokens, only need to compute 700 tokens.&lt;/p&gt;

&lt;h4&gt;
  
  
  With the Full Merge Approach
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Round 2 request:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[system: You are a search assistant] [user: What is GraphRAG?] [user: [Executed tools]\n search→5 results] [user: [evidence]\n{...500 chars...}]
 ←──── same as Round 1 ────→ ←─────────── entirely new content ───────────────→
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First 150 tokens match, the remaining 530 tokens are new content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 3 request:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[system: You are a search assistant] [user: What is GraphRAG?] [user: [Executed tools]\n search→5 results\n search→3 results] [user: [evidence]\n{...1000 chars...}]
 ←──── same as Round 1 ────→ ←───────── content changed! ─────────────────→
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The third message's content changed from &lt;code&gt;"search→5 results"&lt;/code&gt; to &lt;code&gt;"search→5 results\n search→3 results"&lt;/code&gt; — &lt;strong&gt;cache is fully invalidated from this point&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache hit: 150 tokens (only system + user query)
To compute: 1060 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare with Approach B which only needs to compute 700 tokens in the same round. The gap accelerates with more iterations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strict Sequential Nature of Prefix Matching
&lt;/h3&gt;

&lt;p&gt;Prefix caching is &lt;strong&gt;sequentially matched block by block from the beginning&lt;/strong&gt;. The reason is positional encoding in the attention mechanism — the same token at position 0 and position 16 has different KV values.&lt;/p&gt;

&lt;p&gt;This means: &lt;strong&gt;If new tokens are inserted at the beginning, the entire cache is invalidated and everything must be recomputed.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In cache:    [block0][block1][block2][block3][block4]
New request: [new_block][block0'][block1'][block2'][block5][block6]
                ✗ → first block doesn't match, subsequent blocks can't be reused even if content is identical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You cannot skip ahead to match later blocks — positions changed, so KV values changed.&lt;/p&gt;

&lt;p&gt;This also explains why &lt;strong&gt;placing the system prompt at the very beginning is beneficial&lt;/strong&gt; — it's the fixed prefix shared by all requests, ensuring the beginning portion always has cache hits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prefix Caching Implementation Mechanism (vLLM)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Block hashing&lt;/strong&gt;: Divide the token sequence into fixed-size blocks (e.g., 16 tokens), compute hash for each block's content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential block matching&lt;/strong&gt;: When a new request arrives, compare hashes block by block from the start to find the longest matching prefix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reuse KV Blocks&lt;/strong&gt;: Matched blocks directly reference cached KV data in GPU memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only compute the tail&lt;/strong&gt;: Start prefill from the first non-matching block
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cached request:  [block0][block1][block2][block3][block4]
New request:     [block0][block1][block2][block5][block6]
                    ✓       ✓       ✓      ✗ → start computing from here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Visual Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Approach B (Standard Multi-Turn) — only compute new tail content each round

Round 1: [████████]                    compute 150
Round 2: [--------][██████████████]    compute 700  (first 150 cache hit)
Round 3: [--------------------][████]  compute 700  (first 850 cache hit)
                              Total computation = 1550

Approach A (Full Merge) — content changes from the 3rd message each round

Round 1: [████████]                    compute 150
Round 2: [--------][██████████████]    compute 530  (first 150 cache hit)
Round 3: [--------][████████████████]  compute 1060 (first 150 cache hit, rest all changed)
                              Total computation = 1740
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As rounds increase, Approach A's disadvantage accelerates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hidden Costs of Approach A
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Decreased Model Comprehension
&lt;/h3&gt;

&lt;p&gt;The tool use format LLMs see during training is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assistant: I'll search for... [tool_call: search({query: "..."})]
tool: [results...]
assistant: Based on results, I'll now... [tool_call: ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simulating this with plain text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;user:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;Executed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;search(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;results&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;search(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;results&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model needs extra "cognitive overhead" to understand this non-standard format, potentially leading to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repeating already-executed tool calls (because the structure isn't as clear as native format)&lt;/li&gt;
&lt;li&gt;Inability to correctly distinguish which information comes from tools vs. from the user&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Cannot Express Tool Failures
&lt;/h3&gt;

&lt;p&gt;In the standard approach, tool failures can be explicitly returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Error: timeout after 10s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_call_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM sees this and adjusts its strategy. In Approach A, you can only write &lt;code&gt;→ 0 results&lt;/code&gt;, and the LLM can't distinguish "no results found" from "search error."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Loss of Parallel Tool Call Capability
&lt;/h3&gt;

&lt;p&gt;The standard format supports returning multiple tool_calls at once, and the inference engine knows they're parallel calls from the same round. Approach A's flat trace text cannot express this structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Does Approach A Have an Advantage?
&lt;/h2&gt;

&lt;p&gt;To be fair, there are a few scenarios where Approach A makes more sense:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The inference engine doesn't support prefix caching&lt;/strong&gt; (rare — mainstream engines all support it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Each round's assistant reasoning is extremely long&lt;/strong&gt; (e.g., DeepSeek's thinking often exceeds 2000+ tokens), and you're certain this reasoning doesn't help subsequent decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-session recovery needed&lt;/strong&gt; — stateless design allows recovery from any intermediate state without depending on complete conversation history&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For point 2, a better approach is: maintain the standard multi-turn format, but &lt;strong&gt;truncate the reasoning portion&lt;/strong&gt; when appending historical assistant messages, keeping only the tool_call structure. This saves tokens while preserving cache and format advantages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Implementation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools_schema&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assistant_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;assistant_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# Append assistant message (optional: truncate reasoning to save tokens)
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assistant_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute tools and append results
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;assistant_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, standard, cache-friendly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Full Merge&lt;/th&gt;
&lt;th&gt;Standard Multi-Turn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token count&lt;/td&gt;
&lt;td&gt;Slightly fewer&lt;/td&gt;
&lt;td&gt;Slightly more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual inference cost&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher&lt;/strong&gt; (no cache)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Lower&lt;/strong&gt; (high cache hit rate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model comprehension accuracy&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Good (native format)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineering complexity&lt;/td&gt;
&lt;td&gt;Manual serialization needed&lt;/td&gt;
&lt;td&gt;Framework-native support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Poor (lost structure)&lt;/td&gt;
&lt;td&gt;Good (each round is clear)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Don't sacrifice the enormous advantages of KV Cache and native format to save a few hundred tokens.&lt;/strong&gt; The apparent "optimization" is actually an anti-optimization — like disabling CPU cache to save memory, the cost far outweighs the benefit.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Line Summary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Existing history is free; only new content costs. Don't destroy the cache yourself.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>kvcache</category>
      <category>llminferenceoptimization</category>
      <category>prefixcaching</category>
      <category>agenticloop</category>
    </item>
    <item>
      <title>The "Ghost Clone" of Community Reports in GraphRAG: Why the Same Report Gets Created Twice</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Tue, 26 May 2026 01:52:57 +0000</pubDate>
      <link>https://dev.to/eyanpen/the-ghost-clone-of-community-reports-in-graphrag-why-the-same-report-gets-created-twice-2k29</link>
      <guid>https://dev.to/eyanpen/the-ghost-clone-of-community-reports-in-graphrag-why-the-same-report-gets-created-twice-2k29</guid>
      <description>&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;When querying the Top 10 nodes by &lt;code&gt;HAS_REPORT&lt;/code&gt; edge count in FalkorDB, we found 4 community_report nodes each with &lt;strong&gt;4&lt;/strong&gt; &lt;code&gt;HAS_REPORT&lt;/code&gt; edges pointing to them. By design, each community should map to exactly one report — so why the one-to-many relationship?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Edge type: HAS_REPORT
Rank  Title                                                          Count
1     Tech Dept Core Team: Backend Architecture &amp;amp; System Design        4
2     Product Dept: User Growth &amp;amp; Monetization Strategy                4
3     Ops Dept: Service Stability &amp;amp; Monitoring System                  4
4     QA Dept: Quality Assurance &amp;amp; Test Automation                     4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In theory each community has one report, each report belongs to one community, and &lt;code&gt;HAS_REPORT&lt;/code&gt; should be a 1:1 relationship.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Intuitive Example
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Imagine You're Managing a Company's Org Chart
&lt;/h3&gt;

&lt;p&gt;Suppose your company has this department structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tech Dept (278 people)
  └── Backend Team (253 people)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Backend Team" is a sub-department of "Tech Dept." Now HR needs to write a &lt;strong&gt;department brief&lt;/strong&gt; for each.&lt;/p&gt;

&lt;p&gt;HR discovers that the core members of "Backend Team" heavily overlap with "Tech Dept" (the backend team IS the main force of the tech department), so the AI generates &lt;strong&gt;nearly identical briefs&lt;/strong&gt; for both:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Department&lt;/th&gt;
&lt;th&gt;Brief Title&lt;/th&gt;
&lt;th&gt;Headcount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tech Dept (community 1491)&lt;/td&gt;
&lt;td&gt;"Core Tech Team: Backend Architecture &amp;amp; System Design"&lt;/td&gt;
&lt;td&gt;278&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend Team (community 2790)&lt;/td&gt;
&lt;td&gt;"Core Tech Team: Backend Architecture &amp;amp; System Design"&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two briefs have &lt;strong&gt;identical titles and content&lt;/strong&gt; (because they essentially describe the same group of people), differing only in "headcount" (size).&lt;/p&gt;

&lt;p&gt;Because the content is identical, the system computes the &lt;strong&gt;same ID&lt;/strong&gt; for both (content-based hash).&lt;/p&gt;

&lt;p&gt;Mapping to the 4 actual problem groups we found:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dept Analogy&lt;/th&gt;
&lt;th&gt;Actual community&lt;/th&gt;
&lt;th&gt;Brief Title&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tech Dept&lt;/td&gt;
&lt;td&gt;community 1491&lt;/td&gt;
&lt;td&gt;"Tech Dept Core Team: Backend Architecture &amp;amp; System Design"&lt;/td&gt;
&lt;td&gt;278&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;└── Backend Team&lt;/td&gt;
&lt;td&gt;community 2790&lt;/td&gt;
&lt;td&gt;"Tech Dept Core Team: Backend Architecture &amp;amp; System Design"&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product Dept&lt;/td&gt;
&lt;td&gt;community 200&lt;/td&gt;
&lt;td&gt;"Product Dept: User Growth &amp;amp; Monetization Strategy"&lt;/td&gt;
&lt;td&gt;796&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;└── Product Team 1&lt;/td&gt;
&lt;td&gt;community 1100&lt;/td&gt;
&lt;td&gt;"Product Dept: User Growth &amp;amp; Monetization Strategy"&lt;/td&gt;
&lt;td&gt;631&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ops Dept&lt;/td&gt;
&lt;td&gt;community 1909&lt;/td&gt;
&lt;td&gt;"Ops Dept: Service Stability &amp;amp; Monitoring System"&lt;/td&gt;
&lt;td&gt;180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;└── Ops Team 1&lt;/td&gt;
&lt;td&gt;community 3073&lt;/td&gt;
&lt;td&gt;"Ops Dept: Service Stability &amp;amp; Monitoring System"&lt;/td&gt;
&lt;td&gt;178&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA Dept&lt;/td&gt;
&lt;td&gt;community 953&lt;/td&gt;
&lt;td&gt;"QA Dept: Quality Assurance &amp;amp; Test Automation"&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;└── QA Team 1&lt;/td&gt;
&lt;td&gt;community 2343&lt;/td&gt;
&lt;td&gt;"QA Dept: Quality Assurance &amp;amp; Test Automation"&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Where's the Problem?
&lt;/h3&gt;

&lt;p&gt;When importing this data into the graph database:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Create report nodes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Taking "Tech Dept" and "Backend Team" as an example. The system sees two rows in the parquet with the same ID but different communities, and blindly creates &lt;strong&gt;two nodes&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Report Node A: {id: "abc123", community: 1491, size: 278}  -- Tech Dept's brief
Report Node B: {id: "abc123", community: 2790, size: 253}  -- Backend Team's brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Create HAS_REPORT edges&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system iterates over each report record and matches report nodes by &lt;code&gt;id&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Processing&lt;/span&gt; &lt;span class="n"&gt;Tech&lt;/span&gt; &lt;span class="n"&gt;Dept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;community&lt;/span&gt; &lt;span class="mi"&gt;1491&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;c:&lt;/span&gt;&lt;span class="n"&gt;communities&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;community&lt;/span&gt;&lt;span class="dl"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1491&lt;/span&gt;&lt;span class="ss"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;r:&lt;/span&gt;&lt;span class="n"&gt;community_reports&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;&lt;span class="py"&gt;id:&lt;/span&gt; &lt;span class="s2"&gt;"abc123"&lt;/span&gt;&lt;span class="ss"&gt;})&lt;/span&gt;  &lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Matches&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nf"&gt;nodes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:HAS_REPORT&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="py"&gt;Result:&lt;/span&gt; &lt;span class="n"&gt;Tech&lt;/span&gt; &lt;span class="n"&gt;Dept&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tech&lt;/span&gt; &lt;span class="n"&gt;Dept&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Processing&lt;/span&gt; &lt;span class="n"&gt;Backend&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;community&lt;/span&gt; &lt;span class="mi"&gt;2790&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;c:&lt;/span&gt;&lt;span class="n"&gt;communities&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;community&lt;/span&gt;&lt;span class="dl"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2790&lt;/span&gt;&lt;span class="ss"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;r:&lt;/span&gt;&lt;span class="n"&gt;community_reports&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;&lt;span class="py"&gt;id:&lt;/span&gt; &lt;span class="s2"&gt;"abc123"&lt;/span&gt;&lt;span class="ss"&gt;})&lt;/span&gt;  &lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Also&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:HAS_REPORT&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="py"&gt;Result:&lt;/span&gt; &lt;span class="n"&gt;Backend&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Backend&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final result&lt;/strong&gt;: This report title has 4 HAS_REPORT edges (2 departments × 2 same-ID nodes = 4).&lt;/p&gt;

&lt;p&gt;The correct result should be: Tech Dept → Tech Dept's brief (1 edge), Backend Team → Backend Team's brief (1 edge), totaling 2 edges.&lt;/p&gt;




&lt;h2&gt;
  
  
  Root Cause Analysis
&lt;/h2&gt;

&lt;p&gt;The problem is caused by two factors compounding:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Leiden Hierarchical Clustering Produces Identical Reports
&lt;/h3&gt;

&lt;p&gt;GraphRAG uses the Leiden algorithm for hierarchical community detection. When a sub-community's members heavily overlap with its parent community, the LLM generates nearly identical reports for both. Since report IDs are content-based hashes, identical content → identical IDs.&lt;/p&gt;

&lt;p&gt;Actual data verification:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;report id&lt;/th&gt;
&lt;th&gt;communities&lt;/th&gt;
&lt;th&gt;sizes&lt;/th&gt;
&lt;th&gt;Hierarchy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6516e2f4...&lt;/td&gt;
&lt;td&gt;2790, 1491&lt;/td&gt;
&lt;td&gt;253, 278&lt;/td&gt;
&lt;td&gt;2790 is a sub-community of 1491&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;feda9fa0...&lt;/td&gt;
&lt;td&gt;1100, 200&lt;/td&gt;
&lt;td&gt;631, 796&lt;/td&gt;
&lt;td&gt;1100 is a sub-community of 200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;d8f25d09...&lt;/td&gt;
&lt;td&gt;2343, 953&lt;/td&gt;
&lt;td&gt;19, 21&lt;/td&gt;
&lt;td&gt;2343 is a sub-community of 953&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;223c76c6...&lt;/td&gt;
&lt;td&gt;3073, 1909&lt;/td&gt;
&lt;td&gt;178, 180&lt;/td&gt;
&lt;td&gt;3073 is a sub-community of 1909&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2. Import Logic Lacks Deduplication and Precise Matching
&lt;/h3&gt;

&lt;p&gt;In the import code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Node creation: unconditional CREATE, no deduplication
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNWIND $batch AS p CREATE (n:community_reports) SET n = p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Edge creation: matches only by id, no community condition
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH (r:community_reports {id: p.rid})&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Matches multiple same-ID nodes → Cartesian product
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Precise Matching When Creating HAS_REPORT
&lt;/h3&gt;

&lt;p&gt;When creating &lt;code&gt;HAS_REPORT&lt;/code&gt; edges, match on both &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;community&lt;/code&gt; to avoid the Cartesian product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (buggy)
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH (r:community_reports {id: p.rid}) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# After (fixed)
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH (r:community_reports {id: p.rid, community: p.cnum}) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way each community only matches the report node that belongs to it, creating exactly 1 edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson learned&lt;/strong&gt;: When using the &lt;code&gt;MATCH&lt;/code&gt; + &lt;code&gt;CREATE&lt;/code&gt; pattern to create relationships in a graph database, if the match condition isn't precise enough (target nodes have duplicates), you'll get unexpected Cartesian products. Always ensure &lt;code&gt;MATCH&lt;/code&gt; conditions can uniquely locate the target node.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>communityreport</category>
      <category>leidenalgorithm</category>
      <category>hierarchicalclustering</category>
    </item>
    <item>
      <title>Known Pitfall in DeepEval Faithfulness Metric: "idk" Verdicts Don't Penalize the Score</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 22 May 2026 02:29:23 +0000</pubDate>
      <link>https://dev.to/eyanpen/known-pitfall-in-deepeval-faithfulness-metric-idk-verdicts-dont-penalize-the-score-2l1p</link>
      <guid>https://dev.to/eyanpen/known-pitfall-in-deepeval-faithfulness-metric-idk-verdicts-dont-penalize-the-score-2l1p</guid>
      <description>&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;While using DeepEval to evaluate a GraphRAG system in a no-reference setting, we discovered that &lt;code&gt;FaithfulnessMetric&lt;/code&gt; can produce misleading perfect scores under certain conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observed Behavior
&lt;/h2&gt;

&lt;p&gt;We asked GraphRAG a complex question about the 5GC PDU Session establishment procedure. The system returned a detailed technical answer (covering specific responsibilities of AMF, SMF, UPF, PCF, etc.), but the retrieved context contained only the &lt;strong&gt;table of contents&lt;/strong&gt; from 3GPP documents, such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The document contains a section '5.6 Session Management' with several sub-subsections.
The document contains a section '5.2 Network Access Control' with several sub-subsections.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The context contained no substantive technical content, yet the Faithfulness score was &lt;strong&gt;1.00 (perfect)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root Cause Analysis
&lt;/h2&gt;

&lt;p&gt;The Faithfulness metric evaluation consists of 4 steps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. Truths extraction&lt;/td&gt;
&lt;td&gt;Extract factual statements from retrieval_context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Claims extraction&lt;/td&gt;
&lt;td&gt;Extract claims from actual_output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Verdicts&lt;/td&gt;
&lt;td&gt;Compare each claim against context, assign &lt;code&gt;yes&lt;/code&gt;/&lt;code&gt;no&lt;/code&gt;/&lt;code&gt;idk&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Score calculation&lt;/td&gt;
&lt;td&gt;Compute final score from verdicts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key lies in Step 3's verdict rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;yes&lt;/code&gt;&lt;/strong&gt; — claim is consistent with context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;no&lt;/code&gt;&lt;/strong&gt; — claim &lt;strong&gt;directly contradicts&lt;/strong&gt; context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;idk&lt;/code&gt;&lt;/strong&gt; — context contains no relevant information to judge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And Step 4's &lt;strong&gt;default scoring formula&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;no_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;idk&lt;/code&gt; does not count as a penalty.&lt;/strong&gt; Only explicit contradictions (&lt;code&gt;no&lt;/code&gt;) reduce the score.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;In our evaluation, the LLM judge (after switching to a stricter model) assigned &lt;code&gt;idk&lt;/code&gt; to all 20 claims:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdicts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"idk"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"idk"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;idk&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Score calculation: &lt;code&gt;score = (20 - 0) / 20 = 1.00&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The final reason output:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The score is 1.00 because there are no contradictions; the actual output fully aligns with the retrieval context."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is clearly misleading — none of the claims in the answer are &lt;strong&gt;supported by the context&lt;/strong&gt;, but since none are "contradicted" either, the score is perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fundamental Issue
&lt;/h2&gt;

&lt;p&gt;Faithfulness measures &lt;strong&gt;"is there a contradiction with the context"&lt;/strong&gt;, not &lt;strong&gt;"is the answer supported by the context"&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These are entirely different dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Faithfulness&lt;/th&gt;
&lt;th&gt;Groundedness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answer fully based on context&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer correct but context irrelevant&lt;/td&gt;
&lt;td&gt;High (no contradiction)&lt;/td&gt;
&lt;td&gt;Low (no support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer contradicts context&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When retrieval context contains only table-of-contents or summary-level information, it's nearly impossible for any specific claim to "directly contradict" it, so Faithfulness will always be perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Solution 1: Enable &lt;code&gt;penalize_ambiguous_claims&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;DeepEval provides a built-in parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;FaithfulnessMetric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;penalize_ambiguous_claims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this enabled, the scoring formula becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;no_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;idk_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now 20 claims all judged &lt;code&gt;idk&lt;/code&gt; yields: &lt;code&gt;(20 - 0 - 20) / 20 = 0.00&lt;/code&gt;, which more accurately reflects how well the context supports the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 2: Add a Groundedness Metric
&lt;/h3&gt;

&lt;p&gt;Use GEval to define a custom Groundedness metric that directly evaluates whether the answer is supported by context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;GEval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Determine whether the actual output is fully supported and grounded by the retrieval context. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Penalize claims in the output that cannot be traced back to specific information in the retrieval context.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluation_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SingleTurnParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INPUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SingleTurnParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACTUAL_OUTPUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SingleTurnParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RETRIEVAL_CONTEXT&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recommendation
&lt;/h3&gt;

&lt;p&gt;Use both solutions together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep Faithfulness (with &lt;code&gt;penalize_ambiguous_claims&lt;/code&gt; enabled) to detect contradictions and unsupported claims&lt;/li&gt;
&lt;li&gt;Add Groundedness to positively evaluate support coverage&lt;/li&gt;
&lt;li&gt;Note Faithfulness limitations in reports to avoid misinterpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Additional Pitfall: Summary Claims Misjudged as "idk"
&lt;/h2&gt;

&lt;p&gt;Even when the context contains specific detailed information, if the actual output summarizes those details, the judge may still assign &lt;code&gt;idk&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Example
&lt;/h3&gt;

&lt;p&gt;The context contained specific procedural details about PDU Session establishment (AMF handling registration, SMF selecting UPF, N4 session setup, etc.), while the actual output included a summary claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"From the UE attempting to access a specific DNN to achieving effective user plane forwarding, the entire process involves close cooperation among multiple core network elements, each playing an indispensable role."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The judge's verdict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"idk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The claim is a summary statement; the context provides specific procedural details but does not directly confirm this overall description."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;p&gt;The Faithfulness prompt imposes strict constraints on the judge:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Only use 'no' if retrieval context DIRECTLY CONTRADICTS the claim — never use prior knowledge."&lt;br&gt;&lt;br&gt;
"Use 'idk' for claims not backed up by context — do not assume your knowledge."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The judge is required to perform &lt;strong&gt;literal-level matching&lt;/strong&gt;, not &lt;strong&gt;semantic-level reasoning&lt;/strong&gt;. Even though the context details fully support the summary through logical inference, since the context doesn't "directly confirm" the statement, the judge can only assign &lt;code&gt;idk&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Impact
&lt;/h3&gt;

&lt;p&gt;For RAG systems, answers are expected to synthesize and summarize context — this is normal and desired behavior. However, Faithfulness's literal-level matching treats such reasonable summaries as "unsupported," causing scores to drop when &lt;code&gt;penalize_ambiguous_claims&lt;/code&gt; is enabled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Possible Improvement
&lt;/h3&gt;

&lt;p&gt;DeepEval's &lt;code&gt;FaithfulnessMetric&lt;/code&gt; supports an &lt;code&gt;evaluation_template&lt;/code&gt; parameter. You can inherit from &lt;code&gt;FaithfulnessTemplate&lt;/code&gt; and modify the verdict guidelines to include "summaries that can be reasonably inferred from context details" in the &lt;code&gt;yes&lt;/code&gt; category. However, this changes the semantics of the evaluation criteria and should be used cautiously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The Faithfulness metric was designed to detect hallucination — whether the model fabricates information that contradicts the context. However, it has limitations on two levels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"idk" doesn't penalize by default&lt;/strong&gt; — always perfect when context is irrelevant (solved with &lt;code&gt;penalize_ambiguous_claims=True&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Literal-level matching is too strict&lt;/strong&gt; — reasonable summaries are judged as unsupported (requires custom templates or supplementary Groundedness metrics)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When evaluating RAG systems, both Faithfulness and Groundedness dimensions must be considered to comprehensively assess answer quality.&lt;/p&gt;

</description>
      <category>deepeval</category>
      <category>faithfulness</category>
      <category>ragevaluation</category>
      <category>evaluationmetrics</category>
    </item>
    <item>
      <title>How FalkorDB Stores Edges: Why Neighbor Lookup Is O(degree)</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Wed, 20 May 2026 07:35:07 +0000</pubDate>
      <link>https://dev.to/eyanpen/how-falkordb-stores-edges-why-neighbor-lookup-is-odegree-3013</link>
      <guid>https://dev.to/eyanpen/how-falkordb-stores-edges-why-neighbor-lookup-is-odegree-3013</guid>
      <description>&lt;p&gt;Many people have a question when they first see FalkorDB's architecture:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It doesn't use traditional adjacency lists but maintains edges with sparse matrices — how does it efficiently find all edges of a given node?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And a follow-up question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If neighbor data is already stored contiguously, why is the query complexity still &lt;code&gt;O(degree)&lt;/code&gt; instead of &lt;code&gt;O(1)&lt;/code&gt;?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  1. How Traditional Graph Databases Store Edges
&lt;/h1&gt;

&lt;p&gt;Traditional graph databases (like Neo4j) typically use:&lt;/p&gt;

&lt;h1&gt;
  
  
  Adjacency List
&lt;/h1&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A -&amp;gt; B
A -&amp;gt; C
A -&amp;gt; D
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally it looks more like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A:
  edge1 -&amp;gt; edge2 -&amp;gt; edge3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each node maintains its own edge linked list&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To find all edges of a node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simply traverse the linked list&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Therefore the complexity is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(degree)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;degree = number of edges
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;out_degree&lt;/code&gt;&lt;br&gt;
Number of outgoing edges&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;in_degree&lt;/code&gt;&lt;br&gt;
Number of incoming edges&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. FalkorDB Is Completely Different: Sparse Matrix
&lt;/h1&gt;

&lt;p&gt;FalkorDB's core design is not an adjacency list.&lt;/p&gt;

&lt;p&gt;It is based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sparse Matrix&lt;/li&gt;
&lt;li&gt;GraphBLAS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to maintain the entire graph.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A(id=0) -&amp;gt; B(id=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internal representation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M[0,1] = edge_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source=0
target=1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An edge exists.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. One Matrix Per Edge Type
&lt;/h1&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;:User&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:FRIEND&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;:User&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;:User&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:LIKES&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;:Post&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FalkorDB maintains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FRIEND matrix
LIKES matrix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way during traversal:&lt;/p&gt;

&lt;p&gt;No need to scan the entire graph.&lt;/p&gt;




&lt;h1&gt;
  
  
  4. How Multi-edges Are Maintained
&lt;/h1&gt;

&lt;p&gt;FalkorDB supports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A -[:CALL]-&amp;gt; B
A -[:CALL]-&amp;gt; B
A -[:CALL]-&amp;gt; B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Therefore a matrix cell cannot simply be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M[0,1] = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's more like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M[0,1] = [3,8,15]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;edge ids
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Essentially similar to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sparse tensor&lt;/li&gt;
&lt;li&gt;compressed adjacency structure&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5. How to Efficiently Find Edges?
&lt;/h1&gt;

&lt;p&gt;Many people mistakenly think:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 0 0 1 0 0 1 1 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You must scan the entire row to find the 1s.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's completely wrong.&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;h1&gt;
  
  
  Sparse Matrix Doesn't Store Zeros At All
&lt;/h1&gt;




&lt;h1&gt;
  
  
  6. What Does a Sparse Matrix Actually Store?
&lt;/h1&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[0,0,0,1,0,0,1,1,0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual storage looks more like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[3,6,7]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;index 3 has an edge
index 6 has an edge
index 7 has an edge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zeros don't exist at all.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;p&gt;Finding neighbors of node A:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;neighbors(A) = [3,6,7]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Return directly.&lt;/p&gt;




&lt;h1&gt;
  
  
  7. CSR / CSC: Industrial-Grade Sparse Matrix Structures
&lt;/h1&gt;

&lt;p&gt;Real implementations typically use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSR (Compressed Sparse Row)&lt;/li&gt;
&lt;li&gt;CSC (Compressed Sparse Column)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A: 0 0 0 1 0 0 1 1
B: 1 0 0 0 0 0 0 0
C: 0 1 0 0 1 0 0 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CSR might store it as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;indices = [3,6,7,0,1,4]
row_ptr = [0,3,4,6]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explanation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A's data is at &lt;code&gt;indices[0:3]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;B's data is at &lt;code&gt;indices[3:4]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;C's data is at &lt;code&gt;indices[4:6]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;p&gt;Finding all edges of A:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;indices[0:3]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gives us:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[3,6,7]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  8. Why Is the Complexity Still O(degree)?
&lt;/h1&gt;

&lt;p&gt;This is the most commonly misunderstood point.&lt;/p&gt;

&lt;p&gt;Many people ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Since &lt;code&gt;[3,6,7]&lt;/code&gt; is already in contiguous memory,&lt;br&gt;
isn't a direct memcpy just O(1)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer:&lt;/p&gt;

&lt;h1&gt;
  
  
  Locating the array is O(1)
&lt;/h1&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;h1&gt;
  
  
  Traversing the array is still O(k)
&lt;/h1&gt;

&lt;p&gt;Where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k = degree
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  9. What Does Algorithmic Complexity Actually Measure?
&lt;/h1&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The database doesn't just return:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;an array pointer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traverse each edge&lt;/li&gt;
&lt;li&gt;Decode the edge object&lt;/li&gt;
&lt;li&gt;Construct the result set&lt;/li&gt;
&lt;li&gt;Return to the client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for edge in neighbors:
    emit(edge)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Must execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;degree times
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the overall complexity is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(degree)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  10. Output-sensitive Complexity
&lt;/h1&gt;

&lt;p&gt;This is a classic concept:&lt;/p&gt;

&lt;h1&gt;
  
  
  The size of the output itself counts toward complexity
&lt;/h1&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;If:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A has 1 million edges
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;finding the array start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only takes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But:&lt;/p&gt;

&lt;p&gt;Returning 1 million edges:&lt;/p&gt;

&lt;p&gt;Cannot be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because:&lt;/p&gt;

&lt;p&gt;You must at least "look at" each element.&lt;/p&gt;




&lt;h1&gt;
  
  
  11. Why Is FalkorDB Still Fast?
&lt;/h1&gt;

&lt;p&gt;Because:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[3,6,7]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contiguous memory&lt;/li&gt;
&lt;li&gt;Cache-friendly&lt;/li&gt;
&lt;li&gt;SIMD-friendly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The CPU can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefetch&lt;/li&gt;
&lt;li&gt;Vector load&lt;/li&gt;
&lt;li&gt;Branch prediction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While traditional adjacency lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;edge1 -&amp;gt; edge2 -&amp;gt; edge3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Involve:&lt;/p&gt;

&lt;h1&gt;
  
  
  Pointer chasing
&lt;/h1&gt;

&lt;p&gt;Which causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache misses&lt;/li&gt;
&lt;li&gt;Memory stalls&lt;/li&gt;
&lt;li&gt;Branch mispredictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;p&gt;FalkorDB has clear advantages in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High fan-out traversal&lt;/li&gt;
&lt;li&gt;Multi-hop pattern matching&lt;/li&gt;
&lt;li&gt;Graph analytics&lt;/li&gt;
&lt;li&gt;GraphRAG&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;scenarios.&lt;/p&gt;




&lt;h1&gt;
  
  
  12. Neo4j vs FalkorDB: The Essential Difference
&lt;/h1&gt;

&lt;p&gt;Neo4j is more like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodes + edge linked lists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suited for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OLTP&lt;/li&gt;
&lt;li&gt;Single-hop queries&lt;/li&gt;
&lt;li&gt;High-frequency edge updates&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;FalkorDB is more like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a graph computation engine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suited for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-hop traversal&lt;/li&gt;
&lt;li&gt;Pattern matching&lt;/li&gt;
&lt;li&gt;Graph analytics&lt;/li&gt;
&lt;li&gt;Vectorized computation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:F&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:F&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Neo4j:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pointer traversal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FalkorDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;matrix multiply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;F × F
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is its biggest architectural difference.&lt;/p&gt;




&lt;h1&gt;
  
  
  13. Final Summary
&lt;/h1&gt;

&lt;p&gt;FalkorDB's core philosophy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Don't store "empty"&lt;br&gt;
Only store "existing edges"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 0 0 1 0 0 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actually becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[3,6]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Querying all edges of a node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Locating adjacency data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;O(1)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Returning all edges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;O(degree)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;degree = number of edges for the current node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total number of edges in the entire graph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the core performance model of a Sparse Matrix graph database.&lt;/p&gt;




&lt;h1&gt;
  
  
  14. Does Splitting Edges Into Multiple Types vs. a Single Type Affect Query Speed?
&lt;/h1&gt;

&lt;p&gt;A common question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Since locating edges is O(1) and returning edges is O(degree),&lt;br&gt;
does categorizing edges into one type vs. multiple types affect query speed?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer: &lt;strong&gt;It depends on whether the query specifies an edge type.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When the Query Specifies an Edge Type
&lt;/h2&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:FRIEND&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FalkorDB only scans the &lt;code&gt;FRIEND&lt;/code&gt; matrix.&lt;/p&gt;

&lt;p&gt;If all edges are categorized as a single type (e.g., &lt;code&gt;:REL&lt;/code&gt;), the matrix contains all edges, making the degree larger.&lt;/p&gt;

&lt;p&gt;Multiple types = smaller matrices = less traversal = &lt;strong&gt;faster&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  When the Query Does Not Specify an Edge Type
&lt;/h2&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FalkorDB needs to merge results from multiple matrices.&lt;/p&gt;

&lt;p&gt;In this case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total traversal volume is the same (total degree)&lt;/li&gt;
&lt;li&gt;Multiple types have slight merge overhead&lt;/li&gt;
&lt;li&gt;Single type traverses one matrix directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is minimal, approximately &lt;strong&gt;no impact&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Single Type vs. Multiple Types&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query specifies edge type&lt;/td&gt;
&lt;td&gt;Multiple types faster&lt;/td&gt;
&lt;td&gt;Only scans the corresponding matrix, smaller degree&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query does not specify edge type&lt;/td&gt;
&lt;td&gt;Nearly no difference&lt;/td&gt;
&lt;td&gt;Same total degree, slight merge overhead with multiple types&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Practical modeling recommendation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Splitting into multiple types is the better practice.&lt;br&gt;
Most real-world queries specify a relationship type, and splitting types significantly reduces the number of edges that need to be traversed.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>falkordb</category>
      <category>sparsematrix</category>
      <category>graphdatabase</category>
      <category>graphblas</category>
    </item>
    <item>
      <title>Orphan Communities in GraphRAG Hierarchical Clustering: Why Some Communities Have No PARENT_OF Edges</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Wed, 20 May 2026 04:33:03 +0000</pubDate>
      <link>https://dev.to/eyanpen/orphan-communities-in-graphrag-hierarchical-clustering-why-some-communities-have-no-parentof-edges-5a7b</link>
      <guid>https://dev.to/eyanpen/orphan-communities-in-graphrag-hierarchical-clustering-why-some-communities-have-no-parentof-edges-5a7b</guid>
      <description>&lt;h2&gt;
  
  
  The Phenomenon
&lt;/h2&gt;

&lt;p&gt;After building a knowledge graph with GraphRAG, you query a community node and discover it has no &lt;code&gt;PARENT_OF&lt;/code&gt; relationships — neither a parent nor any children. Yet the graph clearly contains many &lt;code&gt;PARENT_OF&lt;/code&gt; edges. Why was this community "forgotten"?&lt;/p&gt;




&lt;h2&gt;
  
  
  Background: GraphRAG's Hierarchical Community Structure
&lt;/h2&gt;

&lt;p&gt;GraphRAG uses the Leiden algorithm to perform &lt;strong&gt;hierarchical clustering&lt;/strong&gt; on the entity graph. To make this intuitive, let's use a "world map" analogy to explain the entire process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Imagine You're Grouping Everyone in the World
&lt;/h3&gt;

&lt;p&gt;Suppose you have a massive social network graph where each node is a person and edges represent "these two people are connected." Now you need to group them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Level 0 (coarsest granularity)&lt;/strong&gt;: First divide by the largest circles — equivalent to splitting everyone into "continents." People within the same continent are closely connected; connections between continents are sparse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 1&lt;/strong&gt;: Further divide within each continent — equivalent to splitting into "countries."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2&lt;/strong&gt;: Divide within each country — equivalent to "provinces/states."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 3, 4, ...&lt;/strong&gt;: Continue dividing into "cities," "neighborhoods"...&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The higher the level, the finer the granularity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each layer connects to the next through &lt;code&gt;PARENT_OF&lt;/code&gt; edges (coarse → fine):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Continent ──PARENT_OF──&amp;gt; Country ──PARENT_OF──&amp;gt; Province ──PARENT_OF──&amp;gt; City
(level 0)              (level 1)             (level 2)              (level 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  A Complete Example
&lt;/h2&gt;

&lt;p&gt;Suppose we run GraphRAG hierarchical clustering on a "Global Cuisine Knowledge Graph." The entities are various ingredients, dishes, and cooking techniques, with edges representing their associations.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Round of Clustering (Level 0): 5 Major Groups
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community&lt;/th&gt;
&lt;th&gt;Representative Entities&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Continent A "Asian Cuisine"&lt;/td&gt;
&lt;td&gt;Rice, soy sauce, wok, tofu, miso...&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continent B "European Cuisine"&lt;/td&gt;
&lt;td&gt;Olive oil, cheese, bread, red wine, butter...&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continent C "American Cuisine"&lt;/td&gt;
&lt;td&gt;Corn, chili peppers, avocado, BBQ...&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continent D "African Cuisine"&lt;/td&gt;
&lt;td&gt;Cassava, peanut sauce, couscous...&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Continent E "Antarctic Research Station Cafeteria"&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Canned food, hardtack, instant coffee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Second Round of Clustering (Level 1): Subdividing Within Groups
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Continent A "Asian Cuisine" (800 entities)&lt;/strong&gt; has complex internal structure and can be further divided:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Continent A "Asian Cuisine" (level 0, size=800)
  ├── PARENT_OF → Country A1 "Chinese Cuisine" (level 1, size=300)
  │     ├── PARENT_OF → Province A1a "Sichuan Cuisine" (level 2, size=80)
  │     ├── PARENT_OF → Province A1b "Cantonese Cuisine" (level 2, size=70)
  │     └── PARENT_OF → Province A1c "Shandong Cuisine" (level 2, size=50)
  ├── PARENT_OF → Country A2 "Japanese Cuisine" (level 1, size=200)
  ├── PARENT_OF → Country A3 "Southeast Asian Cuisine" (level 1, size=150)
  └── PARENT_OF → Country A4 "Korean Cuisine" (level 1, size=100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What about Continent E "Antarctic Research Station Cafeteria" (3 entities)?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Continent E "Antarctic Research Station Cafeteria" (level 0, size=3)
  ├── Canned food
  ├── Hardtack
  └── Instant coffee

  (That's it — no outgoing PARENT_OF edges)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The relationships among these 3 entities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Canned food ↔ Hardtack (both are long-shelf-life foods)&lt;/li&gt;
&lt;li&gt;Canned food ↔ Instant coffee (both are ready-to-eat items)&lt;/li&gt;
&lt;li&gt;Hardtack ↔ Instant coffee (both are research station staples)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They're closely related, so they're grouped together. But with only 3 members — you can't split 3 people into "departments" and "teams." That would be absurd.&lt;/p&gt;

&lt;p&gt;Meanwhile, Continent E's external connections are extremely sparse — only "canned food" has one weak link to Continent B's "canned olive oil." This connection is too weak for the algorithm to merge Continent E into Continent B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: Continent E becomes an orphan — it can neither be subdivided downward nor merged into another group.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Do Orphans Occur? Two Conditions Must Be Met Simultaneously
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌─────────────────────────┐
                    │  Community too small     │
                    │  (2~9 entities)          │
                    │  Cannot subdivide further│
                    └───────────┬─────────────┘
                                │
                                ▼
                    ┌─────────────────────────┐
                    │  Becomes an orphan       │
                    │  Community               │
                    │  No PARENT_OF edges      │
                    └───────────┬─────────────┘
                                │
                    ┌───────────┴─────────────┐
                    │  Extremely weak external │
                    │  connections             │
                    │  (1~2 cross-group edges) │
                    │  Not worth merging into  │
                    │  another group           │
                    └─────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Leiden algorithm's criterion is &lt;strong&gt;modularity&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subdivide downward&lt;/strong&gt;: Split 3 people into 2 groups? Each group would have 1-2 people — no statistical significance, modularity won't improve. Abandoned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge into others&lt;/strong&gt;: Only 1 weak connection to the nearest large group; forcing a merge would reduce that group's cohesion. Abandoned.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Data Speaks
&lt;/h2&gt;

&lt;p&gt;Returning to real GraphRAG data, the statistics perfectly confirm this pattern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orphan communities (no PARENT_OF edges):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community&lt;/th&gt;
&lt;th&gt;Size (entity count)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orphan 1&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphan 2&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphan 3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphan 4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphan 5&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Normal communities (have PARENT_OF edges, participate in hierarchical subdivision):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community&lt;/th&gt;
&lt;th&gt;Size (entity count)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Normal 1&lt;/td&gt;
&lt;td&gt;2,511&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normal 2&lt;/td&gt;
&lt;td&gt;2,330&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normal 3&lt;/td&gt;
&lt;td&gt;1,571&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normal 4&lt;/td&gt;
&lt;td&gt;688&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normal 5&lt;/td&gt;
&lt;td&gt;685&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is crystal clear: &lt;strong&gt;the larger the size, the more likely it participates in the hierarchy; the smaller the size, the more likely it becomes an orphan.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In one real knowledge graph, level 0 had 41 communities total — 23 participated normally in hierarchical subdivision, while 18 became orphans. All orphans had sizes between 2 and 9.&lt;/p&gt;




&lt;h2&gt;
  
  
  Impact on GraphRAG Queries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Global Search
&lt;/h3&gt;

&lt;p&gt;Global Search traverses community reports at a certain level to answer questions. If it chooses to traverse level 1 reports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Normal communities' information appears in level 1 sub-community reports&lt;/li&gt;
&lt;li&gt;❌ Orphan communities have no level 1 sub-communities; their information &lt;strong&gt;won't appear in any level 1+ reports&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogy: If you only read "country-level" reports, the Antarctic research station cafeteria's information won't appear in any country's report — because it doesn't belong to any country.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local Search
&lt;/h3&gt;

&lt;p&gt;Local Search finds relevant entities directly through entity vector matching, independent of the hierarchical structure. So entities within orphan communities can still be retrieved by Local Search.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Impact
&lt;/h3&gt;

&lt;p&gt;Since orphan communities are very small (2-9 entities) and contain limited information, their impact on most queries is minimal. But if your query happens to involve this "edge knowledge," you should be aware of this blind spot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Normal Community&lt;/th&gt;
&lt;th&gt;Orphan Community&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;Tens to thousands&lt;/td&gt;
&lt;td&gt;2~9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analogy&lt;/td&gt;
&lt;td&gt;Continents/Countries/Provinces (large populations)&lt;/td&gt;
&lt;td&gt;Antarctic research station (3 people)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal structure&lt;/td&gt;
&lt;td&gt;Complex, can be subdivided layer by layer&lt;/td&gt;
&lt;td&gt;Too simple, cannot be subdivided&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External connections&lt;/td&gt;
&lt;td&gt;Extensive interactions with other groups&lt;/td&gt;
&lt;td&gt;Almost isolated from the outside&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PARENT_OF edges&lt;/td&gt;
&lt;td&gt;Yes (pointing to finer sub-communities)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global Search visibility&lt;/td&gt;
&lt;td&gt;Information propagates through reports at all levels&lt;/td&gt;
&lt;td&gt;Only visible in level 0 reports&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Leiden hierarchical clustering algorithm's behavior&lt;/strong&gt; is just like the real world, where the Antarctic research station truly doesn't belong to any country's administrative division — it's too small and too isolated; forcing it into some country would be unreasonable. The algorithm makes the same judgment: &lt;strong&gt;communities too small cannot be further subdivided, and communities with connections too weak to the outside won't be forcibly merged.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>leidenalgorithm</category>
      <category>communitydetection</category>
      <category>hierarchicalclustering</category>
    </item>
    <item>
      <title>GraphRAG Local Search Text Unit Selection Strategy: Design Trade-offs and Improvement Directions</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 15 May 2026 00:49:08 +0000</pubDate>
      <link>https://dev.to/eyanpen/graphrag-local-search-text-unit-selection-strategy-design-trade-offs-and-improvement-directions-16c4</link>
      <guid>https://dev.to/eyanpen/graphrag-local-search-text-unit-selection-strategy-design-trade-offs-and-improvement-directions-16c4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;GraphRAG's Local Search needs to select the most relevant raw text fragments (Text Units) associated with the knowledge graph to fill the LLM context window during query time. This selection strategy seems simple — sort by entity similarity, fill one by one — but in real-world scenarios it exposes a significant limitation: &lt;strong&gt;popular entities can monopolize the entire Text Unit budget, causing key text from other entities to be truncated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article provides an in-depth analysis of the root cause of this problem, the core problem it was designed to solve, and possible improvement directions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is the Current Strategy
&lt;/h2&gt;

&lt;p&gt;Local Search's Text Unit selection has four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Iterate through selected entities (ranked by vector similarity), collecting each entity's associated &lt;code&gt;text_unit_ids&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Deduplication: each TU is attributed only to the first entity encountered&lt;/li&gt;
&lt;li&gt;Sorting: by &lt;code&gt;(entity_index, -num_relationships)&lt;/code&gt; — entity order takes priority, within the same entity sorted by relationship density in descending order&lt;/li&gt;
&lt;li&gt;Fill into context one by one until reaching the token limit (default 50% of total budget, approximately 6000 tokens)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Core code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_entities&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;entity_relationships&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;relationships&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_unit_ids&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_unit_ids_set&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_units&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;num_relationships&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_relationships&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_relationships&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_units&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text_id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;text_unit_ids_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;unit_info_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_units&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_relationships&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;unit_info_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Problem Scenario: Popular Entities Monopolize the Budget
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Concrete Example
&lt;/h3&gt;

&lt;p&gt;Suppose the user asks: "What is the anti-inflammatory mechanism of chamazulene?"&lt;/p&gt;

&lt;p&gt;Entities returned by vector search:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Associated TU Count&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Chamomile&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;High-frequency entity, mentioned in almost all herbal documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Chamazulene&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Active component of chamomile, fewer specialized references&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;NF-κB pathway&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Specific anti-inflammatory molecular mechanism&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TU attribution after deduplication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;index 0 "Chamomile": TU1, TU2, TU3, ..., TU50  (50 items)
index 1 "Chamazulene": TU51, TU52              (TU1, TU5 already claimed by Chamomile)
index 2 "NF-κB":  TU53                    (only 1 unclaimed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sorting result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TU1(index=0, rel=5) → TU2(index=0, rel=4) → ... → TU50(index=0, rel=0)
→ TU51(index=1, rel=2) → TU52(index=1, rel=1)
→ TU53(index=2, rel=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assuming a token budget of 6000 tokens and each TU averaging 300 tokens, only about 20 TUs can fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: All top 20 positions are occupied by "Chamomile" TUs. The text about "chamazulene's anti-inflammatory mechanism" that the user actually cares about (TU51, TU52, TU53) is entirely truncated. The context fed to the LLM is filled with generic introductions about "Chamomile" but contains no original text supporting chamazulene's specific molecular mechanisms.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It Was Designed This Way: What Problem It Solves
&lt;/h2&gt;

&lt;p&gt;This strategy was not designed arbitrarily — it solves a more fundamental problem: &lt;strong&gt;ensuring that the most semantically relevant entities receive the most comprehensive original text support&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Scenario It Addresses
&lt;/h3&gt;

&lt;p&gt;Suppose the user asks: "What is the status of chamomile in European traditional medicine?"&lt;/p&gt;

&lt;p&gt;Vector search returns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Associated TU Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Chamomile&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;European Herbalism&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Lavender&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this scenario, "Chamomile" is indeed the most core entity — the user is asking about it. If a round-robin strategy were used (taking 1 TU from each entity in turn), then "Lavender's" 30 TUs would split the budget equally with "Chamomile" — but the user never asked about lavender.&lt;/p&gt;

&lt;p&gt;The advantages of the current strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Respects semantic ranking&lt;/strong&gt;: The entity with the highest vector similarity gets the most original text support, which is correct in most cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationship density sorting ensures quality&lt;/strong&gt;: Among multiple TUs for the same entity, the most information-dense ones come first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication avoids redundancy&lt;/strong&gt;: The same TU won't appear repeatedly because it's associated with multiple entities&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Core Trade-off
&lt;/h3&gt;

&lt;p&gt;This is a classic &lt;strong&gt;relevance depth vs. coverage breadth&lt;/strong&gt; trade-off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The current strategy chooses &lt;strong&gt;depth&lt;/strong&gt;: ensuring the most relevant entity has sufficient original text evidence&lt;/li&gt;
&lt;li&gt;The cost is &lt;strong&gt;breadth&lt;/strong&gt;: secondary entities may have no original text support at all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most "questions about a specific entity" (the design target of Local Search), depth-first is reasonable. The problem emerges when queries involve cross-entity relationships.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Essence of the Problem: A Single Sorting Dimension Cannot Express Multi-Objective Optimization
&lt;/h2&gt;

&lt;p&gt;Text Unit selection is fundamentally a multi-objective optimization problem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Relevance&lt;/strong&gt;: The semantic relevance of a TU to the query (expressed indirectly through entity ranking)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Information density&lt;/strong&gt;: The number of relationships contained in a TU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage&lt;/strong&gt;: Ensuring every selected entity has original text support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diversity&lt;/strong&gt;: Avoiding homogeneous content flooding the context&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The current strategy uses a single tuple &lt;code&gt;(entity_index, -num_relationships)&lt;/code&gt; attempting to optimize the first two objectives simultaneously, but completely ignores the latter two.&lt;/p&gt;




&lt;h2&gt;
  
  
  Improvement Directions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach 1: Per-Entity Cap
&lt;/h3&gt;

&lt;p&gt;The simplest improvement — set a TU contribution cap for each entity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_TU_PER_ENTITY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_entities&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_unit_ids&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;MAX_TU_PER_ENTITY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_unit_ids_set&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;text_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_units&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# ... addition logic unchanged
&lt;/span&gt;            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Simple to implement, guarantees each entity at least has a chance to contribute TUs&lt;br&gt;
&lt;strong&gt;Cons&lt;/strong&gt;: Cap value is hard to determine; if an entity genuinely needs extensive original text support, it gets artificially limited&lt;/p&gt;
&lt;h3&gt;
  
  
  Approach 2: Round-Robin
&lt;/h3&gt;

&lt;p&gt;Each round takes 1 TU from each entity (selecting the best by relationship density), cycling until the budget is exhausted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;entity_queues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sorted_tus_for_entity_i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_entities&lt;/span&gt;&lt;span class="p"&gt;))}&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_queues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_entities&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entity_queues&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;tu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entity_queues&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="nf"&gt;token_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Guarantees coverage, every entity has original text support&lt;br&gt;
&lt;strong&gt;Cons&lt;/strong&gt;: Depth of the most relevant entity is diluted; lower-ranked irrelevant entities also receive equal budget&lt;/p&gt;
&lt;h3&gt;
  
  
  Approach 3: Weighted Quota Allocation
&lt;/h3&gt;

&lt;p&gt;Allocate TU quotas based on entity vector similarity scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Assuming similarity scores: [0.95, 0.82, 0.71]
&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.71&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;quotas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tus&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# quotas ≈ [15, 13, 11] (assuming max_tus=39)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Balances depth and breadth; higher-relevance entities get more quota without monopolizing&lt;br&gt;
&lt;strong&gt;Cons&lt;/strong&gt;: Increased implementation complexity; requires preserving similarity scores from vector search results (not retained in current code)&lt;/p&gt;
&lt;h3&gt;
  
  
  Approach 4: Minimum Guarantee + Remaining Competition
&lt;/h3&gt;

&lt;p&gt;Guarantee each entity at least N TUs (e.g., 2), with remaining budget competed for using the current strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Phase 1: Guarantee 2 best TUs per entity
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;selected_entities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;guaranteed_tus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;top_2_by_relationship_density&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;guaranteed_tus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Phase 2: Fill remaining budget using original sorting strategy
&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;all_tus&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;guaranteed_tus&lt;/span&gt;
&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entity_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_relationships&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;fill_until_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Guarantees coverage while preserving the depth advantage of the original strategy&lt;br&gt;
&lt;strong&gt;Cons&lt;/strong&gt;: If many entities are selected, the guarantee phase may consume significant budget&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Current Strategy&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Relevance depth&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Information density&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage breadth&lt;/td&gt;
&lt;td&gt;❌ Missing&lt;/td&gt;
&lt;td&gt;Popular entities monopolize budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content diversity&lt;/td&gt;
&lt;td&gt;❌ Missing&lt;/td&gt;
&lt;td&gt;Homogenization risk&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GraphRAG's current Text Unit selection strategy is a "depth-first" design that performs well for "questions about a single entity" scenarios, but exposes insufficient coverage when queries involve multi-entity cross-relationships.&lt;/p&gt;

&lt;p&gt;The most pragmatic improvement is &lt;strong&gt;Approach 4 (Minimum Guarantee + Remaining Competition)&lt;/strong&gt; — it guarantees that every selected entity has at least some original text support with minimal code changes, without breaking the original strategy's advantages in mainstream scenarios.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>localsearch</category>
      <category>textunit</category>
      <category>contextwindow</category>
    </item>
    <item>
      <title>Why Gold Answers Are Becoming Less Important in GraphRAG Systems</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Tue, 12 May 2026 08:10:41 +0000</pubDate>
      <link>https://dev.to/eyanpen/why-gold-answers-are-becoming-less-important-in-graphrag-systems-jbn</link>
      <guid>https://dev.to/eyanpen/why-gold-answers-are-becoming-less-important-in-graphrag-systems-jbn</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Traditional RAG evaluation relies on human-annotated "standard answers," but in the GraphRAG era, this approach is losing its relevance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Is a Gold Answer?
&lt;/h2&gt;

&lt;p&gt;A Gold Answer is a human-annotated "standard correct answer." In traditional NLP and RAG system evaluation, the process typically goes like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prepare a batch of test questions&lt;/li&gt;
&lt;li&gt;Have humans write the "correct answer" for each question&lt;/li&gt;
&lt;li&gt;Let the system answer the same questions&lt;/li&gt;
&lt;li&gt;Compare system answers against Gold Answers, calculating F1, BLEU, ROUGE, and other scores&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach has worked for years in search engines and simple Q&amp;amp;A systems. But in complex systems like GraphRAG, the value of Gold Answers is declining rapidly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Knowledge Graphs Evolve Continuously — Gold Answers Can't Keep Up
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;The heart of GraphRAG is the knowledge graph. Graphs aren't static — every document update, every re-extraction of entities and relationships changes the graph. Today's "correct answer" might be outdated tomorrow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Suppose your company has an internal technical architecture document:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;January version&lt;/strong&gt;: The document states "the order service uses MySQL"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;March version&lt;/strong&gt;: After an architecture upgrade, it now reads "the order service uses PostgreSQL + Redis cache"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gold Answer you annotated in January is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Q: What database does the order service use?&lt;br&gt;&lt;br&gt;
A: MySQL&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By March, the GraphRAG system has re-indexed the new documents and correctly answers "PostgreSQL + Redis." But if you still evaluate against the January Gold Answer, the system gets marked as "wrong."&lt;/p&gt;

&lt;h3&gt;
  
  
  A More Realistic Scenario
&lt;/h3&gt;

&lt;p&gt;In enterprise environments, document update frequency is much higher than most people imagine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API documentation changes weekly&lt;/li&gt;
&lt;li&gt;Organizational structures are adjusted quarterly&lt;/li&gt;
&lt;li&gt;Technology choices may be overhauled every six months&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After each document update, you need to re-annotate Gold Answers. For an evaluation set with 500 test questions, each update might require modifying 30% of the answers — that means re-reviewing 150 answers every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human Annotation of Gold Answers Is Extremely Costly and Unreliable
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;The questions GraphRAG handles often involve multi-hop reasoning and cross-document correlation. For such questions, even human experts struggle to provide a single "uniquely correct" answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Suppose the question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Among the projects Zhang San is responsible for, which ones use EOL (End of Life) technology stacks?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To answer this, annotators need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find which projects Zhang San is responsible for (possibly scattered across 5 documents)&lt;/li&gt;
&lt;li&gt;Find the technology stack for each project (yet more documents)&lt;/li&gt;
&lt;li&gt;Determine which stacks are EOL (requires external information)&lt;/li&gt;
&lt;li&gt;Synthesize all the above into an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Suppose the ground truth is that Zhang San is responsible for 4 projects, 3 of which use EOL tech stacks. After an hour of document review, the annotator writes this Gold Answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Project A (Spring Boot 2.5), Project B (Log4j 1.x), Project C (Python 2.7)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the annotator missed Project D — because Zhang San's responsibility for Project D was documented in meeting minutes, not in the official project assignment sheet.&lt;/p&gt;

&lt;p&gt;Now look at the evaluation results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;th&gt;Score Against Gold Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional RAG&lt;/td&gt;
&lt;td&gt;Found Projects A, B (missed C)&lt;/td&gt;
&lt;td&gt;Recall 2/3 = 0.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GraphRAG&lt;/td&gt;
&lt;td&gt;Found Projects A, B, C, D (discovered D through relationship reasoning in meeting minutes)&lt;/td&gt;
&lt;td&gt;Recall 3/3 = 1.0, but Precision 3/4 = 0.75 (D judged as "extraneous")&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The irony: GraphRAG gets penalized for being &lt;strong&gt;more correct than the Gold Answer&lt;/strong&gt;. It discovered information through the graph's relationship chain (Zhang San → attended meeting → meeting resolution → responsible for Project D) that even the annotator missed, but in the evaluation framework, this "extra correct answer" is treated as an error.&lt;/p&gt;

&lt;p&gt;Final F1 scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional RAG: F1 = 0.80&lt;/li&gt;
&lt;li&gt;GraphRAG: F1 = 0.86&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GraphRAG clearly found more complete and accurate results, yet its score advantage is negligible — and in some evaluation settings (like strict exact matching), it might even score lower than traditional RAG. &lt;strong&gt;The Gold Answer ceiling limits the ability to identify superior systems.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cost Calculation
&lt;/h3&gt;

&lt;p&gt;Annotating a single complex GraphRAG test question might take a domain expert 30-60 minutes (requiring cross-referencing multiple documents). If you need 200 test questions, that's 100-200 hours of expert time. And these answers might only remain valid for a few months (see the first point above).&lt;/p&gt;

&lt;h2&gt;
  
  
  GraphRAG Answers Are Inherently Diverse in Form
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Traditional RAG typically answers factual questions ("What is X?"), where answers are relatively fixed. But GraphRAG excels at relationship reasoning and comprehensive analysis — questions where the "correct answer" naturally has multiple valid expressions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In our microservices architecture, which services have circular dependencies?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;GraphRAG might answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Answer A&lt;/strong&gt;: Service A → Service B → Service C → Service A forms a cycle; Service D and Service E call each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer B&lt;/strong&gt;: Two groups of circular dependencies exist: (1) A-B-C triangular cycle (2) D-E bidirectional dependency. Recommend prioritizing decoupling the A-B-C cycle as it involves the core transaction path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer C&lt;/strong&gt;: Circular dependency path detected: A→B→C→A. Additionally, D↔E has bidirectional calls, but since they use asynchronous messaging, the actual impact is minimal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All three answers are "correct," but with different emphases. Using any single one as the Gold Answer would unfairly penalize other equally correct responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Metrics Fail
&lt;/h3&gt;

&lt;p&gt;Comparing the three answers above using ROUGE scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Answer A vs Answer B: ROUGE-L might be only 0.3 (completely different wording)&lt;/li&gt;
&lt;li&gt;Answer A vs Answer C: ROUGE-L might be 0.5 (some overlap)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But from an information correctness perspective, all three should receive full marks. The Gold Answer + text similarity metric combination completely fails here.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-Judge Is Replacing Gold Answers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Given all these issues with Gold Answers, the industry is shifting toward a new evaluation paradigm: using LLMs as judges (LLM-as-Judge), directly evaluating answer quality rather than comparing against "standard answers."&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Traditional approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System answer: "PostgreSQL + Redis"
Gold Answer: "MySQL"
ROUGE score: 0.0  → Judged as incorrect ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLM-as-Judge approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question: "What database does the order service use?"
System answer: "PostgreSQL + Redis"
Reference document: [Latest architecture doc, clearly states PostgreSQL + Redis]

LLM judgment: Answer is consistent with documentation, information is accurate, score 5/5 ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Advantages of LLM-as-Judge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Gold Answer&lt;/th&gt;
&lt;th&gt;LLM-as-Judge&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Requires human annotation&lt;/td&gt;
&lt;td&gt;Extensive manual work&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adapts to document updates&lt;/td&gt;
&lt;td&gt;Requires re-annotation&lt;/td&gt;
&lt;td&gt;Automatically adapts (references latest docs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles multiple valid expressions&lt;/td&gt;
&lt;td&gt;Cannot&lt;/td&gt;
&lt;td&gt;Can (understands semantic equivalence)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation cost&lt;/td&gt;
&lt;td&gt;High (manual)&lt;/td&gt;
&lt;td&gt;Low (API calls)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation speed&lt;/td&gt;
&lt;td&gt;Slow (days/weeks)&lt;/td&gt;
&lt;td&gt;Fast (minutes)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  GraphRAG Evaluation Dimensions Far Exceed "Answer Correctness"
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Gold Answers can only evaluate one dimension: whether the answer content is correct. But GraphRAG system quality depends on many other factors that Gold Answers simply cannot measure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;For the same question, two GraphRAG systems both give correct answers, but the quality differs dramatically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System A's response&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Zhang San is responsible for Project X, which uses Spring Boot 2.5 (EOL).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;System B's response&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Zhang San is responsible for Project X, which uses Spring Boot 2.5 (maintenance ended November 2023). Additionally, the project depends on Log4j 1.x (EOL since 2015, with known security vulnerability CVE-2019-17571). Recommend referring to the internal migration guide [link] for upgrading.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both answers might score identically against the Gold Answer, but System B is clearly more valuable — it provides more complete information, security risk alerts, and actionable recommendations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dimensions That Actually Matter
&lt;/h3&gt;

&lt;p&gt;For GraphRAG systems, we should focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Graph coverage&lt;/strong&gt;: Are entities and relationships being completely extracted?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning path explainability&lt;/strong&gt;: Which nodes and edges did the system traverse to reach its conclusion?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Information completeness&lt;/strong&gt;: Are important related details being missed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeliness&lt;/strong&gt;: Is the referenced information current?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actionability&lt;/strong&gt;: Does the answer provide executable recommendations?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these dimensions can be evaluated by Gold Answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Should We Evaluate GraphRAG Then?
&lt;/h2&gt;

&lt;p&gt;Since Gold Answers are no longer a silver bullet, here are evaluation strategies better suited for GraphRAG:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-Judge + dimension decomposition&lt;/strong&gt;: Have LLMs score separately on accuracy, completeness, relevance, and other dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source document fact-checking&lt;/strong&gt;: Verify whether each fact in the answer can be traced back to source documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph quality metrics&lt;/strong&gt;: Directly evaluate knowledge graph entity coverage and relationship accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end user satisfaction&lt;/strong&gt;: Have real users evaluate whether answers solved their problems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression testing over absolute scoring&lt;/strong&gt;: Focus on quality changes before and after system updates, rather than pursuing absolute scores&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Gold Answers aren't entirely worthless — for simple factual Q&amp;amp;A and system cold-start phases, they remain a useful baseline. But in complex systems like GraphRAG, over-reliance on Gold Answers introduces three risks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;False sense of security&lt;/strong&gt;: High Gold Answer scores don't mean the system is actually useful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance burden&lt;/strong&gt;: The cost of continuously updating Gold Answers may exceed the value they provide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation blind spots&lt;/strong&gt;: Gold Answers cannot cover GraphRAG's most important quality dimensions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rather than spending enormous effort maintaining a set of "standard answers" destined to become outdated, invest that energy into more modern, comprehensive evaluation systems. GraphRAG evaluation should be like GraphRAG itself — dynamic, multi-dimensional, and based on understanding rather than rote memorization.&lt;/p&gt;

</description>
      <category>goldanswer</category>
      <category>graphrag</category>
      <category>ragevaluation</category>
      <category>llmasjudge</category>
    </item>
    <item>
      <title>Why Does Semantic Chunking Need an Embedding API?</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Mon, 04 May 2026 05:54:39 +0000</pubDate>
      <link>https://dev.to/eyanpen/why-does-semantic-chunking-need-an-embedding-api-4dei</link>
      <guid>https://dev.to/eyanpen/why-does-semantic-chunking-need-an-embedding-api-4dei</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Fixed-length chunking requires no external services, yet semantic chunking absolutely needs an Embedding API — why?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Short Answer
&lt;/h2&gt;

&lt;p&gt;The core idea of semantic chunking is to &lt;strong&gt;split text at semantic boundaries&lt;/strong&gt;. Determining whether "two pieces of text belong to the same topic" requires converting text into vectors and computing similarity — that's exactly what the Embedding API does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional Chunking vs Semantic Chunking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Fixed-Length / Recursive&lt;/th&gt;
&lt;th&gt;Semantic Chunking&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Split criteria&lt;/td&gt;
&lt;td&gt;Character count, token count, delimiters&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; between adjacent sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Requires Embedding&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Split quality&lt;/td&gt;
&lt;td&gt;May break in the middle of a topic&lt;/td&gt;
&lt;td&gt;Splits at topic transitions, preserving semantic coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fixed-length chunking is like measuring paper with a ruler — regardless of content, it cuts every 500 characters. Semantic chunking is like a reader who, after finishing a paragraph, asks "is the next part still about the same thing?" If not, that's where the cut goes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Mainstream Semantic Chunking Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Strategy 1: Adjacent Similarity (Kamradt Method)
&lt;/h3&gt;

&lt;p&gt;Core idea: Compute semantic distances between adjacent sentences and split where distances spike.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Process:
1. Split text into small sentences
2. For each sentence, concatenate buffer_size adjacent sentences as context
3. Call Embedding API to get vectors for each combined sentence
4. Compute cosine distances between adjacent combined sentences
5. Use binary search to find a threshold, split where distance exceeds it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Build context windows
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;combined_texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Get embeddings for all combined sentences (one batch call)
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined_texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Compute cosine distances only between adjacent sentences
&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Higher distance = greater topic difference
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 4: Binary search for threshold targeting total_size / avg_chunk_size cuts
&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;binary_search_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_cuts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Split where distance exceeds threshold
&lt;/span&gt;&lt;span class="n"&gt;breakpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Intuition: Imagine reading an article sentence by sentence, asking yourself after each one: "Is the next sentence still about the same thing?" When you feel the topic has jumped, you cut there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key characteristic: Only looks at adjacent relationships.&lt;/strong&gt; It only computes the distance between sentence[i] and sentence[i+1] — a &lt;strong&gt;local greedy&lt;/strong&gt; strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Cluster Optimal Segmentation (Dynamic Programming Method)
&lt;/h3&gt;

&lt;p&gt;Core idea: Build a similarity matrix between all sentence pairs and use dynamic programming to find the segmentation that maximizes intra-cluster similarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Process:
1. Split text into small sentences
2. Call Embedding API to get vectors for all sentences
3. Build an N×N similarity matrix
4. Normalize the matrix by subtracting the mean (prevents degeneration into one giant cluster)
5. Use dynamic programming to find the optimal segmentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Get embeddings for all sentences (note: no buffer concatenation)
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Build N×N similarity matrix
&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Mean normalization to prevent DP from putting everything in one cluster
&lt;/span&gt;&lt;span class="n"&gt;mean_sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;upper_triangle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;similarity_matrix&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;mean_sim&lt;/span&gt;
&lt;span class="nf"&gt;fill_diagonal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Dynamic programming for optimal segmentation
# dp[i] = maximum intra-cluster similarity sum for the first i+1 sentences
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;cluster_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_chunk_size&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Backtrack to get optimal segmentation
&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;backtrack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segmentation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key characteristic: Globally optimal.&lt;/strong&gt; It considers relationships between all sentence pairs and uses DP to find the overall best segmentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Comparison of the Two Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fundamental Algorithmic Differences
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Kamradt (Adjacent Similarity)&lt;/th&gt;
&lt;th&gt;Cluster (Dynamic Programming)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local&lt;/strong&gt; — only adjacent sentences&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Global&lt;/strong&gt; — all sentence pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision method&lt;/td&gt;
&lt;td&gt;Greedy: cut when distance exceeds threshold&lt;/td&gt;
&lt;td&gt;Optimization: maximize intra-cluster similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threshold&lt;/td&gt;
&lt;td&gt;Binary search for target cut count&lt;/td&gt;
&lt;td&gt;No threshold needed, DP decides automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context enhancement&lt;/td&gt;
&lt;td&gt;✅ buffer_size concatenation&lt;/td&gt;
&lt;td&gt;❌ Uses raw sentences directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size constraints&lt;/td&gt;
&lt;td&gt;avg_chunk_size + max_chunk_size dual constraint&lt;/td&gt;
&lt;td&gt;max_chunk_size hard constraint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The core difference in one sentence:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kamradt asks: "Is there a topic transition between these two adjacent sentences?"&lt;/li&gt;
&lt;li&gt;Cluster asks: "Which grouping makes sentences within each group most similar to each other?"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  An Intuitive Example
&lt;/h3&gt;

&lt;p&gt;Consider 6 sentences with the following topic distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sentence 1: Discussing Apple's earnings report
Sentence 2: Discussing Apple's new products
Sentence 3: Discussing the weather forecast
Sentence 4: Discussing tomorrow's temperature
Sentence 5: Discussing Apple's stock price
Sentence 6: Discussing Apple's competitors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kamradt's approach:&lt;/strong&gt; Compare adjacent pairs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentence 2→3: Topic jump (Apple → weather), cut!&lt;/li&gt;
&lt;li&gt;Sentence 4→5: Topic jump (weather → Apple), cut!&lt;/li&gt;
&lt;li&gt;Result: [1,2] [3,4] [5,6]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster's approach:&lt;/strong&gt; The global similarity matrix shows sentences 1,2,5,6 are highly similar to each other&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;But since DP requires contiguous segmentation (can't skip around), it can only cut contiguous spans&lt;/li&gt;
&lt;li&gt;Result is likely also [1,2] [3,4] [5,6], but the reasoning is different&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key difference emerges when boundaries are fuzzy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider an article that gradually transitions from "EV technology" to "energy policy":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sentence 1: Tesla released a new generation of battery technology
Sentence 2: The new battery's energy density improved by 50%
Sentence 3: Higher energy density means longer driving range
Sentence 4: Range anxiety has been a barrier for consumers buying EVs
Sentence 5: The government introduced charging station subsidies to address this
Sentence 6: Subsidies cover both residential and commercial charging facilities
Sentence 7: Commercial charging uses time-of-use electricity pricing
Sentence 8: Time-of-use pricing is a key component of electricity market reform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What Kamradt sees (adjacent distances):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1→2: 0.08  (both about batteries)
2→3: 0.10  (battery → range, very close)
3→4: 0.12  (range → range anxiety, very close)
4→5: 0.15  (consumers → government policy, slightly far but not outstanding)
5→6: 0.09  (both about subsidies)
6→7: 0.13  (subsidies → pricing, somewhat far)
7→8: 0.11  (both about pricing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No single distance clearly "spikes" — the topic slides gradually. Kamradt's binary search struggles to find a reasonable threshold, potentially producing suboptimal splits like [1-4][5-8] or [1-3][4-6][7-8].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Cluster sees (global similarity matrix summary):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        S1    S2    S3    S4    S5    S6    S7    S8
S1      --   0.9   0.7   0.4   0.2   0.1   0.1   0.05
S2           --    0.8   0.5   0.2   0.15  0.1   0.05
S3                 --    0.6   0.3   0.2   0.15  0.1
S4                       --    0.5   0.4   0.3   0.2
S5                             --    0.8   0.6   0.4
S6                                   --    0.7   0.5
S7                                         --    0.8
S8                                               --
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The global view clearly shows: sentences 1-3 are highly similar to each other (battery/range technology), sentences 5-8 are highly similar to each other (policy/pricing), and sentence 4 is a transition. DP optimization discovers that [1-3][4-8] or [1-4][5-8] maximizes intra-cluster similarity, producing a more reasonable split.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The essential difference:&lt;/strong&gt; Kamradt only looks at "the gap between adjacent sentences" — in a gradual transition, each step's gap is small, like the boiling frog metaphor. Cluster looks at "the overall similarity within each group" — even when the transition is smooth, it can still detect that sentence 1 and sentence 8 are essentially unrelated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Cost Comparison
&lt;/h3&gt;

&lt;p&gt;This is one of the most important practical differences between the two strategies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embedding input&lt;/td&gt;
&lt;td&gt;combined_sentence (with buffer context)&lt;/td&gt;
&lt;td&gt;Raw sentences (no buffer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding call count&lt;/td&gt;
&lt;td&gt;N texts, 1 batch call&lt;/td&gt;
&lt;td&gt;N texts, 1 batch call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average text length&lt;/td&gt;
&lt;td&gt;Longer (~7 sentences, buffer_size=3)&lt;/td&gt;
&lt;td&gt;Shorter (1 sentence)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total token consumption&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher&lt;/strong&gt; (buffer causes input inflation)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Lower&lt;/strong&gt; (no redundancy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-embedding computation&lt;/td&gt;
&lt;td&gt;O(N) — only adjacent distances&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;O(N²)&lt;/strong&gt; — full similarity matrix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DP computation&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;O(N × max_cluster_size)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Concrete Numbers (1000 sentences, ~30 tokens each)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Kamradt:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding input: 1000 combined_sentences, each ~7×30 = 210 tokens&lt;/li&gt;
&lt;li&gt;Total token consumption: 1000 × 210 = &lt;strong&gt;210,000 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Distance computation: 999 dot products → negligible&lt;/li&gt;
&lt;li&gt;Memory: 1000 × embedding_dim matrix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding input: 1000 raw sentences, each ~30 tokens&lt;/li&gt;
&lt;li&gt;Total token consumption: 1000 × 30 = &lt;strong&gt;30,000 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Similarity matrix: 1000 × 1000 = &lt;strong&gt;1 million floats&lt;/strong&gt; (~8MB)&lt;/li&gt;
&lt;li&gt;DP computation: O(1000 × max_cluster_size) iterations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding API cost&lt;/strong&gt;: Kamradt consumes ~7x more tokens (due to buffer concatenation), higher API cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute resources&lt;/strong&gt;: Cluster's O(N²) matrix and DP are more expensive on local CPU/memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network latency&lt;/strong&gt;: Same for both (both use 1 batch call, or multiple calls based on batch_size)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Large-Scale Scenario (100,000 sentences)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total embedding tokens&lt;/td&gt;
&lt;td&gt;~21 million tokens&lt;/td&gt;
&lt;td&gt;~3 million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API calls (batch_size=500)&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Similarity computation&lt;/td&gt;
&lt;td&gt;99,999 dot products&lt;/td&gt;
&lt;td&gt;10 billion dot products (N² matrix)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;~400MB (embedding matrix)&lt;/td&gt;
&lt;td&gt;~40GB (N² similarity matrix) ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;At 100K sentences, Cluster's N² matrix will blow up memory&lt;/strong&gt; — this is its hard limitation. In practice, Cluster is better suited for medium-length documents (hundreds to thousands of sentences), while Kamradt can handle any length.&lt;/p&gt;

&lt;h3&gt;
  
  
  Split Quality Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clear topic boundaries&lt;/td&gt;
&lt;td&gt;✅ Excellent, obvious distance spikes&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradual topic transitions&lt;/td&gt;
&lt;td&gt;⚠️ May fail to find split points&lt;/td&gt;
&lt;td&gt;✅ Global optimization still finds best split&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short documents (&amp;lt;50 sentences)&lt;/td&gt;
&lt;td&gt;✅ Fast&lt;/td&gt;
&lt;td&gt;✅ Higher quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long documents (&amp;gt;10K sentences)&lt;/td&gt;
&lt;td&gt;✅ Linear scaling&lt;/td&gt;
&lt;td&gt;❌ Memory explosion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very short sentences&lt;/td&gt;
&lt;td&gt;⚠️ Needs buffer for context&lt;/td&gt;
&lt;td&gt;⚠️ Short sentence embeddings are low quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How to Choose?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Scenario&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unknown document length, need general solution&lt;/td&gt;
&lt;td&gt;Kamradt&lt;/td&gt;
&lt;td&gt;Linear complexity, won't blow memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short documents (&amp;lt;2000 sentences), want optimal splits&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;Globally optimal, higher quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding API charges per token&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;No buffer inflation, 7x fewer tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Limited local compute resources&lt;/td&gt;
&lt;td&gt;Kamradt&lt;/td&gt;
&lt;td&gt;O(N) computation, memory-friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fuzzy topic boundaries, need precise splits&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;DP global optimization is more robust&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Can't Other Methods Replace Embedding?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alternative&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Keyword overlap / TF-IDF&lt;/td&gt;
&lt;td&gt;Cannot capture synonyms or contextual semantics ("automobile" and "vehicle" would be considered unrelated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rule-based delimiters (paragraphs, periods)&lt;/td&gt;
&lt;td&gt;One paragraph may contain multiple topics; different paragraphs may discuss the same topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM direct judgment&lt;/td&gt;
&lt;td&gt;Too expensive, high latency, unsuitable for batch processing tens of thousands of sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Embedding maps text into a high-dimensional semantic space where semantically similar texts have small vector distances and dissimilar texts have large distances. This is currently the optimal approach for semantic similarity measurement, balancing &lt;strong&gt;cost, speed, and quality&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  buffer_size: The Role of the Context Window
&lt;/h2&gt;

&lt;p&gt;Semantic chunking has a key parameter &lt;code&gt;buffer_size&lt;/code&gt; (default: 3) that determines how much context is concatenated when generating embeddings for each sentence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Concatenation logic
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 3 before
&lt;/span&gt;              &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                                     &lt;span class="c1"&gt;# current
&lt;/span&gt;              &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 3 after
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key point: buffer_size does not affect the number of Embedding calls — only the length of each input text.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With 10 sentences, whether buffer_size is 1 or 10, you still embed 10 combined_sentences. The difference is how much context each text contains:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;buffer_size&lt;/th&gt;
&lt;th&gt;Avg sentences per text&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;~3&lt;/td&gt;
&lt;td&gt;Less context, may misjudge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (default)&lt;/td&gt;
&lt;td&gt;~7&lt;/td&gt;
&lt;td&gt;Balance point&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;~21&lt;/td&gt;
&lt;td&gt;Rich context, but may exceed model token limit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note: Embedding models have input length limits (e.g., BGE-M3 max 8192 tokens). If buffer_size is too large, texts get truncated, potentially losing the current sentence's information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance at Scale
&lt;/h2&gt;

&lt;p&gt;Suppose a long document is split into 100,000 sentences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Texts to embed = &lt;strong&gt;100,000&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;With batch_size of 500, actual API calls = 100,000 ÷ 500 = &lt;strong&gt;200 HTTP requests&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The performance bottleneck is API call count (determined by total sentences and batch_size), independent of buffer_size.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fallback Strategy: What If Embedding Is Unavailable?
&lt;/h2&gt;

&lt;p&gt;Good system design should account for Embedding service unavailability. The common approach: when Embedding calls fail, automatically fall back to recursive chunking (pure rule-based splitting, no Embedding needed).&lt;/p&gt;

&lt;p&gt;This means semantic chunking is an &lt;strong&gt;enhancement&lt;/strong&gt;, not a &lt;strong&gt;dependency&lt;/strong&gt; — the system still works without the Embedding service, just with lower split quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Why is Embedding needed?&lt;/td&gt;
&lt;td&gt;Judging semantic similarity requires vector representations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can rules replace it?&lt;/td&gt;
&lt;td&gt;No, rules cannot capture semantics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can LLM replace it?&lt;/td&gt;
&lt;td&gt;Theoretically yes, but cost and latency are unacceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kamradt vs Cluster core difference?&lt;/td&gt;
&lt;td&gt;Local adjacent comparison vs global optimal segmentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which has higher Embedding cost?&lt;/td&gt;
&lt;td&gt;Kamradt: higher token consumption (buffer inflation); Cluster: higher compute cost (N² matrix)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which for large documents?&lt;/td&gt;
&lt;td&gt;Kamradt — linear complexity, won't blow memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which for optimal splits?&lt;/td&gt;
&lt;td&gt;Cluster — global DP optimization, but limited to medium-length documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What if service is unavailable?&lt;/td&gt;
&lt;td&gt;Both fall back to rule-based chunking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Embedding API is the "eyes" of semantic chunking — without it, the chunking algorithm is a blind person cutting a cake. The two strategies "see" text differently: Kamradt is like a line-by-line scanner, Cluster is like an editor with a bird's-eye view. Which to choose depends on your document scale and split quality requirements.&lt;/p&gt;

</description>
      <category>semanticchunking</category>
      <category>embedding</category>
      <category>rag</category>
      <category>textsplitting</category>
    </item>
    <item>
      <title>Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Sun, 03 May 2026 00:19:18 +0000</pubDate>
      <link>https://dev.to/eyanpen/multiple-independent-questions-batch-into-one-request-or-split-into-many-an-analysis-of-llm-1h6m</link>
      <guid>https://dev.to/eyanpen/multiple-independent-questions-batch-into-one-request-or-split-into-many-an-analysis-of-llm-1h6m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;When you have 5 unrelated questions, should you pack them into one message to the LLM, or send 5 requests simultaneously? Which is faster?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Short Answer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Splitting into multiple independent parallel requests is almost always faster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a gut feeling — it's determined by the underlying inference mechanism of LLMs. Let's walk through the reasoning from first principles.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. How LLMs Generate Text: Autoregressive Decoding
&lt;/h2&gt;

&lt;p&gt;To understand this problem, you first need to know how LLMs "write."&lt;/p&gt;

&lt;p&gt;LLMs (GPT-4, Claude, etc.) use &lt;strong&gt;autoregressive generation&lt;/strong&gt;: they produce one token at a time, append that token back to the input, then generate the next token. This repeats until generation is complete.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Generating N tokens requires N forward passes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 100-token answer requires 100 inference steps&lt;/li&gt;
&lt;li&gt;A 500-token answer requires 500 inference steps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total output length directly determines total latency&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Batched Request: Output Volumes Stack, Latency Grows Linearly
&lt;/h2&gt;

&lt;p&gt;Suppose you have 5 independent questions, each requiring ~200 tokens to answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach A: Combine into one request&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You stuff all 5 questions into a single message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Please answer the following questions separately:
1. xxx
2. xxx
3. xxx
4. xxx
5. xxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM needs to generate total output ≈ 5 × 200 = 1000 tokens. Due to autoregressive decoding, these 1000 tokens are generated &lt;strong&gt;sequentially&lt;/strong&gt; — token #201 must wait for the first 200 to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total latency ≈ 1000 × per-token generation time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Plus additional overhead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM must maintain context switches between answers ("now answering question 3")&lt;/li&gt;
&lt;li&gt;Longer KV Cache means increasing attention computation at each step&lt;/li&gt;
&lt;li&gt;Actual output often exceeds 1000 tokens (formatting, transition phrases, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Split Requests: Parallel Inference, Latency Equals the Slowest One
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Approach B: Split 5 questions into 5 independent requests, sent simultaneously&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each request independently generates ~200 tokens. If the server has sufficient concurrent processing capacity (all modern LLM services do), these 5 requests are &lt;strong&gt;processed in parallel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total latency ≈ max(individual request latencies) ≈ 200 × per-token generation time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Total output tokens&lt;/th&gt;
&lt;th&gt;Actual latency (relative)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Combined request&lt;/td&gt;
&lt;td&gt;~1000+&lt;/td&gt;
&lt;td&gt;~1000 steps (sequential)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Split into 5 requests&lt;/td&gt;
&lt;td&gt;~200 each&lt;/td&gt;
&lt;td&gt;~200 steps (parallel)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Theoretical speedup ≈ 5x&lt;/strong&gt; (equals the number of questions).&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Why Does Parallelism Work? — Server-Side Continuous Batching
&lt;/h2&gt;

&lt;p&gt;You might ask: doesn't the LLM server have capacity limits? Won't 5 simultaneous requests queue up?&lt;/p&gt;

&lt;p&gt;Modern LLM inference engines (vLLM, TensorRT-LLM, TGI, etc.) all implement &lt;strong&gt;Continuous Batching&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multiple requests share the same GPU matrix operation&lt;/strong&gt;: GPUs excel at parallel computation. Combining tokens from 5 requests into one batch allows a single forward pass to generate one token for each request simultaneously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic scheduling&lt;/strong&gt;: Different requests have different output lengths. Shorter ones finish first, and their slots are immediately given to new requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput vs. latency decoupling&lt;/strong&gt;: Larger batches mean higher GPU utilization and more total tokens processed per unit time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the server's perspective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 short parallel requests → GPU does 5-way batched inference, producing 5 tokens per step&lt;/li&gt;
&lt;li&gt;1 long request → GPU does single-sequence inference, producing 1 token per step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The GPU's parallel computing power is wasted when requests are combined.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The Prefill Phase Difference
&lt;/h2&gt;

&lt;p&gt;LLM inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prefill&lt;/strong&gt;: Process the input prompt, computing KV Cache for all input tokens. This step can process all input tokens in parallel, with latency roughly linear to input length.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode&lt;/strong&gt;: Generate output token by token. This step is sequential.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With combined requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill phase: Longer input (all 5 questions concatenated), longer prefill time&lt;/li&gt;
&lt;li&gt;Decode phase: Longer output, longer decode time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With split requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each request's prefill is shorter, and all 5 prefills can run in parallel or pipelined&lt;/li&gt;
&lt;li&gt;Each request's decode is shorter, and they run in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both phases favor splitting.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. An Often-Overlooked Factor: Quality
&lt;/h2&gt;

&lt;p&gt;Beyond speed, combining requests carries quality risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attention dilution&lt;/strong&gt;: When an LLM processes multiple unrelated tasks in one generation, its "focus" on each task decreases. Research shows that more irrelevant information in the prompt leads to lower answer quality (the "Lost in the Middle" phenomenon).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format confusion&lt;/strong&gt;: Answers to 5 questions easily suffer from numbering errors, omissions, or mismatched responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error propagation&lt;/strong&gt;: If the answer to question 2 goes wrong, the LLM may be influenced in subsequent answers (autoregressive "inertia").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Split requests completely isolate context, giving each question the LLM's "full attention."&lt;/p&gt;

&lt;h2&gt;
  
  
  7. When Is Combining Actually Better?
&lt;/h2&gt;

&lt;p&gt;To be fair, there are a few scenarios where combining may be more appropriate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hidden correlations between questions&lt;/strong&gt;: Even if you think they're independent, the LLM might give more consistent answers seeing the full picture (e.g., different sections of the same report).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict API rate limits&lt;/strong&gt;: If your API quota is 3 requests per minute, you have no choice but to combine 5 questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network latency far exceeds generation time&lt;/strong&gt;: If each API call has 2 seconds of network round-trip but generation only takes 0.5 seconds, splitting 5 times (5 × 2s = 10s network overhead) might exceed the combined generation time. But this is rare in practice — modern API network latency is typically 100-300ms, far less than generation time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extremely short answers&lt;/strong&gt;: If each question only needs a word or two, prefill overhead dominates, and combining can reduce redundant prefill costs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  8. How to Verify This Yourself
&lt;/h2&gt;

&lt;p&gt;If you want to test this empirically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Call LLM API
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;questions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Approach A: Combined
&lt;/span&gt;        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please answer separately:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time_combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

        &lt;span class="c1"&gt;# Approach B: Parallel
&lt;/span&gt;        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;time_parallel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Combined: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_combined&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parallel: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_parallel&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Speedup: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_combined&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;time_parallel&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, 5 moderately complex independent questions typically achieve 3-5x speedup with parallel requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Combined request&lt;/th&gt;
&lt;th&gt;Split parallel requests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation speed&lt;/td&gt;
&lt;td&gt;Slow (sequential output of all answers)&lt;/td&gt;
&lt;td&gt;Fast (parallel generation, latency = slowest)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU utilization&lt;/td&gt;
&lt;td&gt;Low (single-sequence inference)&lt;/td&gt;
&lt;td&gt;High (batched parallel inference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer quality&lt;/td&gt;
&lt;td&gt;May degrade (attention dilution)&lt;/td&gt;
&lt;td&gt;Better (isolated context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API calls&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Rate-limited / extremely short answers&lt;/td&gt;
&lt;td&gt;Independent questions needing detailed answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Core principle in one sentence: LLM's autoregressive mechanism means output is sequential; combining requests = forcing all outputs into a single serial stream; splitting requests = leveraging server-side parallelism to generate multiple outputs simultaneously. Splitting independent questions is the classic strategy of trading space (concurrent slots) for time.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llminference</category>
      <category>autoregressivegeneration</category>
      <category>parallelrequests</category>
      <category>continuousbatching</category>
    </item>
    <item>
      <title>What Is GraphRAG Really Doing? — A Deep Dive into Microsoft's Blog Post</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 11:57:01 +0000</pubDate>
      <link>https://dev.to/eyanpen/what-is-graphrag-really-doing-a-deep-dive-into-microsofts-blog-post-17m5</link>
      <guid>https://dev.to/eyanpen/what-is-graphrag-really-doing-a-deep-dive-into-microsofts-blog-post-17m5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Original: &lt;a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/" rel="noopener noreferrer"&gt;GraphRAG: Unlocking LLM discovery on narrative private data - Microsoft Research&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In early 2024, Microsoft published a technical blog post. The core message boils down to one sentence: &lt;strong&gt;Traditional RAG falls short with complex data, and GraphRAG fills the gap using knowledge graphs + graph clustering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't an academic paper — it reads more like a "tech pitch" aimed at technical decision-makers and engineers. Let me break it down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Does Traditional RAG Fall Short?
&lt;/h2&gt;

&lt;p&gt;To understand what GraphRAG solves, we need to start with the pain points of traditional RAG. The article highlights two scenarios where traditional RAG struggles:&lt;/p&gt;

&lt;h3&gt;
  
  
  Information That Can't Be Connected
&lt;/h3&gt;

&lt;p&gt;Imagine asking an AI: "What has Novorossiya done?"&lt;/p&gt;

&lt;p&gt;Traditional RAG takes the word "Novorossiya" and runs a vector search. But among the 10 text chunks retrieved, none directly mentions that name — the answer is scattered across different documents, connected only through indirect relationships between entities. Vector search only finds text that "looks similar"; it can't handle this kind of reasoning that requires "jumping" between connections.&lt;/p&gt;

&lt;p&gt;GraphRAG works differently: it locates the Novorossiya node in the knowledge graph, then traverses along relationship edges — actions, goals, related organizations — and assembles the complete answer.&lt;/p&gt;

&lt;p&gt;Put simply, vector retrieval is "local matching," while real-world knowledge is often connected indirectly through chains of entity relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can't Answer "Big Questions"
&lt;/h3&gt;

&lt;p&gt;Another example: "What are the top 5 themes in this dataset?"&lt;/p&gt;

&lt;p&gt;Traditional RAG is stumped — the word "themes" is too broad. Vector search doesn't know which direction to look, and ends up matching some irrelevant text that happens to contain the word "theme." The answer naturally goes off track.&lt;/p&gt;

&lt;p&gt;This is fundamentally a granularity problem: vector RAG retrieves at the text chunk level, but "overall themes" require a macro-level understanding of the entire dataset. No single chunk can support that kind of answer.&lt;/p&gt;

&lt;p&gt;GraphRAG handles this easily with pre-built community clusters and community summaries, extracting themes directly from the macro structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Does GraphRAG Work?
&lt;/h2&gt;

&lt;p&gt;The entire process has two phases: offline indexing, then online question answering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Offline Indexing: Three Steps
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Documents
    │
    ▼
┌─────────────────────────────┐
│ Step 1: Entity &amp;amp; Relationship│  LLM processes documents chunk
│ Extraction                   │  by chunk, extracting all
│                              │  entities (people, places,
│                              │  organizations, etc.) and
│                              │  their relationships
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 2: Knowledge Graph      │  Assemble extracted entities
│ Construction                 │  and relationships into a
│                              │  complete graph structure
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 3: Community Detection  │  Perform bottom-up hierarchical
│ &amp;amp; Summarization              │  clustering on the graph (e.g.,
│                              │  Leiden algorithm), generate
│                              │  LLM summary reports for each
│                              │  community
└─────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short: first let the LLM extract all the people, events, things, and their relationships from the documents, assemble them into a large graph, then cluster the graph into groups and write a summary for each group.&lt;/p&gt;

&lt;h3&gt;
  
  
  Online Answering: Choose Strategy by Question Type
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question Type&lt;/th&gt;
&lt;th&gt;How to Find the Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Specific questions (e.g., "What has Novorossiya done?")&lt;/td&gt;
&lt;td&gt;Locate entity in graph → traverse relationships → collect related text → generate answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Macro questions (e.g., "Top 5 themes")&lt;/td&gt;
&lt;td&gt;Use community summaries directly → aggregate layer by layer → generate global answer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Technical Points Worth Digging Into
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Use LLM for Graph Construction Instead of Traditional NLP?
&lt;/h3&gt;

&lt;p&gt;The traditional approach uses NER (Named Entity Recognition) + relation extraction models, but these have hard limitations: you need to predefine entity types and relation types, they break when you switch domains, and they can't capture implicit relationships.&lt;/p&gt;

&lt;p&gt;LLM advantages are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-shot capability&lt;/strong&gt; — no need to train separately for each domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can read between the lines&lt;/strong&gt; — for example, extracting the implicit "government attention" relationship from "the Attorney General's office reported the creation of Novorossiya"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not constrained by schema&lt;/strong&gt; — let the LLM discover entity and relationship types on its own&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is straightforward: LLM calls are expensive, and the indexing phase needs to process the entire dataset, so computational costs are significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Community Detection — GraphRAG's Killer Feature
&lt;/h3&gt;

&lt;p&gt;Many approaches use knowledge graphs to enhance RAG, but what truly sets GraphRAG apart is community detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses algorithms like Leiden to partition the knowledge graph into multi-level communities (think of them as "topic clusters")&lt;/li&gt;
&lt;li&gt;Pre-generates an LLM summary report for each community&lt;/li&gt;
&lt;li&gt;Different community levels correspond to different levels of abstraction; choose the right granularity when answering questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the secret behind its ability to answer "big questions" — no need to traverse the entire graph on the fly, just look up the pre-written summaries.&lt;/p&gt;

&lt;p&gt;When generating community reports, the LLM receives CSV tables of entities and relationships within that community: an Entities table (entity ID, name, description), a Relationships table (source, target, description, combined_degree), and an optional Claims table. Relationships are sorted by &lt;code&gt;combined_degree&lt;/code&gt; in descending order, prioritizing the most important ones, with truncation when the token limit is exceeded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Provenance — Every Statement Is Traceable
&lt;/h3&gt;

&lt;p&gt;GraphRAG places special emphasis on provenance. The complete evidence chain looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    → GraphRAG Answer + [Data: Entities (ID), Relationships (ID)]
        → Relationship IDs point to specific edges in the knowledge graph
            → Edges link back to specific passages in the original source documents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Answer → entities/relationships in the graph → original documents — fully traceable end to end. For enterprise applications, this capability is critical — you can verify every claim the AI makes.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Were the Experiments Conducted?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dataset
&lt;/h3&gt;

&lt;p&gt;They used the VIINA dataset (violence information from news articles), chosen deliberately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Involves multi-party conflict with fragmented information — complex enough&lt;/li&gt;
&lt;li&gt;Includes news sources from both Russian and Ukrainian sides with opposing viewpoints and contradictory information&lt;/li&gt;
&lt;li&gt;Data from June 2023, ensuring it's not in the LLM's training set&lt;/li&gt;
&lt;li&gt;Thousands of articles, far exceeding context window limits — can't be handled without RAG&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Evaluation Results
&lt;/h3&gt;

&lt;p&gt;Four metrics were used for scoring:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;How It's Evaluated&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Comprehensiveness&lt;/td&gt;
&lt;td&gt;How complete is the answer&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human Empowerment&lt;/td&gt;
&lt;td&gt;Does it provide sources for verification&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diversity&lt;/td&gt;
&lt;td&gt;Does it answer from multiple perspectives&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;Does it hallucinate&lt;/td&gt;
&lt;td&gt;SelfCheckGPT absolute measurement&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The results are interesting: GraphRAG significantly outperforms traditional RAG on the first three metrics, but they're roughly equal on faithfulness. In other words, GraphRAG's improvement is mainly in "finding more comprehensively," not in "hallucinating less."&lt;/p&gt;




&lt;h2&gt;
  
  
  Don't Just Look at the Strengths — Know the Limitations Too
&lt;/h2&gt;

&lt;p&gt;This is a pitch piece after all, so it naturally emphasizes the positives. A few caveats to keep in mind:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High indexing cost&lt;/strong&gt; — Every document chunk requires an LLM call to extract entities and relationships. For large datasets, this could take hours or even days. With GPT-4 level models, API costs are considerable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incremental updates are a hard problem&lt;/strong&gt; — The article doesn't mention what happens when data changes. In practice, new documents require re-extraction and merging, community structures may change as a result, requiring re-clustering and re-generating summaries. There's no good engineering solution for this yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extraction quality depends on the LLM&lt;/strong&gt; — LLM entity and relationship extraction isn't 100% accurate. It may miss implicit entities, get relationships wrong, and different models produce varying extraction quality with inconsistent results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queries will be slower&lt;/strong&gt; — Graph traversal + LLM generation has a longer pipeline than simple vector retrieval + LLM generation, so latency is naturally higher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not every question needs it&lt;/strong&gt; — The article itself acknowledges that for simple factual queries (like "What is Novorossiya?"), traditional RAG is sufficient. GraphRAG's advantages are concentrated in multi-hop reasoning and global summarization scenarios.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Analogy to Build Your Intuition
&lt;/h2&gt;

&lt;p&gt;Imagine you're a new employee at a company, and you want to understand "the most important project developments in the last three months."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional RAG is like searching through a filing cabinet&lt;/strong&gt;: You walk into the archive room and search using "project developments" as a keyword. You find dozens of files scattered across different drawers — meeting minutes, emails, reports. You have to piece the fragments together yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphRAG is like asking a colleague who knows everything&lt;/strong&gt;: They've not only read every document but also remember that "Zhang San's Project A and Li Si's Project B are actually related," and know that "last month's budget adjustment affected three departments." They can give you an organized, complete answer right away.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional RAG&lt;/th&gt;
&lt;th&gt;GraphRAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How it works&lt;/td&gt;
&lt;td&gt;Search keywords, find relevant passages&lt;/td&gt;
&lt;td&gt;Build a relationship network first, then answer along relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Good at&lt;/td&gt;
&lt;td&gt;"What is X?" "How to do X?"&lt;/td&gt;
&lt;td&gt;"What's the relationship between X and Y?" "What's the overall picture?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analogy&lt;/td&gt;
&lt;td&gt;A librarian helping you find books&lt;/td&gt;
&lt;td&gt;A detective connecting clues into a complete story&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weakness&lt;/td&gt;
&lt;td&gt;Fragmented, lacks global perspective&lt;/td&gt;
&lt;td&gt;Building the relationship network takes time and compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GraphRAG doesn't solve the "search more accurately" problem — it solves the "search dimension" problem&lt;/strong&gt; — expanding from text similarity to entity relationships and global structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The knowledge graph is the means; community clustering is the real innovation&lt;/strong&gt; — Many approaches use graphs to enhance RAG, but community detection + pre-summarization is GraphRAG's unique weapon for global queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provenance is the foundation of trust&lt;/strong&gt; — Every assertion can be traced back to the original document. Enterprise applications can't do without this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The trade-off is indexing cost&lt;/strong&gt; — Using LLMs to process all data for graph construction is much more expensive than simple vectorization. This must be weighed when deploying in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not a replacement, but a complement&lt;/strong&gt; — Use GraphRAG for complex reasoning and global analysis, traditional RAG for simple factual queries. In real systems, combining both is the right approach.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>graphrag</category>
      <category>rag</category>
      <category>knowledgegraph</category>
      <category>communitydetection</category>
    </item>
    <item>
      <title>The Biggest Pitfall in GraphRAG: One Entity, Seven Identities</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 11:54:16 +0000</pubDate>
      <link>https://dev.to/eyanpen/the-biggest-pitfall-in-graphrag-one-entity-seven-identities-5d8d</link>
      <guid>https://dev.to/eyanpen/the-biggest-pitfall-in-graphrag-one-entity-seven-identities-5d8d</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;You thought the hardest part of GraphRAG was "building the graph." In reality, the hardest part is "assigning entity types" — even when you've predefined a strict type schema.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. A Real-World Dataset
&lt;/h2&gt;

&lt;p&gt;We ran GraphRAG entity extraction on 3GPP TS 23.502 (5G Core Network signaling procedure specification). This document is about 700+ pages and one of the most critical standards in the telecom domain.&lt;/p&gt;

&lt;p&gt;The results were painful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A total of &lt;strong&gt;8,873 distinct entities&lt;/strong&gt; were extracted (deduplicated by title)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1,123 entities were assigned 2 or more types&lt;/strong&gt; — 12.7% of the total&lt;/li&gt;
&lt;li&gt;The most extreme case, &lt;code&gt;PMIC&lt;/code&gt;, was classified into &lt;strong&gt;7 different types&lt;/strong&gt;: &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt;, &lt;code&gt;DATA_TYPE&lt;/code&gt;, &lt;code&gt;INFORMATION_ELEMENT&lt;/code&gt;, &lt;code&gt;MANAGEMENT_ENTITY&lt;/code&gt;, &lt;code&gt;NETWORK_ELEMENT&lt;/code&gt;, &lt;code&gt;PROCEDURE&lt;/code&gt;, &lt;code&gt;PROTOCOL&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that this experiment &lt;strong&gt;already used a strictly predefined entity type schema&lt;/strong&gt;, with the prompt explicitly constraining the LLM to only use the specified type set. In other words, this isn't chaos caused by "no constraints" — it's &lt;strong&gt;chaos that persists even after constraints are applied&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's worse, these "type conflicts" don't occur across different documents — they happen &lt;strong&gt;within the same document&lt;/strong&gt; and even &lt;strong&gt;within the same chunk&lt;/strong&gt;. When the LLM reads a minimal text segment, even with explicit type constraints, it still assigns different types to the same entity.&lt;/p&gt;

&lt;p&gt;We found &lt;strong&gt;63 text_unit-level overlapping conflicts&lt;/strong&gt; — the same entity annotated with two different types within the same text block. For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Labeled as&lt;/th&gt;
&lt;th&gt;Also labeled as&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AF&lt;/td&gt;
&lt;td&gt;ORGANIZATION&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NRF&lt;/td&gt;
&lt;td&gt;INTERFACE&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5G SECURITY CONTEXT&lt;/td&gt;
&lt;td&gt;SECURITY_ELEMENT&lt;/td&gt;
&lt;td&gt;ARCHITECTURE_CONCEPT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HPLMN&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;td&gt;ORGANIZATION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SERVICE REQUEST&lt;/td&gt;
&lt;td&gt;INFORMATION_ELEMENT&lt;/td&gt;
&lt;td&gt;PROCEDURE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This isn't the LLM making rookie mistakes, nor is the schema poorly designed. Think about it: &lt;code&gt;AF&lt;/code&gt; (Application Function) genuinely is both a "network function" and an "organizational role"; &lt;code&gt;NRF&lt;/code&gt; is both a "network function" and exposes "interfaces." These types are all in our predefined schema, and the LLM picks a "legal" type every time — it just picks different legal types for the same entity. &lt;strong&gt;The problem isn't that the LLM judged wrong, nor that the schema isn't strict enough — it's that real-world entities are inherently not single-typed.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Why Is This Problem So Hard?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Entities Are Inherently Multi-Faceted
&lt;/h3&gt;

&lt;p&gt;In 3GPP specifications, the term &lt;code&gt;AMF&lt;/code&gt; (Access and Mobility Management Function):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In architecture diagrams, it's a &lt;strong&gt;NETWORK_FUNCTION&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In signaling procedures, it's a participant in a &lt;strong&gt;PROCEDURE&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In deployment descriptions, it's a &lt;strong&gt;NETWORK_ELEMENT&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In interface definitions, it's an endpoint of an &lt;strong&gt;INTERFACE&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same entity plays different roles in different contexts. This isn't a bug — it's reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 LLM Type Judgment Depends on the Context Window
&lt;/h3&gt;

&lt;p&gt;GraphRAG entity extraction is performed chunk by chunk. Each text_unit is roughly a few hundred tokens, and the LLM can only see that small segment.&lt;/p&gt;

&lt;p&gt;The same entity &lt;code&gt;PDU SESSION ESTABLISHMENT&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In a chunk describing signaling procedures, the LLM classifies it as &lt;strong&gt;PROCEDURE&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In a chunk describing message formats, the LLM classifies it as &lt;strong&gt;INFORMATION_ELEMENT&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both judgments are correct, but they conflict when merged into the knowledge graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 No Matter How Good the Schema, Type Boundaries Are Inherently Fuzzy
&lt;/h3&gt;

&lt;p&gt;We already predefined a type schema, but who defines the boundary between &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt; and &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt;? In the 3GPP context, many concepts naturally span multiple categories. &lt;code&gt;POLICY CONTROL&lt;/code&gt; is both a "procedure" (PROCEDURE) and an "architectural concept" (ARCHITECTURE_CONCEPT) — both types are in our schema, and the LLM isn't wrong to pick either one.&lt;/p&gt;

&lt;p&gt;This isn't a problem of poorly written prompts or imprecise schema definitions — it's &lt;strong&gt;a fundamental tension between the granularity of type systems and the complexity of the real world&lt;/strong&gt;. You can make the schema more fine-grained, but a finer schema only creates more boundary issues, not fewer.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Scale Amplifies the Problem
&lt;/h3&gt;

&lt;p&gt;Our data shows that among entities with multiple types, the top 20 entities average 4–7 types and are associated with 10–200 descriptions. A core entity like &lt;code&gt;AF&lt;/code&gt; has 209 descriptions, 192 text_unit references, and 4 types.&lt;/p&gt;

&lt;p&gt;When a knowledge graph contains thousands of such "multi-faceted entities," downstream community detection, relationship reasoning, and summary generation are all affected — because the graph structure is polluted by type noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. How Does the Industry Currently Address This?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach 1: Predefined Strict Type System (Schema-First) ⚠️ We Already Tried This
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Before extraction, manually define a strict entity type schema and explicitly constrain the LLM in the prompt to only use these types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Microsoft GraphRAG's default configuration, most enterprise knowledge graph projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our actual results&lt;/strong&gt;: All the data at the beginning of this article was produced under Schema-First mode. We predefined the type set and explicitly constrained it in the prompt — yet 1,123 entities still had multi-type conflicts, and 63 text_unit-level overlapping conflicts persisted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's not enough&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema can constrain the LLM to "only pick from these types," but &lt;strong&gt;can't constrain it to "pick only one for the same entity"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Domain concepts are inherently multi-faceted; &lt;code&gt;AF&lt;/code&gt; in the 3GPP context genuinely is both NETWORK_FUNCTION and ORGANIZATION — no schema, however strict, changes this fact&lt;/li&gt;
&lt;li&gt;Requires domain experts to design the schema — high cost, and you need to redesign for each new domain&lt;/li&gt;
&lt;li&gt;Being too strict loses information — forcing &lt;code&gt;AF&lt;/code&gt; to be &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt; discards its semantics as &lt;code&gt;ORGANIZATION&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;: Schema-First is a necessary condition but not a sufficient one. It reduces the "random naming" problem but doesn't solve the fundamental contradiction of "one entity, multiple identities."&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: Allow Multi-Types, Post-Processing Merge (Multi-Label + Post-Processing)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't limit the number of types during extraction; allow an entity to have multiple types, then merge, deduplicate, and select a primary type through rules or models in post-processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: LlamaIndex's PropertyGraphIndex, some academic research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preserves multi-faceted entity information&lt;/li&gt;
&lt;li&gt;No information loss during extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Post-processing logic is complex; rules are hard to enumerate exhaustively&lt;/li&gt;
&lt;li&gt;"Selecting a primary type" itself requires domain knowledge&lt;/li&gt;
&lt;li&gt;Graph complexity increases; query performance degrades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Exploratory analysis, early stages where domain boundaries are uncertain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 3: Hierarchical Typing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Build a hierarchical type system where, for example, &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt; is a subtype of &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt;. Extract at the finest granularity; aggregate by hierarchy during queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Wikidata's type system, YAGO knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Balances precision and flexibility&lt;/li&gt;
&lt;li&gt;Supports queries at different granularities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Designing the hierarchy itself is a major undertaking&lt;/li&gt;
&lt;li&gt;LLMs struggle to accurately determine hierarchical relationships during extraction&lt;/li&gt;
&lt;li&gt;Cross-domain hierarchies are hard to unify&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Large-scale, long-term knowledge graph projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 4: Abandon Explicit Types, Use Embeddings (Type-Free + Embedding)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't assign discrete type labels to entities; instead, use vector embeddings to represent semantic features. Similar entities naturally cluster in vector space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Some recent research, such as GNN-based entity representation learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completely avoids the type conflict problem&lt;/li&gt;
&lt;li&gt;Captures subtle semantic differences between entities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loses interpretability — you can't tell users "this is a network function"&lt;/li&gt;
&lt;li&gt;Downstream community detection and summary generation need redesign&lt;/li&gt;
&lt;li&gt;Difficult to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Research projects, scenarios with low interpretability requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 5: Context-Aware Dynamic Typing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't fix types during extraction; instead, dynamically determine entity types based on query context. For example, when a user asks about architecture, &lt;code&gt;AF&lt;/code&gt; is treated as &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt;; when asking about organization, it's treated as &lt;code&gt;ORGANIZATION&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Currently mostly in the academic exploration stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most aligned with reality — an entity's "identity" truly depends on context&lt;/li&gt;
&lt;li&gt;No difficult type decisions needed during extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely high engineering complexity&lt;/li&gt;
&lt;li&gt;Graph structure can't be determined during offline graph building; community detection algorithms are hard to apply&lt;/li&gt;
&lt;li&gt;Increased query latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: A research direction for next-generation GraphRAG systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. My Recommendation: Schema-First Foundation + Layered Types + Primary Type Voting + Context Preservation
&lt;/h2&gt;

&lt;p&gt;Our experiments have proven that Schema-First is a necessary starting point — without it, types become even more chaotic. But it alone isn't enough. Based on our hands-on experience with 3GPP documents, I recommend layering a &lt;strong&gt;pragmatic post-processing approach&lt;/strong&gt; on top of Schema-First:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 0: Keep Schema-First (Already in Place)
&lt;/h3&gt;

&lt;p&gt;Continue using the predefined type schema to constrain the LLM. This step is already done; its value lies in keeping types within a finite set, preventing the LLM from freely inventing meaningless types like &lt;code&gt;THINGY&lt;/code&gt; or &lt;code&gt;STUFF&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Preserve All Types During Extraction
&lt;/h3&gt;

&lt;p&gt;On top of Schema-First, don't force a single type during extraction. If the LLM picks multiple types from the predefined set, keep them all. Preserve every (entity, type, text_unit) triple. This is the raw signal — once lost, it can't be recovered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Statistical Voting for Primary Type
&lt;/h3&gt;

&lt;p&gt;For each entity, count how many times it's annotated as each type across all text_units, and select the most frequent as the &lt;strong&gt;primary type&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Taking &lt;code&gt;AF&lt;/code&gt; as an example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NETWORK_FUNCTION: 150 occurrences → &lt;strong&gt;primary type&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;ORGANIZATION: 30 occurrences&lt;/li&gt;
&lt;li&gt;ARCHITECTURE_CONCEPT: 20 occurrences&lt;/li&gt;
&lt;li&gt;NETWORK_ELEMENT: 9 occurrences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The primary type is used for the knowledge graph's main structure, community detection, and default queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Preserve Alternative Types as Properties
&lt;/h3&gt;

&lt;p&gt;Other types aren't discarded — they're stored as the entity's &lt;code&gt;alternative_types&lt;/code&gt; property, available for use during queries as needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AF"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"primary_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NETWORK_FUNCTION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"alternative_types"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ORGANIZATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ARCHITECTURE_CONCEPT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NETWORK_ELEMENT"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type_distribution"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NETWORK_FUNCTION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ORGANIZATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ARCHITECTURE_CONCEPT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NETWORK_ELEMENT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Type Conflict Detection and Manual Review
&lt;/h3&gt;

&lt;p&gt;For text_unit-level overlapping conflicts (same entity labeled as different types within the same chunk), flag them as candidates for review. These 63 conflicts are the most worth manually checking — they often reveal blind spots in the type system design.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the Cost?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Increased storage&lt;/strong&gt;: Each entity stores multiple types and distribution info; graph data volume increases by roughly 20–30%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No change to extraction&lt;/strong&gt;: No need to modify prompts or extraction pipelines; no additional cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-processing development needed&lt;/strong&gt;: The voting, merging, and conflict detection pipeline requires additional development — roughly 2–3 days of engineering effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slightly more complex queries&lt;/strong&gt;: The query layer needs to decide whether to use the primary type or all types, but this logic can be encapsulated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can't be fully automated&lt;/strong&gt;: Text_unit-level conflicts still require human judgment, but the volume is manageable (only 63 in our case).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Final Thoughts
&lt;/h2&gt;

&lt;p&gt;GraphRAG papers and blog posts always focus on the flashy capabilities like "community detection" and "global queries," but when it comes to real-world deployment, &lt;strong&gt;entity type chaos is the first roadblock&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One TS 23.502 document, 8,873 entities, 1,123 with multi-type conflicts — and this is &lt;strong&gt;after applying Schema-First constraints&lt;/strong&gt;. This isn't an edge case; it's the norm for all complex domain documents. Predefined type schemas are necessary but far from sufficient.&lt;/p&gt;

&lt;p&gt;There's no silver bullet for this problem. But at least we can: &lt;strong&gt;build on Schema-First, avoid losing information during post-processing, use statistical methods to select primary types, preserve multi-faceted nature for downstream use, and keep the conflicts that truly need human judgment within a manageable scope.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the gap between "running a demo" and "going to production" in GraphRAG — and it's the most important one to fill.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>entitytyping</category>
      <category>knowledgegraph</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
