Introduction
GraphRAG's Local Search needs to select the most relevant raw text fragments (Text Units) associated with the knowledge graph to fill the LLM context window during query time. This selection strategy seems simple — sort by entity similarity, fill one by one — but in real-world scenarios it exposes a significant limitation: popular entities can monopolize the entire Text Unit budget, causing key text from other entities to be truncated.
This article provides an in-depth analysis of the root cause of this problem, the core problem it was designed to solve, and possible improvement directions.
What Is the Current Strategy
Local Search's Text Unit selection has four steps:
- Iterate through selected entities (ranked by vector similarity), collecting each entity's associated
text_unit_ids - Deduplication: each TU is attributed only to the first entity encountered
- Sorting: by
(entity_index, -num_relationships)— entity order takes priority, within the same entity sorted by relationship density in descending order - Fill into context one by one until reaching the token limit (default 50% of total budget, approximately 6000 tokens)
Core code:
for index, entity in enumerate(selected_entities):
entity_relationships = [rel for rel in relationships if rel.source == entity.title or rel.target == entity.title]
for text_id in entity.text_unit_ids or []:
if text_id not in text_unit_ids_set and text_id in self.text_units:
num_relationships = count_relationships(entity_relationships, self.text_units[text_id])
text_unit_ids_set.add(text_id)
unit_info_list.append((self.text_units[text_id], index, num_relationships))
unit_info_list.sort(key=lambda x: (x[1], -x[2]))
Problem Scenario: Popular Entities Monopolize the Budget
Concrete Example
Suppose the user asks: "What is the anti-inflammatory mechanism of chamazulene?"
Entities returned by vector search:
| Rank | Entity | Associated TU Count | Notes |
|---|---|---|---|
| 0 | Chamomile | 50 | High-frequency entity, mentioned in almost all herbal documents |
| 1 | Chamazulene | 4 | Active component of chamomile, fewer specialized references |
| 2 | NF-κB pathway | 2 | Specific anti-inflammatory molecular mechanism |
TU attribution after deduplication:
index 0 "Chamomile": TU1, TU2, TU3, ..., TU50 (50 items)
index 1 "Chamazulene": TU51, TU52 (TU1, TU5 already claimed by Chamomile)
index 2 "NF-κB": TU53 (only 1 unclaimed)
Sorting result:
TU1(index=0, rel=5) → TU2(index=0, rel=4) → ... → TU50(index=0, rel=0)
→ TU51(index=1, rel=2) → TU52(index=1, rel=1)
→ TU53(index=2, rel=1)
Assuming a token budget of 6000 tokens and each TU averaging 300 tokens, only about 20 TUs can fit.
Result: All top 20 positions are occupied by "Chamomile" TUs. The text about "chamazulene's anti-inflammatory mechanism" that the user actually cares about (TU51, TU52, TU53) is entirely truncated. The context fed to the LLM is filled with generic introductions about "Chamomile" but contains no original text supporting chamazulene's specific molecular mechanisms.
Why It Was Designed This Way: What Problem It Solves
This strategy was not designed arbitrarily — it solves a more fundamental problem: ensuring that the most semantically relevant entities receive the most comprehensive original text support.
The Scenario It Addresses
Suppose the user asks: "What is the status of chamomile in European traditional medicine?"
Vector search returns:
| Rank | Entity | Associated TU Count |
|---|---|---|
| 0 | Chamomile | 50 |
| 1 | European Herbalism | 8 |
| 2 | Lavender | 30 |
In this scenario, "Chamomile" is indeed the most core entity — the user is asking about it. If a round-robin strategy were used (taking 1 TU from each entity in turn), then "Lavender's" 30 TUs would split the budget equally with "Chamomile" — but the user never asked about lavender.
The advantages of the current strategy:
- Respects semantic ranking: The entity with the highest vector similarity gets the most original text support, which is correct in most cases
- Relationship density sorting ensures quality: Among multiple TUs for the same entity, the most information-dense ones come first
- Deduplication avoids redundancy: The same TU won't appear repeatedly because it's associated with multiple entities
Core Trade-off
This is a classic relevance depth vs. coverage breadth trade-off:
- The current strategy chooses depth: ensuring the most relevant entity has sufficient original text evidence
- The cost is breadth: secondary entities may have no original text support at all
For most "questions about a specific entity" (the design target of Local Search), depth-first is reasonable. The problem emerges when queries involve cross-entity relationships.
The Essence of the Problem: A Single Sorting Dimension Cannot Express Multi-Objective Optimization
Text Unit selection is fundamentally a multi-objective optimization problem:
- Relevance: The semantic relevance of a TU to the query (expressed indirectly through entity ranking)
- Information density: The number of relationships contained in a TU
- Coverage: Ensuring every selected entity has original text support
- Diversity: Avoiding homogeneous content flooding the context
The current strategy uses a single tuple (entity_index, -num_relationships) attempting to optimize the first two objectives simultaneously, but completely ignores the latter two.
Improvement Directions
Approach 1: Per-Entity Cap
The simplest improvement — set a TU contribution cap for each entity:
MAX_TU_PER_ENTITY = 5
for index, entity in enumerate(selected_entities):
count = 0
for text_id in entity.text_unit_ids or []:
if count >= MAX_TU_PER_ENTITY:
break
if text_id not in text_unit_ids_set and text_id in self.text_units:
# ... addition logic unchanged
count += 1
Pros: Simple to implement, guarantees each entity at least has a chance to contribute TUs
Cons: Cap value is hard to determine; if an entity genuinely needs extensive original text support, it gets artificially limited
Approach 2: Round-Robin
Each round takes 1 TU from each entity (selecting the best by relationship density), cycling until the budget is exhausted:
entity_queues = {i: sorted_tus_for_entity_i for i in range(len(selected_entities))}
result = []
while budget > 0 and any(entity_queues.values()):
for i in range(len(selected_entities)):
if entity_queues[i]:
tu = entity_queues[i].pop(0)
result.append(tu)
budget -= token_count(tu)
Pros: Guarantees coverage, every entity has original text support
Cons: Depth of the most relevant entity is diluted; lower-ranked irrelevant entities also receive equal budget
Approach 3: Weighted Quota Allocation
Allocate TU quotas based on entity vector similarity scores:
# Assuming similarity scores: [0.95, 0.82, 0.71]
scores = [0.95, 0.82, 0.71]
total = sum(scores)
quotas = [int(max_tus * s / total) for s in scores]
# quotas ≈ [15, 13, 11] (assuming max_tus=39)
Pros: Balances depth and breadth; higher-relevance entities get more quota without monopolizing
Cons: Increased implementation complexity; requires preserving similarity scores from vector search results (not retained in current code)
Approach 4: Minimum Guarantee + Remaining Competition
Guarantee each entity at least N TUs (e.g., 2), with remaining budget competed for using the current strategy:
# Phase 1: Guarantee 2 best TUs per entity
for entity in selected_entities:
guaranteed_tus = top_2_by_relationship_density(entity)
result.extend(guaranteed_tus)
# Phase 2: Fill remaining budget using original sorting strategy
remaining = all_tus - guaranteed_tus
remaining.sort(key=lambda x: (x.entity_index, -x.num_relationships))
fill_until_budget(remaining)
Pros: Guarantees coverage while preserving the depth advantage of the original strategy
Cons: If many entities are selected, the guarantee phase may consume significant budget
Summary
| Dimension | Current Strategy | Issue |
|---|---|---|
| Relevance depth | ✅ Excellent | — |
| Information density | ✅ Excellent | — |
| Coverage breadth | ❌ Missing | Popular entities monopolize budget |
| Content diversity | ❌ Missing | Homogenization risk |
GraphRAG's current Text Unit selection strategy is a "depth-first" design that performs well for "questions about a single entity" scenarios, but exposes insufficient coverage when queries involve multi-entity cross-relationships.
The most pragmatic improvement is Approach 4 (Minimum Guarantee + Remaining Competition) — it guarantees that every selected entity has at least some original text support with minimal code changes, without breaking the original strategy's advantages in mainstream scenarios.
Top comments (0)