- Business Pain Point: In highly hierarchical, long-document domains such as finance and healthcare, traditional vector search (including hybrid retrieval) suffers from structural recall failure when facing cross-chapter logical comparison queries.
- Architectural Breakthrough: Designed the FoC (Forest of Clauses) architecture, elevating the document's table of contents to a "first-class citizen." It employs a dual-engine concurrent retrieval of Top-down (LLM tree-structure routing) and Bottom-up (vector fragment search), assembling a precision "subtree" in memory.
- Engineering Barriers: Built a custom $O(N)$ stack-based parser to dynamically construct clause forests with non-standard hierarchies (a blind spot for general-purpose commercial parsers); introduced vLLM Prefix Caching to resolve long-context performance bottlenecks, reducing TTFT from seconds to milliseconds at medium-to-high concurrency; integrated Guided Decoding to guarantee 100% structured output.
- Bottom Line: Eliminated the recall blind spot for complex logical Q&A without adding significant inference cost, delivering a domain-specialized RAG retrieval engine.
1. The Pain Point
The true measure of a RAG system's maturity is not how well it answers questions from Wikipedia, but whether it breaks down when confronted with real-world business documents—dense with complex sentences and deeply nested structures, such as insurance policies and legal contracts.
Consider a real production case. A user asks:
"What are the differences in coverage between the primary policy and the supplementary rider?"
Relying solely on Vector Search (even Dense + Sparse hybrid retrieval with Rerank), the result is often disastrous:
- System behavior: Retrieved 3–5 chunks, all concentrated in the "Coverage" chapter of the primary policy.
- Critical failure: The supplementary rider's clauses were completely missed! The rider starts on page 15 of the PDF, and from a pure semantic perspective, its embedding distance from "primary policy coverage" is too large.
Architectural Root Cause Analysis:
We came to a clear realization: vector search is fundamentally a Bottom-up "blind man's elephant" approach. It shatters a well-structured document into hundreds of chunks, discarding the global table of contents, chapter nesting, exclusion clauses, and supplementary relationships. When answering questions that require cross-chapter logical association, fragmented chunks simply cannot reconstruct the full picture. General-purpose commercial solutions hit a structural ceiling here.
2. The Dual-Engine Architecture and the "God's-Eye View"
If vector search loses document structure, why not let the LLM glance at the "table of contents" first?
This is the core idea behind FoC (Forest of Clauses): instead of blind-searching in embedding space, let the LLM navigate the document's hierarchical structure (the ToC tree) like a human flipping through a book for global routing.
We designed a loosely-coupled dual-path concurrent architecture (asyncio.gather) at the retrieval layer:
- Vector Retrieval (Bottom-up): Handles fine-grained entity matching (e.g., finding the exact definition of "thyroid cancer" or "proton therapy").
- FoC Retrieval (Top-down): Handles cross-chapter logical association. Feeds the entire document's "skeleton" (chapter titles) to the LLM, which selects the relevant chapter IDs.
These two retrieval paths run independently and merge their results in memory.
2.1 Overall Architecture
Ingestion Phase (Offline)
Retrieval Phase (Online)
To minimize inference cost, we placed a 9B model as an intent router at the front gate: simple definitional queries (fact-type) go through the Vector fast path; only complex clause comparison queries (logic-type) trigger the full FoC concurrent retrieval, ensuring efficient compute allocation.
3. Core Engineering: Doing What's Hard but Right
Making this mechanism run reliably and efficiently in production required four key engineering efforts:
3.1 Building a Domain Moat: Single-Pass Stack Scan with Non-Standard Parsing
The raw input for insurance clauses is a linear Markdown text stream. We need to reconstruct a tree from it. General-purpose parsers (e.g., LlamaParse) only recognize standard # and ## headings. But insurance documents use highly non-standard title formats: "Part I", "Article 1", "(i)", etc.
Rather than compromising with generic tools, we built a custom regex engine using a Single-pass Stack-based Parsing algorithm:
- Maintain a node stack; when encountering a same-level or higher-level heading,
popto backtrack. - The core logic is minimal:
while current_level <= self.stack[-1].level: self.stack.pop() -
Engineering payoff: Strict $O(N)$ time complexity—processing a 50-page policy takes milliseconds. More importantly, node IDs are assigned in monotonically increasing order, establishing the data structure foundation for $O(\log N)$
find_node_by_idlookups during retrieval.
3.2 Fragment Tracing: GPS Coordinates for Every Chunk
During the Ingestion phase, we tag every chunk with two critical labels:
-
clause_id: The ID of the clause node it belongs to. -
clause_path: The full ancestor chain from root to that node (e.g.,"3.10.22", representing Primary Policy → Exclusions → Intentional Crime).
Architectural significance: With clause_path, any isolated fragment retrieved from the vector store can be instantly traced back to its complete lineage (ancestor node chain), achieving a reverse mapping from unordered fragments to ordered structure.
3.3 Deterministic Output: LLM Routing + Guided Decoding
At retrieval time, we render the entire tree as a minimal Markdown table of contents and send it to the LLM, which returns a list of relevant IDs.
Engineering pitfall: Under high concurrency, even a 35B model like Qwen can produce truncated or malformed JSON output (~15% failure rate).
Solution: Deep integration with vLLM's Guided Decoding (FSM constraint). By enforcing output conformance to a predefined JSON Schema at the token sampling stage, we pushed the structured parsing success rate to 100%, ensuring production-grade reliability.
3.4 Extreme Context Optimization: Assembling the "Pruned Bonsai"
Finally, we merge the clauses selected by FoC with the fragments retrieved by Vector search. Using clause_path, we extract all relevant nodes and their ancestors from the original clause tree and render them as a subtree.
Think of it as providing the final inference model with a precision-pruned bonsai—preserving cross-chapter hierarchical logic (what's a prerequisite for what, what's an exception to what) while stripping away irrelevant branch noise, dramatically reducing the inference model's context pressure and hallucination probability.
4. The Performance Battle: Surviving 7K Long-Context at Low Cost
The FoC architecture introduced a significant performance challenge: long-context inference cost.
A clause tree skeleton for a policy with primary and supplementary coverage easily reaches 6K tokens. Every retrieval request requires the LLM to read the entire skeleton, and the Prefill computation causes TTFT (Time to First Token) to spike while concurrency capacity collapses.
If we can't solve the cost and latency problem, this architecture is just a toy.
Solution: Squeezing Every Drop from vLLM Prefix Caching
Analysis revealed that FoC's prompt structure is practically tailor-made for Prefix Caching:
- System Prompt (6K+ tokens): The clause tree skeleton, completely static for a given insurance product.
- User Prompt (~50 tokens): The question, different each time.
By enabling --enable-prefix-caching on the inference engine, the 6K+ KV Cache is computed only once on the first request. All subsequent concurrent requests for the same policy hit the cached KV directly in GPU memory, reducing the long-context Prefill computation from $O(N^2)$ to near $O(1)$ (only the few dozen new query tokens need computation).
Benchmark Results (Qwen3.5-35B-A3B full-precision, RTX PRO 6000 96GiB, ~6K input tokens):
| Scenario | TTFT P99 (Disabled) | TTFT P99 (Enabled) | Improvement |
|---|---|---|---|
| Cache-hit single request | 227 ms | 111 ms | 2x |
| c=4 concurrency | ~249 ms | ~205 ms | 1.2x |
| c=22 concurrency | ~4.9 s | ~747 ms | 7x |
| c=28 concurrency | Unavailable | ~996 ms | Qualitative shift |
Core Architectural Gains:
- Free concurrency leverage: At high concurrency (c=22), queuing latency dropped from 4.9s to 0.7s. Without any additional hardware cost, the single-GPU usable concurrency ceiling was pushed from c≤16 to c=32+, truly meeting the stringent requirements for commercial deployment.
- Eliminating the long-context first-token penalty: After system warm-up (cache hit), single long-context inference TTFT is compressed to the 100ms range, completely eliminating user "spinning wheel anxiety."
5. Architecture Decision: Why Not RAPTOR?
In the tree-structured RAG space, Stanford's RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) is another widely discussed approach. Its philosophy is symmetrical to FoC, yet in real-world medical claims and financial compliance scenarios, we chose to build FoC:
| Dimension | FoC (Our Approach) | RAPTOR |
|---|---|---|
| Construction Direction | Top-down: Leverages a single-pass parsing engine, strictly following the document's physical hierarchy. | Bottom-up: Semantically clusters text chunks and relies on LLMs to generate summary parent nodes. |
| Ingestion Cost | $O(N)$ stack-based parsing, zero LLM calls, zero I/O, millisecond-level generation | Recursive LLM summarization at each level, high cost |
| Traceability | Precise source tracing to original document location | Summary nodes are generated content, no direct mapping to source text |
RAPTOR's emergence in academia validates from the outside that "building a tree for documents (Tree-Organized)" is the correct direction for solving long-document RAG recall blind spots. However, architecture decisions must respect the business context. FoC absorbed RAPTOR's tree-based retrieval vision while replacing non-deterministic clustering with deterministic parsing—eliminating the weaknesses that would be fatal in compliance scenarios, and achieving a fusion of academic frontier with financial-grade engineering practice.
6. Conclusion: Structure Is Knowledge
In the world of RAG, we often place too much faith in the magic of Embeddings while overlooking the inherent shape of the data itself.
The core design philosophy of FoC (Forest of Clauses) is: Structure is Knowledge.
When a document possesses a rigorous hierarchical structure, that structure should never be discarded as mere Metadata—it is, in itself, an exceptionally powerful retrieval signal. By elevating the "table of contents" to a first-class citizen, using a Top-down global perspective to compensate for Bottom-up local blind spots, and applying extreme engineering measures (Prefix Caching, FSM constraints, Stack parsing) to flatten the resulting performance overhead—
This is the path we found to balance accuracy, performance, and cost in complex financial document Q&A. We hope it offers some architectural inspiration to fellow practitioners.


Top comments (0)