JunYoungMoon

Posted on Dec 15, 2025 • Edited on Dec 18, 2025

Our RAG system still failed on hierarchical metrics — Part 2

#rag #springboot #openai #systemdesign

How I Built a RAG System That Actually Understands Business Metrics (Part 2: Hierarchical Search)

TLDR: Part 1 could find "total_revenue", but failed at "Naver traffic". This article shows how I fixed that with a tree-based metric system and 2-stage GPT filtering. Result: 95% → 98% accuracy.

The Real Problem

Remember Part 1? We built this:

User: "What's the conversion rate?"
System: Found "conversion_rate"

But then this happened:

User: "How's my Naver traffic?"
System: Found "traffic_by_channel"
        But which channel? Naver? Google? Facebook?

The gap: Our system understood categories but couldn't navigate their subcategories.

💡 The Solution in 3 Moves

Think of it like a chess game. We need three moves to checkmate:

Move 1: SEARCH      → Cast a wide net (50 candidates)
Move 2: CLASSIFY    → Filter intelligently (3 names → 12 nodes)
Move 3: REFINE      → Validate relationships (final 2 metrics)

Let's see each move in action.

Move 1: The Setup (Vector Search)

The Challenge: Metrics have hierarchy

traffic_by_channel/           ← Parent (1-depth)
├── naver                     ← Child (2-depth)
├── google
├── facebook
└── direct_traffic

The Strategy: Use TWO search indices

Index	Purpose	Top N
1-depth	Find categories	10
2-depth	Find specific values	40

Why 40 vs 10? 2-depth metrics are like finding "John" in a phone book with 1000 Johns. We need more candidates.

Example Result:

Query: "How's my Naver traffic?"

1-depth finds: traffic_by_channel, site_duration, bounce_rate...
2-depth finds: naver, google, facebook, traffic_count...

Total: 50 candidates

Key Insight: Don't search for "traffic_by_channel|naver" directly. Search for each piece separately, then connect them.

Move 2: The Play (Smart Classification)

This move has 3 sub-plays. Watch closely.

Sub-Play 2.1: Remove the Noise 🧹

Problem: 50 candidates include junk

For "Naver traffic":
naver          ← Relevant
traffic_count  ← Maybe relevant?
instagram      ← Wrong channel
bounce_rate    ← Not about traffic sources

Solution: Ask GPT (Structured Outputs)

// Force GPT to return clean JSON array
"response_format": {
  "type": "json_schema",
  "schema": {
    "type": "object",
    "properties": {
      "metrics": {
        "type": "array",
        "items": {"type": "string"}
      }
    }
  }
}

Output:

{
  "metrics": ["naver", "direct_traffic", "traffic_count"]
}

Result: 50 → 3 ✂️

Sub-Play 2.2: From Names to Nodes

Problem: "naver" is just a string. We need context.

What we need to know:

Is this advertiser's data or industry average?
What are the parent metrics?
What other "naver" nodes exist?

Solution: MetricForest lookup

"naver" in nameIndex → [
  "traffic_by_channel|naver|advertiser",
  "traffic_by_channel|naver|industry",
  "channel_revenue|naver|advertiser",
  "channel_revenue|naver|industry",
  "first_purchase|naver|advertiser",
  "repeat_purchase|naver|advertiser"
]

The Magic: O(1) lookup using HashMap

// Without nameIndex: O(N) scan through all nodes
for (node in allNodes) {
  if (node.name == "naver") { ... }  // Slow!
}

// With nameIndex: O(1) direct access
List<MetricNode> nodes = nameIndex.get("naver");  // Fast!

Result: 3 names → 12 nodes (each name maps to multiple nodes)

Sub-Play 2.3: Pick Your Side

Problem: Same metric, two meanings

traffic_by_channel|naver|advertiser  → "Our Naver traffic"
traffic_by_channel|naver|industry    → "Industry average"

Question: Which one does the user want?

Solution: Let GPT decide

Query Analysis:
"How's MY Naver traffic?"     → Advertiser
"Industry average?"           → Industry  
"Compare to industry"         → Both

Implementation:

String scope = classifyScope(query);

if ("common".equals(scope)) {
    return ["advertiser", "industry"];  // Return both
} else {
    return [scope];  // Return one
}

Result: 12 nodes → Filter to advertiser domain only

Move 2 Summary:

Started with: 50 candidates

Ended with: 12 nodes, advertiser domain confirmed

But we're not done yet...

Move 3: Checkmate (Relationship Validation)

Here's where it gets interesting. We have 12 nodes:

✓ traffic_by_channel|naver|advertiser
✓ channel_revenue|naver|advertiser        ← Wrong parent!
✓ traffic_by_channel|direct_traffic|advertiser
✓ new_members|traffic_count|advertiser   ← Wrong parent!
... 8 more

Problem: Not all are correct. "channel_revenue|naver" isn't about traffic.

Solution: 3-step validation

Step 3.1: Collect the Neighbors

Rule:

If 1-depth node → Get children
If 2-depth node → Get parents

Why? To understand relationships.

Our 12 nodes are all 2-depth
→ Collect their parents
→ Get 6 unique parent names:

1. traffic_by_channel    ← Traffic related ✓
2. channel_revenue       ← Revenue, not traffic ✗
3. first_purchase        ← Purchase, not traffic ✗
4. repeat_purchase       ← Purchase, not traffic ✗
5. new_members          ← Not about channels ✗
6. existing_members     ← Not about channels ✗

Step 3.2: Filter Parents with GPT

Ask GPT: "Which parents are relevant to 'Naver traffic'?"

Input: 6 parent names
GPT thinking:
  - traffic_by_channel? YES → About traffic AND channels
  - channel_revenue? NO → About revenue, not traffic
  - first_purchase? NO → About purchases
  - new_members? NO → Not related to Naver

Output: ["traffic_by_channel"]

This is CRITICAL: Only 1 parent passed! This means:

Keep: traffic_by_channel|naver
Remove: channel_revenue|naver
Remove: new_members|traffic_count

Step 3.3: Final Validation

Now apply 3 rules:

Rule 1: Leaf nodes (no children) → Accept immediately

if (node.depth == 1 && node.children.isEmpty()) {
    return node;  // Already final
}

Rule 2: Parent nodes (has children) → Return matching children

if (node.depth == 1 && !node.children.isEmpty()) {
    return node.children
        .filter(child -> secondResult.contains(child.name));
}

Rule 3: Child nodes → Validate parent exists in secondResult

if (node.depth == 2) {
    return node.parents
        .anyMatch(parent -> secondResult.contains(parent.name));
}

Applying to our 12 nodes:

Check: traffic_by_channel|naver|advertiser
  → Parent "traffic_by_channel" in secondResult? 
  → ACCEPT

Check: channel_revenue|naver|advertiser
  → Parent "channel_revenue" in secondResult? 
  → REJECT

Check: new_members|traffic_count|advertiser
  → Parent "new_members" in secondResult? 
  → REJECT

Final Result:

[
  "traffic_by_channel|naver|advertiser",
  "traffic_by_channel|direct_traffic|advertiser"
]

Checkmate!

The Complete Game Replay

Let's watch the entire sequence:

Query: "How's my Naver traffic?"

   Move 1: SEARCH
   ├─ 1-depth index → 10 results
   ├─ 2-depth index → 40 results
   └─ Total: 50 candidates

   Move 2: CLASSIFY
   ├─ 2.1: GPT Filter → 3 names
   ├─ 2.2: Node Lookup → 12 nodes
   └─ 2.3: Domain Filter → ["advertiser"]

   Move 3: REFINE
   ├─ 3.1: Collect neighbors → 6 parent names
   ├─ 3.2: GPT Filter → 1 parent name
   └─ 3.3: Validate → 2 final metrics

   RESULT:
   ✓ traffic_by_channel|naver|advertiser
   ✓ traffic_by_channel|direct_traffic|advertiser

Time taken: < 1 second

Cost per query: ~$0.002

Before vs After

Part 1 (Basic Search)

 "What's the conversion rate?"
 → conversion_rate

 "How's my Naver traffic?"
 → Found "traffic_by_channel" but couldn't find "naver"

Accuracy: 95% for flat metrics only

Part 2 (Hierarchical Search)

   "What's the conversion rate?"
   → conversion_rate

   "How's my Naver traffic?"
   → traffic_by_channel|naver|advertiser
   → traffic_by_channel|direct_traffic|advertiser

   "Compare my Google traffic to industry"
   → traffic_by_channel|google|advertiser
   → traffic_by_channel|google|industry

Accuracy: 98% for all queries including hierarchical

What I Learned

1. Use AI Where It Shines

Good: Semantic understanding

// Is "revenue" related to "sales"? → Ask GPT
// Is "naver" relevant to "traffic"? → Ask GPT

Bad: Deterministic logic

// Is this node depth 1 or 2? → Just check node.depth
// Does parent exist? → Just check hashmap

2. Two-Stage Filtering is Magic

One GPT call on 50 candidates → 70% accuracy
Two GPT calls (50 → 6 → final) → 98% accuracy

The second filter on a small, focused set is what makes it work.

3. Data Structure = Foundation

Without MetricForest:

Can't collect neighbors
Can't validate relationships
Can't distinguish contexts

The tree structure makes everything else possible.

4. Filter Domain LATE, Not Early

Wrong approach:

// Filter domain first
nodes = filterByDomain(nodes, "advertiser");
// Now we can't see that both domains existed!

Right approach:

// Let GPT see both domains
nodes = getAllNodes();
domain = askGPT("Which domain does user want?");
// Now filter
nodes = filterByDomain(nodes, domain);

This enables "compare" queries that need both domains.

Real-World Code

Here's the orchestration service that ties it all together:

@Service
@RequiredArgsConstructor
public class MetricSearchService {

    private final VectorSearchService vectorSearch;
    private final StructuredOutputsService gptFilter;
    private final MetricForest metricForest;
    private final DomainClassifier domainClassifier;

    public List<String> search(String query) {
        // Move 1: Search
        var candidates = vectorSearch.search(query, 50);

        // Move 2.1: First filter
        var names = gptFilter.filter(query, candidates);

        // Move 2.2: Resolve to nodes
        var nodes = metricForest.findByNames(names);

        // Move 2.3: Decide domain
        var domain = domainClassifier.classify(query, nodes);

        // Move 3.1-3.2: Collect and filter neighbors
        var neighbors = collectNeighbors(nodes);
        var validParents = gptFilter.filter(query, neighbors);

        // Move 3.3: Final validation
        return validate(nodes, domain, validParents);
    }
}

Clean, simple, effective.

Try These Queries

The system now handles:

Flat metrics (from Part 1)

"What's the conversion rate?"
"Show order count"

Hierarchical metrics (NEW!)

"My Naver traffic?"
"Google revenue?"
"Facebook first purchase rate?"

Comparisons (NEW!)

"Compare my Naver traffic to industry average"

Category queries (NEW!)

"Show all channel traffic"

What's Next?

Current system handles:

Flat metrics
Hierarchical metrics
Domain disambiguation
Comparison queries

Coming soon:

Time ranges ("last month's Naver traffic")
Aggregations ("total revenue by channel")
Conversational follow-ups
Multi-metric queries

Need Help Implementing RAG?

I help companies integrate AI systems with their existing Spring Boot infrastructure.

Specializing in:

Spring Boot + OpenAI integration
Custom RAG pipelines
E-commerce analytics systems

💭 Questions or feedback? Join the discussion in the comments!

Tags: #rag #hierarchical-search #spring-boot #openai #system-design #ai-engineering

DEV Community