Building an Explainable Graph RAG System with SAGE (JSON-LD, Percentile Pruning, Multi-Hop Retrieval)

#java #ai #llm #rag

SAGE: Structure Aware Graph Expansion with Pink Floyd Example

Introduction

The SAGE framework (Structure Aware Graph Expansion) is designed to solve the problem of multi-hop retrieval in heterogeneous data. While traditional RAG (Retrieval-Augmented Generation) treats data chunks as isolated units, SAGE builds a structural graph offline to capture relationships that flat retrieval often misses.

https://github.com/vishalmysore/sagejava

In this article, we illustrate SAGE using a dataset centered around the legendary band Pink Floyd.

The Pink Floyd Knowledge Graph

We model the Pink Floyd universe with 8 heterogeneous data chunks:

Band: Pink Floyd (formed 1965 in London).
Albums: Dark Side of the Moon, The Wall, Wish You Were Here.
Members: David Gilmour, Roger Waters, Syd Barrett.
Solo Work: David Gilmour's About Face.

Phase 1: Offline Graph Construction & Percentile Pruning

SAGE doesn't just connect everything into a "hairball" graph. It computes dense embedding similarity (e.g., cosine) across all chunk pairs, optionally incorporating entity signals (like shared "BAND" or "PERSON" tags).

The O(N²) offline process generates a similarity score for every possible pair. However, to maintain high retrieval precision and avoid "neighbor noise," SAGE applies a rigorous pruning strategy:

The 95th Percentile Rule: Edges above the 95th percentile similarity threshold are retained; all others are pruned.

This ensures the graph encodes only the strongest semantic neighborhoods. In our dataset:

Strong Clusters: Members (Gilmour, Waters, Barrett) share high similarity due to their shared band history and roles, resulting in surviving edges.
Structural Bridges: The relationship between the Band (Node A) and its Albums (Nodes B, E) forms the critical "backbone" for retrieval.

JSON-LD: The Strategic Substrate

The resulting graph is exported in JSON-LD format. This isn't just an export format; JSON-LD enables integration with knowledge graphs, Linked Data systems, and graph-native tooling without complex transformation.

{
  "@context" : {
    "schema" : "https://schema.org/",
    "sage" : "urn:sage:ontology:",
    "isPartOf" : { "@type" : "@id" }
  },
  "@type" : "sage:KnowledgeGraph",
  "sage:framework" : "SAGE (Structure Aware Graph Expansion)",
  "@graph" : [ 
    {
      "@type" : "CreativeWork",
      "@id" : "urn:sage:chunk:pf-band",
      "name" : "Pink Floyd Band Info",
      "description" : "Pink Floyd is a British rock band formed in London in 1965.",
      "mentions" : { "London" : "CITY", "Pink Floyd" : "BAND", "1965" : "DATE" }
    },
    {
      "@type" : "Relationship",
      "source" : { "@id" : "urn:sage:chunk:pf-band" },
      "target" : { "@id" : "urn:sage:chunk:pf-album-wall" },
      "relationshipType" : "DOC_DOC",
      "weight" : 0.85
    }
    /* Additional edges: Band ↔ Albums, Members ↔ Members confirmed in full graph */
  ],
  "sage:statistics" : {
    "nodeCount" : 8,
    "edgeCount" : 4
  }
}

Phase 2: Online Retrieval & Multi-Hop Discovery

Consider a complex query that flat retrieval often struggles with due to "semantic gaps":

Query: "Which rock opera was released by the band formed in 1965?"

Stage 1 (Seed Selection): The retriever finds Node A (Band Info) because it explicitly matches the attribute "formed in 1965."
Stage 2 (Graph Expansion): The retriever "walks" the structural bridge from the Band to its related albums. It discovers The Wall (Node E), which contains the specific "rock opera" descriptor.

Why This Works Better Than Flat Retrieval

Flat RAG Failure: A standard retriever might find "The Wall" but rank it lower because it doesn't mention "1965." Or it might find the "1965" band info but miss the album.
SAGE Advantage: The graph becomes an auditable retrieval substrate. Every expansion step is deterministic, inspectable, and explainable — unlike black-box embedding-only retrieval.

The Algorithm Pipeline

[ Query ] ──▶ [ Seed Node: Band ] ──▶ [ Expanded: Albums ] ──▶ [ Final Context ]
   │               (Matches 1965)          (Found via Edge)
   └───────────────────────────────────────────────────────────▶ [ Explainable Answer ]

Tools4AI Integration

By using Tools4AI, this retrieval engine is exposed as a callable capability for AI Agents. The agent can call searchKnowledgeGraph to initialize seeds and then decide—based on its own reasoning—to expandNode to explore the semantic clusters built during the offline phase. This turns an academic algorithm into a production-ready, inspectable toolset.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.