SAGE: Structure Aware Graph Expansion with Pink Floyd Example
Introduction
The SAGE framework (Structure Aware Graph Expansion) is designed to solve the problem of multi-hop retrieval in heterogeneous data. While traditional RAG (Retrieval-Augmented Generation) treats data chunks as isolated units, SAGE builds a structural graph offline to capture relationships that flat retrieval often misses.
https://github.com/vishalmysore/sagejava
In this article, we illustrate SAGE using a dataset centered around the legendary band Pink Floyd.
The Pink Floyd Knowledge Graph
We model the Pink Floyd universe with 8 heterogeneous data chunks:
- Band: Pink Floyd (formed 1965 in London).
- Albums: Dark Side of the Moon, The Wall, Wish You Were Here.
- Members: David Gilmour, Roger Waters, Syd Barrett.
- Solo Work: David Gilmour's About Face.
Phase 1: Offline Graph Construction & Percentile Pruning
SAGE doesn't just connect everything into a "hairball" graph. It computes dense embedding similarity (e.g., cosine) across all chunk pairs, optionally incorporating entity signals (like shared "BAND" or "PERSON" tags).
The O(N²) offline process generates a similarity score for every possible pair. However, to maintain high retrieval precision and avoid "neighbor noise," SAGE applies a rigorous pruning strategy:
The 95th Percentile Rule: Edges above the 95th percentile similarity threshold are retained; all others are pruned.
This ensures the graph encodes only the strongest semantic neighborhoods. In our dataset:
- Strong Clusters: Members (Gilmour, Waters, Barrett) share high similarity due to their shared band history and roles, resulting in surviving edges.
- Structural Bridges: The relationship between the Band (Node A) and its Albums (Nodes B, E) forms the critical "backbone" for retrieval.
JSON-LD: The Strategic Substrate
The resulting graph is exported in JSON-LD format. This isn't just an export format; JSON-LD enables integration with knowledge graphs, Linked Data systems, and graph-native tooling without complex transformation.
{
"@context" : {
"schema" : "https://schema.org/",
"sage" : "urn:sage:ontology:",
"isPartOf" : { "@type" : "@id" }
},
"@type" : "sage:KnowledgeGraph",
"sage:framework" : "SAGE (Structure Aware Graph Expansion)",
"@graph" : [
{
"@type" : "CreativeWork",
"@id" : "urn:sage:chunk:pf-band",
"name" : "Pink Floyd Band Info",
"description" : "Pink Floyd is a British rock band formed in London in 1965.",
"mentions" : { "London" : "CITY", "Pink Floyd" : "BAND", "1965" : "DATE" }
},
{
"@type" : "Relationship",
"source" : { "@id" : "urn:sage:chunk:pf-band" },
"target" : { "@id" : "urn:sage:chunk:pf-album-wall" },
"relationshipType" : "DOC_DOC",
"weight" : 0.85
}
/* Additional edges: Band ↔ Albums, Members ↔ Members confirmed in full graph */
],
"sage:statistics" : {
"nodeCount" : 8,
"edgeCount" : 4
}
}
Phase 2: Online Retrieval & Multi-Hop Discovery
Consider a complex query that flat retrieval often struggles with due to "semantic gaps":
Query: "Which rock opera was released by the band formed in 1965?"
- Stage 1 (Seed Selection): The retriever finds Node A (Band Info) because it explicitly matches the attribute "formed in 1965."
- Stage 2 (Graph Expansion): The retriever "walks" the structural bridge from the Band to its related albums. It discovers The Wall (Node E), which contains the specific "rock opera" descriptor.
Why This Works Better Than Flat Retrieval
- Flat RAG Failure: A standard retriever might find "The Wall" but rank it lower because it doesn't mention "1965." Or it might find the "1965" band info but miss the album.
- SAGE Advantage: The graph becomes an auditable retrieval substrate. Every expansion step is deterministic, inspectable, and explainable — unlike black-box embedding-only retrieval.
The Algorithm Pipeline
[ Query ] ──▶ [ Seed Node: Band ] ──▶ [ Expanded: Albums ] ──▶ [ Final Context ]
│ (Matches 1965) (Found via Edge)
└───────────────────────────────────────────────────────────▶ [ Explainable Answer ]
Tools4AI Integration
By using Tools4AI, this retrieval engine is exposed as a callable capability for AI Agents. The agent can call searchKnowledgeGraph to initialize seeds and then decide—based on its own reasoning—to expandNode to explore the semantic clusters built during the offline phase. This turns an academic algorithm into a production-ready, inspectable toolset.
Top comments (0)