To turn a complex document — like a Game of Thrones script or a Federal Reserve report — into a reasoning-ready index, the Java PageIndexAgent follows a strict two-phase pipeline. This article walks through the exact lifecycle of that data, from raw text input to a cited, section-accurate answer.
https://github.com/vishalmysore/page-index-java
The Big Picture
Phase 1 (Indexing): Full Document → LLM → TOC Tree (JSON)
Phase 2A (Navigate): TOC + Query → LLM → Selected Node IDs
Phase 2B (Extract): Node IDs → Java → Raw Section Text
Phase 2C (Answer): Section Text → LLM → Final Answer + Citation
The full document is sent to the LLM exactly once. Every subsequent LLM call works only on small, targeted excerpts.
Step 1 — The Raw Input (Java String)
The document is loaded into a standard Java String. At this stage, it is just a flat sequence of text — no structure, no hierarchy, no meaning.
String document;
try (InputStream is = getClass().getClassLoader()
.getResourceAsStream("got_script.txt")) {
document = new String(is.readAllBytes());
}
// document = "EPISODE 1: WINTER IS COMING\nScene 1: The Wall.\nWill, Gared, and Waymar Royce..."
Step 2 — Phase 1: Building the TOC Tree
The buildIndex(String documentText) method is the entry point for Phase 1. It sends the entire document to the LLM along with a strict structured prompt that instructs the model to act as a Document Architect — not a summarizer.
public List<TOCNode> buildIndex(String documentText) {
int maxRetries = 3;
for (int attempt = 1; attempt <= maxRetries; attempt++) {
try {
String response = chatModel.generate(buildIndexPrompt(documentText));
String json = cleanJsonResponse(response);
List<TOCNode> result = mapper.readValue(json, new TypeReference<List<TOCNode>>() {});
System.out.println("[Phase 1] OK — " + result.size() + " top-level nodes parsed.");
return result;
} catch (Exception e) {
System.err.println("[Phase 1 / attempt " + attempt + "] JSON parse error: " + e.getMessage());
if (attempt == maxRetries) {
return buildFallbackIndex(documentText); // Plan B — regex heading scanner
}
}
}
return new ArrayList<>();
}
The Indexing Prompt
The LLM is not simply asked to "summarize." The prompt enforces a rigid JSON contract, telling the model to identify narrative boundaries, assign unique IDs, and write a one-sentence summary per section. The summary becomes the retrieval signal in Phase 2.
private String buildIndexPrompt(String documentText) {
return "You are an expert document analyst. Analyze the following document and generate a JSON Table of Contents tree.\n\n" +
"STRICT RULES:\n" +
"1. Output ONLY a valid JSON array. No prose, no markdown fences, no explanation.\n" +
"2. Each node MUST have exactly these keys: \"title\", \"nodeId\", \"summary\", \"nodes\".\n" +
"3. \"nodes\" MUST always be a JSON array (use [] if no children — NEVER omit it).\n" +
"4. Maximum tree depth: 2 levels.\n" +
"5. nodeId: short kebab-case, e.g. \"scene-1\", \"part-2\", \"section-3-1\".\n" +
"6. summary: exactly 1 sentence, 15–25 words, describing what happens in this section.\n" +
"7. Be meticulous with JSON syntax — every { must close with }, every [ with ].\n\n" +
"EXAMPLE:\n" +
"[\n" +
" {\"title\": \"The Opening\", \"nodeId\": \"opening\", \"summary\": \"Hero is introduced in a marketplace.\", \"nodes\": []},\n" +
" {\"title\": \"Act Two\", \"nodeId\": \"act-2\", \"summary\": \"The conflict escalates dramatically.\", \"nodes\": [\n" +
" {\"title\": \"The Confrontation\", \"nodeId\": \"confrontation\", \"summary\": \"Villain reveals his true plan.\", \"nodes\": []}\n" +
" ]}\n" +
"]\n\n" +
"DOCUMENT:\n" + documentText + "\n\n" +
"OUTPUT (JSON array only):";
}
Step 3 — LLM Structural Generation
The LLM reads the script and identifies natural narrative units — not fixed 500-character chunks. For a Game of Thrones episode it might produce:
[
{
"title": "Prologue: Beyond the Wall",
"nodeId": "scene-1-wall",
"summary": "Night's Watch scouts venture beyond the Wall and encounter White Walkers for the first time.",
"nodes": []
},
{
"title": "Winterfell Introduction",
"nodeId": "act-1-winterfell",
"summary": "The Stark family is introduced, culminating in the arrival of King Robert Baratheon.",
"nodes": [
{
"title": "The Archery Lesson",
"nodeId": "scene-2-archery",
"summary": "Arya upstages Bran during archery practice, establishing her spirited personality.",
"nodes": []
}
]
}
]
Step 4 — Jackson Deserialization into List<TOCNode>
PageIndexAgent maps the JSON string into a recursive Java POJO using Jackson Databind:
private final ObjectMapper mapper = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
// Inside buildIndex():
List<TOCNode> result = mapper.readValue(json, new TypeReference<List<TOCNode>>() {});
The target POJO is TOCNode, which is itself recursive via its nodes field:
public class TOCNode {
private String title; // Section heading
private String nodeId; // Unique kebab-case ID e.g. "scene-1-wall"
private String summary; // 1-sentence retrieval signal
private List<TOCNode> nodes; // Recursive child nodes
// getters and setters...
}
At this point, the "map of Westeros" lives in Java memory as a strongly-typed object tree.
Step 5 — Phase 2A: The Reasoning Query (TOC only, no full document)
When the user asks "Who was the first person to see a White Walker?", Phase 2 begins.
What the LLM does NOT receive: the full script.
What it receives: the query + the TOC tree (titles and summaries only).
private String selectRelevantNodes(String query, String indexJson) {
String prompt =
"You are a document navigation agent.\n" +
"You have a Table-of-Contents (TOC) tree of a document. Each node has a 'nodeId' and a 'summary'.\n" +
"Your job: read the user's query, study the TOC summaries, and identify which node IDs are MOST likely\n" +
"to contain the answer. Do NOT guess — reason from the summaries.\n\n" +
"RULES:\n" +
"- Output ONLY a comma-separated list of nodeId values. Nothing else. No explanation.\n" +
"- Return 1 to 3 node IDs maximum. Prefer the most specific match.\n" +
"- Example output: section-3-1, section-4-2\n\n" +
"TOC TREE:\n" + indexJson + "\n\n" +
"USER QUERY: " + query + "\n\n" +
"RELEVANT NODE IDS (comma-separated only):";
return chatModel.generate(prompt).trim();
}
The LLM sees the summary for scene-1-wall:
"Night's Watch scouts venture beyond the Wall and encounter White Walkers for the first time."
It reasons: the answer is in this node — and returns scene-1-wall.
This is logical reasoning, not cosine similarity.
Step 6 — Phase 2B: Targeted Text Extraction (no LLM)
PageIndexAgent resolves the selected node IDs into section titles, then scans the raw document text line-by-line to extract only those sections. No LLM is involved here — this is pure Java:
private String extractSections(String documentText, List<String> titles) {
if (titles.isEmpty()) return documentText; // safety fallback
String[] lines = documentText.split("\\r?\\n");
StringBuilder extracted = new StringBuilder();
boolean capturing = false;
for (String line : lines) {
String upper = line.toUpperCase();
// Does this line match one of our target section headings?
boolean isMatch = titles.stream().anyMatch(title ->
upper.contains(title.toUpperCase()) ||
normalizeHeading(upper).contains(normalizeHeading(title.toUpperCase())));
// Is this line any section heading (to know when to stop capturing)?
boolean isAnyHeading = line.trim().matches("^(SCENE|PART|CHAPTER|ACT|SECTION)\\s+.*")
|| line.trim().matches("^\\d+\\.\\s+[A-Z].*")
|| (line.toUpperCase().equals(line.trim()) && line.trim().length() > 10
&& line.trim().length() < 80);
if (isMatch) {
capturing = true;
extracted.append("\n--- EXTRACTED: ").append(line.trim()).append(" ---\n");
} else if (isAnyHeading && capturing) {
capturing = false; // new section started — stop capturing
}
if (capturing) {
extracted.append(line).append("\n");
}
}
// If nothing was extracted, fall back to full document
return extracted.length() > 50 ? extracted.toString() : documentText;
}
Token reduction in practice: For the Sholay synopsis (9,203 chars), the extracted section for a single scene average 600–750 chars — a reduction of 92–93%.
Step 7 — Phase 2C: Final Answer Synthesis
Only the small, targeted excerpt goes to the LLM with the original query:
private String synthesizeAnswer(String query, String extractedText, List<String> sourceSections) {
String prompt =
"You are a precise question-answering assistant.\n" +
"You have been given a RELEVANT EXCERPT from a document (already pre-selected by a reasoning agent).\n" +
"Answer the user's question using ONLY the information in this excerpt.\n" +
"At the end, cite the section(s) you drew from.\n\n" +
"RELEVANT EXCERPT:\n" + extractedText + "\n\n" +
"USER QUESTION: " + query + "\n\n" +
"ANSWER:";
return chatModel.generate(prompt);
}
Final answer:
"The scout named Will was the first person to see a White Walker during the Night's Watch patrol beyond the Wall."
— Source: Prologue: Beyond the Wall (nodeId: scene-1-wall)
The retrieval is explainable — you can trace the exact section it came from.
Step 8 — The Complete Pipeline at a Glance
The public API exposes a single retrieveAndAnswer() method that orchestrates all three Phase 2 steps and returns a structured RetrievalResult:
public RetrievalResult retrieveAndAnswer(String query, String documentText,
List<TOCNode> index) throws Exception {
// 2A — LLM selects node IDs from TOC (no full doc)
String indexJson = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(index);
String nodeIdsCsv = selectRelevantNodes(query, indexJson);
// 2B — Java extracts matching sections from raw text (no LLM)
List<String> nodeIds = Arrays.stream(nodeIdsCsv.split(","))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
List<String> relevantTitles = collectTitlesForIds(index, nodeIds);
String extractedText = extractSections(documentText, relevantTitles);
// 2C — LLM answers from the small excerpt only
String answer = synthesizeAnswer(query, extractedText, relevantTitles);
return new RetrievalResult(nodeIds, relevantTitles, extractedText, answer);
}
RetrievalResult is fully observable — it exposes every intermediate step:
public static class RetrievalResult {
public final List<String> selectedNodeIds; // e.g. ["scene-1-wall"]
public final List<String> sourceSections; // e.g. ["Prologue: Beyond the Wall"]
public final String extractedText; // the raw snippet sent to the LLM
public final String answer; // the final synthesized answer
}
Safety Rails — Why the Pipeline Doesn't Crash
| Mechanism | Where | What it does |
|---|---|---|
| Temperature = 0.0 | Model config | Deterministic JSON output — no creative formatting |
| Retry loop (3×) | buildIndex() |
Retries the full LLM call if JSON parse fails |
| JSON fence stripper | cleanJsonResponse() |
Removes json wrappers and trims to [...]
|
| Fallback heading scanner | buildFallbackIndex() |
Regex scans for SCENE/CHAPTER/PART lines if all retries fail |
| Section extraction fallback | extractSections() |
Returns the full document if no section heading matches |
| Unknown property tolerance | Jackson config |
FAIL_ON_UNKNOWN_PROPERTIES = false for forward compatibility |
Handling the Context Window Challenge
| Document Size | Strategy |
|---|---|
| Short–medium (up to ~50 pages) | Single-pass — entire document sent in one Phase 1 call |
| Very long (100+ pages) | Map-Reduce — send document in chunks, each generates a mini-TOC, a final LLM call merges them into one global tree |
The current implementation uses the single-pass strategy, which comfortably handles documents like the 1,700-word Federal Reserve report or a full film synopsis within the context window of models such as nvidia/nemotron-nano-12b-v2-vl or gpt-4o.
Top comments (0)