vishalmysore

Posted on Mar 1

How PageIndex Works: A Step-by-Step Technical Walkthrough

#java #llm #rag #tutorial

To turn a complex document — like a Game of Thrones script or a Federal Reserve report — into a reasoning-ready index, the Java PageIndexAgent follows a strict two-phase pipeline. This article walks through the exact lifecycle of that data, from raw text input to a cited, section-accurate answer.

https://github.com/vishalmysore/page-index-java

The Big Picture

Phase 1 (Indexing):   Full Document  →  LLM  →  TOC Tree (JSON)
Phase 2A (Navigate):  TOC + Query    →  LLM  →  Selected Node IDs
Phase 2B (Extract):   Node IDs       →  Java →  Raw Section Text
Phase 2C (Answer):    Section Text   →  LLM  →  Final Answer + Citation

The full document is sent to the LLM exactly once. Every subsequent LLM call works only on small, targeted excerpts.

Step 1 — The Raw Input (Java String)

The document is loaded into a standard Java String. At this stage, it is just a flat sequence of text — no structure, no hierarchy, no meaning.

String document;
try (InputStream is = getClass().getClassLoader()
        .getResourceAsStream("got_script.txt")) {
    document = new String(is.readAllBytes());
}
// document = "EPISODE 1: WINTER IS COMING\nScene 1: The Wall.\nWill, Gared, and Waymar Royce..."

Step 2 — Phase 1: Building the TOC Tree

The buildIndex(String documentText) method is the entry point for Phase 1. It sends the entire document to the LLM along with a strict structured prompt that instructs the model to act as a Document Architect — not a summarizer.

public List<TOCNode> buildIndex(String documentText) {
    int maxRetries = 3;
    for (int attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            String response = chatModel.generate(buildIndexPrompt(documentText));
            String json = cleanJsonResponse(response);
            List<TOCNode> result = mapper.readValue(json, new TypeReference<List<TOCNode>>() {});
            System.out.println("[Phase 1] OK — " + result.size() + " top-level nodes parsed.");
            return result;
        } catch (Exception e) {
            System.err.println("[Phase 1 / attempt " + attempt + "] JSON parse error: " + e.getMessage());
            if (attempt == maxRetries) {
                return buildFallbackIndex(documentText); // Plan B — regex heading scanner
            }
        }
    }
    return new ArrayList<>();
}

The Indexing Prompt

The LLM is not simply asked to "summarize." The prompt enforces a rigid JSON contract, telling the model to identify narrative boundaries, assign unique IDs, and write a one-sentence summary per section. The summary becomes the retrieval signal in Phase 2.

private String buildIndexPrompt(String documentText) {
    return "You are an expert document analyst. Analyze the following document and generate a JSON Table of Contents tree.\n\n" +
        "STRICT RULES:\n" +
        "1. Output ONLY a valid JSON array. No prose, no markdown fences, no explanation.\n" +
        "2. Each node MUST have exactly these keys: \"title\", \"nodeId\", \"summary\", \"nodes\".\n" +
        "3. \"nodes\" MUST always be a JSON array (use [] if no children — NEVER omit it).\n" +
        "4. Maximum tree depth: 2 levels.\n" +
        "5. nodeId: short kebab-case, e.g. \"scene-1\", \"part-2\", \"section-3-1\".\n" +
        "6. summary: exactly 1 sentence, 15–25 words, describing what happens in this section.\n" +
        "7. Be meticulous with JSON syntax — every { must close with }, every [ with ].\n\n" +
        "EXAMPLE:\n" +
        "[\n" +
        "  {\"title\": \"The Opening\", \"nodeId\": \"opening\", \"summary\": \"Hero is introduced in a marketplace.\", \"nodes\": []},\n" +
        "  {\"title\": \"Act Two\", \"nodeId\": \"act-2\", \"summary\": \"The conflict escalates dramatically.\", \"nodes\": [\n" +
        "    {\"title\": \"The Confrontation\", \"nodeId\": \"confrontation\", \"summary\": \"Villain reveals his true plan.\", \"nodes\": []}\n" +
        "  ]}\n" +
        "]\n\n" +
        "DOCUMENT:\n" + documentText + "\n\n" +
        "OUTPUT (JSON array only):";
}

Step 3 — LLM Structural Generation

The LLM reads the script and identifies natural narrative units — not fixed 500-character chunks. For a Game of Thrones episode it might produce:

[
  {
    "title": "Prologue: Beyond the Wall",
    "nodeId": "scene-1-wall",
    "summary": "Night's Watch scouts venture beyond the Wall and encounter White Walkers for the first time.",
    "nodes": []
  },
  {
    "title": "Winterfell Introduction",
    "nodeId": "act-1-winterfell",
    "summary": "The Stark family is introduced, culminating in the arrival of King Robert Baratheon.",
    "nodes": [
      {
        "title": "The Archery Lesson",
        "nodeId": "scene-2-archery",
        "summary": "Arya upstages Bran during archery practice, establishing her spirited personality.",
        "nodes": []
      }
    ]
  }
]

Step 4 — Jackson Deserialization into `List<TOCNode>`

PageIndexAgent maps the JSON string into a recursive Java POJO using Jackson Databind:

private final ObjectMapper mapper = new ObjectMapper()
        .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);

// Inside buildIndex():
List<TOCNode> result = mapper.readValue(json, new TypeReference<List<TOCNode>>() {});

The target POJO is TOCNode, which is itself recursive via its nodes field:

public class TOCNode {
    private String title;         // Section heading
    private String nodeId;        // Unique kebab-case ID  e.g. "scene-1-wall"
    private String summary;       // 1-sentence retrieval signal
    private List<TOCNode> nodes;  // Recursive child nodes

    // getters and setters...
}

At this point, the "map of Westeros" lives in Java memory as a strongly-typed object tree.

Step 5 — Phase 2A: The Reasoning Query (TOC only, no full document)

When the user asks "Who was the first person to see a White Walker?", Phase 2 begins.

What the LLM does NOT receive: the full script.

What it receives: the query + the TOC tree (titles and summaries only).

private String selectRelevantNodes(String query, String indexJson) {
    String prompt =
        "You are a document navigation agent.\n" +
        "You have a Table-of-Contents (TOC) tree of a document. Each node has a 'nodeId' and a 'summary'.\n" +
        "Your job: read the user's query, study the TOC summaries, and identify which node IDs are MOST likely\n" +
        "to contain the answer. Do NOT guess — reason from the summaries.\n\n" +
        "RULES:\n" +
        "- Output ONLY a comma-separated list of nodeId values. Nothing else. No explanation.\n" +
        "- Return 1 to 3 node IDs maximum. Prefer the most specific match.\n" +
        "- Example output: section-3-1, section-4-2\n\n" +
        "TOC TREE:\n" + indexJson + "\n\n" +
        "USER QUERY: " + query + "\n\n" +
        "RELEVANT NODE IDS (comma-separated only):";
    return chatModel.generate(prompt).trim();
}

The LLM sees the summary for scene-1-wall:

"Night's Watch scouts venture beyond the Wall and encounter White Walkers for the first time."

It reasons: the answer is in this node — and returns scene-1-wall.

This is logical reasoning, not cosine similarity.

Step 6 — Phase 2B: Targeted Text Extraction (no LLM)

PageIndexAgent resolves the selected node IDs into section titles, then scans the raw document text line-by-line to extract only those sections. No LLM is involved here — this is pure Java:

private String extractSections(String documentText, List<String> titles) {
    if (titles.isEmpty()) return documentText; // safety fallback

    String[] lines = documentText.split("\\r?\\n");
    StringBuilder extracted = new StringBuilder();
    boolean capturing = false;

    for (String line : lines) {
        String upper = line.toUpperCase();

        // Does this line match one of our target section headings?
        boolean isMatch = titles.stream().anyMatch(title ->
                upper.contains(title.toUpperCase()) ||
                normalizeHeading(upper).contains(normalizeHeading(title.toUpperCase())));

        // Is this line any section heading (to know when to stop capturing)?
        boolean isAnyHeading = line.trim().matches("^(SCENE|PART|CHAPTER|ACT|SECTION)\\s+.*")
                || line.trim().matches("^\\d+\\.\\s+[A-Z].*")
                || (line.toUpperCase().equals(line.trim()) && line.trim().length() > 10
                        && line.trim().length() < 80);

        if (isMatch) {
            capturing = true;
            extracted.append("\n--- EXTRACTED: ").append(line.trim()).append(" ---\n");
        } else if (isAnyHeading && capturing) {
            capturing = false; // new section started — stop capturing
        }

        if (capturing) {
            extracted.append(line).append("\n");
        }
    }

    // If nothing was extracted, fall back to full document
    return extracted.length() > 50 ? extracted.toString() : documentText;
}

Token reduction in practice: For the Sholay synopsis (9,203 chars), the extracted section for a single scene average 600–750 chars — a reduction of 92–93%.

Step 7 — Phase 2C: Final Answer Synthesis

Only the small, targeted excerpt goes to the LLM with the original query:

private String synthesizeAnswer(String query, String extractedText, List<String> sourceSections) {
    String prompt =
        "You are a precise question-answering assistant.\n" +
        "You have been given a RELEVANT EXCERPT from a document (already pre-selected by a reasoning agent).\n" +
        "Answer the user's question using ONLY the information in this excerpt.\n" +
        "At the end, cite the section(s) you drew from.\n\n" +
        "RELEVANT EXCERPT:\n" + extractedText + "\n\n" +
        "USER QUESTION: " + query + "\n\n" +
        "ANSWER:";
    return chatModel.generate(prompt);
}

Final answer:

"The scout named Will was the first person to see a White Walker during the Night's Watch patrol beyond the Wall."
— Source: Prologue: Beyond the Wall (nodeId: scene-1-wall)

The retrieval is explainable — you can trace the exact section it came from.

Step 8 — The Complete Pipeline at a Glance

The public API exposes a single retrieveAndAnswer() method that orchestrates all three Phase 2 steps and returns a structured RetrievalResult:

public RetrievalResult retrieveAndAnswer(String query, String documentText,
                                          List<TOCNode> index) throws Exception {
    // 2A — LLM selects node IDs from TOC (no full doc)
    String indexJson   = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(index);
    String nodeIdsCsv  = selectRelevantNodes(query, indexJson);

    // 2B — Java extracts matching sections from raw text (no LLM)
    List<String> nodeIds        = Arrays.stream(nodeIdsCsv.split(","))
                                        .map(String::trim)
                                        .filter(s -> !s.isEmpty())
                                        .collect(Collectors.toList());
    List<String> relevantTitles = collectTitlesForIds(index, nodeIds);
    String extractedText        = extractSections(documentText, relevantTitles);

    // 2C — LLM answers from the small excerpt only
    String answer = synthesizeAnswer(query, extractedText, relevantTitles);

    return new RetrievalResult(nodeIds, relevantTitles, extractedText, answer);
}

RetrievalResult is fully observable — it exposes every intermediate step:

public static class RetrievalResult {
    public final List<String> selectedNodeIds;  // e.g. ["scene-1-wall"]
    public final List<String> sourceSections;   // e.g. ["Prologue: Beyond the Wall"]
    public final String extractedText;          // the raw snippet sent to the LLM
    public final String answer;                 // the final synthesized answer
}

Safety Rails — Why the Pipeline Doesn't Crash

Mechanism	Where	What it does
Temperature = 0.0	Model config	Deterministic JSON output — no creative formatting
Retry loop (3×)	`buildIndex()`	Retries the full LLM call if JSON parse fails
JSON fence stripper	`cleanJsonResponse()`	Removes `json` wrappers and trims to `[...]`
Fallback heading scanner	`buildFallbackIndex()`	Regex scans for `SCENE/CHAPTER/PART` lines if all retries fail
Section extraction fallback	`extractSections()`	Returns the full document if no section heading matches
Unknown property tolerance	Jackson config	`FAIL_ON_UNKNOWN_PROPERTIES = false` for forward compatibility

Handling the Context Window Challenge

Document Size	Strategy
Short–medium (up to ~50 pages)	Single-pass — entire document sent in one Phase 1 call
Very long (100+ pages)	Map-Reduce — send document in chunks, each generates a mini-TOC, a final LLM call merges them into one global tree

The current implementation uses the single-pass strategy, which comfortably handles documents like the 1,700-word Federal Reserve report or a full film synopsis within the context window of models such as nvidia/nemotron-nano-12b-v2-vl or gpt-4o.

DEV Community

How PageIndex Works: A Step-by-Step Technical Walkthrough

The Big Picture

Step 1 — The Raw Input (Java String)

Step 2 — Phase 1: Building the TOC Tree

The Indexing Prompt

Step 3 — LLM Structural Generation

Step 4 — Jackson Deserialization into `List<TOCNode>`

Step 5 — Phase 2A: The Reasoning Query (TOC only, no full document)

Step 6 — Phase 2B: Targeted Text Extraction (no LLM)

Step 7 — Phase 2C: Final Answer Synthesis

Step 8 — The Complete Pipeline at a Glance

Safety Rails — Why the Pipeline Doesn't Crash

Handling the Context Window Challenge

Top comments (0)

The Big Picture

Step 1 — The Raw Input (Java String)

Step 2 — Phase 1: Building the TOC Tree

The Indexing Prompt

Step 3 — LLM Structural Generation

Step 4 — Jackson Deserialization into List<TOCNode>

Step 5 — Phase 2A: The Reasoning Query (TOC only, no full document)

Step 6 — Phase 2B: Targeted Text Extraction (no LLM)

Step 7 — Phase 2C: Final Answer Synthesis

Step 8 — The Complete Pipeline at a Glance

Safety Rails — Why the Pipeline Doesn't Crash

Handling the Context Window Challenge

Step 4 — Jackson Deserialization into `List<TOCNode>`