Ranjan Dailata

Posted on Apr 5

A Vectorless RAG System for Smarter Document Intelligence

#ai #automation #algorithms #unstructured

Modern AI applications rely heavily on Retrieval-Augmented Generation (RAG) to analyze documents and answer questions. Most implementations follow a familiar approach of

Split documents into chunks
Generate embeddings
Store them in a vector database
Retrieve the most similar chunks

While this architecture works well for small documents, it begins to break down when dealing with long, complex documents such as research papers, legal contracts, financial reports, or technical manuals.

Important context gets fragmented
Sections lose their relationships
Retrieval becomes noisy

To solve this problem, PageIndex introduces a fundamentally different approach to document retrieval.

Instead of relying on vector similarity search, PageIndex transforms documents into a hierarchical tree structure and allows large language models to reason over that structure directly.

The result is a vectorless, reasoning-based RAG system that more closely resembles how human experts read and navigate documents.

This article explores how PageIndex works and why it represents a new direction for document intelligence systems.

The Problem with Traditional RAG

Most RAG systems follow this pipeline:

Document
   ↓
Chunk text
   ↓
Create embeddings
   ↓
Store in vector database
   ↓
Retrieve similar chunks
   ↓
Send to LLM

This method introduces several problems.

Loss of Structure

Documents are inherently hierarchical.

Document
 ├ Chapter
 │   ├ Section
 │   │   ├ Subsection
 │   │   └ Subsection
 │   └ Section

Chunking destroys this structure by breaking documents into arbitrary pieces.

Context Fragmentation

Important ideas often span multiple paragraphs or sections.

Chunk-based retrieval may return only part of the information needed to answer a question.

Retrieval Noise

Vector similarity can retrieve text that is semantically similar but contextually incorrect.

For example, a query about clinical trial results might retrieve text from the introduction simply because the terminology overlaps.

Infrastructure Complexity

Traditional RAG pipelines require additional infrastructure:

Vector databases
Embedding pipelines
Chunking strategies
Similarity tuning

PageIndex removes much of this complexity.

Introducing PageIndex

PageIndex is a vectorless, reasoning-based retrieval framework.

Instead of embedding chunks into a vector database, PageIndex converts documents into a tree-structured index.

Each node represents a section of the document.

Document
├ Introduction
│  ├ Background
│  └ Objectives
│
├ Methods
│  ├ Study Design
│  └ Participants
│
└ Results
   ├ Efficacy
   └ Safety

Each node contains:

Section title
Sentence boundaries
Semantic summary
Parent-child relationships

This structure preserves the original organization of the document.

Rather than searching through fragments, the system can navigate the document hierarchy intelligently.

How PageIndex Works

PageIndex consists of several coordinated components that transform documents into a navigable knowledge structure.

The architecture includes:

PageIndex API
Indexer
Retriever
Reasoning module
LLM interface
JSON tree storage

Together, these components create a reasoning-based retrieval pipeline.

Step 1: Document Indexing

The Indexer converts the raw document into a hierarchical structure.

An LLM analyzes the document and identifies sections and subsections.

Example output:

Document
 ├ 1. Introduction
 │   ├ 1.1 Background
 │   └ 1.2 Objectives
 │
 ├ 2. Methods
 │   ├ 2.1 Study Design
 │   └ 2.2 Participants

Each section is stored with sentence-level indices so the system can retrieve the exact text later.

The tree is cached as JSON for reuse.

Step 2: Structure-Aware Retrieval

Instead of performing vector similarity search, the Retriever allows the LLM to reason over the document tree.

The system collects all nodes and sends them to the model with their summaries and hierarchical paths.

Example prompt conceptually looks like this:

Question:
"What were the safety outcomes?"

Available sections:
- Introduction
- Methods
- Results > Safety
- Results > Efficacy

The LLM selects the most relevant nodes.

This process is traceable and explainable, since the system can show exactly which sections were chosen.

Step 3: Context-Aware Reasoning

Once the relevant sections are identified, the system extracts the corresponding text and sends it to the reasoning module.

The LLM then generates the final answer using only the selected context.

Question
+ Retrieved Sections
   ↓
LLM Reasoning
   ↓
Answer

Because the retrieval step already narrowed down the context, the model can focus on the most relevant information.

Why PageIndex Is Different

PageIndex challenges several assumptions in traditional RAG systems.

1. No Vector Database

PageIndex does not require embeddings or similarity search.

This reduces infrastructure complexity.

2. No Chunking

Documents remain intact within their hierarchical structure.

This preserves meaning and context.

3. Reasoning-Based Retrieval

Instead of matching vectors, retrieval is performed by an LLM that evaluates document sections semantically.

4. Explainable Retrieval

Because the system selects explicit nodes from the document tree, the retrieval process is transparent.

Users can trace exactly how the answer was produced.

Example Workflow

A typical PageIndex workflow looks like this:

Upload Document
      ↓
Tree Index Creation
      ↓
User Question
      ↓
LLM selects relevant nodes
      ↓
Context extraction
      ↓
Reasoned answer

The system behaves much like a human expert scanning a document:

Identify relevant sections
Read those sections carefully
Extract insights

Where PageIndex Excels

PageIndex performs particularly well for long and structured documents, including:

Research papers
Financial reports
Clinical trial documents
Legal contracts
Technical documentation

In these domains, section hierarchy carries important meaning that chunk-based systems often lose.

Why This Matters for AI Systems

As organizations accumulate massive collections of documents, the ability to analyze them effectively becomes increasingly important.

Vector-based retrieval was an important first step, but it is not always the best approach for structured knowledge.

PageIndex demonstrates a different paradigm:

Retrieval through reasoning rather than similarity search.
Preserving document structure and allowing LLMs to navigate that structure intelligently, PageIndex enables more accurate and explainable document analysis.

Inspiration

PageIndex is an open framework designed to simplify and improve document intelligence systems.

https://pageindex.ai/
https://docs.pageindex.ai/

Conclusion

AI systems are rapidly evolving from simple chat interfaces into powerful research tools capable of analyzing large bodies of information.

The future of document intelligence may not lie in bigger vector databases, but in smarter ways of representing and reasoning over knowledge.

By combining hierarchical indexing with LLM reasoning, PageIndex offers a compelling alternative to traditional RAG pipelines, the key reason being it is simpler, more explainable, and closer to how humans actually read documents.

Top comments (1)

Olebeng • May 11

Following the two previous articles in this series, this one lands on the architectural question I was left with after the chunking post: if the semantic unit problem is really about preserving structure rather than choosing the right splitter, what does a system that abandons splitting entirely look like?

The approach is compelling for the document types you describe. Legal contracts, clinical records, research papers, and regulatory frameworks all carry meaning in their hierarchy that arbitrary chunking destroys. A system that lets an LLM navigate the structure rather than match fragments against a query vector is closer to how a domain expert actually reads these documents.

Two things worth examining further.

The first is the verification gap in the indexer itself. The LLM that creates the hierarchical tree is producing an output that becomes the foundation for all subsequent retrieval. If it misidentifies a section boundary or produces a summary that omits a clinically important detail, that error propagates invisibly to every query that traverses those nodes. With vector retrieval you can at least inspect the chunk directly. With a tree built by an LLM, the structural decisions that shape retrieval are themselves an LLM output. The explainability benefit is real but it is downstream of a step that is harder to audit than it looks.

The second is about the explainability claim specifically. You can trace which nodes were selected. You cannot trace why the LLM chose those nodes rather than adjacent ones. For regulated environments, that distinction matters. "Section 2.1 was retrieved" is more useful than "vector distance 0.73 matched", but the selection logic itself remains opaque.

The natural follow-up experiment this series is pointing toward: a head-to-head between your tree-based approach and Arshad's reranked pipeline on the same corpora. His top-20 with Cohere rerank hit RAGAS SUM 3.7079 on code. How does PageIndex score on structured documents against a strong reranker baseline? That comparison would tell us when reasoning-based retrieval wins over retrieval plus quality filtering and for which document types.