PageIndex caught my eye when it hit GitHub's trending page. It's a RAG framework that ditches vectors entirely in favor of document structure and LLM reasoning. I had to try it.
The open-source version turned out to be pretty bare-bones though. It handles tree generation fine, but if you want to actually query your documents locally, you're on your own. So I forked it, added the missing retrieval pieces, and threw in AWS Bedrock support while I was at it.
This post walks through how to run the full PageIndex pipeline locally with Bedrock as your LLM provider.
Full code: github.com/b-d055/PageIndex
(Clone the repo and run local_rag.py with examples provided to get started right away)
What is PageIndex?
Traditional vector-based RAG relies heavily on semantic similarity as a proxy for relevance. While this works well for many use cases, it often breaks down with long, structured, or highly technical documents that require domain knowledge and multi-step reasoning. In those cases, retrieving text that merely “sounds similar” to a query isn’t enough. PageIndex takes a different approach by using LLM reasoning to navigate a document’s structure, prioritizing sections based on how a human expert would actually search for an answer.
For a deeper dive into the motivation and design, check out PageIndex's introductory blog post.
PageIndex mimics how a human expert navigates documents:
- Read the structure - Parse the document's hierarchy (like a table of contents) to understand what's where
- Reason over sections - Use LLM reasoning to identify which sections likely contain relevant information
- Extract and evaluate - Pull content from selected sections and assess if it's sufficient to answer the query
- Iterate or answer - If more context is needed, revisit the structure and select additional sections. Otherwise, generate the response.
The output is a hierarchical tree that mirrors how a human would navigate a document.
The Problem: Open Source vs API
The PageIndex GitHub repo provides tree generation, but the cookbooks all use their API (PageIndexClient) for RAG queries. It's free to start but may cost you depending on your features and usage. If you want to run everything locally or use your own LLM provider (like bedrock), you need to bridge this gap.
What the open-source repo includes:
-
run_pageindex.py- generates tree structures from PDFs -
md_to_tree()- generates trees from Markdown - Utilities for PDF parsing, token counting, etc.
What's missing:
- Query/retrieval functionality
- Helper functions like
create_node_mapping(),print_tree() - Support for non-OpenAI providers
Step 1: Generate a Tree Structure
First, generate a tree from your document. This step currently requires OpenAI. I may add alternative provider support in my fork later.
Add your OpenAI API key to a .env file:
OPENAI_API_KEY=your-key
Then run:
python run_pageindex.py --pdf_path document.pdf
Note: The upstream repo uses
CHATGPT_API_KEYinternally, but my fork acceptsOPENAI_API_KEYand sets it automatically.
This creates a JSON file in results/ with:
- Hierarchical sections extracted from the document
- Page ranges for each section
- AI-generated summaries
- Full text content (requires
--if-add-node-text yes)
Example tree structure:
{
"doc_name": "quarterly-report.pdf",
"structure": [
{
"title": "Financial Results",
"start_index": 1,
"end_index": 5,
"node_id": "0001",
"summary": "Overview of Q1 financial performance...",
"text": "Full text content...",
"nodes": [...]
}
]
}
Important: The tree must include the text field for retrieval to work. Use --if-add-node-text yes during generation. It's off by default.
Step 2: Set Up AWS Bedrock
Generate a Bedrock API Key
AWS Bedrock now supports API key authentication. This simplifies setup significantly.
- Go to the AWS Bedrock Console
- Navigate to Model access and ensure you have access to Claude models (edit: model access page has been retired and is no longer required)
- Go to API keys in the sidebar (you may need to scroll down)
- Create a new API key (these can be short-term or long-term)
- Copy the key. It's only shown once.
For more details, see the AWS documentation on Bedrock API keys.
Configure Environment
# .env
OPENAI_API_KEY=sk-... # Required for tree generation
AWS_BEARER_TOKEN_BEDROCK=your-key # For Bedrock queries
AWS_REGION=us-east-1 # Or your preferred region
How Authentication Works
boto3 automatically picks up the AWS_BEARER_TOKEN_BEDROCK environment variable:
import boto3
import os
os.environ['AWS_BEARER_TOKEN_BEDROCK'] = "your-api-key"
client = boto3.client('bedrock-runtime', region_name='us-east-1')
response = client.converse(modelId=model_id, messages=messages)
No IAM roles or AWS CLI configuration needed when using API key auth.
Step 3: Query with Bedrock
Now you can query your document using Bedrock:
python local_rag.py --provider bedrock \
--model us.anthropic.claude-haiku-4-5-20251001-v1:0 \
--tree results/document_structure.json \
--query "What are the main conclusions?"
Or use interactive mode for multiple questions:
python local_rag.py --provider bedrock \
--model us.anthropic.claude-haiku-4-5-20251001-v1:0 \
--tree results/document_structure.json \
-i
Some Bedrock Models to Try
Use the us. prefix for cross-region inference:
| Model | ID |
|---|---|
| Claude Sonnet 4.5 | us.anthropic.claude-sonnet-4-5-20250929-v1:0 |
| Claude Haiku 4.5 | us.anthropic.claude-haiku-4-5-20251001-v1:0 |
| Amazon Nova Pro | us.amazon.nova-pro-v1:0 |
| Amazon Nova Lite | us.amazon.nova-lite-v1:0 |
Tip: Claude Haiku 4.5 offers a good balance of speed and cost for RAG queries.
How the RAG Pipeline Works
The local RAG script implements a three-step pipeline:
1. Tree Search
Send the tree structure (without text) to the LLM and ask it to identify relevant nodes:
prompt = f"""
You are given a question and a tree structure of a document.
Find all nodes that are likely to contain the answer.
Question: {query}
Document tree structure: {tree_json}
Reply with: {{"thinking": "...", "node_list": ["0001", "0002"]}}
"""
2. Content Extraction
Retrieve the full text from the identified nodes:
for node_id in search_result['node_list']:
context += node_map[node_id]['text']
3. Answer Generation
Send the extracted content to the LLM to generate an answer:
prompt = f"""
Answer the question based on the context:
Question: {query}
Context: {context}
"""
Key Implementation Details
Helper Functions
The open-source repo doesn't include these, so we implement them:
def create_node_mapping(tree_structure):
"""Create a flat mapping of node_id -> node for easy lookup."""
node_map = {}
def traverse(nodes):
for node in nodes:
if 'node_id' in node:
node_map[node['node_id']] = node
if 'nodes' in node:
traverse(node['nodes'])
traverse(tree_structure)
return node_map
Bedrock Provider
class BedrockProvider:
def __init__(self, model, region):
self.client = boto3.client('bedrock-runtime', region_name=region)
self.model = model
def call(self, prompt):
response = self.client.converse(
modelId=self.model,
messages=[{"role": "user", "content": [{"text": prompt}]}],
inferenceConfig={"temperature": 0, "maxTokens": 4096}
)
return response['output']['message']['content'][0]['text']
Two-Phase Workflow
The key insight is separating tree generation from querying:
| Phase | Provider | What Happens |
|---|---|---|
| Generation | OpenAI (required, for now) | Parse PDF, extract structure, generate summaries |
| Querying | Any (OpenAI/Bedrock) | Tree search, content extraction, answer generation |
This means you can:
- Generate the tree once
- Query many times with any provider (use Haiku or Nova for speed)
- Share tree files across team members
Files Reference
Notable files in my fork.
| File | Purpose |
|---|---|
local_rag.py |
Main script with OpenAI + Bedrock support |
run_pageindex.py |
Tree generation from PDFs |
.env |
API keys (copy from .env.example) |
results/*.json |
Generated tree structures |
requirements.txt |
Dependencies including boto3
|
Conclusion
PageIndex is a refreshing take on RAG. Using document structure and reasoning instead of vector similarity can yield smarter retrieval. This is especially true for complex documents.
This implementation is intentionally simple. It's a starting point, not a production-ready system. The two-phase workflow (generate once, query many) keeps things practical. The tree structures are just human-readable JSON, so it's easy to inspect what's happening and build on top of it.
If you're tired of fighting with chunking strategies and embedding quality, give it a shot.
Resources
- My Fork with Bedrock Support
- PageIndex GitHub (upstream)
- PageIndex Documentation
- Bedrock API Keys Documentation
- Bedrock Converse API
Top comments (0)