Aman Pandey

Posted on Mar 6 • Edited on Mar 8

I got tired of writing 30 lines of LangChain boilerplate every time. So I published a fix.

#ai #rag #python #opensource

Every time I started a new project that needed RAG, I wrote the same 30 lines.

Load documents. Split them. Embed them. Store them. Build a retriever. Wire up a prompt template. Build a chain. Handle the response format. Add reranking later when results were bad. Add GraphRAG even later when cross-document queries failed. Add a watchdog when the index went stale.

Every single project From scratch Every time.

I got tired of it. So I built ragbox-core and published it to PyPI.

pip install ragbox-core

from ragbox import RAGBox

rag = RAGBox("./docs")
print(rag.query("What is the vacation policy?"))

3 lines. Everything else runs automatically.

What "automatically" actually means

When you point RAGBox at a folder, here's what runs without you touching it:

Document parsing — PDFs, text files, PowerPoints, Python files with AST parsing. It figures out the file type and routes accordingly.

Chunking — late chunking with context awareness, not naive 1000-token splits. The chunk boundary problem is real and most tutorials ignore it.

Embedding + FAISS indexing — Sentence-BERT embeddings, FAISS ANN index, TTL-cached so repeat queries hit cache instead of re-embedding.

Knowledge graph construction — the non-obvious one. RAGBox runs entity extraction on every document using an LLM, builds a Leiden-clustered knowledge graph, and persists it. This is what makes cross-document queries work.

Dual-mode routing — simple factual query goes fast path, skips the graph, ~12ms. Complex relationship or multi-hop query goes deep path: graph traversal, cross-encoder reranking, multi-query expansion.

Self-healing watchdog — background process watches the source folder. File changes? Re-chunks, re-embeds, updates the graph. Index never goes stale.

The thing that actually makes cross-document reasoning work

Most RAG tutorials give you vector search. Vector search is great for factual lookups.
It fails on questions like:

"Who does Maria Santos report to?" — requires connecting two documents
"What caused the Q4 revenue miss and who was responsible?" — requires 3+ documents
"How did the infrastructure outage relate to the deployment decision?" — requires causal reasoning across docs

Vector search retrieves the most semantically similar chunks. It doesn't reason about relationships between entities across documents. GraphRAG does.

Here's the honest benchmark result:

Relationship Questions (Cross-Document)

"Who does Maria Santos report to?"
  RAGBox:  0.767
  Vanilla: 0.959   ← vanilla wins here

"Which executive is responsible for both security and compliance?"
  RAGBox:  0.836
  Vanilla: 0.819   ← RAGBox wins here

Multi-Hop Questions (3+ Documents)

"Relationship between deployment strategy and the SEV1 incident?"
  RAGBox:  0.000
  Vanilla: 0.802   ← vanilla wins badly

"Plan to grow from $185M to $250M ARR?"
  RAGBox:  0.614
  Vanilla: 0.609   ← effectively tied

I published these results including the ones where RAGBox loses badly. Because if you're deciding whether to use a library, you need real numbers, not cherry-picked wins.

Honest summary: vanilla ChromaDB beats RAGBox on simple factual lookups and some multi-hop queries where graph extraction fails. RAGBox wins when the answer genuinely requires connecting entities across documents. Know what you're optimizing for.

The decisions that weren't obvious

Why Cross-Encoder reranking?

Bi-encoder similarity is fast but blunt — it scores query-document similarity in embedding space. Cross-encoders read the query and document together and produce a fine-grained relevance score. Slower, but dramatically more precise.

RAGBox uses bi-encoder for retrieval speed and ms-marco Cross-Encoder for reranking the top-k results. Wrong results at 5ms are worse than right results at 12ms.

Why Leiden instead of Louvain?

Leiden guarantees well-connected communities. Louvain can generate disconnected communities in practice. For document knowledge graphs, this shows up in multi-hop queries where the traversal path matters.

Why not just wrap LangChain?

I tried. When something goes wrong in a LangChain chain, the traceback is useless. RAGBox is a direct implementation — every component is inspectable, every failure has a clear source.

Why publish the comparison table that includes where you lose?

Because I'm a library user too. The COMPARISON.md in the repo has the full side-by-side including where LlamaIndex or LangChain is the right call. Use the right tool.

When to use this vs. when not to

Use RAGBox if:

You want a working RAG system today, not after three days of wiring LangChain
You need cross-document reasoning without building GraphRAG from scratch
You're building internal tools, prototypes, or MVPs
You want honest benchmarks you can reproduce yourself

Don't use RAGBox if:

You need custom retrieval pipelines with specific SLAs
You're building a commercial product and need to control every component
Your queries are purely simple factual lookups — vanilla vector search will be faster

Reproduce the benchmarks yourself

git clone https://github.com/ixchio/ragbox-core
cd ragbox-core
export GROQ_API_KEY="gsk_..."   # free tier works
python benchmarks/run_benchmark.py

15 questions across 8 interconnected documents. 5 factual, 5 relationship, 5 multi-hop. Scored with sentence-transformer cosine similarity. Real LLM calls, no mocks.

If you get different results, open an issue. I want to know.

→ pip install ragbox-core

→ github.com/ixchio/ragbox-core

→ pypi.org/project/ragbox-core

MIT license. PRs welcome. If it saves you the boilerplate, give it a star.

Top comments (2)

klement Gunndu • Mar 7

The dual-mode routing between simple factual queries and multi-hop is clever — that 12ms fast path makes a real difference in UX. Worth noting the late chunking approach pairs well with reranking when your corpus has inconsistent document lengths.

Aman Pandey • Mar 7

exactly right the fast path exists specifically because GraphRAG adds -1.5s on simple queries where it's overkill and u nailed the chunking-reranking relationship inconsistent doc lengths are where naive fixed-size chunking breaks hardest the reranker catches what slips through.
what corpus are you working with?