Struggling to find answers in massive documents? 🤯
Tired of sifting through hundreds of pages to find that one piece of information?
I've been there!
That's why I dove deep into Retrieval-Augmented Generation (RAG) and built a Large Document Q&A AI Agent that can process, index, and accurately answer questions from documents up to 800,000+ words!
I'm a hands-on learner, so I built this RAG pipeline from scratch. Every chunking strategy, vector database integration, and UI element was a challenge I personally solved. And now, it's all powered by a Django web interface for seamless scalability and clean data management.
Key Features:
- Versatile Document Support: Handles PDF, DOCX, TXT, and Markdown.
- Flexible Vector Search: Integrates with FAISS, ChromaDB, or Pinecone.
- Context-Aware: Tracks conversational context for more natural interactions.
- Multiple Access Points: Features both REST API and CLI tools for diverse querying needs.
- Modern User Interface: Built with a sleek Bootstrap UI via Django.
Built with Python, LangChain, Django, OpenAI GPT-4, FAISS, and Bootstrap
djleamen
/
doc-reader
Large document Q&A agent using RAG
RAG Document Q&A System
A specialized large document Q&A AI agent using Retrieval-Augmented Generation (RAG). This system can efficiently process, index, and query massive documents to provide accurate, contextual answers.
🆕 Django Migration
This project has been converted to Django for better data management and scalability!
- New: Django web interface with Bootstrap UI
- New: Database persistence with Django ORM
- New: Admin interface for data management
- New: User session tracking for conversational mode
- Preserved: All existing RAG functionality
🚀 Features
- Large Document Support: Handle documents up to 800k+ words efficiently
- Multiple Format Support: PDF, DOCX, TXT, and Markdown files
- Advanced RAG Pipeline: Combines retrieval and generation for accurate answers
- Vector Database Options: FAISS, ChromaDB, and Pinecone support
- Conversational Mode: Maintains context across multiple queries
- Modern Web Interface: Beautiful Django UI with Bootstrap styling
- REST API: Django REST Framework API…
Top comments (4)
Love how you built this from scratch — that pipeline looks clean, and 800k+ context support is no joke.
One thing I started noticing when running large doc RAG setups: even with great vector recall, the answers sometimes feel right but are subtly off — like the LLM’s pulling from a drifted semantic zone, not the one it retrieved.
I’ve been testing a layer that tracks semantic tension (ΔS) between query, retrieved chunk, and generation — kinda like asking: “is this really the same conversation thread, or just similar wording?”
Once I started logging ΔS spikes, some of the weird hallucinations finally made sense.
Would love to hear how your system handles these edge cases — ever tried tracking semantic coherence post-retrieval?
Hi there!
Right now, I don’t have full semantic validation post-retrieval, but I do track source confidence scores for retrieved chunks. That helps with transparency, but it doesn’t always catch the subtle cases where the LLM’s generation drifts semantically from the original query, even if the retrieval was technically “accurate.”
Your ΔS tracking approach is brilliant. I’ve been thinking about extending the confidence layer with something like a SemanticCoherenceValidator, maybe by calculating cosine delta across query→chunk, chunk→generation, and query→generation embeddings. When the gap spikes, it could trigger fallback behaviours like boosting k, hedging outputs, or flagging uncertainty.
Have you found that smarter retrieval strategies (like hybrid search or semantic re-ranking) reduce the drift? Or does most of it still stem from the generation step?
Really appreciate you sharing this, lots to think about!
Absolutely love that you're already thinking in terms of a SemanticCoherenceValidator — that's almost exactly what I was trying to prototype.
I’ve been experimenting with ΔS as a running delta between:
query ↔ chunk
chunk ↔ generation
query ↔ generation
Sometimes the chunk looks right, but ΔS(q↔g) spikes — and boom, hallucination. My working theory is that generation “drifts” to a nearby attractor in semantic space, especially when the retrieved chunk is only surface-similar but not semantically entangled with the query.
So when ΔS goes above 0.5 (my rough threshold), I’ve tried a few tricks:
Re-rank or fallback to top-k next chunks
Trigger WFGY reasoning layer to reinterpret the chunk via symbolic compression
Or, just prompt the model to double-check “is this really the same topic?” as a kind of semantic sanity check
Long-term I want to plug this into a general-purpose ΔS logger for post-hoc debugging too — it's wild how many hallucinations become explainable when you look at coherence tension instead of just recall scores.
Would love to jam more on this if you're building something similar.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.