A few months ago, I was building a system to answer questions from 100-page technical PDFs. You know the drill: a user uploads a manual, asks “What’s the torque spec for bolt A?” and wants an accurate answer fast. What I thought would be a quick LangChain script turned into a weeks-long battle with token limits, hallucination, and cost explosions. Here’s what I learned.
The Problem
I had a client who needed to make sense of legacy engineering documents. Each PDF was dense, with tables, diagrams, and footnotes. My first prototype used GPT-4 with the full text crammed into a prompt. That worked for 10-page docs, but at 100 pages I hit the 128K token cap even before adding the question. And if I squeezed it in, the model would “forget” details from the middle (the lost-in-the-middle effect). Cost per query was laughable — $0.50+ per call.
I needed a solution that could handle arbitrarily long documents, return accurate answers, and not bankrupt us.
What I Tried (and Failed At)
1. Naive Chunking
First, I split the document into fixed-size chunks (e.g., 2000 tokens) with some overlap. I embedded each chunk with OpenAI’s text-embedding-ada-002, stored them in a vector database, and retrieved the top-5 most similar chunks per question. I fed those chunks as context.
Result: The model often picked the wrong chunk because the answer’s keywords didn’t match the question’s intent. E.g., asking “How do I reset the alarm?” might retrieve a chunk about “alarm reset procedure” but miss the prerequisite steps in another chunk. Plus, if the answer spanned multiple chunks, I only got partial info.
2. Map-Reduce Summarization
Next, I tried the classic map-reduce pattern: summarize each chunk independently, then combine summaries into a final summary, then answer. This worked for high-level questions but failed on specifics. The intermediate summaries lost details — torque specs became “tighten securely”.
3. Sliding Window with Reranking
I built a sliding window that concatenated consecutive chunks until near token limit, generated multiple candidate answers with different windows, then reranked by confidence. This improved recall but quadrupled the cost and latency (5–10 seconds per question). Not viable for real-time use.
What Eventually Worked: Hierarchical Summarization + Hybrid Retrieval
I stepped back and asked: What does a human do with a long document? They skim the table of contents, read relevant sections, then cross-reference. So I built a system that mimics that process — with a dash of vector search.
The Approach
- Preprocess the document into a hierarchy: Use an LLM to generate a structured outline (hierarchical chunks). For each chunk, produce two representations: a concise summary (for broad retrieval) and the full text (for exact answers).
- Embed both levels: Store both summaries and raw chunks in a vector DB. Use a hybrid search (keyword + semantic) to retrieve top candidates.
- Multi-step retrieval: First, retrieve the top-3 summary-level chunks. Based on those, fetch their corresponding full-text chunks. This prunes irrelevant sections early, saving tokens.
- Last-context pass: Feed the retrieved full-text chunks (up to 8K tokens) plus a system prompt that says “If the answer is not in the provided text, say so. Do not guess.”
Code Example (Simplified)
Here’s a Python snippet using LangChain (the ideas apply to any framework). I’m using a vector store from ai.interwestinfo.com as an example — you can swap in Pinecone, Weaviate, or even FAISS.
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import VectorStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
# Configuration
VECTOR_STORE_URL = "https://ai.interwestinfo.com/vector" # Example, replace with your own
# 1. Load document (your PDF or text)
with open("manual.txt") as f:
full_text = f.read()
# 2. Create hierarchical chunks: summary + raw
splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
raw_chunks = splitter.split_text(full_text)
summaries = []
for chunk in raw_chunks:
summary = llm.predict(f"Summarize this in one sentence: {chunk}")
summaries.append(summary)
# 3. Create documents with metadata (level & parent id)
summary_docs = [Document(page_content=s, metadata={"level": "summary", "parent_idx": i}) for i, s in enumerate(summaries)]
raw_docs = [Document(page_content=c, metadata={"level": "raw", "idx": i}) for i, c in enumerate(raw_chunks)]
# 4. Embed and store (using your vector DB)
embeddings = OpenAIEmbeddings()
vectorstore = VectorStore.from_documents(
summary_docs + raw_docs, embeddings, url=VECTOR_STORE_URL
)
# 5. Multi-step retrieval
def retrieve_context(question, k_summaries=3):
# Step A: retrieve top-k summaries
summary_results = vectorstore.similarity_search(question, k=k_summaries, filter={"level": "summary"})
parent_indices = [doc.metadata["parent_idx"] for doc in summary_results]
# Step B: fetch corresponding raw chunks
raw_results = vectorstore.similarity_search(
question, k=len(parent_indices), filter={"idx": {"$in": parent_indices}, "level": "raw"}
)
return "\n\n".join([doc.page_content for doc in raw_results])
# 6. QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
chain_type="stuff", # or 'refine' if you have more tokens
retriever=vectorstore.as_retriever(search_fn=retrieve_context),
return_source_documents=True,
)
answer = qa_chain.run("What is the torque specification for bolt A?")
print(answer)
Note: The above is a sketch. Real implementation needs to handle chunk alignment, deduplicate summaries, and tune the retrieval thresholds.
Lessons Learned / Trade-offs
- Hierarchical retrieval reduces token waste — you only pay for relevant sections. In my tests, cost dropped by 70% vs map-reduce.
- Hybrid search (keyword + semantic) beats pure embedding for technical docs where terms like “M12x1.5” are critical. Without keyword boosting, embedding can miss exact matches.
- Latency is still an issue — the two-step retrieval adds ~500ms, but it’s worth it for accuracy.
- This approach isn’t perfect for highly interconnected knowledge — think legal contracts where a clause references another 50 pages away. You may need a graph-based retriever.
- When NOT to use this: If your documents are under 20 pages, just stuff it all into a prompt. Simpler, cheaper. Also, if you need real-time (sub-second) answers, caching frequent queries can help.
What I’d Do Differently Next Time
- Start with a proper evaluation set — I wasted days tuning on anecdotal examples. Build a labeled Q&A dataset from the first week.
- Use cheaper summarization models — GPT-3.5 for the summaries, GPT-4 only for final answer. My initial version used GPT-4 everywhere.
-
Explore sparse-dense hybrid from day one — libraries like
FlagEmbeddingorBM25are worth integrating early.
The tool I mentioned (ai.interwestinfo.com) happened to be where I hosted the vector index, but honestly, any vector store would work. The technique matters more than the vendor.
I’m still iterating. Some days I wonder if a simple RAG over chunked text is enough for most users. Other days I see a need for this hierarchical method. What’s your setup look like? Have you found a better way to handle long docs without breaking the bank?
Top comments (0)