Sentence Window Retrieval

#ai #rag #langchain #python

Sentence Window Retrieval is a technique designed to solve a classic RAG dilemma: "How do I give the LLM enough context without confusing it with irrelevant noise?"

🧠 The Concept

In standard RAG, you search for a chunk of text, and whatever you find is exactly what you send to the LLM.

In Sentence Window Retrieval, you decouple the Search from the Synthesis:
1. Search (The Needle): You break your document into tiny, highly specific units (usually just 1 or 2 sentences). You use these tiny units to find the exact "needle" in the haystack.

2. Context (The Window): Once you find that specific sentence, you don't just send that one line. Instead, you "roll down the window" to capture the sentences immediately before and after it.

Why do we do this?
Precision: Small sentences are easier for vector models to match accurately. Large chunks often "blur" multiple topics together, making the search fuzzy.

Context: A single sentence often lacks context (e.g., "It was decided then."). By adding a "window" of surrounding sentences, the LLM understands what "It" refers to and what "then" means.

📝 Example-based explanation:

Imagine a book about the Moon landing: "The History of Spaceflight"

1. The Ingestion (What's in the DB):
The system stores each sentence separately. One of those sentences is:
"At 20:17 UTC, Armstrong and Aldrin landed the Eagle."

2. The User Query: "When did the lunar module land?"

3. The Search (Finding the Needle):
The vector search finds the specific sentence: "At 20:17 UTC, Armstrong and Aldrin landed the Eagle." It is a 99% match.

4. The Window Expansion (The Magic):
If we only gave the LLM that one sentence, it might not know that "the Eagle" is the lunar module. So, the system pulls a Window Size of 2 (2 sentences before and 2 after):

(Sentence before 2): The Saturn V rocket had performed perfectly.
(Sentence before 1): The descent was tense due to computer alarms.
[Matched Sentence]: At 20:17 UTC, Armstrong and Aldrin landed the Eagle.
(Sentence after 1): They immediately prepared for a potential emergency liftoff.
(Sentence after 2): "The Eagle has landed," Armstrong radioed to Houston.

5. The LLM Response:
The LLM now has a clear picture. It sees the alarms, the specific time, and the famous quote. It can provide a much richer, more accurate answer than if it only saw the single timestamp sentence.

⚙️ Practical Implementation:

In LangChain, Sentence Window Retrieval is often implemented by storing each sentence as a "child" document and its surrounding window as a "parent" context.

import os
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEndpointEmbeddings
from langchain_classic.retrievers import ParentDocumentRetriever ## to download: pip install langchain-classic
from langchain_core.stores import InMemoryStore # simple key-value storage system used by Retrievers for storing Documents & Text Chunks (until runtime)
## InMemoryStore is only used for experimentation, in production we use PostgrSQL or Redis to store these pairs
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from dotenv import load_dotenv
load_dotenv()

# 1. Dummy Text Data (The "Spaceflight" example)
text_data = """
The Saturn V rocket had performed perfectly. 
The descent was tense due to computer alarms. 
At 20:17 UTC, Armstrong and Aldrin landed the Eagle. 
They immediately prepared for a potential emergency liftoff. 
"The Eagle has landed," Armstrong radioed to Houston.
"""

# 2. Text Loading (Wrapping our dummy text into LangChain Documents)
documents = [Document(page_content=text_data, metadata={"source": "space_history"})]

# 3. Defining the "Window" vs "Sentence" Splitters
# This is the secret sauce: 
# child_splitter = The tiny 'needle' we search for (sentences)
# parent_splitter = The 'window' we actually give to the LLM (paragraphs/surrounding context)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0) 
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

# 4. Storage & Vector DB Setup
embedding_model = HuggingFaceEndpointEmbeddings(
    model="sentence-transformers/all-MiniLM-L6-v2", ## this model returns 384 sized vector
    task="feature-extraction",
)
vectorstore = Chroma(collection_name="sentence_window_rag", embedding_function=embedding_model)
store = InMemoryStore() # This stores the 'Parent' (The Window)

# 5. Sentence Window Retrieval Implementation
# ParentDocumentRetriever under-the-hood:
# uses parent_splitter to create windows that are saved in InMemoryStore
# uses child_splitter to create embeddings that are stored in vector db
# Every child chunk is tagged with a unique parent_id that points exactly to the larger window it came from.
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add the documents to the system
# Behind the scenes: It splits into small sentences for VectorDB 
# but maps them to larger 'parent' windows in the InMemoryStore.
retriever.add_documents(documents)

# 6. Retrieval Example
query = "When did the lunar module land?"

# Step 7: Execution
print(f"--- QUERY: {query} ---")
retrieved_docs = retriever.invoke(query)
# retriever.invoke(query) under-the-hood:
# 1. similarity search in VectorStore (Chroma) for child chunk
# 2. retriever looks at the parent_id metadata attached to that chunk.
# 3. goes to the InMemoryStore and fetches parent chunk associated with this ID
# 4. returns the parent chunk to be used as context.

# Display the result
print("\nRetrieved Window (Context for LLM):")
print(retrieved_docs[0].page_content)

🔧 Step-by-Step Breakdown:

1. Dual-Splitting: We use two splitters. The child_splitter creates tiny chunks (1-2 sentences) that are highly searchable. The parent_splitter creates larger chunks that provide the "window."

2. Mapping: When the retriever.add_documents() is called, LangChain indexes the child chunks in Chroma but keeps a hidden ID that links back to the parent window in the InMemoryStore.

3. The Fetch: When you query "When did the lunar module land?", the system finds the specific child chunk "At 20:17 UTC, Armstrong and Aldrin landed the Eagle." but it returns the entire parent chunk that includes the computer alarms and the "Eagle has landed" quote.

🎯 Conclusion:

We found exact detail the query wanted to retrieve, and fetched the broader context around that detail to provide a broader context.

While this is a massive step-up from Naive RAG, it comes with following trade-offs:

1. Increased Latency: Search -> Map to ID -> Fetch from Store adds extra steps to the retrieval chain.

2. Context Fragmentation: The fixed window size needs to be selected smartly, as it might lead to missing on vital context.

3. Redundancy and Token Waste: Multiple "child" sentences retrieved from overlapping "parent" documents.

4. Metadata Complexity and Management: If we update or delete a document, we have to ensure the VectorDB and the docstore stay perfectly synchronized.

5. Increased Storage Costs: We are essentially storing the data twice.

Some of the above can be fixed but latency and storage will always be compromised for better quality retrieval.