Oleh Halytskyi

Posted on Sep 18, 2024

Optimizing RAG Context: Chunking and Summarization for Technical Docs

#rag #ai #llm #development

Introduction
Solution Overview
Preparing the Environment
Load Documents
Convert HTML Document to Markdown Format
Split Markdown into Chunks Based on Headers
Split Chunks into Smaller Chunks Based on Tokens
Summarization Based on Headers and Chunk Text
Add Documents to VectorDB
Query Documents from VectorDB
Conclusion

Introduction

In the realm of Retrieval-Augmented Generation (RAG) applications, precise text preparation is crucial, especially when dealing with technical documentation that includes code examples. Effective text chunking is essential to maintain the logical flow and integrity of the content, which directly impacts the quality of the context provided to RAG systems and ensures the generation of accurate and useful outputs.

Inspired by Greg Kamradt's insightful tutorial 5 Levels Of Text Splitting and the valuable resources on FullStackRetrieval.com, I recognized the importance of carefully structuring text chunks to preserve meaning and context. When text is split without attention to its logical structure, it can lose critical context, misrepresent information, or disrupt the flow of understanding, resulting in incomplete or confusing code examples and explanations.

After evaluating numerous existing solutions, including all current LangChain text splitters and several other popular approaches, I found none that fully met my needs. This led me to develop a custom text splitter tailored to address these challenges. My solution ensures that code blocks remain intact and that chunks preserve the logical flow and meaning of the documentation. Maintaining the integrity of the original content is essential when working with technical material. While this custom splitter works well for this particular example, it may need further tuning to handle other sources effectively.

To support this process, I leverage LangChain for the entire pipeline, including loading HTML documents, converting them into Markdown format, and handling the summarization and querying tasks. For my specific task, I found Markdown to be the most suitable format for chunking, as it provides a clear structure and maintains the hierarchy and formatting of the content. Additionally, I utilize Ollama for LLM processing, which enables local use of powerful language models such as llama3.1 for summarization and mxbai-embed-large for embedding. This approach allows me to securely process private documentation without exposing it externally.

In addition, I've adopted an approach to summarization that I have not encountered in existing documentation or solutions. This method generates summaries that reflect not only the content within each chunk but also the hierarchical meaning of headers, ensuring that each chunk retains its full context. This strategic approach to summarization is designed to enhance the efficiency and precision of queries within VectorDB, thereby improving the overall utility of RAG applications.

By storing both the summary and the original text in VectorDB, I can query them in a variety of ways - search by summary, original text, or both, and even retrieve the original text based on summary queries. This flexibility greatly increases the efficiency of RAG applications, especially those designed to aid in code generation and understanding.

Solution Overview

In this guide, I will walk you through the following key steps:

Loading and Converting Documents: Using LangChain to load HTML documents and convert them into Markdown format. Markdown is ideal for chunking due to its simplicity and ability to preserve the hierarchy of content, including headers and code blocks.
Custom Text Splitter: A detailed explanation of how I developed a custom splitter that maintains the logical flow of text, avoids breaking code blocks, and includes hierarchical metadata from headers.
Summarization with Header Hierarchy: Demonstrating how each chunk is summarized while preserving the full meaning by including header context. This ensures that summaries retain the structure and logical flow of the original document.
VectorDB Integration: Storing both summarized and original content in a VectorDB, enabling flexible querying. You’ll see how this approach allows for efficient searches by summary, original content, or both, as well as the ability to query summaries and retrieve the corresponding original text.
Conclusion: I will summarize the key takeaways from the guide, highlighting the importance of each step.

This tutorial is comprehensive and provides an in-depth walkthrough of the process. If you'd prefer to explore the code and outputs directly, you can access the Jupyter notebook here: Jupyter Notebook.

Preparing the Environment

Before diving into the document loading process, it's important to set up the right environment. Using Conda is a great way to manage dependencies and maintain isolated environments for different projects. Below are the steps to set up a Conda environment for our RAG (Retrieval-Augmented Generation) project.

# Create a new environment called 'rag-env'
conda create -n rag-env python=3.11

# Activate the environment
conda activate rag-env

# Install necessary packages
pip install langchain==0.2.16 \
    langchain_community==0.2.16 \
    beautifulsoup4==4.12.3 \
    markdownify==0.13.1 \
    tiktoken==0.7.0 \
    langchain-chroma==0.1.3

Additionally, you’ll need to install Ollama locally if you haven’t already. This is required for running the llama3.1 and mxbai-embed-large models for summarization and embedding.

Load Documents

In this step, we use LangChain's RecursiveUrlLoader to load HTML documents. This loader is particularly useful when we want to extract content from web pages, as it allows us to follow links recursively. For this example, we're loading the Python Control Flow Tutorial from the official Python documentation with a depth limit of 1. The max_depth parameter ensures that we only pull content from the specified page without following additional links.

from langchain_community.document_loaders import RecursiveUrlLoader

# Load the document
loader = RecursiveUrlLoader(
    "https://docs.python.org/3/tutorial/controlflow.html",
    max_depth=1,
)

html_docs = loader.load()

Convert HTML Document to Markdown Format

After loading the HTML document, the next step is to convert it into Markdown format. Markdown is a great format for processing because of its simplicity and ability to preserve document structure, including headers and code blocks.

In this step, we use a custom transformer based on MarkdownifyTransformer. The custom transformer overrides the transform_documents method to process the content, ensuring that code blocks are handled properly. We use a regular expression to identify code blocks and remove unnecessary newlines before the closing backticks.

from langchain_community.document_transformers import MarkdownifyTransformer
import re

# Custom transformer that extends the MarkdownifyTransformer to strip trailing newlines from code blocks
class CustomMarkdownifyTransformer(MarkdownifyTransformer):
    def transform_documents(self, documents):
        transformed_docs = super().transform_documents(documents)
        for doc in transformed_docs:
            if hasattr(doc, 'page_content'):
                doc.page_content = self._strip_trailing_newline_from_code_blocks(doc.page_content)
        return transformed_docs

    def _strip_trailing_newline_from_code_blocks(self, text):
        # Regex to find code blocks and ensure they end with a newline before the closing backticks
        code_block_pattern = re.compile(r'(```\w*\n[\s\S]*?)```')
        return code_block_pattern.sub(lambda match: match.group(1).rstrip() + '\n```', text)

# Transform the document to Markdown format
md = CustomMarkdownifyTransformer()
md_docs = md.transform_documents(html_docs)

Split Markdown into Chunks Based on Headers

To prepare the document for querying in a RAG application, it's essential to split it into chunks based on its structure. In this step, we break the Markdown content into manageable sections by using the document headers as delimiters.

While developing this custom splitter, I experimented with multiple existing approaches, including LangChain's Text Splitters and the ExperimentalMarkdownSyntaxTextSplitter. Unfortunately, these approaches didn't meet my criteria. These existing splitters often distorted code blocks by changing whitespace or introducing unwanted newlines, which is critical for a quality LLM context.

To overcome these issues, I developed a custom splitter that:

Preserves the logical flow of the text by splitting only on headers;
Ensures that code blocks remain intact without altering their original formatting;
Filters out irrelevant headers like "Table of Contents", "This Page" and "Navigation";
Adds header names to the metadata for each chunk, including all headers by hierarchy. This ensures that each chunk retains the context provided by its parent headers, making it easier to trace back the full structure of the document. This hierarchical metadata is also particularly useful for summarization, as it helps keep the chunk context intact;
Storing headers in the metadata allows for more flexible querying in specific use cases. For example, filtering in VectorDB based on metadata like header names can help retrieve specific chunks that correspond to particular sections of the document.

from langchain.schema import Document

sicbh_include_headers_in_content = False
sicbh_filter_headers = ["Table of Contents", "This Page", "Navigation"]
sicbh_show_unwanted_chunks_metadata = False

# Function to divide the Markdown documents into chunks based on headers
def split_into_chunks_by_headers(md_docs):
    chunks = []
    header_pattern = re.compile(r'^(#{1,6})\s+(.*)')
    code_block_pattern = re.compile(r'^\s*```')
    in_code_block = False

    for doc in md_docs:
        if hasattr(doc, 'page_content'):
            lines = doc.page_content.split('\n')
        else:
            raise AttributeError("Document object has no 'page_content' attribute")

        current_chunk = {'metadata': {}, 'content': ''}
        current_headers = {}
        prev_header_level = 0

        for line in lines:
            if code_block_pattern.match(line):
                in_code_block = not in_code_block

            if not in_code_block:
                match = header_pattern.match(line)
                if match:
                    # If there is content in the current chunk, add it to the chunks list
                    if current_chunk['content']:
                        current_chunk['content'] = current_chunk['content'].strip()
                        chunks.append(current_chunk)
                        current_chunk = {'metadata': {}, 'content': ''}

                    # Extract the header level and text
                    header_level = len(match.group(1))
                    header_text = match.group(2)

                    # Clean the header text
                    header_text = re.sub(r'\\', '', header_text)
                    header_text = re.sub(r'\[¶\]\(.*?\)', '', header_text).strip()

                    # Update the current headers
                    header_key = f'Header {header_level}'
                    if header_level > prev_header_level:
                        current_headers[header_key] = header_text
                    else:
                        del current_headers[f'Header {prev_header_level}']
                        current_headers[header_key] = header_text

                    # Add the header line to metadata
                    current_chunk['metadata'] = current_headers.copy()

                    # Optionally add the cleaned header text to content
                    if sicbh_include_headers_in_content:
                        current_chunk['content'] += f"{match.group(1)} {header_text}\n"

                    # Update the previous header level
                    prev_header_level = header_level
                else:
                    current_chunk['content'] += line + '\n'
            else:
                current_chunk['content'] += line + '\n'

        # Add the last chunk to the chunks list
        if current_chunk['content']:
            current_chunk['content'] = current_chunk['content'].strip()
            chunks.append(current_chunk)

    # Convert the chunks to Document objects, filtering out unwanted chunks
    documents = []
    unwanted_chunks = []
    for chunk in chunks:
        metadata = chunk['metadata']
        if metadata and not any(any(unwanted in value for unwanted in sicbh_filter_headers) for value in metadata.values()):
            documents.append(Document(page_content=chunk['content'], metadata=chunk['metadata']))
        else:
            unwanted_chunks.append(chunk['metadata'])

    # Optionally print the unwanted chunks metadata
    if sicbh_show_unwanted_chunks_metadata and unwanted_chunks:
        print(f"Unwanted chunks metadata:")
        for chunk in unwanted_chunks:
            print(chunk)
        print()

    return documents

# Split the Markdown documents into chunks based on headers
chunks_by_headers = split_into_chunks_by_headers(md_docs)

Split Chunks into Smaller Chunks Based on Tokens

At this point, we have already split the document into chunks based on headers. However, for RAG applications, it’s necessary to ensure that the size of each chunk fits within the token limit supported by the language model in use. Technically, this step can be skipped if you're using models with large context lengths, but it is often still useful to split large chunks for optimal performance.

This custom token-based splitter:

Splits the document into smaller chunks while preserving sentences and code blocks;
Avoids splitting sentences that end with ":" from the following code or text;
Optionally prevents splitting the text directly following a code block (since this text is often an explanation of the code).

It’s more important to preserve the full meaning of the text in each chunk rather than focusing strictly on achieving high token accuracy.

import tiktoken

tiktoken_encoder = "cl100k_base"
chunk_max_tokens = 500
scbt_text_follow_code_block = True

# Function to split a chunk into smaller parts based on token count
def split_chunk_by_tokens(content, tokenizer, max_tokens):
    # Split content into code blocks and paragraphs
    parts = re.split(r'(\n```\n.*?\n```\n)', content, flags=re.DOTALL)
    final_parts = []
    for part in parts:
        if part.startswith('\n```\n') and part.endswith('\n```\n'):
            final_parts.append(part)
        else:
            final_parts.extend(re.split(r'\n\s*\n', part))
    # Remove newlines from the start and end of each part
    parts = [part.strip() for part in final_parts if part.strip()]

    # Calculate total tokens
    total_tokens = sum(len(tokenizer.encode(part)) for part in parts)
    target_tokens_per_chunk = total_tokens // (total_tokens // max_tokens + 1)

    # Initialize variables
    chunks = []
    current_chunk = ""
    current_token_count = 0

    # Iterate over the parts and merge them if needed
    i = 0
    while i < len(parts):
        part = parts[i]

        # Merge parts if the current part ends with ":" or "```" (if enabled) and has a following part
        while (part.endswith(":") or (scbt_text_follow_code_block and part.endswith("```"))) and i + 1 < len(parts):
            part += "\n\n" + parts[i + 1]
            i += 1  # Skip the next part as it has been merged

        # Calculate the token count of the part
        part_tokens = tokenizer.encode(part)
        part_token_count = len(part_tokens)

        # Split the part into smaller parts if it exceeds the target token count
        if current_token_count + part_token_count > target_tokens_per_chunk and current_chunk:
            chunks.append(current_chunk.strip())
            current_chunk = part
            current_token_count = part_token_count
        else:
            current_chunk += "\n\n" + part if current_chunk else part
            current_token_count += part_token_count

        i += 1

    # Add the last chunk if it has content
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Function to divide the Markdown documents into chunks based on token count
def split_into_chunks_by_tokens(chunks, tokenizer, max_tokens):
    split_chunks = []

    for chunk in chunks:
        token_count = len(tokenizer.encode(chunk.page_content))
        if token_count > max_tokens:
            split_texts = split_chunk_by_tokens(chunk.page_content, tokenizer, max_tokens)
            for text in split_texts:
                split_chunks.append(Document(page_content=text, metadata=chunk.metadata))
        else:
            split_chunks.append(chunk)

    return split_chunks

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding(tiktoken_encoder)

# Split the chunks into smaller parts based on token count
chunked_docs = split_into_chunks_by_tokens(chunks_by_headers, tokenizer, chunk_max_tokens)
print(f"Number of chunks by tokens: {len(chunked_docs)}")

Summarization Based on Headers and Chunk Text

The next step in the process is to generate summaries for each chunk. We use the llama3.1 model via ChatOllama to create concise summaries that incorporate both the content of the chunk and all the relevant headers in the hierarchy. This ensures that the summary retains the full context of the document, including the structure provided by the headers.

By considering the hierarchical headers, the summaries maintain the logical flow of the document. This is particularly useful for querying from VectorDB, where maintaining the overall meaning and structure of the document is essential for retrieving accurate and relevant information.

from langchain_community.chat_models import ChatOllama
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.prompts import HumanMessagePromptTemplate
import uuid

# Initialize the ChatOllama model
llm = ChatOllama(model="llama3.1", temperature=0)

# Create the prompt
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=(
                "Summarize the following content in a single, concise paragraph. "
                "Include key information from all headers provided, maintaining the overall context and meaning. "
                "Output only the summary text without any introductory phrases, labels, or concluding remarks."
            )
        ),
        HumanMessagePromptTemplate.from_template("{context}"),
    ]
)

# Create the chain
chain = create_stuff_documents_chain(llm, prompt)

# Summarize the content of the documents
summarized_docs = []
for doc in chunked_docs:
    # Merge the metadata and content into a single document
    metadata = "\n".join(f"{key}: {value}" for key, value in doc.metadata.items())
    merged_docs = [Document(page_content=metadata), Document(page_content=doc.page_content)]

    # Invoke the chain to summarize the content
    result = chain.invoke({"context": merged_docs})

    # Create a new Document object with the summary content
    unique_id = str(uuid.uuid4())
    summary_doc = Document(page_content=result, metadata={"type": "summary", "id": unique_id})

    # Create a copy of the metadata and update it
    updated_metadata = doc.metadata.copy()
    updated_metadata.update({"type": "original", "summary_id": unique_id})
    doc.metadata = updated_metadata
    summarized_docs.append(summary_doc)

# Merge the summarized and original documents
all_docs = summarized_docs + chunked_docs

print(f"Number of summarized documents: {len(summarized_docs)}")
print(f"Number of original documents: {len(chunked_docs)}")
print(f"Number of all documents: {len(all_docs)}")

Add Documents to VectorDB

After summarizing the documents, the next step is to store them in a vector database, which enables efficient querying and retrieval based on both summaries and original content. We use the mxbai-embed-large model from Ollama to generate embeddings for both the original and summarized documents.

The embeddings represent the semantic meaning of the documents, making it easier for the VectorDB to retrieve relevant chunks based on queries. In this example, the documents are stored in Chroma, a popular vector database optimized for efficient querying and retrieval.

The key steps include:

Initializing the OllamaEmbeddings to use the mxbai-embed-large model;
Clearing any existing documents in the Chroma VectorDB (if the code is run again);
Adding the new summarized and original documents into the VectorDB.

from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma

# Initialize the Ollama embedding model
ollama_emb = OllamaEmbeddings(
    model="mxbai-embed-large",
)

# Initialize Chroma vector store and clear existing documents
vectorstore = Chroma(
    collection_name="summarization",
)
vectorstore.delete_collection()

# Add new documents to the collection
vectorstore = Chroma.from_documents(
    documents=all_docs,
    embedding=ollama_emb,
    collection_name="summarization",
)

Query Documents from VectorDB

The final step is to query the documents stored in the Chroma vector database and retrieve the relevant results. Before running the queries, we need to prepare the retriever and set up a function to output the results in a readable format.

# Use the vector store as a retriever
retriever = vectorstore.as_retriever()

# Function to print results
def print_results(title, results):
    print(title)
    print()
    for i, result in enumerate(results):
        print(f"---------- Result #{i + 1} ----------")
        print(f"Metadata: {result.metadata}")
        print(f"Content: {result.page_content}")
        print("\n")

With the retriever prepared and the print_results function ready, we can now perform various queries to demonstrate the flexibility of querying both summaries and original content in the vector database:

Querying summaries only;
Querying original content;
Querying both summaries and original content together;
Querying summaries but retrieving the original text based on those summaries.

By utilizing the summary and original content filters, we can flexibly query the vector database for the most relevant results based on the user's needs.

Querying Summaries

In this example, we query only the summaries, which allows us to retrieve concise, context-rich overviews of the content.

# Query only summaries
retriever.search_kwargs = {"k": 2, "filter": {"type": "summary"}}
summary_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Summary Results.", summary_results)

Output:

Summary Results.

---------- Result #1 ----------
Metadata: {'id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'summary'}
Content: The `range()` function returns an object that behaves like a list but doesn't actually make one, saving space. This iterable object can be used with functions and constructs that expect successive items until the supply is exhausted, such as the `for` statement or the `sum()` function. When printed directly, it displays its start and end values, e.g., `range(0, 10)`.


---------- Result #2 ----------
Metadata: {'id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'summary'}
Content: The built-in `range()` function generates arithmetic progressions that can be used for iteration over a sequence of numbers. It takes three parameters: start point, end point, and step (default is 1), and returns an iterator that produces the specified range of values. For example, `range(5)` generates numbers from 0 to 4, while `range(5, 10)` generates numbers from 5 to 9. The `range()` function can also be used in combination with `len()` to iterate over the indices of a sequence, or with other functions like `enumerate()` for more convenient looping techniques.

The output demonstrates two concise summaries related to the query about writing a Python script that prints numbers from 1 to 30. Both summaries provide an overview of Python's range() function, explaining its use in iterating over sequences of numbers. The summaries are context-rich, mentioning how the range() function works in a loop and how it can be utilized with additional constructs like sum() and enumerate().

This is a good match for the query, as the range() function is a common tool for iterating over a sequence of numbers, which directly relates to the task of printing numbers from 1 to 30.

Querying Original Content

Next, we query only the original content to retrieve more detailed information.

# Query only original content
retriever.search_kwargs = {"k": 2, "filter": {"type": "original"}}
original_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Original Content Results.", original_results)

Output:

Original Content Results.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).


---------- Result #2 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.7. Defining Functions', 'summary_id': '0a569fde-d1df-4f7c-9315-1e6da9b15214', 'type': 'original'}
Content: We can create a function that writes the Fibonacci series to an arbitrary
boundary:

...<skipped>...

The first statement of the function body can optionally be a string literal;
this string literal is the function’s documentation string, or *docstring*.
(More about docstrings can be found in the section [Documentation Strings](#tut-docstrings).)
There are tools which use docstrings to automatically produce online or printed
documentation, or to let the user interactively browse through code; it’s good
practice to include docstrings in code that you write, so make a habit of it.

Note: In some examples, you will see ...<skipped>... to avoid overwhelming the blog with very large outputs. If you'd like to explore the full outputs for each step, please refer to the accompanying Jupyter Notebook.

The Original Content Results contain two results retrieved from the original content, providing more detailed and comprehensive information compared to the summaries.

First Result: The first result is relevant to the query, as it focuses on the range() function, which is directly related to printing numbers in Python. The output provides a detailed explanation of how the range() function can be used to iterate over a sequence of numbers and includes an example that demonstrates its use, which aligns well with the query.

Second Result: The second result, however, is less relevant to the query. It discusses creating a Fibonacci series function and includes information on docstrings. While useful in other contexts, this output doesn't directly relate to the task of writing a Python script to print numbers from 1 to 30, making it a less accurate match compared to the first result.

When comparing this to the Querying Summaries example, the second result in Querying Original Content is not as closely aligned with the original query. The summary query provided concise, context-rich information about range() and its use for iterating over numbers, which was a better match for the specific task.

Querying Both Summaries and Original Content

We can also query both summaries and original content without applying any filters to retrieve a mix of both types.

# Query both summaries and original content
retriever.search_kwargs = {"k": 2}  # No filter to get both types
both_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Both Results.", both_results)

Output:

Both Results.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).


---------- Result #2 ----------
Metadata: {'id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'summary'}
Content: The `range()` function returns an object that behaves like a list but doesn't actually make one, saving space. This iterable object can be used with functions and constructs that expect successive items until the supply is exhausted, such as the `for` statement or the `sum()` function. When printed directly, it displays its start and end values, e.g., `range(0, 10)`.

The Both Results query retrieves a mix of both original content and summaries, offering different levels of detail.

First Result (Original Content): This result is a detailed explanation of the range() function from the original content, including examples and usage patterns. It provides a comprehensive overview of how to generate sequences of numbers using range(), which directly aligns with the task of printing numbers from 1 to 30. The examples include different ways to use range() with various parameters, making it a very relevant and informative result.

Second Result (Summary): The second result is a concise summary that focuses on the functionality of the range() function, highlighting its efficiency and how it behaves like a list when iterated over. It’s a more compact but accurate explanation relevant to the query.

Both results are highly relevant to the query, with the original content offering depth and the summary offering a concise explanation.

Querying Summaries but Retrieving Original Text

One of the most powerful and interesting aspects of this process is the ability to query using summaries but retrieve the corresponding original text. This method allows you to perform quick, concise queries with summaries, but then retrieve the full original content based on those summaries.

In this step, we query summaries and use the summary IDs to fetch the original content. This method is particularly useful when you want to retrieve detailed content but still benefit from the speed and conciseness of querying summaries.

# Query summary but get original text
retriever.search_kwargs = {"k": 2, "filter": {"type": "summary"}}
summary_for_original_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")

# Extract original texts based on summary query
original_texts_based_on_summary = []
if summary_for_original_results:
    summary_ids = [result.metadata["id"] for result in summary_for_original_results]
    for summary_id in summary_ids:
        retriever.search_kwargs = {"k": 2, "filter": {"summary_id": summary_id}}
        original_texts_based_on_summary.extend(
            retriever.invoke("")
        )

print_results("Original Texts based on Summary Query.", original_texts_based_on_summary)

Output:

Original Texts based on Summary Query.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'original'}
Content: A strange thing happens if you just print a range:

```
>>> range(10)
range(0, 10)
```

In many ways the object returned by [`range()`](../library/stdtypes.html#range "range") behaves as if it is a list,
but in fact it isn’t. It is an object which returns the successive items of
the desired sequence when you iterate over it, but it doesn’t really make
the list, thus saving space.

...<skipped>...

Later we will see more functions that return iterables and take iterables as
arguments. In chapter [Data Structures](datastructures.html#tut-structures), we will discuss in more detail about
[`list()`](../library/stdtypes.html#list "list").


---------- Result #2 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).

The Original Texts based on Summary Query feature demonstrates how powerful it can be to query using summaries but retrieve the corresponding original content. In this case, the summaries point to sections of the document that discuss Python's range() function, and the retrieved original content provides detailed explanations, examples, and usage.

First Result: This result explains how the range() function behaves like a list, showing how it works in conjunction with the for loop and the sum() function. The example demonstrates how range() returns an iterable object that saves memory, which directly relates to the query about printing numbers.

Second Result: This result provides more examples of how the range() function can be used to iterate over numbers, including different ways to specify a range, starting point, step size, and more. This content is a good match for the query, as it focuses on iterating over sequences of numbers in Python.

Compared to querying summaries alone, this method provides a quick way to retrieve concise, context-rich summaries and use those to fetch the detailed original content. Both retrieved results are highly relevant to the query about writing a Python script to print numbers from 1 to 30, offering detailed explanations and code examples on how to achieve this.

Conclusion

In this guide, we walked through the complete process of preparing technical documentation for use in Retrieval-Augmented Generation (RAG) applications. We began by loading and converting documents into Markdown format, ensuring the preservation of structure and code blocks. Next, we split the document into manageable chunks based on headers and token counts to retain context and ensure efficient querying.

The summarization step was particularly important, where we generated summaries based on both chunk text and hierarchical headers to retain full context. These summaries, along with the original content, were stored in a VectorDB (Chroma), which enabled us to query the documents flexibly.

The final and perhaps most powerful aspect was the ability to query summaries while retrieving the original text. Summaries are often more efficient, as they incorporate not only content but also relevant headers to provide additional context. This method facilitates fast, concise queries while ensuring access to detailed, in-depth content when needed.

Each step in this process is crucial to maintaining the logical flow, integrity, and usability of the content. Summarization, in particular, plays a key role in optimizing performance, while querying allows for flexible and efficient information retrieval.

This guide only covered a small part of the overall possibilities. There is plenty of room for improvement, such as testing with different types of documents, incorporating automated testing, and refining the overall implementation to handle more complex use cases. These enhancements can further optimize the system for even more efficient document processing and querying.

I encourage you to explore the accompanying Jupyter Notebook for a more detailed look at the full implementation. It contains key pieces of code that you can review, download, and modify to suit your specific needs in RAG applications, along with the full outputs generated during each step of the process.

DEV Community

Optimizing RAG Context: Chunking and Summarization for Technical Docs

Contents

Introduction

Solution Overview

Preparing the Environment

Load Documents

Convert HTML Document to Markdown Format

Split Markdown into Chunks Based on Headers

Split Chunks into Smaller Chunks Based on Tokens

Summarization Based on Headers and Chunk Text

Add Documents to VectorDB

Query Documents from VectorDB

Querying Summaries

Querying Original Content

Querying Both Summaries and Original Content

Querying Summaries but Retrieving Original Text

Conclusion

Top comments (0)