DEV Community: Oleh Halytskyi

Retrieving Original Documents via Summaries with Weaviate and LangChain

Oleh Halytskyi — Sat, 02 Nov 2024 16:45:50 +0000

In the realm of large language models (LLMs) and retrieval-augmented generation (RAG), optimizing the retrieval process is crucial for efficiency and accuracy. In a previous blog post, I discussed how to optimize RAG context by chunking and summarization for technical documents using Chroma VectorDB. Specifically, the post demonstrated how to query summaries while retrieving the original documents.

In this post, we'll explore how to achieve a similar result using Weaviate and its cross-references feature, integrated with LangChain. We'll leverage Weaviate's ability to create cross-references between data objects to efficiently retrieve original documents by querying their summaries.

This tutorial offers a comprehensive and detailed walkthrough of the process. If you prefer to explore the code and outputs directly, you can access the Jupyter notebook here: Jupyter Notebook.

Preparing the Environment

First, set up a Conda environment to manage dependencies and keep the project isolated.

# Create a new environment called 'rag-env'
conda create -n rag-env python=3.12

# Activate the environment
conda activate rag-env

# Install necessary packages
pip install weaviate-client==4.9.0 \
    langchain==0.3.5 \
    langchain-core==0.3.13 \
    langchain-ollama==0.2.0

Note: The following are also required:

Weaviate: Install it locally using Docker, to Kubernetes cluster or access it via the Weaviate Cloud.
Ollama: Ensure you have the mxbai-embed-large model for embeddings and llama3.1 for the RAG example.

Initializing the Weaviate Client

Let's begin by initializing the Weaviate client with authentication.

import getpass
import weaviate
from weaviate.classes.init import Auth

# Prompt for the Weaviate API key
WEAVIATE_API_KEY = getpass.getpass()

# Initialize the Weaviate client with authentication
weaviate_client = weaviate.connect_to_local(
    auth_credentials=Auth.api_key(WEAVIATE_API_KEY)
)

# Check if the client is ready
print("Client is Ready?", weaviate_client.is_ready())

Importing Data with Cross-References

Next, we'll load the original and summarized documents, create Weaviate collections, and insert the data with cross-references.

Note: The files chunked_docs.json and summarized_docs.json are taken from my previous blog post. They were created by adding the following code to the Summarization Based on Headers and Chunk Text section:

# Save the chunked and summarized documents to JSON files
import json
chunked_docs_json = [{'page_content': doc.page_content, 'metadata': doc.metadata} for doc in chunked_docs]
with open('files/generated/chunked_docs.json', 'w') as f:
    json.dump(chunked_docs_json, f, indent=4)

summarized_docs_json = [{'page_content': doc.page_content, 'metadata': doc.metadata} for doc in summarized_docs]
with open('files/generated/summarized_docs.json', 'w') as f:
    json.dump(summarized_docs_json, f, indent=4)

In the summarized_docs.json file, metadata.id was changed to metadata.doc_id to avoid conflicts with Weaviate's id field.

Now, let's proceed to import the data:

import json
from langchain_ollama import OllamaEmbeddings

# Load the chunked and summarized documents
with open("files/chunked_docs.json", "r") as f:
    chunked_docs = json.load(f)

with open("files/summarized_docs.json", "r") as f:
    summarized_docs = json.load(f)

# Define the collection names
collections = {
    "original": "OriginalDocuments",
    "summary": "SummarizedDocuments",
}

# Delete collections if they already exist
for collection in collections.values():
    if collection in weaviate_client.collections.list_all(simple=True):
        weaviate_client.collections.delete(collection)

# Create the collections
original_collection_db = weaviate_client.collections.create(collections["original"])
summary_collection_db = weaviate_client.collections.create(collections["summary"])

# Initialize the Ollama embedding model
ollama_emb = OllamaEmbeddings(model="mxbai-embed-large")

# Insert the documents into the collections
for summarized_doc in summarized_docs:
    summarized_doc_id = summarized_doc["metadata"]["doc_id"]
    original_doc = next((doc for doc in chunked_docs if doc.get("metadata", {}).get("summary_id") == summarized_doc_id), None)

    if original_doc:
        original_uuid = original_collection_db.data.insert(
            {
                "page_content": original_doc["page_content"],
            },
            vector=ollama_emb.embed_query(original_doc["page_content"]),
        )
        summary_collection_db.data.insert(
            {
                "page_content": summarized_doc["page_content"],
            },
            references={"originalDocument": original_uuid},
            vector=ollama_emb.embed_query(summarized_doc["page_content"]),
        )

# Verify the number of documents in the collections
original_count = len(original_collection_db)
summary_count = len(summary_collection_db)
print(f"Number of documents in the original collection: {original_count}")
print(f"Number of documents in the summary collection: {summary_count}")

Output:

Number of documents in the original collection: 34
Number of documents in the summary collection: 34

Explanation:

Loading Data: The chunked and summarized documents are loaded from JSON files.
Collections: Two collections, OriginalDocuments and SummarizedDocuments, are defined and created in Weaviate.
Embeddings: The embedding model is initialized using Ollama's mxbai-embed-large model.
Data Insertion: Each summarized document is matched with its corresponding original document. The original documents are inserted into the OriginalDocuments collection, and the summarized documents are inserted into the SummarizedDocuments collection, including a cross-reference that links back to their respective original documents.

Why Create Vectors Manually?

You might be curious why the vectors are created manually using ollama_emb.embed_query instead of leveraging Ollama Embeddings with Weaviate. The reason is that the Weaviate instance is running on a Kubernetes cluster in a lab environment, while Ollama operates in a different isolated environment to which Weaviate doesn't have connectivity. Since Weaviate cannot access the environment where Ollama is running, this approach demonstrates how to insert documents into Weaviate by manually generating embeddings in a separate environment and then supplying them directly to Weaviate.

Querying with Cross-References

The aim of this example is to demonstrate how to perform queries with cross-references by using Weaviate directly. We'll define a function that retrieves documents by querying the summaries and then obtains the original documents via cross-references. This will allow us to see the outputs for both the summaries and the original documents.

from weaviate.classes.query import QueryReference, MetadataQuery

# Define a function to retrieve documents
def retrieve_documents(query, vector, limit=2, score_threshold=0.8):
    response = summary_collection_db.query.hybrid(
        query,
        vector=vector,
        limit=limit,
        return_references=QueryReference(link_on="originalDocument"),
        return_metadata=MetadataQuery(score=True),
    )

    summary_docs = []
    original_docs = []
    for o in response.objects:
        if o.metadata.score is not None and o.metadata.score >= score_threshold:
            summary_doc = {"page_content": o.properties["page_content"]}
            summary_docs.append(summary_doc)
            for ref_obj in o.references["originalDocument"].objects:
                original_doc = {"page_content": ref_obj.properties["page_content"]}
                original_docs.append(original_doc)

    return summary_docs, original_docs

# Define a query
query = "I want to write a Python script that prints numbers from 1 to 30."
vector = ollama_emb.embed_query(query)
summary_docs, original_docs = retrieve_documents(query, vector)

# Print the summarized and original documents
print("Summarized Documents:")
for i, doc in enumerate(summary_docs, start=1):
    print(f"Summarized Document #{i}")
    print("--------------------")
    print(doc["page_content"])
    print("--------------------")
    print()

print("Original Documents:")
for i, doc in enumerate(original_docs, start=1):
    print(f"Original Document #{i}")
    print("--------------------")
    print(doc["page_content"])
    print("--------------------")
    print()

Output:

Summarized Documents:
Summarized Document #1
--------------------
The `break` statement exits the innermost enclosing for or while loop, stopping execution of the loop and continuing with the next statement. This is demonstrated by a nested for loop that prints factors of numbers from 2 to 9, where the break statement stops the loop when a factor is found. The `continue` statement skips the rest of the current iteration in a loop and moves on to the next one, as shown by a for loop that iterates over numbers from 2 to 9, printing even numbers and skipping odd ones.
--------------------

Summarized Document #2
--------------------
The built-in `range()` function generates arithmetic progressions that can be used for iteration over a sequence of numbers. It takes three parameters: start point, end point, and step (default is 1), and returns an iterator that produces the specified range of values. The end point is never part of the generated sequence. To iterate over the indices of a sequence, `range()` can be combined with `len()`, but in most cases it's more convenient to use the `enumerate()` function for this purpose.
--------------------

Original Documents:
Original Document #1
--------------------
The [`break`](../reference/simple_stmts.html#break) statement breaks out of the innermost enclosing
[`for`](../reference/compound_stmts.html#for) or [`while`](../reference/compound_stmts.html#while) loop:

```
>>> for n in range(2, 10):
...     for x in range(2, n):
...         if n % x == 0:
...             print(f"{n} equals {x} * {n//x}")
...             break
...
4 equals 2 * 2
6 equals 2 * 3
8 equals 2 * 4
9 equals 3 * 3
```

The [`continue`](../reference/simple_stmts.html#continue) statement continues with the next
iteration of the loop:

```
>>> for num in range(2, 10):
...     if num % 2 == 0:
...         print(f"Found an even number {num}")
...         continue
...     print(f"Found an odd number {num}")
...
Found an even number 2
Found an odd number 3
Found an even number 4
Found an odd number 5
Found an even number 6
Found an odd number 7
Found an even number 8
Found an odd number 9
```
--------------------

Original Document #2
--------------------
If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

The given end point is never part of the generated sequence; `range(10)` generates
10 values, the legal indices for items of a sequence of length 10\. It
is possible to let the range start at another number, or to specify a different
increment (even negative; sometimes this is called the ‘step’):

```
>>> list(range(5, 10))
[5, 6, 7, 8, 9]

>>> list(range(0, 10, 3))
[0, 3, 6, 9]

>>> list(range(-10, -100, -30))
[-10, -40, -70]
```

To iterate over the indices of a sequence, you can combine [`range()`](../library/stdtypes.html#range "range") and
[`len()`](../library/functions.html#len "len") as follows:

```
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
...     print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
```

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).
--------------------

Explanation:

Hybrid Query: The retrieve_documents function utilizes Weaviate's hybrid search on the SummarizedDocuments collection. Hybrid search combines the results of a vector similarity search and a keyword (BM25F) search by fusing the two result sets. This approach enhances retrieval accuracy by considering both semantic similarity (from embeddings) and keyword relevance.
Cross-References: It retrieves the summarized documents and accesses the original documents via the originalDocument cross-reference.
Filtering: Only documents with a score above the threshold are considered.
Result: The retrieved summarized and original documents are printed to display the outputs.

Creating a Custom Retriever

To integrate this retrieval mechanism with LangChain, we'll implement a custom retriever that leverages Weaviate's cross-references.

from typing import List, Any
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from weaviate.classes.query import QueryReference, MetadataQuery

class VectorDBRetrieverCrossReferences(BaseRetriever):
    """A custom retriever that retrieves documents from a Weaviate vector database."""
    summary_collection_db: Any
    ollama_emb: Any
    k: int = 2
    score_threshold: float = 0.8
    return_source_documents: bool = False

    def _get_relevant_documents(self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
        """Sync implementation for retriever."""
        vector = self.ollama_emb.embed_query(query)
        response = self.summary_collection_db.query.hybrid(
            query,
            vector=vector,
            limit=self.k,
            return_references=QueryReference(link_on="originalDocument"),
            return_metadata=MetadataQuery(score=True),
        )

        original_docs = []
        for o in response.objects:
            if o.metadata.score is not None and o.metadata.score >= self.score_threshold:
                for ref_obj in o.references["originalDocument"].objects:
                    doc_content = ref_obj.properties["page_content"]
                    metadata = {"source": ref_obj.properties.get("source")} if self.return_source_documents else {}
                    original_docs.append(Document(page_content=doc_content, metadata=metadata))

        return original_docs

# Initialize the custom retriever
retriever = VectorDBRetrieverCrossReferences(
    summary_collection_db=summary_collection_db,
    ollama_emb=ollama_emb
)

# Retrieve documents using the custom retriever
query = "I want to write a Python script that prints numbers from 1 to 30."
documents = retriever.invoke(query)

# Print the retrieved documents
for i, doc in enumerate(documents, start=1):
    print(f"Document #{i}")
    print("--------------------")
    print(doc.page_content)
    print("--------------------")
    print()

Output:

Document #1
--------------------
The [`break`](../reference/simple_stmts.html#break) statement breaks out of the innermost enclosing
[`for`](../reference/compound_stmts.html#for) or [`while`](../reference/compound_stmts.html#while) loop:

```
>>> for n in range(2, 10):
...     for x in range(2, n):
...         if n % x == 0:
...             print(f"{n} equals {x} * {n//x}")
...             break
...
4 equals 2 * 2
6 equals 2 * 3
8 equals 2 * 4
9 equals 3 * 3
```

The [`continue`](../reference/simple_stmts.html#continue) statement continues with the next
iteration of the loop:

```
>>> for num in range(2, 10):
...     if num % 2 == 0:
...         print(f"Found an even number {num}")
...         continue
...     print(f"Found an odd number {num}")
...
Found an even number 2
Found an odd number 3
Found an even number 4
Found an odd number 5
Found an even number 6
Found an odd number 7
Found an even number 8
Found an odd number 9
```
--------------------

Document #2
--------------------
If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

The given end point is never part of the generated sequence; `range(10)` generates
10 values, the legal indices for items of a sequence of length 10\. It
is possible to let the range start at another number, or to specify a different
increment (even negative; sometimes this is called the ‘step’):

```
>>> list(range(5, 10))
[5, 6, 7, 8, 9]

>>> list(range(0, 10, 3))
[0, 3, 6, 9]

>>> list(range(-10, -100, -30))
[-10, -40, -70]
```

To iterate over the indices of a sequence, you can combine [`range()`](../library/stdtypes.html#range "range") and
[`len()`](../library/functions.html#len "len") as follows:

```
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
...     print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
```

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).
--------------------

Explanation:

Custom Retriever: The VectorDBRetrieverCrossReferences class extends LangChain's BaseRetriever.
Method Override: The _get_relevant_documents method performs the hybrid query and retrieves the original documents via cross-references.
Integration: The retriever is initialized with the summary_collection_db and the embedding model, making it compatible with LangChain.

Example of Simple Retrieval-Augmented Generation (RAG)

Finally, let's build a simple RAG pipeline using the custom retriever and a language model.

from langchain_ollama.chat_models import ChatOllama
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

# Initialize the ChatOllama model
llm = ChatOllama(model="llama3.1", temperature=0, num_ctx=16384)

# Define the system prompt template
system_prompt = (
    "You are an assistant for answering questions. "
    "Use only the exact information provided in the context, do not include external knowledge or guesses. "
    "If the answer cannot be inferred from the context, reply: 'I don't know based on the provided context.' "
    "Do not provide answers that are not based on the context, including code examples or references to other libraries. "
    "Format your entire response in valid Markdown, including code snippets and links. "
    "Always adhere to these rules strictly.\n\n"
    "Context: \n"
    "{context}"
)

# Define the chat prompt
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# Create the retrieval-augmented generation (RAG) chain
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# Query #1
query = "I want to write a Python script that prints numbers from 1 to 30."
response = rag_chain.invoke({"input": query})

# Print the response
print("Example #1")
print("--------------------")
print(f"Query: {query}")
print(f"Answer: {response['answer']}")
print("--------------------")
print("\n")

# Query #2 (check that context only is used)
query = "I want to write a Go script that prints numbers from 1 to 30."
response = rag_chain.invoke({"input": query})

# Print the response
print("Example #2")
print("--------------------")
print(f"Query: {query}")
print(f"Answer: {response['answer']}")
print("--------------------")

Output:

Example #1
--------------------
Query: I want to write a Python script that prints numbers from 1 to 30.
Answer: You can use the `range()` function in Python to generate a sequence of numbers and print them.

Here's how you can do it:

```
for i in range(1, 31):
    print(i)
```

This will print numbers from 1 to 30. The `range()` function generates numbers starting from 0 by default, so we start at 1 and end at 30 (which is exclusive).
--------------------


Example #2
--------------------
Query: I want to write a Go script that prints numbers from 1 to 30.
Answer: I don't know based on the provided context. The given text is about Python programming and does not provide any information about writing a Go script. If you need help with a specific task, I'll be happy to assist you in another way.
--------------------

Explanation:

LLM Initialization: The ChatOllama is initialized using the llama3.1 model.
Prompt Setup: A system prompt is defined that instructs the assistant to use only the provided context.
Chain Creation: The question-answer chain and the RAG chain are created.
Testing: The RAG chain is tested with two queries:
- The first query is about writing a Python script, which is covered in the context.
- The second query is about writing a Go script, which is not in the context, so the assistant appropriately replies that it doesn't know based on the provided context.

Conclusion

By leveraging Weaviate's cross-references, we can efficiently retrieve original documents by querying their summaries. Integrating this mechanism with LangChain allows us to build powerful RAG pipelines that provide accurate and context-specific responses.

Feel free to explore and modify the code to suit your own datasets and use cases!

Optimizing RAG Context: Chunking and Summarization for Technical Docs

Oleh Halytskyi — Wed, 18 Sep 2024 15:41:28 +0000

Introduction
Solution Overview
Preparing the Environment
Load Documents
Convert HTML Document to Markdown Format
Split Markdown into Chunks Based on Headers
Split Chunks into Smaller Chunks Based on Tokens
Summarization Based on Headers and Chunk Text
Add Documents to VectorDB
Query Documents from VectorDB
- Querying Summaries
- Querying Original Content
- Querying Both Summaries and Original Content
- Querying Summaries but Retrieving Original Text
Conclusion

Introduction

In the realm of Retrieval-Augmented Generation (RAG) applications, precise text preparation is crucial, especially when dealing with technical documentation that includes code examples. Effective text chunking is essential to maintain the logical flow and integrity of the content, which directly impacts the quality of the context provided to RAG systems and ensures the generation of accurate and useful outputs.

Inspired by Greg Kamradt's insightful tutorial 5 Levels Of Text Splitting and the valuable resources on FullStackRetrieval.com, I recognized the importance of carefully structuring text chunks to preserve meaning and context. When text is split without attention to its logical structure, it can lose critical context, misrepresent information, or disrupt the flow of understanding, resulting in incomplete or confusing code examples and explanations.

After evaluating numerous existing solutions, including all current LangChain text splitters and several other popular approaches, I found none that fully met my needs. This led me to develop a custom text splitter tailored to address these challenges. My solution ensures that code blocks remain intact and that chunks preserve the logical flow and meaning of the documentation. Maintaining the integrity of the original content is essential when working with technical material. While this custom splitter works well for this particular example, it may need further tuning to handle other sources effectively.

To support this process, I leverage LangChain for the entire pipeline, including loading HTML documents, converting them into Markdown format, and handling the summarization and querying tasks. For my specific task, I found Markdown to be the most suitable format for chunking, as it provides a clear structure and maintains the hierarchy and formatting of the content. Additionally, I utilize Ollama for LLM processing, which enables local use of powerful language models such as llama3.1 for summarization and mxbai-embed-large for embedding. This approach allows me to securely process private documentation without exposing it externally.

In addition, I've adopted an approach to summarization that I have not encountered in existing documentation or solutions. This method generates summaries that reflect not only the content within each chunk but also the hierarchical meaning of headers, ensuring that each chunk retains its full context. This strategic approach to summarization is designed to enhance the efficiency and precision of queries within VectorDB, thereby improving the overall utility of RAG applications.

By storing both the summary and the original text in VectorDB, I can query them in a variety of ways - search by summary, original text, or both, and even retrieve the original text based on summary queries. This flexibility greatly increases the efficiency of RAG applications, especially those designed to aid in code generation and understanding.

Solution Overview

In this guide, I will walk you through the following key steps:

Loading and Converting Documents: Using LangChain to load HTML documents and convert them into Markdown format. Markdown is ideal for chunking due to its simplicity and ability to preserve the hierarchy of content, including headers and code blocks.
Custom Text Splitter: A detailed explanation of how I developed a custom splitter that maintains the logical flow of text, avoids breaking code blocks, and includes hierarchical metadata from headers.
Summarization with Header Hierarchy: Demonstrating how each chunk is summarized while preserving the full meaning by including header context. This ensures that summaries retain the structure and logical flow of the original document.
VectorDB Integration: Storing both summarized and original content in a VectorDB, enabling flexible querying. You’ll see how this approach allows for efficient searches by summary, original content, or both, as well as the ability to query summaries and retrieve the corresponding original text.
Conclusion: I will summarize the key takeaways from the guide, highlighting the importance of each step.

This tutorial is comprehensive and provides an in-depth walkthrough of the process. If you'd prefer to explore the code and outputs directly, you can access the Jupyter notebook here: Jupyter Notebook.

Preparing the Environment

Before diving into the document loading process, it's important to set up the right environment. Using Conda is a great way to manage dependencies and maintain isolated environments for different projects. Below are the steps to set up a Conda environment for our RAG (Retrieval-Augmented Generation) project.

# Create a new environment called 'rag-env'
conda create -n rag-env python=3.11

# Activate the environment
conda activate rag-env

# Install necessary packages
pip install langchain==0.2.16 \
    langchain_community==0.2.16 \
    beautifulsoup4==4.12.3 \
    markdownify==0.13.1 \
    tiktoken==0.7.0 \
    langchain-chroma==0.1.3

Additionally, you’ll need to install Ollama locally if you haven’t already. This is required for running the llama3.1 and mxbai-embed-large models for summarization and embedding.

Load Documents

In this step, we use LangChain's RecursiveUrlLoader to load HTML documents. This loader is particularly useful when we want to extract content from web pages, as it allows us to follow links recursively. For this example, we're loading the Python Control Flow Tutorial from the official Python documentation with a depth limit of 1. The max_depth parameter ensures that we only pull content from the specified page without following additional links.

from langchain_community.document_loaders import RecursiveUrlLoader

# Load the document
loader = RecursiveUrlLoader(
    "https://docs.python.org/3/tutorial/controlflow.html",
    max_depth=1,
)

html_docs = loader.load()

Convert HTML Document to Markdown Format

After loading the HTML document, the next step is to convert it into Markdown format. Markdown is a great format for processing because of its simplicity and ability to preserve document structure, including headers and code blocks.

In this step, we use a custom transformer based on MarkdownifyTransformer. The custom transformer overrides the transform_documents method to process the content, ensuring that code blocks are handled properly. We use a regular expression to identify code blocks and remove unnecessary newlines before the closing backticks.

from langchain_community.document_transformers import MarkdownifyTransformer
import re

# Custom transformer that extends the MarkdownifyTransformer to strip trailing newlines from code blocks
class CustomMarkdownifyTransformer(MarkdownifyTransformer):
    def transform_documents(self, documents):
        transformed_docs = super().transform_documents(documents)
        for doc in transformed_docs:
            if hasattr(doc, 'page_content'):
                doc.page_content = self._strip_trailing_newline_from_code_blocks(doc.page_content)
        return transformed_docs

    def _strip_trailing_newline_from_code_blocks(self, text):
        # Regex to find code blocks and ensure they end with a newline before the closing backticks
        code_block_pattern = re.compile(r'(```\w*\n[\s\S]*?)```')
        return code_block_pattern.sub(lambda match: match.group(1).rstrip() + '\n```', text)

# Transform the document to Markdown format
md = CustomMarkdownifyTransformer()
md_docs = md.transform_documents(html_docs)

Split Markdown into Chunks Based on Headers

To prepare the document for querying in a RAG application, it's essential to split it into chunks based on its structure. In this step, we break the Markdown content into manageable sections by using the document headers as delimiters.

While developing this custom splitter, I experimented with multiple existing approaches, including LangChain's Text Splitters and the ExperimentalMarkdownSyntaxTextSplitter. Unfortunately, these approaches didn't meet my criteria. These existing splitters often distorted code blocks by changing whitespace or introducing unwanted newlines, which is critical for a quality LLM context.

To overcome these issues, I developed a custom splitter that:

Preserves the logical flow of the text by splitting only on headers;
Ensures that code blocks remain intact without altering their original formatting;
Filters out irrelevant headers like "Table of Contents", "This Page" and "Navigation";
Adds header names to the metadata for each chunk, including all headers by hierarchy. This ensures that each chunk retains the context provided by its parent headers, making it easier to trace back the full structure of the document. This hierarchical metadata is also particularly useful for summarization, as it helps keep the chunk context intact;
Storing headers in the metadata allows for more flexible querying in specific use cases. For example, filtering in VectorDB based on metadata like header names can help retrieve specific chunks that correspond to particular sections of the document.

from langchain.schema import Document

sicbh_include_headers_in_content = False
sicbh_filter_headers = ["Table of Contents", "This Page", "Navigation"]
sicbh_show_unwanted_chunks_metadata = False

# Function to divide the Markdown documents into chunks based on headers
def split_into_chunks_by_headers(md_docs):
    chunks = []
    header_pattern = re.compile(r'^(#{1,6})\s+(.*)')
    code_block_pattern = re.compile(r'^\s*```')
    in_code_block = False

    for doc in md_docs:
        if hasattr(doc, 'page_content'):
            lines = doc.page_content.split('\n')
        else:
            raise AttributeError("Document object has no 'page_content' attribute")

        current_chunk = {'metadata': {}, 'content': ''}
        current_headers = {}
        prev_header_level = 0

        for line in lines:
            if code_block_pattern.match(line):
                in_code_block = not in_code_block

            if not in_code_block:
                match = header_pattern.match(line)
                if match:
                    # If there is content in the current chunk, add it to the chunks list
                    if current_chunk['content']:
                        current_chunk['content'] = current_chunk['content'].strip()
                        chunks.append(current_chunk)
                        current_chunk = {'metadata': {}, 'content': ''}

                    # Extract the header level and text
                    header_level = len(match.group(1))
                    header_text = match.group(2)

                    # Clean the header text
                    header_text = re.sub(r'\\', '', header_text)
                    header_text = re.sub(r'\[¶\]\(.*?\)', '', header_text).strip()

                    # Update the current headers
                    header_key = f'Header {header_level}'
                    if header_level > prev_header_level:
                        current_headers[header_key] = header_text
                    else:
                        del current_headers[f'Header {prev_header_level}']
                        current_headers[header_key] = header_text

                    # Add the header line to metadata
                    current_chunk['metadata'] = current_headers.copy()

                    # Optionally add the cleaned header text to content
                    if sicbh_include_headers_in_content:
                        current_chunk['content'] += f"{match.group(1)} {header_text}\n"

                    # Update the previous header level
                    prev_header_level = header_level
                else:
                    current_chunk['content'] += line + '\n'
            else:
                current_chunk['content'] += line + '\n'

        # Add the last chunk to the chunks list
        if current_chunk['content']:
            current_chunk['content'] = current_chunk['content'].strip()
            chunks.append(current_chunk)

    # Convert the chunks to Document objects, filtering out unwanted chunks
    documents = []
    unwanted_chunks = []
    for chunk in chunks:
        metadata = chunk['metadata']
        if metadata and not any(any(unwanted in value for unwanted in sicbh_filter_headers) for value in metadata.values()):
            documents.append(Document(page_content=chunk['content'], metadata=chunk['metadata']))
        else:
            unwanted_chunks.append(chunk['metadata'])

    # Optionally print the unwanted chunks metadata
    if sicbh_show_unwanted_chunks_metadata and unwanted_chunks:
        print(f"Unwanted chunks metadata:")
        for chunk in unwanted_chunks:
            print(chunk)
        print()

    return documents

# Split the Markdown documents into chunks based on headers
chunks_by_headers = split_into_chunks_by_headers(md_docs)

Split Chunks into Smaller Chunks Based on Tokens

At this point, we have already split the document into chunks based on headers. However, for RAG applications, it’s necessary to ensure that the size of each chunk fits within the token limit supported by the language model in use. Technically, this step can be skipped if you're using models with large context lengths, but it is often still useful to split large chunks for optimal performance.

This custom token-based splitter:

Splits the document into smaller chunks while preserving sentences and code blocks;
Avoids splitting sentences that end with ":" from the following code or text;
Optionally prevents splitting the text directly following a code block (since this text is often an explanation of the code).

It’s more important to preserve the full meaning of the text in each chunk rather than focusing strictly on achieving high token accuracy.

import tiktoken

tiktoken_encoder = "cl100k_base"
chunk_max_tokens = 500
scbt_text_follow_code_block = True

# Function to split a chunk into smaller parts based on token count
def split_chunk_by_tokens(content, tokenizer, max_tokens):
    # Split content into code blocks and paragraphs
    parts = re.split(r'(\n```\n.*?\n```\n)', content, flags=re.DOTALL)
    final_parts = []
    for part in parts:
        if part.startswith('\n```\n') and part.endswith('\n```\n'):
            final_parts.append(part)
        else:
            final_parts.extend(re.split(r'\n\s*\n', part))
    # Remove newlines from the start and end of each part
    parts = [part.strip() for part in final_parts if part.strip()]

    # Calculate total tokens
    total_tokens = sum(len(tokenizer.encode(part)) for part in parts)
    target_tokens_per_chunk = total_tokens // (total_tokens // max_tokens + 1)

    # Initialize variables
    chunks = []
    current_chunk = ""
    current_token_count = 0

    # Iterate over the parts and merge them if needed
    i = 0
    while i < len(parts):
        part = parts[i]

        # Merge parts if the current part ends with ":" or "```" (if enabled) and has a following part
        while (part.endswith(":") or (scbt_text_follow_code_block and part.endswith("```"))) and i + 1 < len(parts):
            part += "\n\n" + parts[i + 1]
            i += 1  # Skip the next part as it has been merged

        # Calculate the token count of the part
        part_tokens = tokenizer.encode(part)
        part_token_count = len(part_tokens)

        # Split the part into smaller parts if it exceeds the target token count
        if current_token_count + part_token_count > target_tokens_per_chunk and current_chunk:
            chunks.append(current_chunk.strip())
            current_chunk = part
            current_token_count = part_token_count
        else:
            current_chunk += "\n\n" + part if current_chunk else part
            current_token_count += part_token_count

        i += 1

    # Add the last chunk if it has content
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Function to divide the Markdown documents into chunks based on token count
def split_into_chunks_by_tokens(chunks, tokenizer, max_tokens):
    split_chunks = []

    for chunk in chunks:
        token_count = len(tokenizer.encode(chunk.page_content))
        if token_count > max_tokens:
            split_texts = split_chunk_by_tokens(chunk.page_content, tokenizer, max_tokens)
            for text in split_texts:
                split_chunks.append(Document(page_content=text, metadata=chunk.metadata))
        else:
            split_chunks.append(chunk)

    return split_chunks

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding(tiktoken_encoder)

# Split the chunks into smaller parts based on token count
chunked_docs = split_into_chunks_by_tokens(chunks_by_headers, tokenizer, chunk_max_tokens)
print(f"Number of chunks by tokens: {len(chunked_docs)}")

Summarization Based on Headers and Chunk Text

The next step in the process is to generate summaries for each chunk. We use the llama3.1 model via ChatOllama to create concise summaries that incorporate both the content of the chunk and all the relevant headers in the hierarchy. This ensures that the summary retains the full context of the document, including the structure provided by the headers.

By considering the hierarchical headers, the summaries maintain the logical flow of the document. This is particularly useful for querying from VectorDB, where maintaining the overall meaning and structure of the document is essential for retrieving accurate and relevant information.

from langchain_community.chat_models import ChatOllama
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.prompts import HumanMessagePromptTemplate
import uuid

# Initialize the ChatOllama model
llm = ChatOllama(model="llama3.1", temperature=0)

# Create the prompt
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=(
                "Summarize the following content in a single, concise paragraph. "
                "Include key information from all headers provided, maintaining the overall context and meaning. "
                "Output only the summary text without any introductory phrases, labels, or concluding remarks."
            )
        ),
        HumanMessagePromptTemplate.from_template("{context}"),
    ]
)

# Create the chain
chain = create_stuff_documents_chain(llm, prompt)

# Summarize the content of the documents
summarized_docs = []
for doc in chunked_docs:
    # Merge the metadata and content into a single document
    metadata = "\n".join(f"{key}: {value}" for key, value in doc.metadata.items())
    merged_docs = [Document(page_content=metadata), Document(page_content=doc.page_content)]

    # Invoke the chain to summarize the content
    result = chain.invoke({"context": merged_docs})

    # Create a new Document object with the summary content
    unique_id = str(uuid.uuid4())
    summary_doc = Document(page_content=result, metadata={"type": "summary", "id": unique_id})

    # Create a copy of the metadata and update it
    updated_metadata = doc.metadata.copy()
    updated_metadata.update({"type": "original", "summary_id": unique_id})
    doc.metadata = updated_metadata
    summarized_docs.append(summary_doc)

# Merge the summarized and original documents
all_docs = summarized_docs + chunked_docs

print(f"Number of summarized documents: {len(summarized_docs)}")
print(f"Number of original documents: {len(chunked_docs)}")
print(f"Number of all documents: {len(all_docs)}")

Add Documents to VectorDB

After summarizing the documents, the next step is to store them in a vector database, which enables efficient querying and retrieval based on both summaries and original content. We use the mxbai-embed-large model from Ollama to generate embeddings for both the original and summarized documents.

The embeddings represent the semantic meaning of the documents, making it easier for the VectorDB to retrieve relevant chunks based on queries. In this example, the documents are stored in Chroma, a popular vector database optimized for efficient querying and retrieval.

The key steps include:

Initializing the OllamaEmbeddings to use the mxbai-embed-large model;
Clearing any existing documents in the Chroma VectorDB (if the code is run again);
Adding the new summarized and original documents into the VectorDB.

from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma

# Initialize the Ollama embedding model
ollama_emb = OllamaEmbeddings(
    model="mxbai-embed-large",
)

# Initialize Chroma vector store and clear existing documents
vectorstore = Chroma(
    collection_name="summarization",
)
vectorstore.delete_collection()

# Add new documents to the collection
vectorstore = Chroma.from_documents(
    documents=all_docs,
    embedding=ollama_emb,
    collection_name="summarization",
)

Query Documents from VectorDB

The final step is to query the documents stored in the Chroma vector database and retrieve the relevant results. Before running the queries, we need to prepare the retriever and set up a function to output the results in a readable format.

# Use the vector store as a retriever
retriever = vectorstore.as_retriever()

# Function to print results
def print_results(title, results):
    print(title)
    print()
    for i, result in enumerate(results):
        print(f"---------- Result #{i + 1} ----------")
        print(f"Metadata: {result.metadata}")
        print(f"Content: {result.page_content}")
        print("\n")

With the retriever prepared and the print_results function ready, we can now perform various queries to demonstrate the flexibility of querying both summaries and original content in the vector database:

Querying summaries only;
Querying original content;
Querying both summaries and original content together;
Querying summaries but retrieving the original text based on those summaries.

By utilizing the summary and original content filters, we can flexibly query the vector database for the most relevant results based on the user's needs.

Querying Summaries

In this example, we query only the summaries, which allows us to retrieve concise, context-rich overviews of the content.

# Query only summaries
retriever.search_kwargs = {"k": 2, "filter": {"type": "summary"}}
summary_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Summary Results.", summary_results)

Output:

Summary Results.

---------- Result #1 ----------
Metadata: {'id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'summary'}
Content: The `range()` function returns an object that behaves like a list but doesn't actually make one, saving space. This iterable object can be used with functions and constructs that expect successive items until the supply is exhausted, such as the `for` statement or the `sum()` function. When printed directly, it displays its start and end values, e.g., `range(0, 10)`.


---------- Result #2 ----------
Metadata: {'id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'summary'}
Content: The built-in `range()` function generates arithmetic progressions that can be used for iteration over a sequence of numbers. It takes three parameters: start point, end point, and step (default is 1), and returns an iterator that produces the specified range of values. For example, `range(5)` generates numbers from 0 to 4, while `range(5, 10)` generates numbers from 5 to 9. The `range()` function can also be used in combination with `len()` to iterate over the indices of a sequence, or with other functions like `enumerate()` for more convenient looping techniques.

The output demonstrates two concise summaries related to the query about writing a Python script that prints numbers from 1 to 30. Both summaries provide an overview of Python's range() function, explaining its use in iterating over sequences of numbers. The summaries are context-rich, mentioning how the range() function works in a loop and how it can be utilized with additional constructs like sum() and enumerate().

This is a good match for the query, as the range() function is a common tool for iterating over a sequence of numbers, which directly relates to the task of printing numbers from 1 to 30.

Querying Original Content

Next, we query only the original content to retrieve more detailed information.

# Query only original content
retriever.search_kwargs = {"k": 2, "filter": {"type": "original"}}
original_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Original Content Results.", original_results)

Output:

Original Content Results.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).


---------- Result #2 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.7. Defining Functions', 'summary_id': '0a569fde-d1df-4f7c-9315-1e6da9b15214', 'type': 'original'}
Content: We can create a function that writes the Fibonacci series to an arbitrary
boundary:

...<skipped>...

The first statement of the function body can optionally be a string literal;
this string literal is the function’s documentation string, or *docstring*.
(More about docstrings can be found in the section [Documentation Strings](#tut-docstrings).)
There are tools which use docstrings to automatically produce online or printed
documentation, or to let the user interactively browse through code; it’s good
practice to include docstrings in code that you write, so make a habit of it.

Note: In some examples, you will see ...<skipped>... to avoid overwhelming the blog with very large outputs. If you'd like to explore the full outputs for each step, please refer to the accompanying Jupyter Notebook.

The Original Content Results contain two results retrieved from the original content, providing more detailed and comprehensive information compared to the summaries.

First Result: The first result is relevant to the query, as it focuses on the range() function, which is directly related to printing numbers in Python. The output provides a detailed explanation of how the range() function can be used to iterate over a sequence of numbers and includes an example that demonstrates its use, which aligns well with the query.

Second Result: The second result, however, is less relevant to the query. It discusses creating a Fibonacci series function and includes information on docstrings. While useful in other contexts, this output doesn't directly relate to the task of writing a Python script to print numbers from 1 to 30, making it a less accurate match compared to the first result.

When comparing this to the Querying Summaries example, the second result in Querying Original Content is not as closely aligned with the original query. The summary query provided concise, context-rich information about range() and its use for iterating over numbers, which was a better match for the specific task.

Querying Both Summaries and Original Content

We can also query both summaries and original content without applying any filters to retrieve a mix of both types.

# Query both summaries and original content
retriever.search_kwargs = {"k": 2}  # No filter to get both types
both_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")
print_results("Both Results.", both_results)

Output:

Both Results.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).


---------- Result #2 ----------
Metadata: {'id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'summary'}
Content: The `range()` function returns an object that behaves like a list but doesn't actually make one, saving space. This iterable object can be used with functions and constructs that expect successive items until the supply is exhausted, such as the `for` statement or the `sum()` function. When printed directly, it displays its start and end values, e.g., `range(0, 10)`.

The Both Results query retrieves a mix of both original content and summaries, offering different levels of detail.

First Result (Original Content): This result is a detailed explanation of the range() function from the original content, including examples and usage patterns. It provides a comprehensive overview of how to generate sequences of numbers using range(), which directly aligns with the task of printing numbers from 1 to 30. The examples include different ways to use range() with various parameters, making it a very relevant and informative result.

Second Result (Summary): The second result is a concise summary that focuses on the functionality of the range() function, highlighting its efficiency and how it behaves like a list when iterated over. It’s a more compact but accurate explanation relevant to the query.

Both results are highly relevant to the query, with the original content offering depth and the summary offering a concise explanation.

Querying Summaries but Retrieving Original Text

One of the most powerful and interesting aspects of this process is the ability to query using summaries but retrieve the corresponding original text. This method allows you to perform quick, concise queries with summaries, but then retrieve the full original content based on those summaries.

In this step, we query summaries and use the summary IDs to fetch the original content. This method is particularly useful when you want to retrieve detailed content but still benefit from the speed and conciseness of querying summaries.

# Query summary but get original text
retriever.search_kwargs = {"k": 2, "filter": {"type": "summary"}}
summary_for_original_results = retriever.invoke("I want to write a Python script that prints numbers from 1 to 30.")

# Extract original texts based on summary query
original_texts_based_on_summary = []
if summary_for_original_results:
    summary_ids = [result.metadata["id"] for result in summary_for_original_results]
    for summary_id in summary_ids:
        retriever.search_kwargs = {"k": 2, "filter": {"summary_id": summary_id}}
        original_texts_based_on_summary.extend(
            retriever.invoke("")
        )

print_results("Original Texts based on Summary Query.", original_texts_based_on_summary)

Output:

Original Texts based on Summary Query.

---------- Result #1 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': 'd0f21c29-decf-405b-828f-8b714eebd459', 'type': 'original'}
Content: A strange thing happens if you just print a range:

```
>>> range(10)
range(0, 10)
```

In many ways the object returned by [`range()`](../library/stdtypes.html#range "range") behaves as if it is a list,
but in fact it isn’t. It is an object which returns the successive items of
the desired sequence when you iterate over it, but it doesn’t really make
the list, thus saving space.

...<skipped>...

Later we will see more functions that return iterables and take iterables as
arguments. In chapter [Data Structures](datastructures.html#tut-structures), we will discuss in more detail about
[`list()`](../library/stdtypes.html#list "list").


---------- Result #2 ----------
Metadata: {'Header 1': '4. More Control Flow Tools', 'Header 2': '4.3. The [`range()`](../library/stdtypes.html#range "range") Function', 'summary_id': '1ff36254-aa51-4302-b0f1-aeba92212a96', 'type': 'original'}
Content: If you do need to iterate over a sequence of numbers, the built\-in function
[`range()`](../library/stdtypes.html#range "range") comes in handy. It generates arithmetic progressions:

```
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
```

...<skipped>...

In most such cases, however, it is convenient to use the [`enumerate()`](../library/functions.html#enumerate "enumerate")
function, see [Looping Techniques](datastructures.html#tut-loopidioms).

The Original Texts based on Summary Query feature demonstrates how powerful it can be to query using summaries but retrieve the corresponding original content. In this case, the summaries point to sections of the document that discuss Python's range() function, and the retrieved original content provides detailed explanations, examples, and usage.

First Result: This result explains how the range() function behaves like a list, showing how it works in conjunction with the for loop and the sum() function. The example demonstrates how range() returns an iterable object that saves memory, which directly relates to the query about printing numbers.

Second Result: This result provides more examples of how the range() function can be used to iterate over numbers, including different ways to specify a range, starting point, step size, and more. This content is a good match for the query, as it focuses on iterating over sequences of numbers in Python.

Compared to querying summaries alone, this method provides a quick way to retrieve concise, context-rich summaries and use those to fetch the detailed original content. Both retrieved results are highly relevant to the query about writing a Python script to print numbers from 1 to 30, offering detailed explanations and code examples on how to achieve this.

Conclusion

In this guide, we walked through the complete process of preparing technical documentation for use in Retrieval-Augmented Generation (RAG) applications. We began by loading and converting documents into Markdown format, ensuring the preservation of structure and code blocks. Next, we split the document into manageable chunks based on headers and token counts to retain context and ensure efficient querying.

The summarization step was particularly important, where we generated summaries based on both chunk text and hierarchical headers to retain full context. These summaries, along with the original content, were stored in a VectorDB (Chroma), which enabled us to query the documents flexibly.

The final and perhaps most powerful aspect was the ability to query summaries while retrieving the original text. Summaries are often more efficient, as they incorporate not only content but also relevant headers to provide additional context. This method facilitates fast, concise queries while ensuring access to detailed, in-depth content when needed.

Each step in this process is crucial to maintaining the logical flow, integrity, and usability of the content. Summarization, in particular, plays a key role in optimizing performance, while querying allows for flexible and efficient information retrieval.

This guide only covered a small part of the overall possibilities. There is plenty of room for improvement, such as testing with different types of documents, incorporating automated testing, and refining the overall implementation to handle more complex use cases. These enhancements can further optimize the system for even more efficient document processing and querying.

I encourage you to explore the accompanying Jupyter Notebook for a more detailed look at the full implementation. It contains key pieces of code that you can review, download, and modify to suit your specific needs in RAG applications, along with the full outputs generated during each step of the process.