Hady Walied

Posted on Oct 10

From Documents to Dialogue: A step-by-step RAG Journey

#tutorial #ai #llm #rag

Welcome to this complete guide on building an advanced Retrieval-Augmented Generation (RAG) system from scratch. In this tutorial series, we'll go from raw PDF documents to a sophisticated chatbot that can answer questions about them, citing its sources.

We'll be using Python, LangChain, and a local LLM (powered by LM Studio) to build our project. Let's get started!

Part 1: The Foundation - Processing Your Documents

Before we can ask questions about our documents, we need to prepare them. Large documents are too big to fit into the context window of most LLMs. The solution is to break them down into smaller, manageable chunks.

The Concept: We'll load PDF files, split them into overlapping text chunks, and save them to a JSON file. This overlap is crucial to ensure that we don't lose context between chunks.

The Code: This is phase1_process_docs.py:

# phase1_process_docs.py
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

def process_documents(source_dir="source_documents", output_file="chunks.json"):
    all_chunks = []
    for filename in os.listdir(source_dir):
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(os.path.join(source_dir, filename))
            documents = loader.load()

            text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
            chunks = text_splitter.split_documents(documents)

            for chunk in chunks:
                all_chunks.append({
                    "content": chunk.page_content,
                    "metadata": chunk.metadata
                })

    with open(output_file, 'w') as f:
        json.dump(all_chunks, f, indent=2)
    print(f"Successfully processed {len(all_chunks)} chunks.")

if __name__ == "__main__":
    process_documents()

Outcome: Run this script, and you'll have a chunks.json file. This is the knowledge base for our RAG system.

Part 2: The Classic Approach - Keyword Search with BM25

The simplest way to find relevant information is through keyword search. BM25 is a powerful algorithm that ranks documents based on the terms they contain.

The Concept: We'll use the rank_bm25 library to create a search index from our document chunks. This will allow us to find chunks that contain specific keywords from our query.

The Code: This is phase2_keywordsearch.py:

# phase2_keyword_search.py
import json
from rank_bm25 import BM25Okapi

# Load the processed chunks
with open("chunks.json", 'r') as f:
    chunks_data = json.load(f)

# Get the content of each chunk
corpus = [chunk['content'] for chunk in chunks_data]
tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

def search_keyword(query, top_n=3):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)

    top_indexes = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:top_n]

    results = [chunks_data[i] for i in top_indexes]
    return results

Outcome: You can now perform basic keyword searches. However, you'll notice that this approach fails to capture the meaning behind the words.

Part 3: A Leap in Understanding - Semantic Search with Vector Embeddings

To search by meaning, we need to enter the world of vector embeddings. An embedding is a numerical representation of a piece of text. By comparing these numbers, we can find text that is semantically similar.

The Concept: We'll use a local LLM via LM Studio to generate embeddings for our chunks. Then, we'll store these embeddings in a vector database (ChromaDB) for efficient searching.

The Code: This is phase3_semantic_search.py:

# phase3_semantic_search.py
import json
import os
from typing import List

import requests
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain_core.embeddings import Embeddings

DB_PATH = "chroma_db"
api_base = "http://26.186.178.211:1234/v1"

class LMStudioEmbeddings(Embeddings):
    # ... (embedding generation logic) ...

def build_or_load_db():
    if os.path.exists(DB_PATH):
        return Chroma(persist_directory=DB_PATH, embedding_function=LMStudioEmbeddings(api_base))

    with open("chunks.json", 'r') as f:
        chunks_data = json.load(f)
    documents = [Document(page_content=chunk['content'], metadata=chunk['metadata']) for chunk in chunks_data]

    db = Chroma.from_documents(
        documents,
        LMStudioEmbeddings(api_base),
        persist_directory=DB_PATH
    )
    return db

Outcome: You now have a powerful semantic search engine. You can query for concepts and ideas, and get much more relevant results than with keyword search alone.

Part 4: The Generative Leap - Building Your First RAG Chatbot

Now it's time to bring in the "G" in RAG: Generation. We'll use an LLM to generate human-like answers based on the documents our retriever finds.

The Concept: We'll create a simple chain using LangChain. The chain will first retrieve relevant documents (using our semantic search from Part 3), then "stuff" them into a prompt for the LLM, and finally, get the answer.

The Code: This is phase4_rag_chat.py:

# phase4_rag_chat.py
# ... (imports and setup) ...

template = """
Answer the question based ONLY on the following context.
If you don't know the answer, just say that you don't know. Do not make up an answer.
Cite the sources used in your answer.

Context:
{context}

Question:
{question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join([f"Source: {d.metadata['source']}\n{d.page_content}" for d in docs])

# RAG Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Outcome: You have a working chatbot! It can answer questions about your documents and point to the sources it used.

Part 5: The Best of Both Worlds - Advanced Retrieval

Keyword search is good at finding specific terms, while semantic search is good at finding related concepts. Why not use both? This is called hybrid search. We'll also add a re-ranking step to further improve the relevance of our retrieved documents.

The Concept: We'll first retrieve documents using both BM25 and semantic search. Then, we'll use a special type of model called a cross-encoder to re-rank the combined results before passing them to the LLM.

The Code: This is from phase5_advanced_rag_chat.py:

# phase5_advanced_rag.py
# ... (imports and setup) ...

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def advanced_retriever(query, top_n_hybrid=10, top_n_rerank=3):
    # ... (hybrid search logic) ...

    # Re-ranking
    pairs = [[query, doc['content']] for doc in combined_docs]
    scores = cross_encoder.predict(pairs)

    scored_docs = list(zip(scores, combined_docs))
    scored_docs.sort(key=lambda x: x[0], reverse=True)

    reranked_docs = [doc for score, doc in scored_docs[:top_n_rerank]]
    return reranked_docs

Outcome: Your RAG system is now much more accurate. The hybrid search ensures you don't miss any relevant documents, and the re-ranker makes sure the LLM gets the best possible context.

The Final Product - A Complete RAG Application

We've built all the components. Now, let's put them together into a single, robust application.

The Concept: We'll refactor our code into classes, making it more organized and reusable. We'll have a DocumentProcessor, a Retriever, a ReRanker, and a main RAGPipeline that orchestrates everything.

The Code: This is the structure of our final total_rag_app.py:

# total_rag_app.py

# --- 1. Configuration ---
# ...

# --- 2. Embedding Model ---
class LMStudioEmbeddings(Embeddings):
    # ...

# --- 3. Document Processing ---
class DocumentProcessor:
    # ...

# --- 4. Retrieval System ---
class Retriever:
    # ...

# --- 5. Re-ranking System ---
class ReRanker:
    # ...

# --- 6. RAG Pipeline ---
class RAGPipeline:
    # ...

# --- 7. Main Execution ---
if __name__ == "__main__":
    # ... (Initialize and run the pipeline) ...

Outcome: You have a complete, command-line RAG application. It's well-structured, easy to run, and represents the culmination of all our work.

Running the Final Application

Full implementation: hadywalied/Total_RAG

Install dependencies:
```
pip install -r requirements.txt
```
Run the app:
```
python total_rag_app.py
```

Example:

RAG Application Ready. Ask a question about your documents.
> what's attention ?
--- Answer ---
Answer:
Attention is a function that maps a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is determined by the compatibility between the
query and the corresponding key.

Source(s):
source_documents\1706.03762v7.pdf
--- End of Answer ---
>

Conclusion

Congratulations! You've built an advanced RAG system from the ground up. You've learned how to process documents, use both keyword and semantic search, re-rank results for relevance, and generate answers with a local LLM.

From here, you can explore many improvements, such as a web interface, more advanced retrieval strategies, or support for different document types. The possibilities are endless. Happy coding!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.