Piotr Borys

Posted on Mar 29

Creating a simple local RAG system

#rag #ai #python #vectordatabase

We'll build a simple RAG system using local only models. We will not use LangChain, which is introducing many bloated dependencies, is much slower than direct Transformers usage, is not error-free and its documentation is mostly misleading. We'll use only bare Transformers functions for that. As a vector database for storing our embeddings from document, we'll use Faiss, which is really efficient in similarity search. Note it sits in RAM, not on a disk and is very fast.

What is a RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that improves Large Language Model (LLM) accuracy by retrieving data from external, trusted sources (documents, databases) rather than relying solely on training data. It enables up-to-date, specialized answers, reduces hallucinations, and avoids costly model retraining.

In simple words: it allows to have a LLM having a specialized knowledge without retraining it. We'll build here a simple version of it, allowing loading a single PDF files and then having a chat. We will use only local models, without using any cloud. This means few things:

it's completely free
it's completely private (no data exposing to internet)
it's weaker than cloud models.

Models of choice

It's up to you - and depends mostly on your hardware (GPU and its VRAM). I've used here google/gemma-2-9b-it as LLM and BAAI/bge-large-en-v1.5 for creating embeddings. It works without any issues on 12GB VRAM GPU - and it works with different languages. You can, for instance, have a source in Polish and ask questions in English (or vice-versa). Keep in mind in order to use models from HuggingFace, you have to have an account there and you have to accept the model's usage policy.

Some important parameters

When creating a RAG system, there're few parameters, that can have a big impact on how it's working. This includes:

LLM's temperature: it controls the randomness of LLM's output. The lower temperature, the more deterministic answers will be.
If LLM can do sampling: if sampling is set to off, LLM is using greedy sampling, so it just selects the tokens with the highest probability. If sampling is allowed, it will take one of the possible tokens. The choise is weighted, but it doesn't necessarily mean the highest probability will be chosen.
How many similar chunks to choose: during looking for similar chunks in the vector database, how many of them will be selected for the answer? In this example well working values are 3-5.

All in all, how you set them, depends mostly on the type of documents you want to work with: if it's some theory, reports, technicals, instructions, guides etc., then set it as I did below. If it's more loose texts, you may want to increase the temperature (let's say, up to 0.4) and turn sampling on. With even more informal texts you will also want to increase the number of similar chunks to be found (even above 10), but be careful, it will greatly increase the RAM usage.

Let's gather it up:

EMBEDDING_MODEL = "BAAI/bge-large-en-v1.5"
LLM_MODEL = "google/gemma-2-9b-it"
LLM_TEMPERATURE = 0.1
LLM_DO_SAMPLE = False
SIMILAR_CHUNKS_COUNT = 3

Building a vector database

Let's start from creating our vector database, which is our special knowledge, created from given PDF file. We will read the file and split the text into smaller chunks (remembering the page number for each chunk, so we can give exact citations in our answers). The chunks will be converted into embeddings, using SentenceTransformer and our embedding model.

reader = pypdf.PdfReader(pdf_path)
full_text = ""
pages_meta = []  # A list to track which page each character comes from.
for i, page in enumerate(reader.pages):
    page_text = page.extract_text()
    if page_text:
        full_text += page_text
        # For each character on the page, store its page number and source file. This is a bit memory-intensive
        # but allows for accurate source tracking later.
        pages_meta.extend([{'page': i + 1, 'source': pdf_path}] * len(page_text))

Having the text extracted, we'll split the full text into smaller chunks:

chunks = simple_text_splitter(full_text, chunk_size=800, chunk_overlap=150)

We'll create a metadata for each chunk by finding the page number corresponding to the middle of the chunk:

chunk_metadatas = []
char_count = 0
for chunk in chunks:
    mid_point = char_count + len(chunk) // 2
    if mid_point < len(pages_meta):
        chunk_metadatas.append(pages_meta[mid_point])
    else: # A fallback for the very last chunk.
        chunk_metadatas.append(pages_meta[-1])
    char_count += 800 - 150 # Move character counter forward by (chunk_size - chunk_overlap).

Now we can create embeddings for each text chunk:

embedding_model = SentenceTransformer(EMBEDDING_MODEL)
embeddings = embedding_model.encode(chunks, convert_to_tensor=True, show_progress_bar=True)
embeddings = embeddings.cpu().numpy().astype('float32') # FAISS requires float32 numpy arrays.

Normalize the embeddings to unit length. This is necessary for using the Inner Product (IP) as a measure of cosine similarity:

faiss.normalize_L2(embeddings)

Let's populate the FAISS vector store:

index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)  # Add the chunk embeddings to the index.

and pack all of this into a handy structure for later use:

return {
    "index": index,
    "chunks": chunks,
    "metadatas": chunk_metadatas,
    "embedding_model": embedding_model
}

Feeding this structure into function below (as a db parameter), we can have a very simple and straightforward searching mechanism:

def search_vector_db(db, query, k):
    query_embedding = db["embedding_model"].encode([query], convert_to_tensor=True)
    query_embedding = query_embedding.cpu().numpy().astype('float32')
    faiss.normalize_L2(query_embedding)

    distances, indices = db["index"].search(query_embedding, k)
    retrieved_chunks = [db["chunks"][i] for i in indices[0]]
    retrieved_metadatas = [db["metadatas"][i] for i in indices[0]]

    return retrieved_chunks, retrieved_metadatas

query is a query. k is the number of chunks to be found.

Before we can use this for searching, we have to have a query first, so let's setup a simple chat with LLM.

Chat with LLM

Loading the LLM is pretty straightforward, if we remember the notes we stated in the beginning (the temperature etc.).
We will load quite a big model, but quantize it to 4-bit precision, significantly reducing the size in RAM:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    LLM_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    low_cpu_mem_usage=True
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    max_length=None,
    temperature=LLM_TEMPERATURE,
    repetition_penalty=1.1,
    do_sample=LLM_DO_SAMPLE,
    return_full_text=False
)
return pipe

Preparing the interactive loop to chat

Pretty much everything below will be closed inside the loop, so we can have a chat:

while True:
    query = input("\nAsk a question (type 'exit' or 'quit' to quit): ")
    if query.lower() in ['exit', 'quit']:
        break

Searching the vector database

First thing to do after getting a query from a user, is to find relevant chunks of text, along with their page numbers, which we will use in sources citation.

context_chunks, context_metadatas = search_vector_db(db, query, k=SIMILAR_CHUNKS_COUNT)
context = "".join(context_chunks)

Having the relevant parts of the text, we have to prepare the LLM's part of the job.

Preparing the prompt template

This part will differ slightly depending on the LLM model used. Some models expect different prompt templates, regarding the user -> assistant loop:

template = f"""<start_of_turn>user
You're a helpful assistant. Answer the question based only on the context below.
Answer using the same language the question was asked in.\n
Context:\n
{context}\n
Question: {query}<end_of_turn>
<start_of_turn>model
"""

If we don't formulate this properly, model can start hallucinating the dialog and start talking to itself. Note we have passed not the text, but our chunks that have been found after the user's question.

Finally we can pass it to LLM and get the answer.

result = llm_pipeline(template)
answer = result[0]['generated_text'].strip()
print(answer)

Sources citation

We can also quote the pages containing the relevant material in the PDF:

seen_pages = set() # Use a set to avoid printing duplicate page numbers.
for meta in context_metadatas:
    page_num = meta.get('page', 'N/A')
    if page_num not in seen_pages:
        print(f"  - Page: {page_num} (Source File: {meta.get('source')})")
        seen_pages.add(page_num)

Final thoughts

As you can see above, it's very simple, much simpler than one could think: it's not LLM looking for the answer.
It's FAISS looking through all our chunks of text and finding the most suiting ones. LLM is given only those found ones and all it's doing is recapping the small portion of the text and formulating nice text. Simple as that.

Now, your task now is to put it into some UI, for example a simple Streamlit which is perfectly suiting this type of job.
Just add some button to load the PDF, input and text - and voilà :)

DEV Community