saara suominen

Posted on Oct 1

RAG: experiments with prompting using 3 LLM's

#privacy #rag #ai #llm

In order to get deeper into RAG (Retrieval Augmented Generation), I tested a few prompting techniques to extract answers from a PDF.

Code

The code was adapted from a tutorial by Nasser Maronie. This stroke me as a good way to implement a RAG system, as it is fully customizable and everything is run locally on your own computer, thus ensuring data privacy.

Data

I chose an old document written as a documentation for a university hackathon as the source. I have written it years ago. The text is written slightly informally and is not strictly informational, but more of a diary, and contains subjective thoughts and sarcasm. I thought it may be a good source since it is can be a bit challenging for the LLM.
The PDF is about 2 pages long and was chunked and embedded into the vector database. The questions I chose to ask from the RAG system were all from the snippet below.

Questions

I came up with the below four questions to ask. Q1 and Q2 were intended as more difficult questions, as the text does not state very clearly what is a disadvantage and what is an advantage. Q3 was then constructed as an option to Q1-Q2, to see whether that was easier to answer as it only talks about differences, not advantages/disadvantages. Q4 was intended to be a fairly easy factual question.

Q1: What were the two benefits of Minerva Square compared to Athena building?
A: you can see the sky (and remember how the daylight looks like), coffee is more near

Q2: What was the one disadvantage of Minerva Square compared to Athena building?
A: lot of stairs

Q3: What were the three differences of Minerva Square compared to Athena building?
A: you can see the sky (and remember how the daylight looks like), coffee is more near, lot of stairs

Q4: Before Monday 28th, what was the location where the group worked?
A: Athena building

Parameters and models

Library: LangChain
Chunking method: RecursiveCharacterTextSplitter
- chunk_size 7500 characters
- chunk_overlap 100 characters
Text embedding model: nomic-embed-text

Three LLM's were compared: llama2-uncensored, mistral and qwen3. Llama2-uncensored is clearly older, and likely less advanced, than the other 2, whereas mistral and qwen3 both represent newer and better models, but come from different kinds of backgrounds.

Test 1: a plain prompt

Here I used a plain and simple prompt, which instructs the LLM to answer the question based on the context, and if the context does not contain the answer, to say that it doesn't have that information.

click for query loop and prompt

def remove_think_tags(response):
   return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
llm = ChatOllama(model=LLM_MODEL)
# Get the vector database instance
db = get_vector_db()

# Search the DB.
results = db.similarity_search_with_score(input, k=5)
context = "\n\n---\n\n".join([doc.page_content for doc, _score in results])

 PROMPT = """
       Answer the question based only on the following context:
       {context}
       ---
       Answer the question based on the above context, If the above   context does 
       not have any information about the question asked, Please tell that you do not have 
       any information regarding the question asked: {question}
"""
prompt_template = ChatPromptTemplate.from_template(PROMPT)
prompt = prompt_template.format(context=context, question=input)

response_text = llm.invoke(prompt, config={'callbacks': [ConsoleCallbackHandler()]})
response = remove_think_tags(response_text.content)

Test 1: results per model

model	Q1	Q2	Q3	Q4
qwen3	c	c	c	c
mistral	pc	pc	c	c
llama2-uncensored	i	i	i	i

i = incorrect
pc = partially correct
c = correct

Analysis of results:

qwen3 was able to answer all questions correctly. mistral answered Q3 and Q4 correctly. For Q1 and Q2, it mentioned the correct answer when thinking, but then concluded that it doesn't have enough information to answer.

llama2-uncensored's answers to questions 1-3 were hallucination not based on the document. Answer to Q4 was a real place mentioned in the text, but incorrect.

test1 click for each model's answer for Q1

Q1: What were the two benefits of Minerva Square compared to Athena building?

llama2-uncensored: "The main benefit of moving to Minerva was increased storage space and a more open atmosphere. However, this came at the cost of being less convenient due to the increased distance from the center of Helsinki. In comparison, the Athena Building has limited available office spaces but is located in an easily accessible area close to the city's main attractions. Therefore, it ultimately depends on what type of organization one belongs to and their specific needs when deciding which building to choose for their office space."

mistral: "_ The text does mention a change of location from Athena building to Minerva Square. However, it only mentions two potential advantages: looking up to see the sky (remembering daylight) and the shorter distance to coffee. No specific benefits unique to Minerva Square compared to Athena Building are explicitly stated in the context provided.\n\nSo, while there are some differences mentioned, the text does not provide enough information about these locations to definitively say what the two benefits of Minerva Square over Athena building were._"

qwen3: "The two benefits of Minerva Square compared to Athena building mentioned in the context are: 1.Ability to see the sky and remember how daylight looks (as opposed to being indoors). 2. Shorter distance to coffee (compared to Athena building)."

Test 2: formulate 5 alternative questions

This was the strategy used in the tutorial as well. The LLM is instructed to create 5 variants of the user's question. It then generates an answer for each of the 5 questions, and returns all unique answers. These are then used to find the answer to the question. This approach can mitigate some of the limitations of distance-based retrieval, as it aims to make sure that the correct answer is among the extracted answers (while some incorrect ones can be extracted as well).

click for query loop and prompt

# Function to get the prompt templates for generating alternative questions and answering based on context
def get_prompt():
    QUERY_PROMPT = PromptTemplate(
        input_variables=["question"],
        template="""You are an AI language model assistant. Your task is to generate five
        different versions of the given user question to retrieve relevant documents from
        a vector database. By generating multiple perspectives on the user question, your
        goal is to help the user overcome some of the limitations of the distance-based
        similarity search. Provide these alternative questions separated by newlines.
        Original question: {question}""",
    )

    template = """Answer the question based ONLY on the following context:
    {context}
    Question: {question}
    """

    prompt = ChatPromptTemplate.from_template(template)

    return QUERY_PROMPT, prompt

# Initialize the language model with the specified model name
llm = ChatOllama(model=LLM_MODEL)
# Get the vector database instance
db = get_vector_db()
# Get the prompt templates
QUERY_PROMPT, prompt = get_prompt()

# Set up the retriever to generate multiple queries using the language model and the query prompt
retriever = MultiQueryRetriever.from_llm(
    db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# Define the processing chain to retrieve context, generate the answer, and parse the output
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = chain.invoke(input)

Test 4: results

model	Q1	Q2	Q3	Q4
qwen3	c	c	c	c
mistral	c	c	pc	c
llama2-uncensored	i	i	i	i

i = incorrect
pc = partially correct
c = correct

Analysis of results:

Looking at the five alternative questions generated by each model, they maintained the meaning of the original question for the most part. There were some small differences to how the questions were formulated, for example mistral's questions contained some guidance to how to answer the question (formulations such as 'focus on most important points', 'pay close attention to', 'with emphasis to'…)

Answering the questions, llama-uncensored performed again poorly, and for Q1-Q3 the answers were not related to the document at all. Q4 was answered with a place from the text, but incorrect one, and with additional incorrect hallucination.

mistral answered mostly correctly and without extra information. For Q1, it mentioned the stairs as well even though that was a disadvantage. For Q3, it failed to mention the stairs as a separate point, but came up with a third point which was not stated in the text, and could not be directly inferred from the text.

qwen3 provided answers that contained thinking, and answered each question exactly correctly. No information that wasn't in the document was returned.

test2 click for each model's answer for Q1

Q1: What were the two benefits of Minerva Square compared to Athena building?

llama2-uncensored: "The two main benefits of Minerva Square compared to Athena Building were:\n1) Space - In comparison, Athena was a rather cramped office space with limited seating and limited working areas. The large floor plate in Minerva Square provided plenty of room for everyone to work comfortably without feeling overcrowded or uncomfortable.\n2) Amenities - Athena Building did not have access to any kitchen facilities, whereas Minerva Square had a communal kitchen on each floor which was great for preparing food and keeping it fresh during the working day. Additionally, Minerva Square also offered a breakout room where teams could take breaks or hold meetings without having to leave the building itself."

mistral: "_ Two benefits of Minerva Square compared to Athena building, as mentioned in the document, are: 1. In Minerva Square, you can look up and see the sky, which helps to remember how daylight looks like. 2. The distance to coffee is shorter in Minerva Square. However, there is an infinite number of stairs to climb compared to Athena building._"

qwen3: "The two benefits of Minerva Square compared to Athena building, as mentioned in the context, are: 1. Ability to see the sky and remember daylight**: The text states that in Minerva Square, \"you can look up and see the sky, therefore remembering how daylight looks like.\" 2. Shorter distance to coffee: The group noted that \"the distance to coffee got shorter\" in Minerva Square. These advantages are highlighted as positive changes during their transition from Athena building to Minerva Square."

Conclusion

Here, there were no big differences between prompting strategies. The questions proved to be fairly easy to the two newer models, so in this case the strategy used in test 2, to extract additional questions, was not strictly necessary. Both tests resulted in similar results.

qwen3 answered all questions correctly in both tests. mistral was uncertain for a few questions in test 1, even though it was able to retrieve the correct answers. Maybe it gave too much emphasis to the prompt, which stated not to answer if the context doesn't have the answer included. For test 2, mistral was a bit unspecific for a few questions and did not return the exactly correct answers. Here it looks like further prompt engineering, even small tweaks to the prompt, could improve mistral's performance.

llama2-uncensored could not answer the questions correctly. It only answered one of the questions based on the context, and incorrectly. Otherwise it relied on hallucination. This is likely due to the small context window of the LLM; it's only 2k tokens where mistral's is 32k tokens and qwen3's is 40k. LLM's are moving towards having larger context windows, and likely one reason is to make them better suitable for RAG.

Top comments (2)

Muhammad Monjurul Karim • Oct 2

what variant of qwen3 did you use?

saara suominen • Oct 4

The latest one, i.e. qwen3:8b