DEV Community

Eyitayo Itunu Babatope
Eyitayo Itunu Babatope

Posted on

How to Implement LLM Grounding using Retrieval Augmented Generation Technique(RAG)

Introduction

Nowadays, when you prompt ChatGPT for information not part of its training data, it will search the web to retrieve it, use it in context, and return an appropriate response. Grounding is when Large Language Models(LLMs) use domain-specific information and data to generate accurate and relevant output. This article will examine LLM grounding using the RAG technique.

Importance of LLM grounding

This section highlights the importance of LLM grounding. The importance of LLM grounding is listed below:

  • Reduce LLM Hallucinations.
  • Improve LLM precision and accuracy.
  • The LLM becomes adept at addressing complex issues quickly.
  • LLM becomes more capable of clarifying intricate issues and reducing confusion.

LLM grounding techniques

This section examines LLM grounding techniques. There are two broad categories of LLM grounding, namely:

  • Retrieval-Augumented Generation(RAG)
  • Fine-tuning.

This article focuses on the RAG technique.

RAG technique

This section focuses on how to implement the RAG technique for LLM grounding. The RAG technique starts by retrieving relevant information based on a query. The retrieved content with the prompt is then merged into the LLM context window to generate a relevant output.

The Implementation of the RAG technique starts with creating a vector representation of the external data through a process called embeddings. It is the process of converting data into vectors, long arrays of numbers that represent semantic meaning for a given sequence of data. Below is an embedding produced by the Ollama "nomic-embed-text" model for the text "Grounding improves LLM output".


[0.079886355,0.026696106,-0.2009385,-0.081955306,0.033845328,0.0026561292,0.030165939,0.013311965,-0.055660687,-0.048723537,0.014457284,0.027321639,0.029530374,0.02201595,0.013614955,0.057640318,0.03209041,-0.051196527,-0.017932769,-0.06712348
...
-0.02952636,-0.053060006,0.051573362,-0.0038687028,0.054432027,-0.0071464307,0.07941523,-0.0049896343,-0.013346551,-0.006801469,-0.02958884,-0.039702114,0.005442398,0.027762491,0.029064095,-0.024355555,0.01312534,0.046164576,-0.045630153,0.014882911,-0.031765144,0.049317453,-0.0023815103,-0.059432093,-0.03721353,-0.014398544,-0.021900289]
Enter fullscreen mode Exit fullscreen mode

The vector can be stored in vector databases such as ChromaDB for later use.

There are LLMs trained to create embeddings. OpenAI has the following embedding models:

  • text-embedding-3-small
  • text-embedding-3-large
  • text-embedding-ada-002

Similarly, Ollama has the following embedding models:

  • nomic-embed-text
  • mxbai-embed-large
  • all-minilm

It must be noted that embeddings generated by Ollama LLMs cannot be used by OpenAI models and vice versa. This is because each platform produces vector embeddings with different dimensions. The dimensions for OpenAI LLMs are 1536, 3072 … and for Ollama, the vector dimensions are 768, 1024 …

To create an OpenAI vector embedding, you can use any of its models, as shown below:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input="Grounding Improves LLM output .",
    encoding_format="float",
)
print(embedding)
Enter fullscreen mode Exit fullscreen mode

For Ollama, it can be done as follows, assuming you have an Ollama installation on your system:

import ollama
response = ollama.embed(model=nomic-embed-text, prompt="Grounding Improves LLM output")
Enter fullscreen mode Exit fullscreen mode

The differences between the embedding models are performance, cost, and vector output size.

The next step is to retrieve the relevant part of the embeddings. You create an embedding for the user prompt, use it to retrieve content relevant to the user’s prompt, then pass both the content and the prompt to the LLM to generate a response, as shown below:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(

    model="gpt-5-nano",

    input=f" You are an assistant, use this data {data} to this input {prompt}"

)

print(response.output_text)
Enter fullscreen mode Exit fullscreen mode

Using Ollama.

import ollama

output = ollama.generate(
    model="llama2", prompt=f"Use this data: {data} to respond to this prompt: {input}"
)

print(output)
Enter fullscreen mode Exit fullscreen mode

Data in the codes above refers to the content retrieved from the database using the user prompt embeddings.

LLM grounding use case:

  1. Customer Service and Support: Question and Answer Chatbots
  2. Legal Research: Quickly retrieving relevant case law and statutes from a large database.
  3. Content creation and research
  4. Code Assistant: Troubleshooting and fixing errors in code.
  5. Medical Assistant: Generate medical insight and history of patients.

Conclusion

Grounding improves LLM output. It reduces hallucination, increasing accuracy and reducing confusion in LLM output. In this article, we have examined the RAG technique for grounding LLM. We examined how to use OpenAI and Ollama embedding models to create embeddings.

Top comments (0)