DEV Community

Cover image for How Retrieval Augmented Generation (RAG) Work
Joseph
Joseph

Posted on

How Retrieval Augmented Generation (RAG) Work

Retrieval Augmented Generation (RAG, pronounced 'rag') works by fetching selective data from a custom knowledge base and integrating it with the output of a language model to provide accurate and up-to-date responses.
RAG can be defined as a ChatGPT-like interface that can use your pdfs, documents or databases to answer questions from you. You can use it as a study assistant to understand documents by asking questions about those documents.

Demonstration of how the RAG system works (custom knowledge base to text chunks to vector database to prompt template to user interface
In this article, we discuss the benefits of using a RAG system and explain its key components. We also detail
how RAG works to enhance the capabilities of large language models by integrating them with custom knowledge bases to deliver accurate and contextually relevant responses.

Why Use a RAG System while LLMs have Answers?

  1. You get up-to-date and referenced data that understands your context Large Language Models (LLMs) give responses based on extensive datasets they have been trained on. GPT-3 is one of the LLMs and has 175 million parameters. GPT3 may not be aware of which team won in the recently concluded football premier league. It may also be unaware of the team lineup changes that happened across the various teams most recently. To add such bits of information to the already enormous LLMs, a RAG system can search the web for recent news, or a document with the news may be uploaded to it. When an LLM is updated with RAG, it can provide responses with up-to-date, in-context, and referenced information.
  2. You get accurate responses Since you provide files with accurate information to the RAG, you get accurate and contextually appropriate answers. By having an accurate data source, the issue of hallucination is solved. Hallucination in Artificial Intelligence (AI) is where LLMs provide incorrect and non-existent data in responses.
  3. Reduced model size requirements for domain-specific expertise The most obvious way to let an LLM model be a subject matter expert is to fine-tune it with more data on that subject. This will increase the model size and consequently the cost of the LLM. Costs may be unproportional to the value gained. Instead of providing the domain-specific data in the model during fine-tuning, a RAG may accommodate the same data in a faster and more cost-efficient manner. The system can then take in long domain-specific queries without overwhelming the LLM’s capacity.

For RAG to work, it basically requires:

  • A custom knowledge base
  • A large language model
  • An embedding model
  • A user query

To understand the process better, let's examine these components in detail.

Custom Knowledge Base

Examples of file types that can be used in the custom knowledge base
It is a collection of relevant and up-to-date information that serves as a foundation of RAG. The knowledge base can contain pdfs, word documents, spreadsheets, databases, notion templates, code, etc.
Since chatGPT-4o was released, you can also connect to your Google Drive or Microsoft Onedrive.
All these are closed content stores.
The internet, which is an open content store can also be used as the custom knowledge base. While using the internet, caution should be taken as the RAG may extract contextually irrelevant data, affecting the response it will provide.
This means that if you have information uploaded in your cloud storage, you don’t have to download it. You can directly feed the link to your RAG system.
Worth noting that, while RAG systems utilize the knowledge base to generate responses, they do not store the data permanently. However, you should still be cautious when sharing confidential information to ensure privacy and security.

Chunking

Example text chunks that can be used in a RAG system
In chunking, the large input text from the custom knowledge base is broken down into smaller pieces that can be consumed by the machine learning algorithms. The algorithms need the text in some small predefined sizes (chunks) for efficient retrieval.
There are several chunking currently being used. Most popular chunking strategies are character chunking, recursive character chunking, agent chunking, token-based chunking, document specific chunking and semantic chunking.
The smarter the chunking strategy, the more enhanced the RAG system.

Embeddings & Embedding Model

An embedding model and how it relates to text chunks and the vector database in a RAG systemMachine learning models understand numerical vectors, and not the generated text chunks. The generated chunks of text therefore need to be converted to numerical vectors, a process called embedding. This is done by an embedding model as illustrated below.
The embedding model can be any model of your choice, based on your chunking strategy and required context length.
Types of embedding models include sparse, dense, long context dense, variable dimension, multi dimension, and code (dense) models. Each type has specific model names, for example multilingual minilm, BERT, GPT-4, qdrant, all-mpnet-base, BGE embeddings, text-embedding-ada-002, jina,…
Effectiveness and cost efficiency can be considered to choose the best model.

Vector Databases

An illustration of the vector database, showing how query vectors relates to similar vectors for an efficient RAG system
The numerical vectors from the embedding model are stored as a vector database. The vector database is therefore a collection of pre-computed vector representations of text data for fast retrieval and similarity search.
Relating the database to a spreadsheet with rows and columns, some columns may be metadata, index, embedding, and text chunk. The metadata has information like date, time, and named entities for each vector entry.
The Vector index offers a fast and scalable similarity search in the embedding space. For each embedding, there is an associated text chunk and metadata that builds the retrieved context.
Vector databases can allow CRUD operations, metadata filtering, and horizontal scaling.

User Chat Interface

A chain showing how the user interface works with an LLm, the vector database and a prompt template to produce the final response
RAG systems are mostly designed to allow chat-like conversations with end users. A user-friendly interface is therefore needed to allow users to interact with the RAG system. The interface allows users to input queries, and receive output responses.
When a user provides a query, it is converted to an embedding which is used to retrieve relevant context from the vector database.

Prompt Template

How the prompt template relates to the ser query and custom knowledge base
In a RAG system, the LLM produces the final response. The custom knowledge base forwards contextual and up-to-date information to this LLM. A combination of the user query and information from the custom knowledge base forms the prompt template which is then forwarded to the LLM.

Conclusion

RAG enhances the capabilities of LLMs by integrating them with a custom knowledge base to deliver precise and contextually relevant responses. They make the static LLM training data to be dynamic by providing up-to-date information.
The RAG system works by first utilizing a custom knowledge base to store relevant and up-to-date information. When a user query is received, it is converted into numerical vectors using an embedding model and then matched with similar vectors in a vector database to retrieve relevant context. This retrieved context, combined with the language model's capabilities, generates precise and contextually relevant responses.
With RAG, small, mid-sized, and large LLM models can provide detailed, accurate, and context-aware interactions, making them invaluable tools for leveraging dynamic and domain-specific knowledge.

Top comments (0)