Joseph

Posted on May 30

How Retrieval Augmented Generation (RAG) Work

#machinelearning #gpt3 #rag #llm

Retrieval Augmented Generation (RAG, pronounced 'rag') works by fetching selective data from a custom knowledge base and integrating it with the output of a language model to provide accurate and up-to-date responses.
RAG can be defined as a ChatGPT-like interface that can use your pdfs, documents or databases to answer questions from you. You can use it as a study assistant to understand documents by asking questions about those documents.

In this article, we discuss the benefits of using a RAG system and explain its key components. We also detail
how RAG works to enhance the capabilities of large language models by integrating them with custom knowledge bases to deliver accurate and contextually relevant responses.

Why Use a RAG System while LLMs have Answers?

You get up-to-date and referenced data that understands your context Large Language Models (LLMs) give responses based on extensive datasets they have been trained on. GPT-3 is one of the LLMs and has 175 million parameters. GPT3 may not be aware of which team won in the recently concluded football premier league. It may also be unaware of the team lineup changes that happened across the various teams most recently. To add such bits of information to the already enormous LLMs, a RAG system can search the web for recent news, or a document with the news may be uploaded to it. When an LLM is updated with RAG, it can provide responses with up-to-date, in-context, and referenced information.
You get accurate responses Since you provide files with accurate information to the RAG, you get accurate and contextually appropriate answers. By having an accurate data source, the issue of hallucination is solved. Hallucination in Artificial Intelligence (AI) is where LLMs provide incorrect and non-existent data in responses.
Reduced model size requirements for domain-specific expertise The most obvious way to let an LLM model be a subject matter expert is to fine-tune it with more data on that subject. This will increase the model size and consequently the cost of the LLM. Costs may be unproportional to the value gained. Instead of providing the domain-specific data in the model during fine-tuning, a RAG may accommodate the same data in a faster and more cost-efficient manner. The system can then take in long domain-specific queries without overwhelming the LLM’s capacity.

For RAG to work, it basically requires:

A custom knowledge base
A large language model
An embedding model
A user query

To understand the process better, let's examine these components in detail.

Custom Knowledge Base

It is a collection of relevant and up-to-date information that serves as a foundation of RAG. The knowledge base can contain pdfs, word documents, spreadsheets, databases, notion templates, code, etc.
Since chatGPT-4o was released, you can also connect to your Google Drive or Microsoft Onedrive.
All these are closed content stores.
The internet, which is an open content store can also be used as the custom knowledge base. While using the internet, caution should be taken as the RAG may extract contextually irrelevant data, affecting the response it will provide.
This means that if you have information uploaded in your cloud storage, you don’t have to download it. You can directly feed the link to your RAG system.
Worth noting that, while RAG systems utilize the knowledge base to generate responses, they do not store the data permanently. However, you should still be cautious when sharing confidential information to ensure privacy and security.

Chunking

In chunking, the large input text from the custom knowledge base is broken down into smaller pieces that can be consumed by the machine learning algorithms. The algorithms need the text in some small predefined sizes (chunks) for efficient retrieval.
There are several chunking currently being used. Most popular chunking strategies are character chunking, recursive character chunking, agent chunking, token-based chunking, document specific chunking and semantic chunking.
The smarter the chunking strategy, the more enhanced the RAG system.

Embeddings & Embedding Model

Machine learning models understand numerical vectors, and not the generated text chunks. The generated chunks of text therefore need to be converted to numerical vectors, a process called embedding. This is done by an embedding model as illustrated below.
The embedding model can be any model of your choice, based on your chunking strategy and required context length.
Types of embedding models include sparse, dense, long context dense, variable dimension, multi dimension, and code (dense) models. Each type has specific model names, for example multilingual minilm, BERT, GPT-4, qdrant, all-mpnet-base, BGE embeddings, text-embedding-ada-002, jina,…
Effectiveness and cost efficiency can be considered to choose the best model.

Vector Databases

The numerical vectors from the embedding model are stored as a vector database. The vector database is therefore a collection of pre-computed vector representations of text data for fast retrieval and similarity search.
Relating the database to a spreadsheet with rows and columns, some columns may be metadata, index, embedding, and text chunk. The metadata has information like date, time, and named entities for each vector entry.
The Vector index offers a fast and scalable similarity search in the embedding space. For each embedding, there is an associated text chunk and metadata that builds the retrieved context.
Vector databases can allow CRUD operations, metadata filtering, and horizontal scaling.

User Chat Interface

RAG systems are mostly designed to allow chat-like conversations with end users. A user-friendly interface is therefore needed to allow users to interact with the RAG system. The interface allows users to input queries, and receive output responses.
When a user provides a query, it is converted to an embedding which is used to retrieve relevant context from the vector database.

Prompt Template

In a RAG system, the LLM produces the final response. The custom knowledge base forwards contextual and up-to-date information to this LLM. A combination of the user query and information from the custom knowledge base forms the prompt template which is then forwarded to the LLM.

Conclusion

RAG enhances the capabilities of LLMs by integrating them with a custom knowledge base to deliver precise and contextually relevant responses. They make the static LLM training data to be dynamic by providing up-to-date information.
The RAG system works by first utilizing a custom knowledge base to store relevant and up-to-date information. When a user query is received, it is converted into numerical vectors using an embedding model and then matched with similar vectors in a vector database to retrieve relevant context. This retrieved context, combined with the language model's capabilities, generates precise and contextually relevant responses.
With RAG, small, mid-sized, and large LLM models can provide detailed, accurate, and context-aware interactions, making them invaluable tools for leveraging dynamic and domain-specific knowledge.

Top comments (1)

Winzod AI • Nov 27

Hey folks, came across this post and thought it might be helpful for you! Check out this comprehensive guide to RAG in AI - Rag In AI

DEV Community

How Retrieval Augmented Generation (RAG) Work

Why Use a RAG System while LLMs have Answers?

Custom Knowledge Base

Chunking

Embeddings & Embedding Model

Vector Databases

User Chat Interface

Prompt Template

Conclusion

Top comments (1)

Read next

Running Phi 3 with vLLM and Ray Serve

Universal Personal Assistant with LLMs

Machine Learning for Software Engineers: A Comprehensive Theoretical Foundation

Top AI Search Engines for Business & Startups in 2025