DEV Community

Cover image for RAG Explained: Ingestion of Data
Ďēv Šhãh 🥑
Ďēv Šhãh 🥑

Posted on

RAG Explained: Ingestion of Data

Disclaimer

This blog reflects my learnings from Augment your LLM Using Retrieval Augmented Generation by NVIDIA. The content is based on topics covered in the course and my understanding of them. If you are not sure, what RAG is, I would suggest you to check out my following blog.

Introduction

This blog talks about ‘Ingestion of Data’. It is the process of converting data into tokens, covering all the steps involved in the process. Data ingestion is a critical step in RAG pipelines because the primary benefit of RAG is generating up-to-date responses. To achieve this, providing the LLM with relevant chunks of data is crucial, and this is done through ingesting the most relevant and updated data.

Ingestion of Data
Image Credit: NVIDIA

Step 1: Retrieve Search

The first step is retrieving data from external sources, which include databases, documents, PDFs, HTML, the internet, APIs, and data repositories. In other words, relevant data is gathered from all necessary sources. This step ensures that all possible relevant information is available for the following processes.

Step 2: Process Raw Text

Once the data is retrieved, the next step is pre-processing the collected data. This step involves preparing the data to make it usable in the later stages of the pipeline. The process consists of the following sub-steps:

  1. Clean Text: The collected data is cleaned. For instance, if the data is in HTML format, all the HTML tags are removed, leaving only the raw text. Other types of formatting (like headers, footers, or special characters) may also be cleaned to ensure the text is usable.
  2. Chunking: After cleaning, the text is grouped into manageable and meaningful segments (chunks). For instance, if you have a long research paper, the text might be chunked into sections such as "Introduction," "Methods," "Results," and "Conclusion." Alternatively, the text can be chunked based on a specific number of words, like breaking the text into 500-word chunks. This ensures that each chunk is small enough for the model to handle and still contains enough meaningful context for accurate retrieval and response generation.
  3. Removing Duplicates: Lastly, duplicate information is removed from the data to avoid redundancy.

Step 3: Tokenization

In the final step, the processed chunks are broken down into smaller units known as tokens. Tokens can be words, sub-words, or even characters, depending on the requirements of the model.

Continuing with the research paper example, let's say the "Introduction" section of the paper contains the sentence:

Machine learning models have shown great promise in data analysis.
Enter fullscreen mode Exit fullscreen mode

During tokenization, this sentence would be broken down into individual tokens. These tokens could be full words like "Machine", "learning", and "models", or even smaller units (sub-words) like "Mach-", "ine", "learn-", and "-ing", depending on the tokenizer being used. The model then uses these tokens to generate embeddings, which are used in the retrieval and generation phases of the RAG pipeline.

Technical Note: Tokenization is vital because modern LLMs operate on tokens, not on raw text. By transforming the data into tokens, the model can more efficiently process the input and retrieve relevant information during generation.

Final Words

Ingestion of data is the backbone of the RAG pipeline. It ensures that the model has access to the most relevant and up-to-date information. The steps of retrieving search, processing raw text, and tokenization ensure that the data is clean, organized, and in a format the model can understand, making it a key factor in improving the accuracy of responses.

Citation
I would like to acknowledge that I took help from ChatGPT to structure my blog simplify content and generate relevant examples.

Top comments (0)