RAG Retrieval: Is High Quality Essential for Every Project?

#tutorials #ai #rag #retrieval

The Importance of Quality in RAG Retrieval: A Case Study

In a manufacturing ERP system, an AI-powered module we developed for supply chain optimization was consistently providing irrelevant or incomplete information when answering user questions. The system would retrieve irrelevant data related to past orders in response to a correct question about inventory status. This negatively impacted operator decisions, leading to delays and errors in production planning. The root cause lay in the deficiencies of the retrieval (fetching) stage. A retrieval mechanism that doesn't correctly understand the context and intent of user queries cannot produce accurate results, no matter how advanced the LLM (Large Language Model) used.

Problems like these highlight one of the most critical aspects of Retrieval-Augmented Generation (RAG) systems: retrieval quality. The better LLMs process the provided context, the more accurate and useful their generated responses will be. If the retrieval layer cannot fetch the most suitable and accurate information for the user's query, the performance of the entire system degrades. In this post, we will delve deep into why retrieval quality is so crucial in RAG systems, what factors influence this quality, and how we can develop projects without compromising on this quality. Based on my experience, I argue that high retrieval quality is not just a luxury but a necessity in every project.

Fundamentals of Retrieval Mechanisms: Vector Databases and Embeddings

The retrieval mechanisms that form the backbone of RAG systems typically utilize vector databases and embedding technologies. Documents, text snippets, or data records are converted into numerical vectors using specialized algorithms. These vectors represent semantic similarities. That is, texts with similar meanings are located close to each other in vector space. When a user makes a query, this query is also converted into a vector, and its nearest neighbors are searched for in the vector database. These neighbors are the documents most likely to be relevant to the query.

For instance, consider my own financial calculator side project. If users ask a specific question like, "What was my profit margin last month?", this question is passed through a vector model to become a "query embedding." Subsequently, each of our past financial reports is stored in the vector database as a pre-generated "document embedding." The system then finds the document embeddings with the closest similarity to the query embedding. If these document embeddings correspond to the correct financial reports, the LLM can generate an accurate response using the information from these reports. However, if the embeddings are generated incorrectly, or if the database doesn't retrieve the right documents, no matter how good a prompt we give the LLM, the result will be unsatisfactory.

The quality of the embedding models used in this process is of great importance. Different models can better capture different languages, terminologies, and semantic nuances. In my own projects, I initially used a simple embedding model. However, due to the complexity of financial terminology and the prevalence of specific terms (e.g., "gross profit margin," "net sales," "operating expenses"), I found that the model couldn't distinguish these terms sufficiently well. This, in turn, reduced the rate at which queries matched the correct documents.

ℹ️ Choosing an Embedding Model

It is critical to test the performance of different embedding models (e.g., text-embedding-ada-002, all-MiniLM-L6-v2, bge-large-en-v1.5) on your own dataset. Each model has its strengths and weaknesses. Some are better with general text, while others might be more suitable for specific domains (e.g., finance, medicine).

Factors Affecting Retrieval Quality: Data Quality and Preprocessing

One of the most significant factors directly impacting retrieval quality is the quality of the data fed into the system. If our database contains inconsistent, incomplete, or erroneous information, the retrieval mechanism, no matter how advanced, will yield incorrect results. This was a problem I frequently encountered in my experience with enterprise software development. While working on a manufacturing ERP system, product information, stock levels, and order details could be stored inconsistently across different modules.

For example, if a product name appeared as "Product X Model A" in one place and "Product X-A Model" in another, it could cause the embeddings to interpret these two records differently. Such inconsistencies, especially if not resolved during text preprocessing, severely degrade retrieval performance. Preprocessing steps include text cleaning (punctuation, special characters), case conversions, stop word removal, and stemming/lemmatization.

In an Android spam blocking application I developed, I used a RAG system to analyze user complaints. Users reported their situations using different phrases like "block my number," "this number is bothering me," "got an ad." Before generating embeddings for these phrases, I had to normalize all the text, remove pronouns like "me" and "this," and eliminate unnecessary words. This way, I enabled the model to better understand that different phrases conveyed the same semantic meaning. If these preprocessing steps are skipped, even similar queries can result in different vectors, and the most relevant documents might not be retrieved.

⚠️ Data Cleaning and Normalization

Data preprocessing is one of the unseen but most critical steps in RAG systems. Skipping or inadequately performing this step leads to noticeable drops in retrieval quality. Ensuring the consistency and accuracy of your data should be your first step.

Chunking Strategies: How Should We Split Documents?

Another important consideration in retrieval systems is how we "chunk" documents (split them into pieces). LLMs typically have a context window limit, which refers to the amount of text the model can process at once. Therefore, it's necessary to split long documents into smaller, meaningful pieces. However, this splitting process should not be arbitrary. The "chunking" strategy directly impacts retrieval quality.

One approach is to use fixed-size chunks. For example, splitting the document into pieces of 1000 tokens each. However, this method can lead to splitting in the middle of sentences or distributing a paragraph across two different chunks. This, in turn, causes a chunk to lose its semantic integrity on its own. A better approach is to use semantically meaningful boundaries. For instance, splitting based on paragraph endings, section headings, or specific text structures (like list items).

When working with financial reports for the custom financial calculators I run on my own VPS, I experimented with different strategies. Initially, I tried to convert the entire report into a single large document embedding. This was not possible due to the LLM's context window limit. Then, I simply split it into 500-word chunks. However, this sometimes resulted in cutting off a profit and loss statement in the middle or leaving a footnote meaningless on its own. This led to the retrieval of incorrect data. Finally, I decided to split the sections based on the natural structure of the report (e.g., headings like "Revenues," "Expenses," "Net Profit," or paragraphs). This way, I ensured that each chunk carried a more coherent meaning within itself.

💡 Chunking Strategies

Experiment with different chunking strategies:

Fixed Size: Split based on a specific word or token count.

Semantic Splitting: Split using paragraph, heading, or sentence boundaries.

Recursive Chunking: First divide into larger pieces, then break these pieces into smaller semantic units. Consider the context window size of the embedding model and LLM you will be using.

The Relationship Between Retrieval Quality and LLM Performance

Low retrieval quality can render even the most advanced LLMs inadequate. LLMs generate responses based on the context provided to them. If this context is irrelevant or incomplete for the query, the LLM's probability of "hallucinating" (generating fabricated information) increases. This is a common and trust-damaging issue in the field of artificial intelligence.

At one point, we were developing a question-answering system for the internal platform of a large e-commerce site. The system was intended to assist users with information about products, shipping status, and return policies. Initially, our retrieval layer focused on fetching the closest documents using product names and SKUs. However, users often asked about product technical specifications or compatibility. The retrieval layer struggled to fetch documents containing these specific technical details. As a result, the LLM provided passive responses like "Insufficient information available for this product" or sometimes fabricated completely incorrect technical specifications.

To solve this problem, we expanded the retrieval layer to include not only product names but also product descriptions, technical documents, and FAQ sections. We also used a more advanced embedding model to better capture the keywords and semantic meaning of the queries. As a result of these changes, the quality of the context fetched by the LLM increased, and it began to provide much more accurate and detailed answers to user questions. This experience was a concrete example of how retrieval quality directly impacts the LLM's performance.

🔥 Risk of Hallucination

Low-quality retrieval leads to LLMs generating incorrect or fabricated information. This erodes user trust and reduces the system's usefulness. Maintaining high retrieval quality is the first step to ensuring LLM accuracy.

Methods to Improve Retrieval Quality: Re-ranking and Hybrid Approaches

There are various advanced techniques that can be used to improve retrieval quality. One of these is "re-ranking." Documents retrieved in the initial retrieval stage are then re-evaluated by a more complex and costly model. This second-stage model can capture semantic nuances or deeper connections with the query that might have been overlooked in the first retrieval stage. Thus, the most relevant documents are moved to the top of the list.

Another important approach is "hybrid retrieval." In this method, not only vector-based similarity search but also traditional keyword-based search (e.g., using algorithms like BM25) is employed. Keyword search is effective in retrieving specific terms or exact matches, while vector search captures broader semantic similarities. Combining these two methods increases the probability of retrieving documents that contain both the correct terms and are semantically relevant.

On my own side project, an anonymous Turkish data platform, I enabled users to query specific datasets. On this platform, both keyword-based precision and capturing semantic meaning were critical. For example, when a user asked for "agricultural production data," I wanted to retrieve not only documents containing these exact words but also semantically similar data like "crop production" or "harvest rates." Initially, when I used only vector search, sometimes data on similar topics, not containing the exact terms, would appear. When I combined it with keyword-based search (e.g., similar to the core query engines of Lucene or Elasticsearch), both specific terms and semantically similar results were obtained in a much more balanced way.

💡 Advanced Retrieval Techniques

Re-ranking: Re-ordering initial retrieval results with a more sophisticated model.

Hybrid Retrieval: Combining vector-based search with keyword-based search.

Query Expansion: Expanding the user's query by adding semantically similar terms.

Document Expansion: Enriching document content with additional semantically relevant information.

Conclusion: Do Not Compromise on Quality

In Retrieval-Augmented Generation (RAG) systems, retrieval quality is a fundamental requirement for project success. Data quality, proper preprocessing steps, effective chunking strategies, and advanced retrieval techniques are the keys to building a high-quality retrieval mechanism. Insufficient retrieval can mislead even the most advanced LLMs, increase the risk of "hallucination," and negatively impact user experience.

In my own experiences, I've seen that not focusing enough on retrieval quality initially can lead to significant time and resource loss later. The delayed shipment reports in our manufacturing ERP, the irrelevant data in my financial calculator, and the incorrect product information on the e-commerce platform all stemmed from the same fundamental issue: low retrieval quality. To avoid encountering such problems and to get the most out of your RAG systems, you must meticulously design the retrieval layer from the very beginning of your project and continuously improve it. Remember, no matter how intelligent LLMs are, they can only be as good as the information provided to them. Therefore, high-quality retrieval is essential for every project.