Improving RAG Retrieval Quality: A Cost-Benefit Analysis

#rag #ai #llm #retrieval

Improving RAG Retrieval Quality: A Cost-Benefit Analysis

Retrieval-Augmented Generation (RAG) systems enrich the capabilities of Large Language Models (LLMs) with external information sources, allowing them to produce more accurate and contextually appropriate responses. At the heart of these systems lies the "retrieval" phase. When retrieval quality drops, the results produced by the LLM are directly affected; responses filled with incorrect or irrelevant information become inevitable. In this post, I will conduct a cost-benefit analysis of the methods used to improve retrieval quality in RAG systems. Through real-world scenarios, we will examine which improvements add how much value and in which situations it is worth spending more resources.

As we begin this analysis, it is important to remember that improving retrieval quality is not just a technical optimization but also has a direct impact on business outcomes. For example, a RAG bot used in customer service providing incorrect information can lower customer satisfaction and damage the brand. A RAG system used in financial analysis bringing back faulty data can lead to serious financial losses. Therefore, we must link our efforts to improve retrieval quality not only to technical metrics but also to business goals.

Advanced Vector Database Options and Costs

Vector databases, which form the foundation of RAG systems, enable fast and effective similarity searches by converting documents into vector representations. However, standard vector databases can experience performance issues, especially with large datasets or complex queries. This is where more advanced, scalable, and optimized vector databases come into play.

For instance, open-source solutions like Milvus or Weaviate offer higher query performance thanks to advanced indexing strategies (such as HNSW, IVF) and distributed architectures. However, managing, maintaining, and scaling these databases brings an additional operational load and cost. Instead of setting up and managing these databases on our own infrastructure, using managed services like Pinecone or Weaviate Cloud reduces operational overhead but means provider lock-in and generally higher monthly costs.

In an ERP integration project for a manufacturing firm, we initially used a simple PostgreSQL pgvector extension. When we indexed about 5 million documents, query times began to exceed 2 seconds. This situation seriously limited the system's ability to respond in real-time. When we switched to Pinecone, we reduced query times to an average of 150 milliseconds on the same dataset. However, the monthly cost of this transition was about 5 times that of our previous infrastructure. When making the decision, we evaluated whether the operational efficiency and increased customer satisfaction brought by the query performance boost justified the cost increase. In this case, we decided it did.

ℹ️ Cost Analysis Note

The costs of managed vector database services generally vary based on data storage, index size, query volume, and additional features (e.g., metadata filtering, auto-scaling). For solutions you set up on your own infrastructure, you must consider hardware, licensing (if any), maintenance, operations, and engineering costs.

Data Pre-processing and Chunking Strategies

One of the most critical factors affecting retrieval quality is how our dataset is pre-processed and divided into smaller pieces (chunks). Splitting documents in a way that is meaningful and preserves context makes it easier for the LLM to find the most relevant information for the query. However, a poor chunking strategy can lead to the loss of relevant information or fill the LLM's context window with unnecessarily large chunks.

There are different chunking strategies: fixed-size chunking, paragraph-based splitting, sentence-based splitting, or more advanced semantic chunking methods. Fixed-size chunking (e.g., 512 tokens) is the simplest method but carries the risk of splitting in the middle of a sentence or an idea. Paragraph-based splitting can be more meaningful, but some paragraphs may be very long while others are very short. Semantic chunking algorithms attempt to better capture the logical sections within documents.

While developing a RAG system for product descriptions on an e-commerce platform, we initially used fixed-size chunking of 500 tokens. This approach resulted in half of the product features being in one chunk and the other half in the next. Consequently, when a user asked, "Is it water-resistant?", the system could not accurately find the chunk containing this information. To solve this, we started splitting product descriptions based on headings and subheadings. With this strategy, each chunk became more consistent and meaningful. The rate of finding information related to the query rose from 75% to 92%. The cost of this improvement was the more complex parsing logic added to our data processing pipeline.

💡 Recommendation for Chunking

When determining chunk size, consider the context window of your target LLM. Very small chunks can cause a loss of context, while very large chunks can increase the LLM's processing cost and create "noise" by including irrelevant information. Overlap between chunks is also important to prevent the loss of information at the end of one chunk and the beginning of the next. Generally, an overlap of 50-100 tokens is a good starting point.

Selection and Impact of Embedding Models

The embedding models that create vector representations directly affect the RAG system's ability to understand. Different embedding models capture the semantic relationships of words and sentences in different ways. Model selection significantly impacts both retrieval quality and the system's computational costs.

There are many different embedding models on the market. Popular models like text-embedding-ada-002 (OpenAI) offer a good balance, while newer and specialized models like Cohere Embed v3 or Voyage AI may show superior performance in specific languages or domains. Open-source models include Sentence-BERT (SBERT) derivatives or models like BAAI/bge-large-en. Each of these models has a different cost; some are charged via API calls, while others can be run on your own infrastructure, bringing hardware costs.

In a financial analysis RAG system, we initially used the text-embedding-ada-002 model. We indexed about 10 million financial reports. When users asked questions about a specific company's "future cash flow projections," the system often brought back general revenue trends or past expenditures. This was because the model could not fully capture the nuances of financial terms. Later, we decided to use a model similar to FinBERT, specifically trained on financial texts. Since this model did not have an API, we had to run it on our own servers. This brought GPU costs and operational overhead. However, with the new model, 80% of queries began to be answered with accurate and relevant information. In the old model, this rate was around 55%. The cost increase went from approximately $0.005 per query to $0.02 when server costs were included. However, the accuracy boost obtained more than justified this cost.

⚠️ Points to Consider in Model Selection

When choosing an embedding model, pay attention not only to its performance but also to its cost, ease of use, and scalability. While models you run on your own infrastructure may require a higher initial hardware investment, they can be more economical than API-based solutions in the long run. Additionally, the language and domain knowledge supported by the model are of critical importance.

Query Expansion Techniques

The question asked by the user (the query) might not be optimized for searching directly in a vector database. Query expansion techniques help improve retrieval quality by making the original query more relevant and comprehensive. These techniques aim to retrieve more potentially relevant documents.

One common query expansion technique is adding synonyms or related terms to the original query. For example, when a user asks for "Laptop prices," we can expand the query to "Laptop prices, notebook costs, computer deals." Another method is to use an LLM to generate multiple queries from different angles based on the original query. This includes approaches like "HyDE" (Hypothetical Document Embeddings), where the LLM generates a hypothetical document in response to the query, and the vector representation of this document is used in the search.

While developing a RAG system for a customer support portal, users' questions were often short and vague. For example, the question "How do I make a return?" could point to many different return policy documents. We used an LLM to expand the query. The LLM took the question "How do I make a return?" and generated more specific questions such as: "What is the process for returning a damaged product?", "What are the return conditions within 14 days?", "Who pays the shipping fee for returns?". These four new generated queries were searched separately in the vector database, and the results were merged. Thanks to this approach, the rate of reaching the correct information rose from 65% to 90%. The cost of this technique was the additional API calls made to the LLM. Each query expansion operation brought an additional cost of about $0.001. However, this cost was quite low compared to the reduction in support requests resulting from incorrect answers.

ℹ️ HyDE Technique

HyDE allows the LLM to generate a hypothetical response to a query and performs a similarity search using the embedding of that response. This can be effective especially when the query itself does not carry as much rich information as a document. However, the quality of the hypothetical response generated by the LLM directly affects the retrieval result.

Re-ranking and Ranking Optimizations

After multiple potentially relevant documents are retrieved during the retrieval phase, re-ranking these documents before they are presented to the LLM can further improve retrieval quality. The initial retrieval usually performs a broad similarity search rather than a perfect semantic match. Re-ranking evaluates the relevance of the retrieved documents to the query more precisely and brings the most relevant ones to the top.

Different approaches exist for re-ranking. A simple method is to rank based on keyword frequency or other metadata in the retrieved documents. More advanced methods use a two-stage "cross-encoder" model. This model takes the query and each retrieved document together—rather than separately—to calculate a much more precise relevance score. Cross-encoder models can be slower when run alone, but they can be used efficiently by limiting the number of documents retrieved in the first step (e.g., 10 instead of 100).

When creating a RAG system for a software documentation portal, we were retrieving about 50 relevant document snippets in the first retrieval step. These documents were usually presented raw to the LLM. However, due to the complexity of the documentation, the LLM sometimes struggled to find the most relevant information. To solve this, we started using a cross-encoder-based re-ranking service like cohere-rerank. We retrieved the first 50 documents, sent them to the re-ranker, and selected the top 5. This process significantly reduced the amount of information presented to the LLM while increasing the accuracy of the responses. The rate of incorrect or incomplete responses dropped from 30% to 10%. The cost of using the re-ranking service was about $0.003 per query. This was in addition to the initial retrieval cost, but the quality improvement justified it.

The Cost-Benefit Balance: When to Stop?

There are countless techniques to improve retrieval quality, and each brings an additional cost. At this point, the critical question is: "When should we consider it enough?" The answer usually depends on our business goals and tolerance levels.

If our RAG system is a customer service bot and user satisfaction is over 95%, it might not be necessary to make large investments for further improvement. However, if the system is still providing incorrect information 15% of the time and this is leading to customer complaints, more aggressive improvements should be considered. Another example could be a financial reporting RAG. Here, accuracy over 99% is critical, whereas 90% accuracy might be acceptable for a customer service bot.

It is useful to follow these steps when conducting a cost-benefit analysis:

Measure the Current State: Determine metrics that indicate retrieval quality (e.g., Precision@k, Recall@k, MRR - Mean Reciprocal Rank) and measure the current system with these metrics.
Set Business Goals: What is the acceptable error rate or accuracy level your system needs to reach? Determine how these improvements will contribute to your business goals (e.g., increased customer satisfaction, operational efficiency).
Research Technical Solutions and Their Costs: Estimate the hardware, software, API, and operational costs that each of the techniques mentioned above (or similar ones) will bring.
Conduct Experiments: Try 1-2 of the most promising improvements on a small scale to measure real-world performance and cost. A/B tests are very valuable at this stage.
Make a Decision: Based on the results, evaluate whether the cost increase provides the expected benefit and decide whether to continue with the improvements.

Once, I was working on a RAG system for an "Anonymous Turkey Data Platform" I developed. My goal was to make complex public data understandable. Initially, the retrieval rate was around 60%. By improving query expansion and the embedding model, I raised this rate to 85%. Later, I considered adding re-ranking, but this would increase additional costs and complexity. The 85% rate was sufficient to serve the platform's purpose, and instead of investing more, I focused on improving the user interface and expanding the dataset. This shows that sometimes the best improvement is not more technical optimization, but accepting the current state and turning to other areas.

💡 Importance of Metrics

Metrics used to measure retrieval quality help you evaluate the system's performance objectively. However, ensure these metrics align with your business goals. For example, high recall might be important, but it is not enough on its own if the retrieved documents are irrelevant. Metrics like Precision and MRR should also be considered.

In conclusion, improving retrieval quality in RAG systems is a continuous optimization process that requires careful cost-benefit analysis. Advanced databases, smart data processing, correct embedding models, and effective query strategies play a key role in increasing this quality. However, as always, the most complex solution may not be the best solution. The essential thing is to find the balance that provides the maximum benefit with the minimum effort and cost required to reach our business goals.