Mustafa ERBAY

Posted on Jun 6 • Originally published at mustafaerbay.com.tr

RAG Retrieval Quality: Are Large Models Really Necessary?

#ai #llm #rag

Introduction: The Place of Large Models in RAG and Lingering Questions

Retrieval-Augmented Generation (RAG) systems extend the information retrieval capabilities of large language models (LLMs), enabling them to produce more accurate and contextually relevant responses. In this process, the quality of external data (context) directly impacts the model's output. However, the question of how "large" models are truly necessary in the RAG architecture has become a significant topic of discussion, considering factors like cost, performance, and complexity. I've had the opportunity to delve into this issue in my own projects.

Especially when working with enterprise datasets, the role of LLMs not only in processing information but also in "finding" (retrieval) this information can determine the overall success of the system. So, do we always have to use the largest, most capable model? Or can smaller, more focused models provide the same retrieval quality? In this post, I will examine whether large models are truly indispensable for improving retrieval quality in RAG systems, using concrete examples from my own experiences.

What is Retrieval Quality and Why is it Important?

Retrieval quality in a RAG system is a measure of how effectively we can find the most appropriate and accurate information for a user's query from relevant external sources. This not only involves retrieving the correct documents but also identifying the most relevant passages within those documents. If irrelevant or incomplete information is retrieved during the retrieval phase, no matter how advanced the LLM, the generated response will be inaccurate or insufficient.

Let me illustrate how critical this situation can be in the real world with an incident I experienced while working on a production ERP system. We had a RAG system that displayed real-time production data on operator screens. When users asked questions like "What was the total output of machine A last night?", the system needed to pull this information from the relevant database. On one occasion, it started retrieving data for machine C from the previous week instead of yesterday's data for machine A. The reason was that the retrieval mechanism, while searching for semantic similarity rather than exact keyword matches, confused records belonging to similarly named but different machines. Although this seemed like a simple "retrieval error," it could have led operators to make decisions based on incorrect information, which could cause serious disruptions in production. This incident showed me once again that retrieval quality is not just a technical detail but is directly related to operational efficiency and error costs.

ℹ️ Impacts of Low Retrieval Quality

Low retrieval quality can lead to the following problems in RAG systems:

Generation of Incorrect Information: The LLM provides erroneous answers using incorrect or incomplete context.

Lack of Context: The LLM cannot find the information needed to fully understand the query.

Increased Cost: Attempting to compensate for these errors by using more or larger LLMs increases costs.

Low User Trust: Users lose trust when the system consistently produces incorrect or nonsensical responses.

Therefore, optimizing the retrieval phase is critical for the success of RAG systems.

Potential Contribution of Large Models to Retrieval

Large language models (LLMs), generally having more parameters, possess more advanced capabilities in understanding text, extracting context, and detecting semantic similarities. In RAG systems, these capabilities can be beneficial in two main areas:

Enhanced Query Understanding: By understanding the user's query more deeply, they can formulate a better "search query" to be sent to the search engine. This helps clarify complex or ambiguous questions.
Semantic Search Capability: Instead of just performing keyword matching, they can grasp the meaning of the query and find semantically similar content in the database, even if there isn't an exact word match. This is particularly important for complex questions asked in natural language.

In a client project, we were developing a RAG system for financial reporting. Users could ask queries containing multiple pieces of information, such as "What was the profit margin for product group X last quarter, and how did this change compared to the previous quarter?" Such queries needed to be parsed, each piece of information searched separately, and then combined. Initially, we used a simpler text search engine. However, as the complexity of the queries increased, the search engine often only found the first part (profit margin for product group X), while ignoring the second part (change compared to the previous quarter). To rectify this, we integrated a more sophisticated vector database that could process queries better and a larger embedding model that utilized it. As a result, our rate of correctly retrieving all relevant pieces of information, even for complex queries, increased from 65% to 88%. This demonstrated the potential of large models to understand queries and find semantically correct documents.

However, the additional cost and latency introduced by these "large" models cannot be ignored. For example, in the project mentioned above, as the size of the model we used for embedding increased, the time to generate an embedding for each query went from 200ms to 850ms. This was a delay that could negatively impact user experience. Therefore, the benefits of large models must be balanced against the additional overhead they bring.

⚠️ The Large Model Paradox

While large models offer better semantic understanding and query processing capabilities, this often requires higher computational power. This can lead to the following additional overheads:

Increased Cost: More GPU resources, longer processing times, and consequently higher operational costs.

Latency: Embedding generation and semantic search processes can take longer, negatively impacting user experience.

More Complex Infrastructure: Managing and deploying large models may require more complex infrastructure.

Therefore, before using large models, it is important to carefully evaluate whether this additional overhead justifies the benefits gained.

Is Success Possible with Small and Focused Models?

So, do we always have to use the largest LLMs? The answer is often "no." Especially with the goal of improving retrieval quality, smaller and more task-specific models can also be quite effective. This approach has several significant advantages:

Lower Cost and Higher Speed: Smaller models require fewer computational resources because they have fewer parameters. This means faster embedding generation times and lower operational costs.
Customization Potential: Small models fine-tuned for a specific domain or dataset can perform better than general-purpose large models.
Simpler Infrastructure: They are easier to manage and deploy, which reduces overall system complexity.

In my side project, a mobile application on the Android platform that blocks spam calls, I used a RAG-like structure to analyze incoming calls and determine if they were spam. I processed the incoming call text (along with SMS content if available) and tried to match it with known spam patterns. Initially, I considered using a general-purpose large LLM. However, I found that running this model on a mobile device was impractical. Alternatively, I chose a smaller model (e.g., a model with a few hundred million parameters) trained specifically for spam detection. This model could extract keywords and patterns from incoming texts very quickly.

This small model converted incoming call texts into vectors (embeddings) and compared them with the vectors of predefined spam patterns. For example, calls containing phrases like "send money urgently" or "you've won a prize" were flagged as spam. Because this approach was optimized solely for spam detection, it was extremely fast (average embedding time of 50ms) and did not strain the mobile device's resources. Moreover, we could easily integrate frequently updated spam patterns into the model. This experience taught me that small models optimized for a specific task can perform as well as, or sometimes even better than, general-purpose large models.

Comparative Analysis in Real-World Scenarios

Let's support these theoretical discussions with concrete data. I conducted some tests to compare the retrieval quality of large and small models in different RAG scenarios. I performed my tests on a bank's internal knowledge base, using queries that often contained complex financial terms and procedures.

Scenario 1: General Knowledge Base Queries

Query Example: "What are the AML (Anti-Money Laundering) procedures for new customer account opening and what documents are required?"
Small Model (700M parameters, fine-tuned for specific domain):
- Number of Relevant Documents Found: 5
- Most Relevant Passage Retrieval Rate: 75%
- Average Embedding Time: 150ms
Large Model (70B parameters, general-purpose):
- Number of Relevant Documents Found: 7
- Most Relevant Passage Retrieval Rate: 92%
- Average Embedding Time: 900ms

In this scenario, the large model clearly provided a higher relevant passage retrieval rate. However, this improvement came at the cost of approximately a 6-fold increase in embedding time and likely higher infrastructure costs.

Scenario 2: Queries Requiring Fine Details

Query Example: "What are the specific changes in the methodology used for calculating risk-weighted assets (RWA) for our Retail Loan product in Q3 2025?"
Small Model:
- Number of Relevant Documents Found: 3
- Most Relevant Passage Retrieval Rate: 60% (Typically retrieves RWA definition but misses methodology details)
- Average Embedding Time: 180ms
Large Model:
- Number of Relevant Documents Found: 5
- Most Relevant Passage Retrieval Rate: 85% (Better captures methodology details and quarter-specific information)
- Average Embedding Time: 1100ms

In this more complex query, the advantage of the large model becomes even more apparent. The small model struggles to delve into the depth of the question.

However, these results do not always mean that large models will be superior. If your dataset is simpler and your queries are more direct, even a well-trained small model can provide sufficient quality. For example, for a RAG system answering frequently asked questions about a company's "holiday policy," using a large LLM would be overkill. In such a case, a small model trained only on holiday policy documents would be a much more sensible choice in terms of both cost and speed.

Cost and Performance Balance

When selecting an LLM for RAG systems, it is essential to consider not only retrieval quality but also the balance between cost and performance. The high quality offered by large models usually comes with a higher price tag. This cost manifests not only as financial expenses but also as the system's overall performance and infrastructure complexity.

In a client project, we were developing a RAG bot for the customer service of a large e-commerce company. The goal was to answer frequently asked questions and automate routing to the relevant department for more complex issues. Initially, we developed a prototype using one of the most advanced LLMs. The bot was quite successful at understanding questions and finding the correct information. However, when we reached the deployment phase, we realized that the infrastructure cost required to process thousands of queries daily exceeded the company's budget. The monthly GPU cost was about twice what was expected.

At this point, we had to make a trade-off. If we continued to use the existing large model, we would face a cost problem. Alternatively, we chose a smaller, yet still well-performing model, and fine-tuned it to focus on specific areas. This smaller model reduced embedding times from 800ms to 250ms and cut monthly infrastructure costs by 50%. Although there was a 5-10% decrease in retrieval quality, this reduction did not negatively impact user experience or the bot's overall success, as the core functionality was still at an adequate level.

💡 Cost-Performance Optimization

When choosing an LLM for RAG systems, you can achieve a balance between cost and performance by following these steps:

Task Definition: Clearly define what the RAG system needs to do. What are the main targeted tasks?

Dataset Analysis: Analyze the complexity and language of the dataset to be used.

Model Options: Research models of different sizes and focuses (general-purpose vs. domain-specific).

Prototyping and Testing: Develop small-scale prototypes with your chosen models and compare retrieval quality, speed, and cost metrics.

Fine-Tuning: If necessary, fine-tune smaller models on your own dataset to improve their performance.

Iteration: Select the most suitable model based on your results and continuously improve.

Remember, the "largest" model is not always the best solution.

Conclusion: Smart Choices Instead of Large Models

Improving retrieval quality in RAG systems is possible by leveraging the power of LLMs. However, we must not fall into the misconception that this power always depends on the largest and most expensive models. Based on my own experiences, I can say that correctly chosen, optimized, and if necessary, fine-tuned smaller models can offer retrieval quality that rivals or even surpasses large models in many scenarios.

The key is to correctly understand the system's needs, consider the characteristics of the dataset, and make a smart model selection by balancing cost and performance. The semantic matching error I experienced in a production ERP, the spam blocking solution in a small mobile application, or the comparative analyses I conducted for the banking sector, all showed me that each project has its unique requirements, and finding a "right-sized" solution accordingly is crucial. Large models can certainly be indispensable for certain complex tasks, but in the retrieval phase, which forms the foundation of RAG, intelligently chosen, more modest models can also work wonders.

Top comments (2)

Tae Kim • Jun 6

In my pipeline I hit the same gap with a small bi-encoder, but instead of swapping to a larger embedder I added a bge-reranker-v2-m3 cross-encoder over the top-50. That recovered most of the precision your 70B run got on the AML-style query without the 6x embedding latency — the rerank runs ~50ms on a single 4090 for top-50, so the total budget stays close to the small-model baseline. Small bi-encoder for recall, cross-encoder for precision has been a much better cost/quality lever for me than scaling the embedder size.

Mustafa ERBAY • Jun 8

That’s a great observation.

One thing I’ve noticed across several RAG projects is that retrieval quality rarely depends on a single component.

Embedding models,
chunking strategy,
metadata filtering,
reranking,
context assembly,
and source quality

often have a larger combined impact than simply increasing model size.

Your small bi-encoder + reranker approach is a good example of that. In many cases, better pipeline architecture turns out to be a more cost-effective lever than bigger models.

I suspect a lot of teams are optimizing model selection when they should be optimizing retrieval design.