DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

RAG Retrieval Quality: Are Large Language Models Always Necessary?

The biggest mistake I see when building RAG (Retrieval-Augmented Generation) architectures is throwing the largest, most expensive language model (LLM) and massive vector databases at every single problem. Last year, on a supply chain search engine for a manufacturing ERP I was working on, we initially designed a pipeline that converted everything into vectors and sent them to a cloud LLM. The result? A 4.5-second latency per user and hundreds of dollars in monthly API bills. Then I paused and simplified the system; I squeezed the LLM only into the final synthesis stage and solved the retrieval stage entirely with local, lightweight tools. The search latency dropped to 85 milliseconds, and the cost was reduced to almost zero.

In this post, I will share my own experiences on how to improve retrieval quality in RAG infrastructures while reducing dependency on large language models, hybrid search techniques, and how to get this done on a budget in production.

The Vector Database and Dense Retrieval Craze

Everywhere I look in the industry, everyone is in a race to set up a vector database. It is widely assumed that taking 1536-dimensional vectors with models like OpenAI's text-embedding-3-small and searching them with Cosine Similarity is a cure-all. In reality, this is not the case at all. While this method, which we call dense retrieval, is highly successful at capturing semantic relationships, it completely fails at specific keywords, part codes, or serial numbers.

For example, in a client project, when a user searched for the part code "SKU-4912-A", the vector search engine would return "SKU-4912-B", which had a similar description. This is because, in the vector space, the semantic difference between these two texts is negligible. In the table below, you can see a resource and accuracy comparison of dense retrieval versus traditional methods using metrics from my own test environment:

Search Type Average Latency (ms) RAM Consumption (1M Rows) SKU/Code Accuracy (%) Semantic Success (%)
Dense Vector (HNSW) 42 ms ~12 GB 45% 92%
Sparse (BM25 / FTS) 4 ms ~800 MB 99% 35%
Hybrid (Combined with RRF) 18 ms ~2.5 GB 98% 90%

As you can see, relying solely on vector search leaves us with high RAM costs in production and incredibly increases our margin of error on technical data. I experienced a similar trade-off situation before in my own side project's infrastructure while indexing large data on a local server; if you don't optimize memory, that server will eventually crash.

⚠️ Memory Savings

When using pgvector on PostgreSQL 16, the RAM consumption of HNSW indexes can reach up to 1.5 times the data size. If you are running on a budget-constrained VPS, you must choose the m and ef_construction parameters carefully when creating indexes.

The Power of BM25 and Classic Inverted Index

If you have technical documents, ERP data, or structured logs, your first resort should be classic inverted index-based search engines. The tsvector and tsquery features introduced in PostgreSQL 14+ do the job perfectly without the hassle of setting up and managing an external Elasticsearch cluster. The BM25 algorithm calculates the frequency of a word within a document (TF) and its rarity across all documents (IDF) to produce a mathematically highly consistent score.

In the SQL example below, you can see how I index and search product names and descriptions in a parts table. Here, I used GIN (Generalized Inverted Index) to reduce the query time to microseconds:

-- Arama vektörlerini saklayacağımız generate edilmiş bir kolon ekliyoruz
ALTER TABLE parts_catalog 
ADD COLUMN search_vector tsvector 
GENERATED ALWAYS AS (
  to_tsvector('english', coalesce(part_name, '') || ' ' || coalesce(description, ''))
) STORED;

-- GIN indeksi oluşturuyoruz
CREATE INDEX idx_parts_search_gin ON parts_catalog USING gin(search_vector);

-- Arama sorgusu ve execution plan analizi
EXPLAIN ANALYZE
SELECT id, part_name, ts_rank(search_vector, query) as rank
FROM parts_catalog, to_tsquery('english', 'SKU-4912-A | copper') query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Looking at the EXPLAIN ANALYZE output of this query, I saw that on a table with 500,000 rows, it returned the result in 1.8 milliseconds using only a Bitmap Index Scan without reading from disk. If we had tried to perform the same search in a vector database, it would have been slower in terms of milliseconds and would have surfaced irrelevant parts that just happened to contain the word "copper".

Hybrid Search Architecture and Decision Matrix

So, what do we do if we don't want to miss the semantic meaning but also want to maintain keyword precision? The solution: Hybrid Search. This is where the Reciprocal Rank Fusion (RRF) algorithm comes into play. RRF takes the rankings from two different search engines and re-scores them using a common formula. There is absolutely no need for expensive orchestration libraries; you can handle this in your own backend service with just a few lines of code.

Below, you can see a simplified version of the function I wrote using FastAPI and PostgreSQL, which combines both vector search (pgvector) and Full-Text Search (FTS) results using RRF:

from typing import List, Dict

def reciprocal_rank_fusion(vector_results: List[Dict], fts_results: List[Dict], k: int = 60) -> List[Dict]:
    """
    RRF algoritması ile iki farklı sonuç kümesini birleştirir.
    k: Sabit penalizasyon parametresi (genelde 60 seçilir)
    """
    scores = {}

    # Vektör sonuçlarını işle
    for rank, doc in enumerate(vector_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0.0) + (1.0 / (k + rank + 1))

    # FTS sonuçlarını işle
    for rank, doc in enumerate(fts_results):
        doc_id = doc["id"]
        # Eğer döküman her iki aramada da varsa skoru katlanır
        scores[doc_id] = scores.get(doc_id, 0.0) + (1.0 / (k + rank + 1))

    # Skorlara göre sırala
    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs
Enter fullscreen mode Exit fullscreen mode

The best part of this architecture is that it brings together the best features of both worlds. When a user types "red actuator valve", FTS matches the words "actuator" and "valve" exactly, while vector search captures documents with colors similar to "red". RRF then combines these two signals and brings the most accurate document to the top.

Using Small Models (Cross-Encoders) Instead of Large Language Models

Triggering LLMs with billions of parameters (such as GPT-4 or Claude) on every query to improve retrieval quality is both financial suicide and drives network overhead through the roof. Instead, it is much more logical to quickly pull 50 documents in the first stage (retrieval) and run them through a locally running Cross-Encoder (Re-ranker) model.

While Bi-Encoder models (classic embedding models) vectorize the query and the document separately, Cross-Encoder models feed the query and the document into the processor simultaneously and analyze the relationship between them much more deeply. A model like BAAI/bge-reranker-base, which is only 278 MB in size, can work wonders even on a CPU.

from sentence_transformers import CrossEncoder

# Küçük ve efektif bir re-ranker modeli yüklüyoruz
model = CrossEncoder("BAAI/bge-reranker-base", max_length=512)

query = "How to adjust the pressure limit on model X12?"
documents = [
    "To adjust the pressure limit on model X12, turn the valve clockwise.",
    "Model X12 is a hydraulic actuator designed for high pressure systems.",
    "Standard pressure limits for older models are set at 150 PSI."
]

# Modeli çalıştırıp skorları alıyoruz
pairs = [[query, doc] for doc in documents]
scores = model.predict(pairs)

# Sonuçları ekrana yazdırıyoruz
for doc, score in zip(documents, scores):
    print(f"Score: {score:.4f} | Doc: {doc}")
Enter fullscreen mode Exit fullscreen mode

When I ran this Python script on a standard VPS CPU, the inference time for 3 documents took only 14 milliseconds. Thanks to this process, we reduce the context we send to the LLM from 50 documents to the 3 most relevant ones. This reduces our LLM token cost by 90% and eliminates the LLM's "lost in the middle" problem.

Memory and CPU Management: Resource Limiting in Production

The most common problem I encounter when running local models (re-ranker or small embedding models) on production servers is that PyTorch or ONNX runtime libraries try to consume all the CPU cores on the server. If you don't take precautions, the system load average spikes instantly when a request comes in, and your PostgreSQL or Nginx services on the same server become unresponsive.

To prevent this problem, it is essential to define resource limits at the Docker Compose or SystemD level. I usually run my Python services limited inside Docker:

version: '3.8'

services:
  reranker-service:
    image: python-reranker:latest
    environment:
      - OMP_NUM_THREADS=2
      - MKL_NUM_THREADS=2
    deploy:
      resources:
        limits:
          cpus: '2.00'
          memory: 2G
        reservations:
          memory: 512M
    restart: always
Enter fullscreen mode Exit fullscreen mode

The most critical setting here is the OMP_NUM_THREADS=2 environment variable. By default, PyTorch tends to use all cores on the server (for example, all of them on a 32-core bare-metal server). Limiting this to 2 or 4 ensures a stable throughput without choking the processor during parallel requests.

ℹ️ SystemD Limits

If you are not using Docker, you can set up a similar protection layer at the operating system level by writing MemoryHigh=1.8G and CPUQuota=200% under the [Service] block of your SystemD unit file.

Real-World Scenario: ERP Inventory and Supply Chain RAG Infrastructure

Let me show you what kind of beast emerges when we bring all of this together with a concrete example. In a manufacturing ERP, there were historical maintenance-repair documents and failure records consisting of 150,000 rows. When maintenance technicians spotted a failure in the field, they would type the failure description on their tablets and query past solution suggestions using RAG.

In the first system we built, the flow was as follows:

  1. User query -> OpenAI Embedding API -> Pinecone Vector DB -> GPT-4 API -> Result.
  2. The Problem: The system stopped when the internet was disconnected or when API limits were hit. The average response time was 4.8 seconds.

The local hybrid flow I redesigned:

  1. User query -> Postgres FTS + pgvector (local) -> Top 30 candidates.
  2. Top 30 candidates -> BGE-Reranker-Base running in local Docker -> Top 3 candidates.
  3. Top 3 candidates -> Lightweight LLM running on the server (or Gemini Flash API as a fallback).
  4. The Result: Internet dependency was minimized. The average response time dropped to 112 milliseconds.
# Log çıktısından bir kesit (Yerel hibrit sistemin çalışma süreleri)
2026-06-07 14:22:01,102 - INFO - Received query: "valve pressure drop during startup"
2026-06-07 14:22:01,118 - INFO - Postgres FTS & Vector Search completed in 16ms (30 candidates found)
2026-06-07 14:22:01,152 - INFO - Re-ranking completed in 34ms (Top 3 selected)
2026-06-07 14:22:01,214 - INFO - Local LLM response generated in 62ms
2026-06-07 14:22:01,215 - INFO - Total turnaround time: 112ms
Enter fullscreen mode Exit fullscreen mode

As can be seen, when we remove large language models and cloud services from the center of the system architecture and position them only as a "text synthesizer" in the final stage, the stability and speed of the system are elevated to a whole new level.

In the next step, I will discuss how we structured a similar local optimization process in our own database replication architecture and how we resolved WAL bloat issues.

Top comments (0)