RAG Retrieval Quality: Development and Cost Anatomy in Side Projects

#life #ai #rag #retrieval

RAG Retrieval Quality: Development and Cost Anatomy in Side Projects

Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) to access information, leading to more accurate and contextually relevant responses. The heart of these systems is the "retrieval" phase. If the information we retrieve is irrelevant or incomplete, the results will be unsatisfactory, no matter how advanced the model is. I've focused on this aspect in my own projects, and today I'll share how we can improve RAG retrieval quality, the associated costs, and my concrete experiences.

In this post, I will delve into strategies for enhancing retrieval quality, their practical applications, and cost analyses. My aim is to guide anyone using or considering using this technology and to provide information about the challenges they might encounter.

1. The Impact of Chunking Strategies on Retrieval

One of the most fundamental steps in RAG systems is dividing large documents into smaller "chunks." The size and content of these chunks directly affect how successful retrieval will be. An incorrect chunking strategy can lead to relevant information being scattered or forming meaningless units. In my experience, strategies that preserve semantic coherence yield better results than fixed-size chunks.

For instance, when we divide a technical document into fixed 500-word segments, we might encounter a description cut off in the middle of a section or an incomplete code snippet. This creates a meaningless context when a search query retrieves that particular chunk. Therefore, I prefer dynamic chunking methods that consider sentence endings, paragraph endings, or specific section headings. In my personal financial calculator side project, I divided documents by their sections to answer users' complex financial questions. This allowed me to present relevant information more holistically.

ℹ️ Choosing a Chunking Strategy

Fixed-size chunks can be a simple starting point, but paragraph or section-based splitting strategies are generally more effective for preserving semantic coherence.

2. Embedding Models: Balancing Semantic Similarity and Cost

The embedding models we use to convert our chunks into vectors are another critical component of retrieval quality. Different models have different semantic understanding capabilities and costs. High-performance models are typically more expensive and require more computational power. In my own projects, I've strived to strike a good balance between cost and performance.

In previous projects, I started with smaller and faster models like all-MiniLM-L6-v2. These models could capture sufficient semantic similarity for most general queries. However, especially in situations requiring deeper semantic understanding for technical or niche topics, the performance improvement from more advanced models like text-embedding-ada-002 was noticeable. While API calls to these models might seem more expensive initially, they improved overall system efficiency through fewer false positives and more accurate information retrieval.

💡 Model Selection

Test different embedding models based on your project's needs. Small models may suffice for general-purpose projects, while for situations requiring technical depth, consider more advanced models. Don't forget to factor in costs.

3. Query Transformation Techniques

The user's query may need to be processed before it can be directly searched in the database. Query transformation techniques can improve retrieval quality by making the original query more optimized or by adding more context. These techniques include query expansion, query simplification, or even query rewriting.

In one of my side projects, I noticed users asking questions using complex financial terms. For example, instead of "How is depreciation expense calculated?", they might ask "How is depreciation calculated for passenger cars after VAT deduction?". In such cases, instead of a simple keyword match, it was effective to break down the query into key terms and perform a broader search related to those terms. I even sometimes generated new queries with potentially relevant keywords in addition to the original query. This was a method I tried with the thought of "can I retrieve a bit more information?", and it provided an additional relevance increase of around 10-15%.

# Example of query expansion (simple concept)
def expand_query(query: str) -> list[str]:
    keywords = query.split()
    expanded_queries = [query]
    if "amortisman" in keywords and "hesaplama" in keywords:
        expanded_queries.append("arac amortisman hesaplama KDV")
    if "KDV" in keywords and "indirimi" in keywords:
        expanded_queries.append("KDV matrahı hesaplama")
    return expanded_queries

# Usage
user_query = "binek araçlar için KDV indirimi sonrası amortisman hesaplaması nasıl yapılır?"
potential_queries = expand_query(user_query)
print(potential_queries)
# Output: ['binek araçlar için KDV indirimi sonrası amortisman hesaplaması nasıl yapılır?', 'arac amortisman hesaplama KDV', 'KDV matrahı hesaplama']

4. Re-ranking Mechanisms: Selecting the Best Results

Multiple document chunks can be retrieved during the retrieval phase. However, adding an additional "re-ranking" layer to select the best few documents to present to the LLM can further improve retrieval quality. This layer re-evaluates the results obtained from the initial retrieval phase with a more advanced model, prioritizing the most relevant ones.

In my own projects, while using a fast vector search engine (e.g., FAISS or ChromaDB) for initial retrieval, I leveraged more sophisticated models for re-ranking. In a client project, after retrieving 50 document chunks, I re-scored them using a "Cross-Encoder" model. This model evaluates the query and document pair together, measuring semantic alignment more precisely. This approach allowed us to present only the top 5 out of the 50 documents we initially retrieved to the LLM, increasing response accuracy by approximately 20%. This prevented us from exceeding the LLM's token limit and also led to more consistent results.

⚠️ Re-ranking Cost

Adding a re-ranking layer means additional computational cost and latency. Therefore, it's important to carefully evaluate whether the performance increase from re-ranking is worth the cost.

5. Cost Anatomy: Development and Operational Perspective

The development and operation of RAG systems require a cost-conscious approach, especially for side projects. Items like embedding models, vector database hosting, API calls, and computational power constitute the costs. I've employed various strategies to minimize these costs in my own projects.

For example, instead of using cloud providers' ready-made services for embedding operations, I ran open-source models on my own servers (VPS). While this incurred initial setup and management costs, it resulted in savings on API call fees in the long run. I also opted for lightweight, self-hostable solutions like ChromaDB for the vector database. This has saved approximately $150-200 USD per month. It's also important to consider the cost of LLMs used for retrieval queries. Choosing models that use fewer tokens and respond faster not only reduces costs but also improves user experience. For instance, experimenting with more affordable and faster models like Gemini Flash or Groq can provide a significant advantage in this area.

# Example of running embeddings on your own server (concept)
# Create venv and install necessary libraries
python -m venv rag_env
source rag_env/bin/activate
pip install sentence-transformers faiss-cpu

# Embedding and vector creation with a Python script
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load documents and chunk them (example)
documents = ["This is the first document.", "This is the second document."]
chunks = [doc.split('.') for doc in documents if doc] # Simple split

# Generate embeddings
embeddings = model.encode(chunks)

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Save (can be used later)
faiss.write_index(index, "my_faiss.index")

6. Data Quality and Freshness: A Constant Struggle

Finally, the effectiveness of RAG systems heavily depends on the quality and freshness of the underlying data. A dataset containing incorrect or incomplete information will lead to erroneous results, no matter how good a retrieval mechanism you build. In my own projects, I've placed great importance on data cleaning and update processes.

In my financial calculator side project, tax rates, legal regulations, and market data are constantly updated. It's critical that these updates are accurately reflected in the RAG system. Therefore, I regularly check my data sources and integrate changes quickly. For example, when a new tax law is enacted, the process of updating the relevant documents and re-generating embeddings takes me 2-3 hours. While this is a constant struggle, it's indispensable for the system's reliability. The way to earn users' trust is by always providing them with the most accurate and up-to-date information.

🔥 Data Freshness Risk

Outdated or incorrect data can cause the LLM to generate misinformation. This can severely damage users' trust in your system. Regular data updates and validation processes are critically important.

In conclusion, improving RAG retrieval quality is not dependent on a single magic formula. It's achievable through the convergence of many factors, from chunking strategies to embedding models, query transformation, and re-ranking mechanisms. Not overlooking costs in this process is vital for sustainability, especially for side projects. My own experiences show that this field requires continuous learning and adaptation.