5 RAG Architecture Mistakes That Kill Production Accuracy (And How to Fix Them)

#ai #architecture #llm #rag

I've built RAG systems that hit 96.8% retrieval accuracy in production. I've also built ones that started at 40% and needed emergency rewrites. The difference wasn't the LLM — it was the architecture decisions made before any model was chosen.

Here are the five mistakes I see most often when teams take RAG from prototype to production.

1. Treating Chunking as an Afterthought

Most tutorials show you how to split documents into 512-token chunks with 50-token overlap and move on. This works for demos. It fails catastrophically on real business documents.

The problem: A contract clause that spans three paragraphs gets split across two chunks. Neither chunk contains the complete clause. The LLM gets partial context and hallucinates the rest.

What actually works:

Use semantic chunking that respects document structure. For structured documents (contracts, legal filings, compliance reports), chunk by logical section — not by token count. A 2,000-token chunk that contains a complete clause is far more useful than four 500-token chunks that fragment it.

# Bad: fixed-size chunking
chunks = text_splitter.split_text(document, chunk_size=512)

# Better: structure-aware chunking
chunks = split_by_sections(document, 
    section_markers=["Article", "Section", "Clause"],
    max_chunk_size=2048,
    preserve_hierarchy=True
)

In production I use a tiered approach: heading-aware splitting for structured documents, semantic similarity-based splitting for unstructured text, and table-preserving extraction for documents with embedded data.

2. Using Only Vector Search

Pure vector search is great at finding semantically similar content. It's terrible at exact matches.

Ask a vector database "What is the termination clause in contract #2847?" and it might return clauses from contracts #2845 and #2849 because they're semantically similar. The user asked for a specific document. Semantic similarity isn't what they need.

The fix: hybrid search.

Combine vector search (semantic understanding) with keyword search (exact matching). Weight them based on query type:

def hybrid_search(query, documents, vector_weight=0.6, keyword_weight=0.4):
    vector_results = vector_store.similarity_search(query, k=20)
    keyword_results = bm25_search(query, documents, k=20)

    combined = reciprocal_rank_fusion(vector_results, keyword_results)
    return rerank(combined, query, top_k=5)

In my production systems I use pgvector for vector search and pg_trgm for fuzzy keyword matching — both in the same PostgreSQL database. No external services, no sync issues, and the retrieval accuracy jump from pure vector to hybrid was 23 percentage points.

The reranking step matters too. After fusion, run the top candidates through a cross-encoder reranker. This catches the cases where both retrieval methods ranked a mediocre result highly.

3. No Source Attribution (The Hallucination Trap)

If your RAG system returns an answer without showing where it came from, you have a hallucination machine with extra steps.

Users need to verify. Especially in high-stakes domains — legal, financial, compliance, healthcare. If the AI says "the penalty clause states a 5% charge" and there's no link to the actual clause, nobody can trust it and nobody will use it.

In production, every answer needs:

The exact source document and section
A relevance confidence score
A clear signal when the system doesn't have enough context to answer

interface RAGResponse {
  answer: string;
  sources: {
    document: string;
    section: string;
    pageNumber: number;
    relevanceScore: number;
    extractedText: string;  // The actual text the answer was derived from
  }[];
  confidence: 'high' | 'medium' | 'low';
  caveat?: string;  // "Based on documents uploaded before March 2026"
}

The confidence field is critical. When retrieval scores are below your threshold, the system should say "I don't have enough information to answer this reliably" rather than guessing. In production, a confident "I don't know" is worth more than a plausible-sounding hallucination.

4. Ignoring Temporal Context

Documents have dates. Policies get updated. Contracts expire. Regulations change.

If your RAG system treats a 2023 compliance policy and a 2026 compliance policy as equally relevant, you're serving stale information. Worse — in regulated industries, this creates legal liability.

Build temporal awareness into your retrieval pipeline:

Store document dates as metadata and use them in filtering
When multiple versions of a document exist, default to the most recent unless the user asks for a specific version
Add temporal signals to the system prompt: "The following context is from documents dated between X and Y"

def temporal_search(query, date_context=None):
    results = hybrid_search(query)

    if date_context:
        results = filter_by_date_range(results, date_context)
    else:
        # Prefer recent documents, but don't exclude older ones entirely
        results = boost_recent(results, decay_factor=0.95, half_life_days=180)

    return results

This sounds obvious. In practice, I've seen production RAG systems serving answers from superseded documents because nobody thought to add date filtering. The fix takes an afternoon. The risk of not doing it is significant.

5. Building a Monolith Instead of a Pipeline

The prototype RAG loop is simple: embed query → search → stuff context → generate answer. Teams ship this and then discover they can't debug it, can't improve one component without breaking another, and can't add features without rewriting everything.

Production RAG is a pipeline, not a function.

Each stage should be independently testable, measurable, and replaceable:

Query Analysis → Retrieval → Reranking → Context Assembly → Generation → Validation

Each stage has its own metrics:

Stage	Key Metric
Query Analysis	Query classification accuracy
Retrieval	Recall@20
Reranking	Precision@5
Context Assembly	Context relevance score
Generation	Answer faithfulness
Validation	Hallucination detection rate

When accuracy drops, you can pinpoint which stage is failing. Is retrieval missing relevant documents? Is reranking promoting the wrong ones? Is the LLM hallucinating despite good context? Each answer leads to a different fix.

In a 12-component RAG system I built for enterprise document intelligence, this pipeline approach let us iterate individual components without regression — and we could A/B test retrieval strategies independently from generation strategies.

The Uncomfortable Truth About RAG in Production

RAG is not a product. It's an architecture pattern. The difference between a demo that impresses stakeholders and a system that handles 10,000 queries a day with consistent accuracy is entirely in the engineering.

The LLM is maybe 20% of the work. The other 80% is chunking strategy, retrieval pipeline, temporal handling, source attribution, monitoring, and the unglamorous work of testing edge cases until your accuracy numbers are something you'd stake your reputation on.

If you're building RAG for production and hitting accuracy walls, the fix is almost always in the retrieval pipeline — not in switching to a bigger model.

Nic Chin is an AI Architect and Fractional CTO specialising in multi-agent systems, RAG architecture, and enterprise AI automation. He provides AI consulting to businesses across the UK, US, Europe, Malaysia, and Singapore. Portfolio and case studies at nicchin.com.