qodors

Posted on May 12

Beyond Vector Search: What RAG Actually Needs

#ai #architecture #llm #rag

Everyone thinks they've built RAG because they threw documents into a vector database and connected an LLM.

You haven't built RAG. You've built a fancy search bar that hallucinates.

The Vector Search Trap

Here's how most RAG implementations go:

Chunk your docs. Embed them. Store in Pinecone or Weaviate. Query with cosine similarity. Feed results to GPT. Done.

Except it's not done. It's broken in ways you won't notice until production.

Vector search finds what's semantically similar. That's not the same as finding what's correct. Or complete. Or relevant to the actual question being asked.

We've debugged enough RAG systems to know — the retrieval is where everything falls apart. Not the generation.

What's Actually Missing

1. Chunking Strategy

Most teams chunk by token count. 500 tokens, overlap 50, move on.

That's lazy. It splits context mid-sentence, separates related concepts, and feeds the LLM fragments that don't make sense on their own.

Better approach: chunk by meaning. Headings, sections, logical boundaries. Smaller chunks for factual lookups. Larger chunks for nuanced questions. One size doesn't fit.

2. Retrieval Isn't Just Similarity

Semantic similarity gets you 60% of the way. The other 40% needs:

Keyword matching — sometimes the user wants an exact term, not a vibe
Metadata filtering — date, source, document type, access level
Re-ranking — the top 5 results from vector search aren't always the best 5 after you think harder

Hybrid retrieval (vector + keyword + filters + re-ranking) is where real accuracy lives. Pure vector search is a starting point, not the finish line.

3. Context Window Management

You retrieved 10 chunks. Now what? Dump all of them into the prompt?

That's how you get bloated context, confused models, and slow responses.

The good RAG systems are selective. They score, filter, and compress. Only the most relevant pieces make it into the prompt. Less noise, better answers.

4. Query Understanding

User types: "What changed in the refund policy last quarter?"

Your vector search sees: refund, policy, changed.

It misses: last quarter. That's a time filter. Not a semantic concept.

Good RAG rewrites the query before retrieval. Decomposes it. Extracts filters. Sometimes runs multiple retrievals and merges results.

The raw user query is almost never the right search query.

5. Knowing When It Doesn't Know

This is the big one.

Most RAG systems answer everything. Confidently. Even when the retrieved documents don't actually contain the answer.

That's not intelligence. That's a liability.

The system needs a "I don't have enough information" response. A confidence threshold. A way to say no.

If your RAG can't refuse, it can't be trusted.

Our Take

We've built RAG into production systems for clients across healthcare, operations, and customer support. Every single time, the first version with basic vector search looked impressive in demos and failed in production.

The pattern is always the same. Retrieval quality is the bottleneck. Not the model. Not the prompt.

Fix retrieval and your "dumb" GPT-3.5 setup will outperform a sloppy GPT-4 pipeline every time.

RAG isn't a weekend project. It's an information architecture problem dressed up as an AI feature.

What To Do About It

If you're building or fixing a RAG system, pressure-test these five things:

Chunk one document three different ways. See which produces better answers. You'll be surprised.
Add keyword search alongside vector search. Hybrid retrieval isn't optional anymore.
Log every retrieval. See what the model actually received before generating. Most teams never look.
Build a "no answer" path. If retrieved chunks score below threshold, say so. Don't guess.
Test with real user queries, not your own. Your questions are clean. Theirs aren't.

Vector search is step one. Not the whole staircase.

The teams getting real value from RAG are the ones treating retrieval as a system — not a single API call.

Build accordingly.

RAG #VectorSearch #AIEngineering #LLM #RetrievalAugmentedGeneration #StartupCTO #AIArchitecture #QodorsEdge

DEV Community