Most RAG tutorials show you the happy path: embed your documents, store them in a vector database, do a similarity search, done. Ship it.
Then you put it in front of real users and discover that semantic search alone misses obvious keyword matches, exact product names get hallucinated into approximate nonsense, and your retrieval quality sits at maybe 60%.
This post covers how we fixed it — using nothing but PostgreSQL. pgvector for semantic search, tsvector for full-text search, and Reciprocal Rank Fusion to merge the results. No Pinecone. No Elasticsearch. Just Postgres.
TL;DR — Pure vector search got us ~62% retrieval precision. Adding full-text search + RRF fusion bumped it to ~84%, with near-perfect exact-match queries. Zero additional infrastructure — it's all just SQL.
Why Not Just Use Vector Search?
Vector (semantic) search is great at understanding meaning. If a user asks "how do I reduce cloud costs?" it will find documents about "infrastructure cost optimization" even if those exact words don't appear.
But it has blind spots. Here's how the two approaches compare:
| Query type | Vector search | Full-text search |
|---|---|---|
| "infrastructure cost optimization" | Finds it even if phrased differently | Misses unless exact words match |
| "PCI-DSS v4" (exact standard) | Returns generic "security compliance" docs | Nails it |
| "ECS" (acronym) | Matches "container services" broadly | Finds exact AWS ECS references |
| "SQS DLQ config" (precise query) | Diluted by semantically similar noise | Direct hit |
| "how do I handle errors gracefully?" | Understands intent, finds relevant patterns | Misses if docs use different wording |
Neither approach is sufficient on its own. The answer is to use both.
Architecture Overview
Both searches run in parallel against the same PostgreSQL database. RRF merges the ranked lists into a single result set.
Setting Up the Database
pgvector for Semantic Search
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id),
content TEXT NOT NULL,
embedding vector(3072), -- text-embedding-3-large dimensions
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW index for approximate nearest neighbor search
CREATE INDEX idx_chunks_embedding ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
Why HNSW over IVFFlat? HNSW gives better recall at query time without needing to tune
nprobe. IVFFlat is faster to build but requires choosing the number of lists upfront and tuning probe count per query. For production RAG where retrieval quality matters more than index build time, HNSW is the better default.
tsvector for Full-Text Search
ALTER TABLE document_chunks
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX idx_chunks_search ON document_chunks USING gin(search_vector);
Using a GENERATED ALWAYS column means the search vector stays in sync with content automatically — no triggers, no application logic to maintain.
The Search Queries
Semantic Search
async def semantic_search(
session: AsyncSession,
query_embedding: list[float],
workspace_id: uuid.UUID,
limit: int = 20,
) -> list[SearchResult]:
"""Find chunks by vector similarity using cosine distance."""
stmt = (
select(
DocumentChunk.id,
DocumentChunk.content,
DocumentChunk.document_id,
(1 - DocumentChunk.embedding.cosine_distance(query_embedding)).label(
"similarity"
),
)
.join(Document, DocumentChunk.document_id == Document.id)
.where(Document.workspace_id == workspace_id)
.order_by(
DocumentChunk.embedding.cosine_distance(query_embedding)
)
.limit(limit)
)
results = await session.execute(stmt)
return [
SearchResult(
chunk_id=row.id,
content=row.content,
document_id=row.document_id,
score=float(row.similarity),
)
for row in results
]
Full-Text Search
async def fulltext_search(
session: AsyncSession,
query: str,
workspace_id: uuid.UUID,
limit: int = 20,
) -> list[SearchResult]:
"""Find chunks by full-text relevance using ts_rank."""
ts_query = func.plainto_tsquery("english", query)
stmt = (
select(
DocumentChunk.id,
DocumentChunk.content,
DocumentChunk.document_id,
func.ts_rank(DocumentChunk.search_vector, ts_query).label("relevance"),
)
.join(Document, DocumentChunk.document_id == Document.id)
.where(
Document.workspace_id == workspace_id,
DocumentChunk.search_vector.op("@@")(ts_query),
)
.order_by(func.ts_rank(DocumentChunk.search_vector, ts_query).desc())
.limit(limit)
)
results = await session.execute(stmt)
return [
SearchResult(
chunk_id=row.id,
content=row.content,
document_id=row.document_id,
score=float(row.relevance),
)
for row in results
]
Reciprocal Rank Fusion
Here's where the magic happens. RRF merges two ranked lists without needing to normalize their scores — which is important because cosine similarity scores and ts_rank scores are on completely different scales.
The formula is simple:
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (typically 60) and rank_i(d) is the rank of document d in the i-th result list.
Why not just normalize and average the scores? Cosine similarity and ts_rank have fundamentally different distributions. Normalizing them into a common range introduces arbitrary decisions about scaling. RRF side-steps this entirely by only caring about rank position, not raw scores. A document ranked #1 in both lists will always beat a document ranked #5 in one and #1 in another — regardless of what the underlying scores looked like.
def reciprocal_rank_fusion(
result_lists: list[list[SearchResult]],
k: int = 60,
) -> list[SearchResult]:
"""
Merge multiple ranked result lists using Reciprocal Rank Fusion.
The k parameter controls how much we trust top-ranked results.
Higher k = more equal weighting. Default 60 is from the original
Cormack et al. (2009) paper.
"""
rrf_scores: dict[uuid.UUID, float] = {}
results_by_id: dict[uuid.UUID, SearchResult] = {}
for result_list in result_lists:
for rank, result in enumerate(result_list, start=1):
rrf_scores[result.chunk_id] = rrf_scores.get(
result.chunk_id, 0.0
) + 1.0 / (k + rank)
results_by_id[result.chunk_id] = result
sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)
return [
SearchResult(
chunk_id=chunk_id,
content=results_by_id[chunk_id].content,
document_id=results_by_id[chunk_id].document_id,
score=rrf_scores[chunk_id],
)
for chunk_id in sorted_ids
]
Putting It Together
async def hybrid_search(
session: AsyncSession,
query: str,
query_embedding: list[float],
workspace_id: uuid.UUID,
limit: int = 10,
) -> list[SearchResult]:
"""
Run semantic and full-text search in parallel,
then merge results with RRF.
"""
semantic_results, fulltext_results = await asyncio.gather(
semantic_search(session, query_embedding, workspace_id, limit=20),
fulltext_search(session, query, workspace_id, limit=20),
)
fused = reciprocal_rank_fusion([semantic_results, fulltext_results])
return fused[:limit]
Both searches run concurrently with asyncio.gather. We fetch 20 candidates from each, fuse them, and return the top 10. Over-fetching from each source gives RRF more signal to work with.
Why Not a Dedicated Vector Database?
We considered Pinecone, Weaviate, and Qdrant. We stuck with PostgreSQL + pgvector for several reasons:
| Concern | Dedicated vector DB | PostgreSQL + pgvector |
|---|---|---|
| Infrastructure | Separate service to monitor, back up, patch, pay for | Already running Aurora PostgreSQL |
| Consistency | Eventual consistency between stores | Single transaction — deletes are atomic |
| Full-text search | Need a third system (Elasticsearch, etc.) |
tsvector is built in |
| Compliance (SOC2) | Every new service adds audit scope | One set of encryption, access controls, audit trails |
| Performance at our scale | Faster at 10M+ vectors | <50ms HNSW search at ~500K chunks |
When would we switch? If we hit tens of millions of vectors and needed sub-10ms p99 latency, a dedicated solution would make sense. At ~500K chunks, pgvector is more than fast enough.
Production Lessons
Here's what we learned running this in production:
1. Over-fetch, then trim.
Pulling 20 results from each source and fusing to 10 consistently outperformed pulling 10 from each. More candidates = better fusion signal.
2. Full-text search needs query preprocessing.
plainto_tsquery handles most cases, but for advanced users we added websearch_to_tsquery support which handles quoted phrases and boolean operators.
3. Monitor both sides independently.
We track semantic search latency and full-text search latency separately. If pgvector's HNSW index needs a rebuild, we see it in the semantic search p99 without it being masked by fast full-text results.
4. Embedding model upgrades require re-indexing.
When we considered switching embedding models, we realized every chunk needs re-embedding. We built this into our pipeline as a background job with progress tracking. Plan for this from day one.
5. RRF's k=60 is a reasonable default, but test it.
We ran evaluations with k values from 10 to 100. For our dataset, 60 was optimal. Your mileage will vary — the important thing is to have an evaluation set so you can actually measure.
Results
After switching from pure semantic search to hybrid search with RRF:
| Metric | Before (vector only) | After (hybrid + RRF) |
|---|---|---|
| Retrieval precision | ~62% | ~84% |
| Exact-match queries (product names, error codes) | Unreliable | Near-perfect |
| Additional infrastructure | — | None. It's all just SQL. |
Hybrid search isn't novel research — the pattern has been around for years in traditional information retrieval. But in the RAG gold rush, most tutorials skip it entirely in favor of pure vector search. If your RAG pipeline is struggling with retrieval quality, this is likely the highest-ROI improvement you can make.
[Code examples in this post are illustrative patterns based on production experience, not copy-paste from a proprietary codebase.]

Top comments (0)