DEV Community

Cover image for The RAG Illusion: Why PostgreSQL Beats Vector Search for Most AI Applications
Dexmac
Dexmac

Posted on

The RAG Illusion: Why PostgreSQL Beats Vector Search for Most AI Applications

A Contrarian View on the Most Hyped Technology in AI Infrastructure


"The best solution to a problem is often realizing you don't have the problem."


The Uncomfortable Question

Before you spin up another Pinecone instance, embed another million documents, and debug another "why didn't it retrieve the right chunk?" issue, ask yourself:

Do my data actually need semantic search, or am I using RAG because everyone else is?

After building a multi-agent AI system for narrative generation, we discovered something counterintuitive: replacing our RAG pipeline with PostgreSQL didn't just simplify our architecture—it made the output better.

This article explains why, and when you should consider the same.


A Brief History: Why RAG Exists

RAG (Retrieval-Augmented Generation) was born from necessity. In 2022-2023, context windows were tiny:

GPT-3.5:     4,096 tokens  (~3,000 words)
GPT-4:       8,192 tokens  (~6,000 words)
Claude 1:    9,000 tokens  (~7,000 words)
Enter fullscreen mode Exit fullscreen mode

If your knowledge base was larger than ~5,000 words, you had a problem. You couldn't fit it in context. RAG was the elegant solution:

  1. Chunk your documents into small pieces
  2. Embed each chunk into a vector
  3. Store vectors in a specialized database
  4. Retrieve the top-k most "similar" chunks to the query
  5. Inject those chunks into the prompt
  6. Generate a response based on the retrieved context

Brilliant. A necessity became a pattern, the pattern became a best practice, and the best practice became... an assumption.


The World Changed, The Assumption Didn't

Fast forward to December 2025:

GPT-5.2:           400,000 tokens (~300,000 words)
Claude Opus 4.5:   200,000 tokens (~150,000 words)
Claude Sonnet 4.5: 1,000,000 tokens (beta, tier 4+)
Gemini 3 Pro:      1,000,000 tokens (~750,000 words)
Enter fullscreen mode Exit fullscreen mode

That's not a typo. One million tokens. You can fit entire codebases, complete documentation sets, or full novels in a single context window.

Yet we're still chunking, embedding, and retrieving as if it's 2022.


The Hidden Costs of RAG

RAG isn't free. It comes with costs that are rarely discussed in the "RAG is amazing!" tutorials.

1. Chunking Destroys Context

ORIGINAL DOCUMENT:
"The patient presented with symptoms consistent with Type 2 diabetes, 
including elevated blood glucose levels (HbA1c: 8.2%). However, given 
the patient's age (12) and rapid onset, we conducted additional testing 
which revealed GAD65 antibodies, confirming Type 1 diabetes instead."

AFTER CHUNKING (500 token chunks):
Chunk 1: "The patient presented with symptoms consistent with Type 2 
          diabetes, including elevated blood glucose levels (HbA1c: 8.2%)."

Chunk 2: "However, given the patient's age (12) and rapid onset, we 
          conducted additional testing which revealed GAD65 antibodies, 
          confirming Type 1 diabetes instead."
Enter fullscreen mode Exit fullscreen mode

Query: "What type of diabetes does the patient have?"

If your retrieval returns only Chunk 1, your LLM confidently answers "Type 2 diabetes."

Wrong.

The correction was in Chunk 2, which may not have been "semantically similar" enough to the query to make the top-k cut.

2. Embedding Loses Nuance

Embeddings compress meaning into ~1,500 floating point numbers. They're remarkably good at capturing gist, but they lose:

  • Negation: "This is good" and "This is not good" often have similar embeddings
  • Specificity: "Apple Inc." and "apple fruit" may cluster together
  • Relationships: "A loves B" vs "B loves A" look identical to embeddings
  • Recency: A 2020 policy and a 2024 update might have similar embeddings

3. Top-K is Arbitrary

Why retrieve the top 5 chunks? Why not 3? Or 10? Or 50?

The answer is usually: "because that's what the tutorial did."

Top-k retrieval means:

  • You might miss the 6th most relevant chunk (which was the one you needed)
  • You definitely include the 5th chunk (even if it's barely relevant)
  • There's no way to know if you got the right ones

4. "Lost in the Middle"

Research shows that LLMs pay less attention to information in the middle of long contexts. If your most relevant chunk ends up sandwiched between less relevant ones, the model might ignore it.

5. The Verification Problem

Here's the question nobody wants to answer:

If I still need to verify the LLM's response (because RAG retrieval might have missed something), what exactly did I gain?

For low-stakes applications, approximate answers are fine. For legal, medical, financial, or any domain requiring accuracy, you're back to manual verification anyway.


The Alternative: Structured State Management

What if, instead of treating your knowledge as "documents to search," you treated it as "state to manage"?

This is the key insight: For most applications, you don't need semantic search. You need structured queries.

The PostgreSQL Approach

-- Instead of: "Find chunks about customer complaints from Q3"
-- You write: 

SELECT 
    c.complaint_id,
    c.description,
    c.resolution,
    cu.name as customer_name,
    p.name as product_name
FROM complaints c
JOIN customers cu ON c.customer_id = cu.id
JOIN products p ON c.product_id = p.id
WHERE c.created_at BETWEEN '2024-07-01' AND '2024-09-30'
  AND c.status = 'resolved'
ORDER BY c.severity DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

This query is:

  • Deterministic: Same query, same results
  • Debuggable: EXPLAIN shows exactly what happened
  • Complete: You get ALL matching records, not "top-k similar"
  • Relational: You can join customers, products, resolutions in one query
  • Fast: Milliseconds, not seconds

Full Context Injection

Now here's the key: once you have your structured data, you don't "retrieve" it into a tiny context. You dump it entirely into the prompt.

def build_context(scene_requirements):
    """Build complete context for this scene."""

    context = db.query("""
        SELECT json_build_object(
            'characters', (
                SELECT json_agg(c.*) 
                FROM characters c 
                WHERE c.name = ANY(%s)
            ),
            'location', (
                SELECT row_to_json(l.*) 
                FROM locations l 
                WHERE l.name = %s
            ),
            'recent_events', (
                SELECT json_agg(e.*) 
                FROM events e 
                WHERE e.chapter >= %s - 1
            ),
            'relationships', (
                SELECT json_agg(r.*) 
                FROM relationships r 
                WHERE r.character_a = ANY(%s) 
                  AND r.character_b = ANY(%s)
            )
        )
    """, [characters, location, current_chapter, characters, characters])

    return context  # ~2,000-10,000 tokens, COMPLETE and COHERENT
Enter fullscreen mode Exit fullscreen mode

This isn't retrieval. It's state serialization.


A Real-World Example: The Book Generator

We built a system that generates novels using multiple AI agents (more on this in a follow-up article). The "world" of a typical story includes:

STORY WORLD COMPONENTS:
- 3-5 characters:      ~500 tokens each  = 2,500 tokens
- 5-10 locations:      ~200 tokens each  = 2,000 tokens  
- 20-30 events:        ~100 tokens each  = 3,000 tokens
- Rules and timeline:                    = 1,000 tokens
────────────────────────────────────────────────────────
TOTAL:                                   ~8,500 tokens
Enter fullscreen mode Exit fullscreen mode

Our context window: 200,000-1,000,000 tokens (depending on model).

We could fit the entire world 20-100 times over. Why were we using RAG?

The RAG Version (What We Started With)

class SharedRAG:
    def query(self, question: str, top_k: int = 5):
        # Embed the question
        query_embedding = self.embed(question)

        # Search vector store
        results = self.vector_store.similarity_search(
            query_embedding, 
            top_k=top_k
        )

        return results
Enter fullscreen mode Exit fullscreen mode

Problems we encountered:

  • "Tell me about the café" returned café description + random events that mentioned coffee
  • Character relationships were split across chunks, often incomplete
  • The same location description was retrieved repeatedly (no frequency control)
  • Timeline was scrambled (similarity ≠ chronology)

The PostgreSQL Version (What We Switched To)

class CoherentMemory:
    def get_scene_context(self, chapter: int, scene: int, 
                          characters: List[str], location: str):
        """Return ONLY what's needed for this specific scene."""

        return {
            'location': self._get_location_details(location, already_narrated=True),
            'characters': self._get_characters(characters),
            'relationships': self._get_relationships_between(characters),
            'known_events': self._get_what_characters_know(characters),
            'recent_events': self._get_events_since(chapter - 1),
        }
Enter fullscreen mode Exit fullscreen mode

Results:

  • Location details are complete and not duplicated
  • Only relationships between present characters are included
  • Events are filtered by what each character actually knows
  • Natural frequency control through already_narrated flags

The Schema

-- 16 tables for complete narrative state management

-- Characters and their attributes
CREATE TABLE characters (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) UNIQUE,
    full_name VARCHAR(200),
    age INT,
    occupation VARCHAR(100),
    background TEXT,
    cognitive_style VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE character_traits (
    character_id INT REFERENCES characters(id),
    trait VARCHAR(100),
    PRIMARY KEY (character_id, trait)
);

CREATE TABLE character_secrets (
    id SERIAL PRIMARY KEY,
    character_id INT REFERENCES characters(id),
    secret TEXT,
    known_by INT[] DEFAULT '{}'  -- Array of character IDs who know
);

-- Relationships (bidirectional)
CREATE TABLE character_relationships (
    character_a INT REFERENCES characters(id),
    character_b INT REFERENCES characters(id),
    relationship_type VARCHAR(50),
    description TEXT,
    PRIMARY KEY (character_a, character_b)
);

-- Locations with usage tracking (frequency control!)
CREATE TABLE locations (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) UNIQUE,
    description TEXT,
    atmosphere TEXT,
    usage_count INT DEFAULT 0,
    max_usage INT DEFAULT 4
);

CREATE TABLE location_details (
    id SERIAL PRIMARY KEY,
    location_id INT REFERENCES locations(id),
    detail TEXT,
    narrated BOOLEAN DEFAULT FALSE  -- Has this been used in the story?
);

-- Events with character knowledge tracking
CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    chapter INT,
    scene INT,
    description TEXT,
    location_id INT REFERENCES locations(id),
    resolved BOOLEAN DEFAULT FALSE
);

CREATE TABLE event_participants (
    event_id INT REFERENCES events(id),
    character_id INT REFERENCES characters(id),
    role VARCHAR(50),  -- 'witness', 'actor', 'mentioned'
    PRIMARY KEY (event_id, character_id)
);

-- Who knows what? Automatic knowledge tracking
CREATE TABLE character_knowledge (
    character_id INT REFERENCES characters(id),
    event_id INT REFERENCES events(id),
    learned_in_chapter INT,
    PRIMARY KEY (character_id, event_id)
);

-- Style tracking (integrated repetition control)
CREATE TABLE style_patterns (
    id SERIAL PRIMARY KEY,
    pattern_type VARCHAR(50),
    pattern_text TEXT,
    chapter INT,
    usage_count INT DEFAULT 0,
    max_per_chapter INT DEFAULT 2
);
Enter fullscreen mode Exit fullscreen mode

Key Queries

Get scene context (coherent segmentation):

-- Everything needed for Chapter 2, Scene 3, with Alice and Bob in the Library

SELECT json_build_object(
    'location', (
        SELECT json_build_object(
            'name', l.name,
            'description', l.description,
            'atmosphere', l.atmosphere,
            'unused_details', (
                SELECT json_agg(ld.detail)
                FROM location_details ld
                WHERE ld.location_id = l.id AND ld.narrated = FALSE
            )
        )
        FROM locations l WHERE l.name = 'Library'
    ),
    'characters', (
        SELECT json_agg(json_build_object(
            'name', c.name,
            'occupation', c.occupation,
            'traits', (SELECT array_agg(trait) FROM character_traits WHERE character_id = c.id),
            'cognitive_style', c.cognitive_style
        ))
        FROM characters c WHERE c.name IN ('Alice', 'Bob')
    ),
    'relationship', (
        SELECT json_build_object(
            'type', cr.relationship_type,
            'description', cr.description
        )
        FROM character_relationships cr
        WHERE (cr.character_a = 1 AND cr.character_b = 2)
           OR (cr.character_a = 2 AND cr.character_b = 1)
    ),
    'what_they_know', (
        SELECT json_agg(DISTINCT e.description)
        FROM events e
        JOIN character_knowledge ck ON e.id = ck.event_id
        WHERE ck.character_id IN (1, 2)
    )
) AS scene_context;
Enter fullscreen mode Exit fullscreen mode

Check for overused patterns:

-- Before generating, check what patterns to avoid

SELECT pattern_text 
FROM style_patterns 
WHERE chapter = 2 
  AND usage_count >= max_per_chapter;

-- Returns: ["shoulders slumped", "heart raced"]
-- These get injected as "AVOID:" instructions
Enter fullscreen mode Exit fullscreen mode

Update after scene generation:

-- Mark location details as narrated
UPDATE location_details 
SET narrated = TRUE 
WHERE location_id = 5 AND detail LIKE '%dusty shelves%';

-- Record that Alice now knows about the secret
INSERT INTO character_knowledge (character_id, event_id, learned_in_chapter)
VALUES (1, 42, 2);

-- Increment location usage
UPDATE locations 
SET usage_count = usage_count + 1 
WHERE name = 'Library';
Enter fullscreen mode Exit fullscreen mode

The Comparison

Aspect RAG PostgreSQL
Setup complexity High (embeddings, vector DB) Medium (schema design)
Query type Semantic similarity Exact relational
Results Top-k approximate Complete exact
Debugging Hard ("why this chunk?") Easy (EXPLAIN)
Relationships Manual in text Native JOINs
Frequency control Add-on Native (usage_count)
Timeline Lost in embeddings ORDER BY timestamp
Knowledge tracking Manual Automatic with FKs
Cost Vector DB + embeddings Standard PostgreSQL
Best for Unstructured exploration Structured applications

When RAG Still Makes Sense

I'm not saying RAG is useless. It has legitimate use cases:

✅ Use RAG When:

  1. Data is truly unstructured — Chat logs, free-form notes, scraped web pages with no consistent format

  2. Volume exceeds context — You have 10GB of documents and even the best chunking gives you 50MB of potentially relevant content

  3. Discovery is the goal — "Find me something related to X" where you don't know what you're looking for

  4. Approximate is acceptable — Recommendations, inspiration, brainstorming assistance

  5. Data changes constantly — News feeds, social media, where re-indexing is expensive

❌ Skip RAG When:

  1. Data has natural structure — Products, customers, transactions, policies, documentation

  2. Total data fits in context — Most knowledge bases under 100k tokens

  3. Precision matters — Legal, medical, financial, compliance

  4. Relationships are important — "Which customers bought X and also complained about Y?"

  5. You need auditability — "Why did the AI say this?" requires deterministic retrieval


The Hybrid Approach

You don't have to choose. Many systems benefit from both:

class HybridMemory:
    def __init__(self):
        self.postgres = PostgresMemory()  # Structured state
        self.rag = RAGMemory()            # Unstructured fallback

    def get_context(self, query, scene_requirements):
        # First: get structured context (deterministic)
        structured = self.postgres.get_scene_context(**scene_requirements)

        # Second: if there's room, add relevant unstructured content
        remaining_tokens = MAX_CONTEXT - count_tokens(structured)

        if remaining_tokens > 1000:
            unstructured = self.rag.search(query, max_tokens=remaining_tokens)
            return merge(structured, unstructured)

        return structured
Enter fullscreen mode Exit fullscreen mode

The key insight: Structured first, RAG for gaps.


Implementation Checklist

If you're considering PostgreSQL over RAG, here's a practical checklist:

1. Analyze Your Data

  • [ ] Can it be structured into entities and relationships?
  • [ ] Does it have natural categories, types, or hierarchies?
  • [ ] Are there temporal aspects (timeline, versions, updates)?
  • [ ] Do you need to track "who knows what"?

2. Estimate Token Budget

  • [ ] Total structured data: _____ tokens
  • [ ] Available context window: _____ tokens
  • [ ] Ratio: If data < 50% of context, skip RAG

3. Design Schema

  • [ ] Core entities identified
  • [ ] Relationships mapped
  • [ ] Usage tracking added (for frequency control)
  • [ ] Knowledge tracking added (for multi-agent scenarios)

4. Build Context Serializer

  • [ ] Function to dump relevant state as JSON/text
  • [ ] Filters for scene/query relevance
  • [ ] Token counting to stay within limits

5. Test Determinism

  • [ ] Same query → same results?
  • [ ] Can you explain why each piece of context was included?
  • [ ] Edge cases handled (empty results, too many results)?

Conclusion: Think Before You Embed

RAG was a brilliant solution to the context length problem of 2022. In 2025, with 200k+ token context windows, it's often a solution in search of a problem.

Before building another RAG pipeline, ask:

  1. Does my data fit in context? If yes, just include it all.
  2. Is my data structured? If yes, use SQL queries, not semantic search.
  3. Do I need precision or exploration? Precision → PostgreSQL. Exploration → RAG.
  4. Am I using RAG because it's right, or because it's trendy?

The best architecture is the one that solves your actual problem with minimal complexity. Sometimes that's a sophisticated RAG pipeline with reranking and hybrid search. Sometimes it's a PostgreSQL database with good indexes and a JSON serializer.

Know the difference.


What's Next

In the next article, I'll show how we applied this PostgreSQL-based approach to build a multi-agent book generation system where AI characters have their own consciousness, memories, and voices. The coherent memory system described here is what allows characters to know things, remember events, and maintain consistent relationships across an entire novel.

Spoiler: When we switched from RAG to PostgreSQL, our characters stopped repeating themselves and started having coherent conversations.

Stay tuned.


If you found this useful, the code is open source: [GitHub link]


Appendix: Quick Reference

PostgreSQL Extensions for AI Workloads

-- Full-text search (built-in)
CREATE INDEX idx_fts ON documents 
USING GIN(to_tsvector('english', content));

-- Vector similarity (pgvector extension)
CREATE EXTENSION vector;
CREATE INDEX idx_embedding ON documents 
USING ivfflat (embedding vector_cosine_ops);

-- JSON querying (built-in)
SELECT data->>'name' FROM entities 
WHERE data @> '{"type": "character"}';
Enter fullscreen mode Exit fullscreen mode

When to Use What

┌─────────────────────────────────────────────────────────────┐
│                    DECISION FLOWCHART                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Does your data have structure?                           │
│           │                                                 │
│     ┌─────┴─────┐                                          │
│    YES          NO                                         │
│     │            │                                          │
│     ▼            ▼                                          │
│  PostgreSQL   Can you add structure?                       │
│     ▲            │                                          │
│     │      ┌─────┴─────┐                                   │
│     │     YES          NO                                  │
│     │      │            │                                   │
│     └──────┘            ▼                                   │
│                    Does it fit in context?                 │
│                         │                                   │
│                   ┌─────┴─────┐                             │
│                  YES          NO                            │
│                   │            │                             │
│                   ▼            ▼                             │
│              Full dump      RAG                             │
│              (no search)   (with all its tradeoffs)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Cost Comparison (Rough Estimates)

Solution Monthly Cost Setup Time Maintenance
Pinecone + OpenAI Embeddings $70-500 1-2 weeks Ongoing
Weaviate Cloud $50-300 1-2 weeks Ongoing
PostgreSQL on RDS $20-100 2-3 days Minimal
PostgreSQL + pgvector $20-100 3-5 days Minimal
SQLite (local) $0 1 day None

Top comments (0)