Ponsubash Raj R

Posted on Jul 4

RAG Is Easy. Useful RAG Is the Hard Part

#ai #rag #postgres #systemdesign

Everybody says “just add RAG” like it is a button in settings.

It is not. I checked. Very disappointing.

The Brief: Personalized News Feeds

Pulse started as a personal AI intelligence feed.

Not a chatbot with a search bar glued to it. Not another app where an LLM confidently explains an article it has never seen. I wanted something more useful:

collect AI engineering content from RSS, GitHub, arXiv, and Gmail newsletters
summarize and classify articles
store embeddings
support exact, semantic, and hybrid search
answer questions from my own corpus
cite the articles it used
say “I do not know” when the corpus has no answer

That last part is important.

A RAG system that cannot say “I do not know” is not intelligent. It is just overconfident autocomplete in formal clothes.

The simple version looked like this:

Very clean. Very incomplete.

The useful version needed much more.

PROJECT REPOSITORY

The Actual System Architecture

Pulse uses a FastAPI backend, PostgreSQL with pgvector, Groq for generation, and an Expo Android app.

At a high level:

For retrieval, the important database columns are:

class Article(Base):
    title: Mapped[str]
    summary: Mapped[str | None]
    category: Mapped[str | None]
    keywords: Mapped[list[str] | None]
    embedding: Mapped[list[float] | None] = mapped_column(Vector(384))
    embedding_model: Mapped[str]
    enrichment_status: Mapped[str]
    hidden: Mapped[bool]

The vector column uses pgvector, which supports vector similarity search inside Postgres including cosine distance and approximate indexes: pgvector README

PostgreSQL also gives full-text search, documented in the PostgreSQL full-text search docs.

So Pulse does not choose between SQL search and vector search.

It uses both.

Because of course one search mode was too peaceful.

Why “Just Use Embeddings” Was Not Enough

Embeddings are useful. They are not magic.

If the user searches:

on-device foundation models

semantic search is great. It can find articles about local AI, small models, mobile inference, and related topics even if the exact words do not match.

But if the user searches:

Anthropic

exact search is often better. The word itself matters. I do not need a poetic interpretation of Anthropic. I need articles that mention Anthropic.

This is where pure vector search becomes annoying.

Vector search is good at meaning. Full-text search is good at exact language. A useful product usually needs both.

So Pulse supports three modes:

Exact      -> PostgreSQL full-text search
Semantic   -> pgvector cosine similarity
Hybrid     -> merge both result sets

Search Mode 1: Exact Search

Exact search uses PostgreSQL full-text search.

This works well for names, tools, companies, and terms that should match literally.

It is also fast and boring.

But boring is underrated. Many production systems are just boring things that work while exciting things are busy timing out.

Search Mode 2: Semantic Search

Semantic search embeds the query and compares it with article embeddings using cosine distance.

query_embedding = await call_embedder(query_text)

distance = Article.embedding.cosine_distance(query_embedding)

rows = await session.execute(
    select(Article, distance)
    .where(
        Article.enrichment_status == "done",
        Article.embedding.is_not(None),
        Article.hidden.is_(False),
    )
    .order_by(distance, Article.ingested_at.desc())
    .limit(limit)
)

Search Mode 3: Hybrid Search

Hybrid search combines exact and semantic results using Reciprocal Rank Fusion.

The idea is simple:

score = 1 / (k + rank)

If an article ranks well in exact search and semantic search, it rises. If it ranks well in only one, it still has a chance.

We merge both result lists:

scores[article_id] += rrf_score(exact_rank)
scores[article_id] += rrf_score(semantic_rank)

This made hybrid the default.

Why?

Because users do not wake up thinking:

“Today I shall formulate a query that is best served by cosine similarity.”

They type words. The system should adapt.

Hybrid search lets exact names win when they should, while semantic matches still catch broader ideas.

Ask Mode: RAG With Brakes

The Ask mode is where retrieval becomes generation.

The user asks:

What are the recent themes around AI coding tools?

Pulse does this:

Here, the rejection step matters.

If the top retrieved articles are weak, Pulse does not call the LLM.

This is not a failure.

This is the product behaving responsibly.

If I ask:

What is the weather in Mumbai?

Pulse should not a produce meteorology fan fiction.

It should say:

I do not have enough relevant context in the corpus.

Prompting With Context, Not Hope

The Ask prompt includes only controlled context:

Article ID
Title
Summary
URL
Similarity score
Recent conversation messages

Not raw HTML. Not full article bodies. Not the entire database. Not “please be accurate” as a magical spell.

A simplified prompt shape:

def build_ask_prompt(question, articles):
    context = "\n\n".join(
        f"[{article.id}]\n"
        f"Title: {article.title}\n"
        f"Summary: {article.summary}\n"
        f"URL: {article.url}"
        for article in articles
    )

    return f"""
Answer the user using only the context below.
If the context is not enough, say so.

Context:
{context}

Question:
{question}
"""

The answer includes citations back to article IDs and URLs.

This keeps the system grounded.

Not perfectly. Nothing with an LLM is perfect. But much better than letting the model free-climb the truth.

Personalization: Ranking Is Also Retrieval

Search is not the only retrieval problem.

The feed itself is retrieval.

Pulse learns from reading behavior:

short reads are weak signals
longer reads are stronger signals
read categories update category weights
article keywords update interest terms
bookmarks and hidden articles affect what should appear

The engagement score is intentionally simple:

def engagement_signal(duration_seconds: int):
    if duration_seconds < 5:
        return None
    if duration_seconds < 30:
        return 0.2
    if duration_seconds < 120:
        return 0.5
    return 1.0

No fake machine learning ceremony. No “neural preference engine” because I read one article for 14 seconds.

Category weights use an exponential moving average:

new_weight = old_weight + alpha * (signal - old_weight)

The feed score combines:

importance + category preference + recency + keyword overlap

Learning Features: RAG Was Only One Part Of The Loop

Once articles are cleaned, summarized, embedded, and ranked, other AI features become easier.

Pulse uses the same enriched corpus for:

1. Daily Digest

The digest selects recent high-importance enriched articles and asks Groq for a three-paragraph briefing.

This is not just summarization. It is scheduled synthesis.

2. Trends

Trend detection scans enriched entities from recent articles.

for entity in article.entities:
    mentions[normalized_entity].add(article.id)

trends = [
    entity for entity, article_ids in mentions.items()
    if len(article_ids) >= 3
]

This lets the app show repeated topics like companies, models, tools, or research themes.

3. LangGraph Quiz Agent

For learning retention, Pulse generates three-question quizzes from an article summary and entities.

LangGraph is useful for modeling multi-step agent flows.

Pulse uses the quiz flow for:

Quiz sessions are stored server-side with expiry. The answer key is not trusted from the client.

Because yes, even in a personal app, the client should not grade itself.

The Product Rule: Retrieval Before Generation

The biggest design rule became:

Retrieve first. Generate second. Refuse when retrieval is weak.

That rule shows up everywhere:

Search can run without Groq.
Ask mode refuses unrelated questions before spending quota.
Digest uses selected articles, not the entire database.
Quiz generation only works on enriched articles.
Feed ranking uses stored signals, not live model calls.

This made the system cheaper, faster, and less ridiculous.

LLMs are powerful. They are also expensive, rate-limited, and occasionally very committed to being wrong.

So Pulse uses them where they add value, and keeps boring deterministic code around them.

The Final Shape

The final RAG architecture looked like this:

That is more work than:

documents -> embeddings -> chatbot

Takeaway

RAG is easy when the input data is clean, the query is friendly, and nobody asks anything weird.

Useful RAG is different.

Useful RAG needs:

clean source data
validated enrichment
exact search
semantic search
hybrid ranking
relevance thresholds
citations
refusal paths
personalization

The hard part is not putting vectors in a database.

The hard part is deciding when the vector result is not good enough.

The hard part is not calling the LLM.

The hard part is knowing when not to call it.

That is what made Pulse useful.

Not because it could answer everything.

Because it knew when it could not.

DEV Community