DEV Community

JustATalentedGuy
JustATalentedGuy

Posted on

RAG Is Easy. Useful RAG Is the Hard Part

Everybody says “just add RAG” like it is a button in settings.

It is not. I checked. Very disappointing.

The Brief: Personalized News Feeds

Pulse started as a personal AI intelligence feed.

Not a chatbot with a search bar glued to it. Not another app where an LLM confidently explains an article it has never seen. I wanted something more useful:

  • collect AI engineering content from RSS, GitHub, arXiv, and Gmail newsletters
  • summarize and classify articles
  • store embeddings
  • support exact, semantic, and hybrid search
  • answer questions from my own corpus
  • cite the articles it used
  • say “I do not know” when the corpus has no answer

Home Page

That last part is important.

A RAG system that cannot say “I do not know” is not intelligent. It is just overconfident autocomplete in formal clothes.

Retrieving suitable article for the question

The simple version looked like this:

Simple flow

Very clean. Very incomplete.

The useful version needed much more.

The Actual System Architecture

Pulse uses a FastAPI backend, PostgreSQL with pgvector, Groq for generation, and an Expo Android app.

At a high level:

Flowchart

For retrieval, the important database columns are:

class Article(Base):
    title: Mapped[str]
    summary: Mapped[str | None]
    category: Mapped[str | None]
    keywords: Mapped[list[str] | None]
    embedding: Mapped[list[float] | None] = mapped_column(Vector(384))
    embedding_model: Mapped[str]
    enrichment_status: Mapped[str]
    hidden: Mapped[bool]
Enter fullscreen mode Exit fullscreen mode

The vector column uses pgvector, which supports vector similarity search inside Postgres including cosine distance and approximate indexes: pgvector README

PostgreSQL also gives full-text search, documented in the PostgreSQL full-text search docs.

So Pulse does not choose between SQL search and vector search.

It uses both.

Because of course one search mode was too peaceful.

Why “Just Use Embeddings” Was Not Enough

Embeddings are useful. They are not magic.

If the user searches:

on-device foundation models
Enter fullscreen mode Exit fullscreen mode

semantic search is great. It can find articles about local AI, small models, mobile inference, and related topics even if the exact words do not match.

But if the user searches:

Anthropic
Enter fullscreen mode Exit fullscreen mode

exact search is often better. The word itself matters. I do not need a poetic interpretation of Anthropic. I need articles that mention Anthropic.

This is where pure vector search becomes annoying.

Vector search is good at meaning. Full-text search is good at exact language. A useful product usually needs both.

So Pulse supports three modes:

Exact      -> PostgreSQL full-text search
Semantic   -> pgvector cosine similarity
Hybrid     -> merge both result sets
Enter fullscreen mode Exit fullscreen mode

Search Mode 1: Exact Search

Exact search uses PostgreSQL full-text search.

This works well for names, tools, companies, and terms that should match literally.

It is also fast and boring.

But boring is underrated. Many production systems are just boring things that work while exciting things are busy timing out.

Search Mode 2: Semantic Search

Semantic search embeds the query and compares it with article embeddings using cosine distance.

query_embedding = await call_embedder(query_text)

distance = Article.embedding.cosine_distance(query_embedding)

rows = await session.execute(
    select(Article, distance)
    .where(
        Article.enrichment_status == "done",
        Article.embedding.is_not(None),
        Article.hidden.is_(False),
    )
    .order_by(distance, Article.ingested_at.desc())
    .limit(limit)
)
Enter fullscreen mode Exit fullscreen mode

Search Mode 3: Hybrid Search

Hybrid search combines exact and semantic results using Reciprocal Rank Fusion.

The idea is simple:

score = 1 / (k + rank)
Enter fullscreen mode Exit fullscreen mode

If an article ranks well in exact search and semantic search, it rises. If it ranks well in only one, it still has a chance.

We merge both result lists:

scores[article_id] += rrf_score(exact_rank)
scores[article_id] += rrf_score(semantic_rank)
Enter fullscreen mode Exit fullscreen mode

This made hybrid the default.

Why?

Because users do not wake up thinking:

“Today I shall formulate a query that is best served by cosine similarity.”

They type words. The system should adapt.

Hybrid search lets exact names win when they should, while semantic matches still catch broader ideas.

Ask Mode: RAG With Brakes

The Ask mode is where retrieval becomes generation.

The user asks:

What are the recent themes around AI coding tools?
Enter fullscreen mode Exit fullscreen mode

Pulse does this:

Answering Flow

Here, the rejection step matters.

If the top retrieved articles are weak, Pulse does not call the LLM.

This is not a failure.

This is the product behaving responsibly.

If I ask:

What is the weather in Mumbai?
Enter fullscreen mode Exit fullscreen mode

Pulse should not a produce meteorology fan fiction.

It should say:

I do not have enough relevant context in the corpus.
Enter fullscreen mode Exit fullscreen mode

Prompting With Context, Not Hope

The Ask prompt includes only controlled context:

Article ID
Title
Summary
URL
Similarity score
Recent conversation messages
Enter fullscreen mode Exit fullscreen mode

Not raw HTML. Not full article bodies. Not the entire database. Not “please be accurate” as a magical spell.

A simplified prompt shape:

def build_ask_prompt(question, articles):
    context = "\n\n".join(
        f"[{article.id}]\n"
        f"Title: {article.title}\n"
        f"Summary: {article.summary}\n"
        f"URL: {article.url}"
        for article in articles
    )

    return f"""
Answer the user using only the context below.
If the context is not enough, say so.

Context:
{context}

Question:
{question}
"""
Enter fullscreen mode Exit fullscreen mode

The answer includes citations back to article IDs and URLs.

This keeps the system grounded.

Not perfectly. Nothing with an LLM is perfect. But much better than letting the model free-climb the truth.

Personalization: Ranking Is Also Retrieval

Search is not the only retrieval problem.

The feed itself is retrieval.

Pulse learns from reading behavior:

  • short reads are weak signals
  • longer reads are stronger signals
  • read categories update category weights
  • article keywords update interest terms
  • bookmarks and hidden articles affect what should appear

The engagement score is intentionally simple:

def engagement_signal(duration_seconds: int):
    if duration_seconds < 5:
        return None
    if duration_seconds < 30:
        return 0.2
    if duration_seconds < 120:
        return 0.5
    return 1.0
Enter fullscreen mode Exit fullscreen mode

No fake machine learning ceremony. No “neural preference engine” because I read one article for 14 seconds.

Category weights use an exponential moving average:

new_weight = old_weight + alpha * (signal - old_weight)
Enter fullscreen mode Exit fullscreen mode

The feed score combines:

importance + category preference + recency + keyword overlap
Enter fullscreen mode Exit fullscreen mode

Learning Features: RAG Was Only One Part Of The Loop

Once articles are cleaned, summarized, embedded, and ranked, other AI features become easier.

Pulse uses the same enriched corpus for:

1. Daily Digest

The digest selects recent high-importance enriched articles and asks Groq for a three-paragraph briefing.

Daily Digest Flow

This is not just summarization. It is scheduled synthesis.

2. Trends

Trend detection scans enriched entities from recent articles.

for entity in article.entities:
    mentions[normalized_entity].add(article.id)

trends = [
    entity for entity, article_ids in mentions.items()
    if len(article_ids) >= 3
]
Enter fullscreen mode Exit fullscreen mode

This lets the app show repeated topics like companies, models, tools, or research themes.

3. LangGraph Quiz Agent

For learning retention, Pulse generates three-question quizzes from an article summary and entities.

LangGraph is useful for modeling multi-step agent flows.

Pulse uses the quiz flow for:

Quiz Flow

Quiz sessions are stored server-side with expiry. The answer key is not trusted from the client.

Because yes, even in a personal app, the client should not grade itself.

The Product Rule: Retrieval Before Generation

The biggest design rule became:

Retrieve first. Generate second. Refuse when retrieval is weak.

That rule shows up everywhere:

  • Search can run without Groq.
  • Ask mode refuses unrelated questions before spending quota.
  • Digest uses selected articles, not the entire database.
  • Quiz generation only works on enriched articles.
  • Feed ranking uses stored signals, not live model calls.

This made the system cheaper, faster, and less ridiculous.

LLMs are powerful. They are also expensive, rate-limited, and occasionally very committed to being wrong.

So Pulse uses them where they add value, and keeps boring deterministic code around them.

The Final Shape

The final RAG architecture looked like this:

Ingestion Pipeline

Question Answering Pipeline

That is more work than:

documents -> embeddings -> chatbot
Enter fullscreen mode Exit fullscreen mode

Takeaway

RAG is easy when the input data is clean, the query is friendly, and nobody asks anything weird.

Useful RAG is different.

Useful RAG needs:

  • clean source data
  • validated enrichment
  • exact search
  • semantic search
  • hybrid ranking
  • relevance thresholds
  • citations
  • refusal paths
  • personalization

The hard part is not putting vectors in a database.

The hard part is deciding when the vector result is not good enough.

The hard part is not calling the LLM.

The hard part is knowing when not to call it.

That is what made Pulse useful.

Not because it could answer everything.

Because it knew when it could not.

Top comments (0)