I built a self-hosted RAG system for Journalism — What Production Retrieval Taught Me

#rag #mcp #postgressql #agents

Over the last few months, I built Atlas — a fully self-hosted retrieval system designed for journalism workflows. No paid APIs. No hosted vector databases or AI infrastructure. Just local models, PostgreSQL, pgvector, Celery, and a retrieval pipeline built to survive production traffic.

I originally thought this would mostly be an infrastructure project. It wasn't. The hardest lessons appeared after deployment — when assumptions broke, retrieval quality drifted, and tiny implementation decisions started affecting reliability.

What does Atlas do?

Atlas ingests live RSS feeds from BBC, Guardian, NYT, NPR, Deutsche Welle and more every 15 minutes, embeds content locally using sentence-transformers, stores vectors in PostgreSQL with pgvector, and answers questions with source-grounded citations.

Beyond search it has:

Grounded Q&A — every answer maps to an exact source passage
Claim-level fact-checking — splits text into claims, scores each against evidence
Story brief generation — key facts, open questions, suggested angles for reporters
Multi-format repurposing — one topic becomes newsletter, social post, audio script, headline
A full story workspace — source notebooks, drafts, editorial review, version diff, publish readiness

https://github.com/PreethaRaj/atlas-editorial-intelligence/releases/download/v1.0.0/SearchAnswer.gif

https://github.com/PreethaRaj/atlas-editorial-intelligence/releases/download/v1.0.0/PartnerMode.gif

The retrieval pipeline

Here is the full pipeline before I get into the lessons:

Query string
    │
    ├── embed(query) → vector cosine > 0.30 → top 20 chunks
    ├── websearch_to_tsquery → PostgreSQL FTS → top 20 chunks
    └── Title FTS boost → top 10 articles
              │
              ▼
         RRF merge (k=60)
              │
         recency blend (85% relevance + 15% freshness)
              │
         post-cosine gate > 0.12
              │
         Policy engine (public / partner / paywall)
              │
         Response + inline citations

Lesson 1 — Pure vector search fails for news

This surprised me. I assumed a good embedding model would handle everything. It does not — at least not for current events journalism.

The problem: proper nouns.

Words like Philippines, Kishida, Rafah, Starmer are rare in any model's training data relative to their importance in daily news. The cosine similarity between "Japan missile exports Philippines" and an article titled "Tokyo defence deal with Manila confirmed" was 0.28 — just below my original threshold of 0.30.

The article was clearly relevant. The vector search missed it completely.

Full-text search caught it immediately because Japan, missile, Philippines all appeared in the article text.

The fix was hybrid search. Vector catches semantic similarity. FTS catches proper nouns and exact terminology. Neither is sufficient alone for a news corpus.

# Three search paths merged with RRF
# Path 1: vector cosine similarity
# Path 2: websearch_to_tsquery (handles "Japan Philippines" as two terms)
# Path 3: title-specific FTS (weighted 0.7x to avoid title-only noise)

# RRF merge — no score normalisation needed because it only uses rank position
# final_score = Σ 1 / (60 + rank_i)

Lesson 2 — Batch embedding is not a micro-optimisation

I was calling embed() once per article for the first two weeks. Here is what that looks like:

17 feeds × 30 articles × embed(1 article) × 100ms = 51 seconds per ingest cycle

After switching to batch embedding — collect all articles, call embed([t1, t2, ..., tN]) once:

17 feeds × 30 articles = 510 articles
embed(510 articles)    = ~3 seconds total

17× faster. The model inference overhead is almost entirely fixed cost per batch, not per item. This is obvious from the PyTorch documentation but I had not read it carefully enough.

# Before — slow
for article in articles:
    vec = embed(article.content)
    insert_embedding(article.id, vec)

# After — fast
contents = [a.content for a in articles]
vecs     = embed(contents)   # single call, returns (N, 384) array
for article, vec in zip(articles, vecs):
    insert_embedding(article.id, vec)

Lesson 3 — The cosine threshold is your precision-recall dial

Atlas has two thresholds:

COSINE_MIN      = 0.30   # SQL WHERE — pre-filter before leaving DB
POST_COSINE_MIN = 0.12   # post-RRF — sanity gate after merge

What I learned tuning these:

Threshold	Effect
0.45	Missed "Japan missile Philippines" — too restrictive
0.30	Good balance for a news corpus
0.20	Sports results started appearing for political queries

The intuition: news articles about related topics often use completely different vocabulary than the query. A threshold of 0.30 allows the model to bridge that vocabulary gap. A threshold of 0.45 requires the query and article to use nearly identical language — which defeats the purpose of semantic search.

POST_COSINE_MIN = 0.12 exists only to handle FTS-only hits. When an article is found by keyword search but has no semantic overlap with the query (cosine = 0.0), it means the keyword match was probably accidental. The post-filter removes those.

Lesson 4 — Celery beat scheduling has a startup timing problem

The beat schedule runs ingest_all_feeds every 15 minutes. But there is a subtle issue: on a fresh deploy, the first beat fires at the next :00, :15, :30, or :45 UTC boundary — not 15 minutes from startup.

Deploy at 14:01 → first ingest at 14:15  ✓ fine
Deploy at 14:14 → first ingest at 14:15  ✓ fine
Deploy at 14:00:01 → first ingest at 14:15  ✗ 15 minute corpus gap on first launch

The fix was timedelta(minutes=15) instead of crontab(minute='*/15').

The startup_ingest task also checks corpus article count before honouring the Redis dedup flag. Empty corpus → ingest regardless. This handles docker-compose down -v (fresh database) correctly.

beat_schedule = {
    "ingest-every-15-min": {
        "task":     "tasks.ingest_all_feeds",
        "schedule": timedelta(minutes=15),  # from startup, not clock-aligned
    },
    "startup-ingest-once": {
        "task":     "tasks.startup_ingest",
        "schedule": timedelta(hours=24),    # fires once, Redis dedup prevents repeats
    },
}

Lesson 5 — The Docker healthcheck dependency chain matters

This one took me an embarrassing amount of time.

celery-beat:
  depends_on:
    celery-worker:
      condition: service_healthy   # ← this line is critical

Without service_healthy, beat starts immediately and dispatches tasks before any worker is ready to consume them. The tasks sit in the queue, beat fires again in 15 minutes, tasks pile up.

With service_healthy, beat waits until a worker is confirmed ready. Clean startup every time.

The worker healthcheck uses celery inspect ping which confirms the worker is actually processing — not just that the container started.

What is next

The infrastructure has a warmup_reranker() stub in main.py for a cross-encoder reranker. That is the highest-impact next upgrade — running cross-encoder/ms-marco-MiniLM-L-6-v2 over the top-20 RRF results before returning to the user. Adds ~100ms latency but meaningfully improves ranking for ambiguous queries.

I am also looking at adding a BM25 path via the pg_bm25 extension (ParadeDB) to replace the PostgreSQL FTS path. BM25 handles document length normalisation better than tsvector for longer articles.