DEV Community

Cover image for Introducing Citai: the RAG engine I built across 6 articles — now free to try
Martin Palopoli
Martin Palopoli

Posted on

Introducing Citai: the RAG engine I built across 6 articles — now free to try

Over the last 6 articles I shared how I built every piece of a production RAG engine: hybrid search, cross-encoder reranking, SSE streaming, multi-tenancy, semantic cache and language detection. Today I'm introducing the finished product: Citai (cite + AI) — a knowledge engine with verifiable citations that anyone can try for free at citai.ai.


The full series

If you're landing on this article first, here's everything I built and documented:

  1. RAG Pipeline — hybrid search (pgvector + BM25), cross-encoder reranking, MMR diversity
  2. Production deploy — Docker multi-stage, Nginx, DigitalOcean, zero-downtime deploys
  3. Multi-tenancy — plans, atomic quotas, Redis rate limiting, data isolation
  4. SSE Streaming — from the LLM to the browser through Nginx, custom 10-event protocol
  5. Semantic cache + FAQ — 3 savings layers that cut ~40% of LLM cost
  6. Language detection — ES/EN/PT heuristics without external APIs, content rules

Every article has real code from the project. Nothing made up for the tutorial.


What is Citai

Citai = cite + AI. A knowledge engine where every answer cites the exact source (page, document, paragraph).

The idea came from a frustration: corporate chatbots that say "according to our documents..." but never show you where. You can't verify anything. You can't trust it.

Citai solves that: upload your documents (PDF, DOCX, TXT, or URLs), the system processes them, and when someone asks a question, the answer comes with the exact reference. If it says "see page 15 of the manual", you can go to page 15 and verify.


Who it's for

  • Companies with internal documentation: manuals, policies, procedures nobody reads
  • Support teams: a knowledge base that answers before the human does
  • Education: study materials with verifiable answers
  • Healthcare and legal: where citing the source isn't optional — it's mandatory

Any use case where an answer without a source isn't enough.


What's included (and what I already showed how it works)

3-stage hybrid search

What I described in article #1 is exactly what runs in production:

Query → Vector (pgvector) + BM25 (tsvector)
     → RRF merge (70/30)
     → Cross-encoder rerank
     → MMR diversity (λ=0.6)
     → Top 5-10 sources with exact citation
Enter fullscreen mode Exit fullscreen mode

It's not "cosine similarity and done". It's a full pipeline with reranking that improves relevance by 25-40% over pure vector search.

Calibrated confidence

Every answer has a confidence score (0-100%) based on cross-encoder scores. If confidence is low, the system says so. It doesn't make things up.

"I couldn't find enough information in the available documentation
to answer with certainty."
Enter fullscreen mode Exit fullscreen mode

That's more valuable than a fabricated answer delivered with confidence.

Embeddable widget

One line of code:

<script
  src="https://citai.ai/widget.js"
  data-api-key="sk-..."
  data-base-url="https://citai.ai"
></script>
Enter fullscreen mode Exit fullscreen mode

Shadow DOM (won't pollute your styles), real-time SSE streaming, pre-chat forms, business hours, human escalation, quick replies, and 4 industry templates. Everything I described in article #4 about SSE is what runs inside the widget.

No-code RAG configuration

Each knowledge base has its own configuration:

Parameter Default What it controls
candidate_k 30 Initial candidates per search
rerank_top_n 15 Documents after the reranker
top_k 10 Final sources (post-MMR)
lambda_param 0.6 Relevance vs diversity balance
bm25_weight 0.3 BM25 weight in the fusion
faq_threshold 0.85 FAQ match threshold
temperature 0.3 LLM creativity
chunk_size 1024 Chunk size in characters

An admin can adjust all of this from the UI, no code required. And the Playground lets you test changes before applying them.

Semantic cache + FAQ matching

Article #5 in action:

  • FAQ match → curated answer in ~15ms, cost $0
  • Cache hit (similarity >= 0.95) → cached response in ~20ms, cost $0
  • Cache miss → full pipeline in ~2-4s

In production, FAQ and cache combined avoid 30-45% of LLM calls. That's real money saved.

Multilingual without external APIs

Automatic ES/EN/PT detection via heuristics (article #6). Users ask in their language, Citai responds in that language. No extra latency, no extra cost.

The embeddings are multilingual (paraphrase-multilingual-MiniLM-L12-v2), so you can have documents in Spanish and ask in English — it works.

Full analytics

  • Resolution rate (resolved / escalated / negative)
  • Top 10 most frequent questions
  • Knowledge gaps (unanswered queries)
  • Tag distribution (auto-tagging via LLM)
  • Cost by provider and FAQ/cache savings
  • Health Score per knowledge base (0-100)
  • CSV/JSON export

Smart Routing

If you have multiple knowledge bases, Citai automatically routes the query to the most relevant KB using embedding centroids. No manual configuration.

Content rules

BLOCK, REDIRECT and FILTER — to control which questions get processed and which topics are blocked. Evaluated before the LLM, at zero cost.

Tool Calling

Connect external APIs (weather, calendar, CRM, anything) and the agent invokes them automatically when the question requires it. Templates included for the most common use cases.


The stack (for those who read the whole series)

┌─────────────────────────────────────────┐
│            Frontend (Vue 3 + TS)        │
│    Tailwind · vue-i18n · Chart.js       │
├─────────────────────────────────────────┤
│              Nginx (TLS + HTTP/2)       │
│    Proxy SSE · Gzip · Maintenance mode  │
├─────────────────────────────────────────┤
│          Backend (FastAPI async)         │
│   SQLAlchemy · Alembic · JWT · SSE      │
├────────────┬────────────┬───────────────┤
│ PostgreSQL │   Redis    │    Groq       │
│  pgvector  │ Rate limit │  LLaMA 3.3   │
│   BM25     │ Concurrent │  70B versatile│
└────────────┴────────────┴───────────────┘
Enter fullscreen mode Exit fullscreen mode
  • LLM: Groq as primary (fast and cheap), with GPT-4o-mini fallback
  • Embeddings: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, runs locally)
  • Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2 (also local)
  • DB: PostgreSQL 16 + pgvector (HNSW) + tsvector (BM25)
  • Infra: Docker Compose on a 4GB VPS ($24/month)

Yes, all of this runs on a 4GB droplet. Article #2 explains how.


Plans

Free Starter Pro Enterprise
Price $0 $29/mo $79/mo Custom
AI queries/month 50 500 5,000 Unlimited
Users 2 5 15 Unlimited
Knowledge bases 1 5 20 Unlimited
Documents 10 50 200 Unlimited
Storage 100 MB 500 MB 5 GB 50 GB+
API keys 2 10 Unlimited
Widget Yes Yes White-label
FAQs Unlimited Unlimited Unlimited Unlimited

FAQs are unlimited across all plans. If a question matches a curated FAQ, it's answered for free without consuming AI queries. That's not a gimmick — it's the design from article #5 in action.

The Free plan doesn't require a credit card. Sign up, upload a PDF, and ask it something.


What I learned building this

1. The reranker improves quality more than anything else

I can swap the LLM, the embeddings, the chunk size — nothing improves relevance as much as adding a cross-encoder after retrieval. It's the cheapest upgrade in terms of cost/benefit ratio.

2. FAQs are boring but powerful

A table with questions and answers isn't sexy. But each FAQ is a perfect answer, served in 15ms, at zero cost, forever. In a multi-tenant SaaS, that scales better than any LLM.

3. Semantic cache needs a high threshold

0.95 minimum similarity. At 0.90 I got incorrect cache hits. A miss is better than a wrong answer.

4. Multi-tenancy is not "add a tenant_id"

It's plan enforcement with atomic counters, rate limiting, data isolation in every query, and a role system that actually works. It takes more code than the RAG pipeline itself.

5. SSE streaming through Nginx has gotchas

proxy_buffering off and proxy_cache off are obvious. But X-Accel-Buffering: no in the response header is what actually fixed it. Article #4 has the details.

6. A 4GB VPS handles more than you'd think

With 1 Uvicorn worker (the embedding model uses ~500MB), tuned PostgreSQL and Redis at 64MB, the system serves real traffic without issues. You don't need Kubernetes to launch.

7. Calibrated confidence changes user perception

When the system says "I'm not sure" instead of making things up, users trust the answers where it is sure even more. Counter-intuitive but measurable.


Try it

citai.ai

  • Sign up for free (no credit card)
  • Upload a PDF or document
  • Ask it something
  • Verify the source

If you've read any of the 6 previous articles, everything I described is what you'll find inside. It's not a prototype — it's production.


What's next

I'm working on:

  • Automatic URL import with periodic sync
  • More industry templates
  • SDK for programmatic integration
  • More LLM providers (Mistral, local Ollama)

If you're interested in any of these features, drop a comment or reach out. If you find a bug, same thing — it's real software, not a demo.


Thank you

To everyone who read, commented, and reacted to the previous articles. This series started as personal documentation and became something worth sharing.

If Citai looks useful to you — or if you're just interested in the stack — a like helps it reach more people.


This is the final installment of the "RAG in Production" series. The 6 previous articles are on my profile. Questions, feedback or ideas → comments are open.

Top comments (0)