Over the last 6 articles I shared how I built every piece of a production RAG engine: hybrid search, cross-encoder reranking, SSE streaming, multi-tenancy, semantic cache and language detection. Today I'm introducing the finished product: Citai (cite + AI) — a knowledge engine with verifiable citations that anyone can try for free at citai.ai.
The full series
If you're landing on this article first, here's everything I built and documented:
- RAG Pipeline — hybrid search (pgvector + BM25), cross-encoder reranking, MMR diversity
- Production deploy — Docker multi-stage, Nginx, DigitalOcean, zero-downtime deploys
- Multi-tenancy — plans, atomic quotas, Redis rate limiting, data isolation
- SSE Streaming — from the LLM to the browser through Nginx, custom 10-event protocol
- Semantic cache + FAQ — 3 savings layers that cut ~40% of LLM cost
- Language detection — ES/EN/PT heuristics without external APIs, content rules
Every article has real code from the project. Nothing made up for the tutorial.
What is Citai
Citai = cite + AI. A knowledge engine where every answer cites the exact source (page, document, paragraph).
The idea came from a frustration: corporate chatbots that say "according to our documents..." but never show you where. You can't verify anything. You can't trust it.
Citai solves that: upload your documents (PDF, DOCX, TXT, or URLs), the system processes them, and when someone asks a question, the answer comes with the exact reference. If it says "see page 15 of the manual", you can go to page 15 and verify.
Who it's for
- Companies with internal documentation: manuals, policies, procedures nobody reads
- Support teams: a knowledge base that answers before the human does
- Education: study materials with verifiable answers
- Healthcare and legal: where citing the source isn't optional — it's mandatory
Any use case where an answer without a source isn't enough.
What's included (and what I already showed how it works)
3-stage hybrid search
What I described in article #1 is exactly what runs in production:
Query → Vector (pgvector) + BM25 (tsvector)
→ RRF merge (70/30)
→ Cross-encoder rerank
→ MMR diversity (λ=0.6)
→ Top 5-10 sources with exact citation
It's not "cosine similarity and done". It's a full pipeline with reranking that improves relevance by 25-40% over pure vector search.
Calibrated confidence
Every answer has a confidence score (0-100%) based on cross-encoder scores. If confidence is low, the system says so. It doesn't make things up.
"I couldn't find enough information in the available documentation
to answer with certainty."
That's more valuable than a fabricated answer delivered with confidence.
Embeddable widget
One line of code:
<script
src="https://citai.ai/widget.js"
data-api-key="sk-..."
data-base-url="https://citai.ai"
></script>
Shadow DOM (won't pollute your styles), real-time SSE streaming, pre-chat forms, business hours, human escalation, quick replies, and 4 industry templates. Everything I described in article #4 about SSE is what runs inside the widget.
No-code RAG configuration
Each knowledge base has its own configuration:
| Parameter | Default | What it controls |
|---|---|---|
candidate_k |
30 | Initial candidates per search |
rerank_top_n |
15 | Documents after the reranker |
top_k |
10 | Final sources (post-MMR) |
lambda_param |
0.6 | Relevance vs diversity balance |
bm25_weight |
0.3 | BM25 weight in the fusion |
faq_threshold |
0.85 | FAQ match threshold |
temperature |
0.3 | LLM creativity |
chunk_size |
1024 | Chunk size in characters |
An admin can adjust all of this from the UI, no code required. And the Playground lets you test changes before applying them.
Semantic cache + FAQ matching
Article #5 in action:
- FAQ match → curated answer in ~15ms, cost $0
- Cache hit (similarity >= 0.95) → cached response in ~20ms, cost $0
- Cache miss → full pipeline in ~2-4s
In production, FAQ and cache combined avoid 30-45% of LLM calls. That's real money saved.
Multilingual without external APIs
Automatic ES/EN/PT detection via heuristics (article #6). Users ask in their language, Citai responds in that language. No extra latency, no extra cost.
The embeddings are multilingual (paraphrase-multilingual-MiniLM-L12-v2), so you can have documents in Spanish and ask in English — it works.
Full analytics
- Resolution rate (resolved / escalated / negative)
- Top 10 most frequent questions
- Knowledge gaps (unanswered queries)
- Tag distribution (auto-tagging via LLM)
- Cost by provider and FAQ/cache savings
- Health Score per knowledge base (0-100)
- CSV/JSON export
Smart Routing
If you have multiple knowledge bases, Citai automatically routes the query to the most relevant KB using embedding centroids. No manual configuration.
Content rules
BLOCK, REDIRECT and FILTER — to control which questions get processed and which topics are blocked. Evaluated before the LLM, at zero cost.
Tool Calling
Connect external APIs (weather, calendar, CRM, anything) and the agent invokes them automatically when the question requires it. Templates included for the most common use cases.
The stack (for those who read the whole series)
┌─────────────────────────────────────────┐
│ Frontend (Vue 3 + TS) │
│ Tailwind · vue-i18n · Chart.js │
├─────────────────────────────────────────┤
│ Nginx (TLS + HTTP/2) │
│ Proxy SSE · Gzip · Maintenance mode │
├─────────────────────────────────────────┤
│ Backend (FastAPI async) │
│ SQLAlchemy · Alembic · JWT · SSE │
├────────────┬────────────┬───────────────┤
│ PostgreSQL │ Redis │ Groq │
│ pgvector │ Rate limit │ LLaMA 3.3 │
│ BM25 │ Concurrent │ 70B versatile│
└────────────┴────────────┴───────────────┘
- LLM: Groq as primary (fast and cheap), with GPT-4o-mini fallback
-
Embeddings:
paraphrase-multilingual-MiniLM-L12-v2(384 dimensions, runs locally) -
Reranker:
cross-encoder/ms-marco-MiniLM-L-6-v2(also local) - DB: PostgreSQL 16 + pgvector (HNSW) + tsvector (BM25)
- Infra: Docker Compose on a 4GB VPS ($24/month)
Yes, all of this runs on a 4GB droplet. Article #2 explains how.
Plans
| Free | Starter | Pro | Enterprise | |
|---|---|---|---|---|
| Price | $0 | $29/mo | $79/mo | Custom |
| AI queries/month | 50 | 500 | 5,000 | Unlimited |
| Users | 2 | 5 | 15 | Unlimited |
| Knowledge bases | 1 | 5 | 20 | Unlimited |
| Documents | 10 | 50 | 200 | Unlimited |
| Storage | 100 MB | 500 MB | 5 GB | 50 GB+ |
| API keys | — | 2 | 10 | Unlimited |
| Widget | — | Yes | Yes | White-label |
| FAQs | Unlimited | Unlimited | Unlimited | Unlimited |
FAQs are unlimited across all plans. If a question matches a curated FAQ, it's answered for free without consuming AI queries. That's not a gimmick — it's the design from article #5 in action.
The Free plan doesn't require a credit card. Sign up, upload a PDF, and ask it something.
What I learned building this
1. The reranker improves quality more than anything else
I can swap the LLM, the embeddings, the chunk size — nothing improves relevance as much as adding a cross-encoder after retrieval. It's the cheapest upgrade in terms of cost/benefit ratio.
2. FAQs are boring but powerful
A table with questions and answers isn't sexy. But each FAQ is a perfect answer, served in 15ms, at zero cost, forever. In a multi-tenant SaaS, that scales better than any LLM.
3. Semantic cache needs a high threshold
0.95 minimum similarity. At 0.90 I got incorrect cache hits. A miss is better than a wrong answer.
4. Multi-tenancy is not "add a tenant_id"
It's plan enforcement with atomic counters, rate limiting, data isolation in every query, and a role system that actually works. It takes more code than the RAG pipeline itself.
5. SSE streaming through Nginx has gotchas
proxy_buffering off and proxy_cache off are obvious. But X-Accel-Buffering: no in the response header is what actually fixed it. Article #4 has the details.
6. A 4GB VPS handles more than you'd think
With 1 Uvicorn worker (the embedding model uses ~500MB), tuned PostgreSQL and Redis at 64MB, the system serves real traffic without issues. You don't need Kubernetes to launch.
7. Calibrated confidence changes user perception
When the system says "I'm not sure" instead of making things up, users trust the answers where it is sure even more. Counter-intuitive but measurable.
Try it
- Sign up for free (no credit card)
- Upload a PDF or document
- Ask it something
- Verify the source
If you've read any of the 6 previous articles, everything I described is what you'll find inside. It's not a prototype — it's production.
What's next
I'm working on:
- Automatic URL import with periodic sync
- More industry templates
- SDK for programmatic integration
- More LLM providers (Mistral, local Ollama)
If you're interested in any of these features, drop a comment or reach out. If you find a bug, same thing — it's real software, not a demo.
Thank you
To everyone who read, commented, and reacted to the previous articles. This series started as personal documentation and became something worth sharing.
If Citai looks useful to you — or if you're just interested in the stack — a like helps it reach more people.
This is the final installment of the "RAG in Production" series. The 6 previous articles are on my profile. Questions, feedback or ideas → comments are open.
Top comments (0)