Martin Palopoli

Posted on Apr 28

Introducing Citai: the RAG engine I built across 6 articles — now free to try

#ai #python #rag #showdev

Over the last 6 articles I shared how I built every piece of a production RAG engine: hybrid search, cross-encoder reranking, SSE streaming, multi-tenancy, semantic cache and language detection. Today I'm introducing the finished product: Citai (cite + AI) — a knowledge engine with verifiable citations that anyone can try for free at citai.ai.

The full series

If you're landing on this article first, here's everything I built and documented:

RAG Pipeline — hybrid search (pgvector + BM25), cross-encoder reranking, MMR diversity
Production deploy — Docker multi-stage, Nginx, DigitalOcean, zero-downtime deploys
Multi-tenancy — plans, atomic quotas, Redis rate limiting, data isolation
SSE Streaming — from the LLM to the browser through Nginx, custom 10-event protocol
Semantic cache + FAQ — 3 savings layers that cut ~40% of LLM cost
Language detection — ES/EN/PT heuristics without external APIs, content rules

Every article has real code from the project. Nothing made up for the tutorial.

What is Citai

Citai = cite + AI. A knowledge engine where every answer cites the exact source (page, document, paragraph).

The idea came from a frustration: corporate chatbots that say "according to our documents..." but never show you where. You can't verify anything. You can't trust it.

Citai solves that: upload your documents (PDF, DOCX, TXT, or URLs), the system processes them, and when someone asks a question, the answer comes with the exact reference. If it says "see page 15 of the manual", you can go to page 15 and verify.

Who it's for

Companies with internal documentation: manuals, policies, procedures nobody reads
Support teams: a knowledge base that answers before the human does
Education: study materials with verifiable answers
Healthcare and legal: where citing the source isn't optional — it's mandatory

Any use case where an answer without a source isn't enough.

What's included (and what I already showed how it works)

3-stage hybrid search

What I described in article #1 is exactly what runs in production:

Query → Vector (pgvector) + BM25 (tsvector)
     → RRF merge (70/30)
     → Cross-encoder rerank
     → MMR diversity (λ=0.6)
     → Top 5-10 sources with exact citation

It's not "cosine similarity and done". It's a full pipeline with reranking that improves relevance by 25-40% over pure vector search.

Calibrated confidence

Every answer has a confidence score (0-100%) based on cross-encoder scores. If confidence is low, the system says so. It doesn't make things up.

"I couldn't find enough information in the available documentation
to answer with certainty."

That's more valuable than a fabricated answer delivered with confidence.

Embeddable widget

One line of code:

<script
  src="https://citai.ai/widget.js"
  data-api-key="sk-..."
  data-base-url="https://citai.ai"
></script>

Shadow DOM (won't pollute your styles), real-time SSE streaming, pre-chat forms, business hours, human escalation, quick replies, and 4 industry templates. Everything I described in article #4 about SSE is what runs inside the widget.

No-code RAG configuration

Each knowledge base has its own configuration:

Parameter	Default	What it controls
`candidate_k`	30	Initial candidates per search
`rerank_top_n`	15	Documents after the reranker
`top_k`	10	Final sources (post-MMR)
`lambda_param`	0.6	Relevance vs diversity balance
`bm25_weight`	0.3	BM25 weight in the fusion
`faq_threshold`	0.85	FAQ match threshold
`temperature`	0.3	LLM creativity
`chunk_size`	1024	Chunk size in characters

An admin can adjust all of this from the UI, no code required. And the Playground lets you test changes before applying them.

Semantic cache + FAQ matching

Article #5 in action:

FAQ match → curated answer in ~15ms, cost $0
Cache hit (similarity >= 0.95) → cached response in ~20ms, cost $0
Cache miss → full pipeline in ~2-4s

In production, FAQ and cache combined avoid 30-45% of LLM calls. That's real money saved.

Multilingual without external APIs

Automatic ES/EN/PT detection via heuristics (article #6). Users ask in their language, Citai responds in that language. No extra latency, no extra cost.

The embeddings are multilingual (paraphrase-multilingual-MiniLM-L12-v2), so you can have documents in Spanish and ask in English — it works.

Full analytics

Resolution rate (resolved / escalated / negative)
Top 10 most frequent questions
Knowledge gaps (unanswered queries)
Tag distribution (auto-tagging via LLM)
Cost by provider and FAQ/cache savings
Health Score per knowledge base (0-100)
CSV/JSON export

Smart Routing

If you have multiple knowledge bases, Citai automatically routes the query to the most relevant KB using embedding centroids. No manual configuration.

Content rules

BLOCK, REDIRECT and FILTER — to control which questions get processed and which topics are blocked. Evaluated before the LLM, at zero cost.

Tool Calling

Connect external APIs (weather, calendar, CRM, anything) and the agent invokes them automatically when the question requires it. Templates included for the most common use cases.

The stack (for those who read the whole series)

┌─────────────────────────────────────────┐
│            Frontend (Vue 3 + TS)        │
│    Tailwind · vue-i18n · Chart.js       │
├─────────────────────────────────────────┤
│              Nginx (TLS + HTTP/2)       │
│    Proxy SSE · Gzip · Maintenance mode  │
├─────────────────────────────────────────┤
│          Backend (FastAPI async)         │
│   SQLAlchemy · Alembic · JWT · SSE      │
├────────────┬────────────┬───────────────┤
│ PostgreSQL │   Redis    │    Groq       │
│  pgvector  │ Rate limit │  LLaMA 3.3   │
│   BM25     │ Concurrent │  70B versatile│
└────────────┴────────────┴───────────────┘

LLM: Groq as primary (fast and cheap), with GPT-4o-mini fallback
Embeddings: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, runs locally)
Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2 (also local)
DB: PostgreSQL 16 + pgvector (HNSW) + tsvector (BM25)
Infra: Docker Compose on a 4GB VPS ($24/month)

Yes, all of this runs on a 4GB droplet. Article #2 explains how.

Plans

	Free	Starter	Pro	Enterprise
Price	$0	$29/mo	$79/mo	Custom
AI queries/month	50	500	5,000	Unlimited
Users	2	5	15	Unlimited
Knowledge bases	1	5	20	Unlimited
Documents	10	50	200	Unlimited
Storage	100 MB	500 MB	5 GB	50 GB+
API keys	—	2	10	Unlimited
Widget	—	Yes	Yes	White-label
FAQs	Unlimited	Unlimited	Unlimited	Unlimited

FAQs are unlimited across all plans. If a question matches a curated FAQ, it's answered for free without consuming AI queries. That's not a gimmick — it's the design from article #5 in action.

The Free plan doesn't require a credit card. Sign up, upload a PDF, and ask it something.

What I learned building this

1. The reranker improves quality more than anything else

I can swap the LLM, the embeddings, the chunk size — nothing improves relevance as much as adding a cross-encoder after retrieval. It's the cheapest upgrade in terms of cost/benefit ratio.

2. FAQs are boring but powerful

A table with questions and answers isn't sexy. But each FAQ is a perfect answer, served in 15ms, at zero cost, forever. In a multi-tenant SaaS, that scales better than any LLM.

3. Semantic cache needs a high threshold

0.95 minimum similarity. At 0.90 I got incorrect cache hits. A miss is better than a wrong answer.

4. Multi-tenancy is not "add a tenant_id"

It's plan enforcement with atomic counters, rate limiting, data isolation in every query, and a role system that actually works. It takes more code than the RAG pipeline itself.

5. SSE streaming through Nginx has gotchas

proxy_buffering off and proxy_cache off are obvious. But X-Accel-Buffering: no in the response header is what actually fixed it. Article #4 has the details.

6. A 4GB VPS handles more than you'd think

With 1 Uvicorn worker (the embedding model uses ~500MB), tuned PostgreSQL and Redis at 64MB, the system serves real traffic without issues. You don't need Kubernetes to launch.

7. Calibrated confidence changes user perception

When the system says "I'm not sure" instead of making things up, users trust the answers where it is sure even more. Counter-intuitive but measurable.

Try it

citai.ai

Sign up for free (no credit card)
Upload a PDF or document
Ask it something
Verify the source

If you've read any of the 6 previous articles, everything I described is what you'll find inside. It's not a prototype — it's production.

What's next

I'm working on:

Automatic URL import with periodic sync
More industry templates
SDK for programmatic integration
More LLM providers (Mistral, local Ollama)

If you're interested in any of these features, drop a comment or reach out. If you find a bug, same thing — it's real software, not a demo.

Thank you

To everyone who read, commented, and reacted to the previous articles. This series started as personal documentation and became something worth sharing.

If Citai looks useful to you — or if you're just interested in the stack — a like helps it reach more people.

This is the final installment of the "RAG in Production" series. The 6 previous articles are on my profile. Questions, feedback or ideas → comments are open.

DEV Community