Oluwafemi Adedayo

Posted on Apr 14

The 5 Levels of RAG Maturity: How to Know When Your RAG Is Actually Production-Ready

#ai #llm #rag #tutorial

I've watched a lot of RAG projects ship in the last eighteen months. They all follow the same arc.

Week 1: somebody wires up a vector store, drops in a few PDFs, and demos it to the team. Everyone is impressed. The CEO forwards the Loom video to the board.

Week 4: the first real users start asking questions. The answers are... fine? Sometimes. Nobody can quite tell.

Week 8: someone asks "is this actually working?" and the room goes quiet. There's a notebook somewhere with eleven test questions in it. It hasn't been run since week 2.

Week 12: the project is either quietly shelved or it's still in production and nobody trusts it.

The problem isn't that RAG is hard to build. The problem is that RAG is hard to evaluate, and without evaluation you can't tell the difference between "a demo that sometimes works" and "a system you can put in front of customers."

This post is about a framework I've been using to close that gap: a 0-to-5 maturity model for RAG systems, with concrete exit criteria at each level. Call it RMM — the RAG Maturity Model. By the end you should be able to look at any RAG system, including your own, and answer the question "what level is this, and what's the next thing I need to fix?"

Why "production-ready" is a useless phrase

The phrase "production-ready RAG" gets thrown around like everyone agrees what it means. Nobody does.

For one team it means "the API returns 200s." For another it means "no hallucinations on the golden set." For a regulated team it means "we can prove to an auditor that PII never leaves the perimeter." These are wildly different bars, and conflating them is how projects end up shipping at one bar and getting evaluated at another.

What we need is the same thing the rest of software engineering figured out years ago: a maturity model. CMMI did it for process. The Well-Architected Framework did it for cloud. DORA did it for delivery. Each of these works because it gives you:

A small number of named levels
Concrete, measurable exit criteria for each level
A clear ordering — you can't skip rungs

Here's what that looks like for RAG.

The RAG Maturity Model

Level	Name	Exit Criteria
RMM-0	Naive	Basic vector search works
RMM-1	Better Recall	Hybrid search, Recall@5 > 70%
RMM-2	Better Precision	Reranker active, nDCG@10 +10%
RMM-3	Better Trust	Guardrails, faithfulness > 85%
RMM-4	Better Workflow	Caching, P95 < 4s, cost tracking
RMM-5	Enterprise	Drift detection, CI/CD gates, adversarial tests

Let's walk through each level, what it actually feels like, and how you know you're done.

RMM-0: Naive

This is the demo. You chunked some docs, embedded them with text-embedding-3-small, stuffed them into a vector store, and built a retrieval loop that grabs the top-k and pipes them into a prompt. It works on the questions you tested it with. It feels like magic.

Exit criteria: You can answer a question end-to-end and the answer references the right document at least some of the time.

Why teams get stuck here: They don't. They get stuck moving past here, because RMM-0 looks so close to working that nobody believes it's only level zero.

The lie RMM-0 tells you: that retrieval quality is mostly solved by picking a good embedding model. It isn't.

RMM-1: Better Recall

The first thing you discover at RMM-0 is that vector search misses obvious matches. Someone asks a question with a specific product code or acronym, and the right document is sitting in your index, but cosine similarity ranks five irrelevant docs above it because the query and the doc don't share enough semantic surface area.

The fix is hybrid search — combine dense (vector) and sparse (BM25 or SPLADE) retrieval, then merge the results with reciprocal rank fusion or a similar scheme. Sparse retrieval catches the literal-match cases that embeddings fumble.

Exit criteria: Recall@5 above 70% on a golden set of at least 50 questions. (Recall@5 = "of the questions where the right answer exists in your corpus, how often is the relevant doc in the top 5 retrieved chunks.")

The honest signal: if you can't measure Recall@5, you're still at RMM-0 regardless of how fancy your retrieval pipeline looks. Measurement is the gate, not the technique.

RMM-2: Better Precision

Now retrieval finds the right doc, but it also finds four wrong ones, and the LLM gets confused or hedges. This is where you add a reranker: a second-stage model (Cohere Rerank, BGE Reranker, Voyage, etc.) that takes your top-50 retrieved chunks and reorders them by actual relevance to the query.

The lift from a good reranker is usually larger than the lift from picking a better embedding model, and it's much cheaper to iterate on.

Exit criteria: nDCG@10 improves by at least 10% versus the RMM-1 baseline on the same golden set. (nDCG@10 captures ordering quality, not just whether the right doc is in the set.)

Common mistake: skipping straight from RMM-0 to a reranker without ever measuring recall. If your retrieval has 40% recall, a reranker can't save you — it can only reorder what it's given.

RMM-3: Better Trust

At this point retrieval is solid, but the LLM still occasionally makes things up. It cites sources that don't say what it claims they say. It answers questions it shouldn't answer. It leaks PII it should redact.

RMM-3 is about the answer, not the retrieval. You add:

Faithfulness scoring — does the generated answer actually follow from the retrieved context, or is the model improvising?
Guardrails — input/output filters for prompt injection, jailbreaks, and policy violations.
PII scanning — detect and redact names, emails, SSNs, etc., on the way in and the way out.

Exit criteria: Faithfulness > 85% on the golden set, plus guardrails and PII scans wired into the request path with measurable block rates.

Why this level matters more than the previous two: RMM-1 and RMM-2 fail visibly — users see bad answers and complain. RMM-3 failures are invisible. The model confidently states something false and the user believes it. These are the failures that destroy trust and end projects.

RMM-4: Better Workflow

Your RAG system answers correctly. Now it has to do that fast, cheaply, and predictably, every day, for a year, while the team keeps shipping changes. RMM-4 is the operational layer:

Caching at the embedding, retrieval, and generation tiers
P95 latency under 4 seconds end-to-end
Cost tracking per query so the finance team doesn't ambush you in month three
Distributed tracing (OpenTelemetry, Langfuse) so you can debug a bad answer two weeks after it happened

Exit criteria: P95 < 4s, cost-per-query measured and trending in a dashboard, and a trace for any query you can name.

The insight: at RMM-4 you stop thinking about RAG as a model problem and start thinking about it as a systems problem. The hardest bugs are no longer prompt bugs; they're cache-invalidation bugs.

RMM-5: Enterprise

The final level is the one most teams never need, but the ones who do can't ship without it:

Drift detection — alerts when the embedding distribution, query distribution, or answer quality shifts
CI/CD gates — every PR runs the full audit and is blocked if faithfulness, recall, or precision regress beyond a threshold
Adversarial test suites — red-team prompts, prompt-injection corpora, and known-bad inputs run on every release

Exit criteria: A merge to main triggers an evaluation run, the run posts to a dashboard, and a regression beyond a configured threshold blocks the merge.

What this really gives you: the ability to keep shipping changes to a RAG system without breaking it. Up to RMM-4 you're building. At RMM-5 you're maintaining, and maintenance is where most RAG projects die.

How to use the model

Three rules.

1. Don't skip levels. RMM is ordered for a reason. A reranker on top of broken recall is wasted compute. Faithfulness scoring on top of bad retrieval just measures how confidently your model is wrong. Drift detection on a system with no golden set has nothing to drift from. The exit criteria are gates, not suggestions.

2. Measure before you optimize. The biggest difference between teams stuck at RMM-0 and teams climbing the model is whether they have a golden set. Fifty questions with known-good answers is enough to start. You can grow it later. You cannot skip it.

3. Score honestly. It is much more useful to know you are an honest RMM-1 than to claim you are an aspirational RMM-3. The model only works if the level is verified, not asserted.

Trying this on your own system

I built a CLI called RAG-Forge that implements this model end-to-end. It scaffolds a RAG pipeline, runs evaluation as a CI gate, and scores any audit report against the RMM levels above. It's MIT-licensed and works against your existing pipeline — you don't need to rewrite anything to score it.

The scoring command takes a JSON audit report and tells you which level you're at and what's blocking the next one:

npm install -g @rag-forge/cli

rag-forge init basic --directory my-rag
cd my-rag
rag-forge index --source ./docs
rag-forge audit --golden-set eval/golden_set.json
rag-forge assess --audit-report reports/audit-report.json

The assess command output looks roughly like this:

Current level: RMM-2 (Better Precision)
✓ Recall@5: 78% (target: >70%)
✓ nDCG@10 lift: +14% (target: +10%)
✗ Faithfulness: 71% (target: >85%) — blocking RMM-3

Next action: faithfulness is below RMM-3 threshold. Add output
verification or a guardrails layer before claiming Better Trust.

That's the whole point of the model — turning "is our RAG good?" into "we are at level 2 and the specific thing blocking level 3 is faithfulness below 85%." You can put that on a roadmap. You can argue about it in a PR review. You can show it to a non-technical stakeholder and they will understand it.

Even if you don't use the tool, you can use the framework. Print the table at the top of this post, score your own system honestly, and you will already know more about where your RAG stands than 90% of teams shipping RAG today.

Where the model breaks down
No maturity model survives contact with reality unmodified, so here are the cases where RMM bends.

Agentic RAG with multi-hop retrieval doesn't fit cleanly into per-query Recall@5. You usually need to score each hop separately and roll up.
Conversational RAG (long chat history) needs a faithfulness metric that accounts for the running context, not just the latest retrieval.
Domain-specific RAG (legal, medical, code) often needs custom precision targets — 85% faithfulness is too low for a legal brief and too high for a brainstorming assistant.
The model is a starting point, not a constitution. If you find a place where it doesn't fit your system, the right move is to write down why and what you replaced it with — that's still better than no framework at all.

The takeaway
If you remember nothing else from this post: build a golden set, score yourself honestly, and stop shipping RAG without knowing what level you're at.

The tools to do this are free. The framework is published. The reason most RAG projects don't have evaluation isn't that it's hard — it's that nobody insists on it. Be the person who insists.

If you try the model on your own system, I'd love to hear where it broke down for you. Drop a comment with your level and what's blocking the next one, or open a discussion in the RAG-Forge repo. The model gets better the more systems we score against it.

DEV Community