Anindya Obi

Posted on Dec 15, 2025

Prompt -> RAG -> Eval: System Overview for LLM Engineers

#ai #programming #rag #agents

1. The real problem: 3 “projects”, 1 system
If you look at most LLM stacks, they’re organized like this:

A prompting project (playground experiments, prompt libraries)
A RAG project (ingestion, chunking, retrieval, reranking)
An eval project (datasets, metrics, dashboards) Each is owned by a slightly different group, with its own tools and vocabulary.

On paper that sounds fine.
In practice, it creates systems that are:

Hard to debug (“Is this a prompt bug or a retrieval bug?”)
Easy to break with small changes (“Why did this config change tank our quality?”)
Impossible to reason about end-to-end The missing piece is a single pipeline view that shows how prompt design, RAG, and eval actually interact.

2. A simple mental model: Prompt → RAG → Eval
Here’s the high-level pipeline we ended up drawing on a whiteboard:

1. Prompt Packs
Reusable prompt templates & patterns that define how the model should behave.

Task prompts (question answering, classification, generation, etc.)
System-level instructions (style, safety, format)
Tool usage hints (how to call retrieval, how to interpret results)
Think of Prompt Packs as your “behavioral contract” with the model.

2. RAG: Ingest → Index → Retrieve
Instead of “RAG” as one box, break it into three explicit stages:

Ingest – how source data is pulled, cleaned, versioned
Index – how text is chunked, embedded, stored (and where)
Retrieve – how queries are built, filtered, reranked

RAG controls “what the model knows right now.”
Prompt Packs + RAG together decide:

What question we’re really asking
What context we’re allowed to use to answer it

3. Eval loops: Close the behavior gap
Finally, we layer evaluation on top of this pipeline:

Offline evals on curated datasets:

Golden questions + expected answers
Checks for hallucination, relevance, style, latency, etc. Online evals in production:
User feedback
Acceptance / rejection events
Task completion signals (did this actually help someone?) Eval loops are where we measure the behavior gap: The difference between what we think the system does (based on prompts + RAG) and what it actually does on real traffic.

3. How it all ties together (and feeds back)
Once you see Prompt Packs, RAG, and eval as one pipeline, you can design feedback loops:
Eval → Prompt:

If answers are structurally correct but off-topic → revise prompt constraints.
If answers are on-topic but messy → tighten response format, style, or examples.

Eval → RAG:

If answers are vague or unsupported → improve retrieval (query building, filters, scoring).
If answers contradict your docs → check ingest/refresh and indexing.

Eval → System design:

If most failures are in one stage (e.g., ingest) → invest there instead of blindly swapping models.

Instead of arguing “prompt vs RAG vs eval”, you’re debugging a single pipeline.

4. The Full-System Overview Diagram
The diagram that helped things “click” visually looks like this (simplified):

User / upstream task

↓

Prompt Pack selection: choose task template, insert user input & constraints

↓

RAG query construction: build retrieval query from prompt + task

↓

RAG pipeline: Ingest → Index → Retrieve

↓

Model call: Prompt + retrieved context

↓

Eval loops: offline test sets, online signals

↓

Feedback: update Prompt Packs; adjust RAG configs; refine eval datasets & metrics

Once we had this on one page, a few things became obvious:

We could map where issues showed up vs where they actually originated
We could propose changes to the pipeline without playing blame ping-pong
We could prioritize work that moved the full system, not a single component

5. How to apply this to your own stack
If you want to try this on your own system:

Write down your Prompt Packs
What prompt templates do you actually use?
Where are they stored? Who updates them?
Draw your RAG pipeline as 3 stages
Ingest sources, indexing strategy, retrieval logic
Note any places where “someone just changes a config”
List your eval loops
Offline datasets, run cadence, metrics
Online signals you already have (thumbs up/down, completion, etc.)
Put it all on one diagram
Connect arrows, mark where feedback really flows

You’ll likely discover that some of your “prompt issues” are actually RAG issues, and some “RAG issues” are actually gaps in how you evaluate.

DEV Community

Prompt -> RAG -> Eval: System Overview for LLM Engineers

Top comments (0)