Why a “Workflow Pack” for RAG + Eval?

#ai #agents #programming

If you ship production RAG systems, you already know the pattern:
• You start with a “simple” retrieval pipeline.
• A week later, you’re buried in ingestion edge cases, chunking mistakes, and silent drift.
• Someone asks, “Is this actually better?” and you realize your eval setup is a mix of eyeballing outputs, a few spot checks, and whatever metrics you had time to hack together.
None of this is because you’re bad at your job.
It’s because the glue work around RAG + evaluation is invisible and rarely reused.
This Workflow Pack V1 is meant to fix that.
Goal: Give you a reusable mental toolbox for RAG + eval – so the next time you build or debug a pipeline, you’re not starting from a blank page.
No code. Just diagrams, flows, and checklists you can apply in your stack of choice.

What’s Inside Workflow Pack V1
The pack bundles two weeks of visual work:
1. Week 1 – RAG Workflow Pack

A. Ingestion Map
Diagram + checklist to answer: “What exactly are we feeding into the system, and how controlled is it?”
• Source inventory (docs, notebooks, tickets, logs, Slack, etc.)
• Ingestion modes (batch, streaming, event-driven)
• Normalization steps (cleaning, de-duplication, PII handling)
• Versioning strategy (doc versions, schema versions, embeddings versions)
• “What can go wrong” checklist (missing fields, broken links, partial uploads)
Here's a link to my blog on this topic: https://dev.to/dowhatmatters/rag-ingestion-the-hidden-bottleneck-behind-retrieval-failures-1idn

B. Chunking Map
A visual way to reason about chunk strategy vs. user questions:
• Sliding window vs. fixed chunk vs. semantic splitting
• Chunk size vs. model context tradeoffs
• Where you attach IDs, tags, and lineage
• Checklist:
o Is the chunk answerable in isolation?
o Can I reconstruct the original doc if needed?
o Am I leaking unrelated context into the same chunk?
Here's a link to my blog on this topic: https://dev.to/dowhatmatters/chunking-and-segmentation-the-quiet-failure-point-in-retrieval-quality-o8a

C. Drift Map
A simple, repeatable mental model for RAG drift:
• Content drift (docs change, embeddings don’t)
• Usage drift (questions change, corpus doesn’t)
• Infra drift (models/embedders updated silently)
• Drift indicators:
o “It used to answer this, now it doesn’t”
o More “I don’t know” or hallucinated answers
o Sharp drop in retrieval relevance for key queries
• Checklist for drift investigation:
o When did we last re-embed?
o When did corpus change?
o Did we change models / hyperparameters?
Here's a link to my blog on this topic: https://dev.to/dowhatmatters/embedding-drift-the-quiet-killer-of-retrieval-quality-in-rag-systems-4l5m

D. Debug Map
Visual debug flow when “RAG is broken”:

Is the question clear and in-scope?
Did we retrieve anything relevant?
Are chunks the right size/granularity?
Is the prompt leaking or overwriting context?
Is the model simply underpowered for this task? Each node in the diagram comes with a 3–5 bullet checklist of things to log, inspect, or flip. Here's a link to my blog on this topic: https://dev.to/dowhatmatters/the-boring-debug-checklist-that-fixes-most-rag-failures-201a

E. Metadata Map
One view showing:
• Core metadata to track (source, timestamps, author, product area, permissions)
• Retrieval-time filters (tenant, environment, locale, feature flags)
• Post-hoc analysis fields (labels from evals, human feedback, bug tags)
The checklist forces the question:
“If this answer looks wrong in production, do we have enough metadata to debug it?”
Here's a link to my blog on this topic: https://dev.to/dowhatmatters/chunk-boundary-and-metadata-alignment-the-hidden-source-of-rag-instability-78b

2. Week 2 – Evaluation Workflow Pack

A. Eval Flow Diagram
A high-level eval pipeline that works across stacks:

Define scenarios (what real users are trying to do)
Build test sets (queries, contexts, references)
Choose metrics (automatic + human)
Run evals on: o Retrieval only o Full RAG (retrieval + generation)
Inspect failures, update: o Data o Retrieval o Prompts o Models Each step includes a small checklist so you’re not guessing the next move. Here's a link to my blog on this topic: https://dev.to/dowhatmatters/building-a-baseline-evaluation-dataset-when-you-have-nothing-3oa9

B. JSON Failure Map
If you’re returning structured JSON from your LLM, you’ve probably seen:
• Random missing fields
• Type mismatches
• Non-JSON “explanations”
• Half-valid / half-garbage responses
The JSON Failure Map gives you:
• A taxonomy of failure modes:
o Schema drift: your JSON schema changed; prompts didn’t.
o Overloaded prompts: too many constraints, model ignores some.
o Context overload: model uses context instead of schema as truth.
o Format forgetting: the classic “here’s your response” blob of text.
• For each failure mode:
o Example patterns
o What to log
o Where to fix (prompt, schema, validator, retry logic)
This is a visual way to stop treating JSON failures as “random LLM stuff” and start treating them as systematic issues.
Here's a link to my blog on this topic: https://dev.to/dowhatmatters/json-eval-failures-why-evaluations-blow-up-and-how-to-fix-them-dj

C. Metrics Map
A compact view organizing metrics into three layers:

Retrieval metrics o Recall / hit rate on labeled queries o MRR / nDCG for relevance o Coverage of key scenarios
Answer quality metrics o Faithfulness / groundedness o Task success (did the user get what they came for?) o Preference models or rubric-based scoring
System metrics o Latency (end-to-end + per step) o Cost per answer / per session o Degradation over time (drift signals connected back to Week 1) Each metric is attached to: • Where it’s computed • When it’s useful • When it’s misleading / can be ignored Here's a link to my blog on this topic: https://dev.to/dowhatmatters/metrics-map-for-llm-evaluation-groundedness-structure-correctness-2i7h

How to Use the Workflow Pack
You don’t have to adopt all of it at once.
Suggested ways to use it:

1. New RAG project:
Use the Ingestion, Chunking, and Metadata maps as a “pre-mortem” checklist in your design doc.

2. Debugging a flaky system:
Start at the Debug Map, follow the branches until you find the first failing assumption.

3. Making evals less ad-hoc:
Use the Eval Flow + Metrics Map to write a one-pager:
“This is how we say something is good/bad in this project.”

4. Teaching / onboarding:
Use the diagrams as a shared language with new team members so your “tribal knowledge” isn’t locked in Slack threads.

When Not to Use This Pack
This pack won’t be very helpful if:
• You’re only running toy demos / hackathon prototypes.
• You’re okay with “it usually works” and don’t need traceability.
• You don’t have any real user or business constraints yet.
It’s designed for AI engineers who:
• Own RAG systems in production or pre-production,
• Need to justify decisions to PMs / infra / leadership,
• And are tired of rebuilding the same mental scaffolding from scratch.

DEV Community

Why a “Workflow Pack” for RAG + Eval?

Top comments (0)