Scarlett Attensil for LaunchDarkly

Posted on Apr 16 • Originally published at launchdarkly.com

Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs

#ai #llm #rag #tutorial

Overview

This tutorial shows you how to run an offline LLM evaluation on the RAG-grounded support agent you built in the Agent Graphs tutorial, using LaunchDarkly AI Configs, the Datasets feature, and built-in LLM-as-a-judge scoring. You'll build a RAG-grounded test dataset, run it through the Playground with a cross-family judge, and learn how to read each failing row as a dataset issue, an agent issue, or judge calibration noise.

Here's how it works. The LaunchDarkly Playground evaluates a single model call against a prompt and dataset you configure. By pre-computing your RAG retrieval offline and baking the chunks directly into each dataset row, you turn that call into a high-value generation test: the model in the Playground receives the same documentation context it would in production, so the eval measures how well your agent reasons over real grounded input.

What You'll Learn

Structure a RAG-grounded test dataset by pre-computing retrieval offline and bundling chunks into each row
Pick the right LLM judge for your agent's output shape (Accuracy for natural-language answers, Likeness for structured labels)
Avoid same-model bias by running the judge on a different model family than the agent
Diagnose failing rows as dataset issues, agent issues, or judge calibration noise

What this tutorial covers, and what it doesn't

Covers:

Generation quality over RAG context: does the model produce a correct answer when the right documentation is in the prompt?

Regression detection: catching unexpected score drops when you change a prompt or model

Variation selection: comparing candidate prompts and models before committing to a new AI Config variation

Does not cover:

Retrieval correctness. Whether your vector store is returning the best chunks is tested by your own RAG pipeline, outside LaunchDarkly.

End-to-end agent graph behavior. Tool execution, multi-turn conversations, handoffs, and multi-step routing require online evals against real production traffic.

Prerequisites

You've completed the Agent Graphs tutorial or have equivalent familiarity with LaunchDarkly AI Configs
You have the devrel-agents-tutorial repo cloned
You have API keys for two model providers, one for the agent under test and one for the judge (the examples use OpenAI and Anthropic)

Step 1: Get the Branch Running

About the branch and the Umbra knowledge base. The feature/offline-evals branch builds on the same Agent Graphs tutorial codebase and the routing, tool, and graph work done in earlier branches — none of that goes away. What this branch adds is a more realistic RAG assessment target: Umbra, a fictional serverless-functions product with an invented knowledge base (refund windows, deployment regions, function timeout limits, rate-limit tiers, and so on). Because Umbra doesn't exist outside this tutorial, the model under test has no pre-training knowledge to fall back on — a correct answer has to come from the retrieved chunks, which is the only way to honestly measure whether your RAG pipeline is doing its job. The branch also ships a pre-built RAG-grounded test dataset (datasets/answer-tests.csv) and a helper script that regenerates it from your vector store.

cd devrel-agents-tutorial
git checkout feature/offline-evals
cp .env.example .env
# Add LD_SDK_KEY, LD_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY to .env

uv sync
uv run python bootstrap/create_configs.py
uv run python initialize_embeddings.py

Start the API and UI in two terminals:

# Terminal 1
uv run uvicorn api.main:app --reload

# Terminal 2
uv run streamlit run ui/chat_interface.py

Open http://localhost:8501 and ask a question grounded in the Umbra docs (refund policy, deployment regions, function timeout). The agent pulls answers from the knowledge base.

Step 2: Understand the Test Dataset

Open datasets/answer-tests.csv. Every row has three fields:

input,expected_output,original_question
"Documentation context: --- We offer a 30-day refund policy for first-time subscribers... --- Annual subscriptions receive a prorated refund within... --- Question: What is the refund policy?","30-day refund policy for first-time subscribers who haven't deployed production traffic. Usage charges are non-refundable.","What is the refund policy?"

input bundles documentation chunks and the question into a single structured prompt, separated by --- dividers. The chunks were retrieved from your production vector store ahead of time by tools/build_rag_dataset.py, so the model in the Playground sees the same grounding the production agent would, even though the Playground never executes your retrieval tools.
expected_output is the correct answer, written by a human who read the source docs.
original_question is a plain-text copy of the question so you can scan the dataset without parsing the bundled prompt. No judge uses this field.

Regenerate the dataset when your knowledge base changes:

uv run python tools/build_rag_dataset.py

For the full reference on dataset format and limits, see Datasets for offline evaluations.

Step 3: Upload the Dataset

Use synthetic data only

Never upload real customer tickets, PII, secrets, or credentials. Replace anything sensitive with synthetic placeholders before upload. See the Playground privacy section for what gets forwarded to model providers.

Navigate to AI > Library in LaunchDarkly, select the Datasets tab, and click Upload dataset. Upload datasets/answer-tests.csv and name it answer-tests.

Step 4: Add Your Model API Keys

The Playground calls model providers directly, so it needs API keys for both the model running your agent and the model running your judge. These keys live in LaunchDarkly's "AI Config Test Run" integration, not in your AI Config.

In the Playground, click Manage API keys in the upper-right corner.
Click Add integration, pick a provider (e.g. OpenAI), paste your API key, accept the terms, and save.
Repeat for the second provider (Anthropic) so you can run a cross-family judge in Step 5.

See the Playground reference doc for the canonical instructions. API keys are stored per-session, so you may need to re-paste them when you return.

Step 5: Run the Evaluation

From the Datasets list, click into answer-tests to open it in a Playground bound to that dataset.

Configure the test

System prompt: paste your support-agent instructions verbatim from the AI Config. Do not edit or simplify them.
Agent model: pick the model your support-agent variation uses (or a candidate you're considering swapping to). To compare two candidates, run the eval twice with different agent models and compare scores.
Acceptance criteria: attach an Accuracy judge with threshold 0.85. Accuracy scores whether the response correctly addresses the input question, which fits grounded natural-language answers.
Evaluation model: uncheck Use same model for evaluation and set the judge to a different model family from the agent. Same-family judging tends to reward output patterns the judge itself produces. A cross-family judge gives you an independent read.

Run the eval.

Reading the results

The example run above had 18 passes and 2 failures. When a row fails, the failure comes from one of three places, and each one sends you in a different direction:

The dataset's chunks don't contain the answer. This is a retrieval problem, not a generation problem. Rebuild the dataset with higher top_k, a reranker, or a different chunker, or verify the answer is indexed at all.
The chunks contain the answer but the model ignored them. This is the agent-side failure offline evals are designed to catch. Tighten the system prompt to insist on grounding, or switch to a more obedient model.
The chunks and the model are both fine but the judge disagreed. This is judge calibration noise. Lower the threshold, try a different judge, or accept it as noise. Don't change your agent based on it.

Sort by score. For each failing row, open the bundled chunks in the input field and ask: was the right answer in there? Yes → fix the prompt or model. No → rebuild the dataset.

What failed in this run

Row 11: "What integrations are available?" (chunks missed the answer). The expected output mentioned monitoring integrations (Datadog, Sentry, LogRocket), but the retrieved chunks only covered databases, storage, and billing. The model correctly listed what it had and said "the documentation does not provide additional information regarding more integrations", which is the correct behavior for an ungrounded claim. Fix: higher top_k or a reranker in build_rag_dataset.py.

Row 12: "Can I get a refund on bandwidth overages?" (judge calibration). The model correctly said bandwidth overages are non-refundable, citing the docs, but omitted a secondary "Review your Usage Dashboard" recommendation from the expected output. Semantically right, lexically short one clause. Fix: lower the threshold or trim the expected output.

Two failures, two different fixes. Without reading the per-row results you'd conflate them and spend time tightening the model when the actual problem lives in the retriever or the dataset.

Where to Go From a Single Run

This tutorial walked you through one run. In practice, a single eval isn't where offline evaluation earns its keep. The real payoff comes from re-running the same dataset against a new prompt, a new model, or a fresh RAG chunker and comparing scores to your last known-good run. A small prompt edit that quietly drops your Accuracy from 0.83 to 0.71 is exactly the kind of regression this pattern is meant to catch, but only if you save the run and compare against it next time.

A reasonable next loop:

Save the run from Step 5 as your reference.
When you change something (prompt, model, chunker, top_k), re-run the same dataset and compare scores.
Add new rows to the dataset as you find failure modes in staging or production.

For end-to-end behavior that offline tests can't capture (tool execution, multi-turn conversations, the tail of real production inputs), see online evaluations and the When to add online evals tutorial. Online evaluations are not currently supported for agent-based AI Configs; for agent workflows, the documented path is programmatic judge evaluation via the AI SDK.

Step 7: Track Evaluation History

View saved runs at AI > Evaluations. Toggle Group by dataset to collapse runs under each dataset name so you can see the history for umbra-rag-eval alongside any other datasets in the project. Compare pass and fail counts across runs, and distinguish saved runs (indefinite retention) from one-off runs (60-day expiry). For metric definitions, see Monitor AI Configs.

What's Next

Progressive rollouts: release your winning variation to 5% of traffic, then 25%, then 100%, watching production metrics before expanding.
When to add online evals: decide what to score on live production traffic once you have an offline baseline.

For a deeper look at the multi-agent RAG system this tutorial builds on, see the Agent Graphs tutorial.

DEV Community