Scarlett Attensil for LaunchDarkly

Posted on Jun 2 • Originally published at launchdarkly.com

AI Experimentation Best Practices: From Evaluation to Safe Production Rollouts

#ai #llm #testing #mlops

Introduction

Artificial intelligence tools, particularly large language models (LLMs), are not like traditional software. AI is probabilistic, so the same instructions and inputs can produce different results, especially when using non-zero temperature or other sampling methods, and those results can shift as your context changes. That unpredictability brings real risks because models can miss the mark, invent facts, or generate unfair or unsafe outputs. They can also incur unexpected costs and slow down under heavy loads, and they must constantly adapt to evolving policies and ethical guidelines.

AI experimentation means iteratively testing data, algorithms, prompts, models, and parameters to optimize model performance and validate hypotheses. You need a clear, repeatable way to try ideas, compare prompts and models, validate how your system finds and uses information, and do safety checks before changes reach real users. Experimentation is not just a nice-to-have; it is essential for shipping AI responsibly, optimizing resource efficiency, reducing costs, and accelerating innovation through rapid, evidence-based iteration cycles.

Throughout this guide, we distinguish evaluation from experimentation. Evaluation means offline benchmarking and scoring, including test sets, human or AI judges, and quality metrics. Experimentation means controlled production changes that affect real users through A/B tests, staged rollouts, or other release strategies. Evaluation tells you whether a variant clears a quality bar; experimentation tells you whether it beats the baseline in production, with statistical confidence and guardrails.

In this article, we cover the core ideas and practical steps for AI experimentation: how to plan a test, evaluate changes, run controlled trials with real users, choose metrics that actually matter to your product, and roll out changes safely. By the end, you will have a process that moves from initial concept to monitored, controlled production release.

AI Experimentation Best Practices

Best Practice	Description
Use experimentation to manage uncertainty	AI outputs can shift over time. Structured experimentation helps teams measure, compare, and validate changes before they reach users. LaunchDarkly AgentControl experiments and release options help turn unpredictability into a controlled process for improvement.
Build trust through evidence, not intuition	Without experimentation, teams rely on gut feeling. Controlled tests provide measurable evidence of what works. Use LaunchDarkly Experimentation, metrics, and AgentControl monitoring to make confident, data-driven decisions.
Detect and reduce hidden risks early	Experimentation surfaces hallucinations, bias, latency regressions, and safety failures before they affect broad audiences. Online evaluations and guarded rollouts help teams detect regressions and pause or roll back unsafe changes.
Enable continuous improvement	AI systems evolve as data, models, and contexts change. Config variations, config targeting, and progressive rollouts give teams a repeatable way to adapt while controlling exposure.
Design experiments with statistical power and variance in mind	Collect multiple observations per variant to account for nondeterminism. Use confidence intervals and statistical significance tests rather than single-run comparisons. Define a minimum detectable effect (MDE), guardrail metrics, and a decision rule before launch. LaunchDarkly experiments and metrics support this evidence-based workflow.
Support responsible and compliant AI	Experimentation frameworks help teams evaluate whether updates align with ethical standards, privacy requirements, and evolving policies. LaunchDarkly role-based access control, approvals, and audit logs help make responsible AI development a built-in process.
Keep track of cost and latency	Track per-session spend and speed, set budgets and max token limits, optimize prompts and context, use caching or streaming where appropriate, and monitor TTFT, p95/p99 latency, retries, and spend. AgentControl monitoring and autogenerated AI metrics help surface cost, latency, and token usage by variation.
Conduct controlled testing with real users	Run A/B tests, sticky cohorts, staged rollouts, or interleaving strategies. Measure satisfaction, task completion, latency, cost, and business impact. Use targeting rules, percentage rollouts, and guarded rollouts to control exposure and rollback thresholds.
Perform evaluation	Define metrics for truthfulness, user experience, reliability, safety, cost, and speed. Test in layers and expand only when stable. Evaluation tells you whether a system meets a bar, while experimentation determines which variant should be trusted in production. LaunchDarkly online evaluations, datasets, and judges support layered AI evaluation workflows.
Use retrieval evaluation for RAG	Evaluate model quality by measuring recall@k, precision@k, citation accuracy, unsupported claim rate, cost, and latency. After offline quality assessment, use live or shadow traffic for controlled experiments that optimize retrievers, chunking, ranking, or reranking. LaunchDarkly AgentControl experiments and monitoring help compare these changes safely.
Ensure proper governance and safety for AI experimentation	Pre-register your experiment plan, including hypothesis, primary metric, MDE, guardrails, and rollback rules. Version prompts, models, and configurations. LaunchDarkly config management, approvals, and audit logs help preserve compliance, safety, and auditability.

Note: Testing different chunking or embedding models usually requires building and validating separate vector indexes, and sometimes separate databases, because embeddings are tied to the index schema. Swapping these at inference time requires architectural planning, reindexing, and migration.

Why AI Needs Experimentation

Traditional software works like a calculator: same input, same output. AI is more like a conversational assistant that can be helpful and creative but sometimes surprising. Since AI is not fully predictable and small changes in wording can shift results, you cannot judge the quality of an AI feature from a single right answer.

AI features are pipelines with many moving parts: models that may update, prompts that steer behavior, tools and APIs that can fail, and knowledge sources that drift as content changes. All of these can affect accuracy, safety, speed, and cost. A one-time test will not catch issues that appear under real traffic.

That is why experimentation is essential. It gives teams a structured way to observe, measure, and improve AI behavior as conditions change. Through continuous testing, you can detect drift, uncover hidden risks, and build confidence that your system performs reliably and responsibly.

LaunchDarkly helps teams operationalize this workflow with AgentControl, configs, config variations, config targeting, monitoring, and online evaluations.

The Hierarchy of Levers: Where to Focus Your Optimization Efforts

In practice, AI experimentation levers should be optimized in order of impact and reversibility:

System message
Examples
Output format
Context
Retries and fallbacks
Models and parameters

This order matters because many high-impact changes can be made without retraining or rebuilding your system. With AgentControl config variations, teams can version and compare these changes while controlling exposure through targeting.

System Message Variations

The system message is one of the most powerful levers in shaping an AI model’s behavior. It defines the model’s role, tone, and boundaries, setting the personality and guardrails for how it responds.

Small changes here can dramatically affect safety and reliability. Tightening tone or adding an out-of-scope clause can prevent speculative or unsafe content. However, overly rigid instructions can make responses sound robotic or unhelpful.

Experiment with several system-message variations and test how they perform across normal, edge, and adversarial scenarios. The goal is not only to find one prompt that works, but to understand how tone and framing influence quality, cost, safety, and latency. Store and compare these variants with AgentControl config variations and monitor results with config performance monitoring.

Choosing the Right Number of Examples

Compare zero-shot, one-shot, and few-shot examples, typically 3-5 examples. Mix common cases and edge cases, include “do” and “don’t” examples, and show the exact output format. Short examples teach patterns, but they also add tokens and delay. Measure accuracy, format adherence, generalization, latency, and cost with autogenerated AI metrics.

Output Format

Choose between free text, structured templates, or native structured outputs. Structured outputs are easier to parse and validate but can constrain creativity or break on truncation. Always validate responses, handle partial outputs gracefully, and keep templates simple. During testing, a temporary explain field can help diagnose why one variation performs better than another.

Context Window Size

Your experiment should test the cost-benefit tradeoff between precise context and extended context. Increasing context often increases cost and latency without improving output quality. Use AgentControl monitoring to compare variation-level latency and token usage before promoting a longer-context variant.

Retries With Backoff

Use one or two attempts for temporary errors such as rate limits, timeouts, or server overload. Add exponential backoff and jitter. Log error rates, latency, and cost. Ensure idempotency, cap retries, enforce timeouts, and offer a polite fallback when limits are hit. For production rollout, pair retry changes with guarded rollouts so latency and error regressions can halt expansion.

Fallback Chain

Route to a backup model or provider in the event of failures or slowness. Keep prompts and formats aligned so the backup model understands the same prompt structure and returns responses in the same format. Preserve conversation state, verify required features on the fallback, and log reasons for routing. LaunchDarkly config targeting can help route different cohorts to different model or provider variations.

The Expansion Rule

Experimentation should scale based on evidence, not enthusiasm. Once your pilot shows strong performance, expand the rollout to broader audiences. Scale only when metrics justify it: success rates are high, failure rates are low, safety checks pass, and time or cost remains acceptable. Use percentage rollouts, progressive rollouts, or guarded rollouts to expand with controlled risk.

Models and Parameters

Models and parameters are the tuning panel for an AI system: the set of dials you use when you want more accuracy, fewer hallucinations, faster responses, or lower cost.

Start with the right model for the job. Use a more capable model for complex reasoning or planning and a smaller, faster model for routine tasks. Match the model’s strength to the complexity and stakes of the task rather than defaulting to the largest model. Lock down the exact model version when possible so results stay reproducible as the model evolves. Version pinning reduces variability, but it does not eliminate drift. Upstream model behavior and real-world inputs can still change, so production experiments and ongoing holdbacks remain necessary.

AgentControl lets teams manage model selection, prompt content, provider configuration, and generation parameters with configs, variations, and AI model configurations.

Temperature

Temperature controls how adventurous or conservative a model’s output is. It is the primary generation setting most users adjust.

Keep it low, around 0-0.3, for code, structured formats, or safety-critical tasks.
Use higher values, around 0.7-1.0, for creativity or brainstorming.
Stay in the middle for everyday conversations.

Other sampling parameters, such as top_p or top_k, also influence output diversity, but temperature usually has the largest and most predictable effect, so it is often the first parameter worth tuning.

Retrieval and Search

Do not rely only on keywords because meaning matters. Semantic search helps the model understand intent. Hybrid search, combining semantic and keyword search, often works best for short queries or exact names. Choose an embedding model that fits your language and domain, and keep its version fixed.

A graph database models relationships and traversals, such as “how is X connected to Y?” A vector database or vector-enabled datastore is optimized for similarity search over embeddings to support retrieval in RAG pipelines. When testing retrieval changes, use online evaluations and AgentControl experiments to compare quality, latency, and cost.

Chunking and Metadata

Split documents into natural sections with slight overlaps. Sliding windows help for long text. Add metadata to improve filtering and relevance. When experimenting, start with a baseline and change one variable at a time: temperature, chunk size, top_k, reranking, or search type. Evaluate offline using a labeled dataset from your domain, then use controlled rollout strategies such as percentage rollouts or guarded rollouts before broad exposure.

Tool and Function Management

Tools are the hands and eyes of your AI. They turn abstract intelligence into real-world action. However, giving an AI system too many tools at once can create reliability, safety, and cost problems. A focused, well-defined toolset keeps the system efficient and predictable.

When experimenting with tools, start small. Give the AI only the tools it truly needs, then expand based on evidence. Simulate tool behavior with mock or historical data before allowing live writes or sensitive operations. Monitor error rates, latency, and cost. Use circuit breakers, fallback paths, and kill switches to keep the system stable when a tool fails.

LaunchDarkly AgentControl tools, agents, feature flags, and release controls can help teams expose new tool behavior gradually and roll back unsafe changes quickly.

Cost and Latency

Managing cost and latency in AI systems is like tuning a race car: you want speed and performance, but you cannot afford to burn all your fuel in one lap. The trick is knowing where your money and time actually go: input tokens, output tokens, model rates, tool usage, retries, retrieval, and post-processing.

Experiment design also affects cost. Multi-armed bandit approaches can reduce spend by shifting traffic away from losing variants early, while long, fixed-horizon A/B tests can waste budget after a clear loser emerges. Track cost per successful answer rather than cost per call so you know which variants are efficient and useful.

Several habits help:

Match the model to the job: Use smaller models for routine tasks and larger models for complex reasoning.
Set clear budgets: Cap tokens, cost, and retries per session.
Cache and reuse: Avoid paying twice for the same retrieval or generated output.
Retry wisely: Validate inputs early and use exponential backoff to avoid waste.
Measure what matters: Track cost per successful answer, not just cost per request.
Watch latency signals: Monitor time to first token, p95/p99 latency, and error rates.

LaunchDarkly Monitoring and autogenerated AgentControl metrics help teams compare token usage, duration, and variation-level performance.

Experimentation Before User Exposure

Before any major AI update reaches real users, it deserves a proper dress rehearsal. Catching issues early prevents bad experiences, unnecessary costs, and reputational damage.

Start by building a test set that mirrors real-world scenarios: genuine examples, synthetic edge cases, and adversarial prompts. If you are working with RAG, make sure answers link back to sources so you can evaluate grounding and citation quality. Use an AI judge or evaluation rubric to score correctness, completeness, clarity, safety, and faithfulness. LaunchDarkly datasets, judges, offline evaluations, and online evaluations support this progression from offline testing to production measurement.

Best practices include:

Set clear thresholds: Define what “good enough” means before the test begins.
Shadow test safely: Run the new model alongside the current one on real traffic while hiding results from users.
Control costs: Sample requests, cache results, and limit verbosity.
Protect fairness and privacy: Compare variants across quality, reliability, cost, and speed while respecting data boundaries.

Once the new model shows stable performance, no quality drops, no latency spikes, and no cost overruns, move to a canary rollout with rollback ready. Use percentage rollouts for fixed exposure, progressive rollouts for scheduled expansion, and guarded rollouts when you want metric-based monitoring and rollback.

Controlled Testing With Real Users

Testing with real users is where theory meets reality. The goal is to gather insight while keeping risk low and user experience intact.

A practical way to do this is A/B testing. By assigning users to consistent cohorts, you can compare different versions of your AI system under real conditions. This supports statistical decision-making, such as confidence intervals and significance testing, rather than anecdotal wins.

To make tests meaningful:

Keep traffic splits representative: Cover different user segments, regions, and use cases with targeting rules.
Tag everything: Include version, prompt, model, parameters, and settings in every request so outcomes are traceable.
Measure real impact: Track satisfaction, edits, retries, task completion, conversion, revenue lift, latency, and cost with LaunchDarkly metrics.

When rolling out updates, start with an internal beta, then gradually expand to 1%, 5%, 10%, and beyond. Watch quality, latency, safety, and failure rates closely. If something goes wrong, roll back and investigate. Creating guarded rollouts gives teams a structured way to tie rollout expansion to live metrics.

Not all AI experiments have a fixed end date. Many teams run ongoing control groups, holdbacks, or adaptive allocation strategies that monitor performance as models, data, and user behavior change. Even then, explicit guardrails and rollback thresholds are essential so optimization never trades off safety, latency, or cost.

Evaluation

Evaluation is not just checking whether the model runs. It is understanding how well the system serves users, how reliable it is under real conditions, and whether it delivers value within operational limits. A strong evaluation framework balances quality, cost, safety, and performance.

Layer evaluations in stages: offline tests, shadow testing, limited rollout, and broader production experiments. Set clear targets for quality, reliability, and cost. Instrument everything so you can explain wins and diagnose regressions. Expand only when metrics hold steady.

Quality and Accuracy

Start with the basics: Does the model tell the truth? Validate answers against known ground truth using offline tests and side-by-side reviews. AI judges provide scalable signals, but they should be calibrated against human review and used primarily for relative comparison between variants rather than absolute truth. LaunchDarkly judges and online evaluations help automate this scoring.

User Experience

Even a technically accurate model fails if it frustrates users. Focus on fast, helpful first responses and fewer handoffs to humans. Measure satisfaction, task completion, rewrite rates, time to first token, and time to useful answer.

Reliability

Reliability means tools behave predictably. Check that outputs match expected formats and that retries or timeouts are rare. Track error rates, schema validity, and success ratios. Define service-level objectives and trigger rollback if failures exceed limits. LaunchDarkly guarded rollouts can connect metric regressions to automated release decisions.

Cost and Speed

Every token, retrieval, and retry has a price. Break down latency and cost by stage to identify where resources are spent. Use smaller or cached models for routine tasks, stream responses where appropriate, and tighten prompts to reduce waste.

Observability

You cannot improve what you cannot see. Log prompts, parameters, model versions, config versions, and tool calls while masking personal data. Feed this data into dashboards that track cost, speed, quality, and safety. Use AgentControl monitoring, metrics, and observability integrations to detect drift and regressions.

Evaluating Retrieval Quality

Great answers depend on great context. Assess retrievers, rerankers, and generators separately and together:

Recall@k shows whether the right documents appear.
Precision@k shows whether retrieved documents are relevant.
nDCG and MRR show how well relevant documents are ranked.
Attributable accuracy connects correct answers to supporting evidence.
Unsupported claim rate flags hallucinations.
Citation correctness, freshness, cost, and latency show whether retrieval adds value.

Use offline QA sets with labeled passages and slice results by topic, query type, and language. Add confidence gating so the system can admit uncertainty instead of fabricating answers.

Observability for Retrieval

Instrument retrieval just like generation. Log query details, index versions, retrieved document IDs, ranking scores, and latency. Use dashboards to visualize recall, accuracy, and latency percentiles. Before rolling out a new index, use canary or shadow testing and control exposure with AgentControl targeting.

Governance and Safety in AI Experimentation

Governance and safety keep AI experimentation trustworthy. The goal is to find measurable improvement while protecting users, respecting constraints, and keeping experiments reproducible.

Security and Access Control

Before any experiment touches real data or users, define who can change what and how. Limit who can modify prompts, deploy models, access production logs, or adjust rollout rules. Use separate environments for development, staging, and production. LaunchDarkly supports these practices through role-based access control, approvals, audit logs, and environments.

Safety Guardrails

Set hard limits that experiments cannot violate. Use content filters, rate limits, token budgets, circuit breakers, and quality thresholds. Define rollback conditions for error rates, latency spikes, toxicity, unsupported claims, or cost overruns. Release policies and guarded rollouts help standardize these controls.

Reproducibility and Compliance

Strong governance means being able to prove what happened. Fix random seeds or sampling settings where supported. Version dataset snapshots, model IDs, prompt templates, guardrails, and configuration files. Store experiment plans, analysis rules, inputs, outputs, parameters, and results. LaunchDarkly config management and config version comparison help preserve reproducibility.

Rollback and Kill Switches

No matter how careful you are, things can go wrong. Keep the previous version ready. Test rollback procedures regularly. Use kill switches that can immediately halt an experiment if safety or quality issues emerge. LaunchDarkly feature flags, guarded rollouts, and guarded rollout management support fast mitigation.

Ongoing Monitoring

Governance does not stop at launch. Continue tracking model performance, user behavior, data distributions, latency, cost, and safety. Periodically rerun safety and quality checks as the system evolves. Maintain a documented process for investigating failures, notifying stakeholders, and implementing fixes.

How LaunchDarkly Helps With AI Experimentation

Where many AI tools stop at evaluation, LaunchDarkly helps enable production experimentation with traffic allocation, metrics, statistical comparison, and controlled release workflows. AI experimentation needs an operational layer that manages prompts, models, parameters, cohorts, traffic allocation, and rollouts safely. Building that layer yourself can quickly become complex.

LaunchDarkly provides:

Instant updates without deployments: Change prompts, swap models, or adjust parameters through AgentControl without redeploying application code.
Safe, gradual rollouts: Test new models on a small percentage of users with percentage rollouts, progressive rollouts, and guarded rollouts.
Centralized control with governance: Use approvals, audit logs, and role-based access control.
Built-in experimentation framework: Run experiments and AgentControl experiments comparing models, prompts, or parameters.
Separation of concerns: Developers can focus on building features while cross-functional teams safely participate in experimentation workflows through controlled configuration changes.

LaunchDarkly feature flags and AgentControl let you treat AI components as dynamic configurations rather than static code, giving you the speed and safety needed for continuous experimentation at scale. The following example shows how to switch between two model configurations with AgentControl.

Example: Switching Between AI Model Variations With AgentControl

In the LaunchDarkly dashboard, open AI, select AgentControl, create a config for the AI workflow, and define variations for each model you want to compare. For implementation details, start with the AgentControl quickstart, then review Create configs, Create and manage config variations, Config targeting, and the Python AI SDK reference.

After setting up config variations, use targeting to control which model variation is served and define a safe default.

Note: This example is simplified for illustration. Production implementations should externalize secrets, define explicit fallbacks, enforce timeouts, and include error handling and guardrails.

Install Dependencies

# Install required dependencies.
# In a notebook, you can run these commands with a leading !.
# In a terminal, run them without the leading !.

!pip install launchdarkly-server-sdk
!pip install launchdarkly-server-sdk-ai
!pip install openai

Import Dependencies

import os

import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai.client import LDAIClient, AICompletionConfigDefault
from openai import OpenAI

Set Up Clients

ld_sdk_key = os.getenv("LAUNCHDARKLY_SDK_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")

if not ld_sdk_key:
    raise RuntimeError("Missing LAUNCHDARKLY_SDK_KEY")

if not openai_api_key:
    raise RuntimeError("Missing OPENAI_API_KEY")

ldclient.set_config(Config(ld_sdk_key))

ld_client = ldclient.get()
ai_client = LDAIClient(ld_client)
openai_client = OpenAI(api_key=openai_api_key)

Create Evaluation Contexts

# Context 1: control group user.
context_user_a = (
    Context.builder("user-alpha-001")
    .kind("user")
    .set("firstName", "Alice")
    .set("lastName", "Anderson")
    .set("email", "alice@example.com")
    .set("experimentGroup", "control")
    .build()
)

# Context 2: treatment group user.
context_user_b = (
    Context.builder("user-beta-002")
    .kind("user")
    .set("firstName", "Bob")
    .set("lastName", "Baker")
    .set("email", "bob@example.com")
    .set("experimentGroup", "treatment")
    .build()
)

Run the Same Query Against Two Config Variations

fallback_value = AICompletionConfigDefault(enabled=False)
user_query = "Write a detailed essay on NASA"


def run_configured_completion(context, label):
    config, tracker = ai_client.completion_config(
        "ai-experimentation",
        context,
        fallback_value,
        {"user_query": user_query}
    )

    if not config.enabled:
        raise RuntimeError(f"AI config is disabled for {label}")

    messages = [message.to_dict() for message in config.messages]
    messages.append({"role": "user", "content": user_query})

    completion = tracker.track_openai_metrics(
        lambda: openai_client.chat.completions.create(
            model=config.model.name,
            messages=messages,
            temperature=config.model.parameters.get("temperature", 0.7),
            max_tokens=config.model.parameters.get("maxTokens", 800)
        )
    )

    print(f"{label}")
    print("-" * 50)
    print(f"Model: {config.model.name}")
    print(f"Response:\n{completion.choices[0].message.content}")
    print()

    return config.model.name, completion.choices[0].message.content


model_a, response_a = run_configured_completion(context_user_a, "User A: Control")
model_b, response_b = run_configured_completion(context_user_b, "User B: Treatment")

print("=" * 50)
print("Comparison")
print("=" * 50)
print(f"User A got: {model_a}")
print(f"User B got: {model_b}")

Using the code above, one user can receive a baseline model variation while another receives an experimental model variation, without requiring a redeploy. This makes it easier to compare quality, latency, and cost under controlled conditions. The AI SDK can also report metrics to LaunchDarkly, which you can review in AgentControl monitoring and use in AgentControl experiments.

Final Thoughts

Experimentation should be part of everyday AI work: a habit, not a one-off project. Keep iterating, version your data, and let real numbers guide decisions instead of hunches. Treat every AI change like a hypothesis. Every hypothesis should map to a traffic allocation strategy, decision rule, and rollback condition.

Change one thing at a time. Start offline, move to shadow testing, then gradually expand through controlled rollouts while tracking quality, cost, latency, safety, and user outcomes. LaunchDarkly AgentControl, config variations, online evaluations, experiments, and guarded rollouts make this process practical by keeping prompts, models, parameters, metrics, and release controls versioned, targetable, measurable, and reversible.