DEV Community: Roman Belov

Human-in-the-Loop for AI Products: When the Model Decides and When a Person Does

Roman Belov — Mon, 20 Jul 2026 02:43:06 +0000

Fully autonomous AI systems in production make critical mistakes. Adding human-in-the-loop reduces them by an order of magnitude, but increases latency and processing cost.

The question is not whether a human is needed in the loop. The question is exactly where to place them so they catch dangerous decisions without turning the AI system into an expensive interface for manual work.

When autonomy works and when it kills the product

Two extremes. Full autonomy: the model decides everything, humans see the result after the fact. Full control: every model response is manually reviewed before reaching the user. The first scales, but destroys trust when errors occur. The second is reliable, but kills the economics.

Real AI products live between these poles. The architect's job: determine which decisions the model makes autonomously and which require human confirmation.

Three factors define the boundary:

Cost of error. A chatbot recommends a restaurant with the wrong rating -- annoying. An AI system approves a $50,000 loan based on hallucinated analysis -- a lawsuit. The higher the cost of an error, the lower the escalation threshold.

Reversibility. A sent push notification cannot be unsent. A drafted email is easy to correct before sending. Irreversible actions require confirmation.

Model confidence. A model with confidence 0.95 makes fewer mistakes than one with 0.6. But even 0.95 means one error in every twenty decisions. The threshold depends on the task.

Confidence Threshold: decision-making framework

A confidence threshold is a numeric cutoff below which the model's decision is sent to a human for review. Sounds simple. In practice it requires calibration per task type.

Base formula

IF confidence >= threshold AND risk_level == "low"
  -> autonomous decision
IF confidence >= threshold AND risk_level == "high"
  -> autonomous decision + async audit
IF confidence < threshold
  -> escalate to human

Thresholds are not fixed. For support ticket classification, 0.7 is sufficient. For a medical triage system, you need 0.95+. For financial decisions -- 0.99.

Threshold calibration

An uncalibrated threshold is worse than no threshold at all. A model that outputs confidence 0.9 on every response (including wrong ones) is useless for HITL.

Calibration process:

Collect 500-1000 model decisions with confidence scores
Label each decision correct/incorrect (manual labeling or LLM-as-Judge)
Plot the curve: at each threshold, what percentage of decisions go to review, and what percentage of errors slip through
Choose the threshold based on business requirements: acceptable error pass-through rate vs. reviewer workload

from sklearn.metrics import precision_recall_curve
import numpy as np

def find_optimal_threshold(
    confidences: list[float],
    is_correct: list[bool],
    max_error_rate: float = 0.02,  # max 2% errors in autonomous decisions
) -> dict:
    """Finds the threshold at which errors in the autonomous zone <= max_error_rate."""
    confidences = np.array(confidences)
    is_correct = np.array(is_correct)

    thresholds = np.arange(0.5, 1.0, 0.01)
    results = []

    for t in thresholds:
        autonomous = confidences >= t
        escalated = confidences < t

        if autonomous.sum() == 0:
            continue

        error_rate = 1 - is_correct[autonomous].mean()
        escalation_rate = escalated.sum() / len(confidences)

        results.append({
            "threshold": round(t, 2),
            "error_rate": round(error_rate, 4),
            "escalation_rate": round(escalation_rate, 4),
            "autonomous_volume": int(autonomous.sum()),
        })

    # Select the lowest threshold at which error_rate <= max_error_rate
    valid = [r for r in results if r["error_rate"] <= max_error_rate]
    if not valid:
        return results[-1]  # strictest threshold

    return min(valid, key=lambda r: r["threshold"])

Typical calibration results:

Threshold	Errors in autonomous	Escalation	Autonomous volume
0.60	8.2%	12%	8800 of 10000
0.75	3.1%	28%	7200 of 10000
0.85	1.4%	41%	5900 of 10000
0.92	0.3%	58%	4200 of 10000

At threshold 0.85, the system handles 59% of requests autonomously with a 1.4% error rate. The remaining 41% go to review. For most B2B products, this is an acceptable balance.

Risk Matrix: the second dimension of filtering

A confidence threshold works on one axis. A risk matrix adds a second: the type and consequences of the decision. High confidence at high risk still warrants oversight.

                    Low Risk          High Risk
                +-----------------+-----------------+
High Confidence |  AUTO           |  AUTO + AUDIT   |
                |  Full           |  Autonomous,    |
                |  autonomy       |  but logged for |
                |                 |  spot checks    |
                |                 |                 |
                +-----------------+-----------------+
Low Confidence  |  QUEUE          |  ESCALATE       |
                |  Batch review   |  Immediate      |
                |  queue          |  escalation to  |
                |                 |  human          |
                |                 |                 |
                +-----------------+-----------------+

Four zones, four behaviors:

AUTO. Low risk, high confidence. The model decides; result is sent immediately. Examples: support ticket classification, content tag generation, search query autocomplete.

AUTO + AUDIT. High risk, high confidence. The model decides autonomously, but every decision is written to an audit log. A reviewer checks a sample (10-20%) after the fact. Examples: content moderation, automated pricing, recommendation personalization.

QUEUE. Low risk, low confidence. Decision goes to a batch review queue. Not urgent, but the model cannot handle it alone. Examples: categorizing non-standard requests, generating descriptions for edge-case products.

ESCALATE. High risk, low confidence. Immediate escalation to a human. Blocks the process until a decision is received. Examples: suspected fraud, medical recommendations with ambiguous symptoms, account deletion decisions.

Determining risk level

Risk is determined statically (by task type) or dynamically (by request content).

Static classification:

RISK_LEVELS = {
    "ticket_classification": "low",
    "content_moderation": "high",
    "price_adjustment": "high",
    "search_autocomplete": "low",
    "fraud_detection": "high",
    "tag_generation": "low",
    "medical_triage": "critical",
    "account_deletion": "critical",
}

Dynamic classification uses a second model or rule set. A $5 return request -- low risk. A $50,000 return request -- high risk. Same task, different level of oversight.

def assess_risk(task_type: str, context: dict) -> str:
    base_risk = RISK_LEVELS.get(task_type, "medium")

    # Dynamic modifiers
    if context.get("amount", 0) > 10_000:
        return "critical"
    if context.get("is_new_user", False):
        return max_risk(base_risk, "high")
    if context.get("affects_multiple_users", False):
        return "critical"

    return base_risk

HITL patterns in production

Five patterns. Each solves a specific architectural problem.

Pattern 1: Pre-action Gate

The model generates a decision but does not execute the action. A human approves or rejects.

Request -> LLM generates decision -> Gate -> [Human approves] -> Action
                                    |
                            [Human rejects] -> Feedback -> Model retries

Use case: any irreversible action. Sending email on behalf of a user, publishing content, financial transactions.

Implementation: the model returns structured JSON with the proposed action. The system writes it to a review queue. The reviewer clicks Approve/Reject. On Approve, the system executes the action. On Reject, feedback is returned to the model for regeneration.

class PreActionGate:
    def __init__(self, queue: ReviewQueue, executor: ActionExecutor):
        self.queue = queue
        self.executor = executor

    async def process(self, llm_decision: dict, context: dict) -> str:
        risk = assess_risk(llm_decision["action_type"], context)
        confidence = llm_decision["confidence"]

        if confidence >= THRESHOLDS[risk] and risk != "critical":
            # Autonomous execution
            return await self.executor.execute(llm_decision)

        # Escalation
        review_id = await self.queue.enqueue(
            decision=llm_decision,
            context=context,
            priority="urgent" if risk == "critical" else "normal",
        )
        return f"Awaiting review: {review_id}"

Pattern 2: Post-action Audit

The model acts autonomously. A human reviews a sample of decisions after the fact. Erroneous decisions are rolled back or compensated.

Request -> LLM -> Action -> Log to audit
                              |
                    Reviewer checks sample
                              |
                    Error -> Rollback + model correction

Use case: high-volume decisions with low per-error cost. Comment moderation, ticket classification, metadata generation.

Sample size depends on model maturity. Initially -- 20-30% of decisions. After stabilization -- 5-10%. If degradation is detected -- back to 20-30%.

Pattern 3: Confidence-based Routing

Requests are distributed between the model and humans based on confidence score. The model handles simple cases; humans handle complex ones.

Request -> LLM evaluates -> confidence >= 0.85 -> Autonomous response
                         -> confidence 0.6-0.85 -> LLM responds + review flag
                         -> confidence < 0.6    -> Human responds

Three zones instead of two. The middle zone: the model responds immediately (no latency impact), but the response goes into a review queue. If the reviewer finds an error, the user receives a corrected response.

This pattern works well for support bots. 60-70% of requests are routine -- the model handles them with confidence 0.9+. 20-25% are medium complexity -- model responds, reviewer checks. 5-15% are complex -- straight to a human.

Pattern 4: Cascading Validation

Multiple layers of checks. Each subsequent layer is more expensive but more accurate. A request passes through as many layers as needed to reach the required confidence.

LLM Generator -> Automated rules (regex, schema validation)
                  | (passed)
              -> LLM-as-Judge (second model evaluates)
                  | (passed)
              -> Autonomous response

              | (failed at any layer)
              -> Escalate to human

More on the second layer: LLM-as-Judge: automated quality gate.

Each layer filters out some errors. Automated rules catch obvious format violations (missing required fields, SQL injection in the response, banned content by stop-words). LLM-as-Judge catches semantic errors: hallucinations, irrelevance, tone violations. The human catches what both automated layers missed.

Economics: automated rules process 100% of requests for $0. LLM-as-Judge processes 85% (15% filtered by rules) at $0.002-$0.01 per request. Humans process 5-10% (the rest passed both layers) at $0.50-$2.00 per request.

Pattern 5: Feedback Loop with learning

Reviewer decisions feed back into the system to improve the model. Over time, the model learns from corrections and requires less review.

LLM -> Decision -> Review -> [Correct] -> +1 to training set
                           -> [Incorrect] -> Correction + training example
                                                   |
                                         Fine-tuning / prompt update
                                                   |
                                         Threshold recalibration

A feedback loop only works with structured data collection. The reviewer does not just click Reject -- they specify the reason for rejection, the correct answer, and the error category. This data builds an evaluation dataset for automated quality monitoring.

Implementing a HITL system: architecture

A minimum production architecture has five components.

+----------+     +--------------+     +--------------+
|  Client   |---->|  AI Service  |---->|  Review Queue |
+----------+     |              |     |  (Redis/SQS)  |
                 |  - LLM call  |     +------+-------+
                 |  - Confidence|            |
                 |  - Risk      |     +------v-------+
                 |  - Routing   |     |  Review UI    |
                 +--------------+     |  (Dashboard)  |
                        |             +------+-------+
                        |                    |
                 +------v-------+     +------v-------+
                 |  Action      |     |  Feedback     |
                 |  Executor    |     |  Store        |
                 +--------------+     +--------------+

AI Service calls the model, receives confidence, determines risk level, and routes the decision.

Review Queue stores decisions awaiting review. Redis for simple cases, SQS/RabbitMQ for distributed systems. Each queued decision contains: original request, model response, confidence, risk level, context, timestamp, and priority.

Review UI shows the reviewer the model's decision with context. The reviewer confirms, corrects, or rejects it. Good UI surfaces confidence, highlights uncertain zones, and suggests alternatives.

Action Executor runs the approved action. Idempotent -- a duplicate call with the same ID does not repeat the action.

Feedback Store collects reviewer decisions for training. Structured data: model decision, human decision, delta, reason for correction.

HITL system metrics

Six metrics for monitoring.

Metric	Formula	Target
Escalation Rate	escalations / all requests	15-30% (domain-dependent)
Auto-resolve Rate	autonomous / all requests	70-85%
Error Escape Rate	errors in autonomous / autonomous	< 2%
Review Latency	average review time	< 5 min (P95)
Reviewer Agreement	agreement with model decision	> 85%
Feedback Utilization	feedback -> model improvement	monthly cycle

Escalation Rate rising means the model is degrading or the threshold is too strict. Falling means the model is improving or the threshold is too lenient.

Error Escape Rate is the primary safety metric. If it rises, recalibrate the threshold or retrain the model.

Reviewer Agreement below 85% signals one of two things: the model performs poorly (reviewers frequently correct it) or reviewers are inconsistent with each other (reviewer calibration needed).

Anti-patterns

Threshold without calibration. A threshold of 0.8 set "because it seems reasonable." Without data on real errors, you cannot determine the right threshold. Start by manually reviewing 100% of decisions, accumulate data, then calibrate.

Review as a bottleneck. The review queue grows faster than reviewers can process it. Decisions wait for hours. Users get no responses. Solution: an automatic fallback -- if a decision has waited longer than the SLA, send the model's response with a note "may be revised."

Blind trust in confidence. Models hallucinate with high confidence. GPT-4 outputs 0.95 on factually wrong answers. Confidence from logprobs and confidence from the model's self-assessment are different things. Logprobs reflect certainty in token selection, not correctness of the answer. External calibration is required.

HITL at every step. A human confirms every model decision. The AI system becomes a UI for manual work. If a reviewer approves 90% of decisions without edits -- the threshold is too strict.

Feedback without action. Reviewers correct decisions, data is collected, but the model does not improve. Every month: analyze common error types, update prompts, recalibrate thresholds.

Checklist: choosing a HITL pattern

Algorithm for choosing the right pattern.

1. Is the action irreversible?
   YES -> Pre-action Gate (pattern 1)
   NO -> continue

2. Volume > 1000 decisions/day?
   YES -> Confidence-based Routing (pattern 3)
   NO -> continue

3. Cost of a single error > $100?
   YES -> Cascading Validation (pattern 4)
   NO -> continue

4. Resources to review 20-30% of decisions?
   YES -> Post-action Audit (pattern 2)
   NO -> Confidence-based Routing with a high threshold (pattern 3)

Always: add a Feedback Loop (pattern 5) on top of whichever pattern you choose.

Summary

Human-in-the-loop is not a fallback for a poor model. It is an architectural pattern that defines where AI works autonomously, where it operates under supervision, and where it yields to a human.

Three tools: a confidence threshold (calibrated on real data), a risk matrix (static + dynamic risk level), and a routing pattern (selected via the checklist). Monitoring metrics show whether the system works and when to recalibrate.

Start with Post-action Audit at 100% coverage. After 1-2 weeks, enough data will exist to calibrate the threshold. Then move to Confidence-based Routing. After a month, add Cascading Validation for high-risk decisions. Each step reduces reviewer workload and increases model autonomy.

Need help designing human-in-the-loop systems for AI products? I help startups build AI products and automate processes — belov.works.

FAQ

How do you prevent the review queue from becoming a bottleneck in high-volume systems?

Three mechanisms prevent queue saturation. First, set an SLA-based automatic fallback: if a decision has been waiting longer than N minutes, send the model's response with a visible "pending review" flag rather than blocking the user. Second, use dynamic threshold relaxation during peak load — temporarily raise the confidence threshold so fewer items enter the queue. Third, maintain a reviewer capacity model: if the Tier A escalation rate consistently exceeds reviewer throughput, the threshold needs recalibration, not more reviewers.

Is LLM-generated confidence reliable enough to use as a routing signal?

Model-reported confidence from self-assessment is not calibrated by default. GPT-4 outputs high confidence scores on factually incorrect answers at a non-trivial rate. Logprobs are more reliable — they reflect certainty in token selection rather than subjective self-assessment — but still require external calibration against labeled data before use. The calibration process described in this article (collecting 500-1000 decisions and comparing scores against correct/incorrect labels) is the minimum required before any production HITL system goes live.

What is a practical approach to handling the Feedback Loop when reviewers are not available daily?

Batch the feedback collection. Instead of requiring real-time feedback from each reviewer, collect decisions daily and run a weekly analysis cycle: export all reviewer corrections, cluster by error category, update the prompt or threshold based on the top 2-3 recurring patterns. This asynchronous approach requires less reviewer discipline while still closing the improvement loop on a cadence that matters — monthly threshold recalibrations are the minimum for a system processing thousands of decisions per week.

Prompt A/B Testing: a scientific approach to improving AI response quality

Roman Belov — Thu, 16 Jul 2026 03:14:45 +0000

Most production prompts never go through formal comparative testing. Teams change wording on intuition, evaluate the result on three examples, and deploy. A week later they notice degradation on edge cases, roll back, and try again. Prompt A/B testing eliminates guesswork and turns prompt optimization into a measurable process.

Why intuitive prompt evaluation does not work

Prompt engineers make three systematic mistakes during manual evaluation.

Small sample error. Checking 5-10 examples does not surface problems. Prompt A may win on simple requests and lose on complex ones. With a sample of 5 requests, the probability of missing a 20% degradation is 33%.

Confirmation bias. The prompt author subconsciously picks examples where the new version looks better. This is not bad intent, it is a cognitive bias. The only way to eliminate it is blind evaluation on a random sample.

No baseline. Without a recorded baseline, it is impossible to know whether the new version improved anything. "Responses seem more accurate" is not a metric. 0.82 to 0.87 on faithfulness over 200 examples is.

A/B testing solves all three: fixed dataset, automated evaluation, statistical verification of the difference.

Architecture of prompt A/B testing

A prompt A/B test differs from a product A/B test. In a product test, you measure user behavior (CTR, conversion). In a prompt test, you measure the quality of model outputs. Users may not be involved at all.

┌─────────────────────────────────────────────────────┐
│              Prompt A/B Test Pipeline                │
├─────────────┬─────────────┬─────────────────────────┤
│  Dataset    │  Execution  │  Evaluation             │
│             │             │                         │
│ Inputs      │ Prompt A    │ LLM-as-Judge            │
│             │ Prompt B    │ Deterministic            │
│ Expected    │ → outputs   │ metrics                 │
│ outputs     │             │ → scores                │
│ (optional)  │             │                         │
├─────────────┼─────────────┼─────────────────────────┤
│  Statistical Analysis → Winner / No Difference      │
└─────────────┴─────────────┴─────────────────────────┘

Three components: dataset, execution, evaluation. Each requires separate attention.

Dataset: the foundation of the experiment

The A/B test dataset contains inputs, optionally expected outputs, and metadata (request category, complexity).

Minimum sample size depends on the expected effect size:

Expected effect	Minimum examples	Notes
Large (>0.15)	50-100	Full prompt rewrite
Medium (0.05-0.15)	200-500	Significant instruction change
Small (<0.05)	500-1000+	Fine-tuning phrasing

The dataset must reflect the distribution of real requests. If 40% of production requests are in Russian but the dataset is English-only, the test results are irrelevant.

# Dataset structure in Langfuse
dataset_items = [
    {
        "input": {"query": "Explain the difference between Docker and Kubernetes"},
        "expected_output": "Docker is containerization, Kubernetes is orchestration...",
        "metadata": {"category": "technical", "complexity": "medium", "locale": "en"}
    },
    {
        "input": {"query": "Write a product description for wireless headphones"},
        "expected_output": None,  # Judge-based evaluation, no reference needed
        "metadata": {"category": "creative", "complexity": "low", "locale": "en"}
    }
]

Dataset construction strategies:

Production sampling. Random sample from real requests. The most relevant approach. Langfuse lets you create dataset items directly from traces.
Stratified sampling. Sample preserving category proportions. If 30% of requests are summarization, 30% Q&A, 40% generation, the dataset maintains those proportions.
Adversarial augmentation. Adding hard and edge cases that rarely appear in production but are critical for quality.

Execution: running both prompt variants

Each dataset item passes through both prompts. Controlling variables is non-negotiable:

from langfuse import Langfuse

langfuse = Langfuse()

dataset = langfuse.get_dataset("eval-dataset-v2")

def run_experiment(prompt_name: str, prompt_version: int, run_name: str):
    prompt = langfuse.get_prompt(prompt_name, version=prompt_version)

    for item in dataset.items:
        # Each run is linked to a dataset item
        with item.observe(run_name=run_name) as trace:
            generation = trace.generation(
                name="main-llm-call",
                model="gpt-4o",
                input=prompt.compile(**item.input),
                metadata={"prompt_version": prompt_version}
            )

            response = call_llm(
                model="gpt-4o",
                messages=prompt.compile(**item.input),
                temperature=0.3  # Fixed temperature
            )

            generation.end(output=response)

# Run both versions
run_experiment("assistant-prompt", version=3, run_name="prompt-v3-baseline")
run_experiment("assistant-prompt", version=4, run_name="prompt-v4-candidate")

Parameters to keep identical between variants:

Model. Same model ID (gpt-4o-2024-08-06, not just gpt-4o).
Temperature. Identical for both. For reproducibility, use 0 or 0.1-0.3.
Seed. If the provider supports it (OpenAI), fix the seed for determinism.
Max tokens. Same limit, so one variant does not win just by being longer.

Quality metrics for prompt evaluation

Metrics fall into two categories: deterministic (computed algorithmically) and LLM-based (evaluated by a judge model).

Deterministic metrics

Fast, free, fully reproducible. Cover a limited set of quality aspects.

Metric	Measures	When to use
ROUGE-L	Reference match (longest common subsequence)	Summarization, extraction
BLEU	N-gram overlap with reference	Translation, paraphrasing
Exact match	Exact string match	Classification, entity extraction
JSON validity	Structure validity	Structured output
Latency	Response time	Any use case
Token count	Response length	Cost optimization

LLM-as-Judge metrics

These evaluate semantic quality. Each evaluation is a model call, but they cover aspects that deterministic metrics cannot. More on the LLM-as-Judge pattern: dedicated guide.

Key metrics implemented in DeepEval:

from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    GEval
)

# Relevancy: how well the answer addresses the question
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")

# Faithfulness: how grounded the answer is in the provided context (for RAG)
faithfulness = FaithfulnessMetric(threshold=0.8, model="gpt-4o")

# Custom metric via G-Eval
tone_consistency = GEval(
    name="Tone Consistency",
    criteria="Evaluate whether the response maintains a professional, "
             "helpful tone throughout. Score 0-1.",
    evaluation_params=["actual_output"],
    model="gpt-4o",
    threshold=0.7
)

G-Eval deserves a closer look. It is a framework for creating custom LLM-as-Judge metrics. You describe a criterion in natural language, G-Eval generates chain-of-thought evaluation steps, and scores the output. This lets you test business-specific aspects: brand guideline compliance, legal phrasing correctness, required disclaimer presence.

Choosing metrics

Do not test everything at once. Each A/B test checks a specific hypothesis, and metrics are chosen to match it.

Hypothesis	Metrics
"New prompt answers questions more accurately"	Answer Relevancy, Faithfulness
"Structured output is more reliable"	JSON validity rate, schema compliance
"Responses are more concise without quality loss"	Token count + Answer Relevancy
"Tone became more professional"	G-Eval with custom criteria

Statistical significance: when the difference is real

Prompt A scored 0.83 on relevancy, Prompt B scored 0.86. Is this a real improvement or noise? A statistical test gives the answer.

Choosing a test

For LLM metrics (continuous values 0-1), use a paired t-test or the Wilcoxon signed-rank test. Paired, because both prompts are evaluated on the same inputs.

import numpy as np
from scipy import stats

# Scores for each input in the dataset
scores_a = np.array([0.82, 0.91, 0.78, 0.85, 0.79, ...])  # Prompt A
scores_b = np.array([0.88, 0.89, 0.84, 0.90, 0.83, ...])  # Prompt B

# Paired t-test (if distribution ≈ normal)
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)

# Wilcoxon signed-rank (non-parametric, no distribution assumptions)
w_stat, p_value_w = stats.wilcoxon(scores_a, scores_b)

# Effect size (Cohen's d for paired samples)
diff = scores_b - scores_a
cohens_d = np.mean(diff) / np.std(diff, ddof=1)

print(f"Mean A: {scores_a.mean():.3f}, Mean B: {scores_b.mean():.3f}")
print(f"P-value (t-test): {p_value:.4f}")
print(f"P-value (Wilcoxon): {p_value_w:.4f}")
print(f"Cohen's d: {cohens_d:.3f}")

Interpreting results

P-value	Cohen's d	Decision
< 0.05	> 0.5	Adopt Prompt B: significant and substantial improvement
< 0.05	0.2-0.5	Consider Prompt B: significant, but moderate effect
< 0.05	< 0.2	Caution: statistically significant but practically negligible
> 0.05	any	No grounds to prefer one version over the other

p-value < 0.05 with Cohen's d < 0.1 means the difference exists but is small enough that it is not worth the deployment risk.

Correction for multiple comparisons

If you are testing three metrics simultaneously (relevancy, faithfulness, tone), the probability of a false positive increases. With three independent tests at alpha=0.05, the probability of at least one false positive is 14%.

Bonferroni correction: divide alpha by the number of tests. For three metrics: 0.05 / 3 = 0.017. A result is only significant at p < 0.017.

Full pipeline with Langfuse and DeepEval

Langfuse manages prompts, datasets, and tracing. DeepEval runs the evaluation. Together they cover the full A/B test cycle. If you have not set up Langfuse yet, here is the setup guide.

Step 1: Preparing the dataset in Langfuse

from langfuse import Langfuse

langfuse = Langfuse()

# Create dataset
dataset = langfuse.create_dataset(
    name="support-bot-eval-v3",
    description="200 real support requests, stratified by category",
    metadata={"source": "production-sampling", "period": "2026-03-01 to 2026-03-15"}
)

# Add items from production traces
for trace in sampled_traces:
    langfuse.create_dataset_item(
        dataset_name="support-bot-eval-v3",
        input=trace.input,
        expected_output=trace.verified_output,  # Human-verified
        metadata={"category": trace.metadata.get("category")}
    )

Step 2: Running the experiment

import openai

client = openai.OpenAI()

def run_prompt_variant(dataset_name: str, prompt_version: int, run_name: str):
    dataset = langfuse.get_dataset(dataset_name)
    prompt = langfuse.get_prompt("support-bot", version=prompt_version)

    results = []

    for item in dataset.items:
        with item.observe(run_name=run_name) as trace:
            messages = prompt.compile(**item.input)

            response = client.chat.completions.create(
                model="gpt-4o-2024-08-06",
                messages=messages,
                temperature=0.2,
                seed=42
            )

            output = response.choices[0].message.content
            trace.generation(
                name="support-response",
                model="gpt-4o-2024-08-06",
                input=messages,
                output=output,
                usage={
                    "input": response.usage.prompt_tokens,
                    "output": response.usage.completion_tokens
                }
            )

            results.append({
                "item_id": item.id,
                "input": item.input,
                "output": output,
                "expected": item.expected_output,
                "metadata": item.metadata
            })

    return results

baseline_results = run_prompt_variant("support-bot-eval-v3", version=3, run_name="v3-baseline")
candidate_results = run_prompt_variant("support-bot-eval-v3", version=4, run_name="v4-candidate")

Step 3: Evaluation with DeepEval

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval

metrics = [
    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
    GEval(
        name="Helpfulness",
        criteria="Rate how helpful and actionable the support response is. "
                 "Consider: does it solve the user's problem? Does it provide "
                 "clear next steps? Score 0-1.",
        evaluation_params=["input", "actual_output"],
        model="gpt-4o",
        threshold=0.7
    )
]

def evaluate_results(results: list, run_label: str):
    test_cases = []
    for r in results:
        test_cases.append(LLMTestCase(
            input=r["input"]["query"],
            actual_output=r["output"],
            expected_output=r.get("expected"),
            context=[r["input"].get("context", "")]
        ))

    evaluation = evaluate(test_cases=test_cases, metrics=metrics)
    return evaluation

Step 4: Statistical analysis

import numpy as np
from scipy import stats

def compare_variants(eval_a, eval_b, metric_name: str, alpha: float = 0.05):
    scores_a = np.array([tc.metrics_data[metric_name] for tc in eval_a.test_cases])
    scores_b = np.array([tc.metrics_data[metric_name] for tc in eval_b.test_cases])

    # Paired test
    _, p_value = stats.wilcoxon(scores_a, scores_b)

    # Effect size
    diff = scores_b - scores_a
    cohens_d = np.mean(diff) / np.std(diff, ddof=1) if np.std(diff) > 0 else 0

    result = {
        "metric": metric_name,
        "mean_a": float(scores_a.mean()),
        "mean_b": float(scores_b.mean()),
        "delta": float(scores_b.mean() - scores_a.mean()),
        "p_value": float(p_value),
        "cohens_d": float(cohens_d),
        "significant": p_value < alpha,
        "recommendation": "adopt" if (p_value < alpha and cohens_d > 0.2) else "keep_baseline"
    }

    return result

# Compare by each metric
for metric_name in ["Answer Relevancy", "Helpfulness"]:
    result = compare_variants(eval_baseline, eval_candidate, metric_name)
    print(f"\n{result['metric']}:")
    print(f"  Baseline: {result['mean_a']:.3f} → Candidate: {result['mean_b']:.3f}")
    print(f"  Delta: {result['delta']:+.3f}, p={result['p_value']:.4f}, d={result['cohens_d']:.3f}")
    print(f"  → {result['recommendation']}")

CI/CD integration: automated prompt tests

An A/B test should not be a manual process. It runs automatically whenever a prompt changes.

# .github/workflows/prompt-test.yml
name: Prompt A/B Test

on:
  push:
    paths:
      - 'prompts/**'

jobs:
  prompt-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Detect changed prompts
        id: changes
        run: |
          changed=$(git diff --name-only HEAD~1 -- prompts/)
          echo "prompts=$changed" >> $GITHUB_OUTPUT

      - name: Run A/B test
        env:
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/run_prompt_ab_test.py \
            --prompt-name support-bot \
            --baseline-version current \
            --candidate-version staged \
            --dataset support-bot-eval-v3 \
            --min-effect-size 0.1 \
            --alpha 0.05

      - name: Check results
        run: |
          python scripts/check_ab_results.py \
            --fail-on-regression \
            --min-delta -0.02  # Tolerate no more than 2% degradation

The --fail-on-regression logic: the pipeline passes if the candidate is not worse than the baseline by more than the specified threshold. This allows deploying prompts that improve one metric without degrading the others.

Patterns and anti-patterns in prompt A/B testing

Patterns that work

One prompt, one variable. Change one thing at a time. If you changed both the system prompt and the few-shot examples simultaneously, you cannot tell what drove the outcome.

Versioning in Langfuse. Every prompt version is stored in Langfuse Prompt Management with metadata: who changed it, why, and what hypothesis it tests. This creates an audit trail of optimization.

Segmented analysis. The overall score may look identical, but one prompt may be better for short requests while the other excels on long ones. Break results down by dataset metadata fields.

# Analysis by segment
for category in ["technical", "billing", "general"]:
    segment_a = [s for s, m in zip(scores_a, metadata) if m["category"] == category]
    segment_b = [s for s, m in zip(scores_b, metadata) if m["category"] == category]

    if len(segment_a) >= 20:  # Sufficient segment size
        _, p = stats.wilcoxon(segment_a, segment_b)
        print(f"{category}: A={np.mean(segment_a):.3f}, B={np.mean(segment_b):.3f}, p={p:.4f}")

Cost-aware evaluation. Prompt B may be 3% better in quality but 40% more expensive due to increased context. Calculate cost per quality point.

Anti-patterns

Testing on training examples. If the few-shot examples in the prompt come from the dataset, the results are invalid. The dataset must contain examples the model has never seen.

Ignoring variance. An average score of 0.85 with a standard deviation of 0.25 is worse than 0.82 with a standard deviation of 0.05. High variance means unpredictable quality. Check not just the mean, but also std, min, and the 5th percentile.

Stopping too early. The first 50 examples show improvement, and it is tempting to stop and deploy. But the first 50 may all be from the same category. Wait for the full run.

Cherry-picking results. You test five metrics, one shows p < 0.05, and that does not mean the prompt is better. This is the multiple comparisons problem (described above). Define your primary metric upfront.

Checklist: launching a prompt A/B test

Hypothesis. State exactly what the new prompt improves and why.
Primary metric. Choose one main metric. Others are supplementary.
Dataset. Minimum 100 examples, representative sample from production.
Controlled variables. Model, temperature, seed, max_tokens, all fixed.
Execution. Both prompts run on the full dataset; results logged in Langfuse.
Evaluation. Metrics computed for both variants.
Statistics. Paired test, p-value, effect size. Bonferroni correction if needed.
Segmentation. Results checked across request categories.
Decision. Adopt, reject, or iterate based on data.
Documentation. Outcome recorded: which prompt, what effect, what limitations.

What next

Prompt A/B testing is one component of a mature LLM ops pipeline. It works best in combination with LLM-as-Judge for automated quality gating and Langfuse for tracing and prompt management.

The next level is online A/B testing, where two prompts run in production simultaneously and traffic is split between them. This requires feature flags, routing logic, and real-time monitoring. Offline A/B testing (described in this article) is simpler, cheaper, and covers 90% of prompt optimization needs.

Start small: one dataset of 100 production requests, one metric, one statistical test. That alone will give you more confidence in your decisions than any amount of manual review.

Need help with prompt optimization or AI quality systems? I help startups build AI products and automate processes — belov.works.

FAQ

How often should a prompt A/B test dataset be refreshed?

Refresh the evaluation dataset when the distribution of production requests shifts significantly — a new feature launch, a change in user demographics, or a language expansion. A practical rule: re-sample 30-50% of the dataset every quarter from recent production traffic. A stale dataset from six months ago will produce valid statistics that no longer reflect current usage patterns, making confident decisions about prompt quality misleading.

Can A/B testing be used to compare different LLMs rather than different prompts?

Yes, the same pipeline applies to model comparisons — run both models on the identical dataset with identical prompts and evaluate using the same metrics. The key constraint is cost: running GPT-4o and Claude Sonnet on 500 examples each adds up quickly. For model selection decisions, start with 50-100 stratified examples to identify obvious winners, then run a full statistical test only when the initial results are inconclusive.

What is a realistic time-to-improvement cycle for prompt optimization?

A complete cycle — hypothesis, dataset run, statistical analysis, decision — takes 2-4 hours of wall-clock time for a 200-example dataset with cloud LLM evaluation. The bottleneck is almost always dataset construction (sourcing and labeling representative examples), not the evaluation itself. Teams running weekly prompt optimization cycles typically maintain a standing eval dataset that is continuously updated from production, reducing cycle time to under an hour for each new hypothesis.

ADR Template: How AI Generates Architecture Decision Records Your Future Self Will Thank You For

Roman Belov — Tue, 14 Jul 2026 03:37:38 +0000

Teams make dozens of architectural decisions every month but document almost none of them. The rest dissolve into Slack threads, hallway conversations, and the minds of people who will leave the company within a year.

Six months later, a new developer stares at the code and asks: "Why Redis here instead of PostgreSQL for queues?" Nobody remembers. An archaeological dig through Git history, Slack, and Notion begins. Two hours spent investigating a decision that originally took 15 minutes.

Architecture Decision Records (ADRs) solve this problem. But they don't get written. The reason is simple: drafting an ADR takes 30-40 minutes, and the developer has already moved on to the next task. AI compresses that to 3-5 minutes. This article covers ADR structure, prompts for LLM-based generation, real-world examples, and CI pipeline automation.

What ADRs are and why capturing architectural decisions matters

An ADR (Architecture Decision Record) is a document that captures one specific architectural decision. Not a spec, not an RFC, not a design document. One decision, one file.

Michael Nygard introduced the concept in 2011. The format took hold at large companies (Spotify, Thoughtworks, GitHub) but remains rare in smaller teams. The main reason: the writing overhead feels higher than the value it delivers.

Three situations where the absence of ADRs hurts the most:

Onboarding. A new developer reads the code and encounters an unconventional decision. Without an ADR, they either spend hours investigating, or treat it as a mistake and "fix" it. Both paths are expensive for the team.

Revisiting decisions. Context changes: load increases, new requirements emerge, a dependency goes stale. Without a record of why the current solution was chosen and which alternatives were rejected, the team re-runs the entire analysis from scratch.

Audits and compliance. In regulated industries (fintech, healthtech), architectural decisions require documented justification. ADRs close that gap automatically.

ADR template structure: 7 required sections

A minimum viable ADR contains seven sections. Each answers a specific question.

# ADR-{number}: {Decision title}

## Status
Proposed | Accepted | Deprecated | Superseded by ADR-{number}

## Date
2026-03-26

## Context
What problem or situation forces this decision?
Technical constraints, business requirements, current system state.

## Decision
Exactly what was decided. A concrete statement without vague language.

## Consequences
Positive and negative effects of the decision.
What becomes easier, what becomes harder.

## Alternatives Considered
Which options were evaluated and why they were rejected.
Comparison criteria, trade-offs.

## References
Links to issues, PRs, discussions, documentation, benchmarks.

Status has four values. Proposed means the decision is under discussion. Accepted means it's adopted and in use. Deprecated means it's outdated, but no replacement has been chosen yet. Superseded by ADR-{N} means it was replaced by a newer decision with a direct link.

Context is the most important section. Without context, a decision loses meaning. "We chose Redis for caching" tells you nothing. "We chose Redis for caching because PostgreSQL LISTEN/NOTIFY couldn't deliver sub-millisecond latency for autocomplete at 10K RPS" tells you everything.

Alternatives Considered is the section most often skipped and the one that provides the most value. When the question "why not Kafka?" comes up a year later, the answer is already recorded.

Prompt for generating ADRs with AI

A baseline prompt that works with Claude, GPT-4, and Gemini:

You are a senior software architect. Generate an ADR (Architecture Decision Record)
using the following template.

DECISION: {description of the decision made}

PROJECT CONTEXT:
- Stack: {languages, frameworks, infrastructure}
- Scale: {load, team size, product stage}
- Constraints: {budget, deadlines, compliance, legacy}

OUTPUT FORMAT:
# ADR-{number}: {Title}

## Status
Accepted

## Date
{current date}

## Context
Describe the problem that led to this decision. Include technical and business factors.
Specific metrics where applicable. 3-5 sentences.

## Decision
State the decision in one paragraph. No words like "we decided" or "we will" - facts only.
Specify the scope: what is included, what is not.

## Consequences
### Positive
- List 3-5 concrete improvements

### Negative
- List 2-3 trade-offs or risks

## Alternatives Considered
For each alternative:
### {Alternative name}
- Description in 1-2 sentences
- Reason for rejection (specific, not "didn't fit")

## References
- Links to relevant resources

RULES:
- Active voice, no filler
- Specific metrics instead of "faster/better/simpler"
- If metrics are unavailable, use qualitative criteria
- Every Consequences item must be verifiable

This prompt covers 80% of cases. The remaining 20% require specialized variants.

Advanced prompts: migrations, technology choices, deprecations

Prompt for a migration ADR

Migrations carry the highest risk and the longest consequence horizon.

Generate an ADR for a migration.

FROM: {current solution, version, how long in production}
TO: {target solution}
REASON FOR MIGRATION: {what broke or stopped working}

In addition to the standard template, add these sections:

## Migration Strategy
- Approach: Big Bang | Strangler Fig | Parallel Run
- Justification for the approach
- Migration phases with estimated timelines

## Rollback Plan
- Rollback triggers
- Rollback procedure
- Acceptable rollback window

## Success Criteria
- Metrics that define migration as complete
- Stabilization period after migration

Prompt for a technology selection ADR

Generate an ADR for a technology selection.

TASK: {what problem the chosen technology solves}
CANDIDATES: {list of evaluated options}
CHOICE: {winner}

Additionally:

## Evaluation Criteria
Comparison table of candidates by criteria:
| Criterion | Weight | {Candidate 1} | {Candidate 2} | {Candidate 3} |
Criteria: performance, DX, community/support, cost, learning curve, integration

## Proof of Concept
- What was tested in the PoC
- Results with numbers

Prompt for a deprecation ADR (Deprecated/Superseded)

Generate an ADR recording the rejection of a previously accepted decision.

ORIGINAL ADR: {number and title}
REASON FOR REJECTION: {what changed since the decision was made}
REPLACEMENT: {new decision, if any}

Additionally:

## What Changed
- Specific context changes that made the decision obsolete
- Date or event that made it apparent

## Lessons Learned
- What the team learned from using the previous solution
- How this influences similar future decisions

Example: ADR for a real project

A team is choosing a caching strategy for a travel app API. Here is what AI generates using the prompt above:

# ADR-012: Redis for External API Response Caching

## Status
Accepted

## Date
2026-03-26

## Context
The travel app API calls external services (Foursquare, weather, flights)
for every user request. Average response time: 800ms on a cold call. 40%
of requests are repeat calls to the same data within a 15-minute window
(user switches between tabs, refreshes the page). External API rate limits:
500 req/min (Foursquare), 1000 req/min (weather). At 200 DAU, current load
is at 60% of the limit.

## Decision
Use Redis (Upstash Serverless) as a cache layer between edge functions
and external APIs. TTL: 15 minutes for geo data, 60 minutes for weather,
5 minutes for flight prices. Cache key: `{api}:{endpoint}:{normalized_params_hash}`.
Invalidation strategy: TTL-based, no manual invalidation in the first phase.

## Consequences
### Positive
- Response time for cached requests: 800ms to 15-25ms (Upstash REST API)
- Rate limit consumption drops by 40% with current usage patterns
- Edge functions free up faster, reducing compute consumption

### Negative
- Additional dependency: Upstash (managed, but still a point of failure)
- Stale data within the TTL window: users may see outdated ticket prices
- Cost: ~$5/mo at current load, scales with growth

## Alternatives Considered
### In-memory cache (Map in Deno isolate)
Zero latency, but state is lost on cold start. At the current cold start
frequency (every 3-5 minutes), hit rate would be below 20%. Doesn't
justify the implementation complexity.

### Cloudflare KV
Eventual consistency with up to 60-second delay. Acceptable for flight
price caching, but creates UX issues for geo data (user is moving).
Cost is comparable to Redis.

### PostgreSQL materialized views
Requires reworking the data layer. Not suitable for edge functions due
to connection latency (50-100ms vs 5-15ms for Redis REST API).

## References
- Upstash Redis REST API: https://docs.upstash.com/redis
- Foursquare rate limits: https://docs.foursquare.com/reference/rate-limits

Note the specificity. No phrases like "improves performance" or "reduces load." Numbers instead: 800ms to 15-25ms, 40% reduction in rate limit consumption, $5/mo. Every item can be verified six months from now.

Generating ADRs from Git history and PR descriptions

AI can extract architectural decisions from existing artifacts. This closes the "we made a decision three months ago and forgot to document it" problem.

Prompt for retrospective generation:

Analyze the following PR (diff + description + comments) and determine
whether it contains an architectural decision. If so, generate an ADR.

PR TITLE: {title}
PR DESCRIPTION: {description}
PR DIFF (key files):
{diff}
PR COMMENTS:
{review comments}

Criteria for an architectural decision:
- Adding a new dependency
- Changing data structure or DB schema
- New pattern (caching, queues, retry strategy)
- Changing an API contract
- Choosing between two or more approaches (recorded in the discussion)

If the PR contains no architectural decision, respond "No ADR needed" with
a brief explanation.

For Claude Code, this process can be automated with:

# Fetch the latest PR diff and generate an ADR
gh pr view --json title,body,comments,files | \
  claude -p "Analyze this PR and generate an ADR if you find an architectural decision."

Automation: ADRs as part of the CI/CD pipeline

Manual ADR generation works but requires discipline. CI automation removes the human factor.

GitHub Action: checking for an ADR

# .github/workflows/adr-check.yml
name: ADR Check

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  check-adr:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Detect architectural changes
        id: detect
        run: |
          CHANGED_FILES=$(gh pr diff ${{ github.event.pull_request.number }} --name-only)

          ADR_NEEDED=false

          # New dependencies
          if echo "$CHANGED_FILES" | grep -q "package.json\|Cargo.toml\|go.mod\|requirements.txt"; then
            ADR_NEEDED=true
          fi

          # DB migrations
          if echo "$CHANGED_FILES" | grep -q "migration\|schema"; then
            ADR_NEEDED=true
          fi

          # Infrastructure config
          if echo "$CHANGED_FILES" | grep -q "docker\|terraform\|cloudflare\|nginx"; then
            ADR_NEEDED=true
          fi

          echo "adr_needed=$ADR_NEEDED" >> $GITHUB_OUTPUT

      - name: Check for ADR file
        if: steps.detect.outputs.adr_needed == 'true'
        run: |
          CHANGED_FILES=$(gh pr diff ${{ github.event.pull_request.number }} --name-only)
          if ! echo "$CHANGED_FILES" | grep -q "^docs/adr/"; then
            echo "::warning::This PR contains architectural changes but no ADR. Consider adding one to docs/adr/"
          fi
        env:
          GH_TOKEN: ${{ github.token }}

This workflow warns rather than blocks. Blocking PRs through ADR requirements creates friction that kills adoption.

ADR file structure in the repository

docs/
└── adr/
    ├── README.md           # Index of all ADRs
    ├── template.md         # Template
    ├── 001-use-astro.md
    ├── 002-redis-caching.md
    └── 003-event-driven.md

Three-digit numbering with leading zeros. Files in chronological order. One file, one decision.

Script for creating a new ADR

#!/bin/bash
# scripts/new-adr.sh

ADR_DIR="docs/adr"
LAST_NUM=$(ls "$ADR_DIR" | grep -oP '^\d+' | sort -n | tail -1)
NEXT_NUM=$(printf "%03d" $((10#$LAST_NUM + 1)))

SLUG=$(echo "$1" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -cd '[:alnum:]-')
FILENAME="${ADR_DIR}/${NEXT_NUM}-${SLUG}.md"

cat > "$FILENAME" << EOF
# ADR-${NEXT_NUM}: $1

## Status
Proposed

## Date
$(date +%Y-%m-%d)

## Context
<!-- What problem forces this decision? -->

## Decision
<!-- What was decided? -->

## Consequences
### Positive
-

### Negative
-

## Alternatives Considered
###
-

## References
-
EOF

echo "Created: $FILENAME"

Usage: ./scripts/new-adr.sh "Switch from REST to GraphQL".

Integrating ADRs with context engineering for AI agents

ADRs gain additional value as context for AI coding. When an AI agent (Claude Code, Cursor, Copilot) works with a codebase, ADRs provide architectural context that does not exist in the code itself.

Adding ADRs to the project's CLAUDE.md:

## Architecture Decisions
Key ADRs to follow when making changes:
- ADR-005: Event-driven architecture for notifications (docs/adr/005-event-driven.md)
- ADR-008: PostgreSQL RLS for multi-tenancy (docs/adr/008-rls-multitenancy.md)
- ADR-012: Redis caching strategy (docs/adr/012-redis-caching.md)

The AI agent respects these decisions when generating code. Instead of suggesting a REST call for notifications, it uses the event bus because ADR-005 recorded that decision. More on structuring context for AI: Context Engineering Guide.

Common mistakes when writing ADRs

Context that is too abstract. "We needed to improve performance" is useless. "API response time grew to 2.3s at p95; SLA requires < 500ms" is useful.

Missing Alternatives. Without an alternatives section, the ADR looks like post-hoc justification rather than a deliberate choice. Even if there was only one alternative (doing nothing), that is worth recording.

Scope that is too broad. An ADR captures one decision. "Moving to microservices" is not one decision. It is ten. Each service, each contract, each communication mechanism deserves its own ADR.

Stale Status. An ADR with Accepted status that was replaced long ago by another decision misleads readers. Updating the status to Superseded by ADR-{N} takes seconds and saves others hours.

Confusing ADRs with documentation. An ADR does not describe how the system works. It describes why the system works the way it does. How is the job of SOPs and operational documentation.

Metrics: measuring ADR practice effectiveness

Four metrics that show whether ADRs are working for the team:

Metric	How to measure	Target
ADR coverage	Number of ADRs / number of architectural PRs per month	> 70%
Time-to-ADR	Time from decision to recorded ADR	< 24 hours
Reference rate	How often ADRs are cited in PRs and discussions	> 2 times/month per ADR
Onboarding feedback	New joiners rate ADR usefulness (1-5)	> 4.0

ADR coverage below 50% means the process has not taken hold. Time-to-ADR above one week means context is being lost and the record becomes a fictional reconstruction.

Checklist: introducing ADRs to a team

Create the docs/adr/ directory and the template
Write 3-5 ADRs for already-made decisions (retrospectively, with AI help)
Add the new-adr.sh script for quick creation
Set up the GitHub Action with a soft warning (not a block)
Include key ADRs in CLAUDE.md / .cursorrules for AI agents
Review ADRs like code: through PRs
Walk through all ADRs quarterly and update statuses

The first three steps take 30 minutes with AI. The rest is a habit that forms over 2-3 sprints.

Need help with architecture decision records or engineering processes? I help startups build AI products and automate processes — belov.works.

FAQ

Should ADRs be reviewed like code — through pull requests?

Yes, this is the most effective approach. Reviewing ADRs via PRs keeps the architectural record auditable, allows team members to challenge or refine the decision before it is locked in, and creates a natural link between the code change and the rationale. The review process also surfaces disagreements early: an ADR under debate in a PR is better than a contested decision discovered six months into implementation.

What is the right granularity for an ADR — one per service or one per significant choice?

One ADR per atomic decision, not per service or per project. Choosing Redis for caching and choosing Upstash specifically as the managed provider are two separate ADRs. "Moving to microservices" is not one ADR — it is at minimum one per service boundary, one for the communication protocol, and one for the deployment strategy. Overly broad ADRs lose precision; overly narrow ones become noise.

How should ADRs be handled when a team is acquired or the codebase is inherited?

Treat inherited ADRs as unverified hypotheses. Read them for context, but audit each against the current state of the system. Some decisions will be stale (the library is deprecated, the load assumption changed), some will still be valid. The fastest path is a one-day "ADR audit sprint": read every ADR, update Status fields, and add a brief note to any where the context has shifted. This investment pays back in the first month of development.

AI Security Audit Checklist: 15 Vulnerabilities Claude Found in Production Code

Roman Belov — Wed, 08 Jul 2026 03:10:45 +0000

Most web applications contain at least one vulnerability from the OWASP Top 10. A typical security audit takes 2-3 weeks and costs upward of $10,000. An LLM can compress the initial audit down to a few hours because it scans code for patterns rather than specific CVEs.

Below are 15 vulnerabilities found while auditing production code with Claude. Each includes the vulnerable code, the fixed version, and a prompt to reproduce the finding. Classification follows OWASP Top 10 (2021). Order reflects frequency of occurrence: most common first.

Methodology: how to run an AI security audit

The audit consists of three passes. First, a broad scan: the LLM receives the entire project and looks for vulnerability patterns. Second, deep analysis: each identified pattern is verified in context (middleware, ORM, framework). Third, verification: manual review of every finding, because LLMs produce false positives.

Prompt for the broad scan:

Perform a security audit of this code. For each finding, include:
1. CWE ID and name
2. OWASP Top 10 category
3. Severity (Critical/High/Medium/Low)
4. The vulnerable code snippet
5. Attack vector -- exactly how an attacker would exploit this
6. Fixed code

Ignore stylistic comments. Focus on security only.
Start with injection attacks, then broken access control, then the rest.

This prompt works because it defines the output structure and prioritizes categories. Without explicit instructions, the LLM mixes critical vulnerabilities with remarks about email validation.

More on structured AI code review: AI Code Review Checklist.

A03:2021 -- Injection

1. SQL Injection via string concatenation

The most common finding. Shows up even in projects using an ORM, because developers switch to raw queries for complex filters.

Vulnerable code:

// API endpoint for user search
app.get('/api/users', async (req, res) => {
  const { search, sortBy } = req.query;
  const query = `
    SELECT id, name, email
    FROM users
    WHERE name LIKE '%${search}%'
    ORDER BY ${sortBy}
  `;
  const result = await db.query(query);
  res.json(result.rows);
});

Attack vector: GET /api/users?search='; DROP TABLE users; --&sortBy=id

Fixed code:

app.get('/api/users', async (req, res) => {
  const { search, sortBy } = req.query;

  const allowedSortColumns = ['id', 'name', 'email', 'created_at'];
  const sanitizedSort = allowedSortColumns.includes(sortBy) ? sortBy : 'id';

  const query = `
    SELECT id, name, email
    FROM users
    WHERE name LIKE $1
    ORDER BY ${sanitizedSort}
  `;
  const result = await db.query(query, [`%${search}%`]);
  res.json(result.rows);
});

Parameterized query for values, whitelist for identifiers (column names). ORDER BY cannot be parameterized in most drivers, so the whitelist is mandatory.

2. NoSQL Injection in MongoDB queries

// Vulnerable: req.body passed directly into the query
app.post('/api/login', async (req, res) => {
  const user = await db.collection('users').findOne({
    username: req.body.username,
    password: req.body.password,
  });
  if (user) return res.json({ token: generateToken(user) });
  res.status(401).json({ error: 'Invalid credentials' });
});

Attack vector: POST /api/login with body {"username": "admin", "password": {"$ne": ""}}. The $ne (not equal) operator turns the password check into "password is not equal to empty string" -- true for any user.

// Fixed: explicit string casting
app.post('/api/login', async (req, res) => {
  const username = String(req.body.username);
  const password = String(req.body.password);

  const user = await db.collection('users').findOne({ username });
  if (!user || !await bcrypt.compare(password, user.passwordHash)) {
    return res.status(401).json({ error: 'Invalid credentials' });
  }
  res.json({ token: generateToken(user) });
});

Two fixes: String() blocks MongoDB operators, and bcrypt.compare replaces plaintext password comparison.

3. Command Injection via child_process

// Vulnerable: user input in a shell command
app.post('/api/convert', async (req, res) => {
  const { filename } = req.body;
  exec(`convert uploads/${filename} -resize 200x200 thumbnails/${filename}`,
    (err, stdout) => {
      if (err) return res.status(500).json({ error: 'Conversion failed' });
      res.json({ status: 'ok' });
    });
});

Attack vector: filename: "image.png; rm -rf /"

// Fixed: execFile instead of exec, filename validation
import { execFile } from 'child_process';

app.post('/api/convert', async (req, res) => {
  const { filename } = req.body;

  if (!/^[a-zA-Z0-9_-]+\.(png|jpg|webp)$/.test(filename)) {
    return res.status(400).json({ error: 'Invalid filename' });
  }

  execFile('convert', [
    `uploads/${filename}`, '-resize', '200x200', `thumbnails/${filename}`
  ], (err) => {
    if (err) return res.status(500).json({ error: 'Conversion failed' });
    res.json({ status: 'ok' });
  });
});

execFile does not spawn a shell, so ; rm -rf / is not interpreted as a separate command. The regex restricts allowable characters.

A01:2021 -- Broken Access Control

4. IDOR -- Insecure Direct Object Reference

// Vulnerable: any authenticated user can view any order
app.get('/api/orders/:id', authMiddleware, async (req, res) => {
  const order = await db.query('SELECT * FROM orders WHERE id = $1', [req.params.id]);
  res.json(order.rows[0]);
});

Attack vector: ID enumeration -- GET /api/orders/1, /api/orders/2, /api/orders/3...

// Fixed: ownership check
app.get('/api/orders/:id', authMiddleware, async (req, res) => {
  const order = await db.query(
    'SELECT * FROM orders WHERE id = $1 AND user_id = $2',
    [req.params.id, req.user.id]
  );
  if (!order.rows[0]) return res.status(404).json({ error: 'Not found' });
  res.json(order.rows[0]);
});

AND user_id = $2 added. If a role legitimately allows access to others' orders (admin, support), the check happens through RBAC middleware rather than through the absence of a condition.

5. Path Traversal when serving files

# Vulnerable: user controls the file path
@app.route('/api/files/<filename>')
def get_file(filename):
    return send_file(f'uploads/{filename}')

Attack vector: GET /api/files/../../etc/passwd

# Fixed: send_from_directory + validation
import os
from werkzeug.utils import secure_filename

@app.route('/api/files/<filename>')
def get_file(filename):
    safe_name = secure_filename(filename)
    if not safe_name:
        abort(400)
    upload_dir = os.path.abspath('uploads')
    file_path = os.path.abspath(os.path.join(upload_dir, safe_name))
    if not file_path.startswith(upload_dir):
        abort(403)
    return send_from_directory(upload_dir, safe_name)

secure_filename strips ../, send_from_directory confines the base directory, and the abspath check prevents bypass via symlinks.

6. Mass Assignment -- overwriting fields via API

// Vulnerable: entire req.body passed to update
app.put('/api/profile', authMiddleware, async (req, res) => {
  await db.query(
    'UPDATE users SET name = $1, email = $2, role = $3 WHERE id = $4',
    [req.body.name, req.body.email, req.body.role, req.user.id]
  );
  res.json({ status: 'updated' });
});

Attack vector: PUT /api/profile with body {"name": "Hacker", "role": "admin"}. User elevates their own role.

// Fixed: whitelist of allowed fields
app.put('/api/profile', authMiddleware, async (req, res) => {
  const { name, email } = req.body;
  await db.query(
    'UPDATE users SET name = $1, email = $2 WHERE id = $3',
    [name, email, req.user.id]
  );
  res.json({ status: 'updated' });
});

Destructuring picks only allowed fields. The role field from the request is silently ignored.

A02:2021 -- Cryptographic Failures

7. JWT without algorithm verification

// Vulnerable: algorithm taken from the token header
const decoded = jwt.verify(token, publicKey);

Attack vector: the attacker crafts a JWT with "alg": "none" or "alg": "HS256", signing it with a symmetric key equal to the server's public key. Some libraries accept such tokens.

// Fixed: algorithm pinned explicitly
const decoded = jwt.verify(token, publicKey, {
  algorithms: ['RS256'],
  issuer: 'https://auth.example.com',
  audience: 'api.example.com',
});

The algorithms parameter prevents the server from accepting JWTs with an arbitrary algorithm. issuer and audience restrict the token's scope.

8. Secrets hardcoded in source code

# Vulnerable: keys in source code
STRIPE_SECRET_KEY = "sk_live_4eC39HqLyjWDarjtT1zdp7dc"
AWS_ACCESS_KEY = "AKIAIOSFODNN7EXAMPLE"
DATABASE_URL = "postgresql://admin:password123@db.example.com/prod"

# Fixed: environment variables + startup validation
import os

def get_required_env(name: str) -> str:
    value = os.environ.get(name)
    if not value:
        raise RuntimeError(f"Missing required environment variable: {name}")
    return value

STRIPE_SECRET_KEY = get_required_env("STRIPE_SECRET_KEY")
AWS_ACCESS_KEY = get_required_env("AWS_ACCESS_KEY")
DATABASE_URL = get_required_env("DATABASE_URL")

get_required_env fails at startup if the variable is not set, preventing the service from silently starting without credentials and returning errors on every request.

Prompt for finding hardcoded secrets:

Find all hardcoded secrets in the codebase: API keys, passwords,
tokens, connection strings. Check: .env files committed to git;
string literals containing "sk_", "AKIA", "password", "secret";
config files with credentials. For each finding, include the file,
line number, and recommendation.

A05:2021 -- Security Misconfiguration

9. CORS -- allowing all origins

// Vulnerable: any site can make requests to the API
app.use(cors({ origin: '*', credentials: true }));

Attack vector: an attacker hosts JavaScript on their site that makes requests to the API on behalf of an authenticated user (cookies are sent automatically).

// Fixed: origin whitelist
const allowedOrigins = [
  'https://example.com',
  'https://app.example.com',
];

app.use(cors({
  origin: (origin, callback) => {
    if (!origin || allowedOrigins.includes(origin)) {
      callback(null, true);
    } else {
      callback(new Error('Not allowed by CORS'));
    }
  },
  credentials: true,
}));

!origin passes requests without an Origin header (server-to-server). For production APIs, consider removing this check and always requiring Origin.

10. Verbose error messages in production

// Vulnerable: stack trace leaks to the client
app.use((err, req, res, next) => {
  res.status(500).json({
    error: err.message,
    stack: err.stack,
    query: err.query, // SQL query in the error
  });
});

Attack vector: the error exposes database structure, file paths, and library versions, making targeted attacks easier.

// Fixed: log internally, send minimal info to client
app.use((err, req, res, next) => {
  const errorId = crypto.randomUUID();

  logger.error({
    errorId,
    message: err.message,
    stack: err.stack,
    path: req.path,
    method: req.method,
    userId: req.user?.id,
  });

  res.status(500).json({
    error: 'Internal server error',
    errorId,
  });
});

errorId links the client response to the log entry. The user reports the ID to support; the developer finds the full stack trace.

A04:2021 -- Insecure Design

11. SSRF -- Server-Side Request Forgery

// Vulnerable: server makes a request to a user-supplied URL
app.post('/api/preview', async (req, res) => {
  const { url } = req.body;
  const response = await fetch(url);
  const html = await response.text();
  const title = extractTitle(html);
  res.json({ title, url });
});

Attack vector: url: "http://169.254.169.254/latest/meta-data/iam/security-credentials/" -- access to the AWS metadata API from the internal network.

import { URL } from 'url';
import dns from 'dns/promises';

async function isAllowedUrl(input: string): Promise<boolean> {
  const parsed = new URL(input);
  if (!['http:', 'https:'].includes(parsed.protocol)) return false;

  const addresses = await dns.resolve4(parsed.hostname);
  const blocked = ['10.', '172.16.', '192.168.', '169.254.', '127.'];
  return !addresses.some(ip => blocked.some(prefix => ip.startsWith(prefix)));
}

app.post('/api/preview', async (req, res) => {
  const { url } = req.body;
  if (!await isAllowedUrl(url)) {
    return res.status(400).json({ error: 'URL not allowed' });
  }
  const controller = new AbortController();
  setTimeout(() => controller.abort(), 5000);

  const response = await fetch(url, {
    signal: controller.signal,
    redirect: 'error',
  });
  const html = await response.text();
  res.json({ title: extractTitle(html), url });
});

DNS resolution verifies that the target IP does not belong to an internal network. redirect: 'error' blocks redirects to internal addresses. The timeout prevents slowloris attacks.

12. Race Condition in payment processing

// Vulnerable: check-then-act without locking
app.post('/api/withdraw', authMiddleware, async (req, res) => {
  const { amount } = req.body;
  const account = await db.query(
    'SELECT balance FROM accounts WHERE user_id = $1', [req.user.id]
  );

  if (account.rows[0].balance >= amount) {
    await db.query(
      'UPDATE accounts SET balance = balance - $1 WHERE user_id = $2',
      [amount, req.user.id]
    );
    res.json({ status: 'ok' });
  } else {
    res.status(400).json({ error: 'Insufficient funds' });
  }
});

Attack vector: two concurrent withdrawal requests. Both read balance 100, both pass the check, both deduct. Result: balance -100.

// Fixed: atomic operation in a transaction
app.post('/api/withdraw', authMiddleware, async (req, res) => {
  const { amount } = req.body;

  const client = await db.connect();
  try {
    await client.query('BEGIN');
    const account = await client.query(
      'SELECT balance FROM accounts WHERE user_id = $1 FOR UPDATE',
      [req.user.id]
    );

    if (account.rows[0].balance < amount) {
      await client.query('ROLLBACK');
      return res.status(400).json({ error: 'Insufficient funds' });
    }

    await client.query(
      'UPDATE accounts SET balance = balance - $1 WHERE user_id = $2',
      [amount, req.user.id]
    );
    await client.query('COMMIT');
    res.json({ status: 'ok' });
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
});

FOR UPDATE locks the row for the duration of the transaction. The second request waits for the first to finish and reads the updated balance.

A07:2021 -- Identification and Authentication Failures

13. Timing Attack in token comparison

// Vulnerable: regular string comparison
app.post('/api/webhook', (req, res) => {
  const signature = req.headers['x-webhook-signature'];
  const expected = computeHmac(req.body, WEBHOOK_SECRET);

  if (signature === expected) {
    processWebhook(req.body);
    res.json({ status: 'ok' });
  } else {
    res.status(401).json({ error: 'Invalid signature' });
  }
});

Attack vector: the === operator exits at the first mismatched byte. By measuring response time, an attacker can recover the signature byte by byte.

import crypto from 'crypto';

app.post('/api/webhook', (req, res) => {
  const signature = req.headers['x-webhook-signature'];
  const expected = computeHmac(req.body, WEBHOOK_SECRET);

  if (!signature || !crypto.timingSafeEqual(
    Buffer.from(signature, 'hex'),
    Buffer.from(expected, 'hex')
  )) {
    return res.status(401).json({ error: 'Invalid signature' });
  }
  processWebhook(req.body);
  res.json({ status: 'ok' });
});

timingSafeEqual compares all bytes in constant time, regardless of where the first mismatch occurs.

14. No rate limiting on authentication

// Vulnerable: unlimited password brute force
app.post('/api/login', async (req, res) => {
  const { email, password } = req.body;
  const user = await findUserByEmail(email);
  if (!user || !await bcrypt.compare(password, user.passwordHash)) {
    return res.status(401).json({ error: 'Invalid credentials' });
  }
  res.json({ token: generateToken(user) });
});

// Fixed: rate limiting + account lockout
import rateLimit from 'express-rate-limit';

const loginLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 10,
  keyGenerator: (req) => req.body.email || req.ip,
  message: { error: 'Too many login attempts. Try again in 15 minutes.' },
});

app.post('/api/login', loginLimiter, async (req, res) => {
  const { email, password } = req.body;
  const user = await findUserByEmail(email);

  if (user?.lockedUntil && user.lockedUntil > new Date()) {
    return res.status(423).json({ error: 'Account temporarily locked' });
  }

  if (!user || !await bcrypt.compare(password, user.passwordHash)) {
    if (user) {
      await incrementFailedAttempts(user.id);
    }
    return res.status(401).json({ error: 'Invalid credentials' });
  }

  await resetFailedAttempts(user.id);
  res.json({ token: generateToken(user) });
});

Rate limiting by email prevents brute-forcing a single account. Account lockout adds a second layer of defense. keyGenerator uses email rather than just IP, so a distributed attack from multiple IPs is also blocked.

More on resilience patterns for API protection: Circuit Breaker in Deno Edge Functions.

A08:2021 -- Software and Data Integrity Failures

15. Prototype Pollution via deep object merge

// Vulnerable: recursive merge without protection
function deepMerge(target: any, source: any): any {
  for (const key of Object.keys(source)) {
    if (typeof source[key] === 'object' && source[key] !== null) {
      target[key] = deepMerge(target[key] || {}, source[key]);
    } else {
      target[key] = source[key];
    }
  }
  return target;
}

app.put('/api/settings', authMiddleware, async (req, res) => {
  const currentSettings = await getSettings(req.user.id);
  const merged = deepMerge(currentSettings, req.body);
  await saveSettings(req.user.id, merged);
  res.json(merged);
});

Attack vector: PUT /api/settings with body {"__proto__": {"isAdmin": true}}. After the merge, every object in the application inherits isAdmin: true.

function deepMerge(target: any, source: any): any {
  for (const key of Object.keys(source)) {
    if (key === '__proto__' || key === 'constructor' || key === 'prototype') {
      continue;
    }
    if (typeof source[key] === 'object' && source[key] !== null && !Array.isArray(source[key])) {
      target[key] = deepMerge(target[key] || {}, source[key]);
    } else {
      target[key] = source[key];
    }
  }
  return target;
}

Three keys are blocked: __proto__, constructor, prototype. In production, prefer battle-tested libraries: lodash merge starting from 4.17.21 is protected, or use structuredClone for copying.

Prompts for each OWASP category

A broad scan catches obvious vulnerabilities. Deep analysis requires specialized prompts by category.

Injection (A03):

Find all places where user input reaches SQL, NoSQL,
LDAP, OS commands, or ORM raw queries without parameterization.
Consider: query params, request body, headers, cookies, file uploads.
Check ORM methods that use raw SQL.

Access Control (A01):

Review every API endpoint: is there a check that the current user
owns the requested resource? Find endpoints that verify authentication
but not authorization. Pay attention to admin endpoints, bulk
operations, export/download.

SSRF and Insecure Design (A04):

Find all places where the server makes HTTP requests to a URL from
user input. Check: is it possible to reach internal services
(metadata API, localhost, private networks)? Is there URL validation,
DNS rebinding protection, redirect restrictions?

Authentication (A07):

Review the authentication mechanism: password storage (bcrypt/argon2?),
JWT (algorithm pinned? refresh tokens present?), sessions (httpOnly?
secure? sameSite?), rate limiting on login/register/reset-password.
Find endpoints without authentication that should be protected.

Automation: CI pipeline for security audit

Manual audit provides depth. Automated audit in CI provides consistency. Combining both closes most vulnerabilities before production.

# .github/workflows/security-audit.yml
name: AI Security Audit
on:
  pull_request:
    paths:
      - 'src/**'
      - 'api/**'

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run SAST
        run: |
          npx semgrep --config=p/owasp-top-ten src/

      - name: Check dependencies
        run: npm audit --audit-level=high

      - name: Check secrets
        run: npx gitleaks detect --source=. --no-git

Semgrep covers OWASP Top 10 with minimal false positives. npm audit catches vulnerable dependencies. Gitleaks finds committed secrets. LLM audit runs separately, at pre-merge review.

Quick audit checklist

Before each release:

All user inputs are parameterized (SQL, NoSQL, shell)
Every endpoint checks resource ownership, not just authentication
File operations are confined to the base directory (path traversal)
Update operations accept a whitelist of fields (mass assignment)
JWT algorithm is pinned in code, not taken from the token
No secrets in source code or build variables
CORS origin is restricted to a whitelist, not *
Error messages do not expose internals (stack trace, SQL)
External URLs go through DNS validation (SSRF)
Financial operations run in transactions with FOR UPDATE
Token comparison uses timingSafeEqual, not ===
Rate limiting on auth endpoints + account lockout
Deep object merges are protected against prototype pollution
CI pipeline includes SAST (semgrep) and dependency audit
LLM security review on every PR touching API/auth

Each item corresponds to one of the 15 vulnerabilities above. If any item fails, it maps to a concrete vulnerability with a known attack vector.

Need help with AI-powered security audits? I help startups build AI products and automate processes — belov.works.

FAQ

Can an LLM replace a professional penetration tester?

No. LLMs excel at pattern recognition across large codebases and surface the majority of OWASP Top 10 issues in hours, but they produce false positives and miss logic-level flaws that require understanding business context. A manual security review by a specialist is still necessary for critical systems — AI compresses the preparation phase and handles the repeatable patterns, freeing the human reviewer for the nuanced findings.

Which model performs best for security auditing — GPT-4o, Claude, or Gemini?

In practice, Claude and GPT-4o produce comparable results for security audits when given a structured prompt. The model matters less than the prompt quality and the completeness of the code submitted for review. What consistently degrades results: sending partial snippets instead of full files, omitting framework and ORM context, and skipping the verification pass against false positives.

How do I handle secrets that were already committed to the repository?

Finding and removing the secret from code is not enough — it remains in Git history. Rotate the exposed credential immediately, then use git filter-repo (not the deprecated git filter-branch) to purge it from all commits. After that, set up pre-commit hooks with Gitleaks or detect-secrets to prevent future commits. Treat any secret that touched a repository as compromised, regardless of how briefly.

Multi-Agent Architecture: When One AI Isn't Enough

Roman Belov — Thu, 21 May 2026 09:11:11 +0000

A single LLM call handles a single task. But real product workflows rarely fit in one call. Analyzing a document, generating a plan, validating quality, formatting the output — those are four separate tasks with different model requirements, context needs, and prompts. Cramming everything into one prompt degrades quality at every step.

Multi-agent architecture breaks a complex workflow into specialized agents. Each agent does one thing well: one analyzes, one generates, one validates. Agents communicate through a protocol, pass results to each other, and can run in parallel.

This article covers how to design a multi-agent system for a startup: which orchestration patterns to choose, how to divide responsibility between agents, and when a single agent is actually better than three.

One Agent vs. Many: Where to Draw the Line

A single agent works fine when the task is linear. Q&A, summarization, classification — one prompt, one call, one result. Problems start when the task contains conflicting requirements.

Example: an AI assistant for a travel app. The user types "plan a trip to Japan for 10 days." A single agent would need to:

Understand the intent and extract parameters (NLU)
Find attractions and assess logistics (search + reasoning)
Build an itinerary accounting for distances and time (optimization)
Generate descriptions for each day (text generation)
Validate the result for errors (validation)

Each step requires a different prompt, different temperature, sometimes a different model. NLU works best with temperature: 0. Text generation — with temperature: 0.7. Route optimization might need a reasoning model like o3. Day descriptions work great with the cheap and fast Gemini Flash.

Stuffing everything into one prompt gives the model contradictory instructions. The result: a bloated 3,000+ token prompt, unpredictable behavior, and no way to debug a specific step.

Rule of thumb: if a task has more than two steps with different model or prompt requirements — split into agents.

Three Orchestration Patterns for Multi-Agent Systems

Sequential Pipeline

The simplest pattern. Agents are chained together — the output of one becomes the input of the next.

Input → [Agent A] → [Agent B] → [Agent C] → Output
         Extract      Process      Format

The implementation is minimal:

async def pipeline(user_input: str) -> str:
    # Step 1: Extract parameters
    params = await agent_extract.run(user_input)

    # Step 2: Generate plan
    plan = await agent_plan.run(params)

    # Step 3: Validate and format
    result = await agent_format.run(plan)

    return result

Pros: easy to debug (each step can be tested independently), predictable execution order, easy to add intermediate logging.

Cons: total time = sum of all steps. If Agent B fails, the whole pipeline fails. No parallelism.

When to use: ETL-like tasks, document processing, any workflow with a fixed step order.

Parallel Fan-Out / Fan-In

Multiple agents work simultaneously on different aspects of the same task. Results are collected and merged.

              ┌─ [Agent A: Safety] ──┐
Input ────────┤  [Agent B: Quality] ──├───► Merge → Output
              └─ [Agent C: Style]  ──┘

async def parallel_review(code: str) -> dict:
    # Run three agents in parallel
    safety, quality, style = await asyncio.gather(
        agent_safety.run(code),
        agent_quality.run(code),
        agent_style.run(code),
    )

    # Merge results
    return merge_reviews(safety, quality, style)

This pattern is used in Claude Concilium for parallel consultations with multiple LLMs. Three models independently analyze the same code; results are compared to find consensus.

Pros: total time = time of the slowest agent (not the sum). Independence: one agent failing doesn't block the others.

Cons: requires result-merging logic. Agents don't see each other's outputs.

When to use: code review, multi-aspect analysis, A/B prompt testing, validation from multiple perspectives.

Router: Dynamic Agent Routing

A router agent analyzes the request and dispatches it to a specialized agent. Think of it as an API gateway in microservices.

              ┌─ [Agent: Travel Planning]
Input → [Router] ─┤─ [Agent: Booking]
              └─ [Agent: General Chat]

ROUTES = {
    "travel_plan": agent_travel,
    "booking": agent_booking,
    "general": agent_chat,
}

async def router(user_input: str) -> str:
    # Classify intent (cheap model)
    intent = await classifier.run(
        f"Classify intent: {user_input}",
        model="gemini-2.0-flash",
        temperature=0,
    )

    # Route to specialized agent
    agent = ROUTES.get(intent, agent_chat)
    return await agent.run(user_input)

The router uses a cheap model for classification. Specialized agents can use expensive models with larger context windows. Savings: 80%+ of requests are handled by cheaper agents.

For more on model routing by task type, see the article on multi-provider LLM architecture.

When to use: chatbots with multiple domains, heterogeneous document processing, systems with clearly defined specialization.

Agent Specialization: Who Does What

A typical breakdown for a product startup:

Agent	Task	Model	Temperature	Prompt size
Classifier	Intent detection / routing	Gemini Flash	0	~200 tokens
Extractor	Data extraction from text	DeepSeek Chat	0	~500 tokens
Planner	Plan and itinerary generation	Claude Sonnet / o3	0.3	~1500 tokens
Writer	User-facing text generation	Gemini Flash	0.7	~800 tokens
Validator	JSON schema and fact checking	GPT-4o	0	~400 tokens
Judge	Output quality scoring	Claude Sonnet	0.2	~600 tokens

The Judge agent deserves special attention. It doesn't generate content — it evaluates other agents' output against criteria: completeness, relevance, safety. More on this pattern in the LLM-as-Judge article.

Each agent is a combination of three things: a system prompt, a model choice, and a set of tools/functions. Example configuration:

from dataclasses import dataclass
from typing import Optional

@dataclass
class AgentConfig:
    name: str
    system_prompt: str
    model: str
    temperature: float = 0.0
    max_tokens: int = 4096
    tools: list[str] | None = None
    timeout_seconds: int = 30

AGENTS = {
    "classifier": AgentConfig(
        name="classifier",
        system_prompt="You are an intent classifier. Return one of: travel_plan, booking, general.",
        model="google/gemini-2.0-flash",
        temperature=0,
        max_tokens=50,
        timeout_seconds=5,
    ),
    "planner": AgentConfig(
        name="planner",
        system_prompt="You are a travel planner. Create detailed day-by-day itineraries.",
        model="anthropic/claude-sonnet-4",
        temperature=0.3,
        max_tokens=8192,
        tools=["search_places", "get_distances", "check_opening_hours"],
        timeout_seconds=60,
    ),
    "validator": AgentConfig(
        name="validator",
        system_prompt="Validate the travel plan. Check: dates are consistent, distances are realistic, no duplicate places.",
        model="openai/gpt-4o",
        temperature=0,
        max_tokens=2048,
        timeout_seconds=15,
    ),
}

Communication Protocol: How Agents Exchange Data

Agents need to exchange data in a predictable format. Two approaches.

Approach 1: Structured Output (JSON)

Each agent returns a JSON object matching a predefined schema. The next agent receives that JSON as part of its prompt.

from pydantic import BaseModel

class TravelParams(BaseModel):
    destination: str
    duration_days: int
    interests: list[str]
    budget_level: str  # "budget" | "mid" | "luxury"

class DayPlan(BaseModel):
    day: int
    activities: list[str]
    estimated_cost_usd: float

class TravelPlan(BaseModel):
    params: TravelParams
    days: list[DayPlan]
    total_cost_usd: float

# Agent A → structured output
params: TravelParams = await agent_extract.run(
    user_input,
    response_format=TravelParams,
)

# Agent B accepts structured input
plan: TravelPlan = await agent_plan.run(
    f"Create a travel plan based on: {params.model_dump_json()}",
    response_format=TravelPlan,
)

Pros: type safety, validation at every step, easy to log and debug.

Cons: structured output overhead (not all models support it equally well), schema rigidity.

Approach 2: MCP (Model Context Protocol)

Agents interact through a standardized protocol. Each agent is an MCP server that exposes tools. The orchestrator calls the tools of the appropriate agent.

Orchestrator (Claude Code)
    │
    ├── MCP: agent-planner
    │     └── tool: create_plan(destination, days, interests)
    │
    ├── MCP: agent-validator
    │     └── tool: validate_plan(plan_json)
    │
    └── MCP: agent-writer
          └── tool: write_description(day_plan)

MCP provides a standard interface for connecting agents. An agent server can be written in any language and run locally or remotely. More on building production MCP servers in the custom MCP servers article.

In practice, MCP is better suited for dev-time agents (IDE assistants), while structured output is better for runtime agents (handling user requests in production).

The Orchestrator: Central Component of the System

The orchestrator decides: which agent to call, in what order, and what to do with errors. A minimal implementation:

import asyncio
import logging
from typing import Any

logger = logging.getLogger(__name__)

class Orchestrator:
    def __init__(self, agents: dict[str, AgentConfig]):
        self.agents = agents

    async def run_agent(self, name: str, input_data: str) -> Any:
        """Run a single agent with timeout and retry."""
        config = self.agents[name]

        for attempt in range(3):
            try:
                result = await asyncio.wait_for(
                    call_llm(
                        model=config.model,
                        system=config.system_prompt,
                        user=input_data,
                        temperature=config.temperature,
                        max_tokens=config.max_tokens,
                        tools=config.tools,
                    ),
                    timeout=config.timeout_seconds,
                )
                logger.info(f"Agent {name} completed", extra={
                    "agent": name,
                    "attempt": attempt + 1,
                    "model": config.model,
                })
                return result

            except asyncio.TimeoutError:
                logger.warning(f"Agent {name} timeout, attempt {attempt + 1}")
                continue
            except Exception as e:
                logger.error(f"Agent {name} failed: {e}")
                if attempt == 2:
                    raise

        raise RuntimeError(f"Agent {name} failed after 3 attempts")

    async def run_pipeline(self, user_input: str) -> str:
        """Sequential pipeline: classify → plan → validate → format."""

        # Step 1: Classification
        intent = await self.run_agent("classifier", user_input)

        if intent == "general":
            return await self.run_agent("writer", user_input)

        # Step 2: Extract parameters
        params = await self.run_agent("extractor", user_input)

        # Step 3: Generate plan
        plan = await self.run_agent("planner", params)

        # Step 4: Parallel validation
        validation, quality_score = await asyncio.gather(
            self.run_agent("validator", plan),
            self.run_agent("judge", plan),
        )

        # Step 5: If validation failed — retry with feedback
        if not validation.get("is_valid"):
            plan = await self.run_agent("planner",
                f"Previous plan had issues: {validation['issues']}. "
                f"Original request: {params}. Fix the plan."
            )

        return plan

Key design decisions in the orchestrator:

Retry with context. On retry, the agent receives information about what went wrong. Not a blind retry, but a retry with validator feedback.
Parallelize where possible. Validator and Judge run concurrently — saving 30–50% of total time.
Graceful degradation. If Judge is unavailable, the pipeline continues without quality scoring. A response without a score beats no response at all.

Observability: Seeing What's Happening Inside

A multi-agent system without observability is a black box. Three essential components.

1. Tracing

Every user request is a trace. Every agent call is a span within that trace. Langfuse is a great fit here:

from langfuse import Langfuse

langfuse = Langfuse()

async def run_with_tracing(user_input: str) -> str:
    trace = langfuse.trace(
        name="travel-planning",
        input=user_input,
    )

    # Classifier span
    classifier_span = trace.span(name="classifier")
    intent = await run_agent("classifier", user_input)
    classifier_span.end(output=intent)

    # Planner span
    planner_span = trace.span(name="planner")
    plan = await run_agent("planner", user_input)
    planner_span.end(output=plan)

    # ... other agents

    trace.update(output=plan)
    return plan

In Langfuse you can see: which agents were called, how long each took, how many tokens each consumed, and what each returned. More on setup in the LLM Observability with Langfuse article.

2. Metrics

A minimal set of metrics for a multi-agent system:

Metric	What it shows	Alert threshold
`agent.latency_p95`	Speed per agent	> 10s for classifier
`agent.error_rate`	Error rate per agent	> 5%
`agent.token_cost`	Cost per agent	> $X/day budget
`pipeline.success_rate`	Successful pipeline runs	< 95%
`pipeline.retry_rate`	How often retries are needed	> 20%
`judge.quality_score`	Average quality score	< 0.7

3. Decision Logging

Every orchestrator decision gets logged: why this agent was chosen, why a retry happened, why the pipeline took an alternate path. Without this, debugging is pure guesswork.

logger.info("routing_decision", extra={
    "trace_id": trace.id,
    "intent": intent,
    "selected_agent": "planner",
    "reason": "intent=travel_plan, confidence=0.95",
    "fallback_agent": "writer",
})

Cost: Multi-Agent System vs. One Big Prompt

More agents = more LLM calls = more cost. But a well-designed architecture can actually be cheaper than a single large prompt.

Cost comparison for "plan a trip":

Approach	Calls	Models	Tokens (input)	Tokens (output)	Estimated cost
Single agent	1	Claude Sonnet	~3000	~2000	~$0.025
Pipeline (5 agents)	5	Mix	~2500 total	~1800 total	~$0.008

The pipeline is cheaper because:

Classifier uses Flash (~$0.0001 per call)
Extractor uses DeepSeek ($0.00014/1M input tokens)
Only the Planner uses an expensive model, and its prompt is shorter (parameters are already extracted)

Result: 60–70% savings on LLM costs with proper task splitting and model routing.

Common Mistakes in Multi-Agent System Design

Over-engineering. Three agents for a task that a single prompt with structured output could handle. If the task is linear and doesn't require different models — one agent is simpler and more reliable.

No fallback. The planner agent goes down — the entire pipeline is dead. Every critical agent needs a fallback: an alternative model, a simplified prompt, or a cache of previous results.

Chains that are too long. Each agent adds latency and a potential failure point. An 8-agent pipeline with P95 latency of 3 seconds per agent means 24 seconds per response. Users won't wait.

Ignoring the context window. Passing the full output of one agent into the next agent's prompt. A 5,000-token planner output plus a 500-token validator system prompt — unnecessary cost. Pass only what the current agent actually needs.

No circuit breaker. One agent returns malformed JSON. The next agent gets broken input, fails, the orchestrator retries — three attempts, three failures, wasted tokens. Circuit breaker: if an agent returns invalid output, don't pass it downstream — surface the error immediately.

Where to Start

Identify the bottleneck. Take your current single-agent pipeline. Find the step that breaks most often or produces poor quality. Extract it into a separate agent.
Start with two agents. Classifier + Worker. The Classifier identifies the task type on a cheap model. The Worker handles it on the right model. This already gives you routing and cost savings.
Add structured output. Pydantic schemas for inputs and outputs of each agent. Validation at every step. Without this, debugging a multi-agent system is nearly impossible.
Connect tracing. Langfuse, LangSmith, or equivalent. See every call, every prompt, every result. Without observability, a multi-agent system is a black box.
Add a Judge agent. Automatic quality scoring at the end of the pipeline. If the score falls below threshold — retry with feedback. This closes the quality loop.
Optimize from metrics. Watch latency, cost, and quality score per agent. Swap models, tweak prompts, and reshape architecture based on real data.

Need help designing multi-agent AI systems? I help startups build AI products and automate processes — belov.works.

AI Code Review Checklist: Correctness, Security, Performance, Readability

Roman Belov — Fri, 15 May 2026 06:31:02 +0000

Most defects missed in code review are logical errors and edge cases — not formatting issues, not naming conventions. Google's "Modern Code Review: A Case Study at Google" (Sadowski et al., 2018) examined review practices at scale, and since then the volume of AI-generated code has grown while reviewers still spend the same 15–30 minutes per PR.

Below is how to structure AI code review across four categories: correctness, security, performance, readability. Priority is in exactly that order. For each category: a checklist, an LLM prompt, and real finding examples. At the end — CI pipeline integration.

Why Category Order Matters

A typical code review starts at the surface. The reviewer notices a poorly named variable, suggests a refactor, discusses style. That consumes 80% of the time. Logical errors and security issues go unnoticed.

A fixed order solves this problem:

Correctness — does the code do what it claims? Are edge cases handled?
Security — any injections, data leaks, or authorization issues?
Performance — any O(n²) where O(n) would do? Any unnecessary allocations?
Readability — will the code make sense in six months? Does it match project conventions?

Each subsequent category is less critical. A bug in authorization logic matters more than a poor variable name. The LLM handles each category in a separate pass, keeping security comments separate from style notes.

Stage 1: Correctness — Logic and Edge Cases

The most expensive category of errors. A business logic bug that slips through review and reaches production costs 10–30x more than a bug caught before merge.

Checklist

[ ] Boundary values: null, empty string, empty array, 0, negative numbers
[ ] Off-by-one: < vs <=, array indices, pagination
[ ] Concurrent access: race conditions on parallel requests
[ ] Error handling: what happens on timeout, 500 error, connection drop
[ ] Data types: integer overflow, float precision loss, implicit coercions
[ ] Contracts: is input validated? Does output match the interface?
[ ] Idempotency: is it safe to call twice?

LLM Prompt

Conduct a code review for correctness. Check ONLY logical errors.

Context:
- Language: {language}
- Function purpose: {purpose}
- Called from: {caller context}

Check each item:
1. Boundary values (null, empty collections, 0, negative)
2. Off-by-one errors in loops and conditions
3. Race conditions on concurrent access
4. Error handling (timeouts, network failures, invalid response)
5. Data types (overflow, precision loss, implicit coercions)
6. Idempotency of repeated calls

For each finding, provide:
- Line of code
- What happens with a specific input
- How to fix it (one sentence)

Do not comment on style, naming, or formatting.

Example Finding

Discount calculation function:

def calculate_discount(price: float, quantity: int) -> float:
    if quantity > 10:
        return price * 0.9
    if quantity > 50:
        return price * 0.8
    return price

The LLM catches: the second condition quantity > 50 will never be reached because quantity > 50 also satisfies quantity > 10, triggering the first branch. The 20% discount is inaccessible. The conditions need to be reversed.

These kinds of bugs pass tests if the test case for quantity=100 only checks for the presence of a discount, not its magnitude.

Stage 2: Security — Vulnerabilities and Data Leaks

Security comes second because a vulnerability in working code is more dangerous than a bug: bugs surface as errors, vulnerabilities are exploited silently.

Checklist

[ ] Injections: SQL, NoSQL, command injection, XSS
[ ] Authentication: is identity verified on every endpoint?
[ ] Authorization: can a user access only their own data?
[ ] Secrets: any hardcoded keys, tokens, or passwords?
[ ] Logging: can PII or secrets end up in logs?
[ ] Deserialization: is input validated before parsing?
[ ] Dependencies: any known CVEs in imported packages?

LLM Prompt

Conduct a security review of the code. Check ONLY vulnerabilities.

Context:
- Stack: {stack}
- This code is accessible as: {public API / internal / edge function}
- Authentication: {auth method}

Check against OWASP Top 10:
1. Injection (SQL, command, XSS) — does user input reach
   queries or HTML without sanitization?
2. Broken Access Control — can someone fetch another user's data
   by swapping user_id / tenant_id?
3. Cryptographic Failures — secrets in code, weak hashing,
   HTTP instead of HTTPS?
4. Security Misconfiguration — CORS *, debug=True, verbose errors?
5. SSRF — can the server be made to reach an internal resource?

For each finding:
- CWE number (if applicable)
- Severity: Critical / High / Medium / Low
- Exploit scenario in one sentence
- Fix in one sentence

Do not comment on performance, style, or logic.

Example Finding

Supabase Edge Function endpoint:

const { data } = await supabase
  .from('documents')
  .select('*')
  .eq('id', req.params.id);

return new Response(JSON.stringify(data));

The LLM catches: no user_id check. Any authenticated user can retrieve a document by ID, even if it belongs to someone else. IDOR (Insecure Direct Object Reference, CWE-639). Severity: High.

Fix: add .eq('user_id', user.id) or an RLS policy at the database level.

For more on how LLMs can automatically evaluate code quality, see LLM-as-Judge: Automated Quality Gate.

Stage 3: Performance — Resources and Scalability

Performance is checked after correctness and security. Fast but incorrect or insecure code is useless.

Checklist

[ ] Algorithm complexity: any nested loops over the same data?
[ ] N+1 queries: a loop with a database call inside
[ ] Unnecessary allocations: object creation in a hot loop
[ ] Missing caching: the same data requested repeatedly
[ ] Response size: SELECT * instead of needed fields
[ ] Indexes: filtering on unindexed fields
[ ] Memory leaks: subscriptions without cleanup, unclosed connections

LLM Prompt

Conduct a performance review. Check ONLY performance issues.

Context:
- Expected load: {requests per second / dataset size}
- Environment: {serverless / server / edge}
- DB: {database}

Check:
1. Algorithmic complexity — O(n²) or worse?
2. N+1 queries — a loop with a query inside?
3. Unnecessary allocations — objects created in the hot path?
4. SELECT * — returning unneeded fields?
5. Missing indexes — WHERE/ORDER BY on an unindexed field?
6. Leaks — unclosed connections, subscriptions without cleanup?

For each finding:
- Current complexity / cost
- At what data volume this becomes a problem
- Fix in one sentence

Do not comment on correctness, security, or style.

Example Finding

Loading user activity:

const users = await getActiveUsers(); // 500 users

const activity = [];
for (const user of users) {
  const logs = await db.query(
    `SELECT * FROM activity_logs WHERE user_id = $1`,
    [user.id]
  );
  activity.push({ user, logs });
}

The LLM catches: classic N+1. With 500 active users, this executes 501 database queries. On serverless, each query has 2–5ms latency — that's 1–2.5 seconds just for queries. Solution: one query with WHERE user_id = ANY($1) and grouping on the application side.

For edge functions, this is critical. The Circuit Breaker in Deno Edge Functions article describes how timeouts from slow queries cause cascading failures.

Stage 4: Readability — Maintainability and Conventions

The last category. Readability comments should not block a merge if the first three stages pass.

Checklist

[ ] Naming: do variable and function names reflect their purpose?
[ ] Function size: more than 30–40 lines — candidate for splitting
[ ] Comments: explain "why", not "what"
[ ] Dead code: unused variables, unreachable branches
[ ] Duplication: the same logic in multiple places
[ ] Conventions: matches the project's style (not generic best practices)
[ ] Typing: concrete types instead of any / object

LLM Prompt

Conduct a readability review. Check ONLY code maintainability.

Context:
- Project style: {link to conventions or examples}
- Patterns: {DI framework, error handling pattern, etc.}

Check:
1. Naming — are variables and functions understandable without context?
2. Size — functions longer than 40 lines?
3. Dead code — unused imports, variables, branches?
4. Duplication — is logic repeated? Is there an existing helper?
5. Typing — any/unknown/object instead of concrete types?
6. Conventions — does the code match the style of the rest of the project?

For each finding:
- Severity: nit / suggestion / convention-violation
- One sentence on what to improve

Do not comment on logic, security, or performance.

Example Finding

const d = await fetchData(id);
const r = processResult(d);
if (r.s === 'ok') {
  await save(r.d);
}

The LLM catches: single-letter variables d, r, and properties s and d without context. Six months later, it's impossible to understand that r.s is a status and r.d is processed data. Severity: suggestion.

Combining Into a Single Review Pipeline

Four passes produce a lot of comments. A structure is needed for the final report.

Output Format

## Code Review Summary

### Correctness (blocking)
- [C1] line 42: quantity > 50 unreachable — reverse condition order
- [C2] line 87: no null response handling from API

### Security (blocking)
- [S1] line 15: IDOR — missing user_id check (CWE-639, High)

### Performance (warnings)
- [P1] line 23-28: N+1 queries, replace with batch query

### Readability (recommendations)
- [R1] line 5: single-letter variables — rename
- [R2] line 33: dead code — remove unused import

Each finding has a unique ID for PR discussion. Correctness and Security block the merge. Performance and Readability are at the author's discretion.

Multi-Agent Review

A single LLM misses errors. Two LLMs working independently miss fewer. In practice, a multi-agent approach finds significantly more defects than a single pass.

Practical implementation: the first agent (Claude) runs all four stages. The second agent (GPT or Gemini) checks only Correctness and Security. Findings are compared. If both agents flagged the same issue, confidence is high. If only one did — manual review is needed.

Claude Concilium implements this approach via MCP: parallel requests to multiple LLMs with result merging.

CI/CD Pipeline Integration

Automated review on every PR. Basic architecture:

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get changed files
        id: diff
        run: |
          echo "files=$(git diff --name-only origin/main...HEAD | grep -E '\.(ts|py|go|rs)$' | tr '\n' ' ')" >> $GITHUB_OUTPUT

      - name: Run AI Review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          for file in ${{ steps.diff.outputs.files }}; do
            diff=$(git diff origin/main...HEAD -- "$file")
            # Prompt with 4 review stages
            claude --print "Review this diff: $diff" >> review_output.md
          done

      - name: Post review comments
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const review = fs.readFileSync('review_output.md', 'utf8');
            await github.rest.pulls.createReview({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number,
              body: review,
              event: 'COMMENT'
            });

Key Setup Decisions

What to review. Only changed files. Full repo scans on every PR are too costly. Filter by extension: exclude .md, .json, and config files.

How to manage cost. One diff per API call. With 10 files in a PR, that's 10 calls at ~2000 tokens each. With claude-sonnet-4 — roughly $0.10–0.15 per average PR. One hour of a human reviewer costs $50–100.

Whether to block merges. At the start — no. AI review as an informational comment. After calibration (2–4 weeks), you can enable blocking for Correctness and Security at severity >= High.

False positives. They will happen. Typical AI code review accuracy is 70–80%. Of 10 comments, 2–3 are irrelevant. This is acceptable if the review saves time on the remaining 7–8. To reduce false positives, add project context to the prompt: conventions, common patterns, architectural decisions.

AI Review Effectiveness Metrics

Metric	How to measure	Target
True Positive Rate	Accepted comments / All comments	> 70%
Blocked Bugs	Bugs caught by AI before merge	Grows over time
Time to Review	Time from PR to first comment	< 5 minutes
Cost per PR	API call cost per PR	< $0.10
Human Review Time	Manual review time after AI	Drops by 30–50%

Track data from day one. Within a month, you'll know which categories yield the most findings and which prompts need refinement.

Getting Started

Week 1. Take the Correctness prompt. Manually run the last 5 PRs from your team through Claude or GPT. Record how many findings matched what the human reviewer caught (or missed).

Week 2. Add the Security prompt. Compare findings with SAST tools (Semgrep, CodeQL). LLMs often catch logical vulnerabilities that SAST misses, but miss pattern-based ones that SAST finds reliably. They complement each other.

Week 3. Set up the CI pipeline. Start with one repository, one language. PR comment without blocking.

Week 4. Calculate metrics. True Positive Rate below 50%? Refine prompts, add project context. Above 70%? Enable blocking for Critical/High.

Four categories in a fixed order. A separate pass for each. A prompt that forbids commenting on other categories. That's enough to catch most defects before merge.

Need help setting up AI code review pipelines? I help startups build AI products and automate processes — belov.works.

TDD with AI: Claude Writes Tests First, Then the Implementation

Roman Belov — Tue, 12 May 2026 09:53:23 +0000

Most teams know TDD works. Few actually practice it. The friction is real: writing a test before the implementation means thinking through the API, edge cases, and module contract before a single line of logic exists. That's cognitively expensive.

An AI assistant removes this barrier. Claude Code turns test-first from a discipline exercise into a natural development workflow. Tests become the specification, and the implementation is generated against them.

Why TDD Stalls Without AI

The classic TDD cycle: Red → Green → Refactor. Write a failing test, make it pass with minimal code, then refactor. In practice, the cycle breaks at the first step.

A developer sits down to write a test for a new module. But writing the test requires defining the interface, which requires understanding the architecture, which requires at least mentally sketching the implementation. The circle closes on itself.

Three concrete problems:

Designing an API from scratch. A test requires calling a function that doesn't exist yet. What arguments does it take? What does it return? What errors does it throw? Without a prototype implementation, these questions hang in the air.

Edge cases in the dark. Good tests cover boundary conditions. But boundary conditions only become visible once you understand the implementation. Empty arrays, null, timeouts, concurrent access — these are all much easier to spot when looking at code rather than a blank test file.

Motivation. A test that fails with "function not found" gives no sense of progress. The developer switches to the implementation "just to quickly verify the idea" — and never returns to test-first.

AI removes all three barriers. Claude generates a test specification from a description of the task, suggests edge cases based on data types and patterns, and creates an implementation strictly against the test contract.

Workflow: Test-First with Claude Code

The workflow has four steps. Each step has a concrete prompt and expected outcome.

Step 1. Specification via Tests

Claude Code prompt:

Write tests for the parseSchedule module.
Input: a free-form schedule string ("every Monday at 10:00", "daily at 9am").
Output: an object { dayOfWeek?: number, hour: number, minute: number, recurring: boolean }.
Cover: valid strings in EN/RU, invalid input, empty string, null.
Use vitest. Do not write the implementation.

Claude generates the test file:

import { describe, it, expect } from 'vitest';
import { parseSchedule } from './parseSchedule';

describe('parseSchedule', () => {
  describe('english input', () => {
    it('parses "every Monday at 10:00"', () => {
      const result = parseSchedule('every Monday at 10:00');
      expect(result).toEqual({
        dayOfWeek: 1,
        hour: 10,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "daily at 9am"', () => {
      const result = parseSchedule('daily at 9am');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "Friday 14:30"', () => {
      const result = parseSchedule('Friday 14:30');
      expect(result).toEqual({
        dayOfWeek: 5,
        hour: 14,
        minute: 30,
        recurring: false,
      });
    });
  });

  describe('russian input', () => {
    it('parses "каждый понедельник в 10:00"', () => {
      const result = parseSchedule('каждый понедельник в 10:00');
      expect(result).toEqual({
        dayOfWeek: 1,
        hour: 10,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "ежедневно в 9:00"', () => {
      const result = parseSchedule('ежедневно в 9:00');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });
  });

  describe('edge cases', () => {
    it('throws on empty string', () => {
      expect(() => parseSchedule('')).toThrow('Invalid schedule format');
    });

    it('throws on null', () => {
      expect(() => parseSchedule(null as any)).toThrow();
    });

    it('throws on unrecognized format', () => {
      expect(() => parseSchedule('maybe sometime')).toThrow(
        'Invalid schedule format'
      );
    });

    it('handles extra whitespace', () => {
      const result = parseSchedule('  daily   at   9am  ');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });
  });
});

All tests fail — parseSchedule doesn't exist. That's expected: the Red phase is complete.

What just happened: the tests locked in the contract. Input type, output structure, behavior on edge cases, supported formats. This file is already a full specification for the module.

Step 2. Minimal Implementation

Prompt:

Write an implementation of parseSchedule that passes all tests in parseSchedule.test.ts.
Minimal code. No extra features. Only what the tests require.

Claude reads the test file, extracts the expectations, and generates the implementation:

const EN_DAYS: Record<string, number> = {
  sunday: 0, monday: 1, tuesday: 2, wednesday: 3,
  thursday: 4, friday: 5, saturday: 6,
};

const RU_DAYS: Record<string, number> = {
  воскресенье: 0, понедельник: 1, вторник: 2, среда: 3,
  четверг: 4, пятница: 5, суббота: 6,
};

interface Schedule {
  dayOfWeek?: number;
  hour: number;
  minute: number;
  recurring: boolean;
}

export function parseSchedule(input: string): Schedule {
  if (!input || typeof input !== 'string') {
    throw new Error('Invalid schedule format');
  }

  const normalized = input.trim().replace(/\s+/g, ' ').toLowerCase();
  if (!normalized) throw new Error('Invalid schedule format');

  const recurring =
    normalized.startsWith('every') ||
    normalized.startsWith('каждый') ||
    normalized.startsWith('каждая') ||
    normalized.startsWith('каждое') ||
    normalized.includes('daily') ||
    normalized.includes('ежедневно');

  const timeMatch = normalized.match(/(\d{1,2}):(\d{2})/);
  const amPmMatch = normalized.match(/(\d{1,2})(am|pm)/);

  let hour: number;
  let minute: number;

  if (timeMatch) {
    hour = parseInt(timeMatch[1], 10);
    minute = parseInt(timeMatch[2], 10);
  } else if (amPmMatch) {
    hour = parseInt(amPmMatch[1], 10);
    minute = 0;
    if (amPmMatch[2] === 'pm' && hour !== 12) hour += 12;
    if (amPmMatch[2] === 'am' && hour === 12) hour = 0;
  } else {
    throw new Error('Invalid schedule format');
  }

  const allDays = { ...EN_DAYS, ...RU_DAYS };
  let dayOfWeek: number | undefined;

  for (const [name, num] of Object.entries(allDays)) {
    if (normalized.includes(name)) {
      dayOfWeek = num;
      break;
    }
  }

  const result: Schedule = { hour, minute, recurring };
  if (dayOfWeek !== undefined) result.dayOfWeek = dayOfWeek;

  return result;
}

Run the tests: all green. Green phase complete.

Step 3. Refactoring Under Test Coverage

Prompt:

Refactor parseSchedule. Extract time and day parsing into separate functions.
Tests must not change. All must remain green.

Claude performs the refactoring and runs tests after each change. The tests are the safety net: if the refactoring breaks something, it's immediately visible.

Step 4. Extending Through New Tests

Need to add support for the "every 2 hours" format? Write the test first:

it('parses interval "every 2 hours"', () => {
  const result = parseSchedule('every 2 hours');
  expect(result).toEqual({
    intervalHours: 2,
    recurring: true,
  });
});

Test fails. Claude prompt: "Make this test pass without breaking the existing ones." The cycle repeats.

Claude Code Prompt Templates for Test-First

Generating Tests for a New Module

Write tests for [module/function].
Context: [what the module does, what data it processes].
Input: [types, examples].
Expected output: [structure, types].
Edge cases: [specific cases or "suggest them yourself"].
Framework: [vitest/jest/pytest].
DO NOT write the implementation.

Generating Implementation from Tests

Write an implementation of [module] that passes all tests in [file.test.ts].
Minimal code. No unnecessary abstractions. Only what the tests require.

Expanding Coverage

Analyze [module] and its tests.
What scenarios are not covered? Add tests for missing edge cases.
Do not modify existing tests.

Refactoring

Refactor [module]. Tests must not change.
Goal: [readability / performance / extensibility].
Run tests after changes.

Comparison: Test-First vs Code-First with AI

In practice, developers use AI in two ways: ask it to generate code first, then tests (code-first), or the other way around (test-first). The difference in outcomes is significant.

Criterion	Code-first + AI	Test-first + AI
Test quality	Tests are retrofitted to the implementation and miss bugs	Tests reflect requirements and catch real problems
Edge case coverage	AI tests what it wrote, not what was needed	Edge cases are defined from the specification
Refactoring	Risky to change: tests may break	Safe: tests are bound to the contract, not the implementation
Speed	Faster at the start	Faster in the long run
API quality	API forms organically	API is designed through tests

The key difference: with code-first, AI generates tests that check what the code does. With test-first, AI generates tests that check what the code should do. The difference becomes critical during refactoring and feature expansion.

Real Example: A Validator with LLM-as-Judge

A practical task: an AI response validation module using LLM-as-Judge. The goal is to check that an LLM's response meets quality criteria.

Tests are written first:

describe('ResponseValidator', () => {
  it('accepts response meeting all criteria', async () => {
    const result = await validator.validate({
      response: 'Paris is the capital of France.',
      criteria: {
        factual: true,
        maxLength: 100,
        language: 'en',
      },
    });
    expect(result.passed).toBe(true);
    expect(result.scores.factual).toBeGreaterThan(0.8);
  });

  it('rejects response failing factual check', async () => {
    const result = await validator.validate({
      response: 'Berlin is the capital of France.',
      criteria: { factual: true },
    });
    expect(result.passed).toBe(false);
    expect(result.failures).toContain('factual');
  });

  it('handles LLM provider timeout', async () => {
    const result = await validator.validate({
      response: 'test',
      criteria: { factual: true },
      timeout: 100,
    });
    expect(result.passed).toBe(false);
    expect(result.error).toBe('timeout');
  });
});

The tests defined: the validate() interface, the criteria structure, the result format, and behavior on timeout. This is a complete specification. Claude implements the module strictly against the contract, including error handling that's easy to forget with a code-first approach.

If your project already has multi-agent code review in place, tests go through review alongside the code. The agents verify that tests cover all declared scenarios and that the implementation doesn't go beyond the contract.

Patterns and Anti-Patterns

What Works Well

One test, one behavior. Claude generates a precise implementation when each test checks exactly one aspect. "Parses a date in ISO format" — good. "Parses a date and validates and formats" — bad.

Typed expectations. Instead of expect(result).toBeTruthy() — expect(result).toEqual({ ... }). The more precise the expectation, the more precise the implementation.

Tests as documentation. The test name describes behavior: it('returns empty array when no items match filter'). Claude uses it to understand intent.

What Doesn't Work

Tests on internal implementation. expect(cache.store).toHaveBeenCalledWith('key', 'value') — binding to the implementation. The test will break on refactoring even though behavior didn't change.

Too many mocks. If a test needs five dependencies mocked, the problem is in the architecture, not the tests. Claude will create a working mock, but the implementation will be brittle.

Generating tests and code in one prompt. "Write a function and tests for it" — that's code-first with the illusion of test-first. The tests will be retrofitted to the implementation. Always split into two steps.

TDD for Infrastructure Code

Test-first works not just for business logic. For infrastructure modules, like a circuit breaker for edge functions, tests are especially valuable.

describe('CircuitBreaker', () => {
  it('stays closed after successful calls', async () => {
    const breaker = new CircuitBreaker({ failureThreshold: 3 });
    await breaker.call(() => Promise.resolve('ok'));
    await breaker.call(() => Promise.resolve('ok'));
    expect(breaker.state).toBe('closed');
  });

  it('opens after reaching failure threshold', async () => {
    const breaker = new CircuitBreaker({ failureThreshold: 2 });
    await ignoreError(() => breaker.call(() => Promise.reject('fail')));
    await ignoreError(() => breaker.call(() => Promise.reject('fail')));
    expect(breaker.state).toBe('open');
  });

  it('rejects calls immediately when open', async () => {
    const breaker = openCircuitBreaker();
    await expect(breaker.call(() => Promise.resolve('ok')))
      .rejects.toThrow('Circuit is open');
  });

  it('transitions to half-open after cooldown', async () => {
    vi.useFakeTimers();
    const breaker = openCircuitBreaker({ cooldownMs: 5000 });
    vi.advanceTimersByTime(5000);
    expect(breaker.state).toBe('half-open');
    vi.useRealTimers();
  });
});

A state machine circuit breaker is fully described through tests: state transitions, thresholds, timers. Claude generates an implementation that correctly handles all transitions because every transition is pinned in a test.

Metrics: What to Measure

Defect escape rate. How many bugs slip past tests into production. With test-first, this number drops because edge cases are covered before any code is written.

Refactoring time. With tests bound to the contract, refactoring is safe. Without them, every change requires manual verification.

Coverage on the first pass. With a code-first approach, coverage after the first iteration is usually 40–60%. With test-first — 80–90%, because the tests already define all the main scenarios.

Number of iterations to green. With well-written tests, Claude passes all of them on the first generation. If more than two iterations are needed, the tests are too vague or contradictory.

Getting Started

There's no need to convert an entire project to TDD in one day. Starting with a single new module is enough.

1. Pick an isolated module. A utility function, a validator, a parser. A module without heavy dependencies where inputs and outputs are easy to define.

2. Write tests with a prompt. Use the template from the section above. Describe the input, expected output, and edge cases. Let Claude generate the test file.

3. Confirm the tests fail. Run the tests. All should be red. If something is green without an implementation — the test is incorrect.

4. Generate the implementation. In a separate prompt, ask Claude to write code that passes all the tests. Don't add "and also do X" — only what's in the tests.

5. Run, refactor, repeat. Green tests — refactor. New requirements — new tests. The Red → Green → Refactor cycle runs without friction when AI handles the generation.

AI doesn't replace TDD. AI makes TDD practical. The cognitive load of designing APIs through tests drops, the speed of the cycle increases, and the quality of the contract stays on par with manual design.

More articles on AI-assisted development workflows at futurecraft.pro. Need help building AI products? — belov.works.

LLM-as-Judge: Automated Quality Gate for LLM Outputs in Production

Roman Belov — Tue, 28 Apr 2026 03:46:25 +0000

LLM-as-Judge is a pattern where one language model evaluates another model's outputs against defined criteria. An automatic quality gate: every response gets checked before reaching the user, or after, for monitoring. Standard production monitoring metrics (200 OK, latency 340ms, rate limits within bounds) are useless for assessing quality — the model can hallucinate in 15% of responses while HTTP status codes tell you nothing about it.

Manual review doesn't scale. One person can handle 100 requests a day. At 10,000, nobody can. And quality degradation usually hits at scale: after a prompt update, a model swap, or a silent change on the provider side.

This article covers how LLM-as-Judge works, which metrics to evaluate, and how to plug it into a production pipeline.

How LLM-as-Judge works and its limitations

The judge model receives a prompt with instructions plus the text being evaluated, then returns a score: a number, a category, or structured JSON. The judge doesn't generate content. It classifies and scores. Models handle this more consistently than generation.

User: "Recommend cafes in downtown Moscow"
          |
          v
+--------------------+
|   LLM Generator    | -> "Here are 5 cafes: Coffemania near Patriarshiye..."
|   (GPT-4o-mini)    |
+--------------------+
          |
          v
+--------------------+
|   LLM Judge        | -> { relevance: 0.9, factuality: 0.7,
|   (Claude Sonnet)  |     toxicity: 0.0, completeness: 0.8 }
+--------------------+
          |
          v
   Score < threshold? -> Alert / Block / Log

Research by Zheng et al. (2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena") showed that GPT-4 as a judge agreed with human ratings in 80%+ of cases. Two human annotators agreed with each other about 81% of the time. The gap between an LLM judge and a human is roughly the same as the gap between two humans.

LLM output quality metrics: what to evaluate

Metric choice depends on the task. Main categories below.

Metrics for RAG systems

Metric	What it checks	When you need it
Faithfulness	Response is grounded in context, no fabricated facts	Always for RAG
Answer Relevance	Response matches the question	Always
Context Relevance	Retriever returned relevant documents	Debugging retrieval

Metrics for generative tasks

Metric	What it checks	When you need it
Correctness	Factual accuracy	When a reference answer exists
Completeness	Response covers all aspects of the query	Complex queries
Toxicity	No insults, harmful content	User-facing products
Hallucination	Model doesn't fabricate facts	Always

Metrics for agent pipelines

Metric	What it checks	When you need it
Tool Use Correctness	Right tool with right arguments	Agent pipelines
Task Completion	End result solves the task	Always for agents

In practice, start with two or three metrics. For RAG: faithfulness + answer relevance. For a chatbot: relevance + toxicity. For an agent: task completion. Add more as you find specific problems.

Judge prompt structure for evaluation

Evaluation quality comes down to the prompt. A working template for faithfulness:

FAITHFULNESS_JUDGE_PROMPT = """You are an impartial judge evaluating the faithfulness
of an AI assistant's response.

Faithfulness means: every claim in the response is supported by the provided context.
Claims not found in context = unfaithful.

## Input
**User Question:** {question}
**Retrieved Context:** {context}
**AI Response:** {response}

## Task
1. Extract each factual claim from the AI Response
2. For each claim, check if it is supported by the Retrieved Context
3. A claim is SUPPORTED if the context contains evidence for it
4. A claim is UNSUPPORTED if the context does not mention it or contradicts it

## Output (JSON only)
{{
  "claims": [
    {{"claim": "...", "supported": true/false, "evidence": "..."}}
  ],
  "score": <float 0.0-1.0, ratio of supported claims to total claims>,
  "reasoning": "<one sentence summary>"
}}"""

What makes this work:

Specific criteria. "Rate the response quality" doesn't work. "Check that every fact is backed by context" works. The more specific the instruction, the more stable the scores.

Chain-of-thought. The model first extracts claims, checks each one, then assigns a score. Without intermediate steps, scores are unstable.

Structured output. JSON with a fixed schema, score from 0 to 1, reasoning in one sentence. This makes parsing and aggregation straightforward.

Implementing LLM-as-Judge: three approaches

1. Python + LLM API

Minimal implementation, no frameworks:

import json
from litellm import completion

def evaluate_faithfulness(question: str, context: str, response: str) -> dict:
    judge_response = completion(
        model="anthropic/claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": FAITHFULNESS_JUDGE_PROMPT.format(
                question=question,
                context=context,
                response=response,
            )
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(judge_response.choices[0].message.content)
    return result

eval_result = evaluate_faithfulness(
    question="Какие кафе в центре Москвы?",
    context="Кофемания: Патриаршие пруды. Сёстры: Покровка 6.",
    response="Рекомендую Кофеманию на Патриарших и Пушкин на Тверском бульваре.",
)
# score: 0.5 (Кофемания confirmed, Пушкин is not)

Pros: full control, minimal dependencies. Cons: you write every metric yourself, no batch processing. If you work with multiple LLM providers, litellm lets you switch between them through a single interface — more on this in the article about multi-provider LLM architecture.

2. DeepEval

Open-source framework with built-in metrics. Works like pytest for LLM outputs.

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    HallucinationMetric,
)

faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

test_case = LLMTestCase(
    input="Какие кафе в центре Москвы?",
    actual_output="Рекомендую Кофеманию на Патриарших...",
    retrieval_context=["Кофемания: Патриаршие пруды. Сёстры: Покровка 6."],
)

results = evaluate([test_case], [faithfulness, relevancy, hallucination])

14+ built-in metrics, pytest integration. LLM quality tests run alongside unit tests:

# test_llm_quality.py
from deepeval import assert_test

def test_travel_recommendations():
    test_case = LLMTestCase(
        input="Кафе в Москве",
        actual_output=run_my_pipeline("Кафе в Москве"),
        retrieval_context=get_retrieved_docs("Кафе в Москве"),
    )
    assert_test(test_case, [faithfulness, relevancy])

3. Langfuse Evaluations

If you already use Langfuse for tracing, evaluations plug in on top. The judge model runs against each trace and attaches a score to it. Scores can be attached to an entire trace or to individual observations. If you haven't set up an observability stack yet, start with the practical guide to LLM observability with Langfuse.

langfuse.score(
    trace_id="trace-abc-123",
    name="faithfulness",
    value=0.85,
    comment="1 of 7 claims not supported by context",
)

For production monitoring, Langfuse fits better than DeepEval: scores are tied to real traces, visible in the dashboard, with day-over-day quality degradation charts.

Integrating LLM-as-Judge into CI/CD and production pipelines

Pre-deploy: prompt regression testing

Prompt changed? Run a dataset through the judge model before deploying. Score below threshold — deploy blocked.

# .github/workflows/llm-quality.yml
name: LLM Quality Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: deepeval test run test_llm_quality.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Runtime: gate before the response

For high-stakes tasks, evaluate before sending the response:

async def generate_with_quality_gate(question: str) -> str:
    response = await generate_response(question)

    eval_result = await evaluate_faithfulness(
        question=question,
        context=retrieved_context,
        response=response,
    )

    if eval_result["score"] < 0.7:
        return "Sorry, I'm not confident in the accuracy of this answer. Try rephrasing your question."

    return response

An extra LLM call per request. GPT-4o-mini as a judge costs $0.15 per million input tokens. At 10,000 requests per day with a ~500-token prompt: about $0.75/day.

Post-hoc: sample-based monitoring

The most common scenario. Evaluation runs asynchronously:

import random

traces = langfuse.fetch_traces(limit=100)
sample = random.sample(traces.data, min(100, len(traces.data)))

scores = []
for trace in sample:
    result = evaluate_faithfulness(
        question=trace.input,
        context=trace.metadata.get("context", ""),
        response=trace.output,
    )
    scores.append(result["score"])
    langfuse.score(trace_id=trace.id, name="faithfulness", value=result["score"])

avg_score = sum(scores) / len(scores)
if avg_score < 0.75:
    send_alert(f"Faithfulness degraded: {avg_score:.2f}")

Cheaper than a runtime gate, but catches trends. Average faithfulness dropped from 0.88 to 0.71 over a week — something broke: the prompt, the retriever, or a model update on the provider side.

LLM-as-Judge pitfalls and biases

Position bias

Judge models systematically prefer whichever answer appears first in pairwise comparisons. Zheng et al. (2023) measured a shift of up to 10-15%. Fix: run the evaluation twice with swapped order and average the results. Or use pointwise scoring instead of pairwise.

Verbosity bias

Longer answers get higher scores, even when a shorter answer is more accurate. In the judge prompt, explicitly state "response length does not affect the score" and include an example where a short answer receives the highest mark.

Self-enhancement bias

GPT-4 gives higher scores to GPT-4 outputs. Claude prefers Claude outputs. Fix: use a judge model from a different provider than the generator. Generate with GPT-4o, evaluate with Claude Sonnet. Or the other way around. The broader question of trusting LLM outputs is a topic of its own — more on this in TruthGuard: when AI agents lie.

Cost

Every evaluation is an LLM call. A runtime gate on 10,000 requests/day means 10,000 extra calls. Options: a cheap model as judge (GPT-4o-mini, Claude Haiku), sample-based evaluation, caching scores for similar pairs.

The judge hallucinates too

A judge model can give a high score to a response full of fabricated facts, if the hallucination sounds plausible. Partial fix: chain-of-thought + structured output. There is no complete fix. This is a fundamental limitation of the approach.

Choosing the right judge model for each scenario

Scenario	Judge model	Why
Pre-deploy tests	GPT-4o or Claude Sonnet 4	Accuracy matters more than speed
Runtime gate	GPT-4o-mini or Claude Haiku	Cheap and fast
Post-hoc monitoring	GPT-4o-mini	Bulk processing

Rule of thumb: the judge model should be at least as capable as the generator. GPT-4o-mini judging GPT-4o-mini works. GPT-4o-mini judging Claude Opus is unreliable.

temperature=0 for judge calls is mandatory.

Tools for LLM evaluation

Tool	Focus	LLM-as-Judge	Self-hosted	Price
DeepEval	Testing	14+ metrics	Yes (OSS)	Free
Ragas	RAG evaluation	Faithfulness, relevance	Yes (OSS)	Free
Langfuse	Observability + evals	Evaluator templates	Yes (OSS)	Free (self-hosted)
Phoenix (Arize)	Observability + evals	Hallucination, QA	Yes (OSS)	Free
Braintrust	Evals + logging	Custom scorers	Cloud	Free tier

For a startup: DeepEval for pre-deploy tests + Langfuse for production monitoring. Two open-source tools cover the entire cycle.

Production setup for LLM-as-Judge

+---------------------------------------------------+
|                   CI/CD Pipeline                   |
|                                                    |
|  PR with prompt change                             |
|      |                                             |
|      v                                             |
|  DeepEval: dataset x new prompt -> scores          |
|      |                                             |
|      v                                             |
|  Score < threshold? -> Block merge                 |
+---------------------------------------------------+

+---------------------------------------------------+
|                   Production                       |
|                                                    |
|  User request -> LLM -> Response -> User           |
|                   |                                |
|                   v (async)                         |
|              Langfuse trace                         |
|                   |                                |
|                   v (cron, hourly)                  |
|         Judge evaluation (sample)                  |
|                   |                                |
|                   v                                |
|         Score dashboard + alerts                   |
+---------------------------------------------------+

Where to start: step-by-step plan

Pick one metric. For RAG: faithfulness. For a chatbot: answer relevance.
Collect 20-30 examples by hand: questions, answers, ratings (good/bad). A golden dataset for calibration.
Write a judge prompt, run it against the golden dataset. Agreement with human ratings below 70%? Revise the prompt.
Add DeepEval to CI for tests on prompt changes.
Set up Langfuse evaluations for production monitoring.

From zero to a working quality gate: two to three days. Golden dataset + judge prompt: a couple of hours.

Prompt Engineering System: Managing 50+ Prompts in Production

Roman Belov — Fri, 10 Apr 2026 04:04:41 +0000

The average LLM project in production uses 20–50 prompts. Classification, summarization, data extraction, response generation, quality evaluation. Each prompt requires iteration, and each iteration can break something that was working. At 50 prompts, managing them manually becomes chaos: who changed the classifier prompt? Why did summarizer accuracy drop? Which version is in production right now?

This article covers how to build a prompt management system that scales from 5 to 500 prompts.

Why You Can't Store Prompts in Code

A prompt looks like a string. Developers store it in code, next to the call logic. This works fine when there are only a few prompts and iterations are infrequent.

Problems start at scale:

Changing a prompt requires deploying the app. The prompt is hardcoded. To fix a single word in a system prompt, you need a PR, review, merge, deploy. Iteration cycle: hours instead of minutes.

No versioning. Git stores history, but a diff on a 2,000-character prompt is unreadable. There's no fast path to roll back a prompt to a previous version without rolling back the entire app.

No link between version and metrics. Prompt changed, quality dropped. Connecting a specific prompt version to specific metrics is manual work when the prompt lives in code.

Cross-team chaos. The product manager wants to adjust the tone. The ML engineer is optimizing tokens. The developer is refactoring the template. All three are editing the same file, and the outcome is unpredictable.

Anatomy of a Prompt Engineering System

A mature prompt management system has four layers:

┌─────────────────────────────────────────────────┐
│              Prompt Engineering System          │
├────────────┬────────────┬────────────┬──────────┤
│  Registry  │  Testing   │  Deploy    │ Monitor  │
│            │            │            │          │
├────────────┼────────────┼────────────┼──────────┤
│ Storage    │ Pre-deploy │ Canary /   │ Metrics  │
│ + versions │ eval       │ A/B rollout│ + alerts │
└────────────┴────────────┴────────────┴──────────┘

Registry — a centralized prompt store with versioning, metadata, and access control.

Testing — automated quality evaluation of a prompt against test datasets before deploying to production.

Deploy — a mechanism to push a new prompt version to production without deploying the application.

Monitor — tracking quality metrics tied to specific prompt versions.

You don't need to build all four layers at once. A minimum viable system is registry + deploy. Without testing and monitoring, you're flying blind.

Registry: Centralized Prompt Storage

The registry solves the basic problem: a single source of truth for all prompts. Two approaches.

Approach 1: Langfuse Prompt Management

Langfuse provides prompt management out of the box. Each prompt is a named entity with versions, labels, and variables.

from langfuse import Langfuse

langfuse = Langfuse()

# Get the production version of a prompt
prompt = langfuse.get_prompt(
    name="ticket-classifier",
    label="production"  # or "staging", "latest"
)

# Prompt with variables
system_message = prompt.compile(
    categories="billing,technical,general,urgent",
    language="en"
)

Prompt structure in Langfuse:

Field	Purpose	Example
`name`	Unique identifier	`ticket-classifier`
`version`	Auto-increment	`14`
`label`	Environment / status	`production`, `staging`
`type`	Format	`text` or `chat`
`config`	Model parameters	`{"model": "gpt-4o-mini", "temperature": 0}`

The prompt is decoupled from code. A product manager edits the prompt in the UI, assigns the staging label, tests it, and switches to production. The application code stays the same.

Approach 2: Prompts-as-Code

For teams that prefer Git as the single source of truth:

prompts/
├── ticket-classifier/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
│       ├── dataset.jsonl
│       └── eval.py
├── summarizer/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
└── prompt_registry.py

# prompts/ticket-classifier/prompt.yaml
name: ticket-classifier
type: chat
model: gpt-4o-mini
temperature: 0
messages:
  - role: system
    content: |
      You are a support ticket classifier.
      Categories: {{categories}}.
      Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}
      Response language: {{language}}.
  - role: user
    content: "{{ticket_text}}"
variables:
  categories: "billing,technical,general,urgent"
  language: "en"

# prompt_registry.py
import yaml
from pathlib import Path

class PromptRegistry:
    def __init__(self, prompts_dir: str = "prompts"):
        self.prompts_dir = Path(prompts_dir)
        self._cache = {}

    def get(self, name: str) -> dict:
        if name not in self._cache:
            prompt_path = self.prompts_dir / name / "prompt.yaml"
            with open(prompt_path) as f:
                self._cache[name] = yaml.safe_load(f)
        return self._cache[name]

    def compile(self, name: str, **variables) -> list[dict]:
        prompt = self.get(name)
        messages = []
        for msg in prompt["messages"]:
            content = msg["content"]
            for key, value in {**prompt.get("variables", {}), **variables}.items():
                content = content.replace(f"{{{{{key}}}}}", str(value))
            messages.append({"role": msg["role"], "content": content})
        return messages

Both approaches support a hybrid variant: prompts live in Git, and CI/CD syncs them to Langfuse on every merge to main.

# ci/sync_prompts.py — called in CI pipeline
from langfuse import Langfuse
from prompt_registry import PromptRegistry

langfuse = Langfuse()
registry = PromptRegistry()

for prompt_name in ["ticket-classifier", "summarizer", "response-generator"]:
    prompt_data = registry.get(prompt_name)
    langfuse.create_prompt(
        name=prompt_name,
        prompt=prompt_data["messages"],
        config={"model": prompt_data["model"], "temperature": prompt_data["temperature"]},
        labels=["production"],
    )

Testing: Eval Before Deploying a Prompt

A prompt without tests is a gamble. Every change can silently break edge cases. Automated evaluation before deployment catches regressions before they reach users.

Datasets: The Gold Standard

Every prompt needs a test dataset. Minimum size: 20–30 examples covering the main scenarios and edge cases.

{"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}}
{"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}}
{"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}}
{"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}}

Dataset sources:

Production logs. Real requests with labeled responses. The most valuable source.
Manual labeling. For new prompts with no production data yet.
Synthetic data. An LLM generates variations of existing examples. Useful for expanding edge case coverage.

Eval Pipeline

import json
from openai import OpenAI
from prompt_registry import PromptRegistry

client = OpenAI()
registry = PromptRegistry()

def evaluate_prompt(prompt_name: str, dataset_path: str, threshold: float = 0.85):
    """Evaluate a prompt against a dataset. Return pass/fail."""
    with open(dataset_path) as f:
        examples = [json.loads(line) for line in f]

    correct = 0
    total = len(examples)
    failures = []

    for example in examples:
        messages = registry.compile(prompt_name, ticket_text=example["input"])
        response = client.chat.completions.create(
            model=registry.get(prompt_name)["model"],
            messages=messages,
            temperature=0,
        )

        result = json.loads(response.choices[0].message.content)

        if result["category"] == example["expected"]["category"]:
            if result["confidence"] >= example["expected"]["confidence_min"]:
                correct += 1
            else:
                failures.append({
                    "input": example["input"],
                    "reason": f"low confidence: {result['confidence']}",
                })
        else:
            failures.append({
                "input": example["input"],
                "reason": f"wrong category: {result['category']}",
            })

    accuracy = correct / total
    passed = accuracy >= threshold

    return {
        "accuracy": accuracy,
        "threshold": threshold,
        "passed": passed,
        "failures": failures,
    }

For complex cases, LLM-as-Judge fits well. A judge model evaluates response quality against defined criteria: relevance, completeness, tone.

CI/CD Integration

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install openai langfuse pyyaml

      - name: Run prompt evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python ci/eval_prompts.py --changed-only

      - name: Comment PR with results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json'));
            let body = '## Prompt Eval Results\n\n';
            for (const [name, result] of Object.entries(results)) {
              const status = result.passed ? '✅' : '❌';
              body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`;
            }
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

Every PR touching prompts automatically runs the eval pipeline and posts results as a comment.

Deploy: Shipping Prompts Without Deploying Code

Three strategies for delivering a new prompt version to production.

Instant Switch

The simplest option. Flip the production label to a new prompt version.

# In Langfuse UI: assign label "production" to prompt v14
# The app picks it up automatically on the next request

prompt = langfuse.get_prompt(
    name="ticket-classifier",
    label="production",
    cache_ttl_seconds=300,  # 5-minute cache
)

Good for non-critical prompts and quick fixes. Risk: 100% of traffic immediately hits the new version.

Canary Deploy

Gradual traffic shift: 5% → 25% → 50% → 100%.

import random

def get_prompt_with_canary(
    name: str,
    canary_percentage: int = 10,
) -> tuple[dict, str]:
    """Return a prompt and its version (production or canary)."""
    if random.randint(1, 100) <= canary_percentage:
        prompt = langfuse.get_prompt(name=name, label="canary")
        return prompt, "canary"
    else:
        prompt = langfuse.get_prompt(name=name, label="production")
        return prompt, "production"

Canary and production metrics are compared in real time. If canary degrades — automatic rollback.

Feature Flags

For teams with an existing feature flag system (LaunchDarkly, Unleash, or homegrown):

def get_prompt_version(name: str, user_id: str) -> str:
    """Determine the prompt version via feature flag."""
    flag = feature_flags.get(f"prompt_{name}_version")

    if flag.is_enabled(user_id):
        return flag.get_variant(user_id)  # "v14", "v15"
    return "production"

You can also target specific users, segments, or regions.

Monitor: Tying Metrics to Prompt Versions

Monitoring without version context is useless. Quality dropped — but what broke: the prompt, the model, the data?

Tracing with Prompt Version

Every LLM call should include the prompt version in metadata:

trace = langfuse.trace(
    name="ticket-classification",
    metadata={
        "prompt_name": "ticket-classifier",
        "prompt_version": prompt.version,      # 14
        "prompt_label": "production",
        "model": "gpt-4o-mini",
    },
)

generation = trace.generation(
    name="classify",
    model="gpt-4o-mini",
    prompt=prompt,  # Langfuse automatically links the version
    input=messages,
    output=response,
)

Version Dashboard

Key metrics to monitor:

Metric	What it shows	Alert when
Accuracy	Fraction of correct responses	< threshold for prompt
Latency p95	Response time	> 2x baseline
Token usage	Token consumption	> 1.5x vs previous version
Error rate	Fraction of invalid responses	> 5%
Cost per request	Cost per call	> budget

# Example: automatic comparison of two prompt versions
def compare_prompt_versions(
    prompt_name: str,
    version_a: int,
    version_b: int,
    metric: str = "accuracy",
) -> dict:
    """Compare metrics for two prompt versions from Langfuse."""
    traces_a = langfuse.fetch_traces(
        name=f"{prompt_name}-eval",
        metadata={"prompt_version": version_a},
        limit=1000,
    )
    traces_b = langfuse.fetch_traces(
        name=f"{prompt_name}-eval",
        metadata={"prompt_version": version_b},
        limit=1000,
    )

    scores_a = [t.scores[metric] for t in traces_a if metric in t.scores]
    scores_b = [t.scores[metric] for t in traces_b if metric in t.scores]

    return {
        "version_a": {"version": version_a, "mean": sum(scores_a) / len(scores_a)},
        "version_b": {"version": version_b, "mean": sum(scores_b) / len(scores_b)},
        "diff": (sum(scores_b) / len(scores_b)) - (sum(scores_a) / len(scores_a)),
    }

Regression Alerts

# Check metrics every 15 minutes (cron job or Langfuse webhook)
def check_prompt_regression(prompt_name: str):
    current_version = langfuse.get_prompt(name=prompt_name, label="production").version
    recent_scores = get_recent_scores(prompt_name, current_version, hours=1)

    baseline = get_baseline_scores(prompt_name, current_version)

    if recent_scores["accuracy"] < baseline["accuracy"] * 0.9:  # > 10% degradation
        alert(
            channel="slack",
            message=f"Regression detected: {prompt_name} v{current_version}. "
                    f"Accuracy: {recent_scores['accuracy']:.2f} "
                    f"(baseline: {baseline['accuracy']:.2f})",
        )
        # Automatic rollback to previous version
        rollback_prompt(prompt_name, to_version=current_version - 1)

Prompt Organization Patterns

Composition Over Monoliths

A 3,000-token monolithic prompt is hard to test and maintain. Break it into components:

# prompts/components/output-format.yaml
name: output-format-json
content: |
  Respond STRICTLY in JSON. No text before or after the JSON.
  If you cannot determine the answer, return {"error": "unable to classify"}.

# prompts/components/language-rules.yaml
name: language-rules
content: |
  Response language: {{language}}.
  Do not translate proper nouns or technical terms.

def compose_prompt(*component_names: str, **variables) -> str:
    """Assemble a prompt from components."""
    parts = []
    for name in component_names:
        component = registry.get(f"components/{name}")
        content = component["content"]
        for key, value in variables.items():
            content = content.replace(f"{{{{{key}}}}}", str(value))
        parts.append(content)
    return "\n\n".join(parts)

# Usage
system_prompt = compose_prompt(
    "ticket-classifier-core",
    "output-format-json",
    "language-rules",
    categories="billing,technical,general",
    language="en",
)

Naming Convention

At 50+ prompts, consistent naming matters:

{domain}-{task}-{variant}

ticket-classifier-v2
ticket-classifier-multilingual
order-summarizer-short
order-summarizer-detailed
response-generator-formal
response-generator-casual
quality-judge-relevance
quality-judge-toxicity

Prompt Metadata

Each prompt should carry metadata for auditing:

name: ticket-classifier
metadata:
  owner: ml-team
  created: 2026-01-15
  last_tested: 2026-03-20
  model_compatibility:
    - gpt-4o-mini
    - claude-3-5-sonnet-20241022
  avg_tokens: 450
  cost_per_call_usd: 0.002
  test_accuracy: 0.92
  dataset_size: 150

Scaling: From 5 to 500 Prompts

How the system evolves as the number of prompts grows:

Scale	Registry	Testing	Deploy	Monitor
5–10 prompts	YAML in Git	Manual eval	Instant switch	Logs
10–50 prompts	Langfuse + Git sync	CI eval pipeline	Canary	Version dashboard
50–200 prompts	Langfuse + RBAC	CI + LLM-as-Judge	Feature flags	Alerts + auto-rollback
200+ prompts	Custom registry	Eval platform	Progressive rollout	ML monitoring

Key thresholds:

10 prompts — you need a registry. Prompts in code become unmanageable.

30 prompts — you need CI eval. Manual testing doesn't scale; regressions slip through.

50 prompts — you need RBAC. Different teams own different prompts; access control becomes non-optional.

100 prompts — you need auto-rollback. Humans can't respond to regressions fast enough in real time.

Prompt Management Tools

Tool	Type	Strengths
Langfuse	Open-source	Prompt management + tracing + evals in one. Self-hostable
PromptLayer	SaaS	Specialized in prompt management. Good UI
Humanloop	SaaS	Prompt management + eval + annotation. Enterprise
Pezzo	Open-source	Prompt management. Lightweight
Custom	Custom	Git + YAML + CI scripts. Maximum control

Langfuse covers most scenarios: registry with versioning, prompt-to-trace linking, dataset-based evals, MCP server for IDE management. Detailed walkthrough in the Langfuse guide.

Common Mistakes

Prompts in .env or config files. No versioning, no testing, no connection to metrics. Fine for prototypes, falls apart in production.

Testing on three examples. The prompt passes three tests and ships to production. A week later you discover it breaks on long inputs or edge case categories.

No baseline. The new prompt version "works well." Without a baseline, there's nothing to compare against. The previous version may have been better.

Optimizing tokens at the expense of quality. Prompt reduced from 800 to 300 tokens. Cost drops 60%. Accuracy drops from 0.94 to 0.81. Saving $50/month costs dozens of wrong responses every day.

Context Engineering for Prompts

A prompt doesn't exist in isolation. Quality depends on what's fed alongside it: context engineering determines which data enters the context window and in what order.

Three rules for production prompts:

Variables instead of hardcoded values. Anything that might change (categories, languages, formats) goes into variables. The prompt stays stable.
Few-shot examples at the end. Models "see" the end of the context more clearly. Placing examples after instructions improves accuracy.
Minimal context. Every extra token in the prompt dilutes the model's attention. If an instruction doesn't affect quality — remove it.

Where to Start

Week 1. Inventory. Collect all prompts from your codebase into one place — YAML files in Git or Langfuse. Standardize the format: name, version, model, messages, variables.

Week 2. Datasets. For each prompt, collect 20–30 test examples from production logs. Label the expected output.

Week 3. Eval pipeline. A script that runs the prompt against the dataset and outputs accuracy. Triggered in CI when prompts change.

Week 4. Monitoring. Prompt version in every trace's metadata. Dashboard with metrics per version. Alert on > 10% degradation.

After a month — a working system where every prompt change is tested, versioned, and monitored. No chaos, no regressions, no "who changed this prompt?"

Context Engineering: How to Manage Context for AI Models and Agents

Roman Belov — Thu, 09 Apr 2026 04:33:23 +0000

Claude's context window holds 200,000 tokens. Gemini's stretches to two million. But response quality starts degrading long before the window fills up. Window size doesn't solve the context problem — it masks it.

Prompt engineering teaches you how to ask. Context engineering teaches you what to feed the model before asking. And the second one shapes the answer more than the first.

Andrej Karpathy put it this way: "Context engineering — the delicate art and science of filling the context window with just the right information for the next step." Tobi Lütke, CEO of Shopify, popularized the term itself, and Gartner declared in July 2025: "Context engineering is in, and prompt engineering is out."

This piece covers concrete techniques, models, and patterns. Things that actually work when you're using AI agents in development every day.

Prompt vs Context: Where the Line Falls

Here's an analogy that works: you're hiring an expert consultant.

Prompt — your question: "What should I do?"
Context — the briefing you hand them before the question.

You can phrase the question perfectly, but if the briefing contains 500 pages of irrelevant documents, even a strong expert will get lost. Flip it around: hand them exactly the 2 pages they need, and even a simple question yields a precise answer.

Prompt engineering answers questions like: how to frame the task, what role to assign the model, what output format to request. Context engineering answers different ones: feed 100 reviews or pick 15 representative ones? The entire 500-line file or just lines 45–80? All the documentation or extract the facts?

A more technical analogy drives it home. The LLM is a CPU. The context window is RAM. You're the operating system deciding what gets loaded into working memory. The goal: load exactly the data needed for the current operation.

Why More Context Is Worse

This is counterintuitive, but backed by research.

Context Rot

Chroma's research (2025) showed that LLM accuracy drops as the token count in context grows — even when the window is far from full.

The mechanism: attention is a fixed resource. Weights always sum to 1. More tokens means less attention per fragment. Think of a flashlight — the wider the beam, the dimmer the light at any point. And the harder the task, the steeper the drop.

Lost in the Middle

A study found a specific pattern: LLM performance drops 30%+ when critical information sits in the middle of a long context. Beginning and end? Fine. The middle is a blind spot.

Practical takeaway: put the important stuff at the beginning or end. System prompt up top. Few-shot examples at the bottom.

Economics

Every token costs money, and the model rereads the entire context on every request (LLMs are stateless):

Input: ~$3 per 1M tokens (Claude Sonnet)
100K context × 100 requests/day = ~$30/day = $900/month

Context engineering is budget engineering too.

Hallucinations from Overload

With a bloated context, the model tries to "use everything" and starts inventing connections between unrelated parts. Data about Company A gets attributed to Company B. Functions that don't exist get "recalled" from similar code that drifted into the context twenty screens back.

Six Layers of Context

Structure context like an onion — six layers, each with a specific job. This fights degradation by placing the most important information at the beginning and end, instead of spreading it across the middle.

┌─────────────────────────────────────────┐
│  1. SYSTEM — who you are & how to act   │  ← Permanent (beginning)
├─────────────────────────────────────────┤
│  2. PROJECT — project context           │  ← Semi-permanent
├─────────────────────────────────────────┤
│  3. TASK — the specific task            │  ← Per task
├─────────────────────────────────────────┤
│  4. DIFF / CODE — relevant fragments    │  ← Per task
├─────────────────────────────────────────┤
│  5. ACCEPTANCE CRITERIA — exit criteria │  ← Per task
├─────────────────────────────────────────┤
│  6. EXAMPLES (Few-shot) — samples       │  ← Optional (end)
└─────────────────────────────────────────┘

System: Role and Behavior

Who the model is and how it should behave. Always at the very beginning.

You are an experienced backend developer working with Python and FastAPI.
Keep answers concise. Use type hints. Don't add dependencies without asking.

This is where the role, response style, and constraints go. Task details do not belong here — that's layer 3.

Project: Project Context

Tech stack, structure, architecture decisions, code conventions. This layer gets reused across tasks. In Claude Code, it lives in the CLAUDE.md file — the agent reads it automatically on every launch.

Task: What to Do

A clear description of what to do, why, and — this gets forgotten constantly — what not to do.

A good example:

Task: Add rate limiting to /users.
Context: Endpoint is unprotected, bots are overloading it.
Requirements: 100 req/min per IP, Redis for counters, 429 on exceeded.
Out of scope: Changing endpoint logic, adding authorization.

A bad example: "Add rate limiting."

Diff/Code: Only What's Relevant

Provide only the code fragments that relate to the task. Not the entire file. Specify path and lines: app/api/users.py, lines 45–60.

Acceptance Criteria: When to Stop

Clear, verifiable conditions. The model only knows when to stop if you tell it. Skip these and you'll get either a half-finished answer or something wildly overengineered.

- [ ] Return 429 status when limit is exceeded
- [ ] Include Retry-After header in the response
- [ ] Unit tests cover edge cases

Examples (Few-shot): At the End

For nonstandard output formats or a specific style. Place them at the end of the context — the model "sees" the finale better.

What Hurts to Put in Context

A few anti-patterns that will reliably tank your results:

The entire project codebase — signal drowns in noise
Contradictory instructions — "use Redux" + "use Context API" = the model gets confused
Outdated examples — code with deprecated APIs gets reproduced verbatim
Vague phrasing — "make it better" gives the model no direction

Four Strategies for Managing Context

LangChain and Anthropic propose a framework: all context work boils down to four actions.

Strategy	What It Does	Example
Write	Persist externally	Scratchpads, MEMORY.md, progress files
Select	Extract only what's relevant	RAG, grep, code search
Compress	Compact	Compaction, summarization, tool result cleanup
Isolate	Isolate tasks	Subagents with clean context

Everything described below is a specific case of one of these four.

Persistence: Bridging Sessions

Every session with an AI agent starts from scratch. New context window, zero memory of previous work. Anthropic calls it the "shift engineer" problem: each new engineer coming on shift remembers nothing of the previous one's work. No notes left behind? Start over.

Plain Files

The most basic form of memory — markdown notes the agent writes for its future self. Claude Code uses MEMORY.md for this: the agent automatically records project patterns, decisions, and architectural notes.

Git as Memory

Commits with meaningful messages form a changelog and restore points. The agent can experiment freely, knowing it can always roll back.

Structured Notes

Plain files evolve. Instead of a flat log, the agent maintains a structured knowledge base. The pattern: write_to_notes(topic, content) + read_from_notes(topic) — an external hard drive for memory.

An example from Anthropic: an agent playing Pokemon recorded "trained Pikachu 1234 steps, 8 out of 10 levels." After a context reset, it read its own notes and picked up right where it left off.

Scratchpad

Working memory within the current session. The agent "thinks out loud" — storing intermediate results, hypotheses, a plan. Scratchpad is RAM; files are disk.

Simple thought, but it changes everything: stop making the model remember. Give it a notebook.

Context Compaction

When the context fills up, compress it. The model gets the full history and produces a summary. Old conversation gets tossed, compressed version goes at the start of the new context.

Manual compaction at logical breakpoints (after finishing a feature) beats automatic. There's also a lighter variant: cleaning up tool results — strip the verbose command outputs from history, keep just the fact that they ran.

Task Trackers

For long-running projects, the "Initializer + Executor" pattern works well. The first agent doesn't write code — it creates a structured task list in JSON: description, status, dependencies. Each subsequent agent reads the list, picks a pending task, completes it, updates the status, and commits.

Subagents: Isolation as Strategy

The main agent can delegate a subtask to a subagent — a separate process with its own clean context window. Like a manager asking a database specialist to optimize a query: hand them the schema and the slow query, not the entire month's email thread.

Three wins:

Context purity. The subagent isn't polluted by the main agent's history. The main agent might have 85% of its window occupied — the subagent starts at 5%.
Specialization. You can use different models or system prompts for different subagents.
Parallelization. Multiple subagents can work simultaneously.

In Claude Code, subagents are launched via the Task tool. The main agent describes the task, the subagent receives it in a clean context, does the work, and returns a structured result. The main agent's context cost is minimal.

MCP and the Tool Tax

MCP (Model Context Protocol) is an open standard defining how AI agents discover and call tools. Each MCP server adds its tool descriptions to the context. Every description costs tokens.

You feel it the moment you start working for real: connect 5–10 MCP servers (GitHub, Slack, database, analytics, monitoring) and tens of thousands of tokens in tool descriptions land in every request, even when none of them get called.

The fix is lazy loading. Claude Code uses Tool Search: tool descriptions load on demand, only when the agent decides it might need one. Saves around 85% of tokens. Other agents have similar tricks: lazy-mcp, MetaMCP.

Tool design principles:

Self-sufficiency: the description contains everything needed for use. The model doesn't read your README.
Unambiguity: user_email instead of data, validate_payment instead of process.
Minimalism: one tool = one atomic operation. If the description exceeds 200 words, the tool does too much.

Memory Hierarchy

Context in production tools isn't a single file — it's a multi-level system. Claude Code's docs lay out the hierarchy:

System prompt — base instructions (always loaded)
Settings — user preferences
CLAUDE.md — project instructions (loaded from the repository root)
Rules — modular instructions, can be path-specific (loaded only when working with certain files)
Skills — entire folders of instructions and scripts the agent loads at its own discretion
Auto Memory — memory the agent forms for itself

Martin Fowler proposes a useful distinction: Instructions (orders — "write a test for this function") vs Guidance (general rules — "all tests must be independent of each other"). CLAUDE.md and rules are mostly Guidance. Chat prompts are Instructions.

Working with Large Documents

You can't just dump a 50-page PDF into the model. You need a strategy.

Chunking

Break it into pieces of 1,500–3,000 tokens with 10–20% overlap. Semantic chunking (by chapters and sections) works noticeably better than chopping at fixed lengths.

Contextual Retrieval from Anthropic tackles the ripped-from-context problem: before indexing, each fragment gets a description of where it came from and what the section covers. Result: at least 35% fewer retrieval failures, up to 67% with reranking.

Fact Extraction

Skip the full text. Pull a structured list of facts and figures from each chunk instead. Smaller footprint, better accuracy for analysis.

Map-Reduce

For very large documents: split into chunks, summarize each (MAP), assemble the mini-summaries into a final one (REDUCE). The MAP phase can be parallelized — speedup scales with the number of workers.

RAG vs Long Context

With windows getting bigger (Gemini 2M), the question keeps coming up: do we still need RAG? Research (arXiv:2501.01880) says it depends on the task.

RAG wins: the corpus is huge (> 1M tokens), freshness matters, budget is limited.

Long context wins: you need synthesis across sections, structural understanding, document < 200K.

Hybrid (the way to go): RAG for selection, long context for analysis. The cost gap is real: full 2M context on every request runs an order of magnitude more than RAG selection + 50K of relevant context.

Where This Doesn't Work

Wouldn't be honest to stop at the upsides.

Context engineering won't fix a bad model

If the model can't write Rust, no amount of context will help. Context engineering works within what the model can already do. If the task is too hard for the current generation, break it into subtasks or try a different angle.

Preparation overhead

Assembling a perfect six-layer context package for every request takes time. For quick questions ("how does this function work?") it's overkill. Context engineering pays off on repeatable tasks and with agents that chain dozens of operations.

Compaction loses information

Compression is a tradeoff. The model picks what to keep and what to toss. Sometimes it tosses what matters. Manual compaction at logical breakpoints is safer, but needs the operator paying attention.

Lost in the Middle works both ways

You can get so focused on "important stuff at the beginning and end" that the middle turns into a junk drawer. Better to cut the context down than hope positioning saves you.

Subagents add latency

Delegating to a subagent means a separate API call with its own context. On a complex task, one subagent fires dozens of requests. For anything real-time, that's too slow.

Lazy tool loading isn't free

Tool Search saves context but adds a search step. If the agent hunts for a tool before every action, that's extra requests and wasted time. Balancing tools-in-context against search frequency takes tuning.

Common mistakes

Three that come up more than anything else:

Copying an entire file instead of the relevant fragment. The model gets 500 lines when it needed lines 45–60. The other 440 lines are pure noise.
Not saying what NOT to do. Without constraints, the model refactors the whole file when you asked it to fix one function.
Skipping acceptance criteria. The model doesn't know when to stop. You get either undercooked or overcomplicated output.

Checklist

Run through this before every serious request to a model.

Before the request:

Is there a source of truth (docs, code, data) in the context?
Is the task clearly described?
Is the output format specified?
Is what NOT to do specified?

In the prompt:

"Answer only based on the provided context"
"If you don't know — say you don't know"

After the response:

Are the facts verified?
Do the referenced functions and libraries actually exist?
Were characteristics of one object attributed to another?

Five takeaways

Less = better. Quality and relevance of context matter more than quantity. The goal is the smallest set of tokens with the strongest signal.
Structure it. Six layers: System, Project, Task, Code, Criteria, Examples. Important stuff at the beginning and end.
Persist it. Persistence = bridge between sessions. State files, structured notes, git.
Isolate it. Subagents with clean context for specialized tasks.
Compress it. Compaction and tool result cleanup when the context grows.

Start small: assemble a six-layer context package for one typical task and compare the result to what you get from pasting code into the chat. The difference tends to be obvious on the first try.

FAQ

At what token count does context rot become practically noticeable, and is there a threshold to monitor?

Multiple benchmarks (including studies by Chroma and others) show measurable accuracy degradation starting around 20–30K tokens for complex reasoning tasks, with a steeper drop past 50K. For simpler extraction tasks the threshold is higher — around 80–100K. A practical monitoring rule: if your average context exceeds 40K tokens per request and you're seeing inconsistent output quality, context size is the first variable to investigate. The $900/month calculation in the article assumes 100K tokens — most production agents can cut that by 60–70% through selective RAG retrieval without measurable quality loss.

How does lazy tool loading in Claude Code achieve 85% token savings, and what is the actual mechanism?

Without lazy loading, every MCP server's full tool schema is injected into the system prompt on every request — 10 servers with 5 tools each at ~200 tokens per tool description equals 10,000 tokens of overhead per call, regardless of which tools actually get used. Tool Search defers schema injection: the agent first sends a semantic search query to find relevant tool names (~50 tokens), then loads only the matching tool descriptions (~400 tokens for 2 tools). The 85% savings comes from eliminating the full schema dump for 8–9 unused tools per typical request.

When should you use manual context compaction versus automatic, and what information is typically lost?

Manual compaction at logical breakpoints (end of a feature, after a passing test suite) is safer because you control what the summary captures. Automatic compaction triggers on window fill and summarizes whatever is current — which may include half-finished reasoning, temporary debugging state, or contradictory instructions from mid-session pivots. The most common loss is architectural decisions made conversationally: "let's not use Redux here because X" survives a manual summary but gets dropped by automatic compaction which treats it as transient chat rather than binding constraint.

I Built a Lie Detector for AI Coding Agents

Roman Belov — Mon, 09 Mar 2026 13:59:51 +0000

The problem nobody talks about

AI coding agents lie. Not on purpose - they hallucinate.

Claude Code tells you "All tests pass!" when tests were never executed. It says "I updated the file" when the content is byte-for-byte identical. It sneaks in git commit --no-verify to skip the hooks that would catch its mistakes.

This isn't rare. It's a documented bug and it hits every serious Claude Code user. System prompts don't help - the agent just ignores them when it "decides" something is done.

I spent a couple weeks building a fix.

Don't ask the agent to be honest - verify it

That's the whole idea. Claude Code has a hooks API - before and after every tool call, it can run your scripts. Those scripts inspect what actually happened and block the agent if the results don't match the claims.

Agent claims: "I updated utils.ts"
    |
[PostToolUse hook]
    |
Compare SHA256 before/after -> IDENTICAL
    |
BLOCKED: "File was not actually modified. Checksum unchanged."

Can't argue with a checksum. This isn't a prompt the agent can ignore. It's a gate.

Six hooks, zero fluff

Hook	What it catches
Dangerous command blocker	`--no-verify`, `--force push`; warns on `reset --hard`, `clean -f`
Pre-commit test runner	Auto-detects your framework, runs tests before every commit
File checksum recorder	Saves SHA256 before file edit
Exit code verifier	Command failed (exit 1) but agent claims success
Phantom edit detector	File unchanged after a claimed "edit"
Commit verification reminder	Makes the agent prove the fix works before claiming "done"

Two days of real use

I ran TruthGuard on a production Flutter project:

5 commits blocked - agent kept trying to commit with failing tests
3 dangerous commands caught - 2x git push --force, 1x git commit --no-verify
0 false positives

The pre-commit test hook alone stopped me from shipping broken code five times in two days. Five times.

Pre-commit testing is the killer feature

When Claude runs git commit, TruthGuard intercepts it. Detects your project type, runs the right test command, and blocks the commit if tests fail. Simple as that.

# Auto-detection:
# pubspec.yaml     -> flutter test
# package.json     -> npm test
# Cargo.toml       -> cargo test
# go.mod           -> go test ./...
# pyproject.toml   -> python -m pytest

Override if you want:

# .truthguard.yml
test_command: "npm run test:unit"
skip_on_no_tests: false

The subtler problem: wrong fixes

After building the basic hooks, I ran into something trickier. Claude makes real changes, tests pass, but the fix doesn't actually solve the original problem. It genuinely thinks it's done.

So I added a post-commit reminder. After every successful commit:

"You just committed code. STOP and verify: did you actually confirm the fix works?"

A nudge, basically. But it makes the agent pause instead of rushing to "Done."

Install

npx truthguard install
cd your-project
npx truthguard init

Copies scripts to ~/.truthguard/, adds hooks to .claude/settings.json. Restart Claude Code and that's it.

Homebrew works too:

brew tap spyrae/truthguard && brew install truthguard

Agent-agnostic

Scripts read JSON from stdin, write JSON to stdout. Same scripts power both Claude Code and Gemini CLI. Supporting another agent means writing a config file, not rewriting hooks.

What's next

This is the free local-only version. No backend, no telemetry, everything on your machine.

Some ideas I'm thinking about:

A second LLM that checks whether the diff actually solves the described problem
Team dashboard with honesty stats
VS Code extension for Cursor and Copilot users

Links

GitHub: github.com/spyrae/truthguard
npm: npmjs.com/package/truthguard

If your agent lies in ways I haven't covered - open an issue and I'll write a hook for it.