You swap a model. The new one scores better on your benchmarks. You deploy it. Two days later, a user reports that something that used to work reliably now behaves differently.
The benchmark never caught it because benchmarks measure averages. What changed was the behavior on specific prompts, the ones your users actually send.
LLM Behavior Diff is a tool that catches this before it happens. Feed it two model versions and a prompt suite, and it runs every prompt through both, scores the responses for semantic similarity, classifies each divergence by severity, and produces an HTML report you can drop into a CI artifact or diff review.
It ships as a CLI, a Python API, and an MCP server so Claude Code or any MCP-compatible agent can run a behavioral diff before a model swap.
The Problem With Model Updates
Every model update is a tradeoff. The new version might score better on reasoning benchmarks while quietly regressing on instruction-following for your specific use case. Or it might phrase safety refusals differently in a way that breaks downstream parsing. Or two models might produce semantically identical answers that look completely different at the token level, which a naive string comparison would flag as a major change when it isn't one.
LLM Behavior Diff addresses all three scenarios. Embedding-based semantic similarity catches meaning-level changes that token-level comparison misses. The LLM-as-judge layer adds a reasoning layer for ambiguous cases. Severity classification separates noise from real regressions.
How It Works
The pipeline runs in five steps for every prompt in your suite:
Load: A YAML prompt suite is loaded into a PromptSuite Pydantic model. Each prompt has an ID, text, category, tags, and an expected behavior description.
Run: Each prompt is sent through Model A and Model B via LLMRunner. Three providers are supported: Ollama (/api/generate), OpenRouter (chat completions), and a deterministic stub provider for offline CI runs.
Score: Each response pair is scored with either EmbeddingDiffer (cosine similarity on all-MiniLM-L6-v2 embeddings) or SimpleDiffer (Jaccard over words). Optionally, an LLM-as-judge score is combined with the similarity score, default judge model is google/gemini-2.0-flash-lite-001 via OpenRouter.
Classify: Each prompt is classified against the --threshold. Changes are bucketed by severity: combined score >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.
Report: An HTML report is rendered and a rich summary table is printed to the terminal.
Why Embeddings Over Token Matching
The difference matters. Here is the same two-model comparison run two ways:
With --use-embeddings (cosine on all-MiniLM-L6-v2):
Avg Similarity: 91.4%
Changes Detected: 0 of 5
With --no-use-embeddings (Jaccard fallback):
Avg Similarity: 25.0%
Changes Detected: 5 of 5
Same two models, same prompts, completely opposite conclusions. The Llama and Gemini answers shared few exact tokens even when semantically identical, which is exactly why the embeddings path is on by default.
Getting Started
pip install -e .
Requires Python 3.11+. Embedding similarity uses sentence-transformers/all-MiniLM-L6-v2, downloaded on first use. The LLM-judge path requires OPENROUTER_API_KEY without it, scoring falls back to embeddings-only.
Running a Diff
Offline - stub provider
A stub provider returns deterministic hashed responses, so the whole pipeline runs offline without Ollama or an API key. Good for CI and testing the setup:
llm-diff run \
--model-a stub-a --provider-a stub \
--model-b stub-b --provider-b stub \
--prompts prompts/default.yaml \
--output output/report.html \
--no-use-embeddings
Real output from this run (stub + Jaccard, threshold 0.5):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ LLM Behavior Diff โ
โ Detecting behavioral shifts between model updates โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Processing: safety-001 โโโโโโโโโโโโโโโโโโโโ 100%
Comparison Summary
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ
โ Metric โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Prompts โ 5 โ
โ Changes Detected โ 3 โ
โ Change Rate โ 60.0% โ
โ Avg Similarity โ 40.0% โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Report saved to: output/stub_jaccard.html
Real models - OpenRouter
export OPENROUTER_API_KEY=sk-or-...
llm-diff run \
--model-a meta-llama/llama-3.2-3b-instruct --provider-a openrouter \
--model-b google/gemini-2.0-flash-lite-001 --provider-b openrouter \
--prompts prompts/default.yaml \
--output output/or_emb.html \
--use-embeddings --threshold 0.85
Real output (embeddings only):
Comparison Summary
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ
โ Total Prompts โ 5 โ
โ Changes Detected โ 0 โ
โ Change Rate โ 0.0% โ
โ Avg Similarity โ 91.4% โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Adding --use-judge brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."
Real models - Ollama
llm-diff run \
--model-a qwen3:8b --provider-a ollama \
--model-b gemma4:e4b --provider-b ollama \
--prompts prompts/default.yaml \
--output output/report.html \
--use-embeddings --threshold 0.85
CLI Reference
llm-diff --help
Usage: llm-diff [OPTIONS] COMMAND [ARGS]...
LLM Behavior Diff โ Model Update Detector
--version Show version information
--help Show this message and exit.
Commands
run Run a comparison between two models.
llm-diff --version
LLM Behavior Diff version 0.1.0
Key options for llm-diff run:
Severity buckets applied when a change is detected: combined >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.
Prompt Suite Format
The prompt suite is a YAML file. prompts/default.yaml ships with 5 prompts spanning reasoning, coding, factual, instruction-following, and safety. You can write your own:
name: "My suite"
version: "1.0.0"
prompts:
- id: "code-001"
text: "Write a Python function reverse_string(s)..."
category: "coding"
tags: ["python"]
expected_behavior: "Short correct function"
IDs must be unique. Category must be one of: reasoning, coding, creativity, safety, instruction_following, factual, conversational.
Python API
The full pipeline is available as a library. A synchronous one-shot call:
from llm_behavior_diff.runner import run_prompt_sync
from llm_behavior_diff.models import ModelConfig, ProviderType
resp = run_prompt_sync(
ModelConfig(name="stub-m", provider=ProviderType.STUB),
prompt_id="p1",
prompt_text="hello world",
)
print(resp.text, resp.success)
# -> Model stub-m says: 921fac0c4c True
Similarity scoring directly:
from llm_behavior_diff.differ import SimpleDiffer, EmbeddingDiffer, create_differ
d = SimpleDiffer()
print(d.compute_similarity("the cat sat", "the cat ran")) # 0.5
e = EmbeddingDiffer()
print(e.compute_similarity("The answer is 4.", "Two plus two equals four."))
# -> ~0.59
create_differ(use_embeddings=False) returns a SimpleDiffer (Jaccard). True returns an EmbeddingDiffer if sentence-transformers is importable, otherwise falls back to SimpleDiffer.
Generating a report from a ComparisonRun:
from llm_behavior_diff.report import ReportGenerator
from llm_behavior_diff.models import Settings
ReportGenerator().save_report(run, Settings(), "out.html")
ReportGenerator looks for a Jinja template in the CWD, the package directory, and a legacy path, then falls back to a built-in template so reports always render.
MCP Server
The tool also runs as an MCP server over stdio transport, exposing three tools so Claude Code or any MCP-compatible agent can trigger a behavioral diff during a session:
llm-diff-mcp
# or: python -m llm_behavior_diff.mcp_server
The three exposed tools:
compare_models - runs a full prompt suite through two models and returns per-prompt similarity, severity, and response text.
analyze_drift - scores drift between two candidate responses for a single prompt.
generate_report - renders an HTML summary from a JSON list of results.
Claude Code config:
{
"mcpServers": {
"llm-behavior-diff": {
"command": "llm-diff-mcp"
}
}
}
Smoke test - all three tools, offline, via Python:
import asyncio, json
from llm_behavior_diff.mcp_server import (
compare_models, CompareModelsRequest,
analyze_drift, AnalyzeDriftRequest,
generate_report, GenerateReportRequest,
)
async def main():
a = await compare_models(CompareModelsRequest(
model_a="stub-a", provider_a="stub",
model_b="stub-b", provider_b="stub",
prompts_path="prompts/default.yaml",
threshold=0.5, use_embeddings=False,
))
print(a.total_prompts, a.changes_detected, a.avg_similarity)
# real: 5 4 0.3446
b = await analyze_drift(AnalyzeDriftRequest(
prompt_text="math",
response_a="The answer is 4.",
response_b="2+2 equals 4.",
use_embeddings=True,
))
print(b.embedding_similarity, b.severity)
# real: 0.5572 moderate
c = await generate_report(GenerateReportRequest(
results_json=json.dumps([
{"prompt_id":"p1","similarity_score":0.9,"behavioral_change":False,"severity":"none"},
{"prompt_id":"p2","similarity_score":0.3,"behavioral_change":True,"severity":"major"},
]),
output_path="output/mcp_report.html",
title="MCP Smoke",
))
print(c.success, c.output_path)
asyncio.run(main())
Verified over stdio JSON-RPC:
llm-diff-mcp # speaks MCP 2024-11-05 on stdio
# tools/list -> compare_models, analyze_drift, generate_report
# tools/call analyze_drift {"prompt_text":"...","response_a":"Paris",
# "response_b":"The capital is Paris.","use_embeddings":true}
# -> {"embedding_similarity":0.7761,"severity":"minor", ...}
Limitations
-
LLM-as-judge requires OpenRouter - Without
OPENROUTER_API_KEY, judging is skipped and the combined score equals the embedding or Jaccard similarity alone. -
First embedding run is slow -
all-MiniLM-L6-v2is downloaded from Hugging Face on first use. Subsequent runs use the local cache. -
Ollama is not spawned automatically - The client talks to
http://localhost:11434by default (OLLAMA_HOST env var overrides). Ollama must already be running. - Stub provider is for CI and demos only - It produces deterministic fake text keyed on model name, temperature, and prompt. Not suitable for real behavioral conclusions.
How You Can Use This
Gate model upgrades in CI before they ship: Add llm-diff run to your deployment pipeline. Before any model swap reaches production, the tool runs your prompt suite through both versions and fails the pipeline if behavioral drift exceeds your threshold. You catch regressions automatically, not from user reports two days later.
Use it during prompt engineering to measure real impact: When you change a system prompt or few-shot examples, run a diff between the old and new configuration. The severity classification tells you whether the change is minor, moderate, or major across your prompt categories, so you know what you are actually shipping.
Use the MCP server to make your agent self-aware of drift: With the MCP server running, Claude Code or any MCP-compatible agent can call compare_models or analyze_drift directly during a session. An agent working on a model integration can check for behavioral drift without leaving the coding environment.
Extend it with additional providers: The tool currently supports Ollama, OpenRouter, and a stub provider, all sharing a common LLMRunner interface. Adding a new provider for Anthropic, Gemini, or any OpenAI-compatible endpoint follows the same pattern without touching the differ, classifier, or report logic.
Final Notes
Behavioral drift is the category of model regression that benchmarks miss. LLM Behavior Diff catches it by running the same prompts through both model versions, scoring the responses semantically rather than lexically, and classifying the divergence by severity before a swap reaches production.
The code is at https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector
You can also build with NEO in your IDE using the VS Code extension or Cursor.
NEO is a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

Top comments (0)