I Built a Tool That Shows Exactly Where GPT-4 and Claude Disagree — The Results Were Surprising

#ai #llm #python #devtools

My team had been arguing for three weeks about which LLM to use for our internal code assistant. Half the team swore by Claude. The other half insisted GPT-4o was obviously better. I had an opinion too — one I'd formed by reading benchmark leaderboards that, I later realized, were largely funded by the same companies whose models topped them.

We were making a decision that would affect every developer on the team, and we were doing it on vibes. Pure, expensive, unverifiable vibes.

So I built model-diff — a CLI that runs the same prompt on multiple LLMs simultaneously and shows you exactly where they agree, where they diverge, and what it costs to find out.

The Real Problem With Model Selection

Benchmarks tell you how a model performs on MMLU, HumanEval, or GSM8K. Useful data, but those tests are not your codebase. They are not your users' questions. And they are definitely not the specific, weird, domain-specific prompts your application sends at 2am when something breaks.

What you actually need is empirical evidence on your prompts. What does GPT-4o say when asked your question? What does Claude say? Where do they converge, and where do they pull in completely different directions? Without that data, you're just picking a model the same way you pick a coffee order — based on brand loyalty and whatever your coworker recommended last month.

Install It and Try It Now

pip install model-diff

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

# Compare default models on any prompt
model-diff "What is the best way to handle errors in Python?"

# Choose specific models
model-diff "Explain recursion" --models gpt-4o,claude-sonnet-4-6,claude-haiku-4-5-20251001

# Cut the noise — show only what differs
model-diff "Explain recursion" --only-diff

No config files. No dashboard. Just a prompt, a few API keys, and an honest answer.

What the Output Looks Like

Here is what a real run looks like when you ask two models about Python error handling:

$ model-diff "What is the best way to handle errors in Python?"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  MODEL: gpt-4o
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Use specific exception types rather than bare except clauses.
Always log exceptions with context. Prefer EAFP (Easier to Ask
Forgiveness than Permission) over LBYL in most Python code.
Use context managers for resource cleanup. Consider custom
exception hierarchies for large applications.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  MODEL: claude-sonnet-4-6
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Catch the most specific exception possible and handle each
case explicitly. Use finally blocks for cleanup. Avoid
swallowing exceptions silently — either re-raise or log.
For production systems, structured logging with exception
chaining (raise X from Y) gives the best traceability.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  SIMILARITY SCORE: 71%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Shared points:
    - Use specific exception types
    - Log exceptions with context
    - Use context managers / finally for cleanup

  Unique to gpt-4o:
    - EAFP vs LBYL philosophy
    - Custom exception hierarchies for scale

  Unique to claude-sonnet-4-6:
    - Exception chaining with raise X from Y
    - Structured logging for production traceability

  Cost:
    gpt-4o           $0.00031
    claude-sonnet-4-6  $0.00018
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Seventy-one percent similar. The overlap is real — both models agree on the fundamentals. But the 29% divergence is where it gets interesting: GPT-4o led with a Python-specific coding philosophy (EAFP), while Claude went straight to production concerns like exception chaining for traceability. Neither answer is wrong. They just reflect different priors about what matters.

How It Works Under the Hood

The technical core is three files:

models.py — Uses Python's concurrent.futures.ThreadPoolExecutor to fire API calls to all selected models in parallel. Sequential calls would take 3-5x longer. Threading gets you all responses in roughly the time of a single call.

differ.py — Computes similarity using difflib.SequenceMatcher at the sentence level. It then does fuzzy matching to bucket sentences into "shared" and "unique" per model. Rich terminal formatting handles the color-coded output.

cli.py — Click-based interface. Handles argument parsing, environment variable validation, and graceful degradation when an API key is missing (it skips that provider rather than crashing).

The whole thing is under 400 lines of Python. No LangChain. No vector database. No YAML config file that requires a PhD to understand.

Where Models Actually Disagreed — Real Findings

Running the same set of prompts across models turned up some genuinely unexpected splits:

On code style: Ask both models to refactor a Python function and GPT-4o will almost always introduce type annotations unprompted. Claude tends to preserve the original style more faithfully unless you ask it to add types. Neither behavior is documented anywhere — you only discover it by running the diff.

On explanation depth: For beginner-level questions like "explain recursion," Claude-Haiku goes straight for the minimal explanation and stops. GPT-4o adds a worked example almost every time. If your users are beginners, that difference matters.

On opinionated vs. balanced answers: Ask "which is better, tabs or spaces?" GPT-4o will give you a diplomatic both-sides answer. Claude will tell you spaces, because PEP 8, and move on. One approach is more useful depending on your context.

The Cost Comparison Nobody Talks About

Same quality answer, radically different price:

Model	Approx. cost / 1K tokens output
claude-haiku-4-5-20251001	~$0.0003
claude-sonnet-4-6	~$0.0015
gpt-4o-mini	~$0.0006
gpt-4o	~$0.0060

For high-volume applications, the difference between Haiku and GPT-4o is not 2x. It is 20x. And if model-diff shows the similarity score between those two is 85% on your specific prompts, you have an empirical case for choosing the cheaper model — not a gut feeling.

What I Learned Building This

Threading is the right call here. Async would be overkill; the bottleneck is network I/O to external APIs, and ThreadPoolExecutor handles that cleanly with less boilerplate than asyncio.
difflib.SequenceMatcher is surprisingly capable. I expected to need something fancier for semantic similarity, but for comparing prose responses sentence-by-sentence, the standard library tool gets you 80% of the way there.
Model behavior is not documented, it's discovered. The stylistic differences I found between models are not in any README or model card. The only way to know is to run the prompts and look.
Benchmarks and vibes are both wrong. Neither "this model scored highest on HumanEval" nor "I like the way it writes" is a substitute for running your actual prompts through a diff.
Small tools with one clear job get used. This does one thing. That made it easy to build, easy to explain, and easy for teammates to actually run.

Try It

If you are in the middle of a model selection decision, or you just want to know what you are actually getting from each provider you are paying for:

pip install model-diff
model-diff "your actual prompt here"

The source is on GitHub: https://github.com/LakshmiSravyaVedantham/model-diff

MIT licensed. PRs welcome. And if you find a prompt where the models give wildly different answers, open an issue — I am collecting interesting divergence cases.