This post documents a real negative result: my trained model worked… but a well-written prompt worked better.
TL;DR
I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well.
Then I trained a small preference-learning judge model using SimPO.
What happened surprised me:
Training accuracy → 100%
Held-out accuracy → 25%
Classic overfitting.
But the real lesson wasn’t about the model.
It was about the data.
After fixing dataset construction:
Held-out accuracy improved to 0.417 (Delta A +25pp)
A carefully prompted untrained model scored 0.833
👉 Conclusion:
At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.
Project Links
Dataset: https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1
Judge Model: https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b
Code: https://github.com/eyorata/sales_evaluation_bench
Total experiment cost: $0.041
The Problem: Existing Benchmarks Miss Real Sales Failures
Benchmarks like τ²-Bench retail, MT-Bench, or AlpacaEval are excellent at evaluating:
tool use
reasoning
conversation flow
But they don’t measure what actually kills B2B deals.
The agent I wanted to evaluate had to:
interpret hiring signals (funding, layoffs, leadership changes)
segment prospects correctly
write grounded outreach emails
avoid over-promising capacity
respect opt-outs and booking rules
Retail benchmarks simply don’t test these behaviors.
Example real failures from earlier experiments:
Auto-booking meetings when prospects only said “let me check my calendar.”
Re-engaging after opt-out, risking brand damage.
Those failures cost real money — but no public benchmark grades them.
So I built one.
Designing the Benchmark
The rule I set early:
Every rubric must be machine-gradable.
No vague scoring like “sounds professional.”
Instead, tasks check things like:
banned phrases absent
at least one signal referenced
no unsupported commitments
tone markers satisfied
correct action class detected
Each task returns a numeric score between 0 and 1.
No humans needed during evaluation.
The Dataset
266 tasks across five generation modes:
Mode Why it exists
Programmatic generation deterministic coverage
Trace-derived tasks grounded realism
Multi-LLM synthesis harder edge cases
Hand-authored adversarial stress testing
Style-guide gold pairs real preference ground truth
Partitions:
Train — 50%
Dev — 30%
Held-out — 20%
Preventing Data Leakage
I enforced three contamination checks:
No shared 8-grams between train and held-out tasks
Embedding similarity threshold
Time-window filtering for public signals
Result: 0 contamination violations.
Why I Chose Preference Training (Path B)
Week 10 analysis showed the model could already write fluent emails.
The real problem was:
👉 it couldn’t judge its own output.
So instead of improving generation, I trained a judge model using SimPO.
Setup:
Algorithm: SimPO (reference-free preference learning)
Trainer: TRL CPOTrainer
Backbone: Qwen2.5-3B
LoRA fine-tuning
Hardware: free Colab T4
The First Run: Perfect Training, Terrible Reality
Training looked amazing:
loss dropped smoothly
train accuracy hit 1.00
reward margins increased
But evaluation stayed stuck:
Train accuracy: 1.00
Held-out accuracy: 0.25
This is the moment many ML projects go wrong.
The instinct is:
bigger model
more steps
different hyperparameters
I almost did that.
Instead, I read the data.
The Real Problem Was the Dataset
Training examples used templated synthetic emails:
“Thank you for your interest…”
Held-out examples were real style-guide drafts:
“You closed your $14M Series A in February…”
The model learned a useless shortcut:
👉 prefer one template phrase over another.
It wasn’t learning tone — it was learning templates.
The Fix
I didn’t retrain immediately.
I fixed the data.
Using a stronger model, I rewrote all training “chosen” examples into authentic Tenacious voice, enforcing:
five tone markers
banned phrase rules
grounded signals
evaluator score ≥ 0.7
Cost: $0.04
Same algorithm. Same setup.
Only the data changed.
The Honest Results
Metric v1 v2
Train accuracy 1.00 1.00
Held-out accuracy 0.25 0.417
Delta A vs baseline 0 +25pp
Prompt baseline — 0.833
Latency 258ms 417ms
Finding #1 — Training Helped
The trained judge beat the untrained backbone.
So the methodology worked.
Finding #2 — Prompting Won Anyway
A carefully designed rubric prompt on the same backbone scored:
0.833 accuracy
No training required.
The Real Lesson
At this scale:
B2B tone judgment is a prompt-following problem more than a preference-learning problem.
The base model already understands tone.
It just needs explicit rules.
This is a legitimate negative result — and an important one.
About Delta C
I didn’t claim cross-benchmark improvement.
The model wasn’t trained on retail tasks, so comparing against τ²-Bench retail would be misleading.
Sometimes the honest result is:
improvement is domain-specific.
Limitations (Important)
Only 12 held-out tasks currently contain preference pairs.
That means:
wide confidence intervals
small-n uncertainty
This limitation is documented rather than hidden.
What’s Next
Dataset v0.2
expand preference slice from 12 → 30 tasks
clarify rubric ambiguity detected during calibration
Model v0.2
Qwen2.5-7B SimPO run
same training recipe
Future Ablation
Compare against a strong commercial model using only prompting.
The Big Engineering Lesson
The hardest decision wasn’t choosing the algorithm.
It was not retraining when training metrics looked perfect.
Clean training loss often means:
👉 the model learned something easy, not something useful.
Fixing the data cost $0.04.
Blindly scaling compute would have cost days.
If Your Training Loss Looks Too Good…
It probably is.
Check the data before blaming the model.
Acknowledgements
Work completed within the 10Academy TRP1 program using:
TRL + SimPO
Unsloth QLoRA training
Google Colab T4
OpenRouter multi-LLM routing
@dataset{tenacious_bench_v01_2026,
title = {Tenacious-Bench},
author = {Nebiyu, Eyoel},
year = 2026,
version = {0.1},
license = {CC-BY-4.0}
}
Top comments (0)