Eyoel Nebiyu

Posted on May 2

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

#agents #ai #llm #machinelearning

This post documents a real negative result: my trained model worked… but a well-written prompt worked better.

TL;DR

I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well.

Then I trained a small preference-learning judge model using SimPO.

What happened surprised me:

Training accuracy → 100%
Held-out accuracy → 25%

Classic overfitting.

But the real lesson wasn’t about the model.

It was about the data.

After fixing dataset construction:

Held-out accuracy improved to 0.417 (Delta A +25pp)
A carefully prompted untrained model scored 0.833

👉 Conclusion:
At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.

Project Links
Dataset: https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1
Judge Model: https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b
Code: https://github.com/eyorata/sales_evaluation_bench

Total experiment cost: $0.041

The Problem: Existing Benchmarks Miss Real Sales Failures

Benchmarks like τ²-Bench retail, MT-Bench, or AlpacaEval are excellent at evaluating:

tool use
reasoning
conversation flow

But they don’t measure what actually kills B2B deals.

The agent I wanted to evaluate had to:

interpret hiring signals (funding, layoffs, leadership changes)
segment prospects correctly
write grounded outreach emails
avoid over-promising capacity
respect opt-outs and booking rules

Retail benchmarks simply don’t test these behaviors.

Example real failures from earlier experiments:

Auto-booking meetings when prospects only said “let me check my calendar.”
Re-engaging after opt-out, risking brand damage.

Those failures cost real money — but no public benchmark grades them.

So I built one.

Designing the Benchmark

The rule I set early:

Every rubric must be machine-gradable.

No vague scoring like “sounds professional.”

Instead, tasks check things like:

banned phrases absent
at least one signal referenced
no unsupported commitments
tone markers satisfied
correct action class detected

Each task returns a numeric score between 0 and 1.

No humans needed during evaluation.

The Dataset

266 tasks across five generation modes:

Mode Why it exists
Programmatic generation deterministic coverage
Trace-derived tasks grounded realism
Multi-LLM synthesis harder edge cases
Hand-authored adversarial stress testing
Style-guide gold pairs real preference ground truth

Partitions:

Train — 50%
Dev — 30%
Held-out — 20%
Preventing Data Leakage

I enforced three contamination checks:

No shared 8-grams between train and held-out tasks
Embedding similarity threshold
Time-window filtering for public signals

Result: 0 contamination violations.

Why I Chose Preference Training (Path B)

Week 10 analysis showed the model could already write fluent emails.

The real problem was:

👉 it couldn’t judge its own output.

So instead of improving generation, I trained a judge model using SimPO.

Setup:

Algorithm: SimPO (reference-free preference learning)
Trainer: TRL CPOTrainer
Backbone: Qwen2.5-3B
LoRA fine-tuning
Hardware: free Colab T4
The First Run: Perfect Training, Terrible Reality

Training looked amazing:

loss dropped smoothly
train accuracy hit 1.00
reward margins increased

But evaluation stayed stuck:

Train accuracy: 1.00
Held-out accuracy: 0.25

This is the moment many ML projects go wrong.

The instinct is:

bigger model
more steps
different hyperparameters

I almost did that.

Instead, I read the data.

The Real Problem Was the Dataset

Training examples used templated synthetic emails:

“Thank you for your interest…”

Held-out examples were real style-guide drafts:

“You closed your $14M Series A in February…”

The model learned a useless shortcut:

👉 prefer one template phrase over another.

It wasn’t learning tone — it was learning templates.

The Fix

I didn’t retrain immediately.

I fixed the data.

Using a stronger model, I rewrote all training “chosen” examples into authentic Tenacious voice, enforcing:

five tone markers
banned phrase rules
grounded signals
evaluator score ≥ 0.7

Cost: $0.04

Same algorithm. Same setup.

Only the data changed.

The Honest Results
Metric v1 v2
Train accuracy 1.00 1.00
Held-out accuracy 0.25 0.417
Delta A vs baseline 0 +25pp
Prompt baseline — 0.833
Latency 258ms 417ms
Finding #1 — Training Helped

The trained judge beat the untrained backbone.

So the methodology worked.

Finding #2 — Prompting Won Anyway

A carefully designed rubric prompt on the same backbone scored:

0.833 accuracy

No training required.

The Real Lesson

At this scale:

B2B tone judgment is a prompt-following problem more than a preference-learning problem.

The base model already understands tone.

It just needs explicit rules.

This is a legitimate negative result — and an important one.

About Delta C

I didn’t claim cross-benchmark improvement.

The model wasn’t trained on retail tasks, so comparing against τ²-Bench retail would be misleading.

Sometimes the honest result is:

improvement is domain-specific.

Limitations (Important)

Only 12 held-out tasks currently contain preference pairs.

That means:

wide confidence intervals
small-n uncertainty

This limitation is documented rather than hidden.

What’s Next
Dataset v0.2
expand preference slice from 12 → 30 tasks
clarify rubric ambiguity detected during calibration
Model v0.2
Qwen2.5-7B SimPO run
same training recipe
Future Ablation

Compare against a strong commercial model using only prompting.

The Big Engineering Lesson

The hardest decision wasn’t choosing the algorithm.

It was not retraining when training metrics looked perfect.

Clean training loss often means:

👉 the model learned something easy, not something useful.

Fixing the data cost $0.04.

Blindly scaling compute would have cost days.

If Your Training Loss Looks Too Good…

It probably is.

Check the data before blaming the model.

Acknowledgements

Work completed within the 10Academy TRP1 program using:

TRL + SimPO
Unsloth QLoRA training
Google Colab T4
OpenRouter multi-LLM routing

@dataset{tenacious_bench_v01_2026,
title = {Tenacious-Bench},
author = {Nebiyu, Eyoel},
year = 2026,
version = {0.1},
license = {CC-BY-4.0}
}

DEV Community

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

Top comments (0)