<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eyoel Nebiyu</title>
    <description>The latest articles on DEV Community by Eyoel Nebiyu (@eyorata).</description>
    <link>https://dev.to/eyorata</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3909051%2F0914ec61-85f3-423b-97d3-0dc9931802d9.jpeg</url>
      <title>DEV Community: Eyoel Nebiyu</title>
      <link>https://dev.to/eyorata</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eyorata"/>
    <language>en</language>
    <item>
      <title>When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Sat, 02 May 2026 12:51:59 +0000</pubDate>
      <link>https://dev.to/eyorata/when-your-training-loss-is-lying-to-you-building-a-tenacious-specific-sales-outreach-benchmark-2jgd</link>
      <guid>https://dev.to/eyorata/when-your-training-loss-is-lying-to-you-building-a-tenacious-specific-sales-outreach-benchmark-2jgd</guid>
      <description>&lt;p&gt;This post documents a real negative result: my trained model worked… but a well-written prompt worked better.&lt;/p&gt;

&lt;p&gt;TL;DR&lt;/p&gt;

&lt;p&gt;I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well.&lt;/p&gt;

&lt;p&gt;Then I trained a small preference-learning judge model using SimPO.&lt;/p&gt;

&lt;p&gt;What happened surprised me:&lt;/p&gt;

&lt;p&gt;Training accuracy → 100%&lt;br&gt;
Held-out accuracy → 25%&lt;/p&gt;

&lt;p&gt;Classic overfitting.&lt;/p&gt;

&lt;p&gt;But the real lesson wasn’t about the model.&lt;/p&gt;

&lt;p&gt;It was about the data.&lt;/p&gt;

&lt;p&gt;After fixing dataset construction:&lt;/p&gt;

&lt;p&gt;Held-out accuracy improved to 0.417 (Delta A +25pp)&lt;br&gt;
A carefully prompted untrained model scored 0.833&lt;/p&gt;

&lt;p&gt;👉 Conclusion:&lt;br&gt;
At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.&lt;/p&gt;

&lt;p&gt;Project Links&lt;br&gt;
Dataset: &lt;a href="https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1&lt;/a&gt;&lt;br&gt;
Judge Model: &lt;a href="https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b" rel="noopener noreferrer"&gt;https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b&lt;/a&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/eyorata/sales_evaluation_bench" rel="noopener noreferrer"&gt;https://github.com/eyorata/sales_evaluation_bench&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Total experiment cost: $0.041&lt;/p&gt;

&lt;p&gt;The Problem: Existing Benchmarks Miss Real Sales Failures&lt;/p&gt;

&lt;p&gt;Benchmarks like τ²-Bench retail, MT-Bench, or AlpacaEval are excellent at evaluating:&lt;/p&gt;

&lt;p&gt;tool use&lt;br&gt;
reasoning&lt;br&gt;
conversation flow&lt;/p&gt;

&lt;p&gt;But they don’t measure what actually kills B2B deals.&lt;/p&gt;

&lt;p&gt;The agent I wanted to evaluate had to:&lt;/p&gt;

&lt;p&gt;interpret hiring signals (funding, layoffs, leadership changes)&lt;br&gt;
segment prospects correctly&lt;br&gt;
write grounded outreach emails&lt;br&gt;
avoid over-promising capacity&lt;br&gt;
respect opt-outs and booking rules&lt;/p&gt;

&lt;p&gt;Retail benchmarks simply don’t test these behaviors.&lt;/p&gt;

&lt;p&gt;Example real failures from earlier experiments:&lt;/p&gt;

&lt;p&gt;Auto-booking meetings when prospects only said “let me check my calendar.”&lt;br&gt;
Re-engaging after opt-out, risking brand damage.&lt;/p&gt;

&lt;p&gt;Those failures cost real money — but no public benchmark grades them.&lt;/p&gt;

&lt;p&gt;So I built one.&lt;/p&gt;

&lt;p&gt;Designing the Benchmark&lt;/p&gt;

&lt;p&gt;The rule I set early:&lt;/p&gt;

&lt;p&gt;Every rubric must be machine-gradable.&lt;/p&gt;

&lt;p&gt;No vague scoring like “sounds professional.”&lt;/p&gt;

&lt;p&gt;Instead, tasks check things like:&lt;/p&gt;

&lt;p&gt;banned phrases absent&lt;br&gt;
at least one signal referenced&lt;br&gt;
no unsupported commitments&lt;br&gt;
tone markers satisfied&lt;br&gt;
correct action class detected&lt;/p&gt;

&lt;p&gt;Each task returns a numeric score between 0 and 1.&lt;/p&gt;

&lt;p&gt;No humans needed during evaluation.&lt;/p&gt;

&lt;p&gt;The Dataset&lt;/p&gt;

&lt;p&gt;266 tasks across five generation modes:&lt;/p&gt;

&lt;p&gt;Mode    Why it exists&lt;br&gt;
Programmatic generation deterministic coverage&lt;br&gt;
Trace-derived tasks grounded realism&lt;br&gt;
Multi-LLM synthesis harder edge cases&lt;br&gt;
Hand-authored adversarial   stress testing&lt;br&gt;
Style-guide gold pairs  real preference ground truth&lt;/p&gt;

&lt;p&gt;Partitions:&lt;/p&gt;

&lt;p&gt;Train — 50%&lt;br&gt;
Dev — 30%&lt;br&gt;
Held-out — 20%&lt;br&gt;
Preventing Data Leakage&lt;/p&gt;

&lt;p&gt;I enforced three contamination checks:&lt;/p&gt;

&lt;p&gt;No shared 8-grams between train and held-out tasks&lt;br&gt;
Embedding similarity threshold&lt;br&gt;
Time-window filtering for public signals&lt;/p&gt;

&lt;p&gt;Result: 0 contamination violations.&lt;/p&gt;

&lt;p&gt;Why I Chose Preference Training (Path B)&lt;/p&gt;

&lt;p&gt;Week 10 analysis showed the model could already write fluent emails.&lt;/p&gt;

&lt;p&gt;The real problem was:&lt;/p&gt;

&lt;p&gt;👉 it couldn’t judge its own output.&lt;/p&gt;

&lt;p&gt;So instead of improving generation, I trained a judge model using SimPO.&lt;/p&gt;

&lt;p&gt;Setup:&lt;/p&gt;

&lt;p&gt;Algorithm: SimPO (reference-free preference learning)&lt;br&gt;
Trainer: TRL CPOTrainer&lt;br&gt;
Backbone: Qwen2.5-3B&lt;br&gt;
LoRA fine-tuning&lt;br&gt;
Hardware: free Colab T4&lt;br&gt;
The First Run: Perfect Training, Terrible Reality&lt;/p&gt;

&lt;p&gt;Training looked amazing:&lt;/p&gt;

&lt;p&gt;loss dropped smoothly&lt;br&gt;
train accuracy hit 1.00&lt;br&gt;
reward margins increased&lt;/p&gt;

&lt;p&gt;But evaluation stayed stuck:&lt;/p&gt;

&lt;p&gt;Train accuracy: 1.00&lt;br&gt;
Held-out accuracy: 0.25&lt;/p&gt;

&lt;p&gt;This is the moment many ML projects go wrong.&lt;/p&gt;

&lt;p&gt;The instinct is:&lt;/p&gt;

&lt;p&gt;bigger model&lt;br&gt;
more steps&lt;br&gt;
different hyperparameters&lt;/p&gt;

&lt;p&gt;I almost did that.&lt;/p&gt;

&lt;p&gt;Instead, I read the data.&lt;/p&gt;

&lt;p&gt;The Real Problem Was the Dataset&lt;/p&gt;

&lt;p&gt;Training examples used templated synthetic emails:&lt;/p&gt;

&lt;p&gt;“Thank you for your interest…”&lt;/p&gt;

&lt;p&gt;Held-out examples were real style-guide drafts:&lt;/p&gt;

&lt;p&gt;“You closed your $14M Series A in February…”&lt;/p&gt;

&lt;p&gt;The model learned a useless shortcut:&lt;/p&gt;

&lt;p&gt;👉 prefer one template phrase over another.&lt;/p&gt;

&lt;p&gt;It wasn’t learning tone — it was learning templates.&lt;/p&gt;

&lt;p&gt;The Fix&lt;/p&gt;

&lt;p&gt;I didn’t retrain immediately.&lt;/p&gt;

&lt;p&gt;I fixed the data.&lt;/p&gt;

&lt;p&gt;Using a stronger model, I rewrote all training “chosen” examples into authentic Tenacious voice, enforcing:&lt;/p&gt;

&lt;p&gt;five tone markers&lt;br&gt;
banned phrase rules&lt;br&gt;
grounded signals&lt;br&gt;
evaluator score ≥ 0.7&lt;/p&gt;

&lt;p&gt;Cost: $0.04&lt;/p&gt;

&lt;p&gt;Same algorithm. Same setup.&lt;/p&gt;

&lt;p&gt;Only the data changed.&lt;/p&gt;

&lt;p&gt;The Honest Results&lt;br&gt;
Metric  v1  v2&lt;br&gt;
Train accuracy  1.00    1.00&lt;br&gt;
Held-out accuracy   0.25    0.417&lt;br&gt;
Delta A vs baseline 0   +25pp&lt;br&gt;
Prompt baseline — 0.833&lt;br&gt;
Latency 258ms   417ms&lt;br&gt;
Finding #1 — Training Helped&lt;/p&gt;

&lt;p&gt;The trained judge beat the untrained backbone.&lt;/p&gt;

&lt;p&gt;So the methodology worked.&lt;/p&gt;

&lt;p&gt;Finding #2 — Prompting Won Anyway&lt;/p&gt;

&lt;p&gt;A carefully designed rubric prompt on the same backbone scored:&lt;/p&gt;

&lt;p&gt;0.833 accuracy&lt;/p&gt;

&lt;p&gt;No training required.&lt;/p&gt;

&lt;p&gt;The Real Lesson&lt;/p&gt;

&lt;p&gt;At this scale:&lt;/p&gt;

&lt;p&gt;B2B tone judgment is a prompt-following problem more than a preference-learning problem.&lt;/p&gt;

&lt;p&gt;The base model already understands tone.&lt;/p&gt;

&lt;p&gt;It just needs explicit rules.&lt;/p&gt;

&lt;p&gt;This is a legitimate negative result — and an important one.&lt;/p&gt;

&lt;p&gt;About Delta C&lt;/p&gt;

&lt;p&gt;I didn’t claim cross-benchmark improvement.&lt;/p&gt;

&lt;p&gt;The model wasn’t trained on retail tasks, so comparing against τ²-Bench retail would be misleading.&lt;/p&gt;

&lt;p&gt;Sometimes the honest result is:&lt;/p&gt;

&lt;p&gt;improvement is domain-specific.&lt;/p&gt;

&lt;p&gt;Limitations (Important)&lt;/p&gt;

&lt;p&gt;Only 12 held-out tasks currently contain preference pairs.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;p&gt;wide confidence intervals&lt;br&gt;
small-n uncertainty&lt;/p&gt;

&lt;p&gt;This limitation is documented rather than hidden.&lt;/p&gt;

&lt;p&gt;What’s Next&lt;br&gt;
Dataset v0.2&lt;br&gt;
expand preference slice from 12 → 30 tasks&lt;br&gt;
clarify rubric ambiguity detected during calibration&lt;br&gt;
Model v0.2&lt;br&gt;
Qwen2.5-7B SimPO run&lt;br&gt;
same training recipe&lt;br&gt;
Future Ablation&lt;/p&gt;

&lt;p&gt;Compare against a strong commercial model using only prompting.&lt;/p&gt;

&lt;p&gt;The Big Engineering Lesson&lt;/p&gt;

&lt;p&gt;The hardest decision wasn’t choosing the algorithm.&lt;/p&gt;

&lt;p&gt;It was not retraining when training metrics looked perfect.&lt;/p&gt;

&lt;p&gt;Clean training loss often means:&lt;/p&gt;

&lt;p&gt;👉 the model learned something easy, not something useful.&lt;/p&gt;

&lt;p&gt;Fixing the data cost $0.04.&lt;/p&gt;

&lt;p&gt;Blindly scaling compute would have cost days.&lt;/p&gt;

&lt;p&gt;If Your Training Loss Looks Too Good…&lt;/p&gt;

&lt;p&gt;It probably is.&lt;/p&gt;

&lt;p&gt;Check the data before blaming the model.&lt;/p&gt;

&lt;p&gt;Acknowledgements&lt;/p&gt;

&lt;p&gt;Work completed within the 10Academy TRP1 program using:&lt;/p&gt;

&lt;p&gt;TRL + SimPO&lt;br&gt;
Unsloth QLoRA training&lt;br&gt;
Google Colab T4&lt;br&gt;
OpenRouter multi-LLM routing&lt;/p&gt;

&lt;p&gt;@dataset{tenacious_bench_v01_2026,&lt;br&gt;
  title  = {Tenacious-Bench},&lt;br&gt;
  author = {Nebiyu, Eyoel},&lt;br&gt;
  year   = 2026,&lt;br&gt;
  version = {0.1},&lt;br&gt;
  license = {CC-BY-4.0}&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
