Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

#llm #machinelearning #python

The Gap

General-purpose LLM benchmarks like τ²-Bench evaluate task completion in retail domains - cancelling orders, processing returns, checking inventory. They cannot answer the question a B2B sales team actually needs answered: does this outreach email say the right thing to the right buyer?
Tenacious Consulting runs four distinct buyer segments - high-growth startups, restructuring companies, mature enterprises, and AI-transformation plays. An email that correctly pitches cost-cutting to a restructuring company is PASS. The identical email sent to a Series B startup that is hiring aggressively is FAIL. τ²-Bench has no rubric for this. No public benchmark does.

The Audit

We documented eight specific failure modes from real pipeline traces that existing benchmarks miss:

Segment misrouting - email pitched to wrong buyer segment despite correct ICP classification

Signal overclaiming - asserting aggressive hiring intent from a single job post

Tone drift - condescension or urgency language that violates the style guide

Injection edge cases - prompt injection via the prospect notes field bypassing ToneGuard

Bench over-commitment - promising consultant availability not in the current bench summary

Competitor gap framing - technically correct gap analysis that reads as arrogant

AI maturity mismatch - pitching ML platform migration to a company with no data layer

Multi-thread leakage - simultaneous outreach to co-founder and VP leaking context

Each failure mode maps to at least three real traces from our Week 10 pipeline run.

Building the Dataset With No Labeled Data

Tenacious had no historical labeled prospects. We built 202 tasks from scratch using a four-mode authoring pipeline:
Programmatic (32%): Templates with structured slots - company size, segment, funding stage, AI maturity score, bench state - populated by combinatorial expansion. One "bench over-commitment" probe becomes 20 tasks by varying inputs.
Multi-LLM synthesis (48%): GPT-4o-mini authored hard cases anchored to the failure taxonomy. Llama-3.1-70B judge-filtered for coherence and rubric-applicability. Different model families for generation and judging prevents preference leakage (Li et al., 2025).
Hand-authored adversarial (20%): The hardest 40 tasks written manually - XSS payloads in notes fields, subtle wrong-segment framing, deadline-pressure tone that passes surface checks but fails the style guide.
Contamination prevention: N-gram overlap (8-gram), embedding cosine similarity (< 0.85), and time-shift verification against a frozen April 2026 signal window. Zero violations before sealing the held-out partition.

The Training Experiment

Why Path B (preference-tuned judge)? Our failure modes are judgment failures, not generation failures. The pipeline produces fluent, well-written emails - it just sometimes produces them for the wrong segment. SFT would improve surface quality of already-good emails. A DPO-trained judge learns to catch the judgment errors.
Why DPO over SimPO/ORPO? DPO's full-sequence reward doesn't dilute the signal from 1-2 sentence segment-alignment errors. SimPO's length-normalized reward would. With 279 pairs and one key hyperparameter (β=0.1), DPO was also simpler to debug on a constrained compute budget.
Training: Qwen2.5-0.5B-Instruct + LoRA (r=16, α=32, 8.8M trainable params). Pure PyTorch DPO loop on Google Colab T4, ~47 minutes, loss 1.67 -> 0.009.
Key implementation detail: The training pairs used email bodies as chosen/rejected completions - not verdict text. This means the model is an implicit reward model, not a verdict generator. The correct evaluation interface is:

reward = β × (log π_DPO(email|prompt) - log π_ref(email|prompt))
Positive reward -> PASS. Negative reward -> FAIL. Asking the model to generate "VERDICT: PASS" produces 100% PASS bias (22% accuracy). Using the implicit reward produces 74%.
The Honest Result
Judge | Accuracy | 95% CI
DPO trained judge (implicit reward) | 74.0% | [62%, 86%]
Rule evaluator | 48.0% | [34%, 62%]
Prompt judge - qwen3-8b zero-shot | 22.0% | [12%, 34%]
Delta A = +26pp over the rule evaluator (p=0.0127, paired bootstrap n=10,000, significant at p<0.05).
Delta B (rule vs zero-shot prompt): +26pp but p=0.5499 - not significant at n=50. This is a sample size limitation, not an absence of effect. The zero-shot model predicted PASS for every single task regardless of content. Training is necessary.
Honest limitations:

25/50 held-out tasks have a labeling artifact from the LLM synthesis pipeline (GT=FAIL with no failure category). This inflates error counts and suppresses accuracy on synthesis tasks (36% vs 62% on programmatic tasks).
The implicit reward interface requires the reference model loaded simultaneously (2x VRAM), which limits deployment to GPU endpoints.
n=50 held-out partition is too small for p<0.05 on Delta B. v0.2 needs 300+ tasks per partition.

What's Next

Tenacious-Bench v0.2 should add: multi-turn trajectory tasks, persona-aware tone scoring, live bench inventory validation, and a double-validation step for LLM-synthesis ground truth.