General sales benchmarks often miss how real outbound agents fail: overclaiming on weak signals, unsafe “bench” commitments, tone that drifts into pushy follow-ups, and gaps between what the rep promises and what delivery can support. For a class project (TRP1 Week 11), I built Tenacious-Bench v0.1, a compact, machine-scored task set aimed at those failure modes—not generic helpfulness.
What’s in the dataset
The public release is on Hugging Face: https://huggingface.co/datasets/Bnobody/tenacious_bench_v0.1.
It currently exposes 168 rows in the hub viewer, with splits aligned to how I train and evaluate: train (105) and validation (63). Tasks mix several authoring modes—programmatic sweeps, multi-LLM synthesis with judge filtering, trace-informed scenarios, and hand-authored adversarial cases—so the bench isn’t a single-generator monoculture.
Each row includes structured inputs (prospect context, stack, headcount, signal confidence, bench availability, etc.), a candidate outreach payload (subject/body/CTA), explicit ground-truth expectations (e.g. when to hand off vs. qualify), and a versioned scoring rubric so scores are reproducible without hand-waving.
Why contamination and provenance matter
Synthetic benchmarks leak in boring ways: near-duplicate phrasing across splits, embedding neighbors that are too close, or “eval” tasks that are effectively the same scenario as training with a date tweak. I run n-gram overlap, embedding similarity, and an explicit signal-window / provenance policy (train/dev vs. held-out time labeling) and record outcomes in a JSON report in the repo. The goal isn’t perfection—it’s to make leakage visible and actionable.
Training angle (Path B)
I’m not publishing a giant SFT corpus here; the project emphasizes a preference-style critic path (ORPO/DPO-style data prep + LoRA training) to catch inconsistency and unsafe commitments. The dataset is the artifact reviewers can actually load; training code and logs live alongside the project README.
Limitations (stated plainly)
Tasks are synthetic and English-first; they don’t replace live A/B tests or compliance review. The bench is meant as a regression harness for product teams iterating on sales agents, not as proof of real-world lift.
Call to action
If you’re building outbound agents, try grading your model on a slice of these tasks and compare against your internal rubric. I’m especially interested in cases where the model is “fluent” but violates bench/signal safety—those are the rows worth expanding next.
Top comments (0)