<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: lidya dagnew</title>
    <description>The latest articles on DEV Community by lidya dagnew (@lidya_dagnew_5d2e8f9a63a3).</description>
    <link>https://dev.to/lidya_dagnew_5d2e8f9a63a3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3432333%2F071a59bc-9a74-4c2a-a6fc-48431488bc5a.png</url>
      <title>DEV Community: lidya dagnew</title>
      <link>https://dev.to/lidya_dagnew_5d2e8f9a63a3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lidya_dagnew_5d2e8f9a63a3"/>
    <language>en</language>
    <item>
      <title>Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists</title>
      <dc:creator>lidya dagnew</dc:creator>
      <pubDate>Fri, 01 May 2026 19:13:04 +0000</pubDate>
      <link>https://dev.to/lidya_dagnew_5d2e8f9a63a3/tenacious-bench-building-a-sales-domain-evaluation-benchmark-when-no-dataset-exists-5cam</link>
      <guid>https://dev.to/lidya_dagnew_5d2e8f9a63a3/tenacious-bench-building-a-sales-domain-evaluation-benchmark-when-no-dataset-exists-5cam</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;The Gap&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;General-purpose LLM benchmarks like τ²-Bench evaluate task completion in retail domains - cancelling orders, processing returns, checking inventory. They cannot answer the question a B2B sales team actually needs answered: does this outreach email say the right thing to the right buyer?&lt;br&gt;
Tenacious Consulting runs four distinct buyer segments - high-growth startups, restructuring companies, mature enterprises, and AI-transformation plays. An email that correctly pitches cost-cutting to a restructuring company is PASS. The identical email sent to a Series B startup that is hiring aggressively is FAIL. τ²-Bench has no rubric for this. No public benchmark does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Audit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We documented eight specific failure modes from real pipeline traces that existing benchmarks miss:&lt;/p&gt;

&lt;p&gt;Segment misrouting - email pitched to wrong buyer segment despite correct ICP classification&lt;/p&gt;

&lt;p&gt;Signal overclaiming - asserting aggressive hiring intent from a single job post&lt;/p&gt;

&lt;p&gt;Tone drift - condescension or urgency language that violates the style guide&lt;/p&gt;

&lt;p&gt;Injection edge cases - prompt injection via the prospect notes field bypassing ToneGuard&lt;/p&gt;

&lt;p&gt;Bench over-commitment - promising consultant availability not in the current bench summary&lt;/p&gt;

&lt;p&gt;Competitor gap framing - technically correct gap analysis that reads as arrogant&lt;/p&gt;

&lt;p&gt;AI maturity mismatch - pitching ML platform migration to a company with no data layer&lt;/p&gt;

&lt;p&gt;Multi-thread leakage - simultaneous outreach to co-founder and VP leaking context&lt;/p&gt;

&lt;p&gt;Each failure mode maps to at least three real traces from our Week 10 pipeline run.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Building the Dataset With No Labeled Data&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Tenacious had no historical labeled prospects. We built 202 tasks from scratch using a four-mode authoring pipeline:&lt;br&gt;
Programmatic (32%): Templates with structured slots - company size, segment, funding stage, AI maturity score, bench state - populated by combinatorial expansion. One "bench over-commitment" probe becomes 20 tasks by varying inputs.&lt;br&gt;
Multi-LLM synthesis (48%): GPT-4o-mini authored hard cases anchored to the failure taxonomy. Llama-3.1-70B judge-filtered for coherence and rubric-applicability. Different model families for generation and judging prevents preference leakage (Li et al., 2025).&lt;br&gt;
Hand-authored adversarial (20%): The hardest 40 tasks written manually - XSS payloads in notes fields, subtle wrong-segment framing, deadline-pressure tone that passes surface checks but fails the style guide.&lt;br&gt;
Contamination prevention: N-gram overlap (8-gram), embedding cosine similarity (&amp;lt; 0.85), and time-shift verification against a frozen April 2026 signal window. Zero violations before sealing the held-out partition.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Training Experiment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Why Path B (preference-tuned judge)? Our failure modes are judgment failures, not generation failures. The pipeline produces fluent, well-written emails - it just sometimes produces them for the wrong segment. SFT would improve surface quality of already-good emails. A DPO-trained judge learns to catch the judgment errors.&lt;br&gt;
Why DPO over SimPO/ORPO? DPO's full-sequence reward doesn't dilute the signal from 1-2 sentence segment-alignment errors. SimPO's length-normalized reward would. With 279 pairs and one key hyperparameter (β=0.1), DPO was also simpler to debug on a constrained compute budget.&lt;br&gt;
Training: Qwen2.5-0.5B-Instruct + LoRA (r=16, α=32, 8.8M trainable params). Pure PyTorch DPO loop on Google Colab T4, ~47 minutes, loss 1.67 -&amp;gt; 0.009.&lt;br&gt;
Key implementation detail: The training pairs used email bodies as chosen/rejected completions - not verdict text. This means the model is an implicit reward model, not a verdict generator. The correct evaluation interface is:&lt;/p&gt;

&lt;p&gt;reward = β × (log π_DPO(email|prompt) - log π_ref(email|prompt))&lt;br&gt;
Positive reward -&amp;gt; PASS. Negative reward -&amp;gt; FAIL. Asking the model to generate "VERDICT: PASS" produces 100% PASS bias (22% accuracy). Using the implicit reward produces 74%.&lt;br&gt;
The Honest Result&lt;br&gt;
Judge | Accuracy | 95% CI&lt;br&gt;
DPO trained judge (implicit reward) | 74.0% | [62%, 86%]&lt;br&gt;
Rule evaluator | 48.0% | [34%, 62%]&lt;br&gt;
Prompt judge - qwen3-8b zero-shot | 22.0% | [12%, 34%]&lt;br&gt;
Delta A = +26pp over the rule evaluator (p=0.0127, paired bootstrap n=10,000, significant at p&amp;lt;0.05).&lt;br&gt;
Delta B (rule vs zero-shot prompt): +26pp but p=0.5499 - not significant at n=50. This is a sample size limitation, not an absence of effect. The zero-shot model predicted PASS for every single task regardless of content. Training is necessary.&lt;br&gt;
Honest limitations:&lt;/p&gt;

&lt;p&gt;25/50 held-out tasks have a labeling artifact from the LLM synthesis pipeline (GT=FAIL with no failure category). This inflates error counts and suppresses accuracy on synthesis tasks (36% vs 62% on programmatic tasks).&lt;br&gt;
The implicit reward interface requires the reference model loaded simultaneously (2x VRAM), which limits deployment to GPU endpoints.&lt;br&gt;
n=50 held-out partition is too small for p&amp;lt;0.05 on Delta B. v0.2 needs 300+ tasks per partition.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Tenacious-Bench v0.2 should add: multi-turn trajectory tasks, persona-aware tone scoring, live bench inventory validation, and a double-validation step for LLM-synthesis ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Artifacts
&lt;/h2&gt;

&lt;p&gt;Dataset: &lt;a href="https://huggingface.co/datasets/lidya7/tenacious-bench-v01" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/lidya7/tenacious-bench-v01&lt;/a&gt;&lt;br&gt;
Trained judge: &lt;a href="https://huggingface.co/lidya7/tenacious-judge-lora-v1" rel="noopener noreferrer"&gt;https://huggingface.co/lidya7/tenacious-judge-lora-v1&lt;/a&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/lidudagn/the-conversion-engine" rel="noopener noreferrer"&gt;https://github.com/lidudagn/the-conversion-engine&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
