<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nati A</title>
    <description>The latest articles on DEV Community by Nati A (@natnael_alemseged).</description>
    <link>https://dev.to/natnael_alemseged</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2775274%2Fc77efb59-e9ad-4440-a4a8-d841ea0392c7.png</url>
      <title>DEV Community: Nati A</title>
      <link>https://dev.to/natnael_alemseged</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/natnael_alemseged"/>
    <language>en</language>
    <item>
      <title>When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch</title>
      <dc:creator>Nati A</dc:creator>
      <pubDate>Sat, 02 May 2026 18:16:47 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/when-generic-benchmarks-fail-building-a-sales-domain-evaluation-bench-from-scratch-1kjf</link>
      <guid>https://dev.to/natnael_alemseged/when-generic-benchmarks-fail-building-a-sales-domain-evaluation-bench-from-scratch-1kjf</guid>
      <description>&lt;p&gt;&lt;em&gt;By Natnael Alemseged&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The gap that τ²-Bench retail cannot measure
&lt;/h2&gt;

&lt;p&gt;Tenacious is a B2B sales automation company. Its agent produces outreach emails for clients — personalized to the prospect's company, calibrated to the signal confidence of the underlying data, and constrained by the actual bench capacity available to fulfill any commitment made in the email. The executive team's question going into Week 11 was simple: how do we know this works for our business, our voice, our segments, our bench? The honest answer was: we don't. Not because the agent was untested, but because the tests we had were the wrong tests.&lt;/p&gt;

&lt;p&gt;τ²-Bench retail measures whether a sales agent can navigate a generic retail conversation. Tenacious needs an agent that checks bench capacity against a real JSON summary, routes prospects to the right ICP segment based on layoff and funding signals, and phrases outreach to match the confidence tier of the underlying data. These are not things any public benchmark grades.&lt;/p&gt;

&lt;p&gt;The audit I ran on Day 1 listed eight probe IDs from the Week 10 failure library that τ²-Bench retail would have passed: P-009 through P-012 (bench overcommitment, 100% trigger rate), P-001 and P-004 (ICP misrouting, 54%), P-005 and P-019 (assertive phrasing under weak signal). A retail benchmark scores those outputs as acceptable because they are fluent. They are not acceptable for Tenacious because they make promises the company cannot keep.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I found the gap: the audit method
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;(Week 10 and Week 11 refer to two consecutive project sprints: Week 10 built the Tenacious sales agent; Week 11 built the evaluator, benchmark, and critic on top of it.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Week 10 evidence was more useful than I expected. The failure taxonomy shows that &lt;code&gt;bench_overcommitment&lt;/code&gt; triggered on every bench-feasibility probe in that roll-up (&lt;strong&gt;40/40&lt;/strong&gt;; see &lt;code&gt;week_10_data/failure_taxonomy.md&lt;/code&gt;). This is not a distribution problem — it is a systematic absence of a check. The agent's generator never consulted &lt;code&gt;bench_summary&lt;/code&gt; before committing capacity.&lt;/p&gt;

&lt;p&gt;The same pattern held for ICP routing: &lt;strong&gt;20 of 37&lt;/strong&gt; probes in the ICP-misclassification roll-up (&lt;strong&gt;54%&lt;/strong&gt;; same source). In both cases, the structured context fields (&lt;code&gt;bench_summary&lt;/code&gt;, &lt;code&gt;signal_confidence_tier&lt;/code&gt;, &lt;code&gt;icp_segment&lt;/code&gt;) were available in the input. The generator simply did not use them.&lt;/p&gt;

&lt;p&gt;This pointed immediately to Path B rather than Path A. The outputs were fluent — no generation quality problem. What was missing was a rejection layer that checks structured context against the draft before it is sent.&lt;/p&gt;

&lt;p&gt;Concretely, five probe traces drove the decision:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Probe ID&lt;/th&gt;
&lt;th&gt;Trace ref&lt;/th&gt;
&lt;th&gt;Failure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P-009&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-4087895185a9&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Go overcommitment: bench=3, committed=10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-d5299b421fc8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NestJS capacity committed but fully deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-001&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-8dc44eb36d33&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Layoff+funding → Segment 1 instead of Segment 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-004&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-19f0af95e3e2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Zero open roles, still Segment 1 pitch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-005&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-b3388b3c3582&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Assertive opener under medium-confidence signal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All five share the same pattern: a structured field in the task input encodes the ground truth, and the agent ignored it. A generation-quality fix does not address this. A critic that has bench state and segment rules in its context can.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the benchmark: how dataset construction actually works at small data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The four authoring modes
&lt;/h3&gt;

&lt;p&gt;Tenacious-Bench v0.2 uses four authoring modes, each with different cost and quality tradeoffs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace-derived&lt;/strong&gt; tasks come directly from the Week 10 failure library. The task input is reconstructed from a real probe, the ground truth is the corrected output from the post-hoc audit. These are the highest-signal tasks — they encode actual failures the agent produced in a real evaluation. The risk is sparse coverage: the probe library covers only the failure modes that were already identified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Programmatic&lt;/strong&gt; tasks expand the trace-derived set by templatizing the inputs — varying company name, capacity numbers, signal tier, and ICP segment systematically. Coverage is higher but signal lines are often synthetic stubs (&lt;code&gt;Ref=tbv02-0021 Arbor Systems hiring-signal.&lt;/code&gt;) rather than grounded specifics. That creates calibration noise in the evaluator's &lt;code&gt;signal_grounding_check&lt;/code&gt;, documented below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-LLM synthesis&lt;/strong&gt; routes task generation to a cheap model tier (Qwen via OpenRouter) and judgment to a different family (Claude/OpenAI) — following the preference-leakage prevention protocol from Li et al. (2025). The generator produces the rejected outputs for preference pairs; the judge verifies them. Using the same model for both would inflate apparent pair quality without improving actual learning signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hand-authored&lt;/strong&gt; tasks cover the long tail of failure modes that neither trace-derived nor programmatic expansion reaches — dual-control coordination failures and edge cases in booking-stage handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Judge-filter calibration (task inclusion)
&lt;/h3&gt;

&lt;p&gt;Every generated task is supposed to pass an LLM-as-judge gate before it enters the benchmark: pointwise scores on &lt;strong&gt;input coherence&lt;/strong&gt;, &lt;strong&gt;ground-truth verifiability&lt;/strong&gt;, and &lt;strong&gt;rubric-application clarity&lt;/strong&gt; (1–5 each), with documented minimums (&lt;code&gt;generation_scripts/audit_logs/authoring_manifest_*.json&lt;/code&gt;: require &lt;strong&gt;≥3&lt;/strong&gt; on each dimension, reject on malformed JSON). &lt;strong&gt;Generator and judge model families are rotated&lt;/strong&gt; so the same family never both authors and scores the same pool — again following Li et al. (2025). Pairwise tiebreaks handle near-duplicate synthesis paths (Jaccard overlap on subject+body, threshold 0.8). The published authoring manifest for the 240-task build records whether live OpenRouter calls were enabled; when the key is absent, the pipeline falls back to a &lt;strong&gt;stub judge&lt;/strong&gt; that only enforces the dimension floor — useful for reproducible CI, but &lt;strong&gt;not&lt;/strong&gt; a substitute for calibrating a frontier judge on a 50-task spot sample. Inter-rater agreement on 30 hand-labeled tasks (24-hour relabel) is what kept the &lt;em&gt;downstream&lt;/em&gt; deterministic rubric honest.&lt;/p&gt;

&lt;h3&gt;
  
  
  The routing decision I would make differently
&lt;/h3&gt;

&lt;p&gt;Stub signal lines from cheap synthesis are not interchangeable with realistic briefs. A real signal line reads: "You closed a $14M Series A in February and your Python roles increased from 2 to 7 in 60 days." A stub reads: "Ref=tbv02-0021 Arbor Systems hiring-signal." The evaluator's &lt;code&gt;signal_grounding_check&lt;/code&gt; grades whether the body references tokens from the signal line; stubs have no meaningful tokens to match.&lt;/p&gt;

&lt;p&gt;The fix for the next revision is to author plausible specific signals (amount, date, role count) at template expansion time — Liu et al. (COLM 2024) Section 3: synthetic quality depends on &lt;strong&gt;specificity of the seed&lt;/strong&gt;, not volume alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contamination and inter-rater agreement
&lt;/h3&gt;

&lt;p&gt;The three-check protocol (8-gram overlap on inputs, embedding cosine &lt;strong&gt;&amp;lt; 0.85&lt;/strong&gt;, time-shift verification) targets &lt;strong&gt;input-level&lt;/strong&gt; train vs held-out overlap, not output memorization. For the preference-pair training slice, &lt;code&gt;training_data/contamination_preference_pairs.json&lt;/code&gt; records &lt;strong&gt;91&lt;/strong&gt; pairs checked and &lt;strong&gt;0&lt;/strong&gt; violations.&lt;/p&gt;

&lt;p&gt;The compliant 24-hour inter-rater pass (30 tasks, 64 check-level comparisons) yielded &lt;strong&gt;0.91&lt;/strong&gt; overall agreement; every dimension cleared &lt;strong&gt;0.80&lt;/strong&gt; after rubric revision (&lt;code&gt;inter_rater_agreement.md&lt;/code&gt;). The weak point was &lt;code&gt;format_check&lt;/code&gt; (&lt;strong&gt;0.87&lt;/strong&gt;): humans penalized filler openers and hollow superlatives while the machine initially used length only. Adding &lt;code&gt;filler_opener&lt;/code&gt; and &lt;code&gt;unsupported_superlative&lt;/code&gt; regexes to &lt;code&gt;scoring_evaluator.py&lt;/code&gt; closed the gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  The training experiment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Path B: SimPO on a text-only Qwen 2.5 0.5B fallback
&lt;/h3&gt;

&lt;p&gt;The project target backbone is Qwen3.5-0.8B. The current Qwen3.5-0.8B HF/Unsloth release is vision-language; TRL CPO routes text prompts through the image processor and breaks on text-only preference pairs. The training notebook uses &lt;code&gt;unsloth/Qwen2.5-0.5B-Instruct&lt;/code&gt; as an operational text-only fallback — an engineering constraint worth stating in public.&lt;/p&gt;

&lt;p&gt;SimPO beats DPO on a free Colab T4 (16 GB): DPO needs a frozen reference model in memory; SimPO is reference-free and fits a workable batch size. SimPO beats ORPO here because the data are &lt;strong&gt;preference pairs only&lt;/strong&gt; — no separate SFT corpus. ORPO's SFT term would drag a 0.5B policy toward Tenacious email prose at the expense of general instruction following; SimPO has no SFT term.&lt;/p&gt;

&lt;p&gt;Preference pairs use each task's &lt;code&gt;ground_truth_output&lt;/code&gt; as &lt;strong&gt;chosen&lt;/strong&gt; and an LLM-generated violation as &lt;strong&gt;rejected&lt;/strong&gt;, validated with &lt;code&gt;scoring_evaluator.py&lt;/code&gt; and logged in &lt;code&gt;training_data/preference_pairs_audit.jsonl&lt;/code&gt;. The rejection generator (Qwen on OpenRouter) and any frontier judge are &lt;strong&gt;different families&lt;/strong&gt; — preference-leakage hygiene per Li et al. (2025).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training slice:&lt;/strong&gt; &lt;strong&gt;91&lt;/strong&gt; rows in &lt;code&gt;training_data/preference_pairs.jsonl&lt;/code&gt;, &lt;strong&gt;6&lt;/strong&gt; failure categories, &lt;strong&gt;0&lt;/strong&gt; contamination flags in &lt;code&gt;training_data/contamination_preference_pairs.json&lt;/code&gt;. Colab T4: &lt;strong&gt;3&lt;/strong&gt; epochs, &lt;strong&gt;81&lt;/strong&gt; train / &lt;strong&gt;10&lt;/strong&gt; eval pairs, &lt;strong&gt;~129 s&lt;/strong&gt; wall time, fp16 LoRA r=16 / α=32, final train loss &lt;strong&gt;4.878&lt;/strong&gt;. Eval margin sanity check: &lt;strong&gt;10/10&lt;/strong&gt; on the training split. Headline lift is decided on &lt;strong&gt;held-out&lt;/strong&gt; tasks only (&lt;code&gt;ablations/ablation_results.json&lt;/code&gt;, &lt;code&gt;ablations/significance_test.txt&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest result
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Delta A: trained LoRA vs deterministic baseline on held-out (same metric)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Definition (paired with &lt;code&gt;ablations/paired_bootstrap_delta_a.py&lt;/code&gt;):&lt;/strong&gt; for each of &lt;strong&gt;47&lt;/strong&gt; held-out tasks, the baseline &lt;strong&gt;succeeds&lt;/strong&gt; if the deterministic &lt;code&gt;scoring_evaluator.py&lt;/code&gt; scores &lt;strong&gt;prefer&lt;/strong&gt; &lt;code&gt;ground_truth_output&lt;/code&gt; over &lt;code&gt;candidate_output&lt;/code&gt;, or the two bodies are identical. The trained judge &lt;strong&gt;succeeds&lt;/strong&gt; if the LoRA's preference margin agrees with that same ordering (or tie). This is &lt;strong&gt;one&lt;/strong&gt; metric end to end — not a mix of all-checks-pass for the baseline and preference accuracy for the model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Preference-aligned rate&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deterministic baseline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7/47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trained LoRA&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;43/47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+76.6 pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;95% bootstrap CI (50 000 resamples, seed 42)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;[+63.8 pp, +87.2 pp]&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-sided paired bootstrap &lt;em&gt;p&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 0.0001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Descriptive sidebar:&lt;/strong&gt; the Week 10 &lt;strong&gt;candidate&lt;/strong&gt; bodies pass all deterministic checks on &lt;strong&gt;11/47&lt;/strong&gt; tasks (&lt;strong&gt;23.4%&lt;/strong&gt;) — a useful raw quality readout, but &lt;strong&gt;not&lt;/strong&gt; the Delta A numerator. The baseline hits &lt;strong&gt;7/47&lt;/strong&gt; because the evaluator often prefers the reference even when the candidate fails some checks.&lt;/p&gt;

&lt;p&gt;By category, the trained judge reaches 100% on bench_overcommitment, dual_control_coordination, gap_overclaiming, signal_overclaiming, and tone_drift; &lt;strong&gt;icp_misclassification&lt;/strong&gt; stays &lt;strong&gt;2/6 (33.3%)&lt;/strong&gt; — the weakest training slice (six pairs) and an open problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delta B: trained LoRA vs prompt-only same backbone
&lt;/h3&gt;

&lt;p&gt;Same held-out preference-margin procedure: base &lt;code&gt;Qwen2.5-0.5B-Instruct&lt;/code&gt; without LoRA scores &lt;strong&gt;48.9%&lt;/strong&gt; (23/47); the trained adapter scores &lt;strong&gt;91.5%&lt;/strong&gt; (43/47) — &lt;strong&gt;+42.6 pp&lt;/strong&gt;, 95% CI &lt;strong&gt;[+29.8 pp, +57.4 pp]&lt;/strong&gt;, &lt;em&gt;p&lt;/em&gt; &amp;lt; 0.0001. Prompt-only already clears dual_control_coordination and signal_overclaiming on this slice; the adapter's lift concentrates in gap_overclaiming and tone_drift, with modest ICP gains (0/6 → 2/6).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost–latency Pareto
&lt;/h3&gt;

&lt;p&gt;Training used &lt;strong&gt;$0&lt;/strong&gt; billed GPU on Colab T4 (&lt;code&gt;cost_pareto.colab_cost_usd&lt;/code&gt; in &lt;code&gt;ablations/ablation_results.json&lt;/code&gt;; ~&lt;strong&gt;2.16&lt;/strong&gt; minutes wall time). &lt;strong&gt;Inference&lt;/strong&gt; on the held-out preference pass: median &lt;strong&gt;~369 ms&lt;/strong&gt; per task with the LoRA judge vs &lt;strong&gt;~96 ms&lt;/strong&gt; for the prompt-only backbone — higher latency for a stronger rejection layer. Dataset authoring included &lt;strong&gt;live&lt;/strong&gt; OpenRouter calls for preference-pair generation (&lt;code&gt;training_data/preference_pairs_audit.jsonl&lt;/code&gt;, &lt;code&gt;mode: "live"&lt;/code&gt;); API spend is logged in &lt;code&gt;cost_log.csv&lt;/code&gt; — &lt;strong&gt;~$0.02&lt;/strong&gt; for 112 qwen/qwen3-8b calls (67K input + 43K output tokens at $0.10/M).&lt;/p&gt;

&lt;h3&gt;
  
  
  What did not work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ICP routing&lt;/strong&gt; remains the failure mode with the fewest pairs and the worst held-out accuracy. &lt;strong&gt;Stub signal lines&lt;/strong&gt; make &lt;code&gt;signal_grounding_check&lt;/code&gt; look worse than real-brief behavior would. &lt;strong&gt;Delta B&lt;/strong&gt; is uneven: training helps most where the prompt-only model was blind, not everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Thread-level coherence&lt;/strong&gt; — grade replies against prior turns, not isolated drafts.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing scope&lt;/strong&gt; — enforce &lt;code&gt;pricing_sheet.md&lt;/code&gt; bands on quoted TCV.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn-roast heuristic&lt;/strong&gt; — style-guide anti-pattern as an LLM-judge dimension.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-signal calibration&lt;/strong&gt; — score against the &lt;strong&gt;weakest&lt;/strong&gt; signal in a brief, not a single scalar tier.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Dataset: &lt;a href="https://huggingface.co/datasets/Natnaela/tenacious-bench" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/Natnaela/tenacious-bench&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Code: &lt;a href="https://github.com/Natnael-Alemseged/SalesConversion-Bench" rel="noopener noreferrer"&gt;https://github.com/Natnael-Alemseged/SalesConversion-Bench&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Community: &lt;a href="https://github.com/sierra-research/tau2-bench/issues/293" rel="noopener noreferrer"&gt;τ²-Bench issue #293 — structured-context evaluation gaps&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>benchmarks</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
