<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Melaku Genet</title>
    <description>The latest articles on DEV Community by Melaku Genet (@mella123).</description>
    <link>https://dev.to/mella123</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908742%2Fb7e6b4d7-59b6-4f29-b604-f375e8419c77.png</url>
      <title>DEV Community: Melaku Genet</title>
      <link>https://dev.to/mella123</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mella123"/>
    <language>en</language>
    <item>
      <title>What Happens When You Evaluate a B2B Sales Agent on Tasks It Was Never Designed For</title>
      <dc:creator>Melaku Genet</dc:creator>
      <pubDate>Sat, 02 May 2026 09:55:44 +0000</pubDate>
      <link>https://dev.to/mella123/tenacious-bench-v01-what-happens-when-you-evaluate-a-b2b-sales-agent-on-tasks-it-was-never-2hc3</link>
      <guid>https://dev.to/mella123/tenacious-bench-v01-what-happens-when-you-evaluate-a-b2b-sales-agent-on-tasks-it-was-never-2hc3</guid>
      <description>&lt;p&gt;Tenacious-Bench v0.1: What Happens When You Evaluate a B2B Sales Agent on Tasks It Was Never Designed For&lt;/p&gt;

&lt;p&gt;By Melaku Y. — May 2026&lt;/p&gt;

&lt;p&gt;The Gap&lt;br&gt;
General-purpose agent benchmarks are good at measuring what they measure.&lt;br&gt;
τ²-Bench retail tasks test whether an agent can navigate a returns portal,&lt;br&gt;
look up order status, or apply a discount code. HELM tests factual recall&lt;br&gt;
and reasoning. Both are rigorous. Neither tells you whether a B2B sales&lt;br&gt;
development agent will fabricate a funding round that never happened.&lt;/p&gt;

&lt;p&gt;That is the gap this work addresses.&lt;/p&gt;

&lt;p&gt;Our agent — built on Qwen2.5-1.5B-Instruct, deployed for outbound&lt;br&gt;
engineering staffing outreach — scored 0.70 pass@1 on τ²-Bench retail&lt;br&gt;
held-out. Reasonable. The same agent, given an enrichment brief showing&lt;br&gt;
&lt;code&gt;open_roles_estimate=0&lt;/code&gt; and &lt;code&gt;layoff_event=True&lt;/code&gt;, wrote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I know that Waverly Biotech recently raised a Series B and is&lt;br&gt;
expanding into three new markets. Your team of engineers will need&lt;br&gt;
support as you scale operations."*&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nothing in the brief supported any of those claims. The agent defaulted&lt;br&gt;
to optimistic SDR copy from its training distribution. τ²-Bench score:&lt;br&gt;
0.70. Domain failure rate: ~75%.&lt;/p&gt;

&lt;p&gt;This is not a criticism of τ²-Bench. It is a measurement of orthogonality.&lt;br&gt;
General benchmarks cannot predict domain-specific failure modes they were&lt;br&gt;
not designed to cover.&lt;/p&gt;

&lt;p&gt;The Audit Method&lt;/p&gt;

&lt;p&gt;We ran 42 domain-specific probes against the unguarded baseline across&lt;br&gt;
13 failure categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FC-1 Hallucinated firmographics (~30% trigger rate): agent
fabricates funding rounds, market expansion, headcount growth not
present in enrichment brief&lt;/li&gt;
&lt;li&gt;FC-2 Segment-gate bypass(~20%): agent pitches AI/LLM capabilities
to companies with &lt;code&gt;ai_maturity_score &amp;lt; 2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;FC-3 Signal overclaim(~40%): banned growth phrases used regardless
of signal input&lt;/li&gt;
&lt;li&gt;FC-5 Confidence bypass (~25%): no hedging language when
&lt;code&gt;icp_confidence=low&lt;/code&gt; or &lt;code&gt;low_peer_count=True&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After adding prompt guardrails (Style Guide v2, signal hedging rules),&lt;br&gt;
the failure rate dropped to ~30% residual. Guardrails reduced but could&lt;br&gt;
not eliminate paraphrase variants of each failure class. FC-1 and FC-3&lt;br&gt;
persisted at ~15% even with the full guardrail stack.&lt;/p&gt;

&lt;p&gt;Inter-rater agreement between deterministic scorer (R1) and human&lt;br&gt;
annotator (R3): κ=0.77 — above the 0.75 threshold we set for rubric&lt;br&gt;
validity.&lt;/p&gt;

&lt;p&gt;The Dataset&lt;/p&gt;

&lt;p&gt;Tenacious-Bench v0.1 contains 408 tasks across four sources:&lt;/p&gt;

&lt;p&gt;| Source        | Count | Role |&lt;/p&gt;

&lt;p&gt;| Programmatic  | 300   | Parameter sweep coverage |&lt;br&gt;
| Synthesis     | 74    | Diverse ICP segments |&lt;br&gt;
| Trace-derived | 5     | Real agent failure traces |&lt;br&gt;
| Hand-authored | 29    | Adversarial edge cases |&lt;/p&gt;

&lt;p&gt;Four hard design choices shaped the dataset:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Deterministic scorer, no LLM judge.&lt;br&gt;
We tested LLaMA 3.1 8B as a judge. κ=0.04–0.26 on signal_fidelity —&lt;br&gt;
unreliable. The rubric scorer uses keyword matching and regex patterns&lt;br&gt;
across five weighted dimensions. Slower to build, but reproducible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Score==1.0 filter for training data.&lt;br&gt;
The ≥0.75 rubric pass threshold is for evaluation. For training, we&lt;br&gt;
required score==1.0 — every dimension passing with no violations. A task&lt;br&gt;
scoring 0.80 has at least one failing dimension. Training on it teaches&lt;br&gt;
the model that violations are acceptable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-source generation with preference leakage fix.&lt;br&gt;
Following Gu et al. (2024), we used DeepSeek for generation and Qwen&lt;br&gt;
72B for judging — not the same model for both. Using the same model to&lt;br&gt;
generate and judge introduces systematic bias toward its own outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Contamination protocol.&lt;br&gt;
SHA-256 hash comparison of enrichment briefs between train, dev, and&lt;br&gt;
held-out partitions. Zero collisions confirmed before training.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Training Experiment&lt;/p&gt;

&lt;p&gt;Path A: Supervised Fine-Tuning on signal-grounded (input,&lt;br&gt;
correct-output) pairs.&lt;/p&gt;

&lt;p&gt;FC-1 and FC-3 are generation-quality failures — the model's output&lt;br&gt;
distribution defaults to optimistic B2B copy regardless of input signals.&lt;br&gt;
Prompt engineering (Path B) reduces but cannot retrain this prior.&lt;br&gt;
SFT directly retrains the mapping from enrichment signal to grounded&lt;br&gt;
output.&lt;/p&gt;

&lt;p&gt;Following Zhou et al. (2023) LIMA: quality dominates quantity at small&lt;br&gt;
scale. We filtered 1,016 training examples at score≥0.75 from the 235&lt;br&gt;
train-partition tasks after source oversampling (hand-authored 4×,&lt;br&gt;
trace-derived 2×). This gives 0.00000065 pairs/param at 1.5B —&lt;br&gt;
comparable to LIMA's ratio at 65B.&lt;/p&gt;

&lt;p&gt;Per Lambert et al. (2024) Tülu 3: DPO requires a reliable preference&lt;br&gt;
judge (κ≥0.70). Our judge achieved κ=0.04–0.26. DPO with a noisy judge&lt;br&gt;
degrades grounding. SFT only was the correct stage-one approach.&lt;/p&gt;

&lt;p&gt;Training configuration: Qwen2.5-1.5B-Instruct, LoRA rank 16 alpha 32,&lt;br&gt;
100 steps, 2e-4 cosine LR, T4 GPU, 6 minutes wall time.&lt;/p&gt;

&lt;p&gt;Loss curve: 1.7114 → 0.2152 → 0.1434 → 0.1148. Clean monotonic&lt;br&gt;
decrease. No divergence.&lt;/p&gt;

&lt;p&gt;The Honest Result&lt;/p&gt;

&lt;p&gt;| Condition | Score | Delta |&lt;/p&gt;

&lt;p&gt;| Unguarded baseline | 0.6976 | — |&lt;br&gt;
| Prompt-only guarded | 0.7992 | +0.1016 |&lt;br&gt;
| &lt;strong&gt;Trained adapter&lt;/strong&gt; | &lt;strong&gt;0.8863&lt;/strong&gt; | &lt;strong&gt;+0.1887&lt;/strong&gt; |&lt;/p&gt;

&lt;p&gt;Delta A: +0.1887 (95% CI [+0.155, +0.224], p&amp;lt;0.0001, paired&lt;br&gt;
bootstrap, n=62 held-out tasks, 10,000 iterations)&lt;/p&gt;

&lt;p&gt;Delta B: +0.0871 (95% CI [+0.057, +0.118], p&amp;lt;0.0001) — training&lt;br&gt;
beat prompt-only. This is the result that validates the SFT path.&lt;br&gt;
Many interventions fail Delta B. This one did not.&lt;/p&gt;

&lt;p&gt;Delta C: +0.1863 vs τ²-Bench baseline 0.70 — improvement is&lt;br&gt;
Tenacious-specific, not general. The adapter was not tested on τ²-Bench&lt;br&gt;
retail tasks and we make no claim about general capability.&lt;/p&gt;

&lt;p&gt;Cost-Pareto: Adapter latency 2.98s vs base 7.17s. The adapter is&lt;br&gt;
2.4× faster AND more accurate. No cost-quality tradeoff.&lt;/p&gt;

&lt;p&gt;54/62 tasks improved. 8 same. 0 regressions.&lt;/p&gt;

&lt;p&gt;What Did Not Work&lt;/p&gt;

&lt;p&gt;TB-TD-001 (Karibu Tech) passed the score==1.0 filter in an earlier&lt;br&gt;
pipeline version despite containing "scale rapidly" and "aggressive&lt;br&gt;
growth" — both banned phrases. The scorer's regex missed these paraphrase&lt;br&gt;
variants. We caught it on manual review and removed it. The scorer gap&lt;br&gt;
remains — future programmatic generation may reintroduce similar&lt;br&gt;
violations without automatic detection. This is the one honest unresolved&lt;br&gt;
failure we are carrying into v0.2.&lt;/p&gt;

&lt;p&gt;What Is Next&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Scorer hardening — extend GROWTH_CLAIM_PATTERNS and BANNED_PHRASES&lt;br&gt;
to cover paraphrase variants. Add a secondary LLM check specifically&lt;br&gt;
for funding fabrication patterns where confidence is high.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DPO stage — now that we have a reliable SFT baseline, we can&lt;br&gt;
construct preference pairs (adapter output vs unguarded output) and&lt;br&gt;
run DPO to push signal fidelity further. The judge problem remains&lt;br&gt;
but is more tractable with a stronger base.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-turn extension — Tenacious-Bench v0.1 evaluates single&lt;br&gt;
outreach drafts. v0.2 will add follow-up email sequences and reply&lt;br&gt;
handling tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Community contribution — we have opened an issue on τ²-Bench&lt;br&gt;
proposing a B2B enterprise task split. If you work on sales agent&lt;br&gt;
evaluation and want to collaborate, the dataset and adapter are&lt;br&gt;
publicly available.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dataset: &lt;a href="https://huggingface.co/datasets/Mella123/tenacious-bench-v0.1" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/Mella123/tenacious-bench-v0.1&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Adapter: &lt;a href="https://huggingface.co/Mella123/tenacious-bench-lora" rel="noopener noreferrer"&gt;https://huggingface.co/Mella123/tenacious-bench-lora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/Melaku-G/week11-tenacious-bench" rel="noopener noreferrer"&gt;https://github.com/Melaku-G/week11-tenacious-bench&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;τ²-Bench issue: &lt;a href="https://github.com/sierra-research/tau2-bench/issues/280" rel="noopener noreferrer"&gt;https://github.com/sierra-research/tau2-bench/issues/280&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Word count: ~1,400&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
