<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Natnael Alemseged</title>
    <description>The latest articles on DEV Community by Natnael Alemseged (@natnael_alemseged).</description>
    <link>https://dev.to/natnael_alemseged</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2775274%2F5fc7b9f9-903b-46ed-bba3-2008fff110a9.png</url>
      <title>DEV Community: Natnael Alemseged</title>
      <link>https://dev.to/natnael_alemseged</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/natnael_alemseged"/>
    <language>en</language>
    <item>
      <title>Why Pairing Your Bootstrap Is Necessary — And When It Stops Helping</title>
      <dc:creator>Natnael Alemseged</dc:creator>
      <pubDate>Fri, 08 May 2026 21:39:19 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/why-pairing-your-bootstrap-is-necessary-and-when-it-stops-helping-2iim</link>
      <guid>https://dev.to/natnael_alemseged/why-pairing-your-bootstrap-is-necessary-and-when-it-stops-helping-2iim</guid>
      <description>&lt;p&gt;A colleague's &lt;code&gt;paired_bootstrap&lt;/code&gt; function resamples one set of 48 task indices and applies it to both the trained LoRA&lt;br&gt;
scores and the baseline scores. The question: what mathematical property makes that the correct procedure — and would an&lt;br&gt;
unpaired bootstrap have changed the reviewer-facing conclusion?&lt;/p&gt;

&lt;p&gt;The short answer: pairing is correct &lt;em&gt;by experimental design&lt;/em&gt;. When the two score vectors have positive covariance,&lt;br&gt;
pairing reduces the model-based standard error; in this specific data the correlation is near-zero (r = 0.167), so the&lt;br&gt;
paired and unpaired bootstrap CIs are practically identical — and neither changes the reviewer-facing conclusion.&lt;/p&gt;

&lt;p&gt;Here is why, from first principles.&lt;/p&gt;


&lt;h2&gt;
  
  
  The experimental design justification: why pairing is valid at all
&lt;/h2&gt;

&lt;p&gt;The 48 held-out tasks were not drawn independently for the baseline and then re-drawn independently for the trained&lt;br&gt;
LoRA. The &lt;strong&gt;same 48 tasks&lt;/strong&gt; were evaluated under both systems. Each task is a repeated measurement on the same subject —&lt;br&gt;
this is a &lt;strong&gt;within-subject design&lt;/strong&gt; (as opposed to a between-subject design where each group sees different samples),&lt;br&gt;
and it is what makes pairing the correct procedure.&lt;/p&gt;

&lt;p&gt;If the 48 baseline tasks and the 48 trained-LoRA tasks were &lt;em&gt;different&lt;/em&gt; tasks drawn from the same population, unpaired&lt;br&gt;
bootstrap would be correct. But here, resampling index 13 means "draw task 13 for both models together." Resampling each&lt;br&gt;
vector independently breaks that structure and estimates uncertainty for a different experiment: baseline and LoRA&lt;br&gt;
evaluated on unrelated task samples.&lt;/p&gt;

&lt;p&gt;This distinction matters before any formula. The formula follows the design; the design is what you defend to the&lt;br&gt;
reviewer.&lt;/p&gt;


&lt;h2&gt;
  
  
  The variance-reduction mechanism: the math behind why pairing helps
&lt;/h2&gt;

&lt;p&gt;Once you have established that pairing is correct, the question is how much it helps. The &lt;strong&gt;bootstrap&lt;/strong&gt; works by&lt;br&gt;
resampling your data with replacement thousands of times to estimate the sampling distribution of a statistic — here,&lt;br&gt;
the mean lift between two systems (&lt;a href="https://hastie.su.domains/CASI/" rel="noopener noreferrer"&gt;Efron &amp;amp; Hastie, 2016&lt;/a&gt;). The standard error of the&lt;br&gt;
mean paired lift is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SE_paired = sqrt((Var(A) + Var(B) − 2·Cov(A, B)) / n)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where A is the baseline binary score vector, B is the trained-LoRA binary score vector, and n = 48.&lt;/p&gt;

&lt;p&gt;The unpaired standard error treats A and B as independent, so the covariance term drops:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SE_unpaired = sqrt((Var(A) + Var(B)) / n)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key distinction: a paired design estimates &lt;code&gt;E[B - A]&lt;/code&gt; — expected within-task lift. An unpaired design estimates&lt;br&gt;
&lt;code&gt;E[B] - E[A]&lt;/code&gt; as if the two means came from unrelated samples. Same point estimate, different uncertainty model.&lt;/p&gt;

&lt;p&gt;Pairing helps in proportion to the covariance between the two score vectors. If tasks where the baseline passes tend&lt;br&gt;
also to be tasks where the trained model passes, the covariance is large and positive, the numerator shrinks, and the&lt;br&gt;
paired SE is meaningfully smaller. If the two models fail and pass on largely &lt;em&gt;different&lt;/em&gt; tasks — low covariance —&lt;br&gt;
pairing buys almost nothing in precision, even though it remains the correct design.&lt;/p&gt;
&lt;h3&gt;
  
  
  The actual numbers
&lt;/h3&gt;

&lt;p&gt;From the held-out evaluation traces:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Trained LoRA passes&lt;/th&gt;
&lt;th&gt;Trained LoRA fails&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Baseline passes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Baseline fails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Baseline: 16 passes, 32 fails → pass rate p_A = 0.333&lt;/li&gt;
&lt;li&gt;Trained LoRA: 41 passes, 7 fails → pass rate p_B = 0.854&lt;/li&gt;
&lt;li&gt;Pearson r(A, B) = &lt;strong&gt;0.167&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Var(A) = 0.333 · 0.667 = &lt;strong&gt;0.222&lt;/strong&gt; ; Var(B) = 0.854 · 0.146 = &lt;strong&gt;0.125&lt;/strong&gt; ; Va + Vb = &lt;strong&gt;0.347&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Cov(A, B) = 0.167 · sqrt(0.222 · 0.125) ≈ &lt;strong&gt;0.028&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The task-level difference vector makes the paired structure visible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;+1&lt;/code&gt; on 26 tasks where trained LoRA passes and baseline fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-1&lt;/code&gt; on 1 task where baseline passes and trained LoRA fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; on 21 tasks where both systems agree&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paired bootstrap resamples this population of task-level differences. The unpaired bootstrap destroys these&lt;br&gt;
relationships by drawing baseline and trained outcomes independently.&lt;/p&gt;

&lt;p&gt;Plugging in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SE_paired   = sqrt((0.347 − 2·0.028) / 48) = sqrt(0.291 / 48) ≈ 0.0779
SE_unpaired = sqrt(0.347 / 48)             ≈ 0.0850
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The paired SE is about &lt;strong&gt;8.4% smaller&lt;/strong&gt; — real but modest, because the covariance is small relative to&lt;br&gt;
&lt;code&gt;Var(A) + Var(B)&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Empirical simulation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n_boot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100_000&lt;/span&gt;

&lt;span class="c1"&gt;# Task-level binary outcomes ordered by contingency cell:
# both-pass (15), baseline-only-pass (1), trained-only-pass (26), both-fail (6)
&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trained&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;paired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;unpaired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;i_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;i_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;trained&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;unpaired&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trained&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i_b&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i_a&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;97.5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unpaired&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;97.5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paired CI:   [+35.4, +68.8] percentage points  — width 33.3 pp
Unpaired CI: [+35.4, +66.7] percentage points  — width 31.3 pp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two CIs are essentially identical — exactly what the near-zero covariance predicts. The SE formula says pairing&lt;br&gt;
should modestly reduce the model-based standard error, but &lt;strong&gt;percentile bootstrap CIs&lt;/strong&gt; (the 2.5th and 97.5th&lt;br&gt;
percentiles of the resampled distribution) on binary-difference data are not symmetric ±1.96·SE intervals. Their tails&lt;br&gt;
shift independently because the empirical distribution is discrete and skewed. The slight width inversion is not a&lt;br&gt;
contradiction: pairing is still the right design, but here it does not buy meaningful precision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Does this change the reviewer conclusion?
&lt;/h2&gt;

&lt;p&gt;The reviewer-facing claim is: &lt;em&gt;"The LoRA adapter's lift is statistically significant above zero."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The critical boundary is whether the CI lower bound stays positive. Both paired and unpaired bootstrap give a lower&lt;br&gt;
bound of &lt;strong&gt;+35.4 percentage points&lt;/strong&gt; — far above zero. Neither variant threatens the significance verdict. A CI&lt;br&gt;
of [−2, +54] would change the conclusion; [+12, +40] would not. The actual data stays nowhere near the dangerous&lt;br&gt;
boundary regardless of which bootstrap method is used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pairing is correct by experimental design, and in this experiment it makes no difference to the reviewer conclusion —&lt;br&gt;
because the near-zero correlation means pairing provides almost no variance reduction.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Adjacent concepts worth connecting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When does pairing matter most?&lt;/strong&gt; (&lt;a href="https://aclanthology.org/P19-1266/" rel="noopener noreferrer"&gt;Dror et al., 2017&lt;/a&gt;) When tasks are&lt;br&gt;
heterogeneous in difficulty and both models are sensitive to that difficulty. If hard tasks fail both models and easy&lt;br&gt;
tasks pass both, r(A,B) is large and paired bootstrapping can cut CI width sharply. In this data, the dominant pattern&lt;br&gt;
is trained LoRA passing where baseline fails — pushing correlation toward zero and making pairing nearly irrelevant for&lt;br&gt;
variance reduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When does low correlation arise in LLM evals?&lt;/strong&gt; Near-zero r(A,B) signals a large capability gap: the stronger model&lt;br&gt;
succeeds on tasks too hard for the weaker one, so their pass/fail patterns decorrelate. That is good news for the&lt;br&gt;
trained model's lift, but it means paired bootstrapping loses its statistical efficiency advantage precisely when the&lt;br&gt;
lift is largest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pointers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Papers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efron, B. &amp;amp; Hastie, T. (2016). &lt;a href="https://hastie.su.domains/CASI/" rel="noopener noreferrer"&gt;&lt;em&gt;Computer Age Statistical
Inference&lt;/em&gt;, Ch. 11 — Bootstrap Confidence Intervals&lt;/a&gt; — authoritative treatment of
bootstrap CIs for paired designs. Freely available via Stanford.&lt;/li&gt;
&lt;li&gt;Dror, R. et al. (
2017). &lt;a href="https://aclanthology.org/P19-1266/" rel="noopener noreferrer"&gt;Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets&lt;/a&gt;
&lt;em&gt;(TACL)&lt;/em&gt; — canonical reference for paired bootstrap and permutation tests in NLP evaluation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tool:&lt;/strong&gt; NumPy &lt;code&gt;default_rng&lt;/code&gt; + bootstrap loop — reproducible in a Colab cell with no additional dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Follow-on:&lt;/strong&gt; For a valid one-sided p-value (not just a CI), use a paired permutation test: randomly flip the sign of&lt;br&gt;
each task-pair's difference and count how often the null mean exceeds the observed mean. The bootstrap percentile CI&lt;br&gt;
lower bound being positive is consistent with significance but is not a p-value.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>statistics</category>
      <category>llm</category>
      <category>evaluation</category>
    </item>
    <item>
      <title>DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing</title>
      <dc:creator>Natnael Alemseged</dc:creator>
      <pubDate>Thu, 07 May 2026 20:51:39 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/dpo-vs-simpo-what-your-preference-trainer-is-actually-optimizing-42b4</link>
      <guid>https://dev.to/natnael_alemseged/dpo-vs-simpo-what-your-preference-trainer-is-actually-optimizing-42b4</guid>
      <description>&lt;p&gt;SalesConversion-Bench had one uncomfortable preference-tuning mismatch: the code trained with TRL &lt;code&gt;DPOTrainer&lt;/code&gt;, while the methodology narrative argued for SimPO.&lt;/p&gt;

&lt;p&gt;That is not just a naming issue. DPO and SimPO turn the same &lt;code&gt;(prompt, chosen, rejected)&lt;/code&gt; pair into different update signals. If the held-out lift is small, like 22.73% vs 18.18%, the project cannot honestly claim whether the model improved because DPO was the right objective, because LoRA rank constrained the update, or because training margins improved without robust held-out behavior.&lt;/p&gt;

&lt;p&gt;The useful answer is not "DPO good, SimPO good, ORPO also good." The useful answer is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Compare the objectives under fixed conditions, control for LoRA rank, and keep the objective whose gains survive held-out evaluation instead of only improving training margins.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The gradient difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DPO: reference-relative preference learning
&lt;/h3&gt;

&lt;p&gt;DPO treats preference tuning as a comparison between two log-probability gaps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;policy_gap = log pi_theta(chosen | prompt) - log pi_theta(rejected | prompt)
ref_gap    = log pi_ref(chosen | prompt)   - log pi_ref(rejected | prompt)

loss = -log sigmoid(beta * (policy_gap - ref_gap))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the update asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Has the trainable policy made the chosen answer more preferred than the rejected answer, beyond what the reference policy already believed?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That reference-relative part is the key. DPO does not only ask whether the chosen answer is more likely than the rejected answer. It asks whether the policy improved that preference gap relative to a reference model.&lt;/p&gt;

&lt;p&gt;In a LoRA setup with TRL &lt;code&gt;DPOTrainer(ref_model=None)&lt;/code&gt;, the exact reference handling depends on TRL and PEFT configuration. Some setups avoid loading a separate reference model and compute reference behavior by disabling adapters; others use a frozen reference copy. The implementation detail should be verified in the actual training stack.&lt;/p&gt;

&lt;p&gt;But the conceptual point stays the same: &lt;strong&gt;DPO is anchored to a reference policy&lt;/strong&gt;. That can be helpful if the base instruct model already has useful judgment priors. It can also preserve the wrong shortcut if the reference already favors short, generic, policy-shaped answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  SimPO: reference-free margin learning
&lt;/h3&gt;

&lt;p&gt;SimPO removes the reference model and scores each answer using average log-probability per token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r(prompt, answer) = (1 / answer_length) * log pi_theta(answer | prompt)

loss = -log sigmoid(beta * (r_chosen - r_rejected - gamma))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The update asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Has the policy made the chosen answer better than the rejected answer by at least the target margin &lt;code&gt;gamma&lt;/code&gt;, using length-normalized scores?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That changes two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No reference anchor:&lt;/strong&gt; SimPO directly pushes the chosen answer above the rejected answer. It does not ask whether the policy improved relative to the base model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Length normalization:&lt;/strong&gt; A long rejected answer is not punished merely because total log-probability accumulates over more tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That second point matters in preference data where chosen and rejected answers differ in length. If the preferred answer is often shorter, a total-log-prob objective can make brevity look like quality. SimPO's average-log-prob reward reduces that artifact.&lt;/p&gt;

&lt;p&gt;The falsifiable hypothesis is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If DPO's gains are mostly coming from reference-relative or length artifacts, then SimPO with the same data, seed, train steps, and LoRA rank should produce cleaner held-out margins and accuracy without increasing the train/eval gap.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Where ORPO fits
&lt;/h3&gt;

&lt;p&gt;ORPO combines a supervised chosen-answer term with an odds-ratio preference term. It should not be co-equal in this comparison. The live mismatch is DPO in code vs SimPO in the methodology.&lt;/p&gt;

&lt;p&gt;ORPO becomes interesting if both DPO and SimPO are unstable, or if the model needs stronger behavior-cloning pressure toward chosen outputs. For this decision, it is a fallback, not the main branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The overoptimization check
&lt;/h2&gt;

&lt;p&gt;Training loss alone is not enough. In a small preference-tuning run, the warning pattern is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;training preference margins improve,
but held-out accuracy or held-out margins do not improve.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two metrics to inspect first are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Diagnostic&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;th&gt;Bad sign&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Train &lt;code&gt;rewards/margins&lt;/code&gt; vs held-out pair accuracy or held-out margins&lt;/td&gt;
&lt;td&gt;Separates real preference learning from training-set margin inflation&lt;/td&gt;
&lt;td&gt;Train margins rise while held-out behavior stays flat or worsens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chosen/rejected reward or log-prob movement&lt;/td&gt;
&lt;td&gt;Shows whether improvement comes from lifting chosen answers, suppressing rejected answers, or drifting oddly from the reference&lt;/td&gt;
&lt;td&gt;Rejected scores collapse while chosen quality does not improve&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If TRL logs &lt;code&gt;rewards/chosen&lt;/code&gt;, &lt;code&gt;rewards/rejected&lt;/code&gt;, and &lt;code&gt;rewards/margins&lt;/code&gt;, use those directly. If it also logs policy/reference log-probs, inspect whether the DPO margin is improving because chosen answers are becoming more likely, or mainly because rejected answers are being pushed down.&lt;/p&gt;

&lt;p&gt;The second case is not automatically reward hacking. It is a review flag. It needs held-out and qualitative confirmation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-on pattern: inspect train vs eval margins
&lt;/h2&gt;

&lt;p&gt;Before arguing that DPO or SimPO "worked," add a tiny log inspection step. The goal is not to prove overoptimization from one scalar. The goal is to force the comparison between training margins and held-out behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_preference_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_log&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_log&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;midpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;early_margin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;midpoint&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train_rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;late_margin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;midpoint&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train_rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chosen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;midpoint&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewards/chosen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train_rewards/chosen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rejected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;midpoint&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewards/rejected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train_rewards/rejected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train margin: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;early_margin&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;late_margin&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;late chosen/rejected rewards: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chosen&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rejected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;eval_log&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;eval_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;eval_margin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;eval_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewards/margins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;eval_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;last_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;held-out margin: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_margin&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;held-out accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_acc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;early_margin&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;late_margin&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;late_margin&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;early_margin&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;eval_log&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review flag: train margin improved. Confirm with held-out pairs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;late_margin&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;early_margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review flag: weak training signal. Check rank, LR, or pair quality.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The useful pattern is the comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train margins up + held-out behavior up    -&amp;gt; plausible improvement
train margins up + held-out behavior flat  -&amp;gt; likely training-set margin inflation
train margins flat                         -&amp;gt; weak signal, bad data, or too little capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  LoRA rank is a confounder
&lt;/h2&gt;

&lt;p&gt;The current LoRA config is &lt;code&gt;r=16&lt;/code&gt;, &lt;code&gt;alpha=32&lt;/code&gt;, &lt;code&gt;dropout=0.05&lt;/code&gt;. For a 0.5B model, that is plausible. The risk is not that &lt;code&gt;r=16&lt;/code&gt; is obviously wrong. The risk is that rank can fake an objective conclusion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank failure&lt;/th&gt;
&lt;th&gt;Expected pattern&lt;/th&gt;
&lt;th&gt;First observable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rank too low&lt;/td&gt;
&lt;td&gt;Training margins plateau early, train loss barely moves, held-out accuracy is flat&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rewards/margins&lt;/code&gt; and train loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rank too high for small data&lt;/td&gt;
&lt;td&gt;Training margins keep improving while held-out accuracy or margins get noisy or worse&lt;/td&gt;
&lt;td&gt;Train/eval margin gap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So rank should stay in the ablation, but lightly. It is not the main theory. It is a control that prevents the false conclusion "SimPO lost" or "DPO won" when the real issue was adapter capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The smallest decisive ablation
&lt;/h2&gt;

&lt;p&gt;The cleanest small matrix is 2 objectives x 2 ranks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Objective&lt;/th&gt;
&lt;th&gt;LoRA rank&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;DPO&lt;/td&gt;
&lt;td&gt;r=16&lt;/td&gt;
&lt;td&gt;Current baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;SimPO&lt;/td&gt;
&lt;td&gt;r=16&lt;/td&gt;
&lt;td&gt;Isolate objective change at current capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;DPO&lt;/td&gt;
&lt;td&gt;r=8&lt;/td&gt;
&lt;td&gt;Test whether lower rank regularizes DPO&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;SimPO&lt;/td&gt;
&lt;td&gt;r=8&lt;/td&gt;
&lt;td&gt;Test whether lower rank regularizes SimPO&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Everything else should stay fixed: model, train/validation split, seed, pair data, max length, epochs or steps, learning rate, batch size, and evaluation script.&lt;/p&gt;

&lt;p&gt;The decision rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prefer SimPO if:
SimPO beats DPO at the same rank on held-out accuracy or held-out margin,
and the gain is not paired with a larger train/eval margin gap.

Prefer DPO if:
DPO matches or beats SimPO at the same rank,
and DPO has a smaller train/eval gap or better qualitative behavior.

Prefer the lower rank if:
r=8 has slightly lower training margins but equal or better held-out behavior.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important correction is sample size. With only 22 held-out pairs, a 3 percentage-point rule is too fine-grained because one example is about 4.5 percentage points.&lt;/p&gt;

&lt;p&gt;A more defensible rule is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Switch objectives only if the winner improves by at least one additional held-out pair, has better or equal held-out margin behavior, and does not worsen the important qualitative failure slices.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One extra correct pair alone is a hint. One extra correct pair plus cleaner margins and no slice regression is a decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;The SalesConversion-Bench run should be described as &lt;strong&gt;DPO preference tuning with LoRA&lt;/strong&gt;, not SimPO-first preference tuning.&lt;/p&gt;

&lt;p&gt;The gap closes when the project can say:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DPO updates the model using a reference-relative chosen-vs-rejected margin.&lt;/li&gt;
&lt;li&gt;SimPO updates the model using a reference-free, length-normalized target margin.&lt;/li&gt;
&lt;li&gt;The objective choice should be decided by a controlled DPO-vs-SimPO ablation at fixed rank, with one lower-rank control to catch LoRA overfitting.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That turns "we prefer SimPO" from a narrative claim into an experiment the project can actually defend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Rafailov et al. (2023), &lt;a href="https://arxiv.org/abs/2305.18290" rel="noopener noreferrer"&gt;Direct Preference Optimization: Your Language Model is Secretly a Reward Model&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Meng et al. (2024), &lt;a href="https://arxiv.org/abs/2405.14734" rel="noopener noreferrer"&gt;SimPO: Simple Preference Optimization with a Reference-Free Reward&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hong et al. (2024), &lt;a href="https://arxiv.org/abs/2403.07691" rel="noopener noreferrer"&gt;ORPO: Monolithic Preference Optimization without Reference Model&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hugging Face TRL, &lt;a href="https://huggingface.co/docs/trl/dpo_trainer" rel="noopener noreferrer"&gt;&lt;code&gt;DPOTrainer&lt;/code&gt; documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>finetuning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>"Return JSON only" doesn't force JSON. Here's what actually forces it.</title>
      <dc:creator>Natnael Alemseged</dc:creator>
      <pubDate>Wed, 06 May 2026 19:09:36 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/return-json-only-doesnt-force-json-heres-what-actually-forces-it-9pn</link>
      <guid>https://dev.to/natnael_alemseged/return-json-only-doesnt-force-json-heres-what-actually-forces-it-9pn</guid>
      <description>&lt;p&gt;You have a judge LLM in your pipeline. You've told it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Return JSON only. No preamble, no explanation. Just the JSON object."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It works great in testing. It works great in staging. Then in production it returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Sure!&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Here's&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;my&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;evaluation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;response:&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The answer is mostly correct but..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your &lt;code&gt;json.loads()&lt;/code&gt; throws. Your pipeline catches nothing. Downstream code receives &lt;code&gt;None&lt;/code&gt; and keeps running. Your evaluation scores are silently wrong for the next 200 requests before anyone notices.&lt;/p&gt;

&lt;p&gt;Was this the model misbehaving? No. Was there ever a way to &lt;em&gt;actually&lt;/em&gt; force JSON output? Yes — but it's not the prompt. Let me show you the real mechanism.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "return JSON only" actually does
&lt;/h2&gt;

&lt;p&gt;When you write a format instruction in a prompt, you are doing exactly one thing: &lt;strong&gt;shifting the probability distribution over the next token.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model has seen millions of examples during training where that kind of phrasing is followed by &lt;code&gt;{&lt;/code&gt; and a well-formed JSON body. Your instruction loads that pattern strongly into the context. The probability mass on JSON-shaped tokens goes way up — often high enough that you get valid JSON 95–99% of the time on a well-tuned model.&lt;/p&gt;

&lt;p&gt;But probable is not certain.&lt;/p&gt;

&lt;p&gt;At every decoding step, the model selects the next token according to its output distribution. At temperature 0 it picks the argmax — the single highest-probability token — deterministically. At any temperature above 0 it samples, meaning lower-probability tokens can and do get selected. Either way, the instruction only shapes that distribution; it does not remove outcomes from it. A preamble phrase like &lt;code&gt;"Sure! Here's the evaluation:"&lt;/code&gt; has a very small but non-zero probability at step one. If something in the context — a long system prompt, a conversational tone in your input, a model that was fine-tuned to sound helpful — nudges that probability even slightly upward, you get the preamble and your parse fails. Deterministic decoding reduces but does not eliminate the risk: if the highest-probability token at step one genuinely is a preamble token, you still get it.&lt;/p&gt;

&lt;p&gt;This is instruction-following. It is a &lt;strong&gt;soft mechanism&lt;/strong&gt;. It has no hard guarantees.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually forces JSON: constrained decoding
&lt;/h2&gt;

&lt;p&gt;There is a different mechanism called &lt;strong&gt;constrained decoding&lt;/strong&gt; (also called structured generation or grammar-guided sampling). It does not operate at the prompt layer. It operates at the inference layer — before sampling happens.&lt;/p&gt;

&lt;p&gt;Here is how it works:&lt;/p&gt;

&lt;p&gt;At each decoding step, the system compares the current partial output against a grammar or schema. Any token that would make the output invalid at this parse state gets its logit set to &lt;strong&gt;negative infinity&lt;/strong&gt; — probability zero. The model cannot produce that token. Not unlikely. &lt;em&gt;Cannot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The foundational paper is Willard &amp;amp; Louf (2023), &lt;a href="https://arxiv.org/abs/2307.09702" rel="noopener noreferrer"&gt;&lt;em&gt;Efficient Guided Generation for Large Language Models&lt;/em&gt;&lt;/a&gt;. They show how to compile a JSON schema into a finite-state machine and use it to mask the vocabulary at each decoding step in O(1) time per token. That last part matters: the approach is fast enough to use in production without meaningful latency overhead.&lt;/p&gt;

&lt;p&gt;This is implemented today in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dottxt-ai/outlines" rel="noopener noreferrer"&gt;Outlines&lt;/a&gt;&lt;/strong&gt; — the reference library from the paper authors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; via &lt;code&gt;--grammar-file&lt;/code&gt; (GBNF grammar format)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI structured outputs&lt;/strong&gt; (&lt;code&gt;response_format: { type: "json_schema", json_schema: {...} }&lt;/code&gt;) — OpenAI's &lt;a href="https://platform.openai.com/docs/guides/structured-outputs" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; describes this as token-level schema enforcement, contracted to produce schema-valid output on every non-refused call. Note the qualifier: a safety refusal or content filter can still return a non-schema response — your boundary code should handle that case explicitly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference from soft prompting is &lt;strong&gt;categorical&lt;/strong&gt;, not quantitative. Instruction-following is a distribution shift. Constrained decoding is a hard exclusion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Soft vs hard: a minimal code comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The soft approach — what most pipelines do:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Evaluate this response. Return JSON only: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: int, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: str}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# silent failure — downstream receives None and keeps running
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;try/except&lt;/code&gt; here is necessary but not sufficient. Catching the error and returning &lt;code&gt;None&lt;/code&gt; just defers the damage — whatever uses &lt;code&gt;result&lt;/code&gt; now has to handle &lt;code&gt;None&lt;/code&gt; everywhere, and if it doesn't, the failure propagates silently and corrupts your scores.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hard approach — schema enforced at the token level:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Evaluation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-2024-08-06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Evaluate this response.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Evaluation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;  &lt;span class="c1"&gt;# always a valid Evaluation — never None
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;try/except&lt;/code&gt; on the parse. No &lt;code&gt;None&lt;/code&gt; propagation. &lt;code&gt;result&lt;/code&gt; is always a typed &lt;code&gt;Evaluation&lt;/code&gt; object because the schema was enforced at the token level before the response was ever assembled.&lt;/p&gt;




&lt;h2&gt;
  
  
  Back to my system: where this broke and what changed
&lt;/h2&gt;

&lt;p&gt;In my LLM judge pipeline, the boundary parsing lives in &lt;code&gt;ledger/agents/credit_analysis_agent.py&lt;/code&gt; (see the &lt;code&gt;_parse_json&lt;/code&gt; helper). The utility function responsible for parsing judge output looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# returned silently — the caller never knew
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure case that exposed this: the judge model received an unusually long input passage and responded with a one-sentence acknowledgment before the JSON object. Here is a redacted example of the failing shape (synthetic but representative):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Sure! Here's my evaluation:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;score&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 0, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;reason&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;_safe_parse_json&lt;/code&gt; returned &lt;code&gt;None&lt;/code&gt;. The scoring loop treated &lt;code&gt;None&lt;/code&gt; as a valid result, defaulted the score to &lt;code&gt;0&lt;/code&gt;, and logged 47 evaluations as failures — all of them wrong, all of them silent.&lt;/p&gt;

&lt;p&gt;The fix had two parts. First, the immediate boundary hardening:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Strip common preamble patterns before attempting parse
&lt;/span&gt;    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No JSON object found in output: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;repr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stripped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripped&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Judge returned unparseable output: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;repr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second — and more importantly — the primary judge call was migrated to use &lt;code&gt;response_format&lt;/code&gt; with a Pydantic schema. The stripping logic is now a fallback for open-weight model calls only. For the main judge endpoint, the parse cannot fail because the schema is enforced at decode time.&lt;/p&gt;

&lt;p&gt;The model card was also updated to accurately reflect that the judge's output reliability comes from constrained decoding, not prompt engineering. That distinction matters the moment someone considers swapping the underlying model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three rules for any pipeline acting on structured LLM output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Validate at every trust boundary.&lt;/strong&gt; Every point where LLM output enters your code as structured data is a trust boundary. Treat a parse failure as a first-class event — log it, alert on it, raise loudly — and never let a &lt;code&gt;None&lt;/code&gt; flow silently downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use constrained decoding when the output is load-bearing.&lt;/strong&gt; If a score, routing decision, or classification depends on structured output, use a constrained endpoint or library. Soft-prompt failures in the 1–5% range compound hard in multi-step pipelines. A judge that is wrong 2% of the time in isolation is wrong much more often when it runs 10 times in an evaluation chain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Keep the prompt instruction anyway.&lt;/strong&gt; Even with constrained decoding, write the format instruction in your prompt. It improves output quality and serves as documentation of intent for anyone reading the code. But treat it as a hint to the model, not a technical contract. The schema enforcement is the contract.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real lesson
&lt;/h2&gt;

&lt;p&gt;The pipeline didn't break because the model was unreliable. It broke because the system was designed as if a prompt instruction were equivalent to a type constraint. It is not.&lt;/p&gt;

&lt;p&gt;A prompt instruction is a statistical nudge. A grammar enforced at decode time is a guarantee. The moment structured LLM output feeds into code that acts on it — a scoring system, an agent router, a tool-call parser, an extraction pipeline — you need one of the two.&lt;/p&gt;

&lt;p&gt;A nudge is not enough.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The code in the "back to my system" section is drawn from a real LLM judge pipeline built during a structured AI engineering program. The failure described happened in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Willard, B. &amp;amp; Louf, R. (2023). &lt;em&gt;Efficient Guided Generation for Large Language Models.&lt;/em&gt; arXiv:2307.09702. &lt;a href="https://arxiv.org/abs/2307.09702" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.09702&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenAI. &lt;em&gt;Structured Outputs — Platform Documentation.&lt;/em&gt; &lt;a href="https://platform.openai.com/docs/guides/structured-outputs" rel="noopener noreferrer"&gt;https://platform.openai.com/docs/guides/structured-outputs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why Merged LoRA Barely Changes Inference Time</title>
      <dc:creator>Natnael Alemseged</dc:creator>
      <pubDate>Tue, 05 May 2026 14:09:00 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/why-merged-lora-barely-changes-inference-time-2mhj</link>
      <guid>https://dev.to/natnael_alemseged/why-merged-lora-barely-changes-inference-time-2mhj</guid>
      <description>&lt;p&gt;While my peer was benchmarking a sales conversion classifier fine-tuned on&lt;br&gt;
Qwen3-0.6B, a merged LoRA version of the model took 14,228 ms per&lt;br&gt;
task while the bare base model took 14,045 ms. That 183 ms gap is only&lt;br&gt;
about 1.3%. Why doesn't merging in extra trained weights make inference&lt;br&gt;
slower? And if the adapter is not the thing driving latency, what&lt;br&gt;
actually is?&lt;/p&gt;

&lt;p&gt;The short answer is: &lt;strong&gt;once LoRA is merged, the model is no longer doing&lt;br&gt;
"base model plus adapter" at inference time. It is just doing the base&lt;br&gt;
model computation with a different set of weight values.&lt;/strong&gt; The tensor&lt;br&gt;
shapes do not change, the number of layers does not change, and the&lt;br&gt;
number of bytes that must be moved for each generated token is almost&lt;br&gt;
the same. On modern GPUs, that last point matters most.&lt;/p&gt;

&lt;p&gt;One caution upfront: with only one timing run per system on a shared&lt;br&gt;
Colab T4, you cannot prove that 183 ms is "real." A 1.3% gap is&lt;br&gt;
&lt;strong&gt;plausibly noise&lt;/strong&gt;, not evidence that merged LoRA adds meaningful latency.&lt;br&gt;
The mechanism below explains why we should expect the difference to be&lt;br&gt;
near zero, and the controlled benchmark below confirms it directly.&lt;/p&gt;
&lt;h2&gt;
  
  
  What merged LoRA changes, and what it does not
&lt;/h2&gt;

&lt;p&gt;Before merging, a LoRA-adapted linear layer is effectively:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;y = W₀x + (α/r)BAx&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;where &lt;code&gt;W₀&lt;/code&gt; is the original weight matrix and &lt;code&gt;BA&lt;/code&gt; is the low-rank LoRA&lt;br&gt;
update. In that form, inference really does include extra operations:&lt;br&gt;
you still apply the base matrix, and you also apply the low-rank update.&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;merge_and_unload()&lt;/code&gt;, those two pieces are combined ahead of time:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;W_merged = W₀ + (α/r)BA&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now inference uses:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;y = W_merged x&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That matters because the model no longer carries separate adapter&lt;br&gt;
modules through the forward pass. At generation time, there is no "plus&lt;br&gt;
adapter" branch left to execute. The model performs the same sequence of&lt;br&gt;
layer operations it did before, using weight tensors with the same&lt;br&gt;
shapes and usually the same dtype as the base model.&lt;/p&gt;

&lt;p&gt;So the key intuition is not "LoRA weights are free." The key intuition&lt;br&gt;
is: &lt;strong&gt;merged LoRA stops being a separate computation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the core mechanism described in the original LoRA paper (Hu et al.,&lt;br&gt;
2021, &lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;arXiv 2106.09685&lt;/a&gt;), which notes&lt;br&gt;
that merging incurs no additional inference latency because the adapter&lt;br&gt;
is folded into the original weights before any forward pass runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where token-generation time actually goes
&lt;/h2&gt;

&lt;p&gt;To understand why this makes almost no latency difference, we need to&lt;br&gt;
separate two phases of inference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefill.&lt;/strong&gt; The model processes the input prompt and builds the KV&lt;br&gt;
cache. This phase can use larger matrix-matrix style operations because&lt;br&gt;
many prompt tokens are processed together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode.&lt;/strong&gt; The model generates one new token at a time, reusing the KV&lt;br&gt;
cache and running a forward pass for just the next token.&lt;/p&gt;

&lt;p&gt;When people talk about autoregressive generation being slow, they are&lt;br&gt;
usually talking about &lt;strong&gt;decode&lt;/strong&gt;, not prefill. Decode is where latency&lt;br&gt;
becomes dominated by repeated small forward passes over the model's&lt;br&gt;
weights.&lt;/p&gt;

&lt;p&gt;At each layer during decode, the core linear operation is effectively a&lt;br&gt;
matrix-vector multiply: a hidden-state vector for one token multiplied by&lt;br&gt;
a weight matrix. That is a bad regime for GPUs because the computation&lt;br&gt;
per byte of memory moved is low. The GPU spends much of its time waiting&lt;br&gt;
for weights to be read from memory rather than saturating its compute&lt;br&gt;
units with arithmetic.&lt;/p&gt;

&lt;p&gt;That is why merged LoRA usually does not show up in decode latency. If&lt;br&gt;
&lt;code&gt;W_merged&lt;/code&gt; has the same shape and dtype as &lt;code&gt;W₀&lt;/code&gt;, then each token still&lt;br&gt;
requires moving essentially the same amount of model data through memory.&lt;br&gt;
The values inside the matrix changed, but the amount of work the GPU&lt;br&gt;
must schedule and the amount of memory it must read are almost the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The expensive part is still streaming the same-sized weight tensors and&lt;br&gt;
running the same decode loop — not "carrying extra learned knowledge."&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The controlled benchmark
&lt;/h2&gt;

&lt;p&gt;To go beyond a single-run observation, the following three-way benchmark&lt;br&gt;
was run on a Colab T4 — base model vs. unmerged adapter vs. merged adapter&lt;br&gt;
— with 10 measured runs per condition (first run discarded as warmup) and&lt;br&gt;
identical generation settings throughout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt; &lt;code&gt;Qwen/Qwen3-0.6B&lt;/code&gt; base model, adapter&lt;br&gt;
&lt;code&gt;Natnaela/my-qwen-0.5b-lora&lt;/code&gt;, &lt;code&gt;MAX_NEW_TOKENS=64&lt;/code&gt;, &lt;code&gt;do_sample=False&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;float16&lt;/code&gt;, PEFT 0.14.0.&lt;/p&gt;

&lt;p&gt;The three conditions are loaded like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PeftModel&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_base&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;BASE_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_unmerged&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;PeftModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;load_base&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ADAPTER_PATH&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_merged&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PeftModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;load_base&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ADAPTER_PATH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge_and_unload&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# adapter folded into weights here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each condition is timed with 11 runs, first discarded as warmup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;timed_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inference_mode&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;

&lt;span class="n"&gt;times&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;timed_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="n"&gt;mean&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full reproducible notebook is on&lt;br&gt;
&lt;a href="https://github.com/Natnael-Alemseged/week12-lora-inference-latency/blob/main/instruction.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Mean latency (s)&lt;/th&gt;
&lt;th&gt;Std dev (s)&lt;/th&gt;
&lt;th&gt;Runs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base&lt;/td&gt;
&lt;td&gt;0.027&lt;/td&gt;
&lt;td&gt;0.001&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unmerged LoRA&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;0.005&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged LoRA&lt;/td&gt;
&lt;td&gt;0.026&lt;/td&gt;
&lt;td&gt;0.001&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern matches the prediction exactly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Merged ≈ base&lt;/strong&gt; (26.5 ms vs 27.1 ms). The standard deviations
overlap completely. After merging, the forward pass is identical in
structure to the base model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unmerged is 2.15× slower than base&lt;/strong&gt; (58.3 ms vs 27.1 ms). The
extra low-rank matrix multiplications &lt;code&gt;BAx&lt;/code&gt; run on every forward pass,
and at the small batch sizes used in decode they add real cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This also recontextualises the original 14,228 ms vs 14,045 ms observation&lt;br&gt;
from the sales classifier benchmark. Those were end-to-end task timings —&lt;br&gt;
prompt processing, tool calls, multi-step generation — not isolated&lt;br&gt;
generation latency. The 183 ms difference was likely noise or tool-call&lt;br&gt;
variance, not evidence that merging adds cost.&lt;/p&gt;

&lt;p&gt;The full benchmark code is available on&lt;br&gt;
&lt;a href="https://github.com/Natnael-Alemseged/week12-lora-inference-latency/blob/main/instruction.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
and can be rerun directly in Colab.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple analogy
&lt;/h2&gt;

&lt;p&gt;Imagine two books with the same number of pages, same paper size, and&lt;br&gt;
same binding weight, but different text printed inside. If your job is&lt;br&gt;
to carry one book from one room to another, the time is determined&lt;br&gt;
mostly by the size and weight of the book, not by which words are on the&lt;br&gt;
pages.&lt;/p&gt;

&lt;p&gt;Merged LoRA is similar. You are still carrying a model of essentially&lt;br&gt;
the same size through the same inference pipeline. The content of the&lt;br&gt;
weights changed, but the "shape of the object" the GPU has to move&lt;br&gt;
through memory did not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What would actually make the model faster
&lt;/h2&gt;

&lt;p&gt;If merged LoRA is not the latency lever, what is?&lt;/p&gt;

&lt;p&gt;The biggest levers are the ones that change memory traffic, parallelism,&lt;br&gt;
or the number of decode steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Quantization.&lt;/strong&gt; Lower-precision weights reduce how many bytes must
be moved per token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batching.&lt;/strong&gt; More concurrent tokens/sequences can increase hardware
utilization and improve throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speculative decoding.&lt;/strong&gt; A draft model can reduce how often the full
model must do slow one-token-at-a-time work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A smaller model or different architecture.&lt;/strong&gt; Fewer or smaller
weight tensors mean less work and less data movement.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are meaningful speed levers because they attack the actual&lt;br&gt;
bottleneck. Merged LoRA does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The corrected claim
&lt;/h2&gt;

&lt;p&gt;The right way to state this in an evaluation report or cost memo is not&lt;br&gt;
"merged LoRA is mathematically free." The right claim is narrower and&lt;br&gt;
more accurate:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Merged LoRA does not materially change the inference graph or memory&lt;br&gt;
footprint per token. &lt;code&gt;merge_and_unload()&lt;/code&gt; folds the low-rank update into&lt;br&gt;
the base weights ahead of inference, so generation runs the same model&lt;br&gt;
structure with the same tensor shapes. A controlled three-way benchmark&lt;br&gt;
(base vs. unmerged vs. merged, 10 runs each on a T4) confirms this:&lt;br&gt;
merged and base land within noise of each other (27 ms vs. 26 ms),&lt;br&gt;
while unmerged is 2.15× slower (58 ms) due to the extra low-rank path&lt;br&gt;
still running at inference time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Hu et al. (2021). &lt;em&gt;LoRA: Low-Rank Adaptation of Large Language Models.&lt;/em&gt;
&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;arXiv:2106.09685&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HuggingFace PEFT documentation — Conceptual guide to LoRA, including
the &lt;code&gt;merge_and_unload()&lt;/code&gt; API.
&lt;a href="https://huggingface.co/docs/peft/conceptual_guides/lora" rel="noopener noreferrer"&gt;huggingface.co/docs/peft/conceptual_guides/lora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark code: PEFT 0.14.0 + Transformers 4.51.3, Colab T4.
&lt;a href="https://github.com/Natnael-Alemseged/week12-lora-inference-latency/blob/main/instruction.md" rel="noopener noreferrer"&gt;github.com/Natnael-Alemseged/week12-lora-inference-latency&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>benchmarks</category>
      <category>ai</category>
    </item>
    <item>
      <title>When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch</title>
      <dc:creator>Natnael Alemseged</dc:creator>
      <pubDate>Sat, 02 May 2026 18:16:47 +0000</pubDate>
      <link>https://dev.to/natnael_alemseged/when-generic-benchmarks-fail-building-a-sales-domain-evaluation-bench-from-scratch-1kjf</link>
      <guid>https://dev.to/natnael_alemseged/when-generic-benchmarks-fail-building-a-sales-domain-evaluation-bench-from-scratch-1kjf</guid>
      <description>&lt;p&gt;&lt;em&gt;By Natnael Alemseged&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The gap that τ²-Bench retail cannot measure
&lt;/h2&gt;

&lt;p&gt;Tenacious is a B2B sales automation company. Its agent produces outreach emails for clients — personalized to the prospect's company, calibrated to the signal confidence of the underlying data, and constrained by the actual bench capacity available to fulfill any commitment made in the email. The executive team's question going into Week 11 was simple: how do we know this works for our business, our voice, our segments, our bench? The honest answer was: we don't. Not because the agent was untested, but because the tests we had were the wrong tests.&lt;/p&gt;

&lt;p&gt;τ²-Bench retail measures whether a sales agent can navigate a generic retail conversation. Tenacious needs an agent that checks bench capacity against a real JSON summary, routes prospects to the right ICP segment based on layoff and funding signals, and phrases outreach to match the confidence tier of the underlying data. These are not things any public benchmark grades.&lt;/p&gt;

&lt;p&gt;The audit I ran on Day 1 listed eight probe IDs from the Week 10 failure library that τ²-Bench retail would have passed: P-009 through P-012 (bench overcommitment, 100% trigger rate), P-001 and P-004 (ICP misrouting, 54%), P-005 and P-019 (assertive phrasing under weak signal). A retail benchmark scores those outputs as acceptable because they are fluent. They are not acceptable for Tenacious because they make promises the company cannot keep.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I found the gap: the audit method
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;(Week 10 and Week 11 refer to two consecutive project sprints: Week 10 built the Tenacious sales agent; Week 11 built the evaluator, benchmark, and critic on top of it.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Week 10 evidence was more useful than I expected. The failure taxonomy shows that &lt;code&gt;bench_overcommitment&lt;/code&gt; triggered on every bench-feasibility probe in that roll-up (&lt;strong&gt;40/40&lt;/strong&gt;; see &lt;code&gt;week_10_data/failure_taxonomy.md&lt;/code&gt;). This is not a distribution problem — it is a systematic absence of a check. The agent's generator never consulted &lt;code&gt;bench_summary&lt;/code&gt; before committing capacity.&lt;/p&gt;

&lt;p&gt;The same pattern held for ICP routing: &lt;strong&gt;20 of 37&lt;/strong&gt; probes in the ICP-misclassification roll-up (&lt;strong&gt;54%&lt;/strong&gt;; same source). In both cases, the structured context fields (&lt;code&gt;bench_summary&lt;/code&gt;, &lt;code&gt;signal_confidence_tier&lt;/code&gt;, &lt;code&gt;icp_segment&lt;/code&gt;) were available in the input. The generator simply did not use them.&lt;/p&gt;

&lt;p&gt;This pointed immediately to Path B rather than Path A. The outputs were fluent — no generation quality problem. What was missing was a rejection layer that checks structured context against the draft before it is sent.&lt;/p&gt;

&lt;p&gt;Concretely, five probe traces drove the decision:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Probe ID&lt;/th&gt;
&lt;th&gt;Trace ref&lt;/th&gt;
&lt;th&gt;Failure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P-009&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-4087895185a9&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Go overcommitment: bench=3, committed=10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-d5299b421fc8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NestJS capacity committed but fully deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-001&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-8dc44eb36d33&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Layoff+funding → Segment 1 instead of Segment 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-004&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-19f0af95e3e2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Zero open roles, still Segment 1 pitch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-005&lt;/td&gt;
&lt;td&gt;&lt;code&gt;probe-b3388b3c3582&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Assertive opener under medium-confidence signal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All five share the same pattern: a structured field in the task input encodes the ground truth, and the agent ignored it. A generation-quality fix does not address this. A critic that has bench state and segment rules in its context can.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the benchmark: how dataset construction actually works at small data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The four authoring modes
&lt;/h3&gt;

&lt;p&gt;Tenacious-Bench v0.2 uses four authoring modes, each with different cost and quality tradeoffs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace-derived&lt;/strong&gt; tasks come directly from the Week 10 failure library. The task input is reconstructed from a real probe, the ground truth is the corrected output from the post-hoc audit. These are the highest-signal tasks — they encode actual failures the agent produced in a real evaluation. The risk is sparse coverage: the probe library covers only the failure modes that were already identified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Programmatic&lt;/strong&gt; tasks expand the trace-derived set by templatizing the inputs — varying company name, capacity numbers, signal tier, and ICP segment systematically. Coverage is higher but signal lines are often synthetic stubs (&lt;code&gt;Ref=tbv02-0021 Arbor Systems hiring-signal.&lt;/code&gt;) rather than grounded specifics. That creates calibration noise in the evaluator's &lt;code&gt;signal_grounding_check&lt;/code&gt;, documented below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-LLM synthesis&lt;/strong&gt; routes task generation to a cheap model tier (Qwen via OpenRouter) and judgment to a different family (Claude/OpenAI) — following the preference-leakage prevention protocol from Li et al. (2025). The generator produces the rejected outputs for preference pairs; the judge verifies them. Using the same model for both would inflate apparent pair quality without improving actual learning signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hand-authored&lt;/strong&gt; tasks cover the long tail of failure modes that neither trace-derived nor programmatic expansion reaches — dual-control coordination failures and edge cases in booking-stage handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Judge-filter calibration (task inclusion)
&lt;/h3&gt;

&lt;p&gt;Every generated task is supposed to pass an LLM-as-judge gate before it enters the benchmark: pointwise scores on &lt;strong&gt;input coherence&lt;/strong&gt;, &lt;strong&gt;ground-truth verifiability&lt;/strong&gt;, and &lt;strong&gt;rubric-application clarity&lt;/strong&gt; (1–5 each), with documented minimums (&lt;code&gt;generation_scripts/audit_logs/authoring_manifest_*.json&lt;/code&gt;: require &lt;strong&gt;≥3&lt;/strong&gt; on each dimension, reject on malformed JSON). &lt;strong&gt;Generator and judge model families are rotated&lt;/strong&gt; so the same family never both authors and scores the same pool — again following Li et al. (2025). Pairwise tiebreaks handle near-duplicate synthesis paths (Jaccard overlap on subject+body, threshold 0.8). The published authoring manifest for the 240-task build records whether live OpenRouter calls were enabled; when the key is absent, the pipeline falls back to a &lt;strong&gt;stub judge&lt;/strong&gt; that only enforces the dimension floor — useful for reproducible CI, but &lt;strong&gt;not&lt;/strong&gt; a substitute for calibrating a frontier judge on a 50-task spot sample. Inter-rater agreement on 30 hand-labeled tasks (24-hour relabel) is what kept the &lt;em&gt;downstream&lt;/em&gt; deterministic rubric honest.&lt;/p&gt;

&lt;h3&gt;
  
  
  The routing decision I would make differently
&lt;/h3&gt;

&lt;p&gt;Stub signal lines from cheap synthesis are not interchangeable with realistic briefs. A real signal line reads: "You closed a $14M Series A in February and your Python roles increased from 2 to 7 in 60 days." A stub reads: "Ref=tbv02-0021 Arbor Systems hiring-signal." The evaluator's &lt;code&gt;signal_grounding_check&lt;/code&gt; grades whether the body references tokens from the signal line; stubs have no meaningful tokens to match.&lt;/p&gt;

&lt;p&gt;The fix for the next revision is to author plausible specific signals (amount, date, role count) at template expansion time — Liu et al. (COLM 2024) Section 3: synthetic quality depends on &lt;strong&gt;specificity of the seed&lt;/strong&gt;, not volume alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contamination and inter-rater agreement
&lt;/h3&gt;

&lt;p&gt;The three-check protocol (8-gram overlap on inputs, embedding cosine &lt;strong&gt;&amp;lt; 0.85&lt;/strong&gt;, time-shift verification) targets &lt;strong&gt;input-level&lt;/strong&gt; train vs held-out overlap, not output memorization. For the preference-pair training slice, &lt;code&gt;training_data/contamination_preference_pairs.json&lt;/code&gt; records &lt;strong&gt;91&lt;/strong&gt; pairs checked and &lt;strong&gt;0&lt;/strong&gt; violations.&lt;/p&gt;

&lt;p&gt;The compliant 24-hour inter-rater pass (30 tasks, 64 check-level comparisons) yielded &lt;strong&gt;0.91&lt;/strong&gt; overall agreement; every dimension cleared &lt;strong&gt;0.80&lt;/strong&gt; after rubric revision (&lt;code&gt;inter_rater_agreement.md&lt;/code&gt;). The weak point was &lt;code&gt;format_check&lt;/code&gt; (&lt;strong&gt;0.87&lt;/strong&gt;): humans penalized filler openers and hollow superlatives while the machine initially used length only. Adding &lt;code&gt;filler_opener&lt;/code&gt; and &lt;code&gt;unsupported_superlative&lt;/code&gt; regexes to &lt;code&gt;scoring_evaluator.py&lt;/code&gt; closed the gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  The training experiment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Path B: SimPO on a text-only Qwen 2.5 0.5B fallback
&lt;/h3&gt;

&lt;p&gt;The project target backbone is Qwen3.5-0.8B. The current Qwen3.5-0.8B HF/Unsloth release is vision-language; TRL CPO routes text prompts through the image processor and breaks on text-only preference pairs. The training notebook uses &lt;code&gt;unsloth/Qwen2.5-0.5B-Instruct&lt;/code&gt; as an operational text-only fallback — an engineering constraint worth stating in public.&lt;/p&gt;

&lt;p&gt;SimPO beats DPO on a free Colab T4 (16 GB): DPO needs a frozen reference model in memory; SimPO is reference-free and fits a workable batch size. SimPO beats ORPO here because the data are &lt;strong&gt;preference pairs only&lt;/strong&gt; — no separate SFT corpus. ORPO's SFT term would drag a 0.5B policy toward Tenacious email prose at the expense of general instruction following; SimPO has no SFT term.&lt;/p&gt;

&lt;p&gt;Preference pairs use each task's &lt;code&gt;ground_truth_output&lt;/code&gt; as &lt;strong&gt;chosen&lt;/strong&gt; and an LLM-generated violation as &lt;strong&gt;rejected&lt;/strong&gt;, validated with &lt;code&gt;scoring_evaluator.py&lt;/code&gt; and logged in &lt;code&gt;training_data/preference_pairs_audit.jsonl&lt;/code&gt;. The rejection generator (Qwen on OpenRouter) and any frontier judge are &lt;strong&gt;different families&lt;/strong&gt; — preference-leakage hygiene per Li et al. (2025).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training slice:&lt;/strong&gt; &lt;strong&gt;91&lt;/strong&gt; rows in &lt;code&gt;training_data/preference_pairs.jsonl&lt;/code&gt;, &lt;strong&gt;6&lt;/strong&gt; failure categories, &lt;strong&gt;0&lt;/strong&gt; contamination flags in &lt;code&gt;training_data/contamination_preference_pairs.json&lt;/code&gt;. Colab T4: &lt;strong&gt;3&lt;/strong&gt; epochs, &lt;strong&gt;81&lt;/strong&gt; train / &lt;strong&gt;10&lt;/strong&gt; eval pairs, &lt;strong&gt;~129 s&lt;/strong&gt; wall time, fp16 LoRA r=16 / α=32, final train loss &lt;strong&gt;4.878&lt;/strong&gt;. Eval margin sanity check: &lt;strong&gt;10/10&lt;/strong&gt; on the training split. Headline lift is decided on &lt;strong&gt;held-out&lt;/strong&gt; tasks only (&lt;code&gt;ablations/ablation_results.json&lt;/code&gt;, &lt;code&gt;ablations/significance_test.txt&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest result
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Delta A: trained LoRA vs deterministic baseline on held-out (same metric)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Definition (paired with &lt;code&gt;ablations/paired_bootstrap_delta_a.py&lt;/code&gt;):&lt;/strong&gt; for each of &lt;strong&gt;47&lt;/strong&gt; held-out tasks, the baseline &lt;strong&gt;succeeds&lt;/strong&gt; if the deterministic &lt;code&gt;scoring_evaluator.py&lt;/code&gt; scores &lt;strong&gt;prefer&lt;/strong&gt; &lt;code&gt;ground_truth_output&lt;/code&gt; over &lt;code&gt;candidate_output&lt;/code&gt;, or the two bodies are identical. The trained judge &lt;strong&gt;succeeds&lt;/strong&gt; if the LoRA's preference margin agrees with that same ordering (or tie). This is &lt;strong&gt;one&lt;/strong&gt; metric end to end — not a mix of all-checks-pass for the baseline and preference accuracy for the model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Preference-aligned rate&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deterministic baseline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7/47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trained LoRA&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;43/47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+76.6 pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;95% bootstrap CI (50 000 resamples, seed 42)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;[+63.8 pp, +87.2 pp]&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-sided paired bootstrap &lt;em&gt;p&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 0.0001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Descriptive sidebar:&lt;/strong&gt; the Week 10 &lt;strong&gt;candidate&lt;/strong&gt; bodies pass all deterministic checks on &lt;strong&gt;11/47&lt;/strong&gt; tasks (&lt;strong&gt;23.4%&lt;/strong&gt;) — a useful raw quality readout, but &lt;strong&gt;not&lt;/strong&gt; the Delta A numerator. The baseline hits &lt;strong&gt;7/47&lt;/strong&gt; because the evaluator often prefers the reference even when the candidate fails some checks.&lt;/p&gt;

&lt;p&gt;By category, the trained judge reaches 100% on bench_overcommitment, dual_control_coordination, gap_overclaiming, signal_overclaiming, and tone_drift; &lt;strong&gt;icp_misclassification&lt;/strong&gt; stays &lt;strong&gt;2/6 (33.3%)&lt;/strong&gt; — the weakest training slice (six pairs) and an open problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delta B: trained LoRA vs prompt-only same backbone
&lt;/h3&gt;

&lt;p&gt;Same held-out preference-margin procedure: base &lt;code&gt;Qwen2.5-0.5B-Instruct&lt;/code&gt; without LoRA scores &lt;strong&gt;48.9%&lt;/strong&gt; (23/47); the trained adapter scores &lt;strong&gt;91.5%&lt;/strong&gt; (43/47) — &lt;strong&gt;+42.6 pp&lt;/strong&gt;, 95% CI &lt;strong&gt;[+29.8 pp, +57.4 pp]&lt;/strong&gt;, &lt;em&gt;p&lt;/em&gt; &amp;lt; 0.0001. Prompt-only already clears dual_control_coordination and signal_overclaiming on this slice; the adapter's lift concentrates in gap_overclaiming and tone_drift, with modest ICP gains (0/6 → 2/6).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost–latency Pareto
&lt;/h3&gt;

&lt;p&gt;Training used &lt;strong&gt;$0&lt;/strong&gt; billed GPU on Colab T4 (&lt;code&gt;cost_pareto.colab_cost_usd&lt;/code&gt; in &lt;code&gt;ablations/ablation_results.json&lt;/code&gt;; ~&lt;strong&gt;2.16&lt;/strong&gt; minutes wall time). &lt;strong&gt;Inference&lt;/strong&gt; on the held-out preference pass: median &lt;strong&gt;~369 ms&lt;/strong&gt; per task with the LoRA judge vs &lt;strong&gt;~96 ms&lt;/strong&gt; for the prompt-only backbone — higher latency for a stronger rejection layer. Dataset authoring included &lt;strong&gt;live&lt;/strong&gt; OpenRouter calls for preference-pair generation (&lt;code&gt;training_data/preference_pairs_audit.jsonl&lt;/code&gt;, &lt;code&gt;mode: "live"&lt;/code&gt;); API spend is logged in &lt;code&gt;cost_log.csv&lt;/code&gt; — &lt;strong&gt;~$0.02&lt;/strong&gt; for 112 qwen/qwen3-8b calls (67K input + 43K output tokens at $0.10/M).&lt;/p&gt;

&lt;h3&gt;
  
  
  What did not work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ICP routing&lt;/strong&gt; remains the failure mode with the fewest pairs and the worst held-out accuracy. &lt;strong&gt;Stub signal lines&lt;/strong&gt; make &lt;code&gt;signal_grounding_check&lt;/code&gt; look worse than real-brief behavior would. &lt;strong&gt;Delta B&lt;/strong&gt; is uneven: training helps most where the prompt-only model was blind, not everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Thread-level coherence&lt;/strong&gt; — grade replies against prior turns, not isolated drafts.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing scope&lt;/strong&gt; — enforce &lt;code&gt;pricing_sheet.md&lt;/code&gt; bands on quoted TCV.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn-roast heuristic&lt;/strong&gt; — style-guide anti-pattern as an LLM-judge dimension.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-signal calibration&lt;/strong&gt; — score against the &lt;strong&gt;weakest&lt;/strong&gt; signal in a brief, not a single scalar tier.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Dataset: &lt;a href="https://huggingface.co/datasets/Natnaela/tenacious-bench" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/Natnaela/tenacious-bench&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Code: &lt;a href="https://github.com/Natnael-Alemseged/SalesConversion-Bench" rel="noopener noreferrer"&gt;https://github.com/Natnael-Alemseged/SalesConversion-Bench&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Community: &lt;a href="https://github.com/sierra-research/tau2-bench/issues/293" rel="noopener noreferrer"&gt;τ²-Bench issue #293 — structured-context evaluation gaps&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>benchmarks</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
