DEV Community: Natnael Alemseged

Why Pairing Your Bootstrap Is Necessary — And When It Stops Helping

Natnael Alemseged — Fri, 08 May 2026 21:39:19 +0000

A colleague's paired_bootstrap function resamples one set of 48 task indices and applies it to both the trained LoRA
scores and the baseline scores. The question: what mathematical property makes that the correct procedure — and would an
unpaired bootstrap have changed the reviewer-facing conclusion?

The short answer: pairing is correct by experimental design. When the two score vectors have positive covariance,
pairing reduces the model-based standard error; in this specific data the correlation is near-zero (r = 0.167), so the
paired and unpaired bootstrap CIs are practically identical — and neither changes the reviewer-facing conclusion.

Here is why, from first principles.

The experimental design justification: why pairing is valid at all

The 48 held-out tasks were not drawn independently for the baseline and then re-drawn independently for the trained
LoRA. The same 48 tasks were evaluated under both systems. Each task is a repeated measurement on the same subject —
this is a within-subject design (as opposed to a between-subject design where each group sees different samples),
and it is what makes pairing the correct procedure.

If the 48 baseline tasks and the 48 trained-LoRA tasks were different tasks drawn from the same population, unpaired
bootstrap would be correct. But here, resampling index 13 means "draw task 13 for both models together." Resampling each
vector independently breaks that structure and estimates uncertainty for a different experiment: baseline and LoRA
evaluated on unrelated task samples.

This distinction matters before any formula. The formula follows the design; the design is what you defend to the
reviewer.

The variance-reduction mechanism: the math behind why pairing helps

Once you have established that pairing is correct, the question is how much it helps. The bootstrap works by
resampling your data with replacement thousands of times to estimate the sampling distribution of a statistic — here,
the mean lift between two systems (Efron & Hastie, 2016). The standard error of the
mean paired lift is:

SE_paired = sqrt((Var(A) + Var(B) − 2·Cov(A, B)) / n)

where A is the baseline binary score vector, B is the trained-LoRA binary score vector, and n = 48.

The unpaired standard error treats A and B as independent, so the covariance term drops:

SE_unpaired = sqrt((Var(A) + Var(B)) / n)

The key distinction: a paired design estimates E[B - A] — expected within-task lift. An unpaired design estimates
E[B] - E[A] as if the two means came from unrelated samples. Same point estimate, different uncertainty model.

Pairing helps in proportion to the covariance between the two score vectors. If tasks where the baseline passes tend
also to be tasks where the trained model passes, the covariance is large and positive, the numerator shrinks, and the
paired SE is meaningfully smaller. If the two models fail and pass on largely different tasks — low covariance —
pairing buys almost nothing in precision, even though it remains the correct design.

The actual numbers

From the held-out evaluation traces:

	Trained LoRA passes	Trained LoRA fails
Baseline passes	15	1
Baseline fails	26	6

Baseline: 16 passes, 32 fails → pass rate p_A = 0.333
Trained LoRA: 41 passes, 7 fails → pass rate p_B = 0.854
Pearson r(A, B) = 0.167
Var(A) = 0.333 · 0.667 = 0.222 ; Var(B) = 0.854 · 0.146 = 0.125 ; Va + Vb = 0.347
Cov(A, B) = 0.167 · sqrt(0.222 · 0.125) ≈ 0.028

The task-level difference vector makes the paired structure visible:

+1 on 26 tasks where trained LoRA passes and baseline fails
-1 on 1 task where baseline passes and trained LoRA fails
0 on 21 tasks where both systems agree

The paired bootstrap resamples this population of task-level differences. The unpaired bootstrap destroys these
relationships by drawing baseline and trained outcomes independently.

Plugging in:

SE_paired   = sqrt((0.347 − 2·0.028) / 48) = sqrt(0.291 / 48) ≈ 0.0779
SE_unpaired = sqrt(0.347 / 48)             ≈ 0.0850

The paired SE is about 8.4% smaller — real but modest, because the covariance is small relative to
Var(A) + Var(B).

Empirical simulation

import numpy as np

rng = np.random.default_rng(42)
n_boot = 100_000

# Task-level binary outcomes ordered by contingency cell:
# both-pass (15), baseline-only-pass (1), trained-only-pass (26), both-fail (6)
baseline = np.array([1] * 15 + [1] + [0] * 26 + [0] * 6)
trained = np.array([1] * 15 + [0] + [1] * 26 + [0] * 6)

paired = []
unpaired = []
for _ in range(n_boot):
    idx = rng.integers(0, 48, 48)
    i_a = rng.integers(0, 48, 48)
    i_b = rng.integers(0, 48, 48)
    paired.append((trained[idx] - baseline[idx]).mean())
    unpaired.append(trained[i_b].mean() - baseline[i_a].mean())

print(np.percentile(paired, [2.5, 97.5]) * 100)
print(np.percentile(unpaired, [2.5, 97.5]) * 100)

Output:

Paired CI:   [+35.4, +68.8] percentage points  — width 33.3 pp
Unpaired CI: [+35.4, +66.7] percentage points  — width 31.3 pp

The two CIs are essentially identical — exactly what the near-zero covariance predicts. The SE formula says pairing
should modestly reduce the model-based standard error, but percentile bootstrap CIs (the 2.5th and 97.5th
percentiles of the resampled distribution) on binary-difference data are not symmetric ±1.96·SE intervals. Their tails
shift independently because the empirical distribution is discrete and skewed. The slight width inversion is not a
contradiction: pairing is still the right design, but here it does not buy meaningful precision.

Does this change the reviewer conclusion?

The reviewer-facing claim is: "The LoRA adapter's lift is statistically significant above zero."

The critical boundary is whether the CI lower bound stays positive. Both paired and unpaired bootstrap give a lower
bound of +35.4 percentage points — far above zero. Neither variant threatens the significance verdict. A CI
of [−2, +54] would change the conclusion; [+12, +40] would not. The actual data stays nowhere near the dangerous
boundary regardless of which bootstrap method is used.

Pairing is correct by experimental design, and in this experiment it makes no difference to the reviewer conclusion —
because the near-zero correlation means pairing provides almost no variance reduction.

Adjacent concepts worth connecting

When does pairing matter most? (Dror et al., 2017) When tasks are
heterogeneous in difficulty and both models are sensitive to that difficulty. If hard tasks fail both models and easy
tasks pass both, r(A,B) is large and paired bootstrapping can cut CI width sharply. In this data, the dominant pattern
is trained LoRA passing where baseline fails — pushing correlation toward zero and making pairing nearly irrelevant for
variance reduction.

When does low correlation arise in LLM evals? Near-zero r(A,B) signals a large capability gap: the stronger model
succeeds on tasks too hard for the weaker one, so their pass/fail patterns decorrelate. That is good news for the
trained model's lift, but it means paired bootstrapping loses its statistical efficiency advantage precisely when the
lift is largest.

Pointers

Papers:

Efron, B. & Hastie, T. (2016). Computer Age Statistical Inference, Ch. 11 — Bootstrap Confidence Intervals — authoritative treatment of bootstrap CIs for paired designs. Freely available via Stanford.
Dror, R. et al. ( 2017). Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets (TACL) — canonical reference for paired bootstrap and permutation tests in NLP evaluation.

Tool: NumPy default_rng + bootstrap loop — reproducible in a Colab cell with no additional dependencies.

Follow-on: For a valid one-sided p-value (not just a CI), use a paired permutation test: randomly flip the sign of
each task-pair's difference and count how often the null mean exceeds the observed mean. The bootstrap percentile CI
lower bound being positive is consistent with significance but is not a p-value.

DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing

Natnael Alemseged — Thu, 07 May 2026 20:51:39 +0000

SalesConversion-Bench had one uncomfortable preference-tuning mismatch: the code trained with TRL DPOTrainer, while the methodology narrative argued for SimPO.

That is not just a naming issue. DPO and SimPO turn the same (prompt, chosen, rejected) pair into different update signals. If the held-out lift is small, like 22.73% vs 18.18%, the project cannot honestly claim whether the model improved because DPO was the right objective, because LoRA rank constrained the update, or because training margins improved without robust held-out behavior.

The useful answer is not "DPO good, SimPO good, ORPO also good." The useful answer is:

Compare the objectives under fixed conditions, control for LoRA rank, and keep the objective whose gains survive held-out evaluation instead of only improving training margins.

The gradient difference

DPO: reference-relative preference learning

DPO treats preference tuning as a comparison between two log-probability gaps:

policy_gap = log pi_theta(chosen | prompt) - log pi_theta(rejected | prompt)
ref_gap    = log pi_ref(chosen | prompt)   - log pi_ref(rejected | prompt)

loss = -log sigmoid(beta * (policy_gap - ref_gap))

So the update asks:

Has the trainable policy made the chosen answer more preferred than the rejected answer, beyond what the reference policy already believed?

That reference-relative part is the key. DPO does not only ask whether the chosen answer is more likely than the rejected answer. It asks whether the policy improved that preference gap relative to a reference model.

In a LoRA setup with TRL DPOTrainer(ref_model=None), the exact reference handling depends on TRL and PEFT configuration. Some setups avoid loading a separate reference model and compute reference behavior by disabling adapters; others use a frozen reference copy. The implementation detail should be verified in the actual training stack.

But the conceptual point stays the same: DPO is anchored to a reference policy. That can be helpful if the base instruct model already has useful judgment priors. It can also preserve the wrong shortcut if the reference already favors short, generic, policy-shaped answers.

SimPO: reference-free margin learning

SimPO removes the reference model and scores each answer using average log-probability per token:

r(prompt, answer) = (1 / answer_length) * log pi_theta(answer | prompt)

loss = -log sigmoid(beta * (r_chosen - r_rejected - gamma))

The update asks:

Has the policy made the chosen answer better than the rejected answer by at least the target margin gamma, using length-normalized scores?

That changes two things:

No reference anchor: SimPO directly pushes the chosen answer above the rejected answer. It does not ask whether the policy improved relative to the base model.
Length normalization: A long rejected answer is not punished merely because total log-probability accumulates over more tokens.

That second point matters in preference data where chosen and rejected answers differ in length. If the preferred answer is often shorter, a total-log-prob objective can make brevity look like quality. SimPO's average-log-prob reward reduces that artifact.

The falsifiable hypothesis is:

If DPO's gains are mostly coming from reference-relative or length artifacts, then SimPO with the same data, seed, train steps, and LoRA rank should produce cleaner held-out margins and accuracy without increasing the train/eval gap.

Where ORPO fits

ORPO combines a supervised chosen-answer term with an odds-ratio preference term. It should not be co-equal in this comparison. The live mismatch is DPO in code vs SimPO in the methodology.

ORPO becomes interesting if both DPO and SimPO are unstable, or if the model needs stronger behavior-cloning pressure toward chosen outputs. For this decision, it is a fallback, not the main branch.

The overoptimization check

Training loss alone is not enough. In a small preference-tuning run, the warning pattern is:

training preference margins improve,
but held-out accuracy or held-out margins do not improve.

The two metrics to inspect first are:

Diagnostic	Why it matters	Bad sign
Train `rewards/margins` vs held-out pair accuracy or held-out margins	Separates real preference learning from training-set margin inflation	Train margins rise while held-out behavior stays flat or worsens
Chosen/rejected reward or log-prob movement	Shows whether improvement comes from lifting chosen answers, suppressing rejected answers, or drifting oddly from the reference	Rejected scores collapse while chosen quality does not improve

If TRL logs rewards/chosen, rewards/rejected, and rewards/margins, use those directly. If it also logs policy/reference log-probs, inspect whether the DPO margin is improving because chosen answers are becoming more likely, or mainly because rejected answers are being pushed down.

The second case is not automatically reward hacking. It is a review flag. It needs held-out and qualitative confirmation.

Hands-on pattern: inspect train vs eval margins

Before arguing that DPO or SimPO "worked," add a tiny log inspection step. The goal is not to prove overoptimization from one scalar. The goal is to force the comparison between training margins and held-out behavior.

import json
from pathlib import Path


def load_jsonl(path):
    rows = []
    for line in Path(path).read_text().splitlines():
        line = line.strip()
        if line:
            rows.append(json.loads(line))
    return rows


def last_number(rows, *keys):
    for row in reversed(rows):
        for key in keys:
            value = row.get(key)
            if isinstance(value, (int, float)):
                return float(value)
    return None


def review_preference_run(train_log, eval_log=None):
    train = load_jsonl(train_log)
    midpoint = max(1, len(train) // 2)

    early_margin = last_number(
        train[:midpoint],
        "rewards/margins",
        "train_rewards/margins",
    )
    late_margin = last_number(
        train[midpoint:],
        "rewards/margins",
        "train_rewards/margins",
    )
    chosen = last_number(
        train[midpoint:],
        "rewards/chosen",
        "train_rewards/chosen",
    )
    rejected = last_number(
        train[midpoint:],
        "rewards/rejected",
        "train_rewards/rejected",
    )

    print(f"train margin: {early_margin} -> {late_margin}")
    print(f"late chosen/rejected rewards: {chosen} / {rejected}")

    if eval_log:
        eval_rows = load_jsonl(eval_log)
        eval_margin = last_number(
            eval_rows,
            "eval_rewards/margins",
            "rewards/margins",
        )
        eval_acc = last_number(eval_rows, "eval_accuracy", "accuracy")
        print(f"held-out margin: {eval_margin}")
        print(f"held-out accuracy: {eval_acc}")

    if early_margin is not None and late_margin is not None:
        if late_margin > early_margin and eval_log is None:
            print("Review flag: train margin improved. Confirm with held-out pairs.")
        elif late_margin <= early_margin:
            print("Review flag: weak training signal. Check rank, LR, or pair quality.")

The useful pattern is the comparison:

train margins up + held-out behavior up    -> plausible improvement
train margins up + held-out behavior flat  -> likely training-set margin inflation
train margins flat                         -> weak signal, bad data, or too little capacity

LoRA rank is a confounder

The current LoRA config is r=16, alpha=32, dropout=0.05. For a 0.5B model, that is plausible. The risk is not that r=16 is obviously wrong. The risk is that rank can fake an objective conclusion.

Rank failure	Expected pattern	First observable
Rank too low	Training margins plateau early, train loss barely moves, held-out accuracy is flat	`rewards/margins` and train loss
Rank too high for small data	Training margins keep improving while held-out accuracy or margins get noisy or worse	Train/eval margin gap

So rank should stay in the ablation, but lightly. It is not the main theory. It is a control that prevents the false conclusion "SimPO lost" or "DPO won" when the real issue was adapter capacity.

The smallest decisive ablation

The cleanest small matrix is 2 objectives x 2 ranks:

Run	Objective	LoRA rank	Purpose
A	DPO	r=16	Current baseline
B	SimPO	r=16	Isolate objective change at current capacity
C	DPO	r=8	Test whether lower rank regularizes DPO
D	SimPO	r=8	Test whether lower rank regularizes SimPO

Everything else should stay fixed: model, train/validation split, seed, pair data, max length, epochs or steps, learning rate, batch size, and evaluation script.

The decision rule:

Prefer SimPO if:
SimPO beats DPO at the same rank on held-out accuracy or held-out margin,
and the gain is not paired with a larger train/eval margin gap.

Prefer DPO if:
DPO matches or beats SimPO at the same rank,
and DPO has a smaller train/eval gap or better qualitative behavior.

Prefer the lower rank if:
r=8 has slightly lower training margins but equal or better held-out behavior.

The important correction is sample size. With only 22 held-out pairs, a 3 percentage-point rule is too fine-grained because one example is about 4.5 percentage points.

A more defensible rule is:

Switch objectives only if the winner improves by at least one additional held-out pair, has better or equal held-out margin behavior, and does not worsen the important qualitative failure slices.

One extra correct pair alone is a hint. One extra correct pair plus cleaner margins and no slice regression is a decision.

Bottom line

The SalesConversion-Bench run should be described as DPO preference tuning with LoRA, not SimPO-first preference tuning.

The gap closes when the project can say:

DPO updates the model using a reference-relative chosen-vs-rejected margin.
SimPO updates the model using a reference-free, length-normalized target margin.
The objective choice should be decided by a controlled DPO-vs-SimPO ablation at fixed rank, with one lower-rank control to catch LoRA overfitting.

That turns "we prefer SimPO" from a narrative claim into an experiment the project can actually defend.

Sources

Rafailov et al. (2023), Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Meng et al. (2024), SimPO: Simple Preference Optimization with a Reference-Free Reward
Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model
Hugging Face TRL, DPOTrainer documentation

"Return JSON only" doesn't force JSON. Here's what actually forces it.

Natnael Alemseged — Wed, 06 May 2026 19:09:36 +0000

You have a judge LLM in your pipeline. You've told it:

"Return JSON only. No preamble, no explanation. Just the JSON object."

It works great in testing. It works great in staging. Then in production it returns:

Sure! Here's my evaluation of the response:

{"score": 4, "reason": "The answer is mostly correct but..."}

Your json.loads() throws. Your pipeline catches nothing. Downstream code receives None and keeps running. Your evaluation scores are silently wrong for the next 200 requests before anyone notices.

Was this the model misbehaving? No. Was there ever a way to actually force JSON output? Yes — but it's not the prompt. Let me show you the real mechanism.

What "return JSON only" actually does

When you write a format instruction in a prompt, you are doing exactly one thing: shifting the probability distribution over the next token.

The model has seen millions of examples during training where that kind of phrasing is followed by { and a well-formed JSON body. Your instruction loads that pattern strongly into the context. The probability mass on JSON-shaped tokens goes way up — often high enough that you get valid JSON 95–99% of the time on a well-tuned model.

But probable is not certain.

At every decoding step, the model selects the next token according to its output distribution. At temperature 0 it picks the argmax — the single highest-probability token — deterministically. At any temperature above 0 it samples, meaning lower-probability tokens can and do get selected. Either way, the instruction only shapes that distribution; it does not remove outcomes from it. A preamble phrase like "Sure! Here's the evaluation:" has a very small but non-zero probability at step one. If something in the context — a long system prompt, a conversational tone in your input, a model that was fine-tuned to sound helpful — nudges that probability even slightly upward, you get the preamble and your parse fails. Deterministic decoding reduces but does not eliminate the risk: if the highest-probability token at step one genuinely is a preamble token, you still get it.

This is instruction-following. It is a soft mechanism. It has no hard guarantees.

What actually forces JSON: constrained decoding

There is a different mechanism called constrained decoding (also called structured generation or grammar-guided sampling). It does not operate at the prompt layer. It operates at the inference layer — before sampling happens.

Here is how it works:

At each decoding step, the system compares the current partial output against a grammar or schema. Any token that would make the output invalid at this parse state gets its logit set to negative infinity — probability zero. The model cannot produce that token. Not unlikely. Cannot.

The foundational paper is Willard & Louf (2023), Efficient Guided Generation for Large Language Models. They show how to compile a JSON schema into a finite-state machine and use it to mask the vocabulary at each decoding step in O(1) time per token. That last part matters: the approach is fast enough to use in production without meaningful latency overhead.

This is implemented today in:

Outlines — the reference library from the paper authors
llama.cpp via --grammar-file (GBNF grammar format)
OpenAI structured outputs (response_format: { type: "json_schema", json_schema: {...} }) — OpenAI's documentation describes this as token-level schema enforcement, contracted to produce schema-valid output on every non-refused call. Note the qualifier: a safety refusal or content filter can still return a non-schema response — your boundary code should handle that case explicitly.

The difference from soft prompting is categorical, not quantitative. Instruction-following is a distribution shift. Constrained decoding is a hard exclusion.

Soft vs hard: a minimal code comparison

The soft approach — what most pipelines do:

import json

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": 'Evaluate this response. Return JSON only: {"score": int, "reason": str}'
    }]
)

try:
    result = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    result = None  # silent failure — downstream receives None and keeps running

The try/except here is necessary but not sufficient. Catching the error and returning None just defers the damage — whatever uses result now has to handle None everywhere, and if it doesn't, the failure propagates silently and corrupts your scores.

The hard approach — schema enforced at the token level:

from pydantic import BaseModel

class Evaluation(BaseModel):
    score: int
    reason: str

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Evaluate this response."}],
    response_format=Evaluation,
)

result = response.choices[0].message.parsed  # always a valid Evaluation — never None

No try/except on the parse. No None propagation. result is always a typed Evaluation object because the schema was enforced at the token level before the response was ever assembled.

Back to my system: where this broke and what changed

In my LLM judge pipeline, the boundary parsing lives in ledger/agents/credit_analysis_agent.py (see the _parse_json helper). The utility function responsible for parsing judge output looked like this:

def _safe_parse_json(raw: str) -> dict | None:
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        return None  # returned silently — the caller never knew

The failure case that exposed this: the judge model received an unusually long input passage and responded with a one-sentence acknowledgment before the JSON object. Here is a redacted example of the failing shape (synthetic but representative):

"Sure! Here's my evaluation:\n\n{\"score\": 0, \"reason\": \"...\"}"

_safe_parse_json returned None. The scoring loop treated None as a valid result, defaulted the score to 0, and logged 47 evaluations as failures — all of them wrong, all of them silent.

The fix had two parts. First, the immediate boundary hardening:

def _safe_parse_json(raw: str) -> dict:
    # Strip common preamble patterns before attempting parse
    start = raw.find("{")
    end = raw.rfind("}") + 1
    if start == -1 or end == 0:
        raise ValueError(f"No JSON object found in output: {repr(raw[:120])}")
    stripped = raw[start:end]
    try:
        return json.loads(stripped)
    except json.JSONDecodeError as e:
        raise ValueError(f"Judge returned unparseable output: {repr(raw[:120])}") from e

Second — and more importantly — the primary judge call was migrated to use response_format with a Pydantic schema. The stripping logic is now a fallback for open-weight model calls only. For the main judge endpoint, the parse cannot fail because the schema is enforced at decode time.

The model card was also updated to accurately reflect that the judge's output reliability comes from constrained decoding, not prompt engineering. That distinction matters the moment someone considers swapping the underlying model.

Three rules for any pipeline acting on structured LLM output

1. Validate at every trust boundary. Every point where LLM output enters your code as structured data is a trust boundary. Treat a parse failure as a first-class event — log it, alert on it, raise loudly — and never let a None flow silently downstream.

2. Use constrained decoding when the output is load-bearing. If a score, routing decision, or classification depends on structured output, use a constrained endpoint or library. Soft-prompt failures in the 1–5% range compound hard in multi-step pipelines. A judge that is wrong 2% of the time in isolation is wrong much more often when it runs 10 times in an evaluation chain.

3. Keep the prompt instruction anyway. Even with constrained decoding, write the format instruction in your prompt. It improves output quality and serves as documentation of intent for anyone reading the code. But treat it as a hint to the model, not a technical contract. The schema enforcement is the contract.

The real lesson

The pipeline didn't break because the model was unreliable. It broke because the system was designed as if a prompt instruction were equivalent to a type constraint. It is not.

A prompt instruction is a statistical nudge. A grammar enforced at decode time is a guarantee. The moment structured LLM output feeds into code that acts on it — a scoring system, an agent router, a tool-call parser, an extraction pipeline — you need one of the two.

A nudge is not enough.

The code in the "back to my system" section is drawn from a real LLM judge pipeline built during a structured AI engineering program. The failure described happened in production.

Sources

Willard, B. & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702. https://arxiv.org/abs/2307.09702
OpenAI. Structured Outputs — Platform Documentation. https://platform.openai.com/docs/guides/structured-outputs

Why Merged LoRA Barely Changes Inference Time

Natnael Alemseged — Tue, 05 May 2026 14:09:00 +0000

While my peer was benchmarking a sales conversion classifier fine-tuned on
Qwen3-0.6B, a merged LoRA version of the model took 14,228 ms per
task while the bare base model took 14,045 ms. That 183 ms gap is only
about 1.3%. Why doesn't merging in extra trained weights make inference
slower? And if the adapter is not the thing driving latency, what
actually is?

The short answer is: once LoRA is merged, the model is no longer doing
"base model plus adapter" at inference time. It is just doing the base
model computation with a different set of weight values. The tensor
shapes do not change, the number of layers does not change, and the
number of bytes that must be moved for each generated token is almost
the same. On modern GPUs, that last point matters most.

One caution upfront: with only one timing run per system on a shared
Colab T4, you cannot prove that 183 ms is "real." A 1.3% gap is
plausibly noise, not evidence that merged LoRA adds meaningful latency.
The mechanism below explains why we should expect the difference to be
near zero, and the controlled benchmark below confirms it directly.

What merged LoRA changes, and what it does not

Before merging, a LoRA-adapted linear layer is effectively:

y = W₀x + (α/r)BAx

where W₀ is the original weight matrix and BA is the low-rank LoRA
update. In that form, inference really does include extra operations:
you still apply the base matrix, and you also apply the low-rank update.

After merge_and_unload(), those two pieces are combined ahead of time:

W_merged = W₀ + (α/r)BA

Now inference uses:

y = W_merged x

That matters because the model no longer carries separate adapter
modules through the forward pass. At generation time, there is no "plus
adapter" branch left to execute. The model performs the same sequence of
layer operations it did before, using weight tensors with the same
shapes and usually the same dtype as the base model.

So the key intuition is not "LoRA weights are free." The key intuition
is: merged LoRA stops being a separate computation.

This is the core mechanism described in the original LoRA paper (Hu et al.,
2021, arXiv 2106.09685), which notes
that merging incurs no additional inference latency because the adapter
is folded into the original weights before any forward pass runs.

Where token-generation time actually goes

To understand why this makes almost no latency difference, we need to
separate two phases of inference:

Prefill. The model processes the input prompt and builds the KV
cache. This phase can use larger matrix-matrix style operations because
many prompt tokens are processed together.

Decode. The model generates one new token at a time, reusing the KV
cache and running a forward pass for just the next token.

When people talk about autoregressive generation being slow, they are
usually talking about decode, not prefill. Decode is where latency
becomes dominated by repeated small forward passes over the model's
weights.

At each layer during decode, the core linear operation is effectively a
matrix-vector multiply: a hidden-state vector for one token multiplied by
a weight matrix. That is a bad regime for GPUs because the computation
per byte of memory moved is low. The GPU spends much of its time waiting
for weights to be read from memory rather than saturating its compute
units with arithmetic.

That is why merged LoRA usually does not show up in decode latency. If
W_merged has the same shape and dtype as W₀, then each token still
requires moving essentially the same amount of model data through memory.
The values inside the matrix changed, but the amount of work the GPU
must schedule and the amount of memory it must read are almost the same.

The expensive part is still streaming the same-sized weight tensors and
running the same decode loop — not "carrying extra learned knowledge."

The controlled benchmark

To go beyond a single-run observation, the following three-way benchmark
was run on a Colab T4 — base model vs. unmerged adapter vs. merged adapter
— with 10 measured runs per condition (first run discarded as warmup) and
identical generation settings throughout.

Setup: Qwen/Qwen3-0.6B base model, adapter
Natnaela/my-qwen-0.5b-lora, MAX_NEW_TOKENS=64, do_sample=False,
float16, PEFT 0.14.0.

The three conditions are loaded like this:

from transformers import AutoModelForCausalLM
from peft import PeftModel

def load_base():
    return AutoModelForCausalLM.from_pretrained(
        BASE_MODEL, torch_dtype=torch.float16
    ).to("cuda")

def load_unmerged():
    return PeftModel.from_pretrained(load_base(), ADAPTER_PATH).to("cuda")

def load_merged():
    m = PeftModel.from_pretrained(load_base(), ADAPTER_PATH)
    return m.merge_and_unload()   # adapter folded into weights here

Each condition is timed with 11 runs, first discarded as warmup:

def timed_generate(model):
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with torch.inference_mode():
        model.generate(**inputs, max_new_tokens=64, do_sample=False)
    torch.cuda.synchronize()
    return time.perf_counter() - t0

times = [timed_generate(model) for _ in range(11)][1:]
mean  = sum(times) / len(times)

Full reproducible notebook is on
GitHub.

Condition	Mean latency (s)	Std dev (s)	Runs
Base	0.027	0.001	10
Unmerged LoRA	0.058	0.005	10
Merged LoRA	0.026	0.001	10

The pattern matches the prediction exactly:

Merged ≈ base (26.5 ms vs 27.1 ms). The standard deviations overlap completely. After merging, the forward pass is identical in structure to the base model.
Unmerged is 2.15× slower than base (58.3 ms vs 27.1 ms). The extra low-rank matrix multiplications BAx run on every forward pass, and at the small batch sizes used in decode they add real cost.

This also recontextualises the original 14,228 ms vs 14,045 ms observation
from the sales classifier benchmark. Those were end-to-end task timings —
prompt processing, tool calls, multi-step generation — not isolated
generation latency. The 183 ms difference was likely noise or tool-call
variance, not evidence that merging adds cost.

The full benchmark code is available on
GitHub
and can be rerun directly in Colab.

A simple analogy

Imagine two books with the same number of pages, same paper size, and
same binding weight, but different text printed inside. If your job is
to carry one book from one room to another, the time is determined
mostly by the size and weight of the book, not by which words are on the
pages.

Merged LoRA is similar. You are still carrying a model of essentially
the same size through the same inference pipeline. The content of the
weights changed, but the "shape of the object" the GPU has to move
through memory did not.

What would actually make the model faster

If merged LoRA is not the latency lever, what is?

The biggest levers are the ones that change memory traffic, parallelism,
or the number of decode steps:

Quantization. Lower-precision weights reduce how many bytes must be moved per token.
Batching. More concurrent tokens/sequences can increase hardware utilization and improve throughput.
Speculative decoding. A draft model can reduce how often the full model must do slow one-token-at-a-time work.
A smaller model or different architecture. Fewer or smaller weight tensors mean less work and less data movement.

These are meaningful speed levers because they attack the actual
bottleneck. Merged LoRA does not.

The corrected claim

The right way to state this in an evaluation report or cost memo is not
"merged LoRA is mathematically free." The right claim is narrower and
more accurate:

Merged LoRA does not materially change the inference graph or memory
footprint per token. merge_and_unload() folds the low-rank update into
the base weights ahead of inference, so generation runs the same model
structure with the same tensor shapes. A controlled three-way benchmark
(base vs. unmerged vs. merged, 10 runs each on a T4) confirms this:
merged and base land within noise of each other (27 ms vs. 26 ms),
while unmerged is 2.15× slower (58 ms) due to the extra low-rank path
still running at inference time.

Sources

Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
HuggingFace PEFT documentation — Conceptual guide to LoRA, including the merge_and_unload() API. huggingface.co/docs/peft/conceptual_guides/lora
Benchmark code: PEFT 0.14.0 + Transformers 4.51.3, Colab T4. github.com/Natnael-Alemseged/week12-lora-inference-latency

When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch

Natnael Alemseged — Sat, 02 May 2026 18:16:47 +0000

By Natnael Alemseged

The gap that τ²-Bench retail cannot measure

Tenacious is a B2B sales automation company. Its agent produces outreach emails for clients — personalized to the prospect's company, calibrated to the signal confidence of the underlying data, and constrained by the actual bench capacity available to fulfill any commitment made in the email. The executive team's question going into Week 11 was simple: how do we know this works for our business, our voice, our segments, our bench? The honest answer was: we don't. Not because the agent was untested, but because the tests we had were the wrong tests.

τ²-Bench retail measures whether a sales agent can navigate a generic retail conversation. Tenacious needs an agent that checks bench capacity against a real JSON summary, routes prospects to the right ICP segment based on layoff and funding signals, and phrases outreach to match the confidence tier of the underlying data. These are not things any public benchmark grades.

The audit I ran on Day 1 listed eight probe IDs from the Week 10 failure library that τ²-Bench retail would have passed: P-009 through P-012 (bench overcommitment, 100% trigger rate), P-001 and P-004 (ICP misrouting, 54%), P-005 and P-019 (assertive phrasing under weak signal). A retail benchmark scores those outputs as acceptable because they are fluent. They are not acceptable for Tenacious because they make promises the company cannot keep.

How I found the gap: the audit method

(Week 10 and Week 11 refer to two consecutive project sprints: Week 10 built the Tenacious sales agent; Week 11 built the evaluator, benchmark, and critic on top of it.)

The Week 10 evidence was more useful than I expected. The failure taxonomy shows that bench_overcommitment triggered on every bench-feasibility probe in that roll-up (40/40; see week_10_data/failure_taxonomy.md). This is not a distribution problem — it is a systematic absence of a check. The agent's generator never consulted bench_summary before committing capacity.

The same pattern held for ICP routing: 20 of 37 probes in the ICP-misclassification roll-up (54%; same source). In both cases, the structured context fields (bench_summary, signal_confidence_tier, icp_segment) were available in the input. The generator simply did not use them.

This pointed immediately to Path B rather than Path A. The outputs were fluent — no generation quality problem. What was missing was a rejection layer that checks structured context against the draft before it is sent.

Concretely, five probe traces drove the decision:

Probe ID	Trace ref	Failure
P-009	`probe-4087895185a9`	Go overcommitment: bench=3, committed=10
P-010	`probe-d5299b421fc8`	NestJS capacity committed but fully deployed
P-001	`probe-8dc44eb36d33`	Layoff+funding → Segment 1 instead of Segment 2
P-004	`probe-19f0af95e3e2`	Zero open roles, still Segment 1 pitch
P-005	`probe-b3388b3c3582`	Assertive opener under medium-confidence signal

All five share the same pattern: a structured field in the task input encodes the ground truth, and the agent ignored it. A generation-quality fix does not address this. A critic that has bench state and segment rules in its context can.

Building the benchmark: how dataset construction actually works at small data

The four authoring modes

Tenacious-Bench v0.2 uses four authoring modes, each with different cost and quality tradeoffs:

Trace-derived tasks come directly from the Week 10 failure library. The task input is reconstructed from a real probe, the ground truth is the corrected output from the post-hoc audit. These are the highest-signal tasks — they encode actual failures the agent produced in a real evaluation. The risk is sparse coverage: the probe library covers only the failure modes that were already identified.

Programmatic tasks expand the trace-derived set by templatizing the inputs — varying company name, capacity numbers, signal tier, and ICP segment systematically. Coverage is higher but signal lines are often synthetic stubs (Ref=tbv02-0021 Arbor Systems hiring-signal.) rather than grounded specifics. That creates calibration noise in the evaluator's signal_grounding_check, documented below.

Multi-LLM synthesis routes task generation to a cheap model tier (Qwen via OpenRouter) and judgment to a different family (Claude/OpenAI) — following the preference-leakage prevention protocol from Li et al. (2025). The generator produces the rejected outputs for preference pairs; the judge verifies them. Using the same model for both would inflate apparent pair quality without improving actual learning signal.

Hand-authored tasks cover the long tail of failure modes that neither trace-derived nor programmatic expansion reaches — dual-control coordination failures and edge cases in booking-stage handling.

Judge-filter calibration (task inclusion)

Every generated task is supposed to pass an LLM-as-judge gate before it enters the benchmark: pointwise scores on input coherence, ground-truth verifiability, and rubric-application clarity (1–5 each), with documented minimums (generation_scripts/audit_logs/authoring_manifest_*.json: require ≥3 on each dimension, reject on malformed JSON). Generator and judge model families are rotated so the same family never both authors and scores the same pool — again following Li et al. (2025). Pairwise tiebreaks handle near-duplicate synthesis paths (Jaccard overlap on subject+body, threshold 0.8). The published authoring manifest for the 240-task build records whether live OpenRouter calls were enabled; when the key is absent, the pipeline falls back to a stub judge that only enforces the dimension floor — useful for reproducible CI, but not a substitute for calibrating a frontier judge on a 50-task spot sample. Inter-rater agreement on 30 hand-labeled tasks (24-hour relabel) is what kept the downstream deterministic rubric honest.

The routing decision I would make differently

Stub signal lines from cheap synthesis are not interchangeable with realistic briefs. A real signal line reads: "You closed a $14M Series A in February and your Python roles increased from 2 to 7 in 60 days." A stub reads: "Ref=tbv02-0021 Arbor Systems hiring-signal." The evaluator's signal_grounding_check grades whether the body references tokens from the signal line; stubs have no meaningful tokens to match.

The fix for the next revision is to author plausible specific signals (amount, date, role count) at template expansion time — Liu et al. (COLM 2024) Section 3: synthetic quality depends on specificity of the seed, not volume alone.

Contamination and inter-rater agreement

The three-check protocol (8-gram overlap on inputs, embedding cosine < 0.85, time-shift verification) targets input-level train vs held-out overlap, not output memorization. For the preference-pair training slice, training_data/contamination_preference_pairs.json records 91 pairs checked and 0 violations.

The compliant 24-hour inter-rater pass (30 tasks, 64 check-level comparisons) yielded 0.91 overall agreement; every dimension cleared 0.80 after rubric revision (inter_rater_agreement.md). The weak point was format_check (0.87): humans penalized filler openers and hollow superlatives while the machine initially used length only. Adding filler_opener and unsupported_superlative regexes to scoring_evaluator.py closed the gap.

The training experiment

Path B: SimPO on a text-only Qwen 2.5 0.5B fallback

The project target backbone is Qwen3.5-0.8B. The current Qwen3.5-0.8B HF/Unsloth release is vision-language; TRL CPO routes text prompts through the image processor and breaks on text-only preference pairs. The training notebook uses unsloth/Qwen2.5-0.5B-Instruct as an operational text-only fallback — an engineering constraint worth stating in public.

SimPO beats DPO on a free Colab T4 (16 GB): DPO needs a frozen reference model in memory; SimPO is reference-free and fits a workable batch size. SimPO beats ORPO here because the data are preference pairs only — no separate SFT corpus. ORPO's SFT term would drag a 0.5B policy toward Tenacious email prose at the expense of general instruction following; SimPO has no SFT term.

Preference pairs use each task's ground_truth_output as chosen and an LLM-generated violation as rejected, validated with scoring_evaluator.py and logged in training_data/preference_pairs_audit.jsonl. The rejection generator (Qwen on OpenRouter) and any frontier judge are different families — preference-leakage hygiene per Li et al. (2025).

Training slice: 91 rows in training_data/preference_pairs.jsonl, 6 failure categories, 0 contamination flags in training_data/contamination_preference_pairs.json. Colab T4: 3 epochs, 81 train / 10 eval pairs, ~129 s wall time, fp16 LoRA r=16 / α=32, final train loss 4.878. Eval margin sanity check: 10/10 on the training split. Headline lift is decided on held-out tasks only (ablations/ablation_results.json, ablations/significance_test.txt).

The honest result

Delta A: trained LoRA vs deterministic baseline on held-out (same metric)

Definition (paired with ablations/paired_bootstrap_delta_a.py): for each of 47 held-out tasks, the baseline succeeds if the deterministic scoring_evaluator.py scores prefer ground_truth_output over candidate_output, or the two bodies are identical. The trained judge succeeds if the LoRA's preference margin agrees with that same ordering (or tie). This is one metric end to end — not a mix of all-checks-pass for the baseline and preference accuracy for the model.

Condition	Preference-aligned rate	n
Deterministic baseline	14.9%	7/47
Trained LoRA	91.5%	43/47
Delta A	+76.6 pp
95% bootstrap CI (50 000 resamples, seed 42)	[+63.8 pp, +87.2 pp]
One-sided paired bootstrap p	< 0.0001

Descriptive sidebar: the Week 10 candidate bodies pass all deterministic checks on 11/47 tasks (23.4%) — a useful raw quality readout, but not the Delta A numerator. The baseline hits 7/47 because the evaluator often prefers the reference even when the candidate fails some checks.

By category, the trained judge reaches 100% on bench_overcommitment, dual_control_coordination, gap_overclaiming, signal_overclaiming, and tone_drift; icp_misclassification stays 2/6 (33.3%) — the weakest training slice (six pairs) and an open problem.

Delta B: trained LoRA vs prompt-only same backbone

Same held-out preference-margin procedure: base Qwen2.5-0.5B-Instruct without LoRA scores 48.9% (23/47); the trained adapter scores 91.5% (43/47) — +42.6 pp, 95% CI [+29.8 pp, +57.4 pp], p < 0.0001. Prompt-only already clears dual_control_coordination and signal_overclaiming on this slice; the adapter's lift concentrates in gap_overclaiming and tone_drift, with modest ICP gains (0/6 → 2/6).

Cost–latency Pareto

Training used $0 billed GPU on Colab T4 (cost_pareto.colab_cost_usd in ablations/ablation_results.json; ~2.16 minutes wall time). Inference on the held-out preference pass: median ~369 ms per task with the LoRA judge vs ~96 ms for the prompt-only backbone — higher latency for a stronger rejection layer. Dataset authoring included live OpenRouter calls for preference-pair generation (training_data/preference_pairs_audit.jsonl, mode: "live"); API spend is logged in cost_log.csv — ~$0.02 for 112 qwen/qwen3-8b calls (67K input + 43K output tokens at $0.10/M).

What did not work

ICP routing remains the failure mode with the fewest pairs and the worst held-out accuracy. Stub signal lines make signal_grounding_check look worse than real-brief behavior would. Delta B is uneven: training helps most where the prompt-only model was blind, not everywhere.

What is next

Thread-level coherence — grade replies against prior turns, not isolated drafts.
Pricing scope — enforce pricing_sheet.md bands on quoted TCV.
LinkedIn-roast heuristic — style-guide anti-pattern as an LLM-judge dimension.
Multi-signal calibration — score against the weakest signal in a brief, not a single scalar tier.

Dataset: https://huggingface.co/datasets/Natnaela/tenacious-bench

Code: https://github.com/Natnael-Alemseged/SalesConversion-Bench

Community: τ²-Bench issue #293 — structured-context evaluation gaps