DEV Community: Beamlaka

Why Your Non-Significant Benchmark Result Might Be a Power Problem (Not a Model Problem)

Beamlaka — Fri, 08 May 2026 18:24:16 +0000

In Week 11, Tenacious-Bench reported:

Delta A = -2.34 pts, 95% CI [-11.09, +6.20], p = 0.71 (not significant)
Delta B = +22.18 pts, 95% CI [+14.43, +29.82], p = 0.0 (reported significant)
At first glance, this looks straightforward: one result is meaningful, one is not.
But this interpretation can be wrong if the benchmark is underpowered for the effect sizes we actually care about.

This post answers two practical questions:

With 216 binary pass/fail tasks, what size improvement can this benchmark reliably detect at 80% power?
Is reporting p = 0.0 valid when bootstrapping with 2,000 samples?
1) The key statistical gap: significance without power is incomplete
A p-value tells you whether observed data are unusual under a null model.
It does not tell you whether your benchmark was large enough to detect a small-but-real improvement.

So p = 0.71 can mean either:

there is truly no effect, or
there is a small effect, but your benchmark has low detection power.
Those are very different decisions for model iteration.

2) MDE at current benchmark size (216 tasks)
Using a standard two-proportion planning approximation with:

baseline pass rate ≈ 74%
alpha = 0.05 (two-sided)
power = 0.80
n = 216 tasks
the minimum detectable effect (MDE) is about +10.9 percentage points.

That is the core result.

If your practical target is +3 to +5 points, 216 tasks is too small for reliable detection.

3) Reinterpreting Delta A (p = 0.71)
Given this power profile, Delta A is better interpreted as inconclusive for small effects, not “definitive no improvement.”

Approximate detection probabilities at n=216:

true +3 pt effect -> ~11% detection chance
true +5 pt effect -> ~23% detection chance
true +8 pt effect -> ~52% detection chance
So failure to reject at +3/+5 is expected most of the time.
This is exactly why a non-significant p-value should not be read as proof of no effect when power is low.

4) How large Tenacious-Bench v0.2 should be
At the same baseline and test settings, target task counts are approximately:

+3 pt detection -> 3,226 tasks
+5 pt detection -> 1,128 tasks
+8 pt detection -> 420 tasks
Design implication:

If +5 is your minimum meaningful lift, v0.2 should be around ~1.1k+ tasks.
If +3 matters, v0.2 needs multi-thousand scale.
If you only care about large lifts (~+8), 400+ can be enough.
5) Correcting the bootstrap p-value (p = 0.0)
With finite bootstrap/Monte Carlo resampling, p = 0.0 is not valid.

Use corrected empirical p-value:

p

r
+
1
B
+
1
p=
B+1
r+1

where:

B
B = number of resamples
r
r = count of resamples at least as extreme as observed statistic
For

B

2000
B=2000,

r

0
r=0:

p

1
2001
≈
0.00050
p=
2001
1

≈0.00050
Correct reporting:

bootstrap p ≈ 0.0005, or
bootstrap p <= 1/2001, or
bootstrap p < 0.001
Not correct: p = 0.0.

6) Suggested report rewrite
A defensible rewrite is:

“Under a standard two-proportion planning approximation (baseline ~74%), a 216-task benchmark has an 80%-power MDE of about +10.9 points. Therefore, Delta A = -2.34 pts (95% CI [-11.09, +6.20], p = 0.71) is inconclusive for small practical gains, not definitive evidence of no effect. To detect +3/+5/+8 point gains at 80% power, v0.2 would require approximately 3,226 / 1,128 / 420 tasks. For Delta B with 2,000 bootstrap samples, report p ≈ 0.0005 (or p < 0.001), not p = 0.0.”

Final takeaway
Evaluation gives you a score difference.
Statistics tells you whether your benchmark could detect the difference you care about.

For Tenacious-Bench, the Day 4 conclusion is simple and actionable:

keep reporting CIs and p-values,
but add MDE + power-based sample-size planning as a first-class benchmark design step.
That turns “not significant” from an ambiguous label into a decision-ready result.

Did My LoRA Learn Tenacious Style—or Just Memorize Augmented Patterns?

Beamlaka — Thu, 07 May 2026 18:17:57 +0000

In Week 11 Tenacious-Bench, we trained a LoRA adapter on Tenacious-style B2B sales emails using Supervised Fine-Tuning (SFT).
We got a real performance lift: Delta A = +0.263 (p < 0.0001).

But that result exposed a harder question:

Did the adapter learn how Tenacious writes, or just what repeated Tenacious-like samples looked like?

This post answers that at the mechanism level: cross-entropy token-by-token, LoRA gradient flow, and why low-diversity augmentation can make convergence look better than generalization.

1) What SFT cross-entropy actually optimizes
In autoregressive SFT, the model predicts the next token at each step.
Cross-entropy loss measures how much probability mass the model gave the correct next token.

So the objective is:

not “be honest,”
not “be cautious,”
not “be Tenacious,”
but: assign high probability to target tokens in the training distribution.

If your targets consistently reflect Tenacious behavior, style improves indirectly.
But the optimization target is still token prediction.

2) How gradients flow in LoRA when base weights are frozen
For each adapted layer:
W = W0 + BA
W 0is frozen
only A and B are trainable
During backprop, gradients pass through the full forward graph, but updates only changeA/B.
That means LoRA acts as a low-rank steering update on top of a fixed backbone.

Practical interpretation: you are not retraining the model’s full knowledge. You are learning a compact directional adjustment that shifts output tendencies.

3) What your seven target modules imply
You adapted:
attention projections: q_proj, k_proj, v_proj, o_proj
feed-forward projections: gate_proj, up_proj, down_proj
A useful diagnostic lens:

Attention-heavy updates often correlate with better context routing (e.g., weak signal -> interrogative phrasing).
MLP-heavy updates often correlate with lexical/phrase-shape adaptation (which can be desired style—or shortcut memorization).
This is why module-level gradient norms matter. Without them, “it improved” is under-explained.

4) Why low diversity is a gradient problem, not just a datasheet warning
Your datasheet states that 94.3% of training pairs are augmented variants of only 128 originals.
That has direct optimization consequences.

Near-duplicate examples produce highly aligned gradient directions repeatedly.
Cross-entropy rewards those repeated token patterns quickly. Training loss falls. Metrics can rise.

But this can represent two different realities:

Generalizable policy learning (what you want)
Surface-pattern reinforcement (what you fear)
Cross-entropy alone cannot tell which one happened.

5) Why your Delta A is real but not fully sufficient
A statistically strong Delta A means the adapter improved on your evaluation distribution.
It does not automatically prove robust style generalization out-of-family.

The defensible claim is:

“The adapter improved predictive behavior on measured data; generalization vs memorization requires additional diagnostics.”

That is stronger science and better engineering.

6) Minimal diagnostics to separate style learning from memorization
A) Grouped holdout by original family
Do not split augmentation siblings across train/held-out.
Keep all variants of one original together in one split.

Stable performance on grouped holdout -> stronger evidence of true style learning
Large drop -> evidence of augmentation-family memorization
B) Gradient norm breakdown by LoRA module
Log gradient norms for LoRA params and aggregate by:

q/k/v/o
gate/up/down
This doesn’t “prove style” alone, but it makes your mechanism claim concrete: where did training pressure concentrate?

7) Practical conclusion for FDE fine-tuning work
This issue generalizes to any narrow, augmented SFT project (sales writing, summarization, code style, domain formatting):

loss convergence is necessary,
benchmark gain is valuable,
but neither alone proves intended behavior learning.
If you want to claim “learned policy,” add grouped holdout and module-level gradient diagnostics as standard evidence.

Final takeaway
Your LoRA adapter likely learned a useful steering update.
But with heavy augmentation concentration, the safest conclusion is:

“We improved next-token policy on this distribution; we are validating whether that policy generalizes beyond augmentation families.”

That framing is honest, technically grounded, and production-defensible.

Why DeepSeek V3.2 Tool Calls Can Drift from Ordered System Instructions

Beamlaka — Wed, 06 May 2026 18:32:59 +0000

When my partner asked this question, it sounded very specific—but it exposed a broad engineering issue many of us hit in production agent systems:

When DeepSeek V3.2 selects a tool via tool_choice="auto", what tokens are actually generated, how is that different from older special-token function-calling formats or strict structured calling, and what does that do to ordered system-instruction adherence?

I expected a simple “function-calling behavior” answer.
What I found is more useful: this is not just a model question. It is a protocol + parser + orchestration question.

The core insight
For open-weight DeepSeek V3.2 workflows, tool calling in auto mode is typically:

model emits textual wrapper content (DSML-like blocks),
runtime/parser extracts tool calls from that text,
runtime normalizes into tool_calls[] objects.
So the system is often text-generation first, structure recovery second.

That differs from strict constrained function-calling stacks where decoding itself is grammar/schema constrained and invalid next tokens are masked out during generation.

In one line:
auto parser-based calling is a best-effort protocol; constrained calling is a decode-time enforcement regime.

What the model is actually doing at generation time
To make this concrete, separate three layers that are often mixed together:

1) Surface token generation
The model generates tokens that can become:

normal assistant prose,
reasoning content,
tool-wrapper text signaling an invocation.
2) Prompt serialization
Your system message, tool descriptions/schemas, and user turn are serialized into one prompt context (with model-specific formatting).

3) Parser/runtime recovery
A parser interprets emitted text and converts it into structured tool-call objects.

If your generation crosses boundaries cleanly, everything looks reliable.
If boundaries are malformed, delayed, truncated, or ambiguous, you get drift—even when the model’s intent looked close.

Why ordered instruction adherence can fail
Suppose your system instruction says:

“First inspect context, then call Tool A, then Tool B, then summarize.”

In parser-first auto paths, failures can happen for structural reasons:

Branch competition at decode time
At the action boundary, the model can continue prose/reasoning or begin tool-wrapper output. Without strict masking, this is free competition.

Prompt-distance pressure
Ordered rules often appear far earlier than the local action boundary where tool wrappers begin.

Reasoning/action boundary leakage
If transitions around reasoning tags and tool wrappers are imperfect, parser classification can degrade.

Truncation at sensitive points
Cutting generation inside wrapper syntax can break tool recovery entirely.

Parser coercion side effects
Some runtimes “helpfully” coerce arguments post-hoc. Useful in some cases, but not equivalent to strict schema-safe decoding.

So “instruction drift” here is often not just “model ignored rules.”
It can be “the protocol boundary and recovery path were fragile.”

Comparison: three reliability regimes
A) Parser-based auto mode (common in open/self-hosted paths)
flexible, capable
but more exposed to wrapper malformation, partial emits, and order drift
needs robust validation and retries
B) Named/required tool choice with stronger control
reduces branch ambiguity
improves order and tool-selection predictability
still depends on schema/tool design quality
C) Strict constrained structured decoding
strongest structural guarantee
invalid next tokens masked at decode time
structure reliability is highest, though semantic quality can still vary
Practical engineering implications
If your workflow is order-sensitive or high-risk, treat parser-first auto as best effort, not guaranteed protocol obedience.

The most effective mitigations I’d recommend:

-Keep tool set and schema text compact
Less scaffolding noise near the decision boundary.

-Encode order criteria twice
In system/developer instructions and inside tool descriptions.

-Reserve token headroom
Prevent truncation mid-wrapper or mid-argument.

-Validate after parse
Tool name, required args, arg types, and instruction-order checks.

-Add repair/retry policy
If parse fails or order check fails, reprompt with explicit corrective instruction.

-Checkpoint long chains
Split long multi-tool sequences into staged turns instead of one long free-decoded pass.

-Use stricter selection modes when correctness dominates
Named/required tools or constrained-decoding paths for critical flows.

Minimal test you can run this week
A fast A/B experiment will make this real:

Fix one ordered workflow (A -> B -> summarize)
Same prompts, same cases
Compare:
parser-based auto
stricter named/required selection (or constrained path where available)
Track:
order violation rate
malformed parse rate
argument schema violation rate
end-to-end success
latency overhead
This turns an abstract debate into measurable engineering tradeoffs.

Scope note
This post focuses on publicly documented DeepSeek V3.2 open-weight behavior and common parser/runtime patterns. It does not claim undocumented hosted internals beyond published API contracts.

Final takeaway
The key lesson for agent/tool-use internals is:

Tool reliability is not only “how smart the model is.”
It is the combined behavior of decoding regime, serialization format, parser recovery, and orchestration checks.

For ordered system-instruction adherence, that distinction matters a lot.
If you need deterministic correctness, pair model capability with protocol discipline and runtime enforcement.

Why front-loaded rules drift in long evaluator and agent loops

Beamlaka — Mon, 04 May 2026 19:46:06 +0000

Subtitle: In context is not the same as in control.

By Beamlak Adane

A colleague asked a sharp question that many of us hit once we move from demos to production evaluators and agents:

In multi-turn evaluation or agent loops, models often begin to ignore the initial rules in the system prompt even though those tokens are still inside the context window. What is happening at the token level during decoding that causes early “anchor” tokens to lose influence over time in a streaming context, and how do attention sinks, KV-cache reuse, and prefix caching affect this? Beyond increasing context length, what can engineers do to preserve instruction fidelity and judgment consistency across long sessions?

This post is my answer after digging into the mechanism and mapping it to something concrete I ship: a sales-email evaluator loop.

The core idea: visible is not the same as influential
Keeping the system prompt inside the window is a storage guarantee, not an influence guarantee.

In a decoder-only transformer, each new token is produced by forming a fresh query and comparing it against keys from all prior positions (under the causal mask). A past token affects the next token only through how much attention mass the current query assigns to that token’s key, after softmax normalization and any positional biases.

So the system rules do not “disappear” in the sense of vanishing from the input. They can stop mattering because the active computation at the current step no longer routes enough probability mass through those early positions—or because later tokens have become fresher, more specific carriers of a different trajectory.

Why multi-turn loops drift harder than one long completion
Three forces show up again and again in evaluator and agent traces:

Recency competition. Each turn adds user text, tool output, and prior model generations. Those tokens are often semantically closer to the immediate subtask than a long rubric paragraph at the front. They compete for attention mass on every decode step.

Self-conditioning. Autoregressive models always condition on what they already generated. A slightly off-rubric line becomes part of the context and can steer the next line. Drift compounds because the model is partly “arguing with its own last move.”

Turn boundaries change the query geometry. Instruction-following research suggests attention to system tokens can look relatively stable within a single answer, then shift more abruptly across turns. That matches engineering reality: each loop iteration injects new observations that compete directly with the original rubric.

Attention sinks, KV cache, and prefix caching (what they actually do)
Attention sinks
Attention sinks are a subtle but important correction to naive “early tokens always win” stories. In long streaming settings, some early tokens can receive surprisingly persistent attention mass—even when they are not semantically “the rule.” Softmax attention must allocate probability somewhere; when the model does not need to attend strongly to many past tokens, mass can pool in early positions that function partly as normalization anchors rather than semantic instruction consultation.

Practical implication: seeing high attention on the beginning of the prompt does not automatically mean the model is faithfully executing the rubric text that follows.

KV cache reuse
KV caching is primarily an inference optimization: once a token’s key and value tensors are computed, they can be reused during autoregressive decoding instead of recomputing the full prefix every step.

Behaviorally, cached states are frozen snapshots of earlier positions. They remain available to future queries, but they are not continuously reinterpreted as the conversation evolves. Each new token still has to “re-win” attention against an expanding set of competitors—often including very recent, task-specific states.

Prefix caching
Prefix caching (serving systems that reuse KV blocks for a shared static prefix across requests) is different again. It does not magically fix “the model stopped obeying rules inside one endless chat thread.”

Its biggest win is operational: it makes it cheap to replay a canonical rubric and tool schema on every call, which enables an architecture where you stop relying on one ever-growing mutable conversation as the carrier of policy.

A concrete example from my evaluator work
In my Week 11 bench, part of the scoring contract encodes “don’t overcommit under weak evidence” using simple lexical cues—downgrade language versus hard commitment language. That is a deliberate simplification, but it makes the drift mechanism obvious:

If the loop injects urgency language from tool output, the model may echo that tone.
Once it emits an overconfident phrase, that phrase becomes recent context on the next step.
The original rubric may still be present in the window, but the next-token distribution is increasingly steered by the trajectory, not only by the static policy block.
The point is not that lexical checks are enough for production. The point is that decode-time routing + self-conditioning can break instruction fidelity even when the rules are technically still “in context.”

What to do in production (beyond “use a bigger window”)
These are the interventions I take seriously for evaluators, judges, and agent planners:

Prefer stateless or quasi-stateless calls. Rebuild each step from an immutable rubric plus compact task state plus the current item. Do not assume one long thread keeps policy alive just because it fits.

Separate policy memory from episodic memory. Summarize observations and decisions; do not summarize the controlling rules into the same lossy blob unless you are okay with slow policy erosion.

Re-anchor rules near the decision. If a constraint matters for this verdict, repeat a compressed version of it immediately before the judgment step—not only at the front of a huge trace.

Add an explicit rule-recall pass. Make the model name the applicable rubric items before it commits to a final answer. This is a cheap guardrail against silent drift.

Structure the output. Schemas and checklists make violations easier to detect and correct than free-form prose alone.

Keep hard guarantees outside the model when possible. Deterministic validators, allowlists, and post-checks should enforce what you cannot afford to “mostly” follow.

Measure drift, not just accuracy. Track instruction adherence over turns and variance across seeds. Evaluators fail quietly; your metrics should be loud.

What I am explicitly not claiming
This post focuses on inference-time mechanics and systems patterns. Training-time fixes (instruction hierarchy fine-tuning, verifiable RL-style constraint learning, etc.) matter too, but they are a different lever. Get the loop architecture right first; otherwise you are paying training costs to compensate for a bad control path.

Further reading
Vaswani et al., Attention Is All You Need — attention definition and decoding stack: https://arxiv.org/pdf/1706.03762
Xiao et al., StreamingLLM — attention sinks and long streaming behavior: https://arxiv.org/pdf/2309.17453
Hugging Face Transformers — KV caching / generation: https://huggingface.co/docs/transformers/cache_explanation
vLLM — prefix caching design: https://docs.vllm.ai/en/stable/design/prefix_caching/
Closing
If you take one line away, take this: instruction drift in long loops is often a control-path failure, not a storage failure. Tokens can remain visible while losing authority; sinks can make early attention misleading; caches change cost and architecture, not the fundamental competition between old rules and new trajectory. The strongest fixes replay policy intentionally, separate policy from trajectory, re-anchor at decision time, and enforce hard constraints outside the model’s narrative.

Tenacious-Bench v0.1: a small B2B sales-outreach benchmark with contamination checks

Beamlaka — Sat, 02 May 2026 10:22:22 +0000

General sales benchmarks often miss how real outbound agents fail: overclaiming on weak signals, unsafe “bench” commitments, tone that drifts into pushy follow-ups, and gaps between what the rep promises and what delivery can support. For a class project (TRP1 Week 11), I built Tenacious-Bench v0.1, a compact, machine-scored task set aimed at those failure modes—not generic helpfulness.

What’s in the dataset

The public release is on Hugging Face: https://huggingface.co/datasets/Bnobody/tenacious_bench_v0.1.

It currently exposes 168 rows in the hub viewer, with splits aligned to how I train and evaluate: train (105) and validation (63). Tasks mix several authoring modes—programmatic sweeps, multi-LLM synthesis with judge filtering, trace-informed scenarios, and hand-authored adversarial cases—so the bench isn’t a single-generator monoculture.

Each row includes structured inputs (prospect context, stack, headcount, signal confidence, bench availability, etc.), a candidate outreach payload (subject/body/CTA), explicit ground-truth expectations (e.g. when to hand off vs. qualify), and a versioned scoring rubric so scores are reproducible without hand-waving.

Why contamination and provenance matter

Synthetic benchmarks leak in boring ways: near-duplicate phrasing across splits, embedding neighbors that are too close, or “eval” tasks that are effectively the same scenario as training with a date tweak. I run n-gram overlap, embedding similarity, and an explicit signal-window / provenance policy (train/dev vs. held-out time labeling) and record outcomes in a JSON report in the repo. The goal isn’t perfection—it’s to make leakage visible and actionable.

Training angle (Path B)

I’m not publishing a giant SFT corpus here; the project emphasizes a preference-style critic path (ORPO/DPO-style data prep + LoRA training) to catch inconsistency and unsafe commitments. The dataset is the artifact reviewers can actually load; training code and logs live alongside the project README.

Limitations (stated plainly)

Tasks are synthetic and English-first; they don’t replace live A/B tests or compliance review. The bench is meant as a regression harness for product teams iterating on sales agents, not as proof of real-world lift.

Call to action

If you’re building outbound agents, try grading your model on a slice of these tasks and compare against your internal rubric. I’m especially interested in cases where the model is “fluent” but violates bench/signal safety—those are the rows worth expanding next.