DEV Community: Eyoel Nebiyu

# Why "drift_score = 0.0" Is Not Yet Evidence of Semantic Stability — and What Your n=251 vs cap=200 Mismatch Actually Costs by: Eyoel Nebiyu

Eyoel Nebiyu — Fri, 08 May 2026 17:27:05 +0000

Repo under interrogation: Heban-7/Data-Contract-Enforcer
Files in scope: report_final_pdf_ready.md, contracts/ai_extensions.py

The question, anchored

You have two questions stacked on top of each other in the same artifact:

(a) effective sample size: the report says Sample size: 251 but the implementation caps embeddings at 200. Which n is the statistic actually computed over, and what does the discrepancy cost you?
(b) evidence chain: when drift_score = 0.0 from a centroid-based cosine method, what additional evidence do you need before writing "Text content is semantically stable" in the report?

Both have a single answer-shape: a centroid is a first-moment summary, and a first-moment summary is silent on everything else — sample size, dispersion, multi-modality, model identity, and fallback behavior. Each of those silences is a separate place where "0.0" can mean something other than "stable." This explainer narrows to that mechanism, gives you the corrections to make in each file, and ships a numpy script that demonstrates the failure mode in one screen.

What centroid-based cosine drift mechanically computes

Given a baseline cohort A of n_A embeddings e_1, ..., e_{n_A} in R^d and a current cohort B of n_B embeddings, the drift statistic is:

c_A = (1 / n_A) * sum_i e_i^A
c_B = (1 / n_B) * sum_j e_j^B
drift_score = 1 - cos(c_A, c_B)

It is a single scalar derived from the means of two samples. The variance of each centroid coordinate is Var(e_k) / n, so the precision of c_A and c_B scales with 1/sqrt(n). Two consequences fall straight out:

The statistic depends on n — but only through estimator variance, not through what is being measured.
Every property of the cohort that is not the mean is invisible to it.

(a) Your n=251 vs cap=200 mismatch — what it actually costs

If contracts/ai_extensions.py enforces a 200-sample cap on the embedding loop but the report cites Sample size: 251, the report is wrong about precision. The standard error of each centroid coordinate is sigma_k / sqrt(n_eff), with n_eff = 200, not 251. That's a sqrt(251/200) = 1.12x understatement of uncertainty — small, but it propagates into any downstream confidence interval and into the threshold below which you treat the score as "no drift."

The bigger cost is provenance, not precision. When a reader sees Sample size: 251 next to drift_score: 0.0, the implicit promise is that 251 documents were embedded and contributed to the centroid. If 51 were silently dropped at the cap, that's a sampling decision (was it the first 200? a random 200? the 200 with the lowest tokens?) that changes whether the centroid is even drawn from the population the report claims. Right fix: rename the field n_reported and add n_effective as a separate field in the metric output, then make the report cite the effective number with a one-sentence note on the cap policy. This is the cheapest reproducibility win in the whole artifact.

(b) Why drift_score ≈ 0 is not yet "semantic stability"

Run the demo at day_4/scripts/centroid_drift_demo.py. It builds two cohorts of 200 vectors in R^128 with identical sample means by construction but a 10x ratio in dispersion, and prints:

  centroid_cosine_drift(A, B)            = 9.78e-13    <- the contract's drift_score
  within_cohort_dispersion(A) (mean L2)  = 2.2591
  within_cohort_dispersion(B) (mean L2)  = 22.5743
  dispersion ratio B/A                   = 9.99x
  permutation p-value on centroid drift  = 1.0000

The drift score is machine zero. A contract that maps drift_score ~ 0 → "semantically stable" will say so here. But cohort B is a cloud ten times wider than cohort A — clearly a different distribution. Even a permutation test that uses the same statistic can't see it (p = 1.0 because no shuffle could make centroid drift smaller than zero). The test in that statistic family is blind by construction.

This is the first-moment trap. There are at least four mechanisms that produce small drift_score without semantic stability, and your contract should distinguish all four:

Genuine stability (what you want it to mean): the population didn't change.
Dispersion shift only: same mean, wider/narrower spread. The demo above. Common when content gets more diverse or more templated over time.
Multimodal redistribution: the cohort splits into clusters whose centroids cancel into the same overall mean. A bimodal {p, -p} cohort and a unimodal cohort at 0 have the same centroid.
Provenance failure: the embedding model returned a fallback (zero vector, last good cache, default vector) for some samples. If the fallback contributes the same constant to both cohorts, centroid distance shrinks toward zero artifactually.

(4) is the one to interrogate first in your repo. If contracts/ai_extensions.py has a try/except that catches embedding-API errors and returns a default vector (or skips the sample silently), then drift_score = 0.0 could mean "all 51 over-cap samples failed and got dropped, plus the 200 that did embed are fine." That is a very different sentence from "Text content is semantically stable."

The evidence chain a "semantically stable" claim needs

Before writing that sentence in report_final_pdf_ready.md, the chain should be:

Embedding model identity pinned — the model id and version of the embedding endpoint at baseline-time and at current-time. If the model changed between the two cohorts, the score is comparing apples to apples in a different orchard, and drift = 0 is meaningless. Log it.
Effective sample size logged (the (a) fix above) — n_effective not n_reported.
Fallback path documented — what happens when an embedding call fails? Is the failure counted into n_effective or silently dropped? If silently dropped, what fraction of the cohort is the fallback vector?
At least one 2nd-moment statistic — within-cohort mean pairwise distance, or the trace of the covariance, or even just np.std(embeddings, axis=0).mean(). One number per cohort. The demo's within_cohort_dispersion is a starter.
A distribution-level statistic — Maximum Mean Discrepancy (MMD, Gretton et al. 2012) or energy distance (Székely & Rizzo 2013) is the standard upgrade. They're 5–10 lines of numpy on top of what you already compute.
A semantic spot-check — k=20 randomly-sampled documents from each cohort, run through an LLM-judge or a human, scored for topic/intent equivalence. The word "semantic" in the claim is doing real work, and only humans or a language model can supply that signal. Centroid distance never can.

Only after (1)–(6) is "Text content is semantically stable" a defensible sentence. Until then, the honest claim is the strictly weaker one: "The first-moment summary of the embedding distribution is unchanged within the noise floor of an n=200 estimator using model M, with fallback rate F."

What to actually change in your two files

contracts/ai_extensions.py — emit a triple, not a scalar:

{
  "drift_score": 0.0,                  # 1 - cos(c_A, c_B)
  "dispersion_ratio": 1.02,            # within-cohort 2nd moment ratio
  "mmd_score": 0.014,                  # MMD with RBF kernel between cohorts
  "n_effective": 200,
  "n_reported_input": 251,
  "embedding_model_id": "text-embedding-3-small",
  "fallback_rate": 0.0
}

report_final_pdf_ready.md — rewrite the drift-results paragraph from:

"drift_score: 0.0 — Text content is semantically stable."

to:

"Centroid-cosine drift = 0.0 over n_effective = 200 (capped from 251 input documents) using text-embedding-3-small with a 0% fallback rate. Within-cohort dispersion ratio = 1.02; MMD between cohorts = 0.014. Consistent with no shift in the first-moment summary or the second-moment dispersion of the embedding distribution. A direct semantic-equivalence test on a k=20 spot-check sample is queued and not yet reported here."

That paragraph is defensible to a senior reviewer; the original one is not.

What I deliberately skipped

Bootstrap CI on the drift score itself; multi-batch streaming drift detectors (KS-windows, ADWIN, Page-Hinkley); contrastive-embedding identifiability; cross-model centroid alignment via Procrustes. Each is its own explainer. The mechanism above — centroid is a first-moment summary, every other property of the cohort is silent — is what binds your two specific questions (n_eff and evidence chain) to one underlying cause.

Pointers

Gretton, Borgwardt, Rasch, Schölkopf, Smola — A Kernel Two-Sample Test, JMLR 2012. The canonical Maximum Mean Discrepancy paper. §3 has the estimator; §6 has the test. Drop-in upgrade for any centroid-only comparison.
Székely, Rizzo — Energy statistics: A class of statistics based on distances, Journal of Statistical Planning and Inference, 2013. Energy distance is MMD's distribution-free cousin; either works.
Rabanser, Günnemann, Lipton — Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, NeurIPS 2019. Empirical comparison of drift detectors. §4 explicitly shows that mean-only statistics miss dispersion shifts.
Anscombe, F. — Graphs in Statistical Analysis, The American Statistician, 1973. The original demonstration that summary statistics agree across very different distributions. Centroid-only drift is the modern direct descendant of the problem Anscombe was warning about.

Tool used hands-on: the centroid_drift_demo.py script in this folder. Numpy-only, no embedding-model dependency, runs in ~3 seconds. Modify the 0.2 and 2.0 spread constants to re-explore — try setting them equal to confirm drift_score ~ 0 even when both cohorts are genuinely the same population.

# What LoRA Actually Adapts and Why Higher Rank Doesn't Always Buy What It Looks Like It Should Explainer by: Eyoel Nebiyu

Eyoel Nebiyu — Thu, 07 May 2026 17:31:45 +0000

The question, anchored

You noticed two things in your Week 10 Conversion Engine fine-tunes that look paradoxical: tiny LoRA adapters often shifted model behavior dramatically, while raising LoRA rank sometimes barely helped and sometimes destabilized outputs. Both observations have a single mechanism behind them — the intrinsic-low-rank hypothesis of fine-tuning. This explainer narrows hard to that mechanism, derives why low rank suffices, and shows you with a runnable script what actually changes when you raise rank.

What LoRA mechanically adapts

A transformer layer has weight matrices in the attention block (Q, K, V, O projections) and the MLP block (gate, up, down). For a hidden dimension d, each is roughly d × d. Full fine-tuning lets every entry of every matrix update; LoRA freezes them and adds a parallel learnable correction on a chosen subset:

forward pass at one layer:
  h = W_frozen · x  +  (α / r) · B · A · x
                            ↑          ↑
                       trainable   trainable
                        d × r       r × d

B is initialized to zero. A is initialized to a small random Gaussian. The combination (α/r) · B · A is the update — at training start it equals zero, so the net forward pass is identical to the frozen base model. As training proceeds, only B and A get gradients; the base weights never change.

Two consequences fall out:

The full-rank weight matrix W_frozen is never altered. No pretrained knowledge is forgotten.
The expressive update lives in a rank-r subspace. No update outside that subspace is reachable, no matter how long you train.

The second point is the crux of your question: what does choosing r do to the set of reachable updates?

Why low rank works at all — the intrinsic-rank hypothesis

Hu et al. (LoRA, ICLR 2022) didn't argue low rank works because they wanted small models. They argued it works because the update needed to adapt a pretrained model to a downstream task lies on a low-dimensional subspace of weight space. This claim is empirically grounded by Aghajanyan et al. (ACL 2021), who showed pretrained language models can be fine-tuned through a randomly-projected ~200-dimensional update on tasks like GLUE and lose almost no performance. The full weight space has billions of dimensions; the task-specific subspace has hundreds.

The intuition: the pretrained model already encodes general syntactic, lexical, and semantic structure. Adapting it to a downstream classification or instruction-following task does not require rewriting that structure — it requires nudging a small number of directions in weight space that route the existing knowledge differently for the new objective.

LoRA at rank r exploits this by allocating exactly r learnable directions per matrix. If the task's intrinsic rank is k, then any r ≥ k will fit; any r < k will not. r is a cap on expressive capacity, not a smooth quality knob.

Reproduce it

Run the demo at day_3/scripts/lora_rank_demo.py. It builds a synthetic 64×64 "task-specific update" of intrinsic rank 4 (plus a tiny noise floor — real fine-tuning targets are not exactly low-rank), fits LoRA at r = 2, 4, and 16 by gradient descent, and prints the SVD spectrum of the trained B @ A:

  allocated r  |  final rel err  | top-8 singular values of trained B @ A
  --------------------------------------------------------------------------
            2  |         0.5758  | 37.828  33.113  0.000  0.000  0.000  0.000  0.000  0.000
            4  |         0.0097  | 37.828  33.113  25.829  24.209  0.000  0.000  0.000  0.000
           16  |         0.0069  | 37.828  33.113  25.829  24.209  0.145  0.138  0.134  0.129

Three readings:

r = 2: under-parameterized. Target's intrinsic rank is 4; rank-2 cannot reach it. Error stays high.
r = 4: matches intrinsic rank exactly. Four large singular values, tight fit (rel-err 0.01).
r = 16: over-parameterized. Still fits, but only the first four singular values are large (37.8, 33.1, 25.8, 24.2); the next four collapse to 0.14 — two orders of magnitude smaller. The optimizer found the four useful directions and drove the other twelve to noise-floor magnitude.

This is what your "higher rank only slightly improved performance" observation looks like under the hood. Once r exceeds the task's intrinsic rank, you are not gaining usable directions — you are allocating parameters that the optimizer drives toward zero, and they only contribute as gradient noise that can destabilize training on small data.

The three framings of "what rank controls" — your specific options

All three of your options are formally true, but only one is binding in practice:

"Higher rank increases expressive capacity" — true, but only up to the task's intrinsic rank. Hu et al. §6.2 + Table 6 shows r = 4 and r = 64 reach similar quality on most GPT-3 adaptation benchmarks.
"Allows adaptation across more directions" — same answer reframed. r = 64 can express updates in 64 directions; the optimizer typically does not find useful gradient signal in all 64.
"Reduces the compression constraint" — true, but the constraint is rarely binding above the intrinsic rank.

Framing 1 is the binding one. Your observation — "higher rank sometimes barely improves" — is the expected mechanism, not a tuning failure.

Two adjacent concepts

lora_alpha is the effective learning rate of the adapter. The forward pass scales the update by α / r. Raise r without raising α and per-direction scaling drops; raise r without lowering the optimizer LR and total update magnitude grows. Most "higher rank destabilized training" reports trace here, not to rank capacity. Rule of thumb: keep α/r constant (or set α = r).

Effective rank ≠ allocated rank — audit it post-hoc. Run SVD on the trained B @ A (one line of NumPy). Singular values concentrate on the first few directions; the rest decay sharply. If you trained at r = 32 and SVD shows 5 large values + 27 tiny, retrain at r = 8 with no loss. This audit is the cleanest empirical signal in the adapter-compression literature.

What I deliberately skipped

QLoRA's 4-bit base-weight quantization layer; per-layer rank choice (different r for attention vs MLP); the target_modules selection question (which projections to LoRA at all); structured-update variants (DoRA, VeRA, LoRA-XS). Each is its own explainer. The mechanism above is what binds your specific observation — small adapter sufficient + larger rank only marginally helpful — to one underlying cause.

Pointers

Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen — LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2022. arXiv: 2106.09685. §4 introduces the parallel-update form; §6.2 + Table 6 has the rank-vs-quality empirical curves on GPT-3 that show the "r = 4 is enough" pattern.
Aghajanyan, Zettlemoyer, Gupta — Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, ACL 2021. arXiv: 2012.13255. The empirical foundation of "task-specific updates live on a low-dim manifold" — Table 1 shows GLUE recovery via random projections at d_int as low as 200.

Tool used hands-on: the lora_rank_demo.py script in this folder. Numpy-only, no PyTorch dependency, runs in ~5 seconds. Modify TRUE_RANK in the script to test your own intuition — try setting it to 8 and watch r = 4 fail, r = 8 succeed, r = 16 over-allocate.

# Scaffolding-Driven vs Model-Driven Planning: Where Agent Systems Actually Break By Eyoel Nebiyu

Eyoel Nebiyu — Wed, 06 May 2026 17:07:10 +0000

Most teams building agent systems focus on improving prompts or improving workflow logic. In production, many costly failures come from something else: the boundary between model interpretation and deterministic execution.

This post explains how to assign planning ownership between scaffolding and model reasoning, why ambiguity handling fails at handoff points, and how to design a safer boundary that still preserves adaptability.

The core architecture problem

Hybrid agent systems combine:

Deterministic scaffolding: states, routers, policy gates, retries, execution order.
Model judgment: semantic interpretation under ambiguous user language.

Neither layer is enough alone. Scaffolding is stable but brittle under messy language. Model reasoning is flexible but probabilistic. The failure surface appears when we treat probabilistic interpretation as execution-ready truth.

In Gemechis's real setup, two ambiguity patterns repeatedly trigger this:

Mixed intent in one turn: prospects both accept a meeting direction and ask for clarification in the same message.
Underspecified acceptance: prospects indicate acceptance but do not provide enough schedule details (day/time/timezone) to execute safely.

A practical decision-ownership model

Use three decision classes.

1) Deterministic-owned decisions (Class D)

These should stay in scaffolding:

policy and compliance constraints,
side-effect eligibility checks,
idempotency/retry policy,
action sequencing and commit control.

These are binary and auditable. If conditions fail, execution does not happen.

2) Model-owned decisions (Class M)

These should stay model-mediated:

intent parsing from messy language,
ambiguity detection,
extraction of candidate entities/slots,
clarification suggestion generation.

These are probabilistic and should carry uncertainty.

3) Hybrid arbitration decisions (Class H)

These require both layers:

proceed vs clarify,
branch selection when multiple intents coexist,
mapping interpreted intent to an executable action.

A strong operating rule is simple: model proposes, scaffolding ratifies before side effects.

Ambiguity pattern 1: mixed intent in one message

Example:

"Thanks, that works. Can you also clarify whether onboarding support is included?"

This is not one intent. It is acceptance plus clarification.

Common failure

A single-label router forces this into either accept_meeting or ask_question.

If it picks clarification only, booking momentum is lost.
If it picks acceptance only, user concern is ignored.

Better design

Represent multi-intent explicitly at the interface, then execute a composite plan:

acknowledge and answer clarification,
keep scheduling flow alive,
request any missing scheduling constraints.

If your transition model cannot represent dual-intent turns, this failure is structural, not incidental.

Ambiguity pattern 2: underspecified acceptance

Example:

"Yes, let's do it next week."

The intent is positive, but execution fields are incomplete.

Common failure

System maps positive sentiment directly to schedule_meeting and advances to commit state.

This causes either:

hard downstream failures, or
silent wrong assumptions (bad day/time).

Better design

Create explicit intermediate state (for example accepted_but_incomplete) and require deterministic completeness checks before execution.

Acceptance and execution-readiness are different decisions and must remain separate.

How correctness is lost: one failure path

Input:

"Sounds good. Could you clarify pricing tiers? Also maybe Thursday afternoon works."

Interpretation: model extracts partial acceptance, clarification intent, and fuzzy time.
Routing: brittle router collapses to one branch.
State update: system records only one intent path.
Execution: wrong downstream behavior (premature scheduling or missed conversion).

The loss happens at boundary compression: multiple uncertain signals are reduced to one deterministic action prematurely.

Why brittleness clusters at handoff points

Three causes repeat across products:

Premature commitment: plausible interpretation treated as final intent.
Uncertainty loss: alternatives/confidence dropped by interface schema.
Syntactic progression over semantic correctness: workflow advances because fields exist, not because meaning is resolved.

This is the gap between "allowed by workflow" and "correct for user intent."

A failure-attribution framework

When incident-reviewing hybrid agents, separate causes into three linked buckets:

Scaffolding failures: rigid one-intent router, missing clarification states, permissive commit transitions.
Model failures: semantic misreads, overconfidence on vague phrasing, weak modality handling.
Interface failures: lossy model-output schema, no confidence-to-policy mapping, early single-action collapse.

Most serious incidents are mixed-cause. Treating them as "just prompt quality" usually misses the fix.

Portable architecture rules for FDE teams

Never let acceptance alone trigger side effects.
Treat multi-intent turns as first-class state.
Preserve uncertainty across the model-to-router boundary.
Add explicit intermediate states (needs_clarification, accepted_but_incomplete).
Use stricter gates for high-risk writes than for read-only responses.
Log boundary artifacts (candidate intents, confidence, chosen branch, gate outcome).

Conclusion

The healthiest hybrid systems are asymmetric by risk:

model-driven upstream interpretation,
deterministic downstream commitment.

If your team can specify, for each planning decision, who owns it, what uncertainty is acceptable, and what gate must pass before action, you will remove most brittle failures at the scaffolding-model boundary.

Research references

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023)

https://arxiv.org/abs/2210.03629
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023)

https://arxiv.org/abs/2308.08155
OpenAI Model Spec (2024)

https://model-spec.openai.com/

# Why `$0.0029` and `$0.0047` Can Both Be Right: Prefix Caching for API-Served LLM Judges By Eyoel Nebiyu

Eyoel Nebiyu — Tue, 05 May 2026 14:39:39 +0000

The question I was asked

Abdulaziz asked a practical evaluation question: in the same benchmark, why do two judge configurations produce different per-task costs ($0.0029 vs $0.0047) while latency looks nearly flat? He needed a mechanism-level explanation he could defend in a model card or memo, not just "the numbers differ."

The short answer is: prefix-cache state, not model capability, is the load-bearing mechanism.

The mechanism in plain language

In hosted LLM APIs, repeated calls often share a large fixed prefix (system prompt, rubric, instructions). Providers can cache that prefix so they do not recompute it from scratch each time.

That creates different billing states:

Cache write: first call with a new prefix (more expensive than a read).
Cache read: repeated call with the same prefix (discounted prompt-side cost).
Completion tokens: generated output (usually unchanged by prefix cache state).

So two eval runs can use the same model and same task set, but if one run has a stable prefix and the other has prompt drift, their effective cost per call diverges.

This is exactly the kind of hidden mechanism that creates cost deltas without obvious quality deltas.

Why this maps to `$0.0029` vs `$0.0047`

Use a simple stylized setup:

System prompt: 1500 tokens (large enough to be cache-relevant)
Per task user tokens: 200
Per task completion tokens: 100
Prompt and completion rates fixed by provider rate card

If the prefix is stable across 12 calls:

Call 1 pays cache-write behavior
Calls 2�12 mostly pay cache-read behavior
Mean cost drops toward the lower number (about $0.003 range)

If prefix stability is partial (some misses from prompt variation, template drift, or multiple judge variants), average cost rises into a middle regime (around $0.0047).

If every call is effectively a miss, cost trends higher still.

So the two observed numbers are consistent with different cache-hit ratios across configurations, not contradictory accounting.

Why latency can still look flat

A common expectation is: if caching reduces prompt-side compute, latency should clearly drop. Sometimes it does. But in many API eval paths, end-to-end latency also includes:

decode-time variance,
network RTT,
provider-side batching/scheduling effects.

When those dominate p50/p95 behavior, cache gains on prompt processing may not appear as dramatic wall-clock separation, especially on small samples.

So it is defensible to say:

cost changed due to cache-state differences,
latency looked near-flat because wall-clock latency is multi-component and noisy.

That is not hand-waving if you clearly separate observed metrics from inferred internals.

The distinction Abdulaziz needed

The most important conceptual cleanup was this:

KV cache (intra-call): reuse inside a single generation pass.
Prefix cache (inter-call): reuse across separate API calls.

People often mix them together as "cache." For cost explanation in API-served evals, inter-call prefix reuse is usually the key driver.

That naming clarity matters in peer review because otherwise the explanation sounds technically correct but operationally unfalsifiable.

What should be claimed (and what should not)

Defensible claim

The per-task cost gap is primarily explained by different prefix-cache hit behavior under the same judge workload; some latency metrics can remain near-flat because decode/network components dominate observed wall-clock variance.

Overclaim to avoid

We directly measured internal provider KV hit-rate events from the API logs.

In most hosted setups, that internal event stream is not directly exposed. The safer framing is "inferred from token/rate behavior" unless explicit cache telemetry is available.

Minimal reproducible demonstration

I also provided a runnable arithmetic demo (day_1/scripts/cache_cost_demo.py) to make this explanation testable:

stable-prefix regime reproduces lower mean cost,
partial-hit regime lands near the middle value,
miss-heavy regime reproduces upper-cost behavior.

This matters because a mechanism explanation is stronger when another engineer can run it and see the same pattern.

What changed after this explainer

Before this, Abdulaziz could report the cost numbers but not defend the mechanism. After the explainer, he could:

name prefix caching as the load-bearing cause,
distinguish measured facts from inferred internals,
and write a cleaner, more defensible cost interpretation in downstream artifacts.

That is the real objective of a Week 12 explainer: not just technical correctness, but better grounded communication in portfolio-quality artifacts.

Sources

Anthropic Prompt Caching Documentation: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
Kwon et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP): https://arxiv.org/abs/2309.06180

These two sources are load-bearing: one defines the production API contract, the other grounds the serving mechanism.

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

Eyoel Nebiyu — Sat, 02 May 2026 12:51:59 +0000

This post documents a real negative result: my trained model worked… but a well-written prompt worked better.

TL;DR

I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well.

Then I trained a small preference-learning judge model using SimPO.

What happened surprised me:

Training accuracy → 100%
Held-out accuracy → 25%

Classic overfitting.

But the real lesson wasn’t about the model.

It was about the data.

After fixing dataset construction:

Held-out accuracy improved to 0.417 (Delta A +25pp)
A carefully prompted untrained model scored 0.833

👉 Conclusion:
At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.

Project Links
Dataset: https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1
Judge Model: https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b
Code: https://github.com/eyorata/sales_evaluation_bench

Total experiment cost: $0.041

The Problem: Existing Benchmarks Miss Real Sales Failures

Benchmarks like τ²-Bench retail, MT-Bench, or AlpacaEval are excellent at evaluating:

tool use
reasoning
conversation flow

But they don’t measure what actually kills B2B deals.

The agent I wanted to evaluate had to:

interpret hiring signals (funding, layoffs, leadership changes)
segment prospects correctly
write grounded outreach emails
avoid over-promising capacity
respect opt-outs and booking rules

Retail benchmarks simply don’t test these behaviors.

Example real failures from earlier experiments:

Auto-booking meetings when prospects only said “let me check my calendar.”
Re-engaging after opt-out, risking brand damage.

Those failures cost real money — but no public benchmark grades them.

So I built one.

Designing the Benchmark

The rule I set early:

Every rubric must be machine-gradable.

No vague scoring like “sounds professional.”

Instead, tasks check things like:

banned phrases absent
at least one signal referenced
no unsupported commitments
tone markers satisfied
correct action class detected

Each task returns a numeric score between 0 and 1.

No humans needed during evaluation.

The Dataset

266 tasks across five generation modes:

Mode Why it exists
Programmatic generation deterministic coverage
Trace-derived tasks grounded realism
Multi-LLM synthesis harder edge cases
Hand-authored adversarial stress testing
Style-guide gold pairs real preference ground truth

Partitions:

Train — 50%
Dev — 30%
Held-out — 20%
Preventing Data Leakage

I enforced three contamination checks:

No shared 8-grams between train and held-out tasks
Embedding similarity threshold
Time-window filtering for public signals

Result: 0 contamination violations.

Why I Chose Preference Training (Path B)

Week 10 analysis showed the model could already write fluent emails.

The real problem was:

👉 it couldn’t judge its own output.

So instead of improving generation, I trained a judge model using SimPO.

Setup:

Algorithm: SimPO (reference-free preference learning)
Trainer: TRL CPOTrainer
Backbone: Qwen2.5-3B
LoRA fine-tuning
Hardware: free Colab T4
The First Run: Perfect Training, Terrible Reality

Training looked amazing:

loss dropped smoothly
train accuracy hit 1.00
reward margins increased

But evaluation stayed stuck:

Train accuracy: 1.00
Held-out accuracy: 0.25

This is the moment many ML projects go wrong.

The instinct is:

bigger model
more steps
different hyperparameters

I almost did that.

Instead, I read the data.

The Real Problem Was the Dataset

Training examples used templated synthetic emails:

“Thank you for your interest…”

Held-out examples were real style-guide drafts:

“You closed your $14M Series A in February…”

The model learned a useless shortcut:

👉 prefer one template phrase over another.

It wasn’t learning tone — it was learning templates.

The Fix

I didn’t retrain immediately.

I fixed the data.

Using a stronger model, I rewrote all training “chosen” examples into authentic Tenacious voice, enforcing:

five tone markers
banned phrase rules
grounded signals
evaluator score ≥ 0.7

Cost: $0.04

Same algorithm. Same setup.

Only the data changed.

The Honest Results
Metric v1 v2
Train accuracy 1.00 1.00
Held-out accuracy 0.25 0.417
Delta A vs baseline 0 +25pp
Prompt baseline — 0.833
Latency 258ms 417ms
Finding #1 — Training Helped

The trained judge beat the untrained backbone.

So the methodology worked.

Finding #2 — Prompting Won Anyway

A carefully designed rubric prompt on the same backbone scored:

0.833 accuracy

No training required.

The Real Lesson

At this scale:

B2B tone judgment is a prompt-following problem more than a preference-learning problem.

The base model already understands tone.

It just needs explicit rules.

This is a legitimate negative result — and an important one.

About Delta C

I didn’t claim cross-benchmark improvement.

The model wasn’t trained on retail tasks, so comparing against τ²-Bench retail would be misleading.

Sometimes the honest result is:

improvement is domain-specific.

Limitations (Important)

Only 12 held-out tasks currently contain preference pairs.

That means:

wide confidence intervals
small-n uncertainty

This limitation is documented rather than hidden.

What’s Next
Dataset v0.2
expand preference slice from 12 → 30 tasks
clarify rubric ambiguity detected during calibration
Model v0.2
Qwen2.5-7B SimPO run
same training recipe
Future Ablation

Compare against a strong commercial model using only prompting.

The Big Engineering Lesson

The hardest decision wasn’t choosing the algorithm.

It was not retraining when training metrics looked perfect.

Clean training loss often means:

👉 the model learned something easy, not something useful.

Fixing the data cost $0.04.

Blindly scaling compute would have cost days.

If Your Training Loss Looks Too Good…

It probably is.

Check the data before blaming the model.

Acknowledgements

Work completed within the 10Academy TRP1 program using:

TRL + SimPO
Unsloth QLoRA training
Google Colab T4
OpenRouter multi-LLM routing

@dataset{tenacious_bench_v01_2026,
title = {Tenacious-Bench},
author = {Nebiyu, Eyoel},
year = 2026,
version = {0.1},
license = {CC-BY-4.0}
}

DEV Community: Eyoel Nebiyu

# Why "drift_score = 0.0" Is Not Yet Evidence of Semantic Stability — and What Your n=251 vs cap=200 Mismatch Actually Costs by: Eyoel Nebiyu

The question, anchored

What centroid-based cosine drift mechanically computes

(a) Your n=251 vs cap=200 mismatch — what it actually costs

(b) Why drift_score ≈ 0 is not yet "semantic stability"

The evidence chain a "semantically stable" claim needs

What to actually change in your two files

What I deliberately skipped

Pointers

# What LoRA Actually Adapts and Why Higher Rank Doesn't Always Buy What It Looks Like It Should Explainer by: Eyoel Nebiyu

The question, anchored

What LoRA mechanically adapts

Why low rank works at all — the intrinsic-rank hypothesis

Reproduce it

The three framings of "what rank controls" — your specific options

Two adjacent concepts

What I deliberately skipped

Pointers

# Scaffolding-Driven vs Model-Driven Planning: Where Agent Systems Actually Break *By Eyoel Nebiyu*

The core architecture problem

A practical decision-ownership model

1) Deterministic-owned decisions (Class D)

2) Model-owned decisions (Class M)

3) Hybrid arbitration decisions (Class H)

Ambiguity pattern 1: mixed intent in one message

Common failure

Better design

Ambiguity pattern 2: underspecified acceptance

Common failure

Better design

How correctness is lost: one failure path

Why brittleness clusters at handoff points

A failure-attribution framework

Portable architecture rules for FDE teams

Conclusion

Research references

# Why `$0.0029` and `$0.0047` Can Both Be Right: Prefix Caching for API-Served LLM Judges *By Eyoel Nebiyu*

The question I was asked

The mechanism in plain language

Why this maps to $0.0029 vs $0.0047

Why latency can still look flat

The distinction Abdulaziz needed

What should be claimed (and what should not)

Defensible claim

Overclaim to avoid

Minimal reproducible demonstration

What changed after this explainer

Sources

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

# Scaffolding-Driven vs Model-Driven Planning: Where Agent Systems Actually Break By Eyoel Nebiyu

# Why `$0.0029` and `$0.0047` Can Both Be Right: Prefix Caching for API-Served LLM Judges By Eyoel Nebiyu

Why this maps to `$0.0029` vs `$0.0047`