<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eyoel Nebiyu</title>
    <description>The latest articles on DEV Community by Eyoel Nebiyu (@eyorata).</description>
    <link>https://dev.to/eyorata</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3909051%2F0914ec61-85f3-423b-97d3-0dc9931802d9.jpeg</url>
      <title>DEV Community: Eyoel Nebiyu</title>
      <link>https://dev.to/eyorata</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eyorata"/>
    <language>en</language>
    <item>
      <title># Why "drift_score = 0.0" Is Not Yet Evidence of Semantic Stability — and What Your n=251 vs cap=200 Mismatch Actually Costs by: Eyoel Nebiyu</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Fri, 08 May 2026 17:27:05 +0000</pubDate>
      <link>https://dev.to/eyorata/-why-driftscore-00-is-not-yet-evidence-of-semantic-stability-and-what-your-n251-vs-2k9p</link>
      <guid>https://dev.to/eyorata/-why-driftscore-00-is-not-yet-evidence-of-semantic-stability-and-what-your-n251-vs-2k9p</guid>
      <description>&lt;p&gt;&lt;strong&gt;Repo under interrogation:&lt;/strong&gt; &lt;a href="https://github.com/Heban-7/Data-Contract-Enforcer" rel="noopener noreferrer"&gt;Heban-7/Data-Contract-Enforcer&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Files in scope:&lt;/strong&gt; &lt;code&gt;report_final_pdf_ready.md&lt;/code&gt;, &lt;code&gt;contracts/ai_extensions.py&lt;/code&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The question, anchored
&lt;/h2&gt;

&lt;p&gt;You have two questions stacked on top of each other in the same artifact:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;(a) effective sample size&lt;/strong&gt;: the report says &lt;code&gt;Sample size: 251&lt;/code&gt; but the implementation caps embeddings at &lt;code&gt;200&lt;/code&gt;. Which n is the statistic actually computed over, and what does the discrepancy cost you?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;(b) evidence chain&lt;/strong&gt;: when &lt;code&gt;drift_score = 0.0&lt;/code&gt; from a centroid-based cosine method, what additional evidence do you need before writing "Text content is semantically stable" in the report?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both have a single answer-shape: &lt;strong&gt;a centroid is a first-moment summary, and a first-moment summary is silent on everything else&lt;/strong&gt; — sample size, dispersion, multi-modality, model identity, and fallback behavior. Each of those silences is a separate place where "0.0" can mean something other than "stable." This explainer narrows to that mechanism, gives you the corrections to make in each file, and ships a numpy script that demonstrates the failure mode in one screen.&lt;/p&gt;


&lt;h2&gt;
  
  
  What centroid-based cosine drift mechanically computes
&lt;/h2&gt;

&lt;p&gt;Given a baseline cohort &lt;code&gt;A&lt;/code&gt; of &lt;code&gt;n_A&lt;/code&gt; embeddings &lt;code&gt;e_1, ..., e_{n_A}&lt;/code&gt; in &lt;code&gt;R^d&lt;/code&gt; and a current cohort &lt;code&gt;B&lt;/code&gt; of &lt;code&gt;n_B&lt;/code&gt; embeddings, the drift statistic is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;c_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_A&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_i&lt;/span&gt; &lt;span class="n"&gt;e_i&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;
&lt;span class="n"&gt;c_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_j&lt;/span&gt; &lt;span class="n"&gt;e_j&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;
&lt;span class="n"&gt;drift_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is a &lt;strong&gt;single scalar derived from the means of two samples&lt;/strong&gt;. The variance of each centroid coordinate is &lt;code&gt;Var(e_k) / n&lt;/code&gt;, so the precision of &lt;code&gt;c_A&lt;/code&gt; and &lt;code&gt;c_B&lt;/code&gt; scales with &lt;code&gt;1/sqrt(n)&lt;/code&gt;. Two consequences fall straight out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The statistic depends on &lt;code&gt;n&lt;/code&gt; — but only through estimator variance, not through what is being measured.&lt;/li&gt;
&lt;li&gt;Every property of the cohort that is not the mean is invisible to it.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  (a) Your n=251 vs cap=200 mismatch — what it actually costs
&lt;/h2&gt;

&lt;p&gt;If &lt;code&gt;contracts/ai_extensions.py&lt;/code&gt; enforces a 200-sample cap on the embedding loop but the report cites &lt;code&gt;Sample size: 251&lt;/code&gt;, the report is wrong about precision. The standard error of each centroid coordinate is &lt;code&gt;sigma_k / sqrt(n_eff)&lt;/code&gt;, with &lt;code&gt;n_eff = 200&lt;/code&gt;, not &lt;code&gt;251&lt;/code&gt;. That's a &lt;code&gt;sqrt(251/200) = 1.12x&lt;/code&gt; understatement of uncertainty — small, but it propagates into any downstream confidence interval and into the &lt;em&gt;threshold&lt;/em&gt; below which you treat the score as "no drift."&lt;/p&gt;

&lt;p&gt;The bigger cost is provenance, not precision. When a reader sees &lt;code&gt;Sample size: 251&lt;/code&gt; next to &lt;code&gt;drift_score: 0.0&lt;/code&gt;, the implicit promise is that 251 documents were embedded and contributed to the centroid. If 51 were silently dropped at the cap, that's a sampling decision (was it the first 200? a random 200? the 200 with the lowest tokens?) that changes whether the centroid is even drawn from the population the report claims. &lt;strong&gt;Right fix: rename the field &lt;code&gt;n_reported&lt;/code&gt; and add &lt;code&gt;n_effective&lt;/code&gt; as a separate field in the metric output, then make the report cite the effective number with a one-sentence note on the cap policy.&lt;/strong&gt; This is the cheapest reproducibility win in the whole artifact.&lt;/p&gt;




&lt;h2&gt;
  
  
  (b) Why drift_score ≈ 0 is not yet "semantic stability"
&lt;/h2&gt;

&lt;p&gt;Run the demo at &lt;a href="//scripts/centroid_drift_demo.py"&gt;&lt;code&gt;day_4/scripts/centroid_drift_demo.py&lt;/code&gt;&lt;/a&gt;. It builds two cohorts of 200 vectors in R^128 with &lt;strong&gt;identical sample means by construction&lt;/strong&gt; but a 10x ratio in dispersion, and prints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  centroid_cosine_drift(A, B)            = 9.78e-13    &amp;lt;- the contract's drift_score
  within_cohort_dispersion(A) (mean L2)  = 2.2591
  within_cohort_dispersion(B) (mean L2)  = 22.5743
  dispersion ratio B/A                   = 9.99x
  permutation p-value on centroid drift  = 1.0000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The drift score is machine zero. A contract that maps &lt;code&gt;drift_score ~ 0 → "semantically stable"&lt;/code&gt; will say so here. But cohort B is a cloud ten times wider than cohort A — clearly a different distribution. Even a permutation test that uses the &lt;strong&gt;same statistic&lt;/strong&gt; can't see it (&lt;code&gt;p = 1.0&lt;/code&gt; because no shuffle could make centroid drift smaller than zero). The test in that statistic family is blind by construction.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;first-moment trap&lt;/strong&gt;. There are at least four mechanisms that produce small &lt;code&gt;drift_score&lt;/code&gt; without semantic stability, and your contract should distinguish all four:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Genuine stability&lt;/strong&gt; (what you want it to mean): the population didn't change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispersion shift only&lt;/strong&gt;: same mean, wider/narrower spread. The demo above. Common when content gets more diverse or more templated over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal redistribution&lt;/strong&gt;: the cohort splits into clusters whose centroids cancel into the same overall mean. A bimodal &lt;code&gt;{p, -p}&lt;/code&gt; cohort and a unimodal cohort at &lt;code&gt;0&lt;/code&gt; have the same centroid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance failure&lt;/strong&gt;: the embedding model returned a fallback (zero vector, last good cache, default vector) for some samples. If the fallback contributes the same constant to both cohorts, centroid distance shrinks toward zero artifactually.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(4) is the one to interrogate first in your repo. If &lt;code&gt;contracts/ai_extensions.py&lt;/code&gt; has a try/except that catches embedding-API errors and returns a default vector (or skips the sample silently), then &lt;code&gt;drift_score = 0.0&lt;/code&gt; could mean "all 51 over-cap samples failed and got dropped, plus the 200 that did embed are fine." That is a &lt;em&gt;very&lt;/em&gt; different sentence from "Text content is semantically stable."&lt;/p&gt;




&lt;h2&gt;
  
  
  The evidence chain a "semantically stable" claim needs
&lt;/h2&gt;

&lt;p&gt;Before writing that sentence in &lt;code&gt;report_final_pdf_ready.md&lt;/code&gt;, the chain should be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model identity pinned&lt;/strong&gt; — the model id and version of the embedding endpoint at baseline-time and at current-time. If the model changed between the two cohorts, the score is comparing apples to apples in a different orchard, and &lt;code&gt;drift = 0&lt;/code&gt; is meaningless. Log it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Effective sample size logged&lt;/strong&gt; (the (a) fix above) — &lt;code&gt;n_effective&lt;/code&gt; not &lt;code&gt;n_reported&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback path documented&lt;/strong&gt; — what happens when an embedding call fails? Is the failure counted into &lt;code&gt;n_effective&lt;/code&gt; or silently dropped? If silently dropped, what fraction of the cohort is the fallback vector?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At least one 2nd-moment statistic&lt;/strong&gt; — within-cohort mean pairwise distance, or the trace of the covariance, or even just &lt;code&gt;np.std(embeddings, axis=0).mean()&lt;/code&gt;. One number per cohort. The demo's &lt;code&gt;within_cohort_dispersion&lt;/code&gt; is a starter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A distribution-level statistic&lt;/strong&gt; — Maximum Mean Discrepancy (MMD, Gretton et al. 2012) or energy distance (Székely &amp;amp; Rizzo 2013) is the standard upgrade. They're 5–10 lines of numpy on top of what you already compute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A semantic spot-check&lt;/strong&gt; — &lt;code&gt;k=20&lt;/code&gt; randomly-sampled documents from each cohort, run through an LLM-judge or a human, scored for topic/intent equivalence. The word "semantic" in the claim is doing real work, and only humans or a language model can supply that signal. Centroid distance never can.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only after (1)–(6) is "Text content is semantically stable" a defensible sentence. Until then, the honest claim is the strictly weaker one: &lt;strong&gt;"The first-moment summary of the embedding distribution is unchanged within the noise floor of an n=200 estimator using model M, with fallback rate F."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What to actually change in your two files
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;contracts/ai_extensions.py&lt;/code&gt;&lt;/strong&gt; — emit a triple, not a scalar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drift_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;# 1 - cos(c_A, c_B)
&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dispersion_ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# within-cohort 2nd moment ratio
&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mmd_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.014&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;# MMD with RBF kernel between cohorts
&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_effective&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_reported_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;251&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding_model_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fallback_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;report_final_pdf_ready.md&lt;/code&gt;&lt;/strong&gt; — rewrite the drift-results paragraph from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"drift_score: 0.0 — Text content is semantically stable."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Centroid-cosine drift = 0.0 over n_effective = 200 (capped from 251 input documents) using &lt;code&gt;text-embedding-3-small&lt;/code&gt; with a 0% fallback rate. Within-cohort dispersion ratio = 1.02; MMD between cohorts = 0.014. Consistent with no shift in the first-moment summary or the second-moment dispersion of the embedding distribution. A direct semantic-equivalence test on a &lt;code&gt;k=20&lt;/code&gt; spot-check sample is queued and not yet reported here."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That paragraph is defensible to a senior reviewer; the original one is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I deliberately skipped
&lt;/h2&gt;

&lt;p&gt;Bootstrap CI on the drift score itself; multi-batch streaming drift detectors (KS-windows, ADWIN, Page-Hinkley); contrastive-embedding identifiability; cross-model centroid alignment via Procrustes. Each is its own explainer. The mechanism above — &lt;em&gt;centroid is a first-moment summary, every other property of the cohort is silent&lt;/em&gt; — is what binds your two specific questions (n_eff and evidence chain) to one underlying cause.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pointers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gretton, Borgwardt, Rasch, Schölkopf, Smola&lt;/strong&gt; — &lt;em&gt;A Kernel Two-Sample Test&lt;/em&gt;, JMLR 2012. The canonical Maximum Mean Discrepancy paper. §3 has the estimator; §6 has the test. Drop-in upgrade for any centroid-only comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Székely, Rizzo&lt;/strong&gt; — &lt;em&gt;Energy statistics: A class of statistics based on distances&lt;/em&gt;, Journal of Statistical Planning and Inference, 2013. Energy distance is MMD's distribution-free cousin; either works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rabanser, Günnemann, Lipton&lt;/strong&gt; — &lt;em&gt;Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift&lt;/em&gt;, NeurIPS 2019. Empirical comparison of drift detectors. §4 explicitly shows that mean-only statistics miss dispersion shifts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anscombe, F.&lt;/strong&gt; — &lt;em&gt;Graphs in Statistical Analysis&lt;/em&gt;, The American Statistician, 1973. The original demonstration that summary statistics agree across very different distributions. Centroid-only drift is the modern direct descendant of the problem Anscombe was warning about.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Tool used hands-on: the &lt;code&gt;centroid_drift_demo.py&lt;/code&gt; script in this folder. Numpy-only, no embedding-model dependency, runs in ~3 seconds. Modify the &lt;code&gt;0.2&lt;/code&gt; and &lt;code&gt;2.0&lt;/code&gt; spread constants to re-explore — try setting them equal to confirm &lt;code&gt;drift_score ~ 0&lt;/code&gt; even when both cohorts are genuinely the same population.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>python</category>
    </item>
    <item>
      <title># What LoRA Actually Adapts and Why Higher Rank Doesn't Always Buy What It Looks Like It Should Explainer by: Eyoel Nebiyu</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Thu, 07 May 2026 17:31:45 +0000</pubDate>
      <link>https://dev.to/eyorata/-what-lora-actually-adapts-and-why-higher-rank-doesnt-always-buy-what-it-looks-like-it-should-4bfp</link>
      <guid>https://dev.to/eyorata/-what-lora-actually-adapts-and-why-higher-rank-doesnt-always-buy-what-it-looks-like-it-should-4bfp</guid>
      <description>&lt;h2&gt;
  
  
  The question, anchored
&lt;/h2&gt;

&lt;p&gt;You noticed two things in your Week 10 Conversion Engine fine-tunes that look paradoxical: tiny LoRA adapters often shifted model behavior dramatically, while raising LoRA rank sometimes barely helped and sometimes destabilized outputs. Both observations have a single mechanism behind them — the &lt;strong&gt;intrinsic-low-rank hypothesis&lt;/strong&gt; of fine-tuning. This explainer narrows hard to that mechanism, derives why low rank suffices, and shows you with a runnable script what actually changes when you raise rank.&lt;/p&gt;




&lt;h2&gt;
  
  
  What LoRA mechanically adapts
&lt;/h2&gt;

&lt;p&gt;A transformer layer has weight matrices in the attention block (Q, K, V, O projections) and the MLP block (gate, up, down). For a hidden dimension &lt;code&gt;d&lt;/code&gt;, each is roughly &lt;code&gt;d × d&lt;/code&gt;. Full fine-tuning lets every entry of every matrix update; LoRA &lt;em&gt;freezes&lt;/em&gt; them and adds a parallel learnable correction on a chosen subset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;forward pass at one layer:
  h = W_frozen · x  +  (α / r) · B · A · x
                            ↑          ↑
                       trainable   trainable
                        d × r       r × d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;B&lt;/code&gt; is initialized to &lt;strong&gt;zero&lt;/strong&gt;. &lt;code&gt;A&lt;/code&gt; is initialized to a small random Gaussian. The combination &lt;code&gt;(α/r) · B · A&lt;/code&gt; is the &lt;em&gt;update&lt;/em&gt; — at training start it equals zero, so the net forward pass is identical to the frozen base model. As training proceeds, only &lt;code&gt;B&lt;/code&gt; and &lt;code&gt;A&lt;/code&gt; get gradients; the base weights never change.&lt;/p&gt;

&lt;p&gt;Two consequences fall out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The full-rank weight matrix &lt;code&gt;W_frozen&lt;/code&gt; is never altered. &lt;strong&gt;No pretrained knowledge is forgotten.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The expressive update lives in a rank-&lt;code&gt;r&lt;/code&gt; subspace. &lt;strong&gt;No update outside that subspace is reachable, no matter how long you train.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second point is the crux of your question: what does choosing &lt;code&gt;r&lt;/code&gt; do to the set of reachable updates?&lt;/p&gt;




&lt;h2&gt;
  
  
  Why low rank works at all — the intrinsic-rank hypothesis
&lt;/h2&gt;

&lt;p&gt;Hu et al. (LoRA, ICLR 2022) didn't argue low rank works because they wanted small models. They argued it works because the &lt;em&gt;update needed to adapt a pretrained model to a downstream task lies on a low-dimensional subspace of weight space&lt;/em&gt;. This claim is empirically grounded by Aghajanyan et al. (ACL 2021), who showed pretrained language models can be fine-tuned through a randomly-projected ~200-dimensional update on tasks like GLUE and lose almost no performance. The full weight space has billions of dimensions; the &lt;em&gt;task-specific&lt;/em&gt; subspace has hundreds.&lt;/p&gt;

&lt;p&gt;The intuition: the pretrained model already encodes general syntactic, lexical, and semantic structure. Adapting it to a downstream classification or instruction-following task does not require &lt;em&gt;rewriting&lt;/em&gt; that structure — it requires &lt;em&gt;nudging&lt;/em&gt; a small number of directions in weight space that route the existing knowledge differently for the new objective.&lt;/p&gt;

&lt;p&gt;LoRA at rank &lt;code&gt;r&lt;/code&gt; exploits this by allocating exactly &lt;code&gt;r&lt;/code&gt; learnable directions per matrix. If the task's &lt;em&gt;intrinsic rank&lt;/em&gt; is &lt;code&gt;k&lt;/code&gt;, then any &lt;code&gt;r ≥ k&lt;/code&gt; will fit; any &lt;code&gt;r &amp;lt; k&lt;/code&gt; will not. &lt;strong&gt;&lt;code&gt;r&lt;/code&gt; is a cap on expressive capacity, not a smooth quality knob.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;

&lt;p&gt;Run the demo at &lt;a href="//scripts/lora_rank_demo.py"&gt;&lt;code&gt;day_3/scripts/lora_rank_demo.py&lt;/code&gt;&lt;/a&gt;. It builds a synthetic 64×64 "task-specific update" of intrinsic rank 4 (plus a tiny noise floor — real fine-tuning targets are not &lt;em&gt;exactly&lt;/em&gt; low-rank), fits LoRA at r = 2, 4, and 16 by gradient descent, and prints the SVD spectrum of the trained &lt;code&gt;B @ A&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  allocated r  |  final rel err  | top-8 singular values of trained B @ A
  --------------------------------------------------------------------------
            2  |         0.5758  | 37.828  33.113  0.000  0.000  0.000  0.000  0.000  0.000
            4  |         0.0097  | 37.828  33.113  25.829  24.209  0.000  0.000  0.000  0.000
           16  |         0.0069  | 37.828  33.113  25.829  24.209  0.145  0.138  0.134  0.129
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three readings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;r = 2&lt;/strong&gt;: under-parameterized. Target's intrinsic rank is 4; rank-2 cannot reach it. Error stays high.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;r = 4&lt;/strong&gt;: matches intrinsic rank exactly. Four large singular values, tight fit (rel-err 0.01).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;r = 16&lt;/strong&gt;: over-parameterized. Still fits, but only the &lt;em&gt;first four&lt;/em&gt; singular values are large (37.8, 33.1, 25.8, 24.2); the next four collapse to &lt;strong&gt;0.14&lt;/strong&gt; — two orders of magnitude smaller. The optimizer found the four useful directions and drove the other twelve to noise-floor magnitude.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what your "higher rank only slightly improved performance" observation looks like under the hood. Once &lt;code&gt;r&lt;/code&gt; exceeds the task's intrinsic rank, you are not gaining usable directions — you are allocating parameters that the optimizer drives toward zero, and they only contribute as gradient noise that can destabilize training on small data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three framings of "what rank controls" — your specific options
&lt;/h2&gt;

&lt;p&gt;All three of your options are formally true, but only one is &lt;em&gt;binding&lt;/em&gt; in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"Higher rank increases expressive capacity"&lt;/strong&gt; — true, but only up to the task's intrinsic rank. Hu et al. §6.2 + Table 6 shows r = 4 and r = 64 reach similar quality on most GPT-3 adaptation benchmarks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Allows adaptation across more directions"&lt;/strong&gt; — same answer reframed. r = 64 &lt;em&gt;can&lt;/em&gt; express updates in 64 directions; the optimizer typically does not &lt;em&gt;find&lt;/em&gt; useful gradient signal in all 64.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Reduces the compression constraint"&lt;/strong&gt; — true, but the constraint is rarely binding above the intrinsic rank.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Framing 1 is the binding one. Your observation — "higher rank sometimes barely improves" — is the expected mechanism, not a tuning failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two adjacent concepts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;lora_alpha&lt;/code&gt; is the effective learning rate of the adapter.&lt;/strong&gt; The forward pass scales the update by &lt;code&gt;α / r&lt;/code&gt;. Raise &lt;code&gt;r&lt;/code&gt; without raising &lt;code&gt;α&lt;/code&gt; and per-direction scaling drops; raise &lt;code&gt;r&lt;/code&gt; without lowering the optimizer LR and total update magnitude grows. Most "higher rank destabilized training" reports trace here, not to rank capacity. Rule of thumb: keep &lt;code&gt;α/r&lt;/code&gt; constant (or set &lt;code&gt;α = r&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effective rank ≠ allocated rank — audit it post-hoc.&lt;/strong&gt; Run SVD on the trained &lt;code&gt;B @ A&lt;/code&gt; (one line of NumPy). Singular values concentrate on the first few directions; the rest decay sharply. If you trained at r = 32 and SVD shows 5 large values + 27 tiny, retrain at r = 8 with no loss. This audit is the cleanest empirical signal in the adapter-compression literature.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I deliberately skipped
&lt;/h2&gt;

&lt;p&gt;QLoRA's 4-bit base-weight quantization layer; per-layer rank choice (different &lt;code&gt;r&lt;/code&gt; for attention vs MLP); the &lt;code&gt;target_modules&lt;/code&gt; selection question (which projections to LoRA at all); structured-update variants (DoRA, VeRA, LoRA-XS). Each is its own explainer. The mechanism above is what binds your specific observation — small adapter sufficient + larger rank only marginally helpful — to one underlying cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pointers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen&lt;/strong&gt; — &lt;em&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/em&gt;, ICLR 2022. arXiv: 2106.09685. §4 introduces the parallel-update form; §6.2 + Table 6 has the rank-vs-quality empirical curves on GPT-3 that show the "r = 4 is enough" pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aghajanyan, Zettlemoyer, Gupta&lt;/strong&gt; — &lt;em&gt;Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning&lt;/em&gt;, ACL 2021. arXiv: 2012.13255. The empirical foundation of "task-specific updates live on a low-dim manifold" — Table 1 shows GLUE recovery via random projections at d_int as low as 200.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Tool used hands-on: the &lt;code&gt;lora_rank_demo.py&lt;/code&gt; script in this folder. Numpy-only, no PyTorch dependency, runs in ~5 seconds. Modify &lt;code&gt;TRUE_RANK&lt;/code&gt; in the script to test your own intuition — try setting it to 8 and watch r = 4 fail, r = 8 succeed, r = 16 over-allocate.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title># Scaffolding-Driven vs Model-Driven Planning: Where Agent Systems Actually Break *By Eyoel Nebiyu*</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Wed, 06 May 2026 17:07:10 +0000</pubDate>
      <link>https://dev.to/eyorata/-scaffolding-driven-vs-model-driven-planning-where-agent-systems-actually-breakby-eyoel-nebiyu-50h1</link>
      <guid>https://dev.to/eyorata/-scaffolding-driven-vs-model-driven-planning-where-agent-systems-actually-breakby-eyoel-nebiyu-50h1</guid>
      <description>&lt;p&gt;Most teams building agent systems focus on improving prompts or improving workflow logic. In production, many costly failures come from something else: the boundary between model interpretation and deterministic execution.&lt;/p&gt;

&lt;p&gt;This post explains how to assign planning ownership between scaffolding and model reasoning, why ambiguity handling fails at handoff points, and how to design a safer boundary that still preserves adaptability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core architecture problem
&lt;/h2&gt;

&lt;p&gt;Hybrid agent systems combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic scaffolding&lt;/strong&gt;: states, routers, policy gates, retries, execution order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model judgment&lt;/strong&gt;: semantic interpretation under ambiguous user language.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither layer is enough alone. Scaffolding is stable but brittle under messy language. Model reasoning is flexible but probabilistic. The failure surface appears when we treat probabilistic interpretation as execution-ready truth.&lt;/p&gt;

&lt;p&gt;In Gemechis's real setup, two ambiguity patterns repeatedly trigger this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mixed intent in one turn&lt;/strong&gt;: prospects both accept a meeting direction and ask for clarification in the same message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Underspecified acceptance&lt;/strong&gt;: prospects indicate acceptance but do not provide enough schedule details (day/time/timezone) to execute safely.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A practical decision-ownership model
&lt;/h2&gt;

&lt;p&gt;Use three decision classes.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Deterministic-owned decisions (Class D)
&lt;/h3&gt;

&lt;p&gt;These should stay in scaffolding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;policy and compliance constraints,&lt;/li&gt;
&lt;li&gt;side-effect eligibility checks,&lt;/li&gt;
&lt;li&gt;idempotency/retry policy,&lt;/li&gt;
&lt;li&gt;action sequencing and commit control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are binary and auditable. If conditions fail, execution does not happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Model-owned decisions (Class M)
&lt;/h3&gt;

&lt;p&gt;These should stay model-mediated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;intent parsing from messy language,&lt;/li&gt;
&lt;li&gt;ambiguity detection,&lt;/li&gt;
&lt;li&gt;extraction of candidate entities/slots,&lt;/li&gt;
&lt;li&gt;clarification suggestion generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are probabilistic and should carry uncertainty.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Hybrid arbitration decisions (Class H)
&lt;/h3&gt;

&lt;p&gt;These require both layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;proceed vs clarify,&lt;/li&gt;
&lt;li&gt;branch selection when multiple intents coexist,&lt;/li&gt;
&lt;li&gt;mapping interpreted intent to an executable action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A strong operating rule is simple: &lt;strong&gt;model proposes, scaffolding ratifies&lt;/strong&gt; before side effects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ambiguity pattern 1: mixed intent in one message
&lt;/h2&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Thanks, that works. Can you also clarify whether onboarding support is included?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is not one intent. It is acceptance plus clarification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common failure
&lt;/h3&gt;

&lt;p&gt;A single-label router forces this into either &lt;code&gt;accept_meeting&lt;/code&gt; or &lt;code&gt;ask_question&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it picks clarification only, booking momentum is lost.&lt;/li&gt;
&lt;li&gt;If it picks acceptance only, user concern is ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Better design
&lt;/h3&gt;

&lt;p&gt;Represent multi-intent explicitly at the interface, then execute a composite plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;acknowledge and answer clarification,&lt;/li&gt;
&lt;li&gt;keep scheduling flow alive,&lt;/li&gt;
&lt;li&gt;request any missing scheduling constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your transition model cannot represent dual-intent turns, this failure is structural, not incidental.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ambiguity pattern 2: underspecified acceptance
&lt;/h2&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Yes, let's do it next week."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The intent is positive, but execution fields are incomplete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common failure
&lt;/h3&gt;

&lt;p&gt;System maps positive sentiment directly to &lt;code&gt;schedule_meeting&lt;/code&gt; and advances to commit state.&lt;/p&gt;

&lt;p&gt;This causes either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hard downstream failures, or&lt;/li&gt;
&lt;li&gt;silent wrong assumptions (bad day/time).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Better design
&lt;/h3&gt;

&lt;p&gt;Create explicit intermediate state (for example &lt;code&gt;accepted_but_incomplete&lt;/code&gt;) and require deterministic completeness checks before execution.&lt;/p&gt;

&lt;p&gt;Acceptance and execution-readiness are different decisions and must remain separate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How correctness is lost: one failure path
&lt;/h2&gt;

&lt;p&gt;Input:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Sounds good. Could you clarify pricing tiers? Also maybe Thursday afternoon works."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Interpretation&lt;/strong&gt;: model extracts partial acceptance, clarification intent, and fuzzy time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt;: brittle router collapses to one branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State update&lt;/strong&gt;: system records only one intent path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: wrong downstream behavior (premature scheduling or missed conversion).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The loss happens at boundary compression: multiple uncertain signals are reduced to one deterministic action prematurely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why brittleness clusters at handoff points
&lt;/h2&gt;

&lt;p&gt;Three causes repeat across products:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Premature commitment&lt;/strong&gt;: plausible interpretation treated as final intent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uncertainty loss&lt;/strong&gt;: alternatives/confidence dropped by interface schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syntactic progression over semantic correctness&lt;/strong&gt;: workflow advances because fields exist, not because meaning is resolved.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the gap between "allowed by workflow" and "correct for user intent."&lt;/p&gt;

&lt;h2&gt;
  
  
  A failure-attribution framework
&lt;/h2&gt;

&lt;p&gt;When incident-reviewing hybrid agents, separate causes into three linked buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scaffolding failures&lt;/strong&gt;: rigid one-intent router, missing clarification states, permissive commit transitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model failures&lt;/strong&gt;: semantic misreads, overconfidence on vague phrasing, weak modality handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interface failures&lt;/strong&gt;: lossy model-output schema, no confidence-to-policy mapping, early single-action collapse.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most serious incidents are mixed-cause. Treating them as "just prompt quality" usually misses the fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Portable architecture rules for FDE teams
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Never let acceptance alone trigger side effects.&lt;/li&gt;
&lt;li&gt;Treat multi-intent turns as first-class state.&lt;/li&gt;
&lt;li&gt;Preserve uncertainty across the model-to-router boundary.&lt;/li&gt;
&lt;li&gt;Add explicit intermediate states (&lt;code&gt;needs_clarification&lt;/code&gt;, &lt;code&gt;accepted_but_incomplete&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Use stricter gates for high-risk writes than for read-only responses.&lt;/li&gt;
&lt;li&gt;Log boundary artifacts (candidate intents, confidence, chosen branch, gate outcome).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The healthiest hybrid systems are asymmetric by risk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model-driven upstream interpretation,&lt;/li&gt;
&lt;li&gt;deterministic downstream commitment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your team can specify, for each planning decision, &lt;strong&gt;who owns it, what uncertainty is acceptable, and what gate must pass before action&lt;/strong&gt;, you will remove most brittle failures at the scaffolding-model boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Research references
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023)&lt;br&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2210.03629&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023)&lt;br&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/abs/2308.08155" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2308.08155&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;OpenAI Model Spec (2024)&lt;br&gt;&lt;br&gt;
&lt;a href="https://model-spec.openai.com/" rel="noopener noreferrer"&gt;https://model-spec.openai.com/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title># Why `$0.0029` and `$0.0047` Can Both Be Right: Prefix Caching for API-Served LLM Judges *By Eyoel Nebiyu*</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Tue, 05 May 2026 14:39:39 +0000</pubDate>
      <link>https://dev.to/eyorata/-why-00029-and-00047-can-both-be-right-prefix-caching-for-api-served-llm-judges-by-3bd6</link>
      <guid>https://dev.to/eyorata/-why-00029-and-00047-can-both-be-right-prefix-caching-for-api-served-llm-judges-by-3bd6</guid>
      <description>&lt;h2&gt;
  
  
  The question I was asked
&lt;/h2&gt;

&lt;p&gt;Abdulaziz asked a practical evaluation question: in the same benchmark, why do two judge configurations produce different per-task costs (&lt;code&gt;$0.0029&lt;/code&gt; vs &lt;code&gt;$0.0047&lt;/code&gt;) while latency looks nearly flat? He needed a mechanism-level explanation he could defend in a model card or memo, not just "the numbers differ."&lt;/p&gt;

&lt;p&gt;The short answer is: &lt;strong&gt;prefix-cache state&lt;/strong&gt;, not model capability, is the load-bearing mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mechanism in plain language
&lt;/h2&gt;

&lt;p&gt;In hosted LLM APIs, repeated calls often share a large fixed prefix (system prompt, rubric, instructions). Providers can cache that prefix so they do not recompute it from scratch each time.&lt;/p&gt;

&lt;p&gt;That creates different billing states:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cache write&lt;/strong&gt;: first call with a new prefix (more expensive than a read).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache read&lt;/strong&gt;: repeated call with the same prefix (discounted prompt-side cost).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completion tokens&lt;/strong&gt;: generated output (usually unchanged by prefix cache state).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So two eval runs can use the same model and same task set, but if one run has a stable prefix and the other has prompt drift, their effective cost per call diverges.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of hidden mechanism that creates cost deltas without obvious quality deltas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this maps to &lt;code&gt;$0.0029&lt;/code&gt; vs &lt;code&gt;$0.0047&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Use a simple stylized setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System prompt: 1500 tokens (large enough to be cache-relevant)&lt;/li&gt;
&lt;li&gt;Per task user tokens: 200&lt;/li&gt;
&lt;li&gt;Per task completion tokens: 100&lt;/li&gt;
&lt;li&gt;Prompt and completion rates fixed by provider rate card&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the prefix is stable across 12 calls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Call 1 pays cache-write behavior&lt;/li&gt;
&lt;li&gt;Calls 2�12 mostly pay cache-read behavior&lt;/li&gt;
&lt;li&gt;Mean cost drops toward the lower number (about &lt;code&gt;$0.003&lt;/code&gt; range)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If prefix stability is partial (some misses from prompt variation, template drift, or multiple judge variants), average cost rises into a middle regime (around &lt;code&gt;$0.0047&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;If every call is effectively a miss, cost trends higher still.&lt;/p&gt;

&lt;p&gt;So the two observed numbers are consistent with &lt;strong&gt;different cache-hit ratios&lt;/strong&gt; across configurations, not contradictory accounting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why latency can still look flat
&lt;/h2&gt;

&lt;p&gt;A common expectation is: if caching reduces prompt-side compute, latency should clearly drop. Sometimes it does. But in many API eval paths, end-to-end latency also includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decode-time variance,&lt;/li&gt;
&lt;li&gt;network RTT,&lt;/li&gt;
&lt;li&gt;provider-side batching/scheduling effects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When those dominate p50/p95 behavior, cache gains on prompt processing may not appear as dramatic wall-clock separation, especially on small samples.&lt;/p&gt;

&lt;p&gt;So it is defensible to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost changed due to cache-state differences,&lt;/li&gt;
&lt;li&gt;latency looked near-flat because wall-clock latency is multi-component and noisy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not hand-waving if you clearly separate observed metrics from inferred internals.&lt;/p&gt;

&lt;h2&gt;
  
  
  The distinction Abdulaziz needed
&lt;/h2&gt;

&lt;p&gt;The most important conceptual cleanup was this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KV cache (intra-call):&lt;/strong&gt; reuse inside a single generation pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix cache (inter-call):&lt;/strong&gt; reuse across separate API calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;People often mix them together as "cache." For cost explanation in API-served evals, inter-call prefix reuse is usually the key driver.&lt;/p&gt;

&lt;p&gt;That naming clarity matters in peer review because otherwise the explanation sounds technically correct but operationally unfalsifiable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What should be claimed (and what should not)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Defensible claim
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;The per-task cost gap is primarily explained by different prefix-cache hit behavior under the same judge workload; some latency metrics can remain near-flat because decode/network components dominate observed wall-clock variance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Overclaim to avoid
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;We directly measured internal provider KV hit-rate events from the API logs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In most hosted setups, that internal event stream is not directly exposed. The safer framing is "inferred from token/rate behavior" unless explicit cache telemetry is available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minimal reproducible demonstration
&lt;/h2&gt;

&lt;p&gt;I also provided a runnable arithmetic demo (&lt;code&gt;day_1/scripts/cache_cost_demo.py&lt;/code&gt;) to make this explanation testable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable-prefix regime reproduces lower mean cost,&lt;/li&gt;
&lt;li&gt;partial-hit regime lands near the middle value,&lt;/li&gt;
&lt;li&gt;miss-heavy regime reproduces upper-cost behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because a mechanism explanation is stronger when another engineer can run it and see the same pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed after this explainer
&lt;/h2&gt;

&lt;p&gt;Before this, Abdulaziz could report the cost numbers but not defend the mechanism. After the explainer, he could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;name prefix caching as the load-bearing cause,&lt;/li&gt;
&lt;li&gt;distinguish measured facts from inferred internals,&lt;/li&gt;
&lt;li&gt;and write a cleaner, more defensible cost interpretation in downstream artifacts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the real objective of a Week 12 explainer: not just technical correctness, but better grounded communication in portfolio-quality artifacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Anthropic Prompt Caching Documentation: &lt;a href="https://docs.claude.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;https://docs.claude.com/en/docs/build-with-claude/prompt-caching&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kwon et al. (2023), &lt;em&gt;Efficient Memory Management for Large Language Model Serving with PagedAttention&lt;/em&gt; (SOSP): &lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2309.06180&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These two sources are load-bearing: one defines the production API contract, the other grounds the serving mechanism.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026</title>
      <dc:creator>Eyoel Nebiyu</dc:creator>
      <pubDate>Sat, 02 May 2026 12:51:59 +0000</pubDate>
      <link>https://dev.to/eyorata/when-your-training-loss-is-lying-to-you-building-a-tenacious-specific-sales-outreach-benchmark-2jgd</link>
      <guid>https://dev.to/eyorata/when-your-training-loss-is-lying-to-you-building-a-tenacious-specific-sales-outreach-benchmark-2jgd</guid>
      <description>&lt;p&gt;This post documents a real negative result: my trained model worked… but a well-written prompt worked better.&lt;/p&gt;

&lt;p&gt;TL;DR&lt;/p&gt;

&lt;p&gt;I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well.&lt;/p&gt;

&lt;p&gt;Then I trained a small preference-learning judge model using SimPO.&lt;/p&gt;

&lt;p&gt;What happened surprised me:&lt;/p&gt;

&lt;p&gt;Training accuracy → 100%&lt;br&gt;
Held-out accuracy → 25%&lt;/p&gt;

&lt;p&gt;Classic overfitting.&lt;/p&gt;

&lt;p&gt;But the real lesson wasn’t about the model.&lt;/p&gt;

&lt;p&gt;It was about the data.&lt;/p&gt;

&lt;p&gt;After fixing dataset construction:&lt;/p&gt;

&lt;p&gt;Held-out accuracy improved to 0.417 (Delta A +25pp)&lt;br&gt;
A carefully prompted untrained model scored 0.833&lt;/p&gt;

&lt;p&gt;👉 Conclusion:&lt;br&gt;
At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.&lt;/p&gt;

&lt;p&gt;Project Links&lt;br&gt;
Dataset: &lt;a href="https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/eyorata/tenacious_bench_v0.1&lt;/a&gt;&lt;br&gt;
Judge Model: &lt;a href="https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b" rel="noopener noreferrer"&gt;https://huggingface.co/eyorata/tenacious-judge-simpo-qwen25-3b&lt;/a&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/eyorata/sales_evaluation_bench" rel="noopener noreferrer"&gt;https://github.com/eyorata/sales_evaluation_bench&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Total experiment cost: $0.041&lt;/p&gt;

&lt;p&gt;The Problem: Existing Benchmarks Miss Real Sales Failures&lt;/p&gt;

&lt;p&gt;Benchmarks like τ²-Bench retail, MT-Bench, or AlpacaEval are excellent at evaluating:&lt;/p&gt;

&lt;p&gt;tool use&lt;br&gt;
reasoning&lt;br&gt;
conversation flow&lt;/p&gt;

&lt;p&gt;But they don’t measure what actually kills B2B deals.&lt;/p&gt;

&lt;p&gt;The agent I wanted to evaluate had to:&lt;/p&gt;

&lt;p&gt;interpret hiring signals (funding, layoffs, leadership changes)&lt;br&gt;
segment prospects correctly&lt;br&gt;
write grounded outreach emails&lt;br&gt;
avoid over-promising capacity&lt;br&gt;
respect opt-outs and booking rules&lt;/p&gt;

&lt;p&gt;Retail benchmarks simply don’t test these behaviors.&lt;/p&gt;

&lt;p&gt;Example real failures from earlier experiments:&lt;/p&gt;

&lt;p&gt;Auto-booking meetings when prospects only said “let me check my calendar.”&lt;br&gt;
Re-engaging after opt-out, risking brand damage.&lt;/p&gt;

&lt;p&gt;Those failures cost real money — but no public benchmark grades them.&lt;/p&gt;

&lt;p&gt;So I built one.&lt;/p&gt;

&lt;p&gt;Designing the Benchmark&lt;/p&gt;

&lt;p&gt;The rule I set early:&lt;/p&gt;

&lt;p&gt;Every rubric must be machine-gradable.&lt;/p&gt;

&lt;p&gt;No vague scoring like “sounds professional.”&lt;/p&gt;

&lt;p&gt;Instead, tasks check things like:&lt;/p&gt;

&lt;p&gt;banned phrases absent&lt;br&gt;
at least one signal referenced&lt;br&gt;
no unsupported commitments&lt;br&gt;
tone markers satisfied&lt;br&gt;
correct action class detected&lt;/p&gt;

&lt;p&gt;Each task returns a numeric score between 0 and 1.&lt;/p&gt;

&lt;p&gt;No humans needed during evaluation.&lt;/p&gt;

&lt;p&gt;The Dataset&lt;/p&gt;

&lt;p&gt;266 tasks across five generation modes:&lt;/p&gt;

&lt;p&gt;Mode    Why it exists&lt;br&gt;
Programmatic generation deterministic coverage&lt;br&gt;
Trace-derived tasks grounded realism&lt;br&gt;
Multi-LLM synthesis harder edge cases&lt;br&gt;
Hand-authored adversarial   stress testing&lt;br&gt;
Style-guide gold pairs  real preference ground truth&lt;/p&gt;

&lt;p&gt;Partitions:&lt;/p&gt;

&lt;p&gt;Train — 50%&lt;br&gt;
Dev — 30%&lt;br&gt;
Held-out — 20%&lt;br&gt;
Preventing Data Leakage&lt;/p&gt;

&lt;p&gt;I enforced three contamination checks:&lt;/p&gt;

&lt;p&gt;No shared 8-grams between train and held-out tasks&lt;br&gt;
Embedding similarity threshold&lt;br&gt;
Time-window filtering for public signals&lt;/p&gt;

&lt;p&gt;Result: 0 contamination violations.&lt;/p&gt;

&lt;p&gt;Why I Chose Preference Training (Path B)&lt;/p&gt;

&lt;p&gt;Week 10 analysis showed the model could already write fluent emails.&lt;/p&gt;

&lt;p&gt;The real problem was:&lt;/p&gt;

&lt;p&gt;👉 it couldn’t judge its own output.&lt;/p&gt;

&lt;p&gt;So instead of improving generation, I trained a judge model using SimPO.&lt;/p&gt;

&lt;p&gt;Setup:&lt;/p&gt;

&lt;p&gt;Algorithm: SimPO (reference-free preference learning)&lt;br&gt;
Trainer: TRL CPOTrainer&lt;br&gt;
Backbone: Qwen2.5-3B&lt;br&gt;
LoRA fine-tuning&lt;br&gt;
Hardware: free Colab T4&lt;br&gt;
The First Run: Perfect Training, Terrible Reality&lt;/p&gt;

&lt;p&gt;Training looked amazing:&lt;/p&gt;

&lt;p&gt;loss dropped smoothly&lt;br&gt;
train accuracy hit 1.00&lt;br&gt;
reward margins increased&lt;/p&gt;

&lt;p&gt;But evaluation stayed stuck:&lt;/p&gt;

&lt;p&gt;Train accuracy: 1.00&lt;br&gt;
Held-out accuracy: 0.25&lt;/p&gt;

&lt;p&gt;This is the moment many ML projects go wrong.&lt;/p&gt;

&lt;p&gt;The instinct is:&lt;/p&gt;

&lt;p&gt;bigger model&lt;br&gt;
more steps&lt;br&gt;
different hyperparameters&lt;/p&gt;

&lt;p&gt;I almost did that.&lt;/p&gt;

&lt;p&gt;Instead, I read the data.&lt;/p&gt;

&lt;p&gt;The Real Problem Was the Dataset&lt;/p&gt;

&lt;p&gt;Training examples used templated synthetic emails:&lt;/p&gt;

&lt;p&gt;“Thank you for your interest…”&lt;/p&gt;

&lt;p&gt;Held-out examples were real style-guide drafts:&lt;/p&gt;

&lt;p&gt;“You closed your $14M Series A in February…”&lt;/p&gt;

&lt;p&gt;The model learned a useless shortcut:&lt;/p&gt;

&lt;p&gt;👉 prefer one template phrase over another.&lt;/p&gt;

&lt;p&gt;It wasn’t learning tone — it was learning templates.&lt;/p&gt;

&lt;p&gt;The Fix&lt;/p&gt;

&lt;p&gt;I didn’t retrain immediately.&lt;/p&gt;

&lt;p&gt;I fixed the data.&lt;/p&gt;

&lt;p&gt;Using a stronger model, I rewrote all training “chosen” examples into authentic Tenacious voice, enforcing:&lt;/p&gt;

&lt;p&gt;five tone markers&lt;br&gt;
banned phrase rules&lt;br&gt;
grounded signals&lt;br&gt;
evaluator score ≥ 0.7&lt;/p&gt;

&lt;p&gt;Cost: $0.04&lt;/p&gt;

&lt;p&gt;Same algorithm. Same setup.&lt;/p&gt;

&lt;p&gt;Only the data changed.&lt;/p&gt;

&lt;p&gt;The Honest Results&lt;br&gt;
Metric  v1  v2&lt;br&gt;
Train accuracy  1.00    1.00&lt;br&gt;
Held-out accuracy   0.25    0.417&lt;br&gt;
Delta A vs baseline 0   +25pp&lt;br&gt;
Prompt baseline — 0.833&lt;br&gt;
Latency 258ms   417ms&lt;br&gt;
Finding #1 — Training Helped&lt;/p&gt;

&lt;p&gt;The trained judge beat the untrained backbone.&lt;/p&gt;

&lt;p&gt;So the methodology worked.&lt;/p&gt;

&lt;p&gt;Finding #2 — Prompting Won Anyway&lt;/p&gt;

&lt;p&gt;A carefully designed rubric prompt on the same backbone scored:&lt;/p&gt;

&lt;p&gt;0.833 accuracy&lt;/p&gt;

&lt;p&gt;No training required.&lt;/p&gt;

&lt;p&gt;The Real Lesson&lt;/p&gt;

&lt;p&gt;At this scale:&lt;/p&gt;

&lt;p&gt;B2B tone judgment is a prompt-following problem more than a preference-learning problem.&lt;/p&gt;

&lt;p&gt;The base model already understands tone.&lt;/p&gt;

&lt;p&gt;It just needs explicit rules.&lt;/p&gt;

&lt;p&gt;This is a legitimate negative result — and an important one.&lt;/p&gt;

&lt;p&gt;About Delta C&lt;/p&gt;

&lt;p&gt;I didn’t claim cross-benchmark improvement.&lt;/p&gt;

&lt;p&gt;The model wasn’t trained on retail tasks, so comparing against τ²-Bench retail would be misleading.&lt;/p&gt;

&lt;p&gt;Sometimes the honest result is:&lt;/p&gt;

&lt;p&gt;improvement is domain-specific.&lt;/p&gt;

&lt;p&gt;Limitations (Important)&lt;/p&gt;

&lt;p&gt;Only 12 held-out tasks currently contain preference pairs.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;p&gt;wide confidence intervals&lt;br&gt;
small-n uncertainty&lt;/p&gt;

&lt;p&gt;This limitation is documented rather than hidden.&lt;/p&gt;

&lt;p&gt;What’s Next&lt;br&gt;
Dataset v0.2&lt;br&gt;
expand preference slice from 12 → 30 tasks&lt;br&gt;
clarify rubric ambiguity detected during calibration&lt;br&gt;
Model v0.2&lt;br&gt;
Qwen2.5-7B SimPO run&lt;br&gt;
same training recipe&lt;br&gt;
Future Ablation&lt;/p&gt;

&lt;p&gt;Compare against a strong commercial model using only prompting.&lt;/p&gt;

&lt;p&gt;The Big Engineering Lesson&lt;/p&gt;

&lt;p&gt;The hardest decision wasn’t choosing the algorithm.&lt;/p&gt;

&lt;p&gt;It was not retraining when training metrics looked perfect.&lt;/p&gt;

&lt;p&gt;Clean training loss often means:&lt;/p&gt;

&lt;p&gt;👉 the model learned something easy, not something useful.&lt;/p&gt;

&lt;p&gt;Fixing the data cost $0.04.&lt;/p&gt;

&lt;p&gt;Blindly scaling compute would have cost days.&lt;/p&gt;

&lt;p&gt;If Your Training Loss Looks Too Good…&lt;/p&gt;

&lt;p&gt;It probably is.&lt;/p&gt;

&lt;p&gt;Check the data before blaming the model.&lt;/p&gt;

&lt;p&gt;Acknowledgements&lt;/p&gt;

&lt;p&gt;Work completed within the 10Academy TRP1 program using:&lt;/p&gt;

&lt;p&gt;TRL + SimPO&lt;br&gt;
Unsloth QLoRA training&lt;br&gt;
Google Colab T4&lt;br&gt;
OpenRouter multi-LLM routing&lt;/p&gt;

&lt;p&gt;@dataset{tenacious_bench_v01_2026,&lt;br&gt;
  title  = {Tenacious-Bench},&lt;br&gt;
  author = {Nebiyu, Eyoel},&lt;br&gt;
  year   = 2026,&lt;br&gt;
  version = {0.1},&lt;br&gt;
  license = {CC-BY-4.0}&lt;br&gt;
}&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
