<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marcus Chen</title>
    <description>The latest articles on DEV Community by Marcus Chen (@marcuswwchen).</description>
    <link>https://dev.to/marcuswwchen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859428%2F572085fe-831d-498b-854b-41102c7902ee.jpg</url>
      <title>DEV Community: Marcus Chen</title>
      <link>https://dev.to/marcuswwchen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marcuswwchen"/>
    <language>en</language>
    <item>
      <title>We shipped a model on a 2-point eval win. It was noise.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:33:10 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/we-shipped-a-model-on-a-2-point-eval-win-it-was-noise-3ml6</link>
      <guid>https://dev.to/marcuswwchen/we-shipped-a-model-on-a-2-point-eval-win-it-was-noise-3ml6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We promoted a fine-tuned 7B because it beat the incumbent by 2.1 points on our internal eval. Two weeks later we added bootstrap confidence intervals to the harness and found the gain sat well inside the noise band. The model was not better. We just had no way to tell.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The win that wasn't
&lt;/h2&gt;

&lt;p&gt;Our eval suite at Nexus Labs is 840 prompts. Enterprise agent tasks. Each one is scored pass/fail by an exact-match check against a known-good structured output, so every result is a 1 or a 0.&lt;/p&gt;

&lt;p&gt;The fine-tuned candidate scored 73.4%. The incumbent scored 71.3%. A 2.1-point lift on a suite that size felt real, so we shipped it to staging and started the rollout paperwork.&lt;/p&gt;

&lt;p&gt;It was not real. Or rather, we had zero evidence either way, which is worse, because we acted like we did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a single number lies
&lt;/h2&gt;

&lt;p&gt;An eval run is a sample, not a measurement. Run the same 840 prompts against the same model with any sampling at temperature above 0 and you get a different number. Even at temperature 0, batching order and kernel nondeterminism in vLLM move it.&lt;/p&gt;

&lt;p&gt;The math is not subtle. For a pass rate around 0.73 over n=840, the binomial standard error is &lt;code&gt;sqrt(p(1-p)/n)&lt;/code&gt;, which is about 1.53 points. The standard error of the &lt;em&gt;difference&lt;/em&gt; between two such rates is roughly 2.1 points.&lt;/p&gt;

&lt;p&gt;So our 2.1-point gap was about one standard error wide. A coin flip dressed up As a result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bootstrap instead of hand-waving
&lt;/h2&gt;

&lt;p&gt;The fix is cheap. We resample the per-prompt results and look at the distribution of the difference. Because both models ran the same prompts, we pair them, which cuts the variance compared to treating the two runs as independent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# per-prompt correctness, 1/0, aligned by prompt id
&lt;/span&gt;&lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old_correct.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# shape (840,)
&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_correct.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;paired_bootstrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;diffs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iters&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;97.5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;

&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;paired_bootstrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  95% CI=[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hi&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# delta=0.021  95% CI=[-0.004, 0.046]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 95% interval runs from -0.4 points to +4.6 points. It crosses zero. We could not rule out that the new model was slightly worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers actually said
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Incumbent 7B&lt;/th&gt;
&lt;th&gt;Fine-tuned 7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pass rate&lt;/td&gt;
&lt;td&gt;71.3%&lt;/td&gt;
&lt;td&gt;73.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Paired delta&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;+2.1 pts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;95% CI on delta&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;[-0.4, +4.6] pts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Significant at p&amp;lt;0.05?&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reading the table is the whole point. The headline delta is positive. The interval that contains it includes outcomes where we regressed. You do not ship on that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed in our process
&lt;/h2&gt;

&lt;p&gt;Three rules now gate any model promotion on my team.&lt;/p&gt;

&lt;p&gt;First, no promotion without a paired bootstrap CI that excludes zero, or a McNemar test under p&amp;lt;0.05. The raw delta is not allowed in the PR description on its own anymore.&lt;/p&gt;

&lt;p&gt;Second, every candidate runs the eval three times. If the three pass rates spread by more than a point at temperature 0, the harness is nondeterministic and we fix that before trusting any comparison. We caught a vLLM &lt;code&gt;max_tokens&lt;/code&gt; truncation bug this way that was silently failing 11 long-output prompts on some runs.&lt;/p&gt;

&lt;p&gt;Third, when we compare a self-hosted candidate against a hosted reference like gpt-4o-mini, we route both through one gateway so the request shape, retries, and timeouts are identical. We use Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for that, since it exposes every provider behind one OpenAI-compatible endpoint and the eval code stops caring who serves the tokens. Same harness, different backend. That removes a confound I used to ignore.&lt;/p&gt;

&lt;p&gt;The cost of all this is one extra function and roughly 2x more eval compute. Against the cost of shipping a regression to an enterprise customer, that is nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper problem
&lt;/h2&gt;

&lt;p&gt;840 prompts sounds like a lot. For detecting a 5-point difference, it is fine. For detecting a 2-point difference at 95% confidence, you need closer to 3,000 prompts, and for 1 point you need over 9,000. Most internal evals are too small to resolve the differences people argue about in standups.&lt;/p&gt;

&lt;p&gt;So we also report the minimum detectable effect for our suite. Right now ours is about 4.5 points. Anything smaller, we say out loud that we cannot measure it, and we either grow the suite or stop pretending the comparison means something.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Bootstrap CIs assume your prompts are a representative sample of production. They are usually not. A tight interval on a biased suite is confidently wrong, and no amount of resampling fixes the sample.&lt;/p&gt;

&lt;p&gt;The paired approach needs aligned per-prompt results, so you have to log at the prompt level, not the aggregate. That is more storage and more plumbing.&lt;/p&gt;

&lt;p&gt;And significance is not importance. A real 0.3-point gain can be statistically solid and operationally meaningless. The test tells you the difference exists, not that you should care.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.routledge.com/An-Introduction-to-the-Bootstrap/Efron-Tibshirani/p/book/9780412042317" rel="noopener noreferrer"&gt;An Introduction to the Bootstrap, Efron &amp;amp; Tibshirani&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.statsmodels.org/stable/generated/statsmodels.stats.contingency_tables.mcnemar.html" rel="noopener noreferrer"&gt;McNemar's test, scikit-learn / statsmodels docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/" rel="noopener noreferrer"&gt;vLLM sampling and determinism notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google-research/tuning_playbook" rel="noopener noreferrer"&gt;Deep Learning Tuning Playbook, Google Research&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Provider drift broke our regression evals. We pinned versions through Bifrost.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 01 Jun 2026 16:03:19 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/provider-drift-broke-our-regression-evals-we-pinned-versions-through-bifrost-4nmb</link>
      <guid>https://dev.to/marcuswwchen/provider-drift-broke-our-regression-evals-we-pinned-versions-through-bifrost-4nmb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our nightly agent regression suite dropped 4 points on a tool-calling metric with zero code or prompt changes. The cause was a provider silently rotating the model behind a floating alias. We moved eval traffic through Bifrost, pinned exact model strings per provider, and added Prometheus per-model latency so the next drift shows up as a graph instead of a Slack mystery.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and eval team at Nexus Labs. Series B, enterprise agent automation. We run a nightly suite of about 2,400 adversarial test cases against whatever models our agents call in production. The suite is the contract. If it moves, something changed.&lt;/p&gt;

&lt;p&gt;On a Tuesday in April it moved. Tool-call accuracy went from 0.91 to 0.87 overnight. No deploy. No prompt edit. Git was clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model under you is not stable
&lt;/h2&gt;

&lt;p&gt;We were calling a floating alias on a hosted provider. The kind that maps to "the current version" and gets repointed when the vendor ships an update. Our eval harness recorded the alias string, not the resolved version. So the harness thought it was testing the same thing two nights running. It wasn't.&lt;/p&gt;

&lt;p&gt;That is the part people skip. You can pin your seed, your temperature, your prompt template, your sampling params. The weights still move under you. A 4-point swing on a contract metric is the difference between shipping and not, and we spent a day and a half bisecting our own code for a bug that lived in someone else's deploy.&lt;br&gt;
The fix is boring. Pin the exact version. Make the gateway enforce it. Alert when the resolved model string changes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why a gateway and not just a constant in our config
&lt;/h2&gt;

&lt;p&gt;We already had model names in a config file. The problem is enforcement and visibility, not storage. We needed three things at the call layer: the exact model string sent on every request, a metric tagged by resolved model, and failover so a provider 500 mid-suite does not kill a 90-minute run.&lt;/p&gt;

&lt;p&gt;We put Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) in front. It is a Go gateway, OpenAI-compatible, so our eval client changed by one base URL. Provider and model become an explicit provider/model string in the request, no floating aliases unless we opt into one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bifrost config -- explicit versions, no floating aliases&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;

&lt;span class="c1"&gt;# eval client now sends fully-qualified model strings:&lt;/span&gt;
&lt;span class="c1"&gt;#   anthropic/claude-sonnet-4-6&lt;/span&gt;
&lt;span class="c1"&gt;#   openai/gpt-4o-mini-2024-07-18&lt;/span&gt;
&lt;span class="c1"&gt;# a dated string cannot be silently rotated under us&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The request side stays explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "openai/gpt-4o-mini-2024-07-18",
    "messages": [{"role": "user", "content": "..."}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Native Prometheus metrics gave us latency and request counts labeled by model. When the dated string stops resolving because the vendor retired it, the suite fails loud on a 4xx instead of quietly testing a substitute. That is the behavior I want. Fail visible, not silent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failover that does not corrupt the eval
&lt;/h2&gt;

&lt;p&gt;A subtler trap: automatic failover is great for production and dangerous for evals. If provider A times out and Bifrost retries on provider B, your eval row now reflects a different model than the column header says. So we scope it. Production keys get fallbacks. The eval virtual key gets retries on the same model only, no cross-provider fallback. Same gateway, two policies.&lt;/p&gt;

&lt;p&gt;That distinction matters more than the drift fix itself. A gateway that just works by silently routing around failures is exactly the thing that poisoned our data in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest comparison
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before landing here.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible single API&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host, no vendor cloud&lt;/td&gt;
&lt;td&gt;Yes (Go binary/Docker)&lt;/td&gt;
&lt;td&gt;Yes (Python)&lt;/td&gt;
&lt;td&gt;Gateway OSS; control plane leans hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-model Prometheus metrics&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Via callbacks/config&lt;/td&gt;
&lt;td&gt;Hosted dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity / ecosystem&lt;/td&gt;
&lt;td&gt;Newer, fewer integrations&lt;/td&gt;
&lt;td&gt;Largest provider list, most battle-tested&lt;/td&gt;
&lt;td&gt;Polished hosted UX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config surface&lt;/td&gt;
&lt;td&gt;Web UI + JSON&lt;/td&gt;
&lt;td&gt;Python-config heavy&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has the wider provider coverage and far more StackOverflow answers when something breaks at 2am. If you live in Python and want the longest integration list, it is the safe pick. Portkey hosted observability is genuinely nicer out of the box than wiring your own Grafana. Bifrost won for us because it is a single Go process we run ourselves, the OpenAI-compatible surface meant a one-line client change, and the Prometheus labels were exactly the cardinality we wanted without a callback plugin. Different teams, different answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;A gateway does not detect drift on its own. It records the exact model string; you still have to alert on changes and pin dated versions. If you keep calling floating aliases through Bifrost, you have added a hop and solved nothing.&lt;br&gt;
It's another process in the path. For our eval traffic that is fine, sub-millisecond overhead against multi-second LLM calls. For ultra-latency-sensitive serving you would benchmark it yourself.&lt;/p&gt;

&lt;p&gt;And it is younger software. We hit one config-reload quirk early. LiteLLM longer track record is a real argument if you cannot afford to debug a gateway.&lt;/p&gt;

&lt;p&gt;Dated model strings also age out. When a provider retires gpt-4o-mini-2024-07-18, our suite breaks loudly and we re-baseline on purpose. That is the point, but it is maintenance, not magic.&lt;/p&gt;

&lt;p&gt;The model is the easy part. The thing that moved under us was the infrastructure around it, and the only defense is making every change observable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost GitHub: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Retries and fallbacks: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Observability / Prometheus: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Governance and virtual keys: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM docs: &lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;https://docs.litellm.ai/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>sre</category>
    </item>
    <item>
      <title>Aggregate eval scores hid a 14-point regression in one user segment</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 01 Jun 2026 06:32:22 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/aggregate-eval-scores-hid-a-14-point-regression-in-one-user-segment-3oe0</link>
      <guid>https://dev.to/marcuswwchen/aggregate-eval-scores-hid-a-14-point-regression-in-one-user-segment-3oe0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our agent eval suite reported 87% pass rate before and after a fine-tune. The aggregate didn't move. One customer segment dropped from 91% to 77% and we shipped it anyway. The fix was stratifying every eval run by segment and gating on the worst slice, not the mean.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and eval team at Nexus Labs. We build agent automation for enterprise customers. Roughly 40 of them in production, each with their own document formats, tool schemas, and edge cases.&lt;/p&gt;

&lt;p&gt;Here's the thing about a single accuracy number. It's an average, and averages lie by construction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;We fine-tuned a Qwen2.5-7B agent on a fresh batch of tool-calling traces. Standard LoRA run in TRL, nothing exotic. Our eval suite had 1,200 cases. Pass rate before: 87.1%. After: 87.4%. Within noise. We shipped.&lt;/p&gt;

&lt;p&gt;Four days later one customer filed a ticket. Their automation was failing on multi-step refund flows. We pulled their slice out of the eval set. 47 cases. The old model passed 43. The new one passed 36. A 14-point drop, completely invisible in the aggregate because that segment was 4% of the total set and the rest had improved slightly.&lt;/p&gt;

&lt;p&gt;The new traces over-represented a different customer's invoice format. The model got better at invoices and worse at refunds. The mean stayed flat. Classic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stratify everything
&lt;/h2&gt;

&lt;p&gt;The change was small in code and large in discipline. Every eval case now carries a &lt;code&gt;segment&lt;/code&gt; tag. The harness reports per-segment pass rates, and CI gates on the minimum slice, not the mean.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval_config.yaml&lt;/span&gt;
&lt;span class="na"&gt;gating&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pass_rate&lt;/span&gt;
  &lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;min_segment&lt;/span&gt;   &lt;span class="c1"&gt;# not "mean"&lt;/span&gt;
  &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
  &lt;span class="na"&gt;min_cases_per_segment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;

&lt;span class="na"&gt;segments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;refund_flow&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;invoice_parse&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;contract_review&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;escalation_routing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;min_cases_per_segment&lt;/code&gt; field matters. A slice with 6 cases swings 16 points if one flips. We flag any segment under 20 cases as low-confidence and don't gate on it, but we still print it. Silent truncation is how you end up trusting a number that's really three coin flips.&lt;/p&gt;

&lt;p&gt;Here's the reporting we wired into the run output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;segment            n     before   after    delta
refund_flow        47    0.915    0.766    -0.149  ❌
invoice_parse      210   0.838    0.910    +0.072
contract_review    156   0.885    0.891    +0.006
escalation_route   89    0.831    0.843    +0.011
---
mean (weighted)    1200  0.871    0.874    +0.003
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;-0.149&lt;/code&gt; would have blocked the deploy. The weighted mean would have waved it through. Same data, different verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the segments come from
&lt;/h2&gt;

&lt;p&gt;You can't tag what you don't capture. We log every production agent call with the customer ID attached, then sample stratified by customer to build eval sets. Our gateway sits in front of the provider calls and writes structured logs we can replay, so building a new slice is a query, not a data-collection project. We run that through Bifrost, which gives us per-request logging we pull into the eval pipeline. Other teams use a sidecar or their own proxy. The point is the customer dimension has to survive into the log, or you can't reconstruct the slice later.&lt;/p&gt;

&lt;p&gt;One detail that bit us: we were sampling uniformly at random for the eval set. Big customers dominated. Small customers with weird formats had 5 cases each and got rounded into noise. Stratified sampling with a floor per segment fixed the representation problem before the gating could even help.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the mean is the wrong default
&lt;/h2&gt;

&lt;p&gt;A mean assumes every case is interchangeable. In a multi-tenant product they're not. A 14-point regression for one customer is a churn risk even if 39 other customers improved. The business doesn't experience the average. Each customer experiences their own slice.&lt;/p&gt;

&lt;p&gt;This is the same reason a single benchmark number tells you almost nothing. MMLU at 0.81 doesn't tell you the model fell apart on the 3% of questions your users actually ask. You have to cut the data along the dimensions that matter to the people paying you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gating strategy&lt;/th&gt;
&lt;th&gt;Catches per-segment regression&lt;/th&gt;
&lt;th&gt;False-block rate&lt;/th&gt;
&lt;th&gt;Setup cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Weighted mean&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Trivial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unweighted mean&lt;/td&gt;
&lt;td&gt;Sometimes&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Trivial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min segment (floor on n)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-segment + manual review&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We run min-segment in CI and route any blocked deploy to a 10-minute human review. The false blocks are real. A small slice flips, CI goes red, and it turns out to be a flaky case. We accept that cost. Shipping a 14-point regression to a paying customer costs more than a few false alarms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Min-segment gating is noisier than the mean. With 40 segments, the probability that at least one drops by chance on any given run is high, so you will get blocked deploys that aren't real regressions. The &lt;code&gt;min_cases_per_segment&lt;/code&gt; floor helps but doesn't eliminate it.&lt;/p&gt;

&lt;p&gt;It also doesn't scale to thousands of segments without becoming a triage burden. At some point you cluster segments into families and gate on those instead of every individual customer.&lt;/p&gt;

&lt;p&gt;And it tells you a slice regressed, not why. You still need to read the failing traces. The harness points at the wound. It doesn't diagnose it.&lt;/p&gt;

&lt;p&gt;Last thing: stratified eval is only as good as your segment definitions. If you pick the wrong dimension to cut on, you'll get clean-looking slices that hide the real variance. We got customer-segment right and missed document-length entirely for two months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/trl" rel="noopener noreferrer"&gt;TRL documentation&lt;/a&gt; for the LoRA fine-tuning setup&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM docs&lt;/a&gt; for serving the eval runs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; for the per-request logging we pull eval slices from&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.snorkel.org/blog/slicing" rel="noopener noreferrer"&gt;Slice-based learning (Snorkel)&lt;/a&gt; on monitoring critical data subsets&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html" rel="noopener noreferrer"&gt;scikit-learn StratifiedKFold&lt;/a&gt; for the sampling floor&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Serving 40 LoRA adapters on one base model: the throughput we got</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 29 May 2026 06:32:33 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/serving-40-lora-adapters-on-one-base-model-the-throughput-we-got-m2n</link>
      <guid>https://dev.to/marcuswwchen/serving-40-lora-adapters-on-one-base-model-the-throughput-we-got-m2n</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We fine-tune one LoRA adapter per enterprise customer on top of a single Llama 3.1 8B base. Running them as 40 separate deployments would have cost roughly $24k/month in mostly-idle GPU. Multi-LoRA serving in vLLM put all 40 on two A100s. Numbers and the parts that broke below.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Nexus Labs we run the fine-tuning and eval team for agent automation. Each enterprise customer gets its own adapter because each has a different tool schema and a different house style for responses. Right now that's 40 customers in production. Rank-16 LoRA, about 42MB per adapter on disk, trained with PEFT and TRL on their own trace data.&lt;/p&gt;

&lt;p&gt;The obvious setup is one model server per customer. That's 40 copies of an 8B base. In bf16 the base is around 16GB of weights before KV cache. Forty of those does not fit on anything we can afford, and most customers send fewer than 5 requests a minute. So you're paying for a GPU to sit at 3% utilization. We priced it at about $24k/month across the fleet on reserved A100s. No.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-LoRA: one base, many adapters
&lt;/h2&gt;

&lt;p&gt;vLLM (we're on 0.6.3) loads the base weights once and applies adapter deltas at request time. You turn it on with &lt;code&gt;--enable-lora&lt;/code&gt; and register adapters by name. The base sits in GPU memory once. Each adapter is a few MB, so dozens fit in the same box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve meta-llama/Llama-3.1-8B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-lora&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-loras&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-lora-rank&lt;/span&gt; 16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-cpu-loras&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A request picks its adapter through the &lt;code&gt;model&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "customer_acme_v3", "messages": [...]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--max-loras 8&lt;/code&gt; is the number of distinct adapters that can be active in a single batch on the GPU. &lt;code&gt;--max-cpu-loras 64&lt;/code&gt; is the CPU-side pool that adapters get swapped in from. When a 9th distinct adapter shows up in a batch, vLLM evicts the least-recently-used one back to CPU. That swap costs us 30 to 50ms measured at p50. Swapping from disk instead of the CPU pool is much worse, so size the CPU pool to your real customer count.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Two A100 80GB, base loaded once per box, adapters shared. Load tested at 600 req/min across the 40 adapters with a Poisson arrival mix weighted by real customer traffic.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;40 separate deployments&lt;/th&gt;
&lt;th&gt;Multi-LoRA, 2x A100&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPUs needed&lt;/td&gt;
&lt;td&gt;~40 (or heavy quant + packing)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Base weights in memory&lt;/td&gt;
&lt;td&gt;40 copies&lt;/td&gt;
&lt;td&gt;2 copies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adapter memory&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;~1.7GB total resident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle cost / month&lt;/td&gt;
&lt;td&gt;~$24k&lt;/td&gt;
&lt;td&gt;~$1.2k&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50 latency (256 tok)&lt;/td&gt;
&lt;td&gt;410ms&lt;/td&gt;
&lt;td&gt;470ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold adapter swap (CPU pool)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;30-50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate throughput&lt;/td&gt;
&lt;td&gt;bounded by idle waste&lt;/td&gt;
&lt;td&gt;~3,100 tok/s/box&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The latency tax is real but small. About 60ms at p50 from the grouped GEMM the multi-LoRA kernel runs when a batch contains several different adapters. For our agent workloads, where a single tool-call turn is 100 to 400 output tokens, that's noise next to the network round trip.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eval gating, because outputs are not identical
&lt;/h2&gt;

&lt;p&gt;I do not roll out a serving change without an eval gate. Multi-LoRA does not produce bit-identical output to a standalone fine-tuned model. The batched LoRA kernel accumulates differently than the single-adapter path. Greedy decode matched on our set. Sampled decode diverged within tolerance, which is expected, but I wanted it measured, not assumed.&lt;/p&gt;

&lt;p&gt;So before cutover we ran each customer's adversarial eval set, 200 tool-call prompts apiece, scoring exact match on tool name plus a JSON-normalized arg comparison. Gate: no regression above 0.5% versus the standalone deployment. Two adapters tripped it. Both turned out to be rank mismatches in how they were exported, not a serving bug. Fixed the export, re-ran, shipped.&lt;/p&gt;

&lt;p&gt;In front of the vLLM box we run Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) as the gateway. It gives us one OpenAI-compatible endpoint, and if the self-hosted box saturates or drops, it falls back to a hosted provider running the generic adapter so a customer gets a degraded answer instead of a 503. It's one gateway option among several; we picked it for the failover behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eviction thrash.&lt;/strong&gt; &lt;code&gt;--max-loras 8&lt;/code&gt; means bursty traffic across more than 8 distinct customers in the same window causes constant swapping. If your concurrency exceeds your active-adapter slots, you pay the 30-50ms swap on a chunk of requests. Watch your eviction rate, not just latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uniform rank.&lt;/strong&gt; Mixing rank 8 and rank 64 adapters wastes the padded buffer, which is sized to the max. We standardized on rank 16 across all customers. If one needs more capacity, it doesn't belong in this pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput per adapter drops&lt;/strong&gt; when many distinct adapters land in one batch, because the kernel does a grouped GEMM instead of one dense matmul. Few adapters per batch, near-dense speed. Many, you lose some.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One base, one tokenizer.&lt;/strong&gt; Every adapter has to share the same base model and tokenizer. A customer who needs a different base (say a 70B) gets its own deployment. No way around it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numerical drift means you own an eval set.&lt;/strong&gt; If you don't have per-customer regression tests, you can't safely make this swap. The infra savings assume you can prove output parity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model was the easy part here. Two A100s instead of forty came down to knowing how many adapters are actually hot at once and sizing the slots to that, then proving the outputs didn't move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/models/lora.html" rel="noopener noreferrer"&gt;vLLM LoRA serving docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.03285" rel="noopener noreferrer"&gt;S-LoRA: Serving Thousands of Concurrent LoRA Adapters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2310.18547" rel="noopener noreferrer"&gt;Punica: Multi-Tenant LoRA Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/huggingface/peft" rel="noopener noreferrer"&gt;Hugging Face PEFT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI Gateway&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>pytorch</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 28 May 2026 16:03:41 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/shadow-testing-a-fine-tuned-8b-against-gpt-4o-mini-through-bifrost-od4</link>
      <guid>https://dev.to/marcuswwchen/shadow-testing-a-fine-tuned-8b-against-gpt-4o-mini-through-bifrost-od4</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We fine-tuned a Llama 3.1 8B for invoice line-item extraction. Before flipping production over, we mirrored 14 days of live traffic to both the fine-tune and gpt-4o-mini using Bifrost's load balancing, then diffed outputs offline. The 8B won on accuracy by 3.2 points and cut per-call cost by 71%. The interesting bug: 4% of "wins" were the fine-tune hallucinating a field the base model correctly left null.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our team at Nexus Labs ships an agent that pulls structured fields out of supplier invoices. The previous version hit gpt-4o-mini for every call. Bill was getting unfun.&lt;/p&gt;

&lt;p&gt;I'm not a fan of swapping production models based on benchmark numbers. MT-Bench scores tell you very little about whether your specific eight-field extraction prompt works on the long tail of malformed PDFs that your customers actually send. So we shadow-tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;We needed three things wired together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mirror live production traffic to a second model without affecting the primary response&lt;/li&gt;
&lt;li&gt;Log both responses with a shared request ID&lt;/li&gt;
&lt;li&gt;Replay an offline judge over the diffs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We were already running Bifrost in front of OpenAI for spend visibility. Turns out the load balancing config lets you weight providers across a single virtual model name, and the per-request log includes the full input and output payload. That covered the first two.&lt;/p&gt;

&lt;p&gt;A trimmed slice of the config we used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;primary_extractor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
  &lt;span class="na"&gt;shadow_extractor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/llama-3.1-8b-extract-v4&lt;/span&gt;
    &lt;span class="na"&gt;shadow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;shadow: true&lt;/code&gt; flag is implemented via a custom plugin. The Bifrost README documents the plugin architecture but does not ship a built-in mirror mode. Our plugin sends the shadow request async and discards the response from the client path. Both log records share a trace ID so downstream comparison is a join, not a search.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we found in 14 days
&lt;/h2&gt;

&lt;p&gt;Fourteen days, 218,400 production requests, mirrored to both targets. The numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;gpt-4o-mini&lt;/th&gt;
&lt;th&gt;Fine-tuned 8B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Field-level accuracy (judge)&lt;/td&gt;
&lt;td&gt;94.1%&lt;/td&gt;
&lt;td&gt;97.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency p50&lt;/td&gt;
&lt;td&gt;480ms&lt;/td&gt;
&lt;td&gt;190ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency p99&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;410ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per 1k requests&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;$0.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucinated field rate&lt;/td&gt;
&lt;td&gt;0.3%&lt;/td&gt;
&lt;td&gt;1.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The accuracy win is real. The cost win is real. The latency win is mostly because we run the 8B on a single H100 with vLLM continuous batching and there is no network egress.&lt;/p&gt;

&lt;p&gt;The hallucination rate is the part that almost killed the migration. The fine-tune confidently filled in &lt;code&gt;vendor_tax_id&lt;/code&gt; on 1.1% of invoices where the field genuinely did not exist. The base model returned null. Our judge initially scored the hallucinations as correct because the format was valid. That's a separate post.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the judge missed
&lt;/h2&gt;

&lt;p&gt;We were using gpt-4o as the offline judge. It graded outputs against the ground-truth JSON. The grader rewarded any non-null field that matched the schema, which meant a plausible-sounding made-up tax ID got partial credit.&lt;/p&gt;

&lt;p&gt;We swapped to a stricter judge that compared field-by-field against a held-out human-labeled set of 2,400 invoices. The fine-tune still won, but the margin shrank to 1.8 points. Worth the migration. Not worth the marketing pitch our PM wanted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bifrost vs LiteLLM or Portkey
&lt;/h2&gt;

&lt;p&gt;I've used all three. Honest comparison:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM&lt;/strong&gt; is fine if all you want is the proxy layer. Easier to drop into a Python script. The plugin story is weaker, so you'd be writing more glue for the mirror behavior we needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portkey&lt;/strong&gt; has nicer observability dashboards out of the box, and its guardrails feature is more mature than what Bifrost ships today. If your priority is policy enforcement on user-facing chat traffic, look there first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost&lt;/strong&gt; won for us because the Go core handles the request volume without GIL-related weirdness, the plugin hooks let us implement the shadow flag without forking, and the &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; model already matched how we track team budgets. The &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching feature&lt;/a&gt; was not relevant here. Extraction prompts are too input-specific.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd switch tomorrow if Portkey shipped a documented mirror primitive and a Go core. They haven't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;The shadow approach doubles your inference cost during the test window. We ran for 14 days, which felt long. Five would have been enough for the distribution, but extraction has weekly seasonality (Mondays look different) so we wanted two full cycles.&lt;/p&gt;

&lt;p&gt;vLLM on a single H100 fits our throughput. If your shadow target is a 70B model you'd need cluster routing, and Bifrost's clustering is enterprise-only. The README is explicit about that. Plan accordingly.&lt;/p&gt;

&lt;p&gt;The judge problem cost us a week of confusion. Run your judge against a small human-labeled set first. If it agrees with humans below 90%, the judge is the bottleneck, not the model.&lt;/p&gt;

&lt;p&gt;One last thing. Shadow traffic with the same trace ID means your APM tool sees double the spans. Filter those out at the collector or your dashboards lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost load balancing and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/custom-plugins" rel="noopener noreferrer"&gt;Bifrost custom plugins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;vLLM continuous batching paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/meta-llama/llama-recipes" rel="noopener noreferrer"&gt;Llama fine-tuning recipes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
    <item>
      <title>Continuous batching wrecked our p99 latency. Here's the trace.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 28 May 2026 06:33:12 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/continuous-batching-wrecked-our-p99-latency-heres-the-trace-42d1</link>
      <guid>https://dev.to/marcuswwchen/continuous-batching-wrecked-our-p99-latency-heres-the-trace-42d1</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We turned on vLLM continuous batching for a throughput win and watched p99 latency 8x in the wrong direction. Long prefills were stalling decodes in the same forward pass. Chunked prefill and a tuned &lt;code&gt;max_num_batched_tokens&lt;/code&gt; got the SLO back at the cost of ~11% of the throughput gain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run Llama 3.3 70B as the routing brain for our agent platform at Nexus Labs. ~14 internal services hit it. SLO is 2s p99 for the single-turn routing call.&lt;/p&gt;

&lt;p&gt;Last month we flipped on vLLM 0.7's continuous batching to push more requests through our 4xH100 box. p50 dropped from 340ms to 190ms. We were happy for about 36 hours.&lt;/p&gt;

&lt;p&gt;Then the latency dashboard turned red.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually saw
&lt;/h2&gt;

&lt;p&gt;p99 went from 1.2s to 9.8s on the routing endpoint. p50 was still good. p99.9 was unprintable.&lt;/p&gt;

&lt;p&gt;The first alert came off our routing service's p99 panel. We checked the upstream load balancer. Healthy. Then the model server CPU and GPU. Healthy by every coarse metric. GPU utilization was 81%, not saturated. KV cache hit rate held at 67%. The Prometheus exporter from vLLM showed something stranger: &lt;code&gt;vllm:time_per_output_token_seconds&lt;/code&gt; had widened from 32ms to 380ms during peak. The model itself wasn't slow. The scheduler was making everyone wait.&lt;/p&gt;

&lt;p&gt;Long requests with 4k+ token prefills were eating decode slots. Short single-turn routing calls were starving behind them. The forward pass would dedicate ~60ms to a prefill chunk for one user's request, and 23 in-flight decode streams would block on it.&lt;/p&gt;

&lt;p&gt;That's the contract of naive continuous batching. Prefill and decode share one forward pass. A big prefill stops everyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;vLLM ships chunked prefill. It splits a large prefill into ~512-token chunks and interleaves them with decode steps. The tradeoff: total throughput per long request goes down. In exchange, decode never stalls for more than one chunk worth of time.&lt;/p&gt;

&lt;p&gt;The other knob is &lt;code&gt;max_num_batched_tokens&lt;/code&gt;. Set too high and you reintroduce the stall. Set too low and you starve throughput. We landed at 4096 for our workload after a sweep.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm config that ended up in prod&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;meta-llama/Llama-3.3-70B-Instruct&lt;/span&gt;
&lt;span class="na"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
&lt;span class="na"&gt;max_model_len&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8192&lt;/span&gt;
&lt;span class="na"&gt;enable_chunked_prefill&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;max_num_batched_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;
&lt;span class="na"&gt;max_num_seqs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;96&lt;/span&gt;
&lt;span class="na"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
&lt;span class="na"&gt;swap_space&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;No batching&lt;/th&gt;
&lt;th&gt;Naive CB&lt;/th&gt;
&lt;th&gt;+ chunked prefill&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p50 latency&lt;/td&gt;
&lt;td&gt;340ms&lt;/td&gt;
&lt;td&gt;190ms&lt;/td&gt;
&lt;td&gt;215ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99 latency&lt;/td&gt;
&lt;td&gt;1.2s&lt;/td&gt;
&lt;td&gt;9.8s&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9 latency&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;td&gt;27s&lt;/td&gt;
&lt;td&gt;3.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens/sec (cluster)&lt;/td&gt;
&lt;td&gt;2,650&lt;/td&gt;
&lt;td&gt;4,820&lt;/td&gt;
&lt;td&gt;4,310&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost/1M output&lt;/td&gt;
&lt;td&gt;$0.74&lt;/td&gt;
&lt;td&gt;$0.41&lt;/td&gt;
&lt;td&gt;$0.46&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We paid back ~11% of the throughput win. We bought back the SLO. Cheap trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that didn't help
&lt;/h2&gt;

&lt;p&gt;We tried priority lanes where small requests jump the queue. It cut p99 to 5.2s but cratered p99 for the long requests instead of solving the underlying scheduling problem. Routing them to separate replicas would have worked, but doubled our GPU footprint. Not worth it for our traffic mix.&lt;/p&gt;

&lt;p&gt;We tried bumping &lt;code&gt;max_num_seqs&lt;/code&gt; to 256 thinking more concurrent decodes would amortize prefills. It made things worse. KV pressure spiked, eviction churn ate compute.&lt;/p&gt;

&lt;p&gt;We tried separating ingress by content length at the gateway layer. Under 1k tokens to one pool, the rest to another. Worked on paper. In practice the small pool got 92% of traffic and we ran out of headroom there. Bin packing prompts isn't free either.&lt;/p&gt;

&lt;p&gt;We added a circuit breaker upstream that sheds to a hosted provider when our internal p99 crosses 3s. We pipe everything through &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; so the failover is one config change instead of a deploy. It catches the edge cases when prefill-heavy traffic spikes faster than autoscaling reacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Chunked prefill is not free. For workloads with very long prompts and short decodes (think doc-QA over 32k context), per-request latency goes up by 15-25%. If that's your hot path, you'd want to split traffic by class and run two pools with different configs.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;max_num_batched_tokens&lt;/code&gt; is workload-specific. The number we landed on is wrong for someone with a different prompt distribution. There's no shortcut. You run the sweep.&lt;/p&gt;

&lt;p&gt;Continuous batching also makes p99 noisier across deployments. A neighbor service pushing a new feature with 8k prompts can hurt yours. The isolation story at the vLLM layer is real but not airtight. We file this under "things our k8s admission controller now checks."&lt;/p&gt;

&lt;h2&gt;
  
  
  What the eval suite said
&lt;/h2&gt;

&lt;p&gt;The boring point. None of this showed up in offline eval. Eval measured correctness on a fixed batch size. Production measures tail latency under realistic prompt mix. If you only have the first one, you'll ship the dashboard regression we shipped.&lt;/p&gt;

&lt;p&gt;We added a load-shape replay step to our deployment pipeline two weeks ago. It replays a sampled 5-minute window of real traffic shape against the candidate. Catches this class of regression before it touches real users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/usage/optimization.html" rel="noopener noreferrer"&gt;vLLM chunked prefill docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmsys.org/blog/2024-01-17-sglang/" rel="noopener noreferrer"&gt;SGLang continuous batching internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.00079" rel="noopener noreferrer"&gt;Mooncake: A KVCache-centric LLM serving disaggregation paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.18677" rel="noopener noreferrer"&gt;Splitwise: Efficient generative LLM inference using phase splitting&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Virtual keys per tenant: ditching our custom LLM billing layer</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 27 May 2026 16:02:19 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/virtual-keys-per-tenant-ditching-our-custom-llm-billing-layer-2p7b</link>
      <guid>https://dev.to/marcuswwchen/virtual-keys-per-tenant-ditching-our-custom-llm-billing-layer-2p7b</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We had 11,247 lines of Python middleware handling per-tenant LLM cost attribution, rate limiting, and provider failover. Replaced about 60% of it with Bifrost's virtual keys and governance features. Some honest gaps remain, which is why this is a writeup and not a sales pitch.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup we inherited
&lt;/h2&gt;

&lt;p&gt;Nexus Labs runs enterprise agent automation. Each customer gets isolated workloads. Each workload makes between 200 and 50,000 LLM calls per day across OpenAI, Anthropic, Bedrock, and Vertex.&lt;/p&gt;

&lt;p&gt;When I joined, we had a Python middleware doing four things at once: API key rotation per provider, per-tenant rate limits in Redis, cost attribution via request tagging, and fallback logic when a provider returned 429s.&lt;/p&gt;

&lt;p&gt;11,247 lines of Python. Three engineers had touched it. Two had left. One of them had encoded their team-internal pricing assumptions inline. Every model deprecation became a sprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually needed
&lt;/h2&gt;

&lt;p&gt;Three things, in priority order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Per-customer spend caps that don't require a deploy to update.&lt;/li&gt;
&lt;li&gt;Provider failover that survives Anthropic going down for 23 minutes (it did, last March).&lt;/li&gt;
&lt;li&gt;Cost data we don't have to reconstruct from CloudWatch logs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I evaluated three gateways before picking one. Here is the comparison after running each through a 2-week eval against our actual traffic shape.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-tenant virtual keys with budgets&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Plugin/config&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host without external deps&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible API for all providers&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in Prometheus metrics&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (newer)&lt;/td&gt;
&lt;td&gt;Hosted preferred&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP gateway&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in web UI for config&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Cloud-first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM was the real contender. Larger community, more battle-tested in production for some workload shapes. Where it lost for us: setting up hierarchical budgets across customer to team to workload tiers required more YAML wrangling than we wanted, and the failover behavior on streaming requests was less predictable under our tests.&lt;/p&gt;

&lt;p&gt;Portkey was strong on dashboards. We didn't want a hosted dependency for our cost control path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;The piece that surprised me most was the virtual keys model. From the docs (&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;governance/virtual-keys&lt;/a&gt;), every tenant gets a virtual key. The key carries the budget cap, rate limit, allowed providers, and allowed models. Our orchestrator stopped caring about provider routing entirely.&lt;/p&gt;

&lt;p&gt;Config that replaced 4,200 lines of Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_acme_prod&lt;/span&gt;
    &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;acme_corp&lt;/span&gt;
    &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_per_month_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12000&lt;/span&gt;
      &lt;span class="na"&gt;reset_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monthly&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;
    &lt;span class="na"&gt;allowed_providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;bedrock&lt;/span&gt;
    &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bedrock&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic.claude-sonnet-4-6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our orchestrator now does one thing: pick a virtual key based on tenant. Send the request. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;11,247 LOC in &lt;code&gt;gateway_middleware/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;p95 added latency from middleware: 47ms&lt;/li&gt;
&lt;li&gt;Mean time to add a new model: 2 days (testing, rollout, monitoring)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After 4 months:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4,108 LOC remaining (mostly business logic we still need)&lt;/li&gt;
&lt;li&gt;p95 added latency from Bifrost in front: 8ms&lt;/li&gt;
&lt;li&gt;Mean time to add a new model: under an hour&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latency number was the biggest surprise. Bifrost is Go. Our middleware was Python doing synchronous Redis calls. We knew that was a problem. Solving it wasn't on the roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;This isn't free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migration was harder than the docs suggest.&lt;/strong&gt; Our cost attribution data didn't map cleanly. We had legacy fields like &lt;code&gt;team_internal_billing_code&lt;/code&gt; baked into every log. Mapping these to virtual key metadata took a full sprint, and the team still grumbles about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching is risky for our workload.&lt;/strong&gt; Our agents call LLMs with tool results embedded in prompts. Two prompts that look 92% similar can require very different responses. We disabled semantic caching for the agent path. Enabled it only for our content generation path, where we saw a 31% hit rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP gateway integration is newer than the rest.&lt;/strong&gt; We use it for filesystem access from a customer-facing automation agent. Works fine. But debugging when a tool call fails requires more log digging than the rest of the platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No native cost-anomaly alerting yet.&lt;/strong&gt; Budget caps work. But "this customer's usage spiked 3x in 2 hours" is still wired up via Prometheus alerts and PagerDuty by hand. Portkey has this in their hosted product. If real-time anomaly alerts are your top requirement, weight that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell a peer team
&lt;/h2&gt;

&lt;p&gt;If you have one provider and one customer, you don't need this. Use the provider's SDK.&lt;/p&gt;

&lt;p&gt;If you have 3+ providers, multiple customer tiers, and someone on your team has written &lt;code&gt;class CostTrackingMiddleware&lt;/code&gt; more than once, evaluate. Spin up the Docker container (&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;quickstart&lt;/a&gt;). Point staging traffic at it for a week. Look at the metrics. Decide.&lt;/p&gt;

&lt;p&gt;The model is the easy part. Cost attribution is the part that wakes you up at 2am when a customer's bill is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost virtual keys docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget management hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.litellm.ai/docs/simple_proxy" rel="noopener noreferrer"&gt;LiteLLM proxy docs&lt;/a&gt; (worth comparing)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;Drop-in replacement notes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
      <category>devops</category>
    </item>
    <item>
      <title>LLM-as-judge variance broke our DPO training signal for 3 weeks</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 27 May 2026 06:31:57 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/llm-as-judge-variance-broke-our-dpo-training-signal-for-3-weeks-14j3</link>
      <guid>https://dev.to/marcuswwchen/llm-as-judge-variance-broke-our-dpo-training-signal-for-3-weeks-14j3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team. We run DPO on Qwen2.5-32B, target latency under 800ms p95 on a single H100.&lt;/p&gt;

&lt;p&gt;Our preference data pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2,400 prompts sampled from production traces per cycle&lt;/li&gt;
&lt;li&gt;4 completions per prompt from the current checkpoint&lt;/li&gt;
&lt;li&gt;GPT-4o-mini grades pairwise preferences against a 6-axis rubric&lt;/li&gt;
&lt;li&gt;TRL DPO, 3 epochs, lr 5e-7, beta 0.1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard recipe. Worked fine for two months.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we saw
&lt;/h2&gt;

&lt;p&gt;Week 9. Training loss curves looked clean. Reward margins grew run over run. Held-out eval reward climbed 0.62 → 0.71. Internal dashboards were green.&lt;/p&gt;

&lt;p&gt;Then product filed tickets. Latency was fine. Tool use accuracy on our production traffic mirror was down 4 points against the pre-DPO baseline. The thing we shipped to make the agent better made it worse.&lt;/p&gt;

&lt;p&gt;We trusted offline eval. We were wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The investigation
&lt;/h2&gt;

&lt;p&gt;I rebuilt the judge call as a deterministic test. Same prompt, same two completions, GPT-4o-mini at temperature 0. Fired the API 50 times in a row.&lt;/p&gt;

&lt;p&gt;The judge flipped its preference 14 of 50 times. 28% self-disagreement on a single pair.&lt;/p&gt;

&lt;p&gt;That number alone should have killed the project. We had built a training signal on top of a weighted coin.&lt;/p&gt;

&lt;p&gt;Ran the test across 200 prompt pairs. Median self-disagreement was 19%. The tail was worse. 8% of pairs had over 40% flip rates, and those pairs were exactly the ambiguous multi-step agent traces we cared about most.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;DPO gradients care about margin. When labels are noisy, the model still gets a gradient, but the direction is garbage. Over thousands of pairs you converge on whatever spurious feature the judge weights at temperature 0. Which, surprise, is not what end users want.&lt;/p&gt;

&lt;p&gt;Our offline reward went up because the model learned the judge's quirks. Production accuracy dropped because the quirks weren't the task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Three changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# preference_judging.yaml&lt;/span&gt;
&lt;span class="na"&gt;judges&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o-2024-11-20&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;
&lt;span class="na"&gt;consensus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;min_agree&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;drop_pair_if_split&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;judges_per_pair&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;rotate_completion_order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Three judges, 2-of-3 majority.&lt;/strong&gt; Drop the pair if split. We lose 18% of pairs. Acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate completion order per judge.&lt;/strong&gt; Position bias was ~7% on its own. Sonnet was closer to 2%, GPT-4o-mini was the worst offender.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bootstrap CIs on the eval set.&lt;/strong&gt; Report reward with a 95% interval, not a point estimate. Half of our prior "improvement" was inside the noise floor.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The judge fleet routes through Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;). One OpenAI-compatible endpoint, automatic fallback when a provider degrades, per-judge token accounting in one place. We were already running three providers for app traffic, so the judge pool was a config change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers after the fix
&lt;/h2&gt;

&lt;p&gt;| Metric | Single judge | 3-judge consensus |&lt;br&gt;
|---|---|&lt;br&gt;
| Judge self-consistency | 72% | 94% |&lt;br&gt;
| Production tool-use accuracy | -4.0 pts | +2.1 pts |&lt;br&gt;
| Training pairs retained | 100% | 82% |&lt;br&gt;
| Cost per 10k pairs (USD) | $11 | $34 |&lt;br&gt;
| Eval-to-prod Spearman correlation | 0.31 | 0.78 |&lt;/p&gt;

&lt;p&gt;Cost tripled. The signal went from misleading to useful. We take that trade every cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This isn't free and it isn't a silver bullet.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Judge cost.&lt;/strong&gt; 3x judges plus pair retries. Budget for it before you propose this to a director.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consensus isn't truth.&lt;/strong&gt; Three judges can agree on the wrong thing. We still sample 5% of pairs for human review weekly. That review process has caught two systematic biases all three LLM judges shared. Probably trained on overlapping data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; Preference labeling is no longer a same-afternoon job. Two-day turnaround on a full cycle now. Plan the data pipeline schedule around it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bad rubric, no rescue.&lt;/strong&gt; If your scoring criteria don't match what users care about, ensembling judges won't save you. We rewrote the rubric twice during this work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Position bias varies by model.&lt;/strong&gt; Don't assume. Measure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deeper point. Most teams I talk to treat the judge as an oracle and the model as the unknown. It's backwards. The model converges on whatever target you point it at. If the target wobbles, the model wobbles with it, and you won't see it in your reward curve.&lt;/p&gt;

&lt;p&gt;We spent three weeks training a model to imitate a noisy judge. The model worked. That was the bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/trl/dpo_trainer" rel="noopener noreferrer"&gt;TRL DPO documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2306.05685" rel="noopener noreferrer"&gt;Zheng et al., "Judging LLM-as-a-Judge with MT-Bench"&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2305.17926" rel="noopener noreferrer"&gt;Wang et al., "Large Language Models are not Fair Evaluators"&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai" rel="noopener noreferrer"&gt;vLLM batched scoring patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost fallback configuration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Token-level eval harness for tool-calling agents: what we wired up</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 26 May 2026 16:03:35 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/token-level-eval-harness-for-tool-calling-agents-what-we-wired-up-1m1b</link>
      <guid>https://dev.to/marcuswwchen/token-level-eval-harness-for-tool-calling-agents-what-we-wired-up-1m1b</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that scores tool selection, argument shape, and recovery behavior separately. Pass rate went from a single 73% number to four signals that actually tell us what broke. Bifrost sits in front as the provider switch so the same eval runs against four models without rewriting the harness.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Nexus Labs we run agent automation for enterprise workflows. Twelve people on the team, around 40 tool definitions across the production agents, mix of GPT-4.1, Claude Sonnet 4.6, and a fine-tuned Qwen3 32B we serve ourselves on vLLM.&lt;/p&gt;

&lt;p&gt;Last quarter our eval suite told us the new agent build was "72% passing." Shipped it. Two customers reported the agent was silently picking the wrong tool and confabulating success. Pass rate didn't catch it because the final assistant message looked fine.&lt;/p&gt;

&lt;p&gt;So we rebuilt the harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four signals
&lt;/h2&gt;

&lt;p&gt;End-to-end pass/fail is one number that hides everything. We split it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Failure mode it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool selection accuracy&lt;/td&gt;
&lt;td&gt;Did the agent pick the right tool at step N&lt;/td&gt;
&lt;td&gt;Picks &lt;code&gt;search_db&lt;/code&gt; when it should call &lt;code&gt;query_api&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Argument F1&lt;/td&gt;
&lt;td&gt;Token-level F1 on tool arguments vs gold&lt;/td&gt;
&lt;td&gt;Right tool, wrong filter or off-by-one date&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery rate&lt;/td&gt;
&lt;td&gt;After a tool returns an error, does the next step make sense&lt;/td&gt;
&lt;td&gt;Loops the same failing call three times&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trajectory length delta&lt;/td&gt;
&lt;td&gt;Steps taken vs minimum needed&lt;/td&gt;
&lt;td&gt;Wanders for 11 steps on a 3-step task&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these are novel on their own. The point is having all four on every run, per-model, per-tool. When our 72% number dropped to 68% on the new build, the breakdown showed argument F1 collapsed on date-range tools while selection stayed flat. That's a tokenizer regression on the fine-tune, not a reasoning regression. Different fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The eval loop
&lt;/h2&gt;

&lt;p&gt;We needed to run the same suite against four models without writing four clients. Bifrost handles that. One OpenAI-compatible endpoint, swap the &lt;code&gt;model&lt;/code&gt; string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;eval_targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4-1&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4.1&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sonnet-4-6&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3-internal&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/qwen3-32b-tools-v4&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cerebras-llama&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cerebras/llama-3.3-70b&lt;/span&gt;

&lt;span class="na"&gt;gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;base_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://bifrost:8080/v1&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;x-bf-virtual-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${EVAL_VK}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The virtual key matters. We give the eval harness its own budget cap through Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;governance&lt;/a&gt; so a runaway nightly run can't burn $4K on Anthropic before anyone notices. Last month it did exactly that, capped at $200, dropped the rest of the requests. Email at 3am instead of a Slack thread the next morning.&lt;/p&gt;

&lt;p&gt;Semantic caching off for eval runs. Obvious reason: cached responses defeat the point. Bifrost lets you disable it per-request via header, &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Argument F1, in code
&lt;/h2&gt;

&lt;p&gt;The non-obvious signal is argument F1. Most harnesses do exact-match on the JSON, which is brittle ("2026-05-26" vs "May 26, 2026" both call the right API but exact-match scores zero).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;arg_f1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predicted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pred_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predicted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;gold_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pred_tokens&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;gold_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;gold_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gold_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tokenize_args&lt;/code&gt; flattens nested JSON and normalizes dates, IDs, and known enums. It's 80 lines. We diff against gold per-key and weight required keys higher than optional ones.&lt;/p&gt;

&lt;p&gt;This caught the Qwen regression. Selection accuracy was 91%, argument F1 dropped from 0.84 to 0.61 in one fine-tune iteration. Turned out the tokenizer was splitting ISO dates differently after we added a new SFT batch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bifrost vs LiteLLM or Portkey
&lt;/h2&gt;

&lt;p&gt;Honest comparison. We tried all three.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Provider count&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;More (50+)&lt;/td&gt;
&lt;td&gt;~40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted free tier&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in virtual keys with budget caps&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Plugin/proxy config&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Prometheus metrics&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Via callback&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead (our measurement, p50)&lt;/td&gt;
&lt;td&gt;~1ms&lt;/td&gt;
&lt;td&gt;~3-4ms&lt;/td&gt;
&lt;td&gt;n/a (hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has more providers and a larger community. If you need a niche provider that's the safer bet. Portkey's hosted UX is more polished if you don't want to run anything. We picked Bifrost because the &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus integration&lt;/a&gt; is native (we already run Prometheus + Grafana) and the overhead was the lowest in our test. Your tradeoffs may differ.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Token-level argument F1 needs gold labels. We hand-labeled 1,200 trajectories. That's not free. If your agent universe is huge and changing weekly, this approach gets expensive.&lt;/p&gt;

&lt;p&gt;Recovery rate is the noisiest signal. It needs a judge model to score whether the next step "makes sense" given the error, and judge models disagree with humans about 8% of the time in our spot checks. We use it as a trend indicator, not a gate.&lt;/p&gt;

&lt;p&gt;Adding a gateway adds a hop. ~1ms in our setup, but if your eval is running 50K trajectories overnight, that's still real wall-clock time. We accept it because the centralized rate limiting and budget caps are worth more than the millisecond.&lt;/p&gt;

&lt;p&gt;Bifrost's MCP gateway is enterprise-only. We use the open-source build, so for MCP tool routing we still wire that ourselves outside the gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost governance and virtual keys&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost observability defaults&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;"Holistic Evaluation of Language Models" (Liang et al.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/features/tool_calling.html" rel="noopener noreferrer"&gt;vLLM tool calling support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2406.12045" rel="noopener noreferrer"&gt;Tau-bench: a benchmark for tool-agent-user interaction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. The harness that tells you which model regressed, and why, is the actual work.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
    <item>
      <title>Prefix caching in vLLM under multi-tenant agent traffic</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 26 May 2026 06:35:20 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/prefix-caching-in-vllm-under-multi-tenant-agent-traffic-5e2j</link>
      <guid>https://dev.to/marcuswwchen/prefix-caching-in-vllm-under-multi-tenant-agent-traffic-5e2j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly the same on another. The split wasn't about traffic volume. It was about how each team templated their system prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Our fine-tuning team serves 14 enterprise agents through a shared inference cluster. Four H100 nodes, vLLM 0.6.x, Qwen2.5-32B as the workhorse model. Traffic is bursty. One customer's nightly workflow can hit 8k requests in twenty minutes while another trickles through 30 calls an hour.&lt;/p&gt;

&lt;p&gt;Before turning on prefix caching, average TTFT across the cluster sat at 410ms p50, 1.2s p95. Cost wasn't the urgent problem. Latency was, because agents loop. A 400ms TTFT on a 12-step plan turns into 4.8 seconds of dead time before the user sees anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cache actually does
&lt;/h2&gt;

&lt;p&gt;vLLM's prefix cache keeps KV blocks for tokens it has already processed. If a new request shares a prefix with something in the cache, those blocks get reused instead of recomputed. The unit is a block (16 tokens by default), so caching is greedy at block boundaries.&lt;/p&gt;

&lt;p&gt;If your system prompt is 1,024 tokens and identical across requests, you skip prefill for 1,024 tokens. At Qwen2.5-32B prefill speeds, that's roughly 90 to 110ms saved per call on our hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it worked
&lt;/h2&gt;

&lt;p&gt;Tenant A's agent uses a fixed system prompt assembled at deploy time. Same 1,847 tokens for every request, byte-for-byte. After we flipped &lt;code&gt;enable_prefix_caching=True&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TTFT p50: 480ms → 110ms&lt;/li&gt;
&lt;li&gt;TTFT p95: 1.4s → 280ms&lt;/li&gt;
&lt;li&gt;GPU prefill compute dropped by 38%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Their hit rate ran around 94% steady-state. The 6% misses were cold starts after pod restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it didn't
&lt;/h2&gt;

&lt;p&gt;Tenant B's agent rebuilds its system prompt every call. They inject the current timestamp, a session UUID, and a hash of recent tool outputs into the first 200 tokens. Looked stable on paper. In practice, every request had a unique prefix starting at token 47.&lt;/p&gt;

&lt;p&gt;vLLM caches at block granularity. One differing token in the first block invalidates everything after it. Tenant B's hit rate: 0.3%.&lt;/p&gt;

&lt;p&gt;We didn't catch this in staging because our staging traffic replays canned prompts. The diff between tenants only showed up under real traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix for Tenant B
&lt;/h2&gt;

&lt;p&gt;I talked their team into pushing the volatile fields to the end of the prompt. Took two hours of refactoring on their side. After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TTFT p50: 510ms → 145ms&lt;/li&gt;
&lt;li&gt;Hit rate: 0.3% → 87%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then they asked why nobody mentioned this in the vLLM docs. The docs do mention it. Nobody reads docs when defaults already look fine on the neighboring tenant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm serve flags we landed on&lt;/span&gt;
&lt;span class="s"&gt;--model Qwen/Qwen2.5-32B-Instruct&lt;/span&gt;
&lt;span class="s"&gt;--enable-prefix-caching&lt;/span&gt;
&lt;span class="s"&gt;--block-size &lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;
&lt;span class="s"&gt;--gpu-memory-utilization &lt;/span&gt;&lt;span class="m"&gt;0.92&lt;/span&gt;
&lt;span class="s"&gt;--max-num-seqs &lt;/span&gt;&lt;span class="m"&gt;256&lt;/span&gt;
&lt;span class="s"&gt;--swap-space &lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;
&lt;span class="s"&gt;--preemption-mode recompute&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--preemption-mode recompute&lt;/code&gt; matters under memory pressure. We tried &lt;code&gt;swap&lt;/code&gt; and watched the cache thrash when bursts hit. Recompute throws cache blocks away cleanly instead of evicting them to CPU and back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Prompt structure&lt;/th&gt;
&lt;th&gt;Hit rate&lt;/th&gt;
&lt;th&gt;TTFT p50 before&lt;/th&gt;
&lt;th&gt;TTFT p50 after&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tenant A (fixed)&lt;/td&gt;
&lt;td&gt;Static 1,847-token prefix&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;480ms&lt;/td&gt;
&lt;td&gt;110ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tenant B (before fix)&lt;/td&gt;
&lt;td&gt;Volatile fields at token 47&lt;/td&gt;
&lt;td&gt;0.3%&lt;/td&gt;
&lt;td&gt;510ms&lt;/td&gt;
&lt;td&gt;505ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tenant B (after fix)&lt;/td&gt;
&lt;td&gt;Volatile fields moved to tail&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;td&gt;510ms&lt;/td&gt;
&lt;td&gt;145ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal eval pipeline&lt;/td&gt;
&lt;td&gt;Per-eval unique prompts&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;390ms&lt;/td&gt;
&lt;td&gt;380ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The eval pipeline column is honest. Prefix caching does nothing for workloads where every prompt is genuinely unique. We left it on anyway because the overhead is negligible.&lt;/p&gt;

&lt;p&gt;For routing across providers when we burst beyond self-hosted capacity, we run a small gateway in front (Bifrost is what we landed on, but the principle works with any of them). The local cache only helps for traffic that lands back on our own node, not the failover path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The cache costs GPU memory. We reserved roughly 14% of HBM for cached blocks at our &lt;code&gt;max-num-seqs&lt;/code&gt; setting. That's tokens we can't use for batch concurrency. Worth it for us because TTFT mattered more than throughput. Not worth it if you're optimizing for tokens-per-second on offline batch.&lt;/p&gt;

&lt;p&gt;Cache invalidation is binary at block boundaries. A one-token change at position 0 kills the whole prefix. No fuzzy matching. Semantic-caching products exist for that, but they're a different beast. They cache responses, not KV state, and the failure modes differ.&lt;/p&gt;

&lt;p&gt;The cache is per-node. We have four nodes behind a round-robin LB, so the same prompt hits a cold cache 75% of the time on first contact. We considered sticky routing by prompt hash. Decided the complexity wasn't worth a 200ms improvement on first-contact latency. Maybe later.&lt;/p&gt;

&lt;p&gt;The model is the easy part. Knowing where your tokens go is the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html" rel="noopener noreferrer"&gt;vLLM automatic prefix caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;PagedAttention paper (Kwon et al., 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/text-generation-inference" rel="noopener noreferrer"&gt;Hugging Face TGI prefix caching notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmsys.org/blog/2024-01-17-sglang/" rel="noopener noreferrer"&gt;SGLang RadixAttention writeup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 25 May 2026 16:03:44 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/we-audited-our-agent-tool-call-traces-half-our-eval-data-was-garbage-152m</link>
      <guid>https://dev.to/marcuswwchen/we-audited-our-agent-tool-call-traces-half-our-eval-data-was-garbage-152m</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a manual audit of 1,200 of them, ~48% were unusable: tool calls that "succeeded" but returned wrong data, retries masking provider failures, and silent fallbacks that changed which model answered. Putting Bifrost in front of the agent fleet fixed the trace problem more than any sampling strategy we tried.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run an enterprise agent product. Sales-ops automations mostly. Each user task ends up as a chain of 8-40 tool calls across a planner model, a worker model, and roughly 12 internal tools.&lt;/p&gt;

&lt;p&gt;For the last quarter my team has been building a fine-tune dataset from real traces. The plan was straightforward. Pull successful task completions. Filter by user thumbs-up. Use the trace as the training signal.&lt;/p&gt;

&lt;p&gt;It did not work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "successful" actually meant in our traces
&lt;/h2&gt;

&lt;p&gt;The first audit pass was 1,200 traces, two engineers, three weeks. We tagged each trace as "clean", "noisy", or "corrupted".&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;% of traces&lt;/th&gt;
&lt;th&gt;What it meant&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clean&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;Tool calls returned correct data, model picked the right next step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;td&gt;Right answer eventually, but with hidden retries, fallback to a different model, or stale cache hits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Corrupted&lt;/td&gt;
&lt;td&gt;17%&lt;/td&gt;
&lt;td&gt;Trace claimed success, output was wrong. User had not noticed yet.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The noisy category is the one that broke me. We had been treating these as gold-standard data. A trace where the planner called &lt;code&gt;crm_lookup&lt;/code&gt;, got a 500, retried twice, then succeeded on a fallback Anthropic key while the original trace span still pointed at OpenAI gpt-4o. The training pair we would have generated: "given this user input, output this tool call sequence." But the sequence was the result of three providers and two model versions stitched together. No reproducibility.&lt;/p&gt;

&lt;p&gt;Worse: nothing in our trace told us which model actually produced the final answer. We had a &lt;code&gt;model&lt;/code&gt; field. It logged whichever provider was configured at request start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we ended up putting a gateway in front of everything
&lt;/h2&gt;

&lt;p&gt;We tried two things first. Both partial fixes.&lt;/p&gt;

&lt;p&gt;The first was logging at the application layer. Wrap every provider call, log model, latency, retry count, fallback path. This works until you have four services calling four SDKs with four retry policies. Our Python service used the official &lt;code&gt;openai&lt;/code&gt; client. Our Go service used a hand-rolled HTTP client. The TypeScript planner used Vercel AI SDK. Three different definitions of "retry".&lt;/p&gt;

&lt;p&gt;The second was forcing all traffic through LiteLLM. It got us to a unified call surface but the observability was thin for our needs, and the failover behaviour was harder to reason about under load. Not a knock on LiteLLM, it just was not the shape we wanted.&lt;/p&gt;

&lt;p&gt;We migrated the fleet behind Bifrost about five months ago. Two reasons specific to our problem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;Automatic Fallbacks&lt;/code&gt; config makes the fallback chain a first-class object.&lt;/strong&gt; When a request fails over from Anthropic to Bedrock, that is in the response metadata. Not in three different log lines you have to join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native Prometheus metrics&lt;/strong&gt; (observability docs) meant &lt;code&gt;bifrost_requests_total&lt;/code&gt; is tagged by the actual provider that served the request, not the one we asked for.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a chunk of the config that mattered for trace cleanup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_API_KEY_1&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_API_KEY_2&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_API_KEY&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;fallback_to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;

&lt;span class="na"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;include_fallback_chain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;include_provider_actual&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two &lt;code&gt;include_*&lt;/code&gt; flags meant every trace span we emitted downstream had a deterministic answer to "who served this token". Our corrupted-trace rate on the next 5,000 sampled dropped from 17% to under 3%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the audit actually changed about our fine-tuning
&lt;/h2&gt;

&lt;p&gt;We stopped using user thumbs-up as the primary filter. Thumbs-up correlates with "user got what they wanted eventually", not "the model made the right call". Now the filter is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-provider, single-model trace (no fallback fired)&lt;/li&gt;
&lt;li&gt;No retry on any tool call&lt;/li&gt;
&lt;li&gt;Tool call result schemas validated post-hoc against a recorded ground truth&lt;/li&gt;
&lt;li&gt;Span timing within 1.5x median for that task class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That filter throws away about 71% of our raw traces. Painful. But the 29% that survives is data we can actually train on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Honest take on what this did not solve.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost is not a debugger.&lt;/strong&gt; It tells you which provider served the request and whether a fallback fired. It does not tell you whether the tool result was &lt;em&gt;correct&lt;/em&gt;. We still need the post-hoc schema validation pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching (docs) made the corruption worse before it got better.&lt;/strong&gt; Cache hits looked like fresh model calls in our old logging. We had to explicitly tag cached responses in the trace pipeline. Once tagged, fine, but the default was confusing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM has a larger provider list at the long-tail.&lt;/strong&gt; If you need niche providers, check both before committing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portkey's prompt management UI is nicer.&lt;/strong&gt; We do prompt management elsewhere so it did not matter for us. If you want one tool for both, Portkey is worth a look.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The MCP gateway feature (docs) is interesting but we have not put it in production.&lt;/strong&gt; Cannot vouch for it yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. The infrastructure around the trace is where your eval dataset lives or dies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks docs&lt;/li&gt;
&lt;li&gt;Bifrost observability defaults&lt;/li&gt;
&lt;li&gt;LiteLLM proxy docs for honest comparison&lt;/li&gt;
&lt;li&gt;Anthropic's tool use guide — the trace structure section is the relevant one&lt;/li&gt;
&lt;li&gt;OpenTelemetry GenAI semantic conventions — what we wish our old logging had matched&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Why Your LLM Eval Harness Is Lying to You (And How to Fix It)</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 22 May 2026 06:32:58 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/why-your-llm-eval-harness-is-lying-to-you-and-how-to-fix-it-2dmb</link>
      <guid>https://dev.to/marcuswwchen/why-your-llm-eval-harness-is-lying-to-you-and-how-to-fix-it-2dmb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass rate on a static suite that hasn't been touched in four months, while the model silently regresses on the queries that actually matter. Here's how we restructured ours at Nexus Labs after a bad week in February.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We shipped a fine-tuned Llama 3.1 70B variant in late January. Eval score: 91.2 on our internal suite. Two weeks later, support tickets spiked. Customers running multi-step agent workflows were getting truncated tool calls roughly 12% of the time. Our eval suite caught zero of these.&lt;/p&gt;

&lt;p&gt;The suite wasn't broken. It was answering a question nobody had asked in months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The static-suite trap
&lt;/h2&gt;

&lt;p&gt;Here's the pattern I keep seeing. Team builds an eval set of 500 examples around the time they ship v1. Each example gets a reference answer and a string-match or embedding-similarity check. The suite becomes the source of truth. CI gates on it. Dashboards graph it. Nobody questions it.&lt;/p&gt;

&lt;p&gt;But your traffic distribution shifts. New customer onboards with a different query pattern. A prompt change upstream alters tool-call frequency. The suite still passes because the suite hasn't moved.&lt;/p&gt;

&lt;p&gt;We pulled three months of production traces and binned them by intent cluster. The original eval suite covered four of the eleven clusters that showed up in real traffic. The four it covered were the easiest ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually changed
&lt;/h2&gt;

&lt;p&gt;Three things. None of them clever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Replay-based eval, refreshed weekly.&lt;/strong&gt; We sample 2,000 real production traces per week, strip PII, and run them through the candidate model. We compare structured outputs (tool calls, JSON fields) against the production response using exact match on tool name plus a learned judge for arguments. Free-form text gets a pairwise preference check against the current production model using a separate judge model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cluster-stratified sampling.&lt;/strong&gt; Embed every trace with &lt;code&gt;text-embedding-3-large&lt;/code&gt;, cluster with HDBSCAN, sample proportionally. This stops the eval from being dominated by the one chatty customer who sends 40% of traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Adversarial slices owned by humans.&lt;/strong&gt; Our support team flags any ticket that traces back to a model failure. Those traces get added to a permanent adversarial set. That set grows. It never shrinks. Currently sitting at 847 examples and climbing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;eval_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replay&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;sample_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
    &lt;span class="na"&gt;window_days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7&lt;/span&gt;
    &lt;span class="na"&gt;strip_pii&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;cluster_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hdbscan&lt;/span&gt;
    &lt;span class="na"&gt;min_cluster_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
  &lt;span class="na"&gt;judges&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;structured&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exact_match_with_arg_judge&lt;/span&gt;
    &lt;span class="na"&gt;freeform&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pairwise_preference&lt;/span&gt;
    &lt;span class="na"&gt;judge_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="na"&gt;adversarial&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./evals/adversarial_permanent.jsonl&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3.0&lt;/span&gt;
  &lt;span class="na"&gt;gates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;regression_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.02&lt;/span&gt;
    &lt;span class="na"&gt;adversarial_floor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;weight: 3.0&lt;/code&gt; on adversarial is deliberate. Those examples represent real customer pain. A 1% regression on adversarial costs us more than a 1% regression on the easy cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing the eval traffic
&lt;/h2&gt;

&lt;p&gt;Running 2,000 traces against a candidate model plus a judge model plus the production baseline gets expensive fast. We were burning $400/week on judge calls alone before we got serious about caching and routing.&lt;/p&gt;

&lt;p&gt;Two things helped. First, semantic caching on the judge prompts. The same trace evaluated twice against the same model pair should not cost twice. Second, we route across providers based on per-token cost for the judge role specifically. We use Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for this because it gives us one OpenAI-compatible endpoint and lets us shift judge traffic between Anthropic and Google without touching the eval code. LiteLLM works similarly if that's already in your stack.&lt;/p&gt;

&lt;p&gt;Cost dropped to $140/week. Same coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: what we tried
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Coverage of real traffic&lt;/th&gt;
&lt;th&gt;Maintenance cost&lt;/th&gt;
&lt;th&gt;Catches silent regressions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static curated suite&lt;/td&gt;
&lt;td&gt;Low (drifts fast)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Rarely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure replay&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Sometimes (misses rare-but-critical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay + cluster sampling + adversarial&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-judge-only with no replay&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Inconsistent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Replay-based eval has real problems and I don't want to undersell them.&lt;/p&gt;

&lt;p&gt;Judge models are not ground truth. Pairwise preference between two model outputs is noisy. We run each comparison three times with temperature 0.3 and take majority vote. Even then, agreement with human raters sits around 78% on our adversarial slice. Useful, not authoritative.&lt;/p&gt;

&lt;p&gt;PII stripping is fragile. We use a regex stack plus a small NER model. We still find leakage occasionally during audits. If your domain has strict data handling rules, you may need synthetic replays instead of real ones, which loses some of the distributional fidelity that makes this work.&lt;/p&gt;

&lt;p&gt;Replay assumes today's traffic looks like tomorrow's. For a stable product, fine. For one shipping new features weekly, you're always one release behind.&lt;/p&gt;

&lt;p&gt;And the adversarial set has a selection bias. We only add examples that humans flagged. Failures nobody noticed don't make it in. We try to compensate by manually sampling 50 random traces per week for human review, but we're not closing the loop completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What hasn't worked
&lt;/h2&gt;

&lt;p&gt;Tried benchmark suites like MT-Bench and HELM as our primary gate. Useless for our domain. They measure general capability. We don't ship general capability. We ship agent reliability on a narrow task surface.&lt;/p&gt;

&lt;p&gt;Tried a single LLM-as-judge with one rubric. Too much variance. Rubric drift between runs was higher than the signal we were trying to measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/EleutherAI/lm-evaluation-harness" rel="noopener noreferrer"&gt;Eleuther's lm-evaluation-harness&lt;/a&gt; — good reference for general benchmark plumbing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/anthropic-cookbook/tree/main/misc" rel="noopener noreferrer"&gt;Anthropic's evals cookbook&lt;/a&gt; — pairwise judge patterns worth borrowing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hdbscan.readthedocs.io/" rel="noopener noreferrer"&gt;HDBSCAN docs&lt;/a&gt; — clustering algorithm we use for stratification&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hamel.dev/blog/posts/evals/" rel="noopener noreferrer"&gt;Hamel Husain on evals&lt;/a&gt; — the post that pushed us to take replay seriously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. The eval is where you find out if you actually shipped what you thought you shipped.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
