DEV Community: Marcus Chen

Shadow AI Is Your Biggest Infra Risk: An Endpoint Governance Guide

Marcus Chen — Tue, 14 Jul 2026 15:42:10 +0000

Shadow AI introduces critical data exfiltration, compliance, and security risks. This guide explores how a combined AI gateway and endpoint agent solution, like Bifrost and Bifrost Edge, delivers comprehensive governance across the enterprise.

The proliferation of AI tools in daily workflows presents a significant challenge for enterprise security and IT teams: shadow AI. This refers to the unsanctioned use of AI applications, platforms, and models by employees without oversight or approval from an organization's central IT or security functions. While employees often adopt these tools for increased productivity, their usage creates a critical blind spot that can lead to substantial infrastructure risks. Research indicates that 98% of organizations have employees using AI tools that were never reviewed or approved by IT or security teams.

Understanding Shadow AI: The Hidden Risks

Shadow AI is not a future threat; it is already embedded in everyday work, often driven by well-meaning employees seeking productivity gains. However, this uncontrolled AI adoption outpaces governance, leading to various risks that traditional security measures struggle to address.

A primary concern is data exfiltration. Sensitive data, including personally identifiable information (PII), intellectual property (IP), source code, internal documents, and financial records, can be inadvertently exposed when employees paste it into public AI tools. Once this data enters a third-party AI service, it may be logged, stored, or even used for model training under that platform's terms of service, permanently leaving the organization's control with no audit trail. A 2025 IBM report highlighted that 20% of organizations suffered a breach specifically due to shadow AI, adding an average premium of $670,000 to breach costs.

Compliance failures represent another significant risk. Many organizations operate under strict regulatory frameworks such as GDPR, HIPAA, ISO 27001, and SOC 2. When sensitive data is fed into unvetted AI tools, organizations risk violating these regulations, incurring substantial financial penalties and reputational damage. For instance, GDPR Article 5 mandates lawful, transparent data processing, which shadow AI inherently bypasses due to a lack of visibility.

The overall loss of visibility and control creates significant blind spots for IT leaders. Without centralized monitoring, organizations remain unaware of which AI platforms are being accessed, by whom, from where, and with what types of data or business processes. This absence of oversight impedes regulatory compliance, complicates incident response efforts, and undermines the overall security posture.

The Imperative for Endpoint AI Governance

Traditional AI governance strategies typically focus on the gateway, where platform teams configure applications to send traffic through a centralized AI gateway. This approach effectively manages applications provisioned and controlled by IT. However, it falls short when employees install their own AI applications, use browser-based AI, or run coding agents that connect directly to AI providers, bypassing the central gateway entirely. This is the "shadow AI gap" that endpoint AI governance aims to close.

Endpoint AI governance applies access controls, usage policies, budget limits, guardrails, and audit logging directly at the machine level, covering every device in the organization. It ensures that AI usage is governed regardless of whether it originates from a browser, a desktop application, or a coding agent in the terminal.

How an AI Gateway + Endpoint Agent Delivers Comprehensive Control

A robust solution for managing shadow AI combines the power of a centralized AI gateway with an endpoint agent that extends governance to every machine. Bifrost, an open-source AI gateway from Maxim AI, serves as the control plane and policy engine, offering extensive capabilities for routing, authentication, observability, and governance. Bifrost Edge then extends that same governance directly to the endpoint.

This combined "AI Gateway + Bifrost Edge" architecture ensures that the virtual keys, budgets, rate limits, routing rules, guardrails, and audit logs configured in the Bifrost AI gateway are enforced uniformly across all AI traffic, including that originating from employee machines. This means the governance follows the user and the device, rather than relying on manual per-application configuration. Bifrost Edge is currently in alpha, continuously expanding its capabilities.

Key Capabilities of Endpoint AI Governance with Bifrost Edge

Bifrost Edge, as the endpoint layer of the Bifrost platform, provides specific features to extend central AI governance to individual devices:

App Governance

Administrators can decide which AI applications are permitted across the organization, and Bifrost Edge enforces these policies directly on each device. Allowed applications run normally and are fully governed through Bifrost, while disallowed applications are blocked before any data leaves the machine. This allows for proactive control over the AI tools employees can use, with new app discoveries automatically triggering an approval workflow in the admin console.

MCP Server Governance

Many AI applications increasingly connect to Model Context Protocol (MCP) servers, which are external tools that can read files, call APIs, and take actions. Organizations often lack visibility into these connections, creating a significant blind spot. Bifrost Edge inventories the MCP servers configured within each AI application across the fleet, providing a real-time, fleet-wide catalog. Administrators can then make per-server allow or deny decisions, which are enforced on the device, preventing denied servers from being used even if they were previously configured within an application. This covers major AI apps that support MCP, including Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor.

Unified Security & Guardrails

By routing all endpoint AI traffic through the Bifrost gateway, Bifrost Edge ensures that every guardrail already configured at the gateway applies automatically to endpoint AI. This eliminates the need for separate security configurations on individual devices. These guardrails protect prompts and responses from desktop apps, browser AI, and coding agents by catching sensitive content such as secrets or PII before it leaves the machine. Bifrost supports native Secrets Detection, Custom Regex (including PII Detection templates), AWS Bedrock Guardrails, Azure Content Safety, Google Model Armor, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI.

MDM Deployment for Fleet-Wide Rollout

Bifrost Edge is built for enterprise-scale deployment. Instead of manual installation, organizations can push Edge to every machine using existing Mobile Device Management (MDM) platforms such as Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud. This silent, fleet-wide rollout ensures consistent policy enforcement with minimal user intervention, as a managed configuration pre-points devices to the organization's Bifrost gateway.

Real-World Impact: Moving from Shadow to Secure

Implementing comprehensive endpoint AI governance effectively shifts organizations from a reactive stance against shadow AI to a proactive one. It delivers:

Complete Visibility: Gain a real-time inventory of all AI applications and MCP servers in use across the entire fleet.
Consistent Compliance: Ensure all AI usage adheres to regulatory requirements (GDPR, HIPAA, SOC 2, ISO 27001) with centralized policy enforcement and immutable audit logs.
Enhanced Security: Protect against data exfiltration, prompt injection attacks, and other AI-native threats by applying robust guardrails at the endpoint.
Reduced Risk: Mitigate financial, reputational, and operational risks associated with ungoverned AI usage.

Endpoint AI governance, powered by an AI gateway like Bifrost and extended by Bifrost Edge, provides a comprehensive, proactive, and adaptable solution for securing AI usage across the enterprise. It moves beyond the limitations of traditional network filtering to deliver true visibility, granular control, and robust data protection where employees actually use AI.

Sources

9 Metrics to Track for Enterprise AI Governance

Marcus Chen — Thu, 09 Jul 2026 10:31:41 +0000

Organizations scaling AI deployments face increasing pressure to ensure responsible and compliant usage. Tracking key metrics allows teams to measure the effectiveness of their AI governance programs, manage risks, and demonstrate accountability. This article examines essential metrics for enterprise AI governance and highlights how solutions like Bifrost can provide the necessary controls and visibility.

As artificial intelligence systems become integral to enterprise operations, the need for robust AI governance has escalated. Governance goes beyond policy documents; it requires measurable operating signals to assess whether AI systems remain within control boundaries and align with ethical, legal, and operational expectations. Without quantifiable metrics, "we have AI governance" remains a claim rather than a verifiable posture.

Effective enterprise AI governance, especially for large language model (LLM) applications, demands continuous monitoring across the AI lifecycle, from development to production. This includes evaluating model outputs, recording production behavior, controlling access to sensitive AI data, and enforcing policies at runtime. Bifrost, an open-source AI gateway by Maxim AI, provides a control plane for LLM traffic that helps organizations implement many of the technical controls necessary to measure and improve AI governance.

Here are nine critical metrics enterprise teams should track for effective AI governance:

1. AI System Inventory Coverage

One of the foundational metrics for AI governance is understanding the full scope of AI systems in use across an organization. Inventory coverage measures the percentage of known AI use cases documented in a central registry. This includes internally developed models, third-party AI embedded in SaaS tools, and shadow AI adopted by business units without formal IT review.

Why it matters: Low coverage signals shadow AI, weak ownership, and incomplete risk oversight. Without a comprehensive inventory, it is impossible to apply consistent governance policies or assess aggregate risk. Over 80% of organizations report moderate to pervasive shadow AI use, yet only 25% have comprehensive visibility into how employees use AI.

How Bifrost helps: Bifrost Edge, an endpoint agent, actively discovers and inventories AI applications and Model Context Protocol (MCP) servers on employee machines, routing all AI traffic through the central Bifrost gateway. This extends the governance and security controls configured in the Bifrost AI gateway to AI traffic on endpoints, providing admins with a fleet-wide catalog of AI tools and MCP servers in use, effectively combating shadow AI. Administrators can review newly discovered applications and MCP servers through a centralized dashboard, approving or denying their usage across the fleet.

2. Policy Compliance Rate

The policy compliance rate measures the percentage of AI deployments or model owners adhering to defined governance policies. This includes adherence to data handling rules, model validation requirements, and deployment workflows.

Why it matters: Policy-only governance often fails to demonstrate what is actually happening in production. This metric provides an operational signal of whether governance policies are genuinely enforced and followed. Non-compliance can lead to significant regulatory fines, such as those under the EU AI Act.

How Bifrost helps: Bifrost enables granular policy enforcement through virtual keys, budgets, and rate limits. These controls allow administrators to define per-consumer access permissions, manage spending, and ensure that AI usage aligns with organizational policies. Access profiles provide a mechanism for creating reusable provider, model, budget, rate limit, and MCP policies that automatically allocate virtual keys at scale.

3. Incident Detection & Resolution Time (MTTD/MTTR for AI)

This metric tracks how quickly AI-related incidents—such as bias, model failure, drift, or security breaches—are detected (Mean Time To Detect, MTTD) and subsequently resolved (Mean Time To Resolve, MTTR).

Why it matters: Rapid detection and resolution are critical for mitigating the impact of AI failures and maintaining trust. Ungoverned AI systems can introduce risks like bias in hiring algorithms or privacy violations, carrying quantifiable costs if not addressed promptly. Proactive monitoring through Key Risk Indicators (KRIs) helps identify potential issues before they escalate.

How Bifrost helps: Bifrost provides extensive observability features, including native Prometheus metrics and OpenTelemetry (OTLP) integration for distributed tracing. These capabilities allow for real-time monitoring of AI traffic, enabling teams to quickly identify anomalies, performance degradation, or unexpected model behavior. Integrations like the Datadog connector extend this visibility into existing APM and observability stacks.

4. Data Leakage / Guardrail Violation Rate

This metric quantifies the frequency of sensitive data exposure or policy breaches detected in prompts and responses, typically through automated guardrails.

Why it matters: Data leakage poses significant security and compliance risks. Employees may inadvertently paste sensitive information into public LLMs, or models may generate unsafe or low-quality outputs. The global average cost of a data breach was \$4.88 million in 2024.

How Bifrost helps: Bifrost's guardrails enforce real-time content safety, preventing sensitive data from reaching models and filtering out undesirable responses. These include native secrets detection and custom regex rules, alongside integrations with third-party guardrail providers such as AWS Bedrock Guardrails, Azure Content Safety, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI. Bifrost Edge extends these same guardrails to endpoint AI traffic, applying controls before data leaves an employee's machine.

5. Cost Per Token / Cost Per Use Case

This metric tracks the expense incurred for processing each token or the total inference cost attributable to a specific AI use case or feature.

Why it matters: LLM costs can escalate rapidly due to token-based pricing and variable usage patterns. Granular cost tracking is essential for optimizing spend, making informed decisions about model selection, and justifying AI investments. Without clear cost-to-outcome data, organizations risk overpaying for AI services.

How Bifrost helps: As an AI gateway, Bifrost centralizes all LLM traffic, capturing comprehensive, structured data for every request. This includes token counts, model usage, and latency. Virtual keys allow for precise cost attribution by user, team, or project, enabling showback and chargeback reporting. Organizations can also use routing rules to direct non-essential traffic to cheaper, smaller models, which can significantly reduce costs without sacrificing quality.

6. Cache Hit Ratio

The cache hit ratio measures the percentage of requests served directly from a cache rather than being forwarded to an LLM provider.

Why it matters: Caching frequently used responses avoids reprocessing identical or semantically similar queries, leading to significant cost savings and reduced latency. Optimizing prompt engineering and implementing effective caching strategies can reduce inference spend by 50 to 80 percent.

How Bifrost helps: Bifrost's semantic caching intelligently stores and retrieves responses based on semantic similarity, rather than exact text matches. This allows teams to reduce costs and latency on repeated queries. Monitoring the cache hit ratio through Bifrost's observability features helps teams understand the effectiveness of their caching strategy and identify opportunities for further optimization.

7. Shadow AI Detection Rate / Coverage

This metric measures the effectiveness of detecting and governing AI tools used by employees without formal IT or security approval.

Why it matters: Shadow AI is a pervasive problem, with significant security implications. It creates blind spots where sensitive corporate data can be exposed to unvetted models, leading to data breaches, compliance violations, and intellectual property leakage. Only 25% of organizations have comprehensive visibility into how employees use AI.

How Bifrost helps: Bifrost Edge is designed specifically to address shadow AI. It operates as an endpoint agent that routes all AI traffic from desktop applications, browser AI, and coding agents through the central Bifrost gateway. This provides real-time visibility into AI tool adoption, tracks usage patterns, and allows administrators to enforce governance policies at the device level, effectively bringing ungoverned AI usage under control.

8. Model Performance & Drift

This category includes metrics such as model accuracy, fairness deviation (e.g., disparate impact, equal opportunity difference), and the rate of model drift detection.

Why it matters: AI models can degrade in performance over time due to changes in data distribution (data drift) or concept shift, leading to inaccurate, biased, or harmful outputs. Continuous monitoring of these metrics is essential to ensure AI systems behave as intended and align with ethical standards. Timely detection of drift allows for intervention before it creates significant business risk.

How Bifrost helps: While Bifrost primarily focuses on traffic routing and governance, its comprehensive observability capabilities provide the raw data necessary for tracking model performance. By capturing every prompt and response, along with metadata such as model used and latency, Bifrost creates a rich dataset that can feed into external evaluation platforms for drift detection and quality assessment. The Mocker plugin can also assist in testing new model versions or configurations against baselines.

9. Audit Log Completeness / Evidence Readiness

This metric measures the proportion of AI systems with complete, immutable, and easily accessible audit trails that capture decisions, data processes, and model updates.

Why it matters: Comprehensive audit logs are critical for demonstrating accountability, transparency, and compliance with regulations such as SOC 2, GDPR, HIPAA, and ISO 27001. Auditors increasingly demand verifiable evidence of control enforcement, not just policy statements.

How Bifrost helps: Bifrost automatically generates audit logs for all AI traffic flowing through the gateway, providing an immutable record of every request, response, policy evaluation, and decision. These logs can be exported to various storage systems and data lakes, ensuring they remain within the organization's perimeter and are readily available for compliance reviews. This capability is fundamental to achieving AI audit readiness.

Conclusion

Implementing effective enterprise AI governance is no longer optional; it is a strategic imperative for managing risk, ensuring compliance, and building trust in AI systems. By focusing on these nine metrics—from inventory coverage and policy compliance to cost optimization and audit readiness—organizations can gain the operational visibility needed to steer their AI initiatives responsibly. Tools like Bifrost provide the foundational infrastructure for measuring, monitoring, and enforcing governance policies across the entire AI landscape, from the data center to the endpoint. Teams evaluating AI gateways can request a Bifrost demo or review the open-source repository.

Sources

KPIs for AI governance: metrics that boards and compliance teams track - VerifyWise: https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFFsqtqTc92SyOfD8VrGqQYVoWdmhFkwb2f1_iMPyiMiZhAeHPnXvimrv3n9z-RXWTjokSewauvTJjVb21ei1bbJo72YDe5DFxhwYqs0BauC8y0tGBGqRgDWxAOtwCfQ8WHCecS3V_DgGwxKimo1mQlXhfEjIFx_Ql4oohbwdINE8k2JOOqLrf78sI=
How to measure AI governance compliance: KPIs, metrics, and benchmarks for audit readiness - Prediction Guard: https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHMwwny1Gl1aYWZxOdzyGJB_jTr47GbihcWPGLa5TZJ_KrwwQrQ_rmfUFb-yCP1xT9bZGSPstKFKsZDR70a6I015Fujk3sg1GjfmKPynx0N6XrczhVT4wFRLgF6dHGn_GxWpCKN2kzPH_d6oBDIIYOiAqCAutenP2wNQU4LwiGW9mlRnnurLYgr9w8swiRmlAZSBoeXL7xS6oFQD1f30x5eGkmp0n0Pf1QPkeShgOiUu8pvWg==
AI Governance KPIs for Enterprise — The Operating Signals Governance Leaders Should Actually Track | Aikaara: https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGwoYzRu0TnD1Px0h-MdcSrETyn9I6XA5uoiQCMMxq7gwsa74xoyhC3dU6AA38QMZk5CBBhvlbHfRtCShj6ih4d6VOdj89MBrufeya6t_vLvzsAQdfz3Gm-b_wo2wsMUMR-V-wrj_Q6DIFARmbAK8hlpg==
Enterprise AI Governance: 2026 Implementation Guide - Solytics Partners: https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHMcBVcafIGV-G5JZfD8kLYQHBe7jifWmj0OZS0LLqMCSvhbhdyggbC8ts1LVUo-WdkFRGmpvPrvjmOKXfrKKicp_oyPXQhUiQ93TpjLV5kjsHrTktK5tiIBWtBw4-58d6Ip0uB55AGXa5WkL2d8xi0mYuK9OSs6sIDCkH0WZ8eQqKsH8gc
Shadow AI stats for 2026: The hidden adoption gap defining enterprise risk - Optro: https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG8lwy-_kQM_lFz8J4Oi5-5B7FNoHaMKBNhPGEwxoJPBl5hQhv5J_8dUdTO8lmKJc5iwvO-o0cc1Psc9jLlQXnVnqBEDiu8hq7sIoiyrhRGV_Y9ll-xdnPH10wUtu9sOMY=

Request tagging for LLM evals with Bifrost dimension headers

Marcus Chen — Thu, 25 Jun 2026 16:01:58 +0000

TL;DR: Request tagging with Bifrost dimension headers (x-bf-dim-*) stamps checkpoint and run metadata onto every LLM eval call, so you slice scores by model version instead of guessing which change moved the aggregate.

We ran roughly 12,000 eval requests across four fine-tuned checkpoints last sprint, and when aggregate accuracy moved three points I couldn't tell which checkpoint produced which response. Our eval harness stored prompts and scores in one table; the routing layer recorded latency and provider somewhere else, and nothing carried the experiment ID end to end. We moved the eval traffic behind Bifrost, the open-source AI gateway from Maxim AI, and used its custom dimension headers to stamp each request with the checkpoint and run ID. Request tagging turned a join-by-timestamp guessing game into a filter.

What request tagging means for LLM evals

Request tagging attaches key-value metadata to each LLM API call so downstream logs, traces, and metrics can be grouped by that metadata. In Bifrost, any header prefixed x-bf-dim-* becomes a custom dimension that is auto-forwarded to logs, traces, and Prometheus, which lets you group eval scores by checkpoint, prompt version, or suite without modifying your harness.

I lead the fine-tuning and evaluation team at Nexus Labs, a Series B company building enterprise agent automation. Our problem was attribution, not measurement. A scoring function that returns 0.81 is useless if you can't tie that number to agentqa-v7-lora-r16 versus agentqa-v6. Most eval setups solve this by threading an experiment ID through every layer of application code, which breaks the moment someone forgets a kwarg. Pushing the metadata into a request header at the gateway means the harness stays dumb and the dimension travels with the request.

Stamping requests with x-bf-dim headers

Bifrost is a drop-in replacement for the OpenAI base URL, so the only change to our harness was the base_url and three extra headers. The gateway holds the provider keys, so the client API key is unused.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused-bifrost-holds-keys",
)

resp = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": eval_case.prompt}],
    extra_headers={
        "x-bf-dim-checkpoint": "agentqa-v7-lora-r16",
        "x-bf-dim-run-id": "eval-2026-06-19-batch3",
        "x-bf-dim-suite": "tool-routing-adversarial",
    },
)

Every request in that batch now carries three dimensions. When the scorer writes its verdict, I don't need to correlate anything by hand; the gateway already recorded the dimensions next to the latency, token counts, and resolved provider. The same endpoint fronts 20+ providers, so when I shadow a hosted model against a self-hosted checkpoint, both legs of the comparison get tagged identically and land in the same store.

Slicing eval results in observability

The dimensions are only useful if the read path is cheap. Bifrost writes telemetry through async observability with under 0.1ms of added overhead, using SQLite by default and Postgres for production volume. The sinks include Prometheus, OpenTelemetry, Datadog, and BigQuery, so I query the same dimensions from whichever tool the rest of the team already watches.

In practice I pull a Prometheus query grouped by checkpoint and suite, then compute per-slice accuracy from the scorer table joined on run_id. That is where the three-point aggregate move resolved: checkpoint v7 gained on the general suite and lost on the adversarial tool-routing suite, which the average had flattened. This kind of per-segment attribution is the whole reason I distrust single-number eval reports. Aggregate metrics are a summary statistic, and summary statistics hide structure by design. The methodology argument is old; the HELM evaluation work made the case for multi-metric, multi-scenario reporting years ago. Tagging at the gateway is the plumbing that makes per-scenario reporting cheap enough to actually do on every run.

One detail that saved me time: the dimensions are arbitrary strings, so I tag prompt-template hashes too. When a template edit slipped into a run, the prompt_hash dimension showed two distinct values inside one supposedly clean batch, and I caught a contaminated comparison before it reached a decision.

Trade-offs and limitations

This is not free infrastructure. Bifrost runs as a separate Go service, so you operate one more process, and a serious deployment needs Postgres rather than the default SQLite once you push real eval volume through it. If your stack is pure Python and you want everything in-process, a library like LiteLLM keeps fewer moving parts, at the cost of the gateway-level telemetry I'm describing here. Bifrost's ecosystem is also younger than LiteLLM's, so you will find fewer community examples for edge integrations.

The dimension headers are forwarded, not validated. Nothing stops a typo in x-bf-dim-checkpoint from creating a phantom slice, so I keep the tag values in one constants module and assert against it in the harness. Cluster-mode horizontal scaling is an enterprise feature, not part of the open-source core, which matters if your eval fleet outgrows a single instance. For a four-checkpoint sprint on one box, none of this bit me. Know your scale before you assume it won't.

Wrapping up

Request tagging with x-bf-dim-* dimension headers moved attribution out of my eval code and into the gateway, which is where it belongs when many checkpoints and suites share one pipeline. The model was never the hard part. Knowing which model produced which number was. If you want to see the tagging and observability path end to end, book a demo: https://getmaxim.ai/bifrost/book-a-demo

Position bias in LLM-as-judge flipped 18% of our verdicts

Marcus Chen — Thu, 25 Jun 2026 06:31:28 +0000

TL;DR: Position bias in LLM-as-judge means the model favors whichever answer it reads first. We measured an 18% verdict flip rate from swapping order alone, and dual-pass scoring brought it under 4%.

Our pairwise evaluation harness at Nexus Labs scored answer A over answer B in 18% of cases purely because A appeared first in the judge prompt. We caught it when a regression in our agent-automation model showed a 6-point win on the leaderboard that vanished the moment a teammate reran the same comparisons with the candidate listed second. Position bias in LLM-as-judge is well documented, but most teams never measure it on their own data, so they ship on numbers that move when you shuffle the prompt. The judge model here was gpt-4o-2024-08-06, scoring 1,200 pairwise comparisons of customer-support agent responses.

This is the part of evaluation that gets skipped because the harness looks like it works. It returns scores. The scores have decimals. They go in a dashboard. Nobody checks whether the decimals mean anything.

What position bias in LLM-as-judge actually is

Position bias in LLM-as-judge is the tendency of a model to prefer a response based on where it sits in the prompt rather than its quality. When you ask a model to pick the better of two answers, listing the same answer first versus second changes the verdict at a measurable rate. The effect was named in Large Language Models are not Fair Evaluators and confirmed across judge models in the MT-Bench paper.

It is not random noise. The bias has a direction. In our runs gpt-4o preferred the first position about 11 points more often than chance would predict, which is consistent with the first-position skew reported in both papers above.

How we measured the flip rate

The measurement is cheap. For every pair, run the judge twice: once with the candidate in slot A, once in slot B. If the verdict changes when only the order changed, that pair is order-sensitive. The flip rate is the fraction of pairs where this happens.

def judge_pair(judge, question, resp_x, resp_y):
    # returns "x", "y", or "tie"
    return judge.compare(question, first=resp_x, second=resp_y)

flips = 0
for q, cand, base in pairs:
    v1 = judge_pair(judge, q, cand, base)   # candidate first
    v2 = judge_pair(judge, q, base, cand)   # candidate second
    # normalize v2 back to candidate-vs-base framing
    v2_norm = {"x": "y", "y": "x", "tie": "tie"}[v2]
    if v1 != v2_norm and "tie" not in (v1, v2_norm):
        flips += 1

flip_rate = flips / len(pairs)

We ran this across three judges. gpt-4o flipped on 18% of pairs, claude-3-5-sonnet on 12%, and a smaller gpt-4o-mini judge flipped on 29%. The smaller the judge, the worse the bias, which tracks with the intuition that weaker models lean harder on surface cues like ordering.

To run the same comparison set against multiple providers without writing a client per vendor, we put the judge calls behind Bifrost and pointed the harness at one OpenAI-compatible endpoint. That is the only infrastructure note here; the method works with any client you already have.

Dual-pass scoring and other fixes

The fix that worked was the boring one: judge every pair in both orders and only count a win when both passes agree. Disagreements become ties. This is the swap-and-average approach the MT-Bench authors recommend, and it dropped our flip-driven verdicts to under 4% of pairs, because a true difference in quality survives the swap while an order artifact does not.

Three approaches we compared:

Dual-pass with agreement gate. Run both orders, count a win only on agreement. Doubles judge cost, removes most order artifacts. This is what we shipped.
Score averaging. Average a numeric score across both orders instead of gating. Cheaper to reason about, but a confident wrong score in one order can still drag the mean.
Reference-anchored scoring. Score each answer independently against a rubric instead of head-to-head, as in G-Eval. Removes pairwise ordering entirely, but rubric scores are noisier and harder to calibrate across raters.

We also report Cohen's kappa between the two passes as a standing health metric. When kappa drops below 0.6 on a new judge or prompt template, we treat the judge as unreliable for that task and stop trusting its leaderboard until we debug the template.

Trade-offs and limitations

Dual-pass doubles judge token cost, which on 1,200 pairs at our prompt sizes added a few dollars per eval run. That is fine for release gates and unacceptable for per-request online scoring, so we only run it offline.

Gating on agreement inflates the tie count. Roughly a fifth of our previously decisive verdicts became ties, which makes small model improvements harder to detect. That is the correct outcome, not a bug: if a difference does not survive an order swap, calling it a win was the original mistake.

None of this addresses other judge biases. Length bias, self-preference when a model judges its own outputs, and sensitivity to formatting all persist. Position bias is the easiest one to measure, so it is the right place to start, not the place to stop.

Where to go next

If you run any LLM-as-judge pipeline, measure your flip rate before you touch anything else. It takes one extra pass over an existing comparison set and tells you whether your leaderboard reflects model quality or prompt ordering. I would run the swap test on your next eval, log Cohen's kappa between passes, and only then argue about which model won.

Governing AI Apps and the MCP Servers They Connect To From One Dashboard

Marcus Chen — Wed, 24 Jun 2026 18:30:06 +0000

As AI adoption surges, organizations face challenges governing the proliferation of AI apps and the unmanaged MCP servers employees use. Learn how to centralize AI governance with Bifrost and Bifrost Edge for comprehensive control and visibility.

The rapid adoption of AI across enterprises has brought unprecedented efficiency, but it also introduces complex governance challenges. Employees routinely use AI tools and connect to Model Context Protocol (MCP) servers without formal oversight, creating "shadow AI" and significant security and compliance risks. Addressing this requires a unified approach that brings both AI application usage and the underlying MCP server interactions under a single pane of glass. Bifrost, an open-source AI gateway from Maxim AI, provides the core control plane, which is then extended to every endpoint by Bifrost Edge for comprehensive governance.

The Rise of Shadow AI and Ungoverned Endpoints

The proliferation of generative AI tools means employees are increasingly using AI in their daily workflows, often without IT approval. Approximately 67% of employees use AI tools at work, yet only 18% of organizations have formal AI security policies in place. This disparity creates a significant "shadow AI" problem, where sensitive data, including personally identifiable information (PII) and intellectual property, can be exposed. PII is exposed in about 65% of shadow AI-related incidents, while intellectual property is exposed in around 40% of incidents.

Beyond consumer-grade AI chat apps, the Model Context Protocol (MCP) allows AI agents to connect to external tools like databases, APIs, and internal systems, enabling powerful autonomous actions. While beneficial for productivity, ungoverned MCP server usage introduces critical security risks. These include sensitive data exfiltration, unauthorized actions from compromised tool responses, overprivileged agent access, and a lack of audit trails connecting agent actions to human accountability. Many organizations lack comprehensive visibility into how employees use AI, with some reports indicating only 25% have such insight.

Centralized AI Governance with the AI Gateway

An AI gateway functions as a centralized control plane for all AI traffic between applications and LLM providers. It intercepts every request and response, enforcing policies, routing decisions, authentication, and compliance controls. Bifrost, as an AI gateway, offers a robust set of features to establish this central governance:

Virtual Keys: These serve as the primary governance entity, allowing administrators to set per-consumer access permissions, budgets, and rate limits for AI usage.
Routing and Failover: Intelligent routing directs requests to specific models, providers, and keys, ensuring automatic failover in case of provider outages and optimizing performance and cost.
Guardrails: Content safety guardrails can be configured to catch sensitive information like secrets or PII before it leaves the organization's network, supporting compliance standards like SOC 2, GDPR, HIPAA, and ISO 27001.
Audit Logs: Immutable audit logs provide a clear record of all AI interactions, which is crucial for accountability and regulatory compliance.

These controls are configured centrally within the Bifrost AI gateway, establishing a foundational layer of security and policy enforcement for traffic explicitly routed through it.

Extending Governance to Every Machine with Bifrost Edge

While the AI gateway provides robust control for configured traffic, shadow AI persists because many endpoint applications and MCP servers bypass the gateway entirely. This is where Bifrost Edge extends the gateway's governance to every machine in the organization. Bifrost Edge is a lightweight agent that runs on employee macOS, Windows, and Linux devices, routing all AI traffic through the organization's Bifrost AI gateway. This ensures that the same virtual keys, budgets, guardrails, and audit logs configured in the gateway apply to all AI traffic originating from endpoints, regardless of the application used.

Bifrost Edge addresses the core challenge of shadow AI by making endpoint AI usage observable and enforceable from a single dashboard, without requiring users to reconfigure individual applications.

Governing AI Applications at the Endpoint

Bifrost Edge gives administrators granular control over which AI applications are permitted within the organization. Teams can define policies to allow or block specific AI tools, and Edge enforces these decisions directly on each device. When Edge detects a new, unapproved application, it can trigger an approval workflow in the admin console, enabling security teams to review and either approve or deny its use across the fleet. This ensures that only sanctioned applications, fully governed by Bifrost's policies, can operate on company machines. When an application is blocked, users receive clear notifications, preventing potential data exfiltration or policy violations.

Gaining Visibility and Control Over MCP Servers

A significant blind spot for many organizations is the unmanaged proliferation of MCP servers that AI agents connect to. Edge closes this gap by providing a live, fleet-wide inventory of all MCP servers configured within AI applications on endpoint devices. Administrators gain unprecedented visibility into which external tools are being used, by whom, and across how many machines.

Once identified, administrators can make per-server allow or deny decisions. A denied MCP server cannot be used, even if an application previously had it configured. This active enforcement prevents agents from connecting to potentially malicious or unvetted external tools, mitigating risks like supply chain exposure and unauthorized command execution. Edge supports discovery for leading AI applications such as Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor.

Enforcing Security and Guardrails Everywhere

With Bifrost Edge, the robust security guardrails configured in the Bifrost AI gateway automatically apply to endpoint AI traffic. This means that prompts and responses from desktop apps, browser AI, and coding agents are protected by the same rules that secure gateway traffic. Guardrails can detect and prevent the leakage of sensitive content, such as secrets or PII, before it leaves the machine.

These guardrails include native secrets detection (backed by Gitleaks), custom regex patterns for organization-specific redaction, and integrations with third-party solutions like AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI. This comprehensive approach ensures that security policies are consistently applied across all AI interactions, from the data center to the user's laptop.

Streamlined Deployment and Administration

Bifrost Edge is designed for enterprise-scale deployment. Instead of manual installation on individual machines, organizations can push the Edge agent to every device through existing Mobile Device Management (MDM) platforms. Supported MDM solutions include Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, covering macOS, Windows, and Linux endpoints.

A managed configuration ensures that devices are pre-pointed at the organization's Bifrost instance upon installation, simplifying rollout. After deployment, administrators manage the entire fleet from a central dashboard. This dashboard provides:

Devices Dashboard: A summary of all machines running Edge, including details like hostname, owner, OS, and installed AI apps/MCP servers.
Approvals Dashboard: A deduplicated catalog of discovered AI apps and MCP servers, allowing for fleet-wide approval or denial with clear status (Pending, Approved, Denied).
Configurations: Centralized settings like the organization certificate (required for routing encrypted AI traffic) and policy sync intervals.

This consolidated view transforms shadow AI from an unmanaged risk into observable and enforceable traffic, enhancing overall security posture and compliance.

The Combined Power: AI Gateway + Bifrost Edge

Effective enterprise AI governance demands a unified strategy. The Bifrost AI gateway serves as the indispensable control plane, where virtual keys, budgets, guardrails, and audit logs are defined. Bifrost Edge then extends this same robust governance directly to the endpoint, ensuring that AI apps and the MCP servers they connect to on every employee's machine adhere to organizational policies. This combined approach eradicates shadow AI, providing a single, consistent framework for visibility, security, and compliance across the entire AI landscape, from the data center to the edge device. Teams can finally gain comprehensive control over all AI interactions, fostering responsible AI adoption at scale.

Sources

When Developers Connect Random MCP Servers: How to Regain Control

Marcus Chen — Wed, 24 Jun 2026 18:29:46 +0000

Developers frequently connect AI agents and tools to Model Context Protocol (MCP) servers to extend their capabilities. This article examines the security and governance challenges posed by ungoverned MCP server usage and outlines how an AI gateway combined with endpoint AI governance can help organizations regain control.

AI agents and developer tools increasingly rely on Model Context Protocol (MCP) servers to enhance their functionality, allowing them to interact with external systems, read files, or execute code. While this capability empowers developers, the proliferation of unmanaged MCP server connections across an organization can introduce significant security and compliance risks. Without a clear governance framework, these connections can become a blind spot, leading to "shadow AI" where sensitive company data might be inadvertently exposed or misused.

Understanding the Shadow AI Challenge with Ungoverned MCP Servers

The ease with which developers can configure AI tools and agents to connect to various MCP servers presents a double-edged sword. On one hand, it fosters innovation and productivity. On the other, it creates an environment where IT and security teams lose visibility into critical data flows and potential vulnerabilities.

When developers connect random MCP servers to their AI assistants, the consequences can include:

Data Leakage: Sensitive intellectual property, customer data, or internal documents could be processed by an unapproved MCP server, potentially transmitted to external services without encryption, or stored insecurely.
Compliance Violations: Industry regulations like GDPR, HIPAA, or SOC 2 often mandate strict control over data handling. Ungoverned MCP servers can bypass these controls, leading to non-compliance and hefty fines.
Security Risks: Malicious MCP servers could introduce vulnerabilities, act as an exfiltration vector for data, or execute unauthorized actions within the company's network.
Lack of Auditability: Without centralized logging and control, there is no way to track which data was sent to which MCP server, who accessed it, or how it was used. This absence of an audit trail makes incident response and forensic analysis nearly impossible.

These challenges highlight the critical need for a robust strategy to govern AI traffic, particularly at the endpoint where developers are actively interacting with AI tools. The rise of shadow IT, now manifesting as shadow AI, necessitates a comprehensive approach to visibility and control.

The Bifrost Approach: Centralized AI Governance at Scale

Organizations can begin to address this challenge by routing all AI traffic through a dedicated AI gateway. Bifrost, an open-source AI gateway from Maxim AI, provides a centralized control plane for managing interactions with LLM providers and the MCP ecosystem.

At the gateway layer, Bifrost enables robust governance through features such as:

Virtual Keys: These serve as primary governance entities, allowing administrators to define specific access permissions, budgets, and rate limits for different projects, teams, or individual users. This ensures that even approved MCP interactions operate within defined constraints.
Guardrails: Bifrost offers comprehensive guardrails to detect and prevent sensitive data from leaving the organization. This includes native secrets detection, custom regex patterns for PII, and integrations with third-party content safety solutions like AWS Bedrock Guardrails or Azure Content Safety. These guardrails apply to all traffic passing through the gateway.
Audit Logs: Every interaction routed through Bifrost generates immutable audit logs, providing a complete historical record of prompts, responses, token usage, and policy enforcement actions. This is crucial for compliance reporting and incident investigation.
MCP Tool Filtering: Administrators can define which MCP tools are accessible per virtual key, ensuring that only approved tools can be invoked by agents connecting through the gateway.

While these gateway-level controls are powerful, they only govern traffic that is explicitly configured to flow through Bifrost. The core problem of developers connecting random MCP servers often occurs outside this explicit routing.

Extending Control to the Endpoint with Bifrost Edge

To truly regain control over ungoverned MCP server usage and mitigate shadow AI risks, organizations need to extend their governance policies to the devices where AI tools are actually used. This is where Bifrost Edge plays a critical role, complementing the AI gateway by bringing endpoint AI traffic under the same centralized governance.

Bifrost Edge is an endpoint agent that runs natively on macOS, Windows, and Linux machines. It routes all AI traffic from supported applications—including desktop chat apps, AI in the browser, and coding agents—through the organization's Bifrost AI gateway. This ensures that the same virtual keys, budgets, guardrails, and audit logs configured in Bifrost are enforced on every endpoint.

Automated MCP Server Discovery and Approval

One of the most significant challenges with ungoverned MCP servers is a lack of visibility. Edge addresses this by actively inventorying the MCP servers configured within each AI application across the entire fleet of devices. This process builds a live, deduplicated catalog of every MCP server in use.

Fleet-wide Inventory: Administrators gain a clear dashboard view of all discovered MCP servers, noting which ones are in use, on how many devices, and their current approval status. This provides the data necessary to make informed governance decisions.
Centralized Approval Workflow: When Edge detects a new MCP server, it can automatically request approval in the Bifrost admin console. Administrators can then decide to allow, deny, or place the server in a pending state, with the decision enforced instantly across all relevant devices. A denied MCP server cannot be used, even if an application was previously configured to connect to it.

Enforcing Policies on the Device

Bifrost Edge ensures that governance is not advisory but strictly enforced. When an MCP server is denied through the Bifrost control plane, Edge prevents any traffic from reaching that server from the endpoint, regardless of the application's local configuration. This active enforcement is critical for maintaining compliance and security across the organization.

Guardrails and Security Everywhere

The guardrails configured in the Bifrost AI gateway automatically extend to all AI traffic routed through Edge. This means that sensitive information, PII, or secrets are detected and blocked before they leave an employee's machine, or before an unapproved MCP server can process them. This consistent application of security policies significantly reduces the attack surface and helps achieve compliance with various data protection standards.

Seamless Deployment with MDM

For large organizations, rolling out endpoint agents can be complex. Bifrost Edge is designed for fleet-wide deployment via existing Mobile Device Management (MDM) platforms, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud. This allows for silent installation and managed configuration, ensuring broad coverage without requiring manual user setup.

Benefits of Centralized MCP Governance

By combining the power of an AI gateway like Bifrost with the endpoint reach of Bifrost Edge, organizations can achieve comprehensive control over their AI ecosystem:

Eliminate Shadow AI: Gain full visibility and control over all AI tools and MCP server connections, regardless of where they are used.
Enhanced Security: Apply consistent security policies and guardrails to all AI traffic, protecting sensitive data from exfiltration and misuse.
Assured Compliance: Maintain immutable audit trails and enforce data governance policies across the entire AI surface, simplifying compliance with regulations.
Streamlined Operations: Manage AI policies and approvals centrally, with automatic enforcement at the endpoint, reducing manual overhead and risk.
Empowered Developers: Developers can continue to innovate with AI tools, confident that their work aligns with organizational security and governance standards.

For enterprise teams seeking to navigate the complexities of AI governance and ensure secure, compliant AI operations, a unified approach combining a robust AI gateway with endpoint intelligence offers a definitive path to regaining control.

Sources

Governing MCP Server Usage in Coding Agents Fleet-Wide

Marcus Chen — Wed, 24 Jun 2026 18:29:18 +0000

Explores the risks of ungoverned Model Context Protocol (MCP) server usage in coding agents and how Bifrost, with its endpoint AI governance capabilities, enables fleet-wide visibility and control.

The rapid adoption of AI coding assistants by development teams has brought unprecedented productivity gains. However, this shift also introduces new governance challenges, particularly concerning the Model Context Protocol (MCP) servers these agents utilize. Ensuring that every instance of an AI coding assistant and its MCP server connections is visible and governed across an entire fleet requires a robust strategy. Bifrost, an open-source AI gateway from Maxim AI, provides the foundational infrastructure to manage and secure AI traffic, extending its capabilities to endpoint governance for comprehensive control over agentic workflows.

The Rise of Agentic Coding and MCP Servers

The Model Context Protocol (MCP) is an open standard designed to connect AI applications, such as large language models (LLMs), with external systems like tools, data sources, and workflows. It acts as a universal adapter, allowing AI assistants to make structured API calls and interact with the outside world beyond their training data. This standardization helps solve the "N×M integration problem," where each AI application would otherwise need custom integrations for every external service.

Coding agents, which leverage LLMs to perform complex development tasks, increasingly rely on MCP servers to execute actions like reading files, running tests, and interacting with APIs. Popular coding agents that support MCP include Claude Code, Codex CLI, Gemini CLI, Cursor, OpenCode, Qwen Code, Roo Code, and Zed Editor [cite: 1, Bifrost Edge context]. These tools empower developers to automate repetitive tasks and accelerate development cycles.

The Hidden Risks of Ungoverned Tool Usage (Shadow AI)

While powerful, the proliferation of AI coding assistants and their underlying MCP server connections introduces significant security and compliance risks for enterprises. Many organizations find that their developers are using these tools without formal approval or oversight from IT and security teams, a phenomenon widely known as "shadow AI".

The consequences of ungoverned MCP server usage can be severe:

Sensitive Data Exfiltration: MCP sessions often handle highly sensitive data, including API keys, database credentials, and personally identifiable information (PII). Without proper controls, this data can be exfiltrated through compromised or malicious tools. Traditional data loss prevention (DLP) tools are frequently unable to reliably parse the conversational, JSON-based payloads in MCP traffic, creating blind spots.
Unauthorized Agent Actions: A compromised MCP server can lead to an agent performing unintended actions, such as modifying records, initiating transactions, or accessing unauthorized systems. Prompt injection attacks, a novel threat unique to LLMs, can manipulate agents into overriding security safeguards or revealing sensitive information through the tools they access.
Overprivileged Access and Privilege Escalation: Many MCP-enabled tools require broad permissions, potentially violating the principle of least privilege. In multi-agent environments, a single compromised agent could escalate privileges laterally across other agents, turning a vulnerability into an organization-wide exposure.
Supply Chain Exposure: MCP servers rely on software components, making them vulnerable to supply chain attacks. A compromised component could be used to exfiltrate data or manipulate agent instructions.
Missing Audit Trails: Without centralized governance, there is no comprehensive record of which MCP servers were used, what actions were taken, or what data was accessed, making compliance and incident response difficult.

The rise of shadow AI in development teams means that many agent-to-system integrations operate without security review, creating uninventoried blind spots where these risks can materialize undetected.

Bridging the Gap with Endpoint AI Governance

To effectively mitigate these risks, organizations must implement robust AI governance that extends beyond the network perimeter to the endpoint where AI tools are actually used. Endpoint AI governance ensures that controls are applied directly on the device, covering desktop applications, browser-based AI, and coding agents.

A comprehensive approach to governing AI on the endpoint integrates with an AI gateway as the central control plane. The Bifrost AI gateway serves as the policy engine, where virtual keys, budgets, rate limits, routing, guardrails, and audit logs are configured. Bifrost Edge then extends that same governance to the endpoint, ensuring that AI traffic on every machine adheres to the established policies. This combined "AI Gateway + Bifrost Edge" narrative is critical for achieving consistent and enforceable AI security. Beyond routing, Bifrost applies governance and security controls (virtual keys, budgets, guardrails, audit logs) centrally, and Bifrost Edge extends that same governance and security to AI traffic on employee machines, with endpoint enforcement on each device.

Fleet-Wide MCP Server Discovery and Control with Bifrost Edge

Bifrost Edge is an endpoint agent that runs on every computer in an organization, transparently routing all AI traffic through the company's Bifrost gateway. This enables comprehensive visibility and control over MCP server usage.

One of Bifrost Edge's core capabilities is its ability to inventory and govern MCP servers [https://docs.getbifrost.ai/edge/mcp-governance]. It automatically discovers the MCP servers configured within each AI application across the entire fleet, creating a live, centralized inventory for administrators. This provides the crucial visibility needed to answer the question: "What MCP servers are running on our fleet?"

Administrators can then make per-server allow/deny decisions through a centralized approvals dashboard [https://docs.getbifrost.ai/edge/admin-approvals]. A denied MCP server is actively blocked on the device, preventing any data from leaving the machine via that server, even if the application had it configured previously. This enforcement applies to a wide range of coding agents, including Claude Code, Claude Desktop, Gemini CLI, OpenCode, Codex, and Cursor [https://docs.getbifrost.ai/edge/supported-applications]. When Edge detects a new MCP server or application, it automatically requests approval in the admin console, allowing for proactive governance [https://docs.getbifrost.ai/edge/app-governance].

Centralized Policy, Decentralized Enforcement

With Bifrost Edge, the existing governance framework defined within the Bifrost AI gateway seamlessly extends to the endpoint. This means that virtual keys, budget allocations, rate limits, and guardrails configured in Bifrost automatically apply to prompts and responses from desktop apps, browser AI, and coding agents [https://docs.getbifrost.ai/edge/security].

Guardrails, which are configured using reusable profiles and rules at the gateway level [https://docs.getbifrost.ai/enterprise/guardrails], detect and prevent sensitive content—such as secrets or PII—from leaving the machine. This includes native Secrets Detection (Gitleaks-backed) and Custom Regex capabilities, as well as integrations with third-party guardrail providers like AWS Bedrock Guardrails, Azure Content Safety, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI [https://docs.getbifrost.ai/enterprise/guardrails].

Every AI request, whether from a centrally configured application or an endpoint coding agent, inherits the organization's comprehensive audit logging [https://docs.getbifrost.ai/enterprise/audit-logs], ensuring an immutable trail for compliance standards like SOC 2, GDPR, HIPAA, and ISO 27001.

Seamless Deployment and Continuous Compliance via MDM

Rolling out endpoint AI governance across an enterprise fleet can be complex, but Bifrost Edge simplifies this through native integration with existing mobile device management (MDM) platforms. Organizations can push the Edge agent to every machine using managed configurations, eliminating the need for individual users to download or manually configure anything [https://docs.getbifrost.ai/edge/deployment-mdm].

Bifrost Edge supports major MDM platforms, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, across macOS, Windows, and Linux devices. This streamlines deployment, ensuring that machines are pre-configured to point to the organization's Bifrost instance. The setup process involves a single browser sign-in via the organization's single sign-on (SSO), linking the device to the user and syncing assigned policies without sensitive information residing on the device itself [https://docs.getbifrost.ai/edge/how-it-works].

By actively governing AI at the endpoint, Bifrost Edge helps organizations:

End shadow AI: Bring all user-initiated AI tool usage under governance.
Ensure zero per-app setup: Transparently route traffic without requiring users to reconfigure individual applications.
Achieve compliance everywhere: Extend existing security and governance policies to every laptop, aligning AI operations with regulatory requirements.

Securing the Future of Agentic Workflows

The shift towards agentic coding workflows, where AI assistants interact autonomously with external tools, necessitates a proactive and comprehensive approach to governance. Relying solely on network-level controls is insufficient for the dynamic and distributed nature of modern AI tool usage.

By combining the robust policy engine of the Bifrost AI gateway with the endpoint enforcement capabilities of Bifrost Edge, organizations can gain the visibility and control needed to securely embrace AI coding assistants. This integrated approach ensures that innovation in development proceeds hand-in-hand with enterprise security, compliance, and responsible AI practices. Teams evaluating AI gateways and endpoint governance solutions can request a Bifrost demo to explore these capabilities or review the open-source repository.

Sources

How Enterprises Monitor and Control Model Context Protocol Servers

Marcus Chen — Wed, 24 Jun 2026 18:29:07 +0000

Enterprise AI deployments face significant challenges governing Model Context Protocol (MCP) servers used by AI agents. This article examines how organizations can gain visibility and implement robust controls for MCP usage across their fleet.

AI agents are rapidly transforming enterprise workflows, automating tasks and interacting with a multitude of tools. A key enabler of this functionality is the Model Context Protocol (MCP), which allows language models to discover, invoke, and interact with external services. While MCP unlocks powerful capabilities, it also introduces significant governance and security challenges for organizations. Without proper controls, IT and security teams can face a blind spot into which external tools employees' AI agents are using and the data flows involved. Bifrost, an open-source AI gateway from Maxim AI, provides a comprehensive approach to gain visibility and enforce policies over MCP server usage within an enterprise.

The Rise of Model Context Protocol (MCP) and Agentic AI

The Model Context Protocol (MCP) defines a standard for how large language models (LLMs) and AI agents can interact with external tools and services. Instead of merely generating text, an AI agent leveraging MCP can read files, call APIs, and execute actions by connecting to specialized MCP servers. This capability is foundational for agents to perform complex, multi-step tasks that require real-world interaction, such as summarizing documents, managing calendars, or integrating with internal systems.

As agentic AI becomes more prevalent, so does the reliance on MCP servers. These servers can be internal (connecting to company APIs) or external (integrating with third-party services). The power of MCP lies in its extensibility, allowing agents to become more versatile and effective. However, this extensibility also presents a governance paradox for enterprises: how can organizations permit the innovative use of agents while maintaining control over data, security, and compliance?

The Shadow AI Problem: Ungoverned MCP Servers

The primary challenge for enterprises is the proliferation of "shadow AI." This refers to AI tool usage by employees that occurs outside the visibility and control of IT and security teams. When employees install popular AI desktop chat applications (such such as Claude Desktop or Cursor), utilize coding agents in their terminals (like Claude Code or Gemini CLI), or interact with AI in their browsers, they may configure connections to various MCP servers without explicit oversight. These tools often allow users to specify arbitrary MCP server URLs.

This ungoverned usage creates significant risks:

Data Exfiltration: Sensitive company data could inadvertently be sent to unsanctioned external MCP servers.
Security Vulnerabilities: Malicious or compromised MCP servers could introduce security risks to the corporate network or data.
Compliance Gaps: Without an audit trail or policy enforcement, organizations cannot demonstrate compliance with regulations like SOC 2, GDPR, HIPAA, or ISO 27001 regarding AI usage.
Cost Overruns: Uncontrolled agent activity can lead to unexpected costs from third-party services.

A traditional AI gateway can only govern traffic that is explicitly routed through it. MCP servers configured directly on an employee's machine bypass this central control, creating a critical blind spot that most enterprises are unprepared to address.

Gaining Visibility: Inventory and Discovery of MCP Servers

The first step in controlling MCP server usage is to understand what exists. Manually inventorying every MCP server configured across a fleet of employee machines is an impractical task. Enterprises need automated mechanisms to discover and catalog these connections.

Bifrost Edge, the endpoint AI governance component of the Bifrost AI gateway, addresses this by running an agent natively on macOS, Windows, and Linux devices. This agent automatically identifies AI applications and the MCP servers configured within them. Edge builds a live, fleet-wide inventory of all MCP server connections. This capability allows security and IT teams to answer critical questions such as: "Which MCP servers are currently active across our endpoints?" or "Are employees connecting to any unsanctioned external tools via AI agents?"

The collected data is then centralized in the Bifrost admin console, providing a consolidated view of all discovered applications and MCP servers. This ensures that no MCP server connection goes unnoticed, giving administrators the visibility needed to begin formulating and enforcing policies. Edge's MCP governance features provide this discovery, covering major AI apps that support MCP.

Implementing Control: Centralized Governance for MCP Usage

Once visibility is established, the next step is to implement robust controls. Bifrost, acting as the central AI gateway and policy engine, combined with Bifrost Edge enforcing those policies at the endpoint, provides a comprehensive governance framework.

The Bifrost AI gateway is where virtual keys, budgets, rate limits, routing, guardrails, and audit logs are configured. Bifrost Edge then extends this governance to every machine. This means the same policies that apply to AI traffic routing through the gateway also apply to AI traffic originating from endpoint applications, including their MCP server interactions.

For MCP server control, administrators can leverage the Bifrost admin console to:

Approve or Deny MCP Servers: After discovery, each unique MCP server found across the fleet appears in a catalog. Administrators can then make explicit per-server allow or deny decisions. A denied server cannot be used, even if an AI application on an endpoint was previously configured to connect to it.
Govern AI Applications: Beyond individual MCP servers, administrators can also define which AI applications themselves are permitted for use on company machines. Edge enforces these policies, ensuring only sanctioned applications can operate.
Apply Policies via Virtual Keys: Bifrost's virtual keys allow administrators to assign specific MCP tool filtering policies to different projects, teams, or individual users. This fine-grained control ensures that developers, for example, might have access to a different set of tools than a customer support team.

The decisions made in the central Bifrost console are automatically synchronized to every Bifrost Edge agent. This ensures that policy updates take effect across the entire organization without requiring manual configuration on individual devices.

Enhancing Security and Compliance with Guardrails and Audit Logs

Controlling which MCP servers are used is crucial, but equally important is governing the content and actions flowing through them. Bifrost extends its powerful guardrail capabilities to endpoint AI traffic, ensuring comprehensive security and compliance.

Guardrails are applied before a prompt reaches an MCP server and before its response is returned to the AI agent. This allows organizations to:

Detect Secrets: Automatically identify and block sensitive information, such as API keys, credentials, or tokens, from being sent in prompts or extracted from responses. Bifrost includes native secrets detection powered by Gitleaks.
Enforce Custom Content Policies: Implement custom regex rules to prevent the transmission of specific types of PII, proprietary code, or other sensitive data unique to the organization.
Integrate Third-Party Content Safety: Leverage existing investments in security tools like AWS Bedrock Guardrails, Azure Content Safety, CrowdStrike AIDR, GraySwan Cygnal, and Patronus AI, with their policies applying to MCP traffic as well.

Furthermore, all MCP server interactions governed by Bifrost Edge are captured in immutable audit logs. These logs provide a comprehensive, tamper-proof record of AI usage, which is essential for demonstrating compliance with regulatory requirements like SOC 2, GDPR, HIPAA, and ISO 27001.

Streamlined Deployment with MDM for Fleet-Wide Governance

Deploying endpoint agents across an entire enterprise fleet can be a significant operational challenge. Bifrost Edge is designed for mass deployment through existing Mobile Device Management (MDM) platforms. This eliminates the need for manual installation or complex user-driven setup, ensuring consistent rollout and compliance.

Bifrost Edge supports major MDM platforms including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud across macOS, Windows, and Linux devices. Administrators can push the Edge agent to every machine with a managed configuration that pre-points it to the organization's Bifrost AI gateway. The first-launch flow is streamlined: silent installation, a single user sign-in via SSO in the browser to link the device to the user, and then immediate policy enforcement.

This MDM-native deployment ensures that AI governance, including MCP server control, is rolled out consistently and automatically to every managed endpoint, closing shadow AI gaps efficiently.

The AI Gateway + Bifrost Edge Approach for Comprehensive MCP Control

Controlling Model Context Protocol servers in an enterprise environment requires a multi-layered strategy that combines centralized policy management with endpoint enforcement. The Bifrost AI gateway serves as the control plane and policy engine, where all governance rules for AI traffic are defined. Bifrost Edge extends this same governance to every endpoint, ensuring that the AI agents and tools employees use on their machines adhere to organizational policies.

This combined "AI Gateway + Bifrost Edge" approach provides unparalleled visibility into MCP server usage, granular control over permitted tools and applications, and robust security and compliance through integrated guardrails and audit logs. For organizations seeking to fully govern their AI landscape, this integrated solution provides a clear path to managing the risks and unlocking the full potential of agentic AI. Teams evaluating AI gateways can request a Bifrost demo or review the open-source repository.

Sources

Bifrost Docs: Govern MCP servers (MCP governance). https://docs.getbifrost.ai/edge/mcp-governance
Bifrost Docs: Admin Approvals. https://docs.getbifrost.ai/edge/admin-approvals
Bifrost Docs: Govern AI apps (app governance). https://docs.getbifrost.ai/edge/app-governance
Bifrost Docs: Security & guardrails. https://docs.getbifrost.ai/edge/security
Bifrost Docs: Deploy with MDM. https://docs.getbifrost.ai/edge/deployment-mdm

Bootstrap confidence intervals for your LLM eval metrics

Marcus Chen — Wed, 24 Jun 2026 06:32:50 +0000

TL;DR: A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap resampling turn a point estimate like 84.2% accuracy into a range, so you stop shipping models on a difference that is noise.

Two checkpoints came back from a fine-tuning run at 84.2% and 85.7% on our 500-example agent eval set. The 1.5 point gap read like a win, and someone wanted to promote the second checkpoint to staging. Before that, I wanted eval confidence intervals on both numbers, because a 500-example set carries more sampling error than most teams admit. At 500 examples, the 95% interval on a single accuracy near 85% spans roughly 3 points on each side. The win sat well inside the noise.

I lead the fine-tuning and evaluation team at Nexus Labs, and the most common mistake I see is treating an eval score as exact. It isn't. Your eval set is a sample drawn from the input space you care about, and a different 500 examples would return a different number. Confidence intervals make that variance visible.

What an eval confidence interval actually tells you

An eval confidence interval is a range around a metric, like accuracy or F1, that quantifies how much the score would move if you resampled the eval set. A 95% bootstrap interval of [81.0%, 87.1%] means that across thousands of resamples of your data, 95% of the recomputed scores fell in that band. It measures sampling noise, not model quality.

That distinction matters. Two checkpoints scoring 84.2% and 85.7% with overlapping intervals are, as far as your eval set can tell, indistinguishable. Card et al. showed in "With Little Power Comes Great Responsibility" that many NLP experiments are underpowered to detect the effect sizes they report.

Computing bootstrap confidence intervals

The bootstrap is resampling with replacement. You take your per-example results, draw N of them with replacement many times, recompute the metric each time, and read percentiles off the resulting distribution. There's no assumption that the metric is normally distributed.

import numpy as np

# per-example correctness, 1 = pass, 0 = fail
results = np.array(eval_pass_flags)  # shape (500,)

def bootstrap_ci(x, n_boot=10_000, alpha=0.05):
    n = len(x)
    rng = np.random.default_rng(0)
    means = np.empty(n_boot)
    for i in range(n_boot):
        sample = x[rng.integers(0, n, n)]
        means[i] = sample.mean()
    lo = np.percentile(means, 100 * alpha / 2)
    hi = np.percentile(means, 100 * (1 - alpha / 2))
    return x.mean(), lo, hi

print(bootstrap_ci(results))  # (0.842, 0.806, 0.876)

scipy ships scipy.stats.bootstrap if you'd rather not hand-roll it. For 500 examples and 10,000 resamples this runs in under a second, so there's no cost excuse to skip it.

Paired bootstrap for model comparisons

When comparing two checkpoints, don't bootstrap each interval separately and check for overlap. Overlapping intervals can still hide a real difference. Use a paired bootstrap: resample the example indices once per iteration, score both models on the same indices, and record the difference.

def paired_bootstrap(a, b, n_boot=10_000):
    n = len(a)
    rng = np.random.default_rng(0)
    diffs = np.empty(n_boot)
    for i in range(n_boot):
        idx = rng.integers(0, n, n)
        diffs[i] = a[idx].mean() - b[idx].mean()
    return np.percentile(diffs, [2.5, 97.5])

If that interval on the difference contains zero, you can't claim the second checkpoint is better. On our 1.5 point gap it ran from -1.9% to +4.8%. Zero is in the band, so we did not promote. Dror et al.'s "Hitchhiker's Guide to Testing Statistical Significance in NLP" covers when paired tests apply and which to pick.

How many eval examples do you need

Interval width shrinks with the square root of N, so halving it costs four times the labeled data. At 500 examples a near-85% metric carries about plus or minus 3 points; reaching plus or minus 1.5 needs roughly 2,000 labeled examples. That is the real budgeting question for an eval set, and it's why I push for fewer, higher-quality, well-stratified examples instead of chasing a round number.

For rare failure modes the picture is worse. A category with 20 examples in your set has an interval so wide it tells you almost nothing, which is how aggregate scores stay stable while a subpopulation quietly regresses.

Trade-offs and limitations

The bootstrap assumes your eval examples are independent and drawn from the distribution you care about. If they cluster (multiple turns from one conversation, or near-duplicate prompts), the effective sample size is smaller than N and your interval comes out too narrow. Dedup first.

It also only measures sampling noise. It says nothing about label error, distribution shift between your eval set and production traffic, or a judge model that's miscalibrated. A tight interval on a biased metric is still wrong, only now you're confident in it. For very low pass rates the percentile bootstrap can misbehave; bias-corrected and accelerated (BCa) intervals are better there but slower to compute.

Wrapping up

Eval confidence intervals are the cheapest reliability upgrade available to an ML team. A dozen lines of NumPy turns every score into a score plus a band, and the band is usually wider than the gap you were about to ship on. Next time a checkpoint wins by a point or two, run the paired bootstrap before you tell anyone. The honest answer is often "we can't tell yet, label more data."

Benchmarking 5 LLM providers on one eval set, no SDK per vendor

Marcus Chen — Tue, 23 Jun 2026 16:01:10 +0000

TL;DR: We run a 1,200-case eval suite for enterprise agent automation at Nexus Labs. Comparing models across OpenAI, Anthropic, Bedrock, Vertex, and Groq used to mean five client libraries and five sets of retry logic. We put Bifrost in front of all of them and now the harness talks to one OpenAI-compatible endpoint. Here's what that bought us, and where it didn't help.

The problem was never the models

Our eval set is 1,200 cases. Tool-call traces, multi-turn agent transcripts, graded against a rubric. The grading is hard. The models are not the hard part.

The hard part was the plumbing around the models. Every time we wanted to score a new candidate, the code branched. openai.ChatCompletion for GPT-4o. anthropic.messages for Claude. The boto3 Bedrock runtime for the Llama variants. Vertex had its own auth dance with service accounts. Groq was OpenAI-shaped but pointed at a different base URL with its own rate limits.

Five providers. Five SDKs. Five retry policies, all written slightly differently by whoever added that provider. When a Vertex call 429'd at case 800 of a run, the harness handled it differently than when Anthropic did. So our results carried noise that had nothing to do with model quality.

One endpoint, five backends

Bifrost is a gateway. It speaks one OpenAI-compatible API and routes to 23+ providers behind it. We run it self-hosted, Docker, on a single box next to the eval workers.

docker run -p 8080:8080 maximhq/bifrost

The harness stopped importing vendor SDKs. It points the OpenAI client at localhost:8080/v1 and changes the model string. That's the whole switch.

import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

for model in ["openai/gpt-4o", "anthropic/claude-sonnet-4-6",
              "bedrock/llama-3.3-70b", "vertex/gemini-2.0",
              "groq/llama-3.3-70b"]:
    for case in eval_set:           # 1,200 cases
        resp = client.chat.completions.create(
            model=model, messages=case.messages)
        record(case, model, resp)

The retry and fallback logic moved out of our Python and into the gateway config. One policy, applied to every provider, documented in retries and fallbacks. When Vertex 429s now, every provider gets the same backoff. The eval results stopped carrying our plumbing's personality.

What it changed, with numbers

A full sweep is 1,200 cases × 5 models = 6,000 calls. Before, a sweep failed partway maybe one run in three, usually a provider-specific timeout we hadn't normalized. We babysat it.

After, the gateway's load balancing across multiple API keys per provider cut our 429 retries enough that a clean sweep became the default. We added a second OpenAI key and a second Anthropic key to the config; Bifrost distributes across them. No code change.

Semantic caching helped less than I expected and more than I'd have guessed. Roughly 18% of our cases are near-duplicate prompts (same system prompt, tiny input deltas). Semantic caching served those on repeat runs. On a re-run after a rubric tweak, that shaved real wall-clock and provider spend. On a first run, zero benefit. Obvious in hindsight.

How it stacks up against the alternatives

We looked at LiteLLM and Portkey first. Both are good. The honest comparison:

Concern	Bifrost	LiteLLM	Portkey
Deployment	Self-host, Go binary	Self-host, Python proxy	Managed SaaS (self-host limited)
OpenAI-compatible API	Yes	Yes	Yes
Provider breadth	23+	Largest list	Broad
Per-key load balancing	Built in	Built in	Built in
Semantic caching	Built in	Via config	Strong, managed
Observability UI	Prometheus + web UI	Thinner UI	Best-in-class dashboards
Governance / virtual keys	Yes	Basic	Yes

If you want the widest provider list and you're Python-native already, LiteLLM is the pragmatic pick. If you want a managed dashboard and don't want to run infrastructure, Portkey's observability is ahead of what we self-host. We picked Bifrost because the Go gateway's overhead was low enough to not show up in our latency numbers, and we wanted the whole thing inside our network with no per-call data leaving. Virtual keys also let us split eval spend by team, which Stripe-brain me appreciates.

Trade-offs and Limitations

It's another process to run. If your eval suite only hits one provider, a gateway is pure overhead. Don't add it for a single backend.

Semantic caching can lie to you in evals. A cached response means you're scoring an old generation, not the current model. We disable caching on any run that's measuring model quality and only enable it for re-scoring runs where the generations are fixed and the rubric changed. If you forget that, you'll report a model improvement that's actually a cache hit.

The unified API smooths over provider differences you sometimes want to see. Bedrock's stop-reason semantics aren't identical to OpenAI's. The gateway normalizes the response shape, so a few provider-specific fields we used to inspect now need digging. For most eval work that's fine. For debugging a weird truncation, I still drop to the raw provider once in a while.

And it's young. The community is smaller than LiteLLM's. When I hit an edge case with Vertex auth, there were fewer GitHub issues to crib from. We filed one.

What I'd tell my past self

The model swap was never the work. The work was making five providers behave like one so the comparison meant something. A gateway is the boring infrastructure answer, and boring is what you want under an eval harness.

Run the sweep clean. Hold the plumbing constant. Then argue about the models.

temperature=0 didn't make our LLM evals reproducible

Marcus Chen — Tue, 23 Jun 2026 06:31:32 +0000

TL;DR: We set temperature=0 and seed=42 and still got different eval scores on the same 800-prompt suite across runs. The cause wasn't the sampler. It was batch-dependent floating point in the inference engine plus silent provider routing. We chased it for a week. Here's what we found and the three things that actually fixed it.

I lead the eval team at Nexus Labs. We fine-tune small models for enterprise agent automation, and our whole release process hangs on one number: pass rate on an 800-prompt domain suite. Green means ship.

Two weeks ago the same model, same suite, same code gave us 81.4% on Monday and 79.6% on Wednesday. Nobody touched the weights. That's a 14-prompt swing on a frozen artifact. If your eval moves more than your model improvements, you can't ship on it.

temperature=0 is not determinism

First assumption everyone makes: greedy decoding is deterministic. Set temperature to 0, you always pick the argmax token, done.

It isn't. temperature=0 removes sampling randomness. It does nothing about the fact that the logits themselves change depending on what else is in the batch.

vLLM (we run 0.6.x) uses continuous batching. Your prompt gets grouped with whatever other requests are in flight. Matrix multiply reductions over a batch of 4 versus a batch of 32 accumulate floating point in a different order. The result is a logit that differs in the 5th decimal place. Usually harmless. But when two candidate tokens are within ~1e-4 of each other, the argmax flips. One flipped token early in a tool-call response cascades into a different JSON structure, which fails our parser, which drops a point.

So our "deterministic" eval was deterministic per request but not across batch compositions. Run the suite when the cluster is busy, you get a different batch shape, you get a different score.

The second source: we didn't know which model answered

The bigger embarrassment. Our eval harness pointed at an internal endpoint that load-balanced across two provider deployments during a migration. About 6% of eval requests were silently hitting a different build of the serving stack with a different quantization. We had no per-request record of which backend served which prompt.

You can't debug a number you can't attribute. The fix for this half was operational, not numerical: route eval traffic through a gateway that logs the exact provider and model per request. We already run Bifrost (https://github.com/maximhq/bifrost) in front of our providers for failover, and its per-request logging gave us the backend attribution we'd been missing. LiteLLM does the equivalent; the point is you need the provenance, not a specific logo.

Once every eval response carried a backend tag, the 6% lit up immediately.

What moved the number

We measured each suspected cause by running the 800-prompt suite 20 times and looking at score variance.

Source	Score range over 20 runs	Fixed by
Batch-dependent FP (continuous batching)	`±1.8 pts`	Pin eval batch size to 1
Silent provider routing	`±2.1 pts`	Per-request backend logging
Parser tolerance on whitespace	`±0.9 pts`	Normalize before compare
Unseeded prompt shuffle in harness	0 pts (red herring)	n/a

The prompt-shuffle thing was where we wasted two days. Order doesn't change per-prompt correctness. We knew that. We checked it anyway because it was easy to check, which is its own lesson about how panic allocates engineering time.

The fix

Three changes. None of them clever.

First, eval runs go through a dedicated config with batch size pinned. Slower, but reductions happen over a fixed shape every time:

# eval-serving.yaml
engine:
  max_num_seqs: 1        # no co-batching during eval
  enforce_eager: true    # disable CUDA graph capture variance
sampling:
  temperature: 0.0
  seed: 42
  top_p: 1.0
logging:
  log_backend_id: true   # which deployment served this request

enforce_eager: true matters more than it looks. CUDA graph capture in vLLM can introduce its own kernel-selection differences across runs. Eager mode is slower but it removed another ±0.4 we hadn't isolated separately.

Second, every eval response is stored with the backend identifier and the raw logprobs of the top 2 tokens at each position. When a score moves now, we diff the logprob traces and find the exact prompt and position where decoding diverged. Takes minutes, not days.

Third, we report eval scores as a range over 5 runs, not a single number. If the range is wider than 1 point, the result is "inconclusive, rerun," not "regression." We stopped pretending a single float is ground truth.

Trade-offs and Limitations

Batch size 1 for eval is expensive. Our 800-prompt suite went from 4 minutes to 19. We accept that because eval correctness is worth more than eval speed, but if you run evals on every commit, 19 minutes is a real tax. We gate it: fast batched eval on PRs for a rough signal, pinned eval on release candidates only.

Pinning enforce_eager and max_num_seqs: 1 means your eval environment no longer matches production serving conditions. You're measuring the model, not the production system. That's the right call for catching regressions in weights, the wrong call if you're trying to reproduce a user-reported production bug, where batch effects are part of the story.

And storing top-2 logprobs per position roughly tripled our eval artifact storage. Cheap at our 800-prompt scale. Reconsider it at 100k.

None of this makes the eval "correct." It makes it reproducible. Those are different problems. A reproducible eval that measures the wrong thing is still wrong, just consistently. The contents of the suite are still the hard part.

Harvesting a regression test set from gateway logs with a plugin

Marcus Chen — Mon, 22 Jun 2026 16:01:28 +0000

TL;DR: Our eval sets went stale because a human wrote the test cases by hand once and never updated them. We moved the capture point into the gateway. A Bifrost custom plugin logs every production request and response, and we curate a weekly regression set from real traffic instead of inventing inputs at our desks.

I lead the fine-tuning and eval team at Nexus Labs. Six people. We ship enterprise agent automation, and the model is the easy part. The hard part is knowing whether last week's change made anything better or quietly broke a customer's workflow.

For a year our regression suite was 120 hand-written cases. Someone on the team sat down, imagined what a user might ask, and froze it. By month three those inputs looked nothing like what real agents were sending. We were grading ourselves on a test we wrote, not the one production was running.

Why the gateway is the right capture point

We route every model call through Bifrost, an open-source AI gateway in front of OpenAI, Anthropic, and our self-hosted vLLM endpoints. It already sees the full request and response for gpt-4o-mini and our fine-tuned Qwen2.5-7B. That's the natural seam to tap.

Capturing inside the application means touching every service. Capturing at the gateway means one place. Bifrost ships a custom plugin system, a middleware layer with a pre-hook and a post-hook around each call, documented under Custom Plugins. We wrote a plugin that copies request and response into a Postgres table with the model id, latency, and token counts attached.

Here's the shape of it. Go, because Bifrost is Go.

func (p *EvalCapturePlugin) PostHook(
    ctx context.Context,
    req *schemas.BifrostRequest,
    res *schemas.BifrostResponse,
) (*schemas.BifrostResponse, error) {
    // sample 5% of traffic, skip anything flagged PII
    if hash(req.ID)%20 == 0 && !req.Meta.Sensitive {
        p.queue <- EvalSample{
            Model:   req.Model,
            Input:   req.Input,
            Output:  res.Choices[0].Message,
            Latency: res.ExtraFields.Latency,
            Tokens:  res.Usage.TotalTokens,
        }
    }
    return res, nil
}

The queue drains to Postgres on a separate goroutine so we don't add latency to the request path. We sample 5%, which at our volume is roughly 9,000 captured calls a day.

Curating, not dumping

Raw traffic is not an eval set. It's a pile. Most of it is repetitive and low-signal.

Each week we pull the captured rows and cluster them by embedding. We keep one representative per cluster, drop near-duplicates, and oversample the tail where the agent hit a tool-call error or returned an empty completion. That tail is where regressions hide. The result is around 400 cases a week, which we human-review down to maybe 250 before it joins the frozen suite.

The frozen suite is now 1,900 cases and growing from real inputs. Last month it caught a 6-point drop in tool-call accuracy on our Qwen fine-tune that the old hand-written set sailed straight past, because no human had thought to write a case with three nested function calls.

You also get the metrics for free. Bifrost emits native Prometheus counters per model, so we already had latency and token distributions to weight the sampling toward expensive calls.

Bifrost vs LiteLLM vs Portkey

We looked at three gateways for this. Honest read:

Capability	Bifrost	LiteLLM	Portkey
Custom logging hook	Go plugin, in-process	Python callback / custom logger	Hosted logs + feedback API
Self-hosted, full data control	Yes	Yes	Self-host on paid tier
Language	Go	Python	Managed service
Overhead at high QPS	Low	Higher under load	Network hop to their edge
Setup friction	Write Go, compile	`pip install`, edit config	Fastest, UI out of the box

LiteLLM was the obvious pick for an ML team. It's Python, so our existing data tooling drops right in, and its provider list is larger. If you want a callback in 10 minutes, it wins. We hit throughput limits under our agent burst traffic that Bifrost's Go path handled without tuning.

Portkey has the most polished logging UI and a real feedback API we didn't have to build. The tradeoff is that the data lives in their system unless you're on the self-hosted plan, and a customer contract ruled that out for us. If you want a managed dashboard and don't have a data-residency clause, Portkey is a reasonable call.

We picked Bifrost because the capture runs in-process with no extra network hop, the plugin is the same binary as the gateway, and the logging plus plugin surface gave us both metrics and raw payloads in one place.

Trade-offs and Limitations

Writing a plugin in Go is more work than a Python callback. If your team doesn't read Go, that's a real cost, and LiteLLM is the saner choice.

Sampling is lossy. At 5% we miss rare inputs, and we've had to bump specific routes to 100% capture when a customer reported a bug we couldn't reproduce.

PII is on you. The gateway sees everything, so the plugin has to redact before it writes, and we still run a scrubbing pass before any human looks at a row. Getting this wrong is worse than a stale eval set.

And capturing traffic doesn't grade it. You still need a scoring harness and labels. The gateway gives you the inputs and outputs. The judgment is the part you can't outsource to infrastructure.

DEV Community: Marcus Chen

Shadow AI Is Your Biggest Infra Risk: An Endpoint Governance Guide

Understanding Shadow AI: The Hidden Risks

The Imperative for Endpoint AI Governance

How an AI Gateway + Endpoint Agent Delivers Comprehensive Control

Key Capabilities of Endpoint AI Governance with Bifrost Edge

App Governance

MCP Server Governance

Unified Security & Guardrails

MDM Deployment for Fleet-Wide Rollout

Real-World Impact: Moving from Shadow to Secure

Sources

9 Metrics to Track for Enterprise AI Governance

1. AI System Inventory Coverage

2. Policy Compliance Rate

3. Incident Detection & Resolution Time (MTTD/MTTR for AI)

4. Data Leakage / Guardrail Violation Rate

5. Cost Per Token / Cost Per Use Case

6. Cache Hit Ratio

7. Shadow AI Detection Rate / Coverage

8. Model Performance & Drift

9. Audit Log Completeness / Evidence Readiness

Conclusion

Sources

Request tagging for LLM evals with Bifrost dimension headers

What request tagging means for LLM evals

Stamping requests with x-bf-dim headers

Slicing eval results in observability

Trade-offs and limitations

Wrapping up

Further reading

Position bias in LLM-as-judge flipped 18% of our verdicts

What position bias in LLM-as-judge actually is

How we measured the flip rate

Dual-pass scoring and other fixes

Trade-offs and limitations

Where to go next

Further reading

Governing AI Apps and the MCP Servers They Connect To From One Dashboard

The Rise of Shadow AI and Ungoverned Endpoints

Centralized AI Governance with the AI Gateway

Extending Governance to Every Machine with Bifrost Edge

Governing AI Applications at the Endpoint

Gaining Visibility and Control Over MCP Servers

Enforcing Security and Guardrails Everywhere

Streamlined Deployment and Administration

The Combined Power: AI Gateway + Bifrost Edge

Sources

When Developers Connect Random MCP Servers: How to Regain Control

Understanding the Shadow AI Challenge with Ungoverned MCP Servers

The Bifrost Approach: Centralized AI Governance at Scale

Extending Control to the Endpoint with Bifrost Edge

Automated MCP Server Discovery and Approval

Enforcing Policies on the Device

Guardrails and Security Everywhere

Seamless Deployment with MDM

Benefits of Centralized MCP Governance

Sources

Governing MCP Server Usage in Coding Agents Fleet-Wide

The Rise of Agentic Coding and MCP Servers

The Hidden Risks of Ungoverned Tool Usage (Shadow AI)

Bridging the Gap with Endpoint AI Governance

Fleet-Wide MCP Server Discovery and Control with Bifrost Edge

Centralized Policy, Decentralized Enforcement

Seamless Deployment and Continuous Compliance via MDM

Securing the Future of Agentic Workflows

Sources

How Enterprises Monitor and Control Model Context Protocol Servers

The Rise of Model Context Protocol (MCP) and Agentic AI

The Shadow AI Problem: Ungoverned MCP Servers

Gaining Visibility: Inventory and Discovery of MCP Servers

Implementing Control: Centralized Governance for MCP Usage

Enhancing Security and Compliance with Guardrails and Audit Logs

Streamlined Deployment with MDM for Fleet-Wide Governance

The AI Gateway + Bifrost Edge Approach for Comprehensive MCP Control

Sources

Bootstrap confidence intervals for your LLM eval metrics

What an eval confidence interval actually tells you

Computing bootstrap confidence intervals

Paired bootstrap for model comparisons