<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elise Moreau</title>
    <description>The latest articles on DEV Community by Elise Moreau (@elise_moreau).</description>
    <link>https://dev.to/elise_moreau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864909%2F72833c18-30db-4456-82ee-e7d2016cc38f.jpg</url>
      <title>DEV Community: Elise Moreau</title>
      <link>https://dev.to/elise_moreau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elise_moreau"/>
    <language>en</language>
    <item>
      <title>Classifier-free guidance above 7.5 oversaturated our product renders</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 26 Jun 2026 05:36:29 +0000</pubDate>
      <link>https://dev.to/elise_moreau/classifier-free-guidance-above-75-oversaturated-our-product-renders-10aj</link>
      <guid>https://dev.to/elise_moreau/classifier-free-guidance-above-75-oversaturated-our-product-renders-10aj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Classifier-free guidance above a scale of ~7.5 pushed our SDXL product renders into oversaturation and clipped highlights. Adding CFG rescale at 0.7 plus dynamic thresholding fixed it with no retraining.&lt;/p&gt;

&lt;p&gt;Around 18% of our automated product renders at Photoroom came back with blown-out highlights and oversaturated color once we raised the classifier-free guidance scale from 5.0 to 9.0 on our fine-tuned SDXL pipeline. The higher scale gave us sharper adherence to the prompt, which the catalog team wanted, but white backgrounds shifted toward grey-blue and metallic surfaces lost their specular detail. To be precise, the problem was not the prompt and not the fine-tune. It was the guidance arithmetic itself interacting with the noise schedule, and it is well documented if you know where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  What classifier-free guidance actually does
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance combines two model predictions at each denoising step: one conditioned on the prompt and one unconditioned. The sampler extrapolates along the vector between them, scaled by a guidance weight. A weight of 1.0 means no guidance, and weights of 5 to 9 are typical for SDXL. Higher weights increase prompt adherence at the cost of pushing latents outside the distribution the model was trained on.&lt;/p&gt;

&lt;p&gt;The method comes from Ho and Salimans in &lt;a href="https://arxiv.org/abs/2207.12598" rel="noopener noreferrer"&gt;Classifier-Free Diffusion Guidance&lt;/a&gt;. The formula at each step is straightforward: take the unconditional prediction, add the guidance scale times the difference between conditional and unconditional. The nuance here is that this extrapolation has no bound. As you raise the scale, the standard deviation of the guided prediction grows past the statistics the model learned, and that excess energy shows up in the decoded image as clipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why high guidance scales oversaturate
&lt;/h2&gt;

&lt;p&gt;The decoded pixel range is fixed, roughly [-1, 1] before the VAE maps it back to RGB. When guidance inflates the variance of the predicted noise, the resulting latents carry larger magnitudes than the VAE was trained to reconstruct cleanly. Bright regions saturate to pure white, and color channels drift because the per-channel means shift together. We measured this directly: at guidance 9.0 the per-image latent standard deviation was about 1.4x the standard deviation of the conditional prediction alone.&lt;/p&gt;

&lt;p&gt;This is the same failure mode the Imagen team described in &lt;a href="https://arxiv.org/abs/2205.11487" rel="noopener noreferrer"&gt;Photorealistic Text-to-Image Diffusion Models&lt;/a&gt;, where high guidance weights produced saturated, unnatural images. Their answer was dynamic thresholding. A second, complementary fix came later from Lin and colleagues in &lt;a href="https://arxiv.org/abs/2305.08891" rel="noopener noreferrer"&gt;Common Diffusion Noise Schedules and Sample Steps are Flawed&lt;/a&gt;, which introduced guidance rescale to bring the guided prediction's variance back in line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two fixes that stack: CFG rescale and dynamic thresholding
&lt;/h2&gt;

&lt;p&gt;CFG rescale corrects the standard deviation of the guided prediction toward the conditional prediction, then blends between the corrected and raw versions by a factor. We set that factor to 0.7 after a sweep. Here is the core of what we run inside the sampler loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_cfg_rescale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;noise_cond&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guidance_scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# standard classifier-free guidance
&lt;/span&gt;    &lt;span class="n"&gt;noise_cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;guidance_scale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;noise_cond&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# rescale variance back toward the conditional prediction (Lin et al. 2023)
&lt;/span&gt;    &lt;span class="n"&gt;std_cond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cond&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;std_cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;noise_rescaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std_cond&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;std_cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# blend corrected and raw so detail is not fully flattened
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;noise_rescaled&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dynamic thresholding works at a different layer. At each step it predicts the clean sample, computes a high percentile of the absolute pixel values (we use the 99.5th), and clamps to that value before renormalizing. The two corrections address different symptoms. Rescale fixes the variance inflation; thresholding clamps the residual outliers that survive. Running both at guidance 9.0 brought our oversaturation rate from 18% to under 2% on a held-out set of 4,000 SKUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we chose the rescale factor
&lt;/h2&gt;

&lt;p&gt;We swept the rescale factor across 0.0, 0.3, 0.5, 0.7, and 1.0 and scored each batch on two axes. The first was a saturation metric: the fraction of pixels with channel values above 0.97 after decoding. The second was CLIP image-text similarity, so we did not trade away the prompt adherence we raised guidance to get. A factor of 1.0 fully matched the conditional variance but flattened contrast on glossy products. A factor of 0.0 left the original problem. The factor of 0.7 held CLIP similarity within 0.4% of the unrescaled run while cutting the saturated-pixel fraction by more than half.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CFG rescale adds two standard deviation reductions and an elementwise blend per step. On our pipeline that is well under 1% of step latency, so cost is not the concern. The real trade-off is contrast. At rescale factors above 0.8 we saw glossy and metallic products lose specular punch, which matters for jewelry and electronics catalogs. Dynamic thresholding has its own edge case: on images that are genuinely meant to be bright and high-key, an aggressive percentile clamps legitimate highlights, so we tuned the percentile per product category rather than globally.&lt;/p&gt;

&lt;p&gt;There is also a simpler path we rejected. You can lower the guidance scale back to 5.0 and avoid the whole question, but you lose the prompt fidelity the catalog team asked for. The corrections let us keep a scale of 8.0 to 9.0 without the artifacts, which was the actual goal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;If your renders saturate at high classifier-free guidance, measure the per-image latent standard deviation against the conditional-only prediction before reaching for retraining. The fix is almost always at the guidance arithmetic, not the weights. I would start with CFG rescale at 0.7, add dynamic thresholding only if outliers remain, and validate with a saturated-pixel metric alongside CLIP similarity so you do not silently trade away adherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2207.12598" rel="noopener noreferrer"&gt;Classifier-Free Diffusion Guidance, Ho and Salimans, 2022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2305.08891" rel="noopener noreferrer"&gt;Common Diffusion Noise Schedules and Sample Steps are Flawed, Lin et al., 2023&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2205.11487" rel="noopener noreferrer"&gt;Photorealistic Text-to-Image Diffusion Models, Saharia et al., 2022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl" rel="noopener noreferrer"&gt;Diffusers guidance_rescale documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Async inference for long-running diffusion jobs through Bifrost</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:53:19 +0000</pubDate>
      <link>https://dev.to/elise_moreau/async-inference-for-long-running-diffusion-jobs-through-bifrost-4lo7</link>
      <guid>https://dev.to/elise_moreau/async-inference-for-long-running-diffusion-jobs-through-bifrost-4lo7</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Async inference through Bifrost lets long-running diffusion jobs submit and poll with the &lt;code&gt;x-bf-async&lt;/code&gt; header, so SDXL batches survive the 60-second proxy timeouts that were killing our product-photo pipeline.&lt;/p&gt;

&lt;p&gt;A large product-variant batch in our pipeline at Photoroom takes 70 to 110 seconds to render across &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL&lt;/a&gt;, and our &lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout" rel="noopener noreferrer"&gt;AWS ALB closes any connection idle past 60 seconds by default&lt;/a&gt;. When we increased batch sizes to cut per-image GPU cost, the synchronous calls began returning 504s before the diffusion step finished. Clients retried on the 504, which double-queued the same render and roughly doubled GPU load during peak hours. We moved the generation traffic behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;the open-source AI gateway&lt;/a&gt; from Maxim AI, and switched the slow jobs to async inference so the HTTP connection no longer has to stay open for the full render.&lt;/p&gt;

&lt;h2&gt;
  
  
  What async inference means at an AI gateway
&lt;/h2&gt;

&lt;p&gt;Async inference at an AI gateway lets a client submit a generation job, receive a job ID, and poll for the result instead of holding one HTTP connection open for the whole compute. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; exposes this with the &lt;code&gt;x-bf-async: true&lt;/code&gt; request header and an &lt;code&gt;x-bf-async-id&lt;/code&gt; returned on submission, so a 100-second diffusion call decouples from any proxy or load-balancer idle limit between the client and the gateway.&lt;/p&gt;

&lt;p&gt;The nuance here is that the GPU work does not get faster. What changes is the connection model. A synchronous request ties the success of a 100-second render to a TCP connection staying healthy for 100 seconds across two network hops. Async breaks that coupling: the submit call returns in milliseconds, and the poll calls are short and idempotent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Submitting and polling jobs with x-bf-async
&lt;/h2&gt;

&lt;p&gt;The submit request looks like a normal call through the OpenAI-compatible endpoint, with one extra header. Bifrost runs as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt;, so our existing image client only changed at the header layer, not the request body.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Submit a long-running generation job&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-async: true"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{\n    "model": "openai/gpt-image-1",\n    "prompt": "studio product shot, white seamless background",\n    "n": 8\n  }'&lt;/span&gt;
&lt;span class="c"&gt;# Response returns: x-bf-async-id: job_8f2c...&lt;/span&gt;

&lt;span class="c"&gt;# Poll for the result with the returned job id&lt;/span&gt;
curl http://localhost:8080/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-async-id: job_8f2c..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be precise about what we measured: the submit call returns before the model starts decoding, so the client thread is free in well under a second. The poll interval we settled on is two seconds, which keeps the queue worker cheap without adding noticeable tail latency on completion. We retired the old retry-on-504 logic entirely, because there is no long-held connection left to fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tagging and observing jobs in flight
&lt;/h2&gt;

&lt;p&gt;Once jobs run detached, you need a way to attribute each one, otherwise a slow render is invisible until a customer complains. Bifrost forwards custom dimension headers prefixed &lt;code&gt;x-bf-dim-*&lt;/code&gt; into logs, traces, and Prometheus, so we tag every submission with the team and the experiment that created it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-dim-team: catalog-enrichment"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-dim-experiment: sdxl-batch-v3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those tags land in the &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; layer, which Bifrost writes asynchronously at under 0.1ms overhead per request. We now graph time-to-completion per experiment instead of one aggregate, which is how we found that one prompt template was three times slower than the rest of the batch. For cost attribution across teams, we pair the dimension tags with scoped &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; so each business unit carries its own budget against the same provider pool.&lt;/p&gt;

&lt;p&gt;Routing also mattered here. The gateway unifies &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ providers&lt;/a&gt; behind one endpoint, and the same async mechanism works whether the job lands on a self-hosted SDXL deployment or a hosted image model, so we can fail a batch over without rewriting the client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Async is the wrong default for fast paths. An interactive thumbnail that renders in 900ms gains nothing from submit-and-poll; you add a second round trip and a polling loop for a job that would have finished inside the original connection. We only route batches above roughly 30 seconds of expected render time through &lt;code&gt;x-bf-async&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The honest limitation on the Bifrost side is operational. Production deployments need Postgres backing the gateway, and you self-host the whole thing, which is real infrastructure to run and patch rather than a managed endpoint. The benchmark numbers are strong: Bifrost sustains &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;5,000 RPS on a single instance at 100% success with about 11µs of overhead on a t3.xlarge&lt;/a&gt;, but those figures describe a node you operate. The ecosystem is also younger than older proxies like LiteLLM, so some integration paths have fewer community examples to copy from. For our team the trade was clearly worth it, since the alternative was tuning load-balancer timeouts per route and still losing jobs at the tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Async inference did not make our diffusion models faster; it made long renders survivable by removing the dependency on a single long-lived connection. The &lt;code&gt;x-bf-async&lt;/code&gt; submit-and-poll model, plus dimension tags for attribution, turned a class of intermittent 504s into a measurable queue we can reason about. If you run image or video generation jobs that routinely cross your proxy timeout, this is the pattern I would try first.&lt;/p&gt;

&lt;p&gt;If you want to see async inference and the rest of the gateway against your own workload, book a demo: &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;https://getmaxim.ai/bifrost/book-a-demo&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt; for the async write path and metrics sinks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;Bifrost benchmarks&lt;/a&gt; for the overhead and throughput figures&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout" rel="noopener noreferrer"&gt;AWS Application Load Balancer connection idle timeout&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>computervision</category>
      <category>ai</category>
    </item>
    <item>
      <title>Best Tools to Secure Endpoint AI Usage in Enterprises</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 17:14:28 +0000</pubDate>
      <link>https://dev.to/elise_moreau/best-tools-to-secure-endpoint-ai-usage-in-enterprises-2ee1</link>
      <guid>https://dev.to/elise_moreau/best-tools-to-secure-endpoint-ai-usage-in-enterprises-2ee1</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F251qmb0sfzee4g4erbuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F251qmb0sfzee4g4erbuh.png" alt="Best Tools to Secure Endpoint AI Usage in Enterprises" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;As employees adopt desktop AI applications and coding agents, securing that usage is a new priority for enterprise security teams. This post compares the top tools for endpoint AI governance, with &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; and its Edge agent as the most comprehensive solution for visibility, control, and compliance.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The rapid adoption of powerful AI tools like Claude Desktop, ChatGPT, and various coding agents has created a significant security blind spot for many organizations. When employees use these applications on company devices without oversight, it results in "shadow AI," a category of ungoverned technology usage that exposes sensitive data and creates compliance risks. To address this, a new category of tools is emerging to provide endpoint AI governance. These tools aim to extend security policies from the datacenter to every employee's machine.&lt;/p&gt;

&lt;p&gt;Solutions in this space range from specialized agents that govern AI traffic to extensions of existing enterprise security platforms. The goal is to gain visibility into which AI tools are being used, by whom, and for what purpose, and to apply consistent security and compliance controls. For many, the ideal solution combines an AI gateway for central policy management with an endpoint agent for enforcement. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, combined with its Bifrost Edge component, exemplifies this integrated approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Criteria for Evaluating Endpoint AI Security Tools
&lt;/h2&gt;

&lt;p&gt;When assessing tools to secure endpoint AI, engineering and security leaders should look for a core set of capabilities that move beyond simple application blocking.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Application &amp;amp; MCP Discovery:&lt;/strong&gt; The tool must first provide visibility. It should be able to inventory all AI-powered desktop applications, browser-based tools, and, critically, the &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;Model Context Protocol (MCP) servers&lt;/a&gt; they connect to across the entire fleet of devices.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Granular Policy Enforcement:&lt;/strong&gt; Effective governance is more than an on/off switch. The best tools allow administrators to create and enforce nuanced policies, such as allowing an application but blocking it from using unapproved MCP servers or external tools.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Gateway Integration:&lt;/strong&gt; Endpoint policies should not exist in a vacuum. A tool that integrates with a central &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; allows for a unified governance strategy. Budgets, rate limits, and provider routing rules set at the gateway should be inherited by the endpoint agent.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Guardrail Enforcement:&lt;/strong&gt; The solution must apply security guardrails directly on the endpoint. This includes detecting and redacting secrets, PII, and other sensitive data before a prompt is ever sent to an external model.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;MDM Deployment:&lt;/strong&gt; For enterprise-wide adoption, the tool must support silent deployment and configuration management through standard Mobile Device Management (MDM) platforms like Jamf, Microsoft Intune, and Kandji.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fq27pd0rpr8x41xkl5gs8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fq27pd0rpr8x41xkl5gs8.png" alt="A magnifying glass hovering over a computer screen, revealing hidden application icons and data connections that were pr" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Top Endpoint AI Governance Tools for 2026
&lt;/h2&gt;

&lt;p&gt;Based on the criteria above, here is an analysis of the leading tools designed to secure AI usage on enterprise endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Bifrost with Bifrost Edge
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; offers the most complete solution by combining a powerful AI gateway with a dedicated endpoint agent, Bifrost Edge. This architecture treats the gateway as the central control plane for policy, with Edge acting as the enforcement arm on every macOS, Windows, and Linux device. This model ensures that the same robust &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt; and security rules apply everywhere.&lt;/p&gt;

&lt;p&gt;The Bifrost platform excels at discovery, providing a fleet-wide inventory of not just AI applications but also the MCP servers configured within them. This allows administrators to make informed decisions, such as approving Claude Code while denying a specific, risky tool it might be configured to use. Policies are enforced on the device, meaning a denied application or MCP server is blocked before any data leaves the machine.&lt;/p&gt;

&lt;p&gt;Because &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; inherits its configuration from the &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;, every request from a desktop app or coding agent is subject to the same &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt;, budgets, rate limits, and audit logging as server-side AI traffic. This unified approach simplifies compliance and closes the loop between infrastructure and endpoint security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises that require a unified and comprehensive AI governance platform that extends from the data center to the endpoint. Its ability to manage not just applications but also the tools and MCP servers they connect to provides an unmatched level of granular control.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Zscaler
&lt;/h3&gt;

&lt;p&gt;Zscaler is a well-established cloud security platform that has extended its capabilities to address AI application usage. Through its Zero Trust Exchange, Zscaler can identify and control access to hundreds of AI and ML web applications. It provides visibility into which users are accessing which services and allows administrators to set policies to allow or block access based on risk.&lt;/p&gt;

&lt;p&gt;The platform's strengths are its deep integration into enterprise network infrastructure and its existing user base. For companies already using Zscaler for web filtering and data loss prevention (DLP), extending policies to cover AI applications is a natural step. It can inspect traffic for data exfiltration and apply tenant restrictions to services like ChatGPT. However, it is primarily focused on web traffic and application-level access control, with less specific functionality around governing the dynamic, tool-based interactions of modern AI agents via MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Organizations already invested in the Zscaler ecosystem that need to quickly gain control over web-based AI application usage. It provides strong, familiar controls for DLP and access management.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Netskope
&lt;/h3&gt;

&lt;p&gt;Netskope is another leader in the Security Service Edge (SSE) and Cloud Access Security Broker (CASB) space. Its platform offers visibility and control over thousands of cloud services, including a wide array of AI applications. Netskope's solution allows security teams to coach users with real-time prompts, for instance, warning them against pasting sensitive data into a public AI chatbot.&lt;/p&gt;

&lt;p&gt;Netskope provides granular control, enabling policies that can differentiate between corporate and personal instances of AI services. It can also apply DLP policies to protect intellectual property and customer data. Like Zscaler, its primary focus is on managing access to cloud applications and protecting data in motion over the network. While effective for web-based AI, it may not offer the same depth of insight into the MCP servers and local tools used by developer-focused agents like Claude Code or Codex CLI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Companies seeking a CASB-centric approach to AI governance with a strong focus on user coaching and granular control over data flow to known cloud AI applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhsrmw2lflr579yg69y8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhsrmw2lflr579yg69y8g.png" alt="A central control tower (representing an AI gateway) sending out synchronized signals to a fleet of laptops (representin" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparative Analysis
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost with Bifrost Edge&lt;/th&gt;
&lt;th&gt;Zscaler&lt;/th&gt;
&lt;th&gt;Netskope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Approach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI Gateway + Endpoint Agent&lt;/td&gt;
&lt;td&gt;Secure Web Gateway / ZTNA&lt;/td&gt;
&lt;td&gt;Cloud Access Security Broker (CASB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP Server Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, deep discovery and control&lt;/td&gt;
&lt;td&gt;No, application-level focus&lt;/td&gt;
&lt;td&gt;No, application-level focus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Policy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, endpoint inherits gateway rules&lt;/td&gt;
&lt;td&gt;Separate policy engine&lt;/td&gt;
&lt;td&gt;Separate policy engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Endpoint Guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, secrets, PII, custom regex&lt;/td&gt;
&lt;td&gt;DLP for network traffic&lt;/td&gt;
&lt;td&gt;DLP for network traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;MDM-native (Jamf, Intune, etc.)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Network integration, client connector&lt;/td&gt;
&lt;td&gt;API introspection, forward/reverse proxy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, core gateway is open source&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Choosing the Right Tool
&lt;/h2&gt;

&lt;p&gt;Securing endpoint AI usage requires a shift in thinking from simply blocking applications to governing their behavior. While established network security platforms like Zscaler and Netskope provide essential controls for web-based AI services, they were not purpose-built for the unique challenges of agentic AI and the tools they use.&lt;/p&gt;

&lt;p&gt;The integrated gateway-plus-agent model used by &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; provides a more robust and future-proof solution. By centralizing policy in an &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; and enforcing it everywhere with an endpoint agent, organizations can gain a complete picture of their AI footprint and apply consistent, granular controls. This approach not only mitigates the risks of shadow AI today but also provides the foundation to securely manage the next generation of autonomous AI agents.&lt;/p&gt;

&lt;p&gt;Teams evaluating solutions for endpoint AI security can &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;request a Bifrost demo&lt;/a&gt; or review its &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source repository&lt;/a&gt; to understand its architecture.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>governance</category>
      <category>enterprise</category>
    </item>
    <item>
      <title>Top 5 Enterprise AI Governance Tools in 2026</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 17:09:25 +0000</pubDate>
      <link>https://dev.to/elise_moreau/top-5-enterprise-ai-governance-tools-in-2026-3jf3</link>
      <guid>https://dev.to/elise_moreau/top-5-enterprise-ai-governance-tools-in-2026-3jf3</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpud9jkreh1mlgfx8imxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpud9jkreh1mlgfx8imxf.png" alt="Top 5 Enterprise AI Governance Tools in 2026" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;[A comparison of the leading AI governance tools for enterprises in 2026, covering security, compliance, and operational control. This review finds &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; to be the most comprehensive and performant solution for teams managing complex AI ecosystems.]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The rapid adoption of AI has introduced significant governance challenges for enterprises, from managing "shadow AI" usage on employee devices to ensuring production workloads comply with standards like SOC 2 and GDPR. An AI governance platform provides the necessary layer of control, offering visibility, security, and policy enforcement across all AI applications. These tools are now critical for managing costs, mitigating risks, and operating AI reliably at scale.&lt;/p&gt;

&lt;p&gt;This article evaluates the top five enterprise AI governance tools available today, comparing them on key criteria such as policy enforcement, endpoint governance, multi-provider support, and deployment flexibility. The analysis is based on publicly available documentation and technical specifications for each platform. For organizations seeking a complete solution that spans from the data center to the individual developer's machine, &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; from Maxim AI, emerges as the leading choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Criteria for Evaluating AI Governance Tools
&lt;/h2&gt;

&lt;p&gt;Effective AI governance requires more than just a simple proxy. When evaluating solutions, engineering and security leaders should look for a comprehensive set of capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Policy Enforcement:&lt;/strong&gt; The ability to define and enforce fine-grained policies for access control, budgets, rate limits, and model routing.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security and Compliance:&lt;/strong&gt; Integrated guardrails to detect and block sensitive data, secrets, or harmful content, along with immutable audit logs to meet compliance requirements.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Endpoint Governance:&lt;/strong&gt; The capacity to extend governance beyond the data center to the AI tools employees use daily on their laptops, such as desktop apps and browser-based AI.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Multi-Provider Support:&lt;/strong&gt; Seamless integration with a wide range of LLM providers (OpenAI, Anthropic, Google, AWS, and open-source models) through a unified API.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Deployment Flexibility:&lt;/strong&gt; Support for various deployment environments, including public cloud, in-VPC, on-premise, and air-gapped systems.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Performance:&lt;/strong&gt; Minimal latency overhead to ensure that governance does not become a performance bottleneck for production applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Top 5 AI Governance Platforms
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is a high-performance AI gateway that provides a unified control plane for AI traffic, combined with an endpoint agent that extends governance to every machine in an organization. Its comprehensive feature set makes it particularly well-suited for enterprises in regulated industries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises needing a single, integrated solution for both infrastructure and endpoint AI governance, with best-in-class performance and extensive deployment options.&lt;/p&gt;

&lt;p&gt;Bifrost's approach is unique in its two-part structure. The &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; acts as the central policy engine. Here, administrators configure everything from &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; with specific budgets to complex routing rules and &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;security guardrails&lt;/a&gt;. The gateway is built for performance, adding only &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds of overhead&lt;/a&gt; at 5,000 requests per second.&lt;/p&gt;

&lt;p&gt;The second component, &lt;strong&gt;Bifrost Edge&lt;/strong&gt;, addresses the growing problem of shadow AI. Edge is an agent that runs on macOS, Windows, and Linux devices and transparently routes all AI traffic from &lt;a href="https://docs.getbifrost.ai/edge/supported-applications" rel="noopener noreferrer"&gt;desktop apps, browser AI, and coding agents&lt;/a&gt; through the gateway. This ensures the same policies, from PII redaction to access controls, are enforced everywhere. Edge provides a fleet-wide inventory of all AI apps and MCP servers in use, allowing administrators to &lt;a href="https://docs.getbifrost.ai/edge/admin-approvals" rel="noopener noreferrer"&gt;approve or deny tools centrally&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzrd8oba7e81iko5a2rom.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzrd8oba7e81iko5a2rom.png" alt="A network diagram showing a central hub representing an AI gateway, with secure, organized data packets flowing from it " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Unified Gateway and Endpoint:&lt;/strong&gt; The &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;AI Gateway + Bifrost Edge&lt;/a&gt; model provides a complete governance picture, covering both centrally managed services and employee tool usage.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enterprise-Grade Security:&lt;/strong&gt; Features include native secrets detection, custom regex guardrails, and integrations with AWS Bedrock Guardrails and Azure Content Safety. &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;Immutable audit logs&lt;/a&gt; support compliance with SOC 2, HIPAA, and GDPR.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Flexible Deployment:&lt;/strong&gt; Bifrost supports &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC and on-premise deployments&lt;/a&gt;, making it suitable for organizations with strict data residency requirements.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Extensive Integrations:&lt;/strong&gt; It supports over 20 LLM providers and offers a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for OpenAI, Anthropic, and other popular SDKs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Kong AI Gateway
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://konghq.com/products/kong-ai-gateway" rel="noopener noreferrer"&gt;Kong AI Gateway&lt;/a&gt; is an extension of the widely used Kong API Gateway. It focuses on providing a control layer for AI traffic within an existing enterprise API management strategy, offering features like prompt engineering, caching, and observability for AI services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Organizations already heavily invested in the Kong ecosystem for API management that want to extend similar controls to their AI workloads.&lt;/p&gt;

&lt;p&gt;Kong's strength lies in its deep integration with the rest of the Kong platform. It allows teams to apply familiar API management policies (like rate limiting, authentication, and traffic control) to LLM APIs. It also includes an "AI Proxy" plugin that provides a unified interface to multiple providers and enables features like prompt templating and response transformation directly at the gateway layer. However, it does not currently offer a dedicated solution for endpoint governance to manage shadow AI on employee devices.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Google Apigee
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/apigee" rel="noopener noreferrer"&gt;Google's Apigee API Management&lt;/a&gt; platform has been extended to manage and secure access to AI services, including Google's own Vertex AI and other third-party models. It functions as a centralized governance layer for enterprises building on Google Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Companies building their AI applications primarily within the Google Cloud ecosystem or those already using Apigee for general API management.&lt;/p&gt;

&lt;p&gt;Apigee allows organizations to create governed "AI proxies" that enforce access controls, manage traffic, and provide analytics for all AI API calls. This is useful for centralizing authentication and applying consistent policies across different AI services. While powerful for infrastructure-level governance, Apigee's scope is focused on API traffic and, like Kong, does not extend to direct endpoint governance of unmanaged employee applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cloudflare AI Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.cloudflare.com/products/ai-gateway/" rel="noopener noreferrer"&gt;Cloudflare's AI Gateway&lt;/a&gt; is a product designed to add a layer of control and observability to AI applications. It sits between an application and the AI models it calls, providing caching, rate limiting, and analytics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams looking for a simple, managed solution to gain visibility and basic control over their AI API traffic, especially those already using Cloudflare's network services.&lt;/p&gt;

&lt;p&gt;As part of the Cloudflare ecosystem, the AI Gateway benefits from the company's global network, offering low-latency connections. It provides valuable insights through logs and analytics, helping teams understand usage patterns, track costs, and identify errors. Its features are geared more toward observability and simple controls rather than the deep policy enforcement and endpoint management required by large enterprises with complex compliance needs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd2uei2xet5ux5e6axy83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd2uei2xet5ux5e6axy83.png" alt="A magnifying glass hovering over a stream of data flowing between a user's computer and a cloud server, highlighting and" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. LiteLLM
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; is a popular open-source library that provides a unified interface for calling over 100 LLM providers. It can be deployed as a proxy server to centralize API key management, routing, and logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Development teams and smaller organizations looking for a flexible, open-source tool to standardize LLM API access without the overhead of a full enterprise platform.&lt;/p&gt;

&lt;p&gt;LiteLLM excels at abstracting away the differences between various LLM APIs, allowing developers to switch between models like GPT-4 and Claude 3 with minimal code changes. When deployed as a proxy, it offers a UI for managing virtual keys, viewing logs, and setting budgets. While it provides a solid foundation for gateway functionality and is a strong tool in the open-source community, it lacks the comprehensive endpoint governance, advanced security guardrails, and high-availability clustering found in enterprise-focused solutions like &lt;a href="https://www.getmaxim.ai/bifrost/alternatives/litellm-alternatives" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As AI becomes more embedded in enterprise operations, a robust governance strategy is no longer optional. While tools like Kong and Apigee extend traditional API management to AI, and LiteLLM offers a flexible open-source alternative, they primarily focus on governing known API traffic. The critical challenge of shadow AI—ungoverned usage on employee devices—remains a significant blind spot.&lt;/p&gt;

&lt;p&gt;Bifrost stands out by providing an integrated solution that addresses both infrastructure and endpoint governance. Its combination of a high-performance gateway and the Bifrost Edge agent delivers a complete visibility and control fabric, making it the most comprehensive choice for enterprises serious about securing and managing their entire AI ecosystem. For teams needing to balance innovation with security and compliance, a holistic approach is essential.&lt;/p&gt;

&lt;p&gt;Teams evaluating AI governance platforms can &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;request a Bifrost demo&lt;/a&gt; or review the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source repository&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aiovernment</category>
      <category>security</category>
      <category>enterprise</category>
      <category>devops</category>
    </item>
    <item>
      <title>Best Tools to Implement Governance and Security in Enterprise AI</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 17:05:00 +0000</pubDate>
      <link>https://dev.to/elise_moreau/best-tools-to-implement-governance-and-security-in-enterprise-ai-581f</link>
      <guid>https://dev.to/elise_moreau/best-tools-to-implement-governance-and-security-in-enterprise-ai-581f</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx3caqk8yg02bqdb4brni.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx3caqk8yg02bqdb4brni.png" alt="Best Tools to Implement Governance and Security in Enterprise AI" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As artificial intelligence moves from prototype to production, the challenge for enterprise leaders has shifted from "how do we build this?" to "how do we control this?" In 2026, AI governance is no longer an optional ethical consideration; it is an operational requirement driven by evolving regulations like the EU AI Act and frameworks such as the NIST AI Risk Management Framework (RMF).&lt;/p&gt;

&lt;p&gt;Effective governance requires more than just visibility. It demands enforced control across access layers, data surfaces, and agentic tool usage. The current landscape is crowded, but organizations are increasingly consolidating their strategy around infrastructure-level controls that can manage risk at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Pillars of Enterprise AI Security
&lt;/h2&gt;

&lt;p&gt;Effective AI governance in an enterprise environment relies on three non-negotiable capabilities: &lt;strong&gt;visibility&lt;/strong&gt; into where AI is used, &lt;strong&gt;control&lt;/strong&gt; over who can access specific models and tools, and &lt;strong&gt;enforcement&lt;/strong&gt; of policies across identity and integration layers.&lt;/p&gt;

&lt;p&gt;Most governance tools stop at discovery—they identify risks but fail to prevent them. To move beyond mere observation, organizations need infrastructure that treats security as an architectural requirement rather than a bolt-on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fovd7jg96c4evl65e624y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fovd7jg96c4evl65e624y.png" alt="A high-tech digital control center wall displaying real-time data flows and traffic filtering nodes, symbolizing visibil" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Leading Tools for AI Governance and Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost by Maxim AI
&lt;/h3&gt;

&lt;p&gt;Bifrost has emerged as a leader in infrastructure-level AI governance. By operating as a high-performance AI gateway, it centralizes policy enforcement for LLM routing, access management, and cost control. Its use of "Virtual Keys" allows teams to issue granular, budget-limited access tokens to different business units, ensuring that policy is distributed rather than centralized in manual key management. Beyond basic routing, it provides MCP (Model Context Protocol) governance, allowing administrators to filter which tools agents can execute at the infrastructure level.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Microsoft Purview
&lt;/h3&gt;

&lt;p&gt;For organizations already embedded in the Microsoft ecosystem, Purview provides robust data governance and compliance capabilities. It excels at discovering and cataloging data across multi-cloud and SaaS environments, which is essential for ensuring that sensitive information does not inadvertently leak into unauthorized AI training sets or LLM prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. IBM Watsonx.governance
&lt;/h3&gt;

&lt;p&gt;IBM’s platform focuses on the lifecycle management of AI models. It is designed for enterprises that need formal risk management, providing tools to track model drift, bias, and compliance with internal standards throughout the model's production lifespan. It is particularly strong for organizations that require certifiable compliance, often aligning with ISO/IEC 42001 standards.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Credo AI
&lt;/h3&gt;

&lt;p&gt;Credo AI differentiates itself through lifecycle governance that automates compliance tasks. It helps teams integrate responsible AI requirements directly into their development workflows, making it easier for large engineering teams to follow policy guidelines without slowing down their release cycles.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwv94tm039an49ysu3u1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwv94tm039an49ysu3u1g.png" alt="Abstract 3D structures symbolizing building blocks or pillars fitting together into a stable, secure foundation under a " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrating Global Frameworks
&lt;/h2&gt;

&lt;p&gt;Successfully deploying these tools requires alignment with established industry frameworks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;NIST AI RMF:&lt;/strong&gt; A voluntary but highly influential framework that organizes governance into four core functions: Govern, Map, Measure, and Manage. It is the de facto global reference for managing AI risk.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;ISO/IEC 42001:&lt;/strong&gt; The first certifiable international standard for AI management systems. It focuses on organizational controls, risk assessments, and documentation, making it attractive for regulated industries that require formal validation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;EU AI Act:&lt;/strong&gt; A mandatory, risk-based regulatory regime that imposes strict obligations on high-risk AI applications. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than treating these as separate checklists, enterprises are increasingly using a "unified approach," using automation platforms to map NIST principles to ISO controls. This strategy allows organizations to satisfy multiple regulatory requirements simultaneously without duplicating compliance efforts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategic Recommendations for Implementation
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Prioritize Enforcement over Discovery:&lt;/strong&gt; Select tools that can block unauthorized actions (e.g., stopping a prompt that leaks PII or blocking an unsanctioned tool call) rather than tools that only send email alerts after a policy violation.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Adopt a Zero-Trust Model:&lt;/strong&gt; Assume no input is safe and no agent inherits blanket permissions. Every operation, from a simple LLM query to a complex agentic tool call, should require explicit policy-based authorization.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Standardize at the Infrastructure Level:&lt;/strong&gt; Tools like AI gateways provide a single policy layer that works regardless of which model or provider is being used. This prevents "governance drift," where different teams use different models with inconsistent security postures.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Automate Audit Trails:&lt;/strong&gt; Ensure that every interaction, including tool execution and data access, is logged with sufficient context to satisfy auditors. Immutable audit logs are essential for meeting SOC 2, HIPAA, and GDPR requirements.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As AI agents become more autonomous, they will continue to introduce new attack surfaces. By focusing on infrastructure-level governance and integrating established frameworks into daily workflows, enterprises can harness the power of agentic AI while maintaining a secure and compliant environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://www.nist.gov/itl/ai-risk-management-framework" rel="noopener noreferrer"&gt;AI Risk Management Framework | NIST&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.getmaxim.ai/blog/best-tools-for-ai-governance-2026/" rel="noopener noreferrer"&gt;Best 5 tools for AI governance in 2026 - Maxim AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.optro.ai/blog/integrate-nist-ai-rmf-iso-42001" rel="noopener noreferrer"&gt;How to integrate NIST AI RMF and ISO 42001 - Optro&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.getmaxim.ai/blog/bifrost-vs-truefoundry/" rel="noopener noreferrer"&gt;Bifrost vs TrueFoundry: Open-Source vs Enterprise AI Gateway&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>enterprise</category>
      <category>governance</category>
    </item>
    <item>
      <title>Top 5 LLM Gateways in 2026</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 17:02:50 +0000</pubDate>
      <link>https://dev.to/elise_moreau/top-5-llm-gateways-in-2026-3noi</link>
      <guid>https://dev.to/elise_moreau/top-5-llm-gateways-in-2026-3noi</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcdi3bqezearu4h34uzmq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcdi3bqezearu4h34uzmq.png" alt="Top 5 LLM Gateways in 2026" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As enterprise adoption of generative AI accelerates, teams are moving away from direct, hard-coded provider integrations. Relying on single-model APIs creates significant operational risks, including fragmented authentication, inconsistent rate limits, and cascading failures during provider outages. To address these challenges, engineering teams are increasingly deploying LLM gateways as a dedicated middleware layer to unify routing, governance, and observability.&lt;/p&gt;

&lt;p&gt;An LLM gateway acts as a reverse proxy, sitting between your application and various model providers. It provides a standardized interface—typically OpenAI-compatible—that allows you to switch underlying models or providers without updating your application code. Beyond simple proxying, modern gateways handle critical production requirements like automatic failover, cost attribution, and security guardrails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fh34gjp9grnuymw2d2e7m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fh34gjp9grnuymw2d2e7m.png" alt="A cross-section of a high-speed data pipe with diverse, color-coded energy streams flowing through a central processing " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluating the Gateway Landscape
&lt;/h3&gt;

&lt;p&gt;When choosing a gateway for 2026 production workloads, teams should prioritize the following criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Latency Overhead:&lt;/strong&gt; In agentic workflows or real-time applications, the gateway must add near-zero overhead. High-performance gateways typically contribute less than 20 milliseconds of latency, with specialized Go-based implementations reaching microsecond-level overhead.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Provider Coverage:&lt;/strong&gt; A robust gateway should support a broad catalog of models from major providers (e.g., OpenAI, Anthropic, Google, AWS, Azure) to prevent vendor lock-in.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Operational Control:&lt;/strong&gt; The decision to self-host versus using a managed SaaS often depends on data residency requirements and compliance mandates, such as HIPAA or GDPR.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Governance Features:&lt;/strong&gt; Enterprise readiness requires granular control, including virtual API keys, team-based budget tracking, and real-time guardrails to prevent credential leakage or prompt injection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw0b6t16svnrirffboh7h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw0b6t16svnrirffboh7h.png" alt="A modular infrastructure stack displaying layers of security, routing, and data flow indicators." width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Top 5 LLM Gateways
&lt;/h3&gt;

&lt;p&gt;Based on current production trends and infrastructure benchmarks, these are the five leading LLM gateways for 2026:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Bifrost
&lt;/h4&gt;

&lt;p&gt;Bifrost stands out as the high-performance option for teams prioritizing scalability and governance. Built in Go, it is engineered for production workloads requiring extreme efficiency, delivering roughly 11 microseconds of overhead even at 5,000+ requests per second. It is particularly well-suited for regulated industries that require air-gapped or VPC-based deployments, providing an enterprise-grade control plane that manages access, budgets, and security across multi-cloud environments.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. LiteLLM
&lt;/h4&gt;

&lt;p&gt;LiteLLM is the industry standard for developer-first, open-source proxying. Because it is Python-based and supports 100+ providers behind a familiar interface, it is a common starting point for teams prototyping AI features. While it offers excellent flexibility for self-hosting, teams should be mindful of its concurrency limitations at scale, which may necessitate more complex infrastructure as request volume grows beyond 500 requests per second.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Kong AI Gateway
&lt;/h4&gt;

&lt;p&gt;For organizations that have already standardized their API management on the Kong ecosystem, the Kong AI Gateway is a logical extension. It leverages Kong's proven plugin architecture to add AI-specific capabilities like prompt introspection and token-based rate limiting to existing API traffic. It is an effective choice for enterprise teams that need to treat AI services as just another microservice within their existing governance and security stack.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Cloudflare AI Gateway
&lt;/h4&gt;

&lt;p&gt;Cloudflare’s offering excels for teams already embedded in the Cloudflare edge ecosystem. By leveraging their global network, it provides low-latency caching and edge-based security. It is essentially a "zero-ops" proxy that requires minimal configuration, making it ideal for teams that want to offload infrastructure management entirely to a globally distributed platform.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. OpenRouter
&lt;/h4&gt;

&lt;p&gt;OpenRouter functions as a managed gateway and marketplace, providing immediate access to over 300 models through a single, unified API. It is a powerful choice for developers exploring a wide array of models quickly, as it eliminates the need to manage individual provider billing accounts. While it is less focused on deep enterprise governance or self-hosted compliance, its ability to route across free and paid model tiers makes it a popular tool for benchmarking and rapid experimentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Maxim AI: Top 5 LLM Gateways in 2026 Comparison&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://litellm.ai/" rel="noopener noreferrer"&gt;TechSy: 8 LLM Gateways Ranked for 2026&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM: Enterprise Infrastructure Overview&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://konghq.com/products/kong-ai-gateway" rel="noopener noreferrer"&gt;Kong: Secure, Scalable AI Gateway Connectivity&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://developers.cloudflare.com/ai-gateway/" rel="noopener noreferrer"&gt;Cloudflare: AI Gateway Features and Capabilities&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>backend</category>
      <category>devops</category>
    </item>
    <item>
      <title>Top 5 MCP Gateways in 2026: An Architectural Comparison</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 16:52:52 +0000</pubDate>
      <link>https://dev.to/elise_moreau/top-5-mcp-gateways-in-2026-an-architectural-comparison-43gf</link>
      <guid>https://dev.to/elise_moreau/top-5-mcp-gateways-in-2026-an-architectural-comparison-43gf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwo20zc4s200sxn9cvoab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwo20zc4s200sxn9cvoab.png" alt="Top 5 MCP Gateways in 2026: An Architectural Comparison" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Compare the leading Model Context Protocol (MCP) gateways for securing and scaling agentic AI. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; leads this architectural evaluation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An MCP gateway is a centralized infrastructure layer that routes, authenticates, and governs connections between AI applications and Model Context Protocol (MCP) servers. In production AI agent systems, letting every agent connect directly to backend tools creates significant security, compliance, and latency risks. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, a high-performance &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; written in Go, addresses these architectural challenges by unifying model routing and tool execution under a single secure control plane. This comparison evaluates the top five MCP gateways in 2026, outlining how each handles developer workflows, access controls, and enterprise scalability.&lt;/p&gt;

&lt;p&gt;The rapid adoption of the &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; has standardized how LLMs communicate with external environments, turning static models into autonomous agents. However, as the ecosystem matures, managing tool access across decentralized environments requires dedicated platform engineering solutions. A dedicated &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; helps teams secure and orchestrate tool execution without modifying underlying client configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Criteria for Evaluating MCP Gateways
&lt;/h2&gt;

&lt;p&gt;When moving AI agent systems from local sandboxes into production, platform engineers must look beyond basic tool-connectivity features. A production-ready gateway serves as a secure proxy between autonomous agents and internal databases, filesystems, and third-party APIs. Without a unified gateway, each desktop client or cloud server handles credentials individually, creating a highly fragmented and insecure environment. The following evaluation framework represents the core architectural dimensions required to run MCP at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Access Control Depth:&lt;/strong&gt; Production environments require granular permissions. Gateways must be evaluated on whether they enforce permissions at the server, tool, or parameter level, rather than adopting an all-or-nothing approach.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Connection Resilience:&lt;/strong&gt; The gateway must support diverse transport protocols, including standard input/output (stdio), HTTP, and Server-Sent Events (SSE), while handling transient network failures gracefully via automatic retry logic.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Resource Efficiency:&lt;/strong&gt; Exposing massive tool catalogs directly to LLMs consumes substantial context window space and increases costs. Platform teams require gateways that optimize prompt structures and tool definitions before payloads are dispatched to models.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enterprise Administration:&lt;/strong&gt; Features like Single Sign-On (SSO) integration, role-based access controls, immutable audit trails, and multi-node high availability are non-negotiable for regulated industries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Bifrost
&lt;/h2&gt;

&lt;p&gt;The open-source &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; represents the standard for high-concurrency tool execution and multi-provider model routing. Engineered in Go, Bifrost is built to unify model access and tool execution under a single control plane. The gateway acts as both an MCP client, connecting to any external tool server, and an &lt;a href="https://docs.getbifrost.ai/mcp/gateway" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; to expose managed tools directly to developer environments.&lt;/p&gt;

&lt;p&gt;Connecting tools is straightforward, as Bifrost supports three connection protocols: &lt;a href="https://docs.getbifrost.ai/mcp/connecting-to-servers" rel="noopener noreferrer"&gt;STDIO, HTTP, and SSE&lt;/a&gt;. Local CLI utilities run through stdio pipelines, while remote web services communicate via standard HTTP or Server-Sent Events (SSE) connections. To manage credentials at scale, Bifrost supports five distinct &lt;a href="https://docs.getbifrost.ai/mcp/auth/overview" rel="noopener noreferrer"&gt;MCP authentication&lt;/a&gt; modes, including static headers, standard OAuth 2.0, and per-user lazy authentication workflows that prompt users for authorization links dynamically.&lt;/p&gt;

&lt;p&gt;A core challenge of typical tool-calling architectures is the sheer volume of tokens consumed when exposing massive tool catalogs to an LLM. Bifrost solves this with &lt;strong&gt;Code Mode&lt;/strong&gt;, a feature where the model writes Python code (Starlark) to execute and orchestrate multiple tools inside a secure local sandbox. According to published &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt;, Code Mode reduces input tokens by up to 92.8% and decreases estimated costs by 92.2% in large MCP deployments compared to standard iterative JSON-RPC roundtrips. Additionally, Bifrost includes &lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt;, an autonomous execution loop where the gateway handles permitted tool executions and feeds results back to the model automatically.&lt;/p&gt;

&lt;p&gt;To safeguard production environments, administrators can set strict &lt;a href="https://docs.getbifrost.ai/features/governance/mcp-tools" rel="noopener noreferrer"&gt;MCP tool filtering&lt;/a&gt; policies per virtual key. This allows organizations to define granular permissions, ensuring that specific virtual keys can only execute pre-approved tools while blocking all unauthorized actions. This capability is fully integrated into Bifrost's broader &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;LLM gateway governance&lt;/a&gt; suite, which enforces hierarchical budgets, rate limits, and provider failover rules. In terms of performance, Bifrost maintains exceptional efficiency, adding just &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds of overhead&lt;/a&gt; per request at 5,000 requests per second in sustained benchmarking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise engineering teams deploying mission-critical agentic systems that require high-performance tool routing, advanced cost optimization, granular key-based governance, and flexible deployment models like private VPC or air-gapped environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3swuk26jl0use4kbgwzu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3swuk26jl0use4kbgwzu.png" alt="An elegant server cluster rack in an ultra-modern data center, with glowing blue and teal optical lines showcasing paral" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Docker MCP Gateway
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://docs.docker.com/desktop/mcp-toolkit/" rel="noopener noreferrer"&gt;Docker MCP Gateway&lt;/a&gt; acts as an orchestrator and localized proxy for Model Context Protocol servers. It runs each MCP server inside an isolated Docker container with strictly restricted system privileges, network configurations, and resource limits. This containerized design solves the risk of running untrusted, locally installed scripts directly on a developer's machine.&lt;/p&gt;

&lt;p&gt;The toolkit integrates directly with Docker Desktop, allowing developers to manage server lifecycles, configure credentials, and organize tools into project-specific profiles. Organizations can curate internal catalogs of approved servers, complete with cryptographic signatures. Developers can verify these container image signatures during runtime using the &lt;code&gt;docker mcp gateway run --verify-signatures&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Local developer environments, desktop client setups, and software teams seeking workstation-level containment and rapid prototyping with pre-packaged catalog tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Microsoft MCP Gateway
&lt;/h2&gt;

&lt;p&gt;For teams deploying large-scale agent networks in Kubernetes environments, the &lt;a href="https://github.com/microsoft/mcp-gateway" rel="noopener noreferrer"&gt;Microsoft MCP Gateway&lt;/a&gt; is a purpose-built open-source reverse proxy and management layer. It is built to run on Azure Kubernetes Service (AKS) and integrates natively with Azure Container Registry (ACR) and Microsoft Entra ID. This allows platform engineers to apply enterprise-grade Single Sign-On (SSO) and Role-Based Access Control (RBAC) to tool execution pipelines.&lt;/p&gt;

&lt;p&gt;The gateway manages stateful, session-aware routing to remote tool servers while keeping client applications completely decoupled from backend infrastructure. It features explicit tool allow-lists, rate-limiting policies, and application-layer sandboxing. These controls are explicitly mapped to protect against the OWASP Top 10 vulnerabilities for LLM and agentic systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Platform teams building native Windows and Azure agent integrations on top of enterprise Kubernetes clusters.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Envoy AI Gateway
&lt;/h2&gt;

&lt;p&gt;The CNCF-backed &lt;a href="https://github.com/envoyproxy/ai-gateway" rel="noopener noreferrer"&gt;Envoy AI Gateway&lt;/a&gt; includes native support for the Model Context Protocol, extending Envoy's industry-standard proxy capabilities to agentic workloads. It routes tool requests via an &lt;code&gt;MCPRoute&lt;/code&gt; declarative API, aggregating several independent backend tool servers into a unified client endpoint. This design eliminates the need to configure multiple client-to-server connections on individual devices.&lt;/p&gt;

&lt;p&gt;Envoy handles the Streamable HTTP Transport specified in the June 2025 MCP standard, processing persistent stateful connections and JSON-RPC messaging. The gateway enforces centralized authentication, OAuth flows, and upstream API key injection. It also inherits Envoy Proxy's battle-tested networking layer, providing robust circuit breaking, dynamic load balancing, and OpenTelemetry logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; DevOps and site reliability engineers looking for an ingress-centric, CNCF-aligned control plane to manage agentic API traffic.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Lunar.dev MCPX
&lt;/h2&gt;

&lt;p&gt;The open-source &lt;a href="https://github.com/TheLunarCompany/lunar" rel="noopener noreferrer"&gt;Lunar.dev MCPX&lt;/a&gt; gateway functions as a zero-code aggregator and control plane for managing agentic API traffic. It runs as a self-contained Docker container, spawning and managing separate MCP servers within its local environment. This setup simplifies connections by exposing a unified endpoint to AI agents.&lt;/p&gt;

&lt;p&gt;MCPX places a heavy emphasis on security and data sanitization. It features built-in Data Loss Prevention (DLP) filters that inspect prompts and responses to detect and block API keys or personally identifiable information (PII). By routing tool-related API calls through the core Lunar Proxy, the system tracks real-time traffic volume, token usage, and API endpoint errors in a central dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Security and compliance teams requiring dedicated data loss prevention and sensitive data sanitization for third-party API tool calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Comparison of the Top MCP Gateways
&lt;/h2&gt;

&lt;p&gt;Evaluating these systems requires looking at where they execute and how they handle security. The following matrix contrasts the five gateways across deployment environments, core isolation mechanics, and resource optimizations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Primary Environment&lt;/th&gt;
&lt;th&gt;Core Security Mechanism&lt;/th&gt;
&lt;th&gt;Token/Cost Optimization&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bifrost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-Cloud, Private VPC, On-Prem&lt;/td&gt;
&lt;td&gt;Virtual Keys, Tool Filtering, TLS&lt;/td&gt;
&lt;td&gt;Yes (Code Mode Sandbox)&lt;/td&gt;
&lt;td&gt;Open Source (Apache 2.0 / Enterprise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker MCP Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local Workstations, Desktop&lt;/td&gt;
&lt;td&gt;Container Sandboxing, Image Signatures&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Open Source / Commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Microsoft MCP Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubernetes (AKS), Hybrid&lt;/td&gt;
&lt;td&gt;Entra ID, Capability Sandboxes, RBAC&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Open Source (MIT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Envoy AI Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud-Native Kubernetes&lt;/td&gt;
&lt;td&gt;OAuth, Upstream Auth, Route Filtering&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Open Source (Apache 2.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lunar.dev MCPX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local Docker, Cloud-Native&lt;/td&gt;
&lt;td&gt;DLP Safeguards, Tool Access Control&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Open Source (Apache 2.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using an &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;agent-to-tool middleware&lt;/a&gt; reduces the structural friction associated with deploying autonomous workflows in production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8gwsex8cf4dao5kof8qq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8gwsex8cf4dao5kof8qq.png" alt="A clean, modern comparison matrix interface represented by physical translucent plates stacked in layers, each plate sho" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced Considerations for Scale: High Availability and Security
&lt;/h2&gt;

&lt;p&gt;Deploying Model Context Protocol (MCP) systems in production introduces unique architectural challenges compared to standard web APIs. Because many local MCP integrations rely on stateful stdio processes, scaling them across multi-node Kubernetes clusters requires gateways that can handle protocol bridging and stateful translation. A centralized, enterprise-grade gateway like &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; bridges this gap by converting local stdio-based servers into highly available, stateless HTTP or SSE connections.&lt;/p&gt;

&lt;p&gt;To scale reliable networks of AI agents, engineers must focus on three operational priorities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Adaptive Load Balancing:&lt;/strong&gt; Gateways should monitor the health of upstream tool servers dynamically. For example, &lt;a href="https://docs.getbifrost.ai/enterprise/adaptive-load-balancing" rel="noopener noreferrer"&gt;Bifrost's adaptive load balancing&lt;/a&gt; automatically routes requests around degraded endpoints, preventing tool-calling failures from derailing agentic workflows.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enterprise-Grade Identity:&lt;/strong&gt; In large organizations, tools cannot run with universal root privileges. Integrating OIDC providers like Okta or Microsoft Entra ID is essential. Gateways must support automatic &lt;a href="https://docs.getbifrost.ai/enterprise/user-provisioning" rel="noopener noreferrer"&gt;user provisioning&lt;/a&gt; to sync team-level permissions directly to tool access profiles.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Data Protection and PII Redaction:&lt;/strong&gt; Because agents interact with sensitive corporate data sources, security guardrails are a strict requirement. Implementing comprehensive &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrails&lt;/a&gt; at the gateway layer ensures that &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/secrets-detection" rel="noopener noreferrer"&gt;secrets detection&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails/custom-regex" rel="noopener noreferrer"&gt;custom regex redacting&lt;/a&gt; occur before any payload leaves the corporate network boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By resolving these concerns at the gateway layer rather than within individual agent applications, engineering teams can maintain a robust, compliant, and highly performant AI platform.&lt;/p&gt;

&lt;p&gt;Platform engineers evaluating their agentic infrastructure options can &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;request a Bifrost demo&lt;/a&gt; or inspect the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source repository&lt;/a&gt; on GitHub to begin securing and scaling tool connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.getbifrost.ai/" rel="noopener noreferrer"&gt;Bifrost AI Gateway Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol Specification&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.docker.com/desktop/mcp-toolkit/" rel="noopener noreferrer"&gt;Docker Docs: MCP Toolkit and Gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/envoyproxy/ai-gateway" rel="noopener noreferrer"&gt;CNCF Envoy AI Gateway Repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/microsoft/mcp-gateway" rel="noopener noreferrer"&gt;Microsoft MCP Gateway Repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/TheLunarCompany/lunar" rel="noopener noreferrer"&gt;Lunar.dev MCPX Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Using the channels-last memory format reduced the latency of our conversation backbone by 22%</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 05:36:21 +0000</pubDate>
      <link>https://dev.to/elise_moreau/channels-last-memory-format-cut-our-conv-backbone-latency-22-19l2</link>
      <guid>https://dev.to/elise_moreau/channels-last-memory-format-cut-our-conv-backbone-latency-22-19l2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.&lt;/p&gt;

&lt;p&gt;Our background-removal model at Photoroom spent roughly 31 ms per 1024x1024 image on an A100, and profiling pointed most of that time at cuDNN convolution kernels rather than our diffusion sampler. The model is a fairly standard U-Net style encoder-decoder, all convolutions, running in float16 under &lt;code&gt;torch.autocast&lt;/code&gt;. Before touching the architecture, I wanted to rule out the cheap wins, and the cheapest one turned out to be tensor memory layout. The channels-last memory format gave us most of the speedup we were chasing, and the change fit in a handful of lines. To be precise, the network math is identical; only the byte order of the activations changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What channels-last memory format changes
&lt;/h2&gt;

&lt;p&gt;The channels-last memory format stores a 4D activation tensor in NHWC byte order, keeping the channel values for one spatial position contiguous in memory. PyTorch keeps the logical NCHW shape, so your indexing and your model code stay the same. What changes is the stride pattern, which lets cuDNN select kernels that read contiguous channels and run more efficiently on tensor-core hardware.&lt;/p&gt;

&lt;p&gt;The default PyTorch layout is NCHW (channels-first), where all of one channel's pixels sit together. NVIDIA's tensor cores prefer the NHWC arrangement for convolutions, as documented in their &lt;a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html" rel="noopener noreferrer"&gt;convolution performance guide&lt;/a&gt;. When your tensors arrive in NCHW, cuDNN often inserts transpose passes around each convolution to reshuffle data, and those transposes are pure overhead. Converting once at the input and keeping the format consistent removes that per-layer reshuffling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Converting a PyTorch model to channels-last
&lt;/h2&gt;

&lt;p&gt;The conversion API has been stable since well before PyTorch 2.3, and the &lt;a href="https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html" rel="noopener noreferrer"&gt;official memory format tutorial&lt;/a&gt; covers the details. Two things need the format: the module parameters and the input tensor. If only one of them is channels-last, cuDNN falls back to NCHW kernels and you gain nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# convert the model's conv weights once, at load time
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# convert each input batch to match
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;autocast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# output is channels_last; convert back if a
&lt;/span&gt;                  &lt;span class="c1"&gt;# downstream op needs contiguous NCHW
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One subtlety worth checking: &lt;code&gt;x.to(memory_format=torch.channels_last)&lt;/code&gt; is a no-op on a 3D tensor, so make sure your inputs carry an explicit batch dimension. After the forward pass, the output keeps channels-last strides. If you feed it into an operation that assumes contiguous NCHW, call &lt;code&gt;.contiguous()&lt;/code&gt; there rather than reverting the whole pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why NHWC is faster on tensor cores
&lt;/h2&gt;

&lt;p&gt;Tensor cores execute matrix-multiply-accumulate on small tiles, and convolutions get lowered to those tile operations. With NHWC layout the channel dimension, which is the contracting dimension of the implicit matmul, is contiguous, so the kernel loads aligned vectors without gathering strided data. The effect grows with channel count. Our deepest encoder blocks at 512 channels saw the largest per-layer improvement, while the early high-resolution layers at 64 channels barely moved.&lt;/p&gt;

&lt;p&gt;The gain also depends on precision. Channels-last pairs with float16 or bfloat16, because tensor cores only engage in reduced precision; in pure float32 the kernels often route through CUDA cores where the layout advantage shrinks. We were already running float16 under autocast, so the two optimizations stacked. The nuance here is that channels-last is not a free win in every configuration. It is a win when your convolutions are wide, your precision is reduced, and your hardware has tensor cores.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring the speedup without fooling yourself
&lt;/h2&gt;

&lt;p&gt;A layout change is easy to misattribute, so I measured carefully. I ran 200 warmup iterations, then timed 1000 forward passes with &lt;code&gt;torch.cuda.synchronize()&lt;/code&gt; around each measurement window, since CUDA calls are asynchronous and an unsynchronized timer reports queue time rather than kernel time. I also confirmed the output tensors matched the NCHW baseline within float16 tolerance, so I knew I was timing the same computation.&lt;/p&gt;

&lt;p&gt;The headline number was a drop from roughly 31 ms to 24 ms per image, about 22% on our A100. On a V100 the same change gave closer to 14%, which tracks with its older tensor-core generation. I would treat any single-number claim with suspicion until you reproduce it on your own shapes; the benefit is real but hardware-dependent and model-dependent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The format is not universally beneficial. Networks dominated by pointwise operations, normalization, or attention rather than spatial convolutions show little or no improvement, because those ops do not hit the cuDNN convolution path that NHWC accelerates. Transformer backbones, for instance, rarely care.&lt;/p&gt;

&lt;p&gt;There is also a correctness trap. Mixing layouts inside a model can silently insert transposes that erase the gain, and some custom operators or older third-party layers assume contiguous NCHW and will either copy or error. If you run &lt;code&gt;torch.compile&lt;/code&gt;, verify the format survives the traced graph rather than assuming it does. For very small channel counts the conversion overhead can outweigh the kernel savings, so profile before committing it everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The channels-last memory format is one of the few optimizations that costs almost nothing to try and is straightforward to revert if it does not help. For a convolution-heavy vision model running in float16 on tensor-core GPUs, it is worth measuring before you reach for quantization or architectural surgery. What I would try next is combining it with &lt;code&gt;torch.compile&lt;/code&gt; and a CUDA graph capture, then re-profiling to see how much transpose overhead is actually left in the trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html" rel="noopener noreferrer"&gt;PyTorch channels-last memory format tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html" rel="noopener noreferrer"&gt;NVIDIA convolution performance and NHWC layout guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/amp.html" rel="noopener noreferrer"&gt;PyTorch autocast and mixed precision docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/cudnn/latest/index.html" rel="noopener noreferrer"&gt;cuDNN developer guide on tensor layouts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pytorch</category>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
    <item>
      <title>The SDXL VAE overflow that decoded black images in fp16</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 23 Jun 2026 05:37:00 +0000</pubDate>
      <link>https://dev.to/elise_moreau/the-sdxl-vae-overflow-that-decoded-black-images-in-fp16-46g6</link>
      <guid>https://dev.to/elise_moreau/the-sdxl-vae-overflow-that-decoded-black-images-in-fp16-46g6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: The SDXL VAE decoder pushes activations past 65504, the max value fp16 can hold, so the last decode step overflows to inf and you get a fully black image. At Photoroom we hit this on roughly 1 in 600 product renders before we caught it. The fix is to upcast only the VAE, or swap in rescaled decoder weights, not to drop the whole pipeline to fp32.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run SDXL-based pipelines for product photography. A customer uploads a sneaker on a kitchen table, we cut it out, then generate a clean studio background around it. Hundreds of thousands of renders a day, mostly on A10G and A100 GPUs, with the UNet in fp16 to keep the per-image latency under our budget.&lt;/p&gt;

&lt;p&gt;The bug showed up as a thin stream of complaints. Black image. No error, no stack trace, no NaN warning in the logs. Just a 1024x1024 PNG of pure black where a render should be.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;I pulled 40 of the failing seeds and replayed them with hooks on every module in the VAE decoder. The UNet output was fine. Latents looked normal, values in the usual range. The decode was where it died.&lt;/p&gt;

&lt;p&gt;To be precise, the overflow lives in the decoder's mid and up blocks. SDXL's VAE has a few residual layers where the post-convolution activations spike hard for certain inputs. fp16 tops out at 65504. I logged a max activation of 3.1e5 inside one of the &lt;code&gt;up_blocks&lt;/code&gt; resblocks on a failing seed. Once a single value hits inf, the following GroupNorm propagates it across the whole feature map, and you decode garbage that clamps to black.&lt;/p&gt;

&lt;p&gt;The nuance here is that it's input-dependent. Most latents never come close to the ceiling. High-contrast scenes with bright speculars, like a glossy bottle on white, are the ones that tip over. That's why our QA never saw it and production did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# hook to catch the overflow as it happens
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;6e4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# fp16 max is 65504
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: max activation &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;named_modules&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_forward_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That printout is what pointed me at the exact resblock instead of guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The options we weighed
&lt;/h2&gt;

&lt;p&gt;There's no single right answer here, and the trade-off is VRAM and latency against correctness. We measured four approaches on the same 500-seed batch.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Fixes overflow&lt;/th&gt;
&lt;th&gt;VAE decode latency&lt;/th&gt;
&lt;th&gt;Extra VRAM&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full pipeline fp32&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+210%&lt;/td&gt;
&lt;td&gt;~2x&lt;/td&gt;
&lt;td&gt;Kills our latency budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;force_upcast&lt;/code&gt; VAE to fp32&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+18%&lt;/td&gt;
&lt;td&gt;+1.1 GB&lt;/td&gt;
&lt;td&gt;Only the VAE runs fp32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE in bf16&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+6%&lt;/td&gt;
&lt;td&gt;+0.1 GB&lt;/td&gt;
&lt;td&gt;Needs Ampere or newer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp16-fix decoder weights&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+0%&lt;/td&gt;
&lt;td&gt;+0 GB&lt;/td&gt;
&lt;td&gt;Rescaled weights, fp16 stays&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full fp32 was off the table. It doubled memory and blew past the latency we promise. The other three all hold up.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;force_upcast&lt;/code&gt; is the diffusers default for a reason. It keeps the UNet in fp16 and runs only the VAE in fp32. One flag, and the overflow is gone because fp32 has the headroom.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;diffusers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoencoderKL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StableDiffusionXLPipeline&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StableDiffusionXLPipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stabilityai/stable-diffusion-xl-base-1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;force_upcast&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# VAE runs fp32, UNet stays fp16
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We landed on bf16 for the VAE on our Ampere fleet. bf16 has the same exponent range as fp32, so the 3.1e5 activation fits without issue, and the decode cost was 6% instead of 18%. On the older A10G boxes that don't get us the bf16 path we wanted, we use the rescaled fp16-fix decoder weights, which shift the activation magnitudes down so they never reach the ceiling in the first place.&lt;/p&gt;

&lt;p&gt;One detail that bit us: if you call &lt;code&gt;pipe.enable_vae_tiling()&lt;/code&gt; for large outputs, the tiling runs before the dtype upcast, so you still need the dtype right. Tiling reduces peak memory, it does not touch the numerical range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the gateway fits
&lt;/h2&gt;

&lt;p&gt;A side note, since people ask how the text side of this connects. Before the diffusion step, we rewrite the user's scene description into a cleaner prompt with an LLM, and we generate alt-text captions after. Those LLM calls go through Bifrost, an open-source gateway that gives us one OpenAI-compatible endpoint with automatic failover across providers. It has nothing to do with the VAE overflow. It just means when one provider has a bad afternoon, the caption step doesn't take the render pipeline down with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;bf16 is not a free win. It has the range of fp32 but only 8 bits of mantissa, fewer than fp16's 10, so you trade overflow safety for a little precision. On our renders the visible difference was nothing, but I would not assume that for every model. Measure SSIM against an fp32 reference before you ship.&lt;/p&gt;

&lt;p&gt;The fp16-fix weights are a community rescaling, not an official release. They work well, and we validated them on 2000 renders, but you're trusting a third-party checkpoint. Pin the exact revision.&lt;/p&gt;

&lt;p&gt;And none of this helps if your latents themselves are out of distribution. We saw two black images that were not VAE overflow at all, they were a bad LoRA producing extreme latents. The hook above tells you which failure you're looking at, so put it in your eval harness, not only in debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/en/api/models/autoencoderkl" rel="noopener noreferrer"&gt;diffusers VAE and &lt;code&gt;force_upcast&lt;/code&gt; docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/madebyollin/sdxl-vae-fp16-fix" rel="noopener noreferrer"&gt;sdxl-vae-fp16-fix rescaled weights&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus" rel="noopener noreferrer"&gt;bfloat16 numerics, the original Google brief&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL paper, architecture details&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost gateway&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pytorch</category>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Unifying image inputs across three vision providers behind Bifrost</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:52:41 +0000</pubDate>
      <link>https://dev.to/elise_moreau/unifying-image-inputs-across-three-vision-providers-behind-bifrost-ggd</link>
      <guid>https://dev.to/elise_moreau/unifying-image-inputs-across-three-vision-providers-behind-bifrost-ggd</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We run an automated visual QA step that scores generated product shots with vision LLMs from OpenAI, Anthropic, and Google. Each provider wanted the image payload shaped differently, and one rate-limit spike could stall the whole batch. Putting &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; in front gave us one OpenAI-compatible image schema and automatic failover, with about 4ms of added latency per call.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Photoroom I work on the diffusion side of product photography. The model that generates a clean studio shot is only half the job. The other half is deciding, automatically, whether the output is actually usable before it reaches a customer.&lt;/p&gt;

&lt;p&gt;So we built a QA scorer. It sends each generated render to a vision model and asks for a structured verdict: background artifacts, clipped edges, color drift against the source. We send the same image to more than one provider because the failure modes differ, and a single model's blind spots leak through otherwise.&lt;/p&gt;

&lt;p&gt;That is where the mess started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three providers, three image schemas
&lt;/h2&gt;

&lt;p&gt;The nuance here is that "OpenAI-compatible vision" is not a settled standard. To be precise, the message envelope diverges per provider.&lt;/p&gt;

&lt;p&gt;OpenAI wants &lt;code&gt;image_url&lt;/code&gt; content parts, and you can pass a &lt;code&gt;data:&lt;/code&gt; base64 URI or a real URL. Anthropic's native API wants a &lt;code&gt;source&lt;/code&gt; block with &lt;code&gt;type: base64&lt;/code&gt;, a &lt;code&gt;media_type&lt;/code&gt;, and raw data. Google Vertex wants &lt;code&gt;inline_data&lt;/code&gt; with &lt;code&gt;mime_type&lt;/code&gt;. Our scorer started life with three code paths, three sets of size limits, and three retry policies that drifted out of sync within a month.&lt;/p&gt;

&lt;p&gt;For a 12-image batch per product across two providers, that branching logic was the part that broke most often. Not the diffusion model. The plumbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We dropped Bifrost in as the gateway and pointed the scorer at a single endpoint. It exposes one OpenAI-compatible API across 23+ providers, so the image part is always written the OpenAI way, and Bifrost translates to whatever the target provider expects. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/streaming" rel="noopener noreferrer"&gt;Multimodal support&lt;/a&gt; for text and images sits behind that common interface.&lt;/p&gt;

&lt;p&gt;One request shape now. The only thing that changes per call is the &lt;code&gt;model&lt;/code&gt; string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Score this render for edge clipping. Return JSON."},
        {"type": "image_url",
         "image_url": {"url": "data:image/png;base64,iVBORw0K..."}}
      ]
    }]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Swap &lt;code&gt;anthropic/claude-sonnet-4-6&lt;/code&gt; for &lt;code&gt;openai/gpt-4o&lt;/code&gt; or &lt;code&gt;vertex/gemini-2.5-pro&lt;/code&gt; and the body stays identical. The scorer no longer knows or cares how each provider encodes pixels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failover instead of a dead batch
&lt;/h2&gt;

&lt;p&gt;The second problem was throughput. When one vision provider returned 429s during a busy stretch, our batch queue used to back up because the scorer kept hammering the same key.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;automatic fallbacks&lt;/a&gt; let us declare an ordered list. If the primary returns an error or times out, the request moves to the next provider with the same payload. No code change in the scorer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;vertex/gemini-2.5-pro&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Across 30 days the failover fired on roughly 0.8% of calls. Small number. It was the difference between a stalled queue and one that drains.&lt;/p&gt;

&lt;p&gt;We also get native &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt; out of the gateway, so per-provider latency and error rates land in the same Grafana board we already use for GPU utilization. Before, that data lived in three provider dashboards and a spreadsheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey before committing. Here is the honest read for our specific multimodal use case.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unified image schema&lt;/td&gt;
&lt;td&gt;Yes, OpenAI-compatible translation&lt;/td&gt;
&lt;td&gt;Yes, very wide provider list&lt;/td&gt;
&lt;td&gt;Yes, managed gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted, single binary&lt;/td&gt;
&lt;td&gt;Yes, Go, &lt;code&gt;npx&lt;/code&gt; or Docker&lt;/td&gt;
&lt;td&gt;Yes, Python proxy&lt;/td&gt;
&lt;td&gt;Self-host available, more involved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Added latency&lt;/td&gt;
&lt;td&gt;~4ms in our test&lt;/td&gt;
&lt;td&gt;Higher under Python load&lt;/td&gt;
&lt;td&gt;Low, hosted edge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider breadth&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;Largest list of the three&lt;/td&gt;
&lt;td&gt;Broad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrails / managed cloud&lt;/td&gt;
&lt;td&gt;Enterprise tier&lt;/td&gt;
&lt;td&gt;Lighter&lt;/td&gt;
&lt;td&gt;Strongest managed feature set&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has the widest provider coverage, and if you live in Python its proxy is genuinely fast to wire up. Portkey's managed guardrails and analytics are more polished than what the open-source Bifrost gives you out of the box. We picked Bifrost because it runs as one self-hosted Go binary next to our inference cluster, and the latency overhead stayed flat under concurrent image traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This is not free.&lt;/p&gt;

&lt;p&gt;You are adding a network hop. We measured about 4ms median, which is noise next to a 2-3 second vision call, but it is not zero, and for pure text streaming you would feel it more.&lt;/p&gt;

&lt;p&gt;It is also one more service to run and patch. If Bifrost goes down without a redundant deployment, every provider goes down with it, so you trade per-provider fragility for a single point you now own. We run two replicas behind a load balancer for that reason.&lt;/p&gt;

&lt;p&gt;And the deep governance pieces, like adaptive load balancing and clustering, sit in the enterprise tier. The open-source core covered our failover and multimodal needs, but check the docs before assuming a specific feature is in the free build.&lt;/p&gt;

&lt;p&gt;The translation layer is also only as good as its provider coverage. A brand-new provider quirk can lag the upstream API by a release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/streaming" rel="noopener noreferrer"&gt;Multimodal and streaming docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Observability and Prometheus metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/vision" rel="noopener noreferrer"&gt;Anthropic vision message format&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>You Can't Govern the AI You Can't See</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:16:13 +0000</pubDate>
      <link>https://dev.to/elise_moreau/you-cant-govern-the-ai-you-cant-see-1kkj</link>
      <guid>https://dev.to/elise_moreau/you-cant-govern-the-ai-you-cant-see-1kkj</guid>
      <description>&lt;p&gt;&lt;em&gt;AI governance starts with visibility: a policy, a budget, or a guardrail can only act on the AI traffic a team can actually see. This guide explains why so much AI use stays out of IT's view, why that gap stops governance before it starts, and how the Bifrost AI gateway and Bifrost Edge close it by making endpoint AI both visible and governable.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every AI governance control an organization owns, from budgets and access rules to guardrails and audit trails, can only act on the AI traffic it can actually see. That ability to see what AI is running and what it is sending, often called AI visibility, is the precondition for everything else. The trouble is that most AI used at work now runs on the endpoint, inside desktop apps, browser tabs, and coding agents that reach a model provider directly, so the activity never reaches the systems security teams watch. A request that leaves a laptop for a third-party model without crossing a monitored path is, for governance purposes, a request that did not happen. The gap is wide, as a 2025 Gartner survey of cybersecurity leaders found that 69 percent have evidence or suspicion that employees are using public generative AI at work, which is exactly the usage most teams cannot account for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you can't govern what you can't see
&lt;/h2&gt;

&lt;p&gt;Governance is a chain of steps, and visibility is the first link. To act on an AI request, a system has to see it, attach an identity and a policy to it, enforce limits on it, and record what happened. When the first step is missing, none of the steps after it can run, because a control that never observes a request has nothing to act on.&lt;/p&gt;

&lt;p&gt;This plays out the same way across every control a security or platform team relies on. A data guardrail that never inspects a prompt cannot redact the secret inside it. A budget that never counts a call cannot cap spending on it. A policy that never sees a tool cannot decide whether the tool is allowed. The result is not weak governance but absent governance, applied with confidence to the fraction of AI traffic that happens to be visible while the rest moves untouched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI goes out of view
&lt;/h2&gt;

&lt;p&gt;AI goes out of view wherever it runs close to the user and connects straight to a provider, which describes most of where it now runs. Four blind spots account for the bulk of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Desktop assistants such as the ChatGPT app or Claude Desktop, signed in with personal accounts the organization does not manage.&lt;/li&gt;
&lt;li&gt;Browser AI, including in-page assistants and extensions that an employee turns on without review.&lt;/li&gt;
&lt;li&gt;Coding agents such as Claude Code, Codex, and Cursor, which read source code and call external services from the developer's machine.&lt;/li&gt;
&lt;li&gt;MCP servers wired into those tools, which can read files, call APIs, and act on a user's behalf with standing access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The list of tools an IT team can name is routinely a fraction of what employees actually use, because every new app, browser feature, and MCP server is one more thing to find, and discovery has no natural endpoint. The tools no one tracks are not necessarily malicious; they are simply outside anyone's view, which is what places them beyond the reach of any control. Gartner has predicted that by 2030, more than 40 percent of organizations will experience &lt;a href="https://www.infosecurity-magazine.com/news/gartner-40-firms-hit-shadow-ai/" rel="noopener noreferrer"&gt;security or compliance incidents tied to the use of unauthorized AI&lt;/a&gt;, a direct consequence of governing only the share of activity a team can see.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why traditional tools don't close the gap
&lt;/h2&gt;

&lt;p&gt;Traditional controls do not close the visibility gap because they were built to watch the network, while endpoint AI mostly avoids the network they watch. Network proxies and data loss prevention systems inspect what crosses the corporate perimeter, yet a large share of AI traffic leaves the device for a provider directly, over an encrypted connection that resembles ordinary web browsing and that often never passes through a corporate proxy at all.&lt;/p&gt;

&lt;p&gt;Three gaps recur across these approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network filtering and data loss prevention sit on the corporate network path, so requests sent straight from a device to a provider, including from machines off that network, never reach them.&lt;/li&gt;
&lt;li&gt;Blocklists work from a known list of destinations, and new apps, browser features, and MCP servers appear faster than any list is updated.&lt;/li&gt;
&lt;li&gt;SaaS and expense audits catch tools that bill the company, but they miss free tiers, personal accounts, and anything installed locally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these methods produces a partial list at a single moment, while the real usage is continuous and changes by the day. Closing the gap calls for visibility at the point where the AI actually runs, which is the endpoint itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Bifrost AI gateway and Bifrost Edge make AI visible and governable
&lt;/h2&gt;

&lt;p&gt;Making AI governable takes two things in sequence: a place where AI traffic can be seen and governed, and a way to route the AI on every machine into that place. Bifrost, the &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway&lt;/a&gt; built by Maxim AI, is that place, and &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge&lt;/a&gt; is what brings the endpoint into it.&lt;/p&gt;

&lt;p&gt;On the gateway, every request that passes through is recorded by &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;built-in observability&lt;/a&gt;, which captures the prompt, the response, the model, the token counts, the cost, and the latency for each call, with no change to the application. The same gateway holds the &lt;a href="https://docs.getbifrost.ai/deployment-guides/config-json/governance" rel="noopener noreferrer"&gt;virtual keys, budgets, and rate limits&lt;/a&gt; that tie usage to a person or project, along with the &lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;guardrail profiles&lt;/a&gt; that inspect prompts and responses. The limit, until now, has been reach: the gateway could see and govern only the traffic that something had already pointed at it.&lt;/p&gt;

&lt;p&gt;Bifrost Edge closes that reach by &lt;a href="https://docs.getbifrost.ai/edge/how-it-works" rel="noopener noreferrer"&gt;routing all supported AI traffic on a machine through Bifrost&lt;/a&gt; rather than letting it go straight to the provider. The AI that used to leave the laptop unseen now appears in the same logs, under the same policies, as the rest of an organization's AI. The division of labor is straightforward: Edge supplies the sight by inventorying endpoint AI and routing it through the gateway, and the gateway supplies the governance by recording, inspecting, and enforcing on the traffic it can now see. The gateway stays the single control plane, and Edge becomes its reach to the endpoint, so there is no separate visibility tool and no second policy model to maintain.&lt;/p&gt;

&lt;h3&gt;
  
  
  See what is running across the fleet
&lt;/h3&gt;

&lt;p&gt;Visibility begins with knowing what is present. Bifrost Edge discovers the &lt;a href="https://docs.getbifrost.ai/edge/mcp-governance" rel="noopener noreferrer"&gt;MCP servers configured in each app&lt;/a&gt; and the &lt;a href="https://docs.getbifrost.ai/edge/app-governance" rel="noopener noreferrer"&gt;AI applications in use&lt;/a&gt; on every machine, then assembles a live view across the fleet of which assistants and which servers are running, on which apps, and on how many devices. New apps and servers surface as they appear rather than during a periodic audit, and each one can be allowed or denied from a single console, with the decision enforced on the device.&lt;/p&gt;

&lt;h3&gt;
  
  
  Govern and record the traffic you can now see
&lt;/h3&gt;

&lt;p&gt;Once endpoint AI is visible, the same controls that protect gateway traffic apply to it. The &lt;a href="https://docs.getbifrost.ai/edge/security" rel="noopener noreferrer"&gt;guardrail profiles configured in Bifrost&lt;/a&gt; run before a prompt reaches a model and before a response returns, so secrets and personal data are caught or redacted before they leave the machine. Virtual keys and budgets tie each request to a person and a limit, while an &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;administrative audit trail&lt;/a&gt; records who changed which policy and when, signed and retained for later review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Roll it out and keep it current
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/edge/deployment-mdm" rel="noopener noreferrer"&gt;Bifrost Edge deploys through the device management platforms&lt;/a&gt; an organization already runs, including Jamf, Microsoft Intune, Kandji, Omnissa Workspace ONE, and JumpCloud, across macOS, Windows, and Linux. Identity and keys come from the user's single sign-on, so no secrets sit on the device, and central changes to policy and routing reach the fleet on their own once a machine is signed in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions about AI visibility
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is AI visibility?
&lt;/h3&gt;

&lt;p&gt;AI visibility is the ability to see which AI tools, models, and services are in use across an organization, and to see the individual requests they send and receive. Without it, governance controls have nothing to act on, which is why visibility is treated as the first step rather than a report generated at the end.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you discover shadow AI?
&lt;/h3&gt;

&lt;p&gt;Shadow AI is discovered by observing AI activity where it originates. Because most of it runs on endpoints, an agent on the device, such as Bifrost Edge, can inventory the apps and MCP servers in use and route their traffic through a gateway, which turns a guess about what employees might be using into a current list of what they actually use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you get visibility without blocking AI?
&lt;/h3&gt;

&lt;p&gt;Visibility does not have to mean blocking AI. Routing endpoint AI through the Bifrost gateway makes each request visible and subject to guardrails and budgets while the tools keep working normally, so an organization can approve and govern AI rather than ban it. Blocking remains available for tools a team decides to disallow, but that is a policy choice rather than a side effect of gaining visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visibility first, then governance
&lt;/h2&gt;

&lt;p&gt;Shadow AI is, at its core, a visibility problem before it is a policy problem, because the strongest policy in the world cannot reach a request no one can see. The organizations that handle it well start by making endpoint AI visible, then apply the controls they already trust to the usage that visibility reveals.&lt;/p&gt;

&lt;p&gt;Pairing the Bifrost AI gateway with Bifrost Edge gives security and platform teams both halves at once: the gateway records, inspects, and enforces, and Edge, currently in alpha, brings the AI on every machine into view so those controls have something to act on. Teams working through their own visibility gap can see how the combined approach fits together on the &lt;a href="https://docs.getbifrost.ai/edge/overview" rel="noopener noreferrer"&gt;Bifrost Edge overview&lt;/a&gt; and register there for alpha access.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>management</category>
      <category>monitoring</category>
      <category>security</category>
    </item>
    <item>
      <title>The seam our tiled upscaler left on every 4K product render</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:51:10 +0000</pubDate>
      <link>https://dev.to/elise_moreau/the-seam-our-tiled-upscaler-left-on-every-4k-product-render-pf5</link>
      <guid>https://dev.to/elise_moreau/the-seam-our-tiled-upscaler-left-on-every-4k-product-render-pf5</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We tile high-res images through our upscaler because a full 4096×4096 pass blows past 24GB of VRAM. For months every render had a faint cross down the middle. The fix was not a bigger GPU. It was admitting that hard tile boundaries break any model with a receptive field, and feathering the overlap with a raised-cosine weight instead of averaging it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Photoroom I work on the generative side, mostly diffusion for product photography. One of our smaller models is a convolutional upscaler that takes a 1024px cutout and pushes it to print resolution. Nothing exotic. A residual-in-residual dense block network, the kind of thing that has been around since ESRGAN in 2018.&lt;/p&gt;

&lt;p&gt;It worked fine in the notebook. In production, on large images, it left a seam.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a seam actually is
&lt;/h2&gt;

&lt;p&gt;You cannot run a 4096×4096 image through this model on a single 24GB card. So you tile. Cut the image into 512px squares, upscale each, stitch them back. The naive version of this is three lines of code and it is wrong.&lt;/p&gt;

&lt;p&gt;The reason is the receptive field. To be precise, every output pixel near a tile edge was computed from a partial neighborhood. The convolutions on the right edge of the left tile never saw the pixels that lived in the right tile. So the two halves disagreed by a small amount, maybe 2-3 grey levels, and the human eye is very good at finding a straight vertical line of consistent 2-3 level error. On a flat grey studio background it was obvious. On busy texture it hid.&lt;/p&gt;

&lt;p&gt;We measured it. Sampling 200 renders, the mean absolute difference across the stitch line was 4.1 on an 8-bit scale, versus 0.9 for an adjacent non-seam column. Small number, very visible artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overlap is necessary but not sufficient
&lt;/h2&gt;

&lt;p&gt;The first fix everyone reaches for is overlapping tiles. Take 512px tiles but step by 448, so each pair shares a 64px strip. Then in the shared region you have two predictions and you blend them.&lt;/p&gt;

&lt;p&gt;The nuance here is how you blend. If you average the overlap with a flat 0.5/0.5 weight, you have moved the discontinuity, not removed it. The blend region now has a soft step at each of its two edges where the weighting suddenly kicks in. Better than before. Still a seam, just blurrier.&lt;/p&gt;

&lt;p&gt;What works is a weight that goes smoothly to zero at the tile border, so a pixel contributes nothing exactly where its receptive field ran out. A raised-cosine (Hann) window does this. Each tile is multiplied by its window, the windows are accumulated, and you divide by the summed weight.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hann_2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ramp up over the overlap, flat in the middle, ramp down
&lt;/span&gt;    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ramp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hann_window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;periodic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ramp&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ramp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;   &lt;span class="c1"&gt;# outer product -&amp;gt; 2D
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;blend_tile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;
    &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;
    &lt;span class="c1"&gt;# caller does canvas / weight.clamp_min(1e-8) at the end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After switching to this, the seam difference dropped from 4.1 to 1.0, statistically indistinguishable from a normal column. Same model weights. Same GPU. Just honest about where each tile's information ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catching it before customers do
&lt;/h2&gt;

&lt;p&gt;The annoying part was that nobody noticed the seam for a while because our eval set was mostly 1024px crops that never tiled. The artifact only existed at the resolution we did not test.&lt;/p&gt;

&lt;p&gt;So we built a regression check on full-size output. For each render we compute the per-column mean absolute gradient and flag any column whose value spikes above its neighbors by more than 3x at a known tile boundary. Cheap, deterministic, runs on CPU.&lt;/p&gt;

&lt;p&gt;For the fuzzier cases (texture seams, slight color drift) we run a vision-language model over a sample of outputs and ask it to describe any visible discontinuity. Those calls go through a gateway, Bifrost, which is one of a few ways we keep provider config and rate limits in one place instead of scattered across scripts. The numeric check catches the obvious ones; the VLM catches the ones a metric misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Seam MAD (8-bit)&lt;/th&gt;
&lt;th&gt;VRAM (4K)&lt;/th&gt;
&lt;th&gt;Extra compute&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single pass&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~31 GB (OOM on 24GB)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard tiles, no overlap&lt;/td&gt;
&lt;td&gt;4.1&lt;/td&gt;
&lt;td&gt;6 GB&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlap + flat average&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;td&gt;+14%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlap + Hann window&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;td&gt;+16%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Overlap is not free. A 64px overlap on 512px tiles means roughly 16% more pixels get processed, so throughput drops by about that much. Wider overlap blends better and costs more, and past ~96px we saw no further quality gain, only the bill.&lt;/p&gt;

&lt;p&gt;Hann windowing assumes the two predictions in the overlap are both reasonable and close. They usually are for this upscaler. For a diffusion model with stochastic sampling per tile they can diverge enough that blending produces a ghost, and you need a shared noise seed or latent-space tiling instead.&lt;/p&gt;

&lt;p&gt;This also does nothing for semantic seams, where two tiles hallucinate different details. Window blending fixes geometry and color continuity, not content disagreement. That is a harder problem and the honest answer is you tile in latent space or you do not tile at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1809.00219" rel="noopener noreferrer"&gt;ESRGAN: Enhanced Super-Resolution GANs&lt;/a&gt; — the architecture family this upscaler comes from&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2302.08113" rel="noopener noreferrer"&gt;MultiDiffusion&lt;/a&gt; — fusing overlapping diffusion paths, the latent-space version of this idea&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2302.02412" rel="noopener noreferrer"&gt;Mixture of Diffusers&lt;/a&gt; — region-based blending for tiled generation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/docs/stable/generated/torch.hann_window.html" rel="noopener noreferrer"&gt;PyTorch torch.hann_window docs&lt;/a&gt; — the window function used above&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; — the gateway we route eval-time VLM calls through&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>computervision</category>
      <category>pytorch</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
