<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Deneesh Narayanasamy</title>
    <description>The latest articles on DEV Community by Deneesh Narayanasamy (@deneesh_narayanasamy).</description>
    <link>https://dev.to/deneesh_narayanasamy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F466999%2F00006e25-9ba0-4632-bf74-5acc27875191.jpeg</url>
      <title>DEV Community: Deneesh Narayanasamy</title>
      <link>https://dev.to/deneesh_narayanasamy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/deneesh_narayanasamy"/>
    <language>en</language>
    <item>
      <title>LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing</title>
      <dc:creator>Deneesh Narayanasamy</dc:creator>
      <pubDate>Tue, 07 Apr 2026 05:29:00 +0000</pubDate>
      <link>https://dev.to/deneesh_narayanasamy/litellm-proxy-the-open-source-alternative-for-multi-provider-llm-failover-and-load-balancing-54fn</link>
      <guid>https://dev.to/deneesh_narayanasamy/litellm-proxy-the-open-source-alternative-for-multi-provider-llm-failover-and-load-balancing-54fn</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: What If You Could Use ANY LLM Provider?
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dzone.com/articles/building-resilient-ai-services-implementing-multi" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem.&lt;/p&gt;

&lt;p&gt;What if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider failover&lt;/strong&gt; (&lt;a href="https://azure.microsoft.com/en-us/products/ai-services/openai-service" rel="noopener noreferrer"&gt;Azure OpenAI&lt;/a&gt; -&amp;gt; &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; -&amp;gt; &lt;a href="https://www.anthropic.com/" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt; -&amp;gt; &lt;a href="https://deepmind.google/technologies/gemini/" rel="noopener noreferrer"&gt;Gemini&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A simpler deployment&lt;/strong&gt; without managing APIM policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider-agnostic architecture&lt;/strong&gt; that works anywhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source flexibility&lt;/strong&gt; with no vendor lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enter &lt;strong&gt;LiteLLM Proxy&lt;/strong&gt; - an open-source unified gateway that gives you all of this out of the box.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is LiteLLM Proxy?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; is an open-source Python library and proxy server that provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified API:&lt;/strong&gt; One OpenAI-compatible endpoint for 100+ LLM providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in Load Balancing:&lt;/strong&gt; Distribute requests across multiple deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Failover:&lt;/strong&gt; Seamlessly retry on different models/providers when one fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate Limit Handling:&lt;/strong&gt; Intelligent retry with &lt;a href="https://en.wikipedia.org/wiki/Exponential_backoff" rel="noopener noreferrer"&gt;exponential backoff&lt;/a&gt; for &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429" rel="noopener noreferrer"&gt;429 errors&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Tracking:&lt;/strong&gt; Monitor spend across all providers in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Support:&lt;/strong&gt; Full &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events" rel="noopener noreferrer"&gt;SSE (Server-Sent Events)&lt;/a&gt; support with proper failover&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: LiteLLM Proxy vs Azure APIM
&lt;/h2&gt;

&lt;p&gt;Here's how LiteLLM Proxy compares to the Azure-native approach:&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure APIM Architecture (Previous Article)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client -&amp;gt; Azure Front Door -&amp;gt; Regional APIM -&amp;gt; Azure OpenAI (Primary)
                                         -&amp;gt; Azure OpenAI (Secondary)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Native Azure integration, enterprise compliance, WAF protection&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Azure-only, complex policies, expensive at scale&lt;/p&gt;
&lt;h3&gt;
  
  
  LiteLLM Proxy Architecture
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client -&amp;gt; Load Balancer -&amp;gt; LiteLLM Proxy -&amp;gt; Azure OpenAI
                                       -&amp;gt; OpenAI Direct
                                       -&amp;gt; Anthropic Claude
                                       -&amp;gt; Google Gemini
                                       -&amp;gt; AWS Bedrock
                                       -&amp;gt; Any LLM Provider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Supported Providers:&lt;/strong&gt; &lt;a href="https://azure.microsoft.com/en-us/products/ai-services/openai-service" rel="noopener noreferrer"&gt;Azure OpenAI&lt;/a&gt;, &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;, &lt;a href="https://www.anthropic.com/" rel="noopener noreferrer"&gt;Anthropic Claude&lt;/a&gt;, &lt;a href="https://deepmind.google/technologies/gemini/" rel="noopener noreferrer"&gt;Google Gemini&lt;/a&gt;, &lt;a href="https://aws.amazon.com/bedrock/" rel="noopener noreferrer"&gt;AWS Bedrock&lt;/a&gt;, and &lt;a href="https://docs.litellm.ai/docs/providers" rel="noopener noreferrer"&gt;100+ more&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Provider-agnostic, simple configuration, open-source, runs anywhere&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Self-managed infrastructure, requires &lt;a href="https://www.docker.com/resources/what-container/" rel="noopener noreferrer"&gt;containerization&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Getting Started: 5-Minute Setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Option 1: &lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; (Recommended for Production)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pull the official image&lt;/span&gt;
docker pull ghcr.io/berriai/litellm:main-latest

&lt;span class="c"&gt;# Run with your config&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; litellm-proxy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4000:4000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/litellm_config.yaml:/app/config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;AZURE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-azure-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-openai-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-anthropic-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/berriai/litellm:main-latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; /app/config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Option 2: Python (Quick Testing)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'litellm[proxy]'&lt;/span&gt;
litellm &lt;span class="nt"&gt;--config&lt;/span&gt; litellm_config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  The Configuration File
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;litellm_config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Primary: Azure OpenAI GPT-4o (West US)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://westus-primary.openai.azure.com/&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/AZURE_API_KEY&lt;/span&gt;
      &lt;span class="na"&gt;api_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-08-01-preview"&lt;/span&gt;
    &lt;span class="na"&gt;model_info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure-westus-gpt4o&lt;/span&gt;

  &lt;span class="c1"&gt;# Failover 1: Azure OpenAI GPT-4o (East US)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://eastus-secondary.openai.azure.com/&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/AZURE_API_KEY_SECONDARY&lt;/span&gt;
      &lt;span class="na"&gt;api_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-08-01-preview"&lt;/span&gt;
    &lt;span class="na"&gt;model_info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure-eastus-gpt4o&lt;/span&gt;

  &lt;span class="c1"&gt;# Failover 2: OpenAI Direct&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/OPENAI_API_KEY&lt;/span&gt;
    &lt;span class="na"&gt;model_info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai-direct-gpt4o&lt;/span&gt;

  &lt;span class="c1"&gt;# Failover 3: Anthropic Claude (ultimate backup)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-3-5-sonnet-20241022&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/ANTHROPIC_API_KEY&lt;/span&gt;
    &lt;span class="na"&gt;model_info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic-claude-sonnet&lt;/span&gt;

&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Enable automatic failover&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;retry_after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;

  &lt;span class="c1"&gt;# Fallback configuration&lt;/span&gt;
  &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;gpt-4o&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpt-4o&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Retry across all gpt-4o deployments&lt;/span&gt;

  &lt;span class="c1"&gt;# Request timeout&lt;/span&gt;
  &lt;span class="na"&gt;request_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;

  &lt;span class="c1"&gt;# Enable streaming&lt;/span&gt;
  &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Load balancing strategy&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;least-busy&lt;/span&gt;

  &lt;span class="c1"&gt;# Enable rate limit awareness&lt;/span&gt;
  &lt;span class="na"&gt;enable_pre_call_checks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Cooldown failed deployments&lt;/span&gt;
  &lt;span class="na"&gt;cooldown_time&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;

  &lt;span class="c1"&gt;# Number of retries per deployment&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

  &lt;span class="c1"&gt;# Retry on these status codes&lt;/span&gt;
  &lt;span class="na"&gt;retry_after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;allowed_fails&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Master key for proxy authentication&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LITELLM_MASTER_KEY&lt;/span&gt;

  &lt;span class="c1"&gt;# Database for tracking (optional)&lt;/span&gt;
  &lt;span class="na"&gt;database_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/DATABASE_URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Magic: How Failover Actually Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Automatic 429 Handling
&lt;/h3&gt;

&lt;p&gt;When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the &lt;code&gt;Retry-After&lt;/code&gt; header&lt;/li&gt;
&lt;li&gt;Marks that deployment as "cooling down"&lt;/li&gt;
&lt;li&gt;Routes the request to the next available deployment&lt;/li&gt;
&lt;li&gt;Continues until a successful response or all deployments exhausted
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Your code stays simple - LiteLLM handles everything
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-litellm-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Point to LiteLLM Proxy
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This request automatically fails over if needed
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# LiteLLM routes to best available
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load Balancing Strategies
&lt;/h3&gt;

&lt;p&gt;LiteLLM supports multiple routing strategies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;simple-shuffle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Random selection&lt;/td&gt;
&lt;td&gt;Even distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;least-busy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Route to deployment with fewest active requests&lt;/td&gt;
&lt;td&gt;High throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;latency-based-routing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Route to fastest responding deployment&lt;/td&gt;
&lt;td&gt;Latency-sensitive apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cost-based-routing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Route to cheapest available option&lt;/td&gt;
&lt;td&gt;Cost optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configure in your YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency-based-routing&lt;/span&gt;

  &lt;span class="c1"&gt;# For latency-based routing, set expected latencies&lt;/span&gt;
  &lt;span class="na"&gt;model_group_alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpt-4o&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/gpt-4o&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;  &lt;span class="c1"&gt;# 70% of traffic&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;  &lt;span class="c1"&gt;# 30% of traffic&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Streaming Support: It Just Works
&lt;/h2&gt;

&lt;p&gt;Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events" rel="noopener noreferrer"&gt;SSE (Server-Sent Events)&lt;/a&gt; natively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-litellm-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Streaming works exactly like direct OpenAI
&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a poem about resilience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the primary provider fails mid-stream, LiteLLM will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detect the connection failure&lt;/li&gt;
&lt;li&gt;Automatically retry on the next provider&lt;/li&gt;
&lt;li&gt;Return an error only if all providers fail&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Production Configuration: Enterprise-Ready Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  High Availability Deployment
&lt;/h3&gt;

&lt;p&gt;For production, deploy multiple LiteLLM instances behind a load balancer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;litellm-1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/berriai/litellm:main-latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4001:4000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./litellm_config.yaml:/app/config.yaml&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;AZURE_API_KEY=${AZURE_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY=${OPENAI_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL=${DATABASE_URL}&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--config /app/config.yaml&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4000/health"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

  &lt;span class="na"&gt;litellm-2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/berriai/litellm:main-latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4002:4000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./litellm_config.yaml:/app/config.yaml&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;AZURE_API_KEY=${AZURE_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY=${OPENAI_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL=${DATABASE_URL}&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--config /app/config.yaml&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4000/health"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

  &lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:alpine&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4000:80"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./nginx.conf:/etc/nginx/nginx.conf&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;litellm-1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;litellm-2&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;a href="https://nginx.org/" rel="noopener noreferrer"&gt;Nginx&lt;/a&gt; Load Balancer Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# nginx.conf&lt;/span&gt;
&lt;span class="k"&gt;events&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;worker_connections&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;http&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;least_conn&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;litellm-1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt; &lt;span class="s"&gt;weight=1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;litellm-2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt; &lt;span class="s"&gt;weight=1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://litellm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_http_version&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Upgrade&lt;/span&gt; &lt;span class="nv"&gt;$http_upgrade&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Connection&lt;/span&gt; &lt;span class="s"&gt;"upgrade"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Real-IP&lt;/span&gt; &lt;span class="nv"&gt;$remote_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;300s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_buffering&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;# Important for streaming&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/health&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://litellm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_connect_timeout&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Advanced Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Budget &amp;amp; Rate Limiting
&lt;/h3&gt;

&lt;p&gt;Control spending and prevent runaway costs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-your-master-key&lt;/span&gt;

&lt;span class="c1"&gt;# User-level budgets&lt;/span&gt;
&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100.00&lt;/span&gt;  &lt;span class="c1"&gt;# $100 max per user&lt;/span&gt;
  &lt;span class="na"&gt;budget_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monthly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create users with specific limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s1"&gt;'http://localhost:4000/user/new'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Authorization: Bearer sk-your-master-key'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "user_id": "user-123",
    "max_budget": 50.00,
    "budget_duration": "monthly",
    "models": ["gpt-4o", "gpt-3.5-turbo"]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Request Caching
&lt;/h3&gt;

&lt;p&gt;Reduce costs and latency with semantic caching using &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;cache_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
    &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;localhost&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
    &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;  &lt;span class="c1"&gt;# 1 hour cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Custom Callbacks &amp;amp; Logging
&lt;/h3&gt;

&lt;p&gt;Track every request for observability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;success_callback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langfuse"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prometheus"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Langfuse &amp;amp; Prometheus integrations&lt;/span&gt;
  &lt;span class="na"&gt;failure_callback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langfuse"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slack"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Langfuse integration&lt;/span&gt;
  &lt;span class="na"&gt;langfuse_public_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LANGFUSE_PUBLIC_KEY&lt;/span&gt;
  &lt;span class="na"&gt;langfuse_secret_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LANGFUSE_SECRET_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Guardrails &amp;amp; Content Moderation
&lt;/h3&gt;

&lt;p&gt;Add safety layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;guardrails&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;guardrail_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content-filter"&lt;/span&gt;
      &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;guardrail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai_moderation&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pre_call&lt;/span&gt;  &lt;span class="c1"&gt;# Check before sending to LLM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Comparing Results: LiteLLM vs Azure APIM
&lt;/h2&gt;

&lt;p&gt;I ran the same load test from my Azure article against both architectures:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Azure APIM&lt;/th&gt;
&lt;th&gt;LiteLLM Proxy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Success Rate&lt;/td&gt;
&lt;td&gt;99.4%&lt;/td&gt;
&lt;td&gt;99.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Latency&lt;/td&gt;
&lt;td&gt;2,184ms&lt;/td&gt;
&lt;td&gt;1,892ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;4,128ms&lt;/td&gt;
&lt;td&gt;3,456ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup Time&lt;/td&gt;
&lt;td&gt;~4 hours&lt;/td&gt;
&lt;td&gt;~30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly Cost&lt;/td&gt;
&lt;td&gt;~$500+&lt;/td&gt;
&lt;td&gt;~$50 (compute only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider Lock-in&lt;/td&gt;
&lt;td&gt;Azure only&lt;/td&gt;
&lt;td&gt;Any provider&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key observations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM showed slightly better latency due to simpler request pipeline&lt;/li&gt;
&lt;li&gt;Both achieved similar reliability with proper configuration&lt;/li&gt;
&lt;li&gt;LiteLLM's multi-provider fallback provided an extra safety net&lt;/li&gt;
&lt;li&gt;Cost difference is significant for smaller teams&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When to Use Which?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose Azure APIM + Front Door When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You're &lt;strong&gt;all-in on Azure&lt;/strong&gt; and need native integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise compliance&lt;/strong&gt; requirements mandate Azure services&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;WAF/DDoS protection&lt;/strong&gt; at the edge&lt;/li&gt;
&lt;li&gt;Your organization has existing APIM expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt; must stay within Azure ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose LiteLLM Proxy When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You need &lt;strong&gt;multi-provider failover&lt;/strong&gt; (not just multi-region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; is a priority&lt;/li&gt;
&lt;li&gt;You want &lt;strong&gt;provider flexibility&lt;/strong&gt; to switch easily&lt;/li&gt;
&lt;li&gt;Your team prefers &lt;strong&gt;simple YAML configuration&lt;/strong&gt; over XML policies&lt;/li&gt;
&lt;li&gt;You're running on &lt;strong&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;, AWS, GCP, or on-prem&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;rapid prototyping&lt;/strong&gt; and iteration&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Production Checklist
&lt;/h2&gt;

&lt;p&gt;If you're deploying LiteLLM Proxy to production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Deploy Multiple Instances:&lt;/strong&gt; At least 2 behind a load balancer&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enable Health Checks:&lt;/strong&gt; Configure &lt;code&gt;/health&lt;/code&gt; endpoint monitoring&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Set Up Database:&lt;/strong&gt; &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;PostgreSQL&lt;/a&gt; for persistence and analytics&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Configure Caching:&lt;/strong&gt; &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; for semantic caching&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Add Monitoring:&lt;/strong&gt; &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; + &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt; or &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;Langfuse&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Set Budget Limits:&lt;/strong&gt; Prevent runaway costs&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Secure the Proxy:&lt;/strong&gt; Use master key authentication&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enable TLS:&lt;/strong&gt; HTTPS in production (via nginx or cloud LB)&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Configure Alerts:&lt;/strong&gt; &lt;a href="https://slack.com/" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;/&lt;a href="https://www.pagerduty.com/" rel="noopener noreferrer"&gt;PagerDuty&lt;/a&gt; for failures&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Test Failover:&lt;/strong&gt; Deliberately fail providers to verify behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: The Right Tool for the Job
&lt;/h2&gt;

&lt;p&gt;Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure APIM&lt;/strong&gt; is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM Proxy&lt;/strong&gt; is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model.&lt;/p&gt;

&lt;p&gt;The best part? These aren't mutually exclusive. You can run LiteLLM Proxy &lt;em&gt;behind&lt;/em&gt; Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing.&lt;/p&gt;

&lt;p&gt;📦 &lt;strong&gt;LiteLLM GitHub:&lt;/strong&gt; &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;github.com/BerriAI/litellm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;LiteLLM Docs:&lt;/strong&gt; &lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;docs.litellm.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>openai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building Resilient AI Services: Implementing Multi-Region Failover for Azure OpenAI at Enterprise Scale</title>
      <dc:creator>Deneesh Narayanasamy</dc:creator>
      <pubDate>Fri, 27 Feb 2026 05:41:05 +0000</pubDate>
      <link>https://dev.to/deneesh_narayanasamy/building-resilient-ai-services-implementing-multi-region-failover-for-azure-openai-at-enterprise-cnd</link>
      <guid>https://dev.to/deneesh_narayanasamy/building-resilient-ai-services-implementing-multi-region-failover-for-azure-openai-at-enterprise-cnd</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: When Your AI Service Goes Down at 3 AM
&lt;/h2&gt;

&lt;p&gt;Picture this: It's 3 AM on a Monday. Your enterprise AI application, the one powering customer support for millions of users, suddenly stops responding. &lt;a href="https://azure.microsoft.com/en-us/products/ai-foundry/models/openai/" rel="noopener noreferrer"&gt;Azure OpenAI&lt;/a&gt; in your primary region is experiencing an outage. Your phone explodes with alerts. Customer complaints flood in. Revenue is bleeding.&lt;/p&gt;

&lt;p&gt;This isn't a hypothetical scenario. It's a reality that every organization building on cloud AI services must prepare for. When you're running production AI workloads at scale, the question isn't &lt;em&gt;if&lt;/em&gt; you'll need failover—it's &lt;em&gt;when&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In this article, I'll walk you through the exact architecture that's implemented to achieve &lt;strong&gt;99.95% uptime&lt;/strong&gt; for Azure OpenAI services serving millions of requests daily. You'll get the actual &lt;a href="https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-policies" rel="noopener noreferrer"&gt;APIM policies&lt;/a&gt;, load testing scripts, and production readiness strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Why Azure OpenAI Needs Sophisticated Failover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Reality of Cloud AI Services
&lt;/h3&gt;

&lt;p&gt;Azure OpenAI is remarkable, but it's still a cloud service with real-world constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regional Quota Limits:&lt;/strong&gt; You can't just throw infinite traffic at a single endpoint. Azure enforces TPM (Tokens Per Minute) and RPM (Requests Per Minute) quotas per region.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rate Limiting (429 Errors):&lt;/strong&gt; When you hit quota limits, you get HTTP 429 (Too Many Requests) responses. These aren't service errors—they're expected behavior that you must handle gracefully.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regional Outages:&lt;/strong&gt; Azure regions can and do experience issues. In Q3 2024 alone, we saw multiple incidents affecting OpenAI availability in specific regions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployment Latency Variance:&lt;/strong&gt; A request to westus might take 200ms, while the same request to eastus takes 450ms. Geography matters.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Business Impact
&lt;/h3&gt;

&lt;p&gt;Let's talk numbers. For an enterprise AI application serving more than 1 million requests per day:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Uptime&lt;/th&gt;
&lt;th&gt;Downtime per Year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;14.4 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;8.76 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.95%&lt;/td&gt;
&lt;td&gt;4.38 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That difference between 99% and 99.95%? That's potentially millions in revenue, thousands of lost customers, and immeasurable damage to brand reputation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: A Multi-Layer Resilience Strategy
&lt;/h2&gt;

&lt;p&gt;Here's the complete architecture implemented to achieve high availability:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggk09u8iax4us6xckp44.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggk09u8iax4us6xckp44.jpg" alt="Multi-region Azure OpenAI architecture" width="800" height="893"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: Multi-region Azure OpenAI architecture with Azure Front Door, APIM, and regional OpenAI instances.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Architecture Components
&lt;/h3&gt;

&lt;p&gt;📦 &lt;strong&gt;GitHub Repo:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/tree/main" rel="noopener noreferrer"&gt;azure-openai-multi-region-failover&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's break down each layer:&lt;/p&gt;
&lt;h4&gt;
  
  
  Layer 1: &lt;a href="https://learn.microsoft.com/en-us/azure/frontdoor/front-door-overview" rel="noopener noreferrer"&gt;Azure Front Door&lt;/a&gt; + &lt;a href="https://learn.microsoft.com/en-us/azure/web-application-firewall/afds/afds-overview" rel="noopener noreferrer"&gt;WAF&lt;/a&gt; (Global Entry Point)
&lt;/h4&gt;

&lt;p&gt;Azure Front Door serves as global load balancer with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dzone.com/articles/distributed-denial-of-service-ddos-attacks-what-yo" rel="noopener noreferrer"&gt;DDoS protection&lt;/a&gt; and Web Application Firewall&lt;/li&gt;
&lt;li&gt;SSL/TLS termination at the edge&lt;/li&gt;
&lt;li&gt;Geographic routing to nearest APIM instance&lt;/li&gt;
&lt;li&gt;Health probing of backend APIM endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Layer 2: &lt;a href="https://azure.microsoft.com/en-us/products/api-management" rel="noopener noreferrer"&gt;Azure API Management&lt;/a&gt; (Regional Intelligence)
&lt;/h4&gt;

&lt;p&gt;APIM instances deployed in multiple regions provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API key management and authentication&lt;/li&gt;
&lt;li&gt;Rate limiting and throttling policies&lt;/li&gt;
&lt;li&gt;Intelligent failover logic (this is where the magic happens)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dzone.com/articles/application-telemetry-different-objectives" rel="noopener noreferrer"&gt;Telemetry&lt;/a&gt; and &lt;a href="https://dzone.com/articles/top-5-metrics-for-cloud-application-monitoring" rel="noopener noreferrer"&gt;monitoring&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why APIM and not just Front Door?&lt;/strong&gt; Because Front Door doesn't understand HTTP 429 responses. It can't distinguish between a true service failure and a rate limit. APIM gives us the intelligence to react appropriately.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Layer 3: &lt;a href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?view=foundry-classic&amp;amp;pivots=web-portal" rel="noopener noreferrer"&gt;Azure OpenAI Resources&lt;/a&gt; (Regional Capacity)
&lt;/h4&gt;

&lt;p&gt;Deploy OpenAI resources across multiple Azure regions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary regions&lt;/strong&gt; (WestUS, SouthIndia, JapanEast, AustraliaEast) for normal traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secondary regions&lt;/strong&gt; (EastUS, CentralIndia, JapanWest, AustraliaWest) as failover targets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;European regions&lt;/strong&gt; (SwedenCentral, SwitzerlandWest, GermanyWestCentral) for GDPR compliance&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Implementation: APIM Policy Magic
&lt;/h2&gt;

&lt;p&gt;Here's where things get interesting. The APIM policy is the brain of the failover system. Let me show you the actual policy that handles 429 responses and fails over seamlessly.&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/blob/main/policies/basic-failover-policy.xml" rel="noopener noreferrer"&gt;basic-failover-policy.xml&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Failover Policy (Complete Implementation)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;policies&amp;gt;
    &amp;lt;inbound&amp;gt;
        &amp;lt;!-- Store the original request path for failover --&amp;gt;
        &amp;lt;set-variable name="originalPath" value="@(context.Request.Url.Path)" /&amp;gt;

        &amp;lt;!-- Extract deployment name from path --&amp;gt;
        &amp;lt;set-variable name="deploymentName" 
                      value="@{
                          var path = context.Request.Url.Path;
                          var match = System.Text.RegularExpressions.Regex.Match(
                              path, 
                              @"/openai/deployments/([^/]+)/");
                          return match.Success ? match.Groups[1].Value : "";
                      }" /&amp;gt;

        &amp;lt;!-- Set primary backend --&amp;gt;
        &amp;lt;set-backend-service base-url="https://westus-primary.openai.azure.com/openai" /&amp;gt;

        &amp;lt;!-- Add request ID for tracing --&amp;gt;
        &amp;lt;set-header name="X-Request-ID" exists-action="override"&amp;gt;
            &amp;lt;value&amp;gt;@(Guid.NewGuid().ToString())&amp;lt;/value&amp;gt;
        &amp;lt;/set-header&amp;gt;

        &amp;lt;!-- Pass through API key (or transform as needed) --&amp;gt;
        &amp;lt;set-header name="api-key" exists-action="override"&amp;gt;
            &amp;lt;value&amp;gt;{{primary-openai-key}}&amp;lt;/value&amp;gt;
        &amp;lt;/set-header&amp;gt;
    &amp;lt;/inbound&amp;gt;

    &amp;lt;backend&amp;gt;
        &amp;lt;!-- Forward to backend --&amp;gt;
        &amp;lt;forward-request buffer-response="true" /&amp;gt;
    &amp;lt;/backend&amp;gt;

    &amp;lt;outbound&amp;gt;
        &amp;lt;!-- Check for rate limit response --&amp;gt;
        &amp;lt;choose&amp;gt;
            &amp;lt;when condition="@(context.Response.StatusCode == 429)"&amp;gt;
                &amp;lt;!-- Log the rate limit event --&amp;gt;
                &amp;lt;trace source="apim-failover"&amp;gt;
                    Primary backend returned 429 for request @(context.Variables.GetValueOrDefault&amp;lt;string&amp;gt;("X-Request-ID"))
                &amp;lt;/trace&amp;gt;

                &amp;lt;!-- Attempt failover to secondary region --&amp;gt;
                &amp;lt;send-request mode="new" response-variable-name="failoverResponse" 
                              timeout="120" ignore-error="false"&amp;gt;
                    &amp;lt;set-url&amp;gt;@{
                        var deployment = context.Variables.GetValueOrDefault&amp;lt;string&amp;gt;("deploymentName");
                        return $"https://eastus-secondary.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version=2024-08-01-preview";
                    }&amp;lt;/set-url&amp;gt;
                    &amp;lt;set-method&amp;gt;POST&amp;lt;/set-method&amp;gt;
                    &amp;lt;set-header name="Content-Type" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;application/json&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-header name="api-key" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;{{secondary-openai-key}}&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-header name="X-Failover-Attempt" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-body&amp;gt;@(context.Request.Body.As&amp;lt;string&amp;gt;(preserveContent: true))&amp;lt;/set-body&amp;gt;
                &amp;lt;/send-request&amp;gt;

                &amp;lt;!-- Return the failover response --&amp;gt;
                &amp;lt;return-response&amp;gt;
                    &amp;lt;set-status code="@(((IResponse)context.Variables["failoverResponse"]).StatusCode)" 
                                reason="@(((IResponse)context.Variables["failoverResponse"]).StatusReason)" /&amp;gt;
                    &amp;lt;set-header name="X-Served-By" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;secondary-region&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-header name="X-Failover" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-body&amp;gt;@(((IResponse)context.Variables["failoverResponse"]).Body.As&amp;lt;string&amp;gt;())&amp;lt;/set-body&amp;gt;
                &amp;lt;/return-response&amp;gt;
            &amp;lt;/when&amp;gt;
            &amp;lt;when condition="@(context.Response.StatusCode &amp;gt;= 500)"&amp;gt;
                &amp;lt;!-- Handle 5xx errors similarly --&amp;gt;
                &amp;lt;trace source="apim-failover"&amp;gt;
                    Primary backend returned @(context.Response.StatusCode) - attempting failover
                &amp;lt;/trace&amp;gt;
                &amp;lt;!-- Same failover logic as above --&amp;gt;
            &amp;lt;/when&amp;gt;
        &amp;lt;/choose&amp;gt;

        &amp;lt;!-- Add header indicating which backend served the request --&amp;gt;
        &amp;lt;set-header name="X-Served-By" exists-action="skip"&amp;gt;
            &amp;lt;value&amp;gt;primary-region&amp;lt;/value&amp;gt;
        &amp;lt;/set-header&amp;gt;
    &amp;lt;/outbound&amp;gt;

    &amp;lt;on-error&amp;gt;
        &amp;lt;!-- Log errors --&amp;gt;
        &amp;lt;trace source="apim-failover-error"&amp;gt;
            Error occurred: @(context.LastError.Message)
        &amp;lt;/trace&amp;gt;
    &amp;lt;/on-error&amp;gt;
&amp;lt;/policies&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Policy Features Explained
&lt;/h3&gt;

&lt;p&gt;Let me walk you through what makes this policy effective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request Context Preservation:&lt;/strong&gt; We store the original path and deployment name in variables. This is crucial because when we construct the failover request, we need to maintain the exact same endpoint structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Buffer Response = True:&lt;/strong&gt; This is critical. APIM needs to read the complete response (including status code) before it can make decisions. Without buffering, we can't inspect the 429 status.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Synchronous Failover:&lt;/strong&gt; I use &lt;code&gt;send-request&lt;/code&gt; with &lt;code&gt;mode="new"&lt;/code&gt; to create a completely new HTTP request to the secondary region. The original request is abandoned.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Header Propagation:&lt;/strong&gt; The &lt;code&gt;X-Served-By&lt;/code&gt; header tells the client which region actually served the request. This is invaluable for debugging and telemetry.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Named Values:&lt;/strong&gt; Notice &lt;code&gt;{{primary-openai-key}}&lt;/code&gt; and &lt;code&gt;{{secondary-openai-key}}&lt;/code&gt;? These are APIM Named Values stored in Azure Key Vault—secure configuration that keeps secrets out of policy XML.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Why This Approach Works
&lt;/h3&gt;

&lt;p&gt;Traditional load balancers fail here because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They see HTTP 429 as a "successful" response (it's not a 5xx)&lt;/li&gt;
&lt;li&gt;They can't read and interpret the response body&lt;/li&gt;
&lt;li&gt;They can't make intelligent decisions based on API-specific behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;APIM bridges this gap by giving us full control over the request/response pipeline.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Streaming Challenge: Handling SSE in Failover Scenarios
&lt;/h2&gt;

&lt;p&gt;Here's something most failover guides don't tell you: &lt;strong&gt;streaming responses fundamentally change the game&lt;/strong&gt;. When you're calling GPT-4o or similar LLMs, you're not getting a single response—you're getting a continuous stream of tokens via Server-Sent Events (SSE).&lt;/p&gt;
&lt;h3&gt;
  
  
  Why LLMs Use Streaming
&lt;/h3&gt;

&lt;p&gt;In production AI applications, streaming isn't optional—it's essential:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Non-streaming: User waits 10+ seconds staring at a blank screen
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a detailed analysis...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# Bad UX!
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Streaming: Tokens appear immediately, feels responsive
&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a detailed analysis...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# Good UX!
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The UX difference is massive: non-streaming feels like your application is frozen. Streaming gives users immediate feedback and a perception of speed, even if the total response time is similar.&lt;/p&gt;

&lt;h3&gt;
  
  
  The APIM + SSE Problem
&lt;/h3&gt;

&lt;p&gt;Here's where it gets tricky. Remember &lt;code&gt;buffer-response="true"&lt;/code&gt; setting in the APIM policy? That works great for standard HTTP responses, but it &lt;strong&gt;breaks streaming&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buffered responses:&lt;/strong&gt; APIM reads the entire response before forwarding. Perfect for inspecting status codes (429), terrible for SSE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses:&lt;/strong&gt; APIM forwards chunks as they arrive. Great for UX, but we can't inspect the status code mid-stream.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can't have both... or can you?&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Hybrid Approach with Smart Detection
&lt;/h3&gt;

&lt;p&gt;Microsoft recently documented proper SSE support in APIM (&lt;a href="https://learn.microsoft.com/en-us/azure/api-management/how-to-server-sent-events" rel="noopener noreferrer"&gt;Server-Sent Events in Azure API Management&lt;/a&gt;), and it's adapted for failover scenario:&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/blob/main/policies/streaming-aware-failover-policy.xml" rel="noopener noreferrer"&gt;streaming-aware-failover-policy.xml&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;policies&amp;gt;
    &amp;lt;inbound&amp;gt;
        &amp;lt;!-- Store request details --&amp;gt;
        &amp;lt;set-variable name="originalPath" value="@(context.Request.Url.Path)" /&amp;gt;
        &amp;lt;set-variable name="deploymentName" 
                      value="@{
                          var path = context.Request.Url.Path;
                          var match = System.Text.RegularExpressions.Regex.Match(
                              path, @"/openai/deployments/([^/]+)/");
                          return match.Success ? match.Groups[1].Value : "";
                      }" /&amp;gt;

        &amp;lt;!-- Check if this is a streaming request --&amp;gt;
        &amp;lt;set-variable name="isStreaming" 
                      value="@{
                          var body = context.Request.Body?.As&amp;lt;JObject&amp;gt;(preserveContent: true);
                          return body != null &amp;amp;&amp;amp; 
                                 body["stream"] != null &amp;amp;&amp;amp; 
                                 body["stream"].Value&amp;lt;bool&amp;gt;() == true;
                      }" /&amp;gt;

        &amp;lt;set-backend-service base-url="https://westus-primary.openai.azure.com/openai" /&amp;gt;

        &amp;lt;set-header name="api-key" exists-action="override"&amp;gt;
            &amp;lt;value&amp;gt;{{primary-openai-key}}&amp;lt;/value&amp;gt;
        &amp;lt;/set-header&amp;gt;
    &amp;lt;/inbound&amp;gt;

    &amp;lt;backend&amp;gt;
        &amp;lt;!-- For streaming requests, don't buffer --&amp;gt;
        &amp;lt;forward-request 
            buffer-response="@(!(bool)context.Variables["isStreaming"])" /&amp;gt;
    &amp;lt;/backend&amp;gt;

    &amp;lt;outbound&amp;gt;
        &amp;lt;choose&amp;gt;
            &amp;lt;!-- Only attempt failover for non-streaming 429s --&amp;gt;
            &amp;lt;when condition="@(context.Response.StatusCode == 429 &amp;amp;&amp;amp; 
                              !(bool)context.Variables["isStreaming"])"&amp;gt;
                &amp;lt;trace source="apim-failover"&amp;gt;
                    Primary backend returned 429 for non-streaming request - attempting failover
                &amp;lt;/trace&amp;gt;

                &amp;lt;!-- Standard failover logic here --&amp;gt;
                &amp;lt;send-request mode="new" response-variable-name="failoverResponse" 
                              timeout="120" ignore-error="false"&amp;gt;
                    &amp;lt;set-url&amp;gt;@{
                        var deployment = context.Variables.GetValueOrDefault&amp;lt;string&amp;gt;("deploymentName");
                        return $"https://eastus-secondary.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version=2024-08-01-preview";
                    }&amp;lt;/set-url&amp;gt;
                    &amp;lt;set-method&amp;gt;POST&amp;lt;/set-method&amp;gt;
                    &amp;lt;set-header name="Content-Type" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;application/json&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-header name="api-key" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;{{secondary-openai-key}}&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-body&amp;gt;@(context.Request.Body.As&amp;lt;string&amp;gt;(preserveContent: true))&amp;lt;/set-body&amp;gt;
                &amp;lt;/send-request&amp;gt;

                &amp;lt;return-response&amp;gt;
                    &amp;lt;set-status code="@(((IResponse)context.Variables["failoverResponse"]).StatusCode)" 
                                reason="@(((IResponse)context.Variables["failoverResponse"]).StatusReason)" /&amp;gt;
                    &amp;lt;set-header name="X-Served-By" exists-action="override"&amp;gt;
                        &amp;lt;value&amp;gt;secondary-region&amp;lt;/value&amp;gt;
                    &amp;lt;/set-header&amp;gt;
                    &amp;lt;set-body&amp;gt;@(((IResponse)context.Variables["failoverResponse"]).Body.As&amp;lt;string&amp;gt;())&amp;lt;/set-body&amp;gt;
                &amp;lt;/return-response&amp;gt;
            &amp;lt;/when&amp;gt;

            &amp;lt;!-- For streaming requests, if we get here, just pass through --&amp;gt;
            &amp;lt;when condition="@((bool)context.Variables["isStreaming"])"&amp;gt;
                &amp;lt;set-header name="X-Stream-Mode" exists-action="override"&amp;gt;
                    &amp;lt;value&amp;gt;enabled&amp;lt;/value&amp;gt;
                &amp;lt;/set-header&amp;gt;
                &amp;lt;set-header name="X-Served-By" exists-action="skip"&amp;gt;
                    &amp;lt;value&amp;gt;primary-region-stream&amp;lt;/value&amp;gt;
                &amp;lt;/set-header&amp;gt;
            &amp;lt;/when&amp;gt;
        &amp;lt;/choose&amp;gt;
    &amp;lt;/outbound&amp;gt;
&amp;lt;/policies&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Design Decision: Client-Side Retry Strategy
&lt;/h3&gt;

&lt;p&gt;Since we can't fail over mid-stream at the APIM level, I implement retry logic in the client application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;APIError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;  &lt;span class="c1"&gt;# Detect dead streams quickly
&lt;/span&gt;            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

            &lt;span class="c1"&gt;# Stream completed successfully
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# 1s, 2s, 4s
&lt;/span&gt;                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AzureOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;azure_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-afd.azurefd.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-apim-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-08-01-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;stream_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum computing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Load Testing: Proving It Works
&lt;/h2&gt;

&lt;p&gt;Theory is great. Data is better. Here's how I validated the architecture across three scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct OpenAI (Baseline):&lt;/strong&gt; Calling Azure OpenAI endpoints directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AFD+APIM to Single Region:&lt;/strong&gt; Using Front Door and APIM but with only one region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AFD+APIM with Failover:&lt;/strong&gt; The complete multi-region architecture&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Test Methodology
&lt;/h3&gt;

&lt;p&gt;📄 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/blob/main/scripts/load_test.py" rel="noopener noreferrer"&gt;load_test.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built a Python load test that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sends 1M requests with high concurrency&lt;/li&gt;
&lt;li&gt;Uses a 70/30 mix of simple and complex queries&lt;/li&gt;
&lt;li&gt;Measures success rate, latency, and failover events&lt;/li&gt;
&lt;li&gt;Categorizes responses by region served
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified load test script
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;simple_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;  &lt;span class="c1"&gt;# 70% simple queries
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch_start&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;batch_start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Create a mix of simple and complex queries
&lt;/span&gt;            &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;query_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;simple_ratio&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SIMPLE_QUERIES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SIMPLE_QUERIES&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;query_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;COMPLEX_QUERIES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COMPLEX_QUERIES&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
                &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_type&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="c1"&gt;# Send batch of requests concurrently
&lt;/span&gt;            &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nf"&gt;send_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="n"&gt;batch_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Results: Success Rates and Latency
&lt;/h3&gt;

&lt;p&gt;Here's what the data showed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Success Rate&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;P95 Latency&lt;/th&gt;
&lt;th&gt;Failover Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct OpenAI&lt;/td&gt;
&lt;td&gt;87.3%&lt;/td&gt;
&lt;td&gt;1,521ms&lt;/td&gt;
&lt;td&gt;2,874ms&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AFD+APIM Single&lt;/td&gt;
&lt;td&gt;88.1%&lt;/td&gt;
&lt;td&gt;1,698ms&lt;/td&gt;
&lt;td&gt;3,056ms&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AFD+APIM Failover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,184ms&lt;/td&gt;
&lt;td&gt;4,128ms&lt;/td&gt;
&lt;td&gt;12.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key findings:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct OpenAI suffered from rate limits with no recovery mechanism&lt;/li&gt;
&lt;li&gt;AFD+APIM Single added minimal overhead (~177ms) but didn't improve reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AFD+APIM Failover achieved near-perfect reliability&lt;/strong&gt; at the cost of higher P95 latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latency increase for failover requests is expected—we're making a second API call when the first one fails. However, this tradeoff is absolutely worth it given the massive improvement in success rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Readiness
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Circuit Breakers Are Essential
&lt;/h3&gt;

&lt;p&gt;Pure failover isn't enough. You need intelligent circuit breakers to avoid hammering overloaded regions:&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/blob/main/policies/circuit-breaker-policy.xml" rel="noopener noreferrer"&gt;circuit-breaker-policy.xml&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;set-variable name="failoverCount" value="@{
    string counterKey = "failover-count-" + context.Deployment.Region;
    int count = context.Variables.ContainsKey(counterKey) 
        ? (int)context.Variables[counterKey] 
        : 0;

    if (count &amp;gt; 10) { // Circuit breaker threshold
        // Check if 5 minutes have passed since last circuit break
        if (DateTime.UtcNow &amp;gt; context.Variables.GetValueOrDefault&amp;lt;DateTime&amp;gt;("circuit-breaker-time", DateTime.MinValue).AddMinutes(5)) {
            // Reset counter and allow a test request
            context.Variables[counterKey] = 0;
            return 0;
        }
        // Circuit still open
        return count;
    }
    // Increment counter
    return count + 1;
}" /&amp;gt;

&amp;lt;choose&amp;gt;
    &amp;lt;when condition="@((int)context.Variables["failoverCount"] &amp;gt; 10)"&amp;gt;
        &amp;lt;!-- Circuit is open, return friendly error --&amp;gt;
        &amp;lt;return-response&amp;gt;
            &amp;lt;set-status code="503" reason="Service Unavailable" /&amp;gt;
            &amp;lt;set-body&amp;gt;{"error": "All regions currently at capacity. Please try again in a few minutes."}&amp;lt;/set-body&amp;gt;
        &amp;lt;/return-response&amp;gt;
    &amp;lt;/when&amp;gt;
&amp;lt;/choose&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;a href="https://sre.google/workbook/monitoring/" rel="noopener noreferrer"&gt;Monitoring&lt;/a&gt; Is Everything
&lt;/h3&gt;

&lt;p&gt;Built a comprehensive monitoring dashboard that tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Success Rates:&lt;/strong&gt; Overall, per region, and per failover status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency Distribution:&lt;/strong&gt; P50/P95/P99 across all scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failover Metrics:&lt;/strong&gt; Failover count, success rate, and latency impact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quota Utilization:&lt;/strong&gt; Per-region TPM/RPM usage against limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breaker Status:&lt;/strong&gt; Open/closed state and activation frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alert triggers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Success rate drops below 99.5% for 5 minutes&lt;/li&gt;
&lt;li&gt;Failover rate exceeds 15% for 10 minutes&lt;/li&gt;
&lt;li&gt;Primary region 429 errors exceed 5% for 5 minutes&lt;/li&gt;
&lt;li&gt;Any region's quota utilization exceeds 85% for 15 minutes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best Practices: Your Implementation Checklist
&lt;/h2&gt;

&lt;p&gt;If you're implementing this architecture, here's your checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Provision Multiple Regions:&lt;/strong&gt; Deploy both primary and failover OpenAI resources&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Set Up Front Door:&lt;/strong&gt; Configure with WAF and geographic routing&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Deploy Regional APIM:&lt;/strong&gt; Use Premium tier for availability sets&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Implement Failover Policy:&lt;/strong&gt; Use my policy template, adjusting for your deployment names&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Configure Named Values:&lt;/strong&gt; Secure your API keys using APIM Named Values&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Set Up Monitoring:&lt;/strong&gt; Track success rates, latency, and failover events&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Implement Circuit Breakers:&lt;/strong&gt; Avoid cascading failures with breaker policies&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Add Client Retries:&lt;/strong&gt; Implement exponential backoff for streaming requests&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Test, Test, Test:&lt;/strong&gt; Load test with your actual traffic patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: From 3 AM Panic to Peaceful Sleep
&lt;/h2&gt;

&lt;p&gt;The investment in this multi-region, intelligent failover architecture pays for itself many times over—not just in reduced downtime costs, but in customer trust and team sanity.&lt;/p&gt;

&lt;p&gt;Is it perfect? No. We still have the occasional hiccup. But the difference between 87.3% and 99.4% reliability is the difference between an unreliable product and one that users can count on.&lt;/p&gt;

&lt;p&gt;Though there are many sandbox projects available, what I've described is a proven, self-managed method. This solution stems from personal experience, and I acknowledge that high availability can be achieved through various approaches. For instance, purchasing Azure's &lt;a href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/provisioned-throughput?view=foundry-classic&amp;amp;tabs=global-ptum" rel="noopener noreferrer"&gt;Provisioned Throughput Units (PTUs)&lt;/a&gt; offers guaranteed capacity but can be costly and still requires a strategy for regional outages. For those exploring alternatives, projects like &lt;a href="https://kgateway.dev/blog/ai-gateway-load-balancing-model-failover/" rel="noopener noreferrer"&gt;KGateway&lt;/a&gt; and &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; offer interesting options worth investigating. I'd love to hear from readers about other projects or approaches you've used to solve this problem.&lt;/p&gt;

&lt;p&gt;As Azure OpenAI continues to evolve, so does the architecture. But the fundamental principles outlined here—multiple layers of resilience, intelligent request routing, and a deep understanding of the service's behavior—will remain essential for any enterprise-scale AI deployment.&lt;/p&gt;

&lt;p&gt;📦 &lt;strong&gt;GitHub Repo:&lt;/strong&gt; &lt;a href="https://github.com/deneeshnarayanasamy/azure-openai-multi-region-failover/tree/main" rel="noopener noreferrer"&gt;azure-openai-multi-region-failover&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine this: instead of being jolted awake at 3 AM by alert notifications, you simply check your dashboard during your morning coffee and see "Incident detected at 03:17, automatic failover initiated, recovery complete by 03:18." That's not fantasy—it's precisely what this architecture delivers. Your system detected the problem, executed the failover, and restored service while you enjoyed uninterrupted sleep. The difference between constant firefighting and confident reliability isn't just technical—it's transformative for your team's wellbeing and your customers' trust.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>openai</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Google A2UI: The Future of Agentic AI for DevOps &amp; SRE (Goodbye Text-Only ChatOps)</title>
      <dc:creator>Deneesh Narayanasamy</dc:creator>
      <pubDate>Sat, 27 Dec 2025 19:54:27 +0000</pubDate>
      <link>https://dev.to/deneesh_narayanasamy/google-a2ui-the-future-of-agentic-ai-for-devops-sre-goodbye-text-only-chatops-2i4g</link>
      <guid>https://dev.to/deneesh_narayanasamy/google-a2ui-the-future-of-agentic-ai-for-devops-sre-goodbye-text-only-chatops-2i4g</guid>
      <description>&lt;p&gt;&lt;em&gt;The era of "Text-Only" ChatOps is ending. Google's new open-source protocol, &lt;strong&gt;A2UI&lt;/strong&gt;, lets AI agents render native, interactive interfaces. Here is what Platform Engineers and SREs need to know.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 TL;DR (For the Busy Engineer)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;What is it?&lt;/strong&gt; &lt;a href="https://developers.googleblog.com/introducing-a2ui-an-open-project-for-agent-driven-interfaces/" rel="noopener noreferrer"&gt;A2UI (Agent-to-User Interface)&lt;/a&gt; is a new open-source standard by Google that lets AI agents generate UI components (JSON) instead of raw text or HTML.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Why care?&lt;/strong&gt; It solves the "Wall of Text" problem in ChatOps. Agents can now pop up interactive forms, charts, and buttons inside your chat app or internal portal.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Key Tech:&lt;/strong&gt; It uses declarative JSON payloads ("Safe like data, expressive like code") to ensure security, no arbitrary JavaScript execution.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Use Case:&lt;/strong&gt; Perfect for &lt;strong&gt;SRE Incident Response&lt;/strong&gt;, &lt;strong&gt;MLOps Labeling&lt;/strong&gt;, and &lt;strong&gt;Self-Service Infrastructure&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem: The "Wall of Text" Bottleneck
&lt;/h2&gt;

&lt;p&gt;We have all been there. It's 3 AM, and you are responding to a P1 incident. You query your Ops bot:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;gt; @ops-bot status service-payments&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The bot responds with &lt;strong&gt;50 lines of unformatted JSON logs&lt;/strong&gt;. To fix the issue, you have to remember specific CLI syntax, type it out, and hope you didn't typo a region flag.&lt;/p&gt;

&lt;p&gt;This is the "Last Mile" problem in AI operations. We have brilliant LLMs that can diagnose complex Kubernetes issues, but they are forced to communicate through dumb text channels. This friction increases cognitive load and slows down Mean Time To Resolution (MTTR).&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter A2UI: "Safe Like Data, Expressive Like Code"
&lt;/h2&gt;

&lt;p&gt;Google released &lt;strong&gt;A2UI&lt;/strong&gt; to bridge this gap. Unlike previous approaches that relied on heavy &lt;code&gt;iframes&lt;/code&gt; or dangerous raw HTML injection, A2UI uses a &lt;strong&gt;standardized JSON schema&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The workflow is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;The Agent&lt;/strong&gt; analyzes the request and sends a JSON "blueprint."&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Client&lt;/strong&gt; (your web portal, mobile app, or chat interface) receives the JSON.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Renderer&lt;/strong&gt; converts that JSON into &lt;strong&gt;native components&lt;/strong&gt; (React, Flutter, Angular, etc.) that match your brand's style system.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why This Architecture Wins for DevOps
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Security First:&lt;/strong&gt; The agent &lt;em&gt;cannot&lt;/em&gt; execute code. It can only request components (like &lt;code&gt;Card&lt;/code&gt;, &lt;code&gt;Button&lt;/code&gt;, &lt;code&gt;Graph&lt;/code&gt;) that exist in your client's "Allow List."&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Native Feel:&lt;/strong&gt; The UI looks and behaves like your internal developer platform, not a disjointed third-party embed.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Bi-Directional Sync:&lt;/strong&gt; When you click "Restart Pod," the state updates instantly in the UI without a page refresh.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Some Use Cases for Platform Teams
&lt;/h2&gt;

&lt;p&gt;If you are building an Internal Developer Platform (IDP), here is how you can use A2UI today.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Interactive Incident Commander (SRE)
&lt;/h3&gt;

&lt;p&gt;Instead of linking to a Grafana dashboard, the agent generates the dashboard &lt;em&gt;in the conversation&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Trigger:&lt;/strong&gt; "Alert: High Latency on Checkout."&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The A2UI Response:&lt;/strong&gt; An interactive Card containing:

&lt;ul&gt;
&lt;li&gt;  📉 &lt;strong&gt;Visual:&lt;/strong&gt; A live mini-chart of error rates over the last 15 minutes.&lt;/li&gt;
&lt;li&gt;  📝 &lt;strong&gt;Context:&lt;/strong&gt; A summary of the last 3 deployments.&lt;/li&gt;
&lt;li&gt;  🔴 &lt;strong&gt;Action:&lt;/strong&gt; A "Rollback" button that triggers a specific GitHub workflow.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Human-in-the-Loop MLOps
&lt;/h3&gt;

&lt;p&gt;MLOps teams often struggle with "edge cases" where a model has low confidence. Building a custom web app for labelers is expensive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Scenario:&lt;/strong&gt; A fraud model flags a transaction with 45% confidence.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The A2UI Solution:&lt;/strong&gt; The agent pushes a "Review Card" to the Ops channel.

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Content:&lt;/strong&gt; Transaction metadata + User History.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Input:&lt;/strong&gt; [Confirm Fraud] vs [False Positive] buttons.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Outcome:&lt;/strong&gt; The click labels the data and triggers a fine-tuning job instantly.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Self-Service Infrastructure Provisioning
&lt;/h3&gt;

&lt;p&gt;Stop making developers write Terraform for simple resources.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Request:&lt;/strong&gt; "I need a Redis instance for staging."&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The A2UI Response:&lt;/strong&gt; A dynamic form.

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Dropdown:&lt;/strong&gt; Select Environment (Dev/Stage).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Slider:&lt;/strong&gt; Select TTL / Retention.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Validation:&lt;/strong&gt; The agent validates the quota &lt;em&gt;before&lt;/em&gt; the user clicks submit.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Code: Anatomy of a Payload
&lt;/h2&gt;

&lt;p&gt;For the developers reading this, here is what the actual wire protocol looks like. It is incredibly readable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "component": "Card",
  "title": "⚠️ Production Alert: High CPU",
  "children": [
    {
      "component": "Text",
      "content": "Service 'payment-gateway' is at 98% utilization."
    },
    {
      "component": "Row",
      "children": [
        {
          "component": "Button",
          "label": "Scale Up (5 Nodes)",
          "action": "scale_up_action",
          "style": "primary"
        },
        {
          "component": "Button",
          "label": "Snooze Alert",
          "action": "snooze_action",
          "style": "secondary"
        }
      ]
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This JSON is platform-agnostic. Your &lt;strong&gt;React&lt;/strong&gt; frontend renders it as a Material UI card; your &lt;strong&gt;iOS&lt;/strong&gt; app renders it as a native SwiftUI view.&lt;/p&gt;




&lt;h2&gt;
  
  
  A2UI vs. MCP vs. Standard ChatOps
&lt;/h2&gt;

&lt;p&gt;For those comparing this to Anthropic's &lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; or standard webhooks, here is the breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Standard ChatOps&lt;/th&gt;
&lt;th&gt;MCP (Model Context Protocol)&lt;/th&gt;
&lt;th&gt;Google A2UI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text / Static Images&lt;/td&gt;
&lt;td&gt;Resources / Text / Prompts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Native UI Components&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interactivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (Command Line)&lt;/td&gt;
&lt;td&gt;Medium (Tool Use)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;High (Stateful UI)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;High (No Code Exec)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Implementation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Moderate&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simple queries&lt;/td&gt;
&lt;td&gt;Connecting Data Sources&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Human-in-the-loop Workflows&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Google has open-sourced the specification and renderers. You can clone the repo and run the "Restaurant Finder" sample to see the rendering in action (it translates perfectly to a "Service Finder" for DevOps).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git clone https://github.com/google/A2UI.git&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Navigate to the client sample
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cd A2UI/samples/client/lit/shell&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Run
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;npm install &amp;amp;&amp;amp; npm run dev&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts: The Shift to Generative UI
&lt;/h2&gt;

&lt;p&gt;We are moving away from &lt;strong&gt;Generic UIs&lt;/strong&gt; (dashboards that show everything) to &lt;strong&gt;Generative UIs&lt;/strong&gt; (interfaces created on-the-fly for the exact problem you are solving).&lt;/p&gt;

&lt;p&gt;For DevOps and SREs, A2UI is the toolkit to build that future. It allows us to keep the "Chat" in ChatOps, but finally ditch the "Ops" headaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔗 Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://developers.googleblog.com/introducing-a2ui-an-open-project-for-agent-driven-interfaces/" rel="noopener noreferrer"&gt;Official Google Blog Post&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://a2ui.org/" rel="noopener noreferrer"&gt;A2UI Organization &amp;amp; Docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/google/A2UI" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Have you tried implementing generative UI in your Ops workflows? Let me know in the comments below!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>sre</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
