DEV Community: Deneesh Narayanasamy

LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing

Deneesh Narayanasamy — Tue, 07 Apr 2026 05:29:00 +0000

Introduction: What If You Could Use ANY LLM Provider?

In my previous article, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem.

What if you need:

Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini)
A simpler deployment without managing APIM policies
Provider-agnostic architecture that works anywhere
Open-source flexibility with no vendor lock-in

Enter LiteLLM Proxy - an open-source unified gateway that gives you all of this out of the box.

What is LiteLLM Proxy?

LiteLLM is an open-source Python library and proxy server that provides:

Unified API: One OpenAI-compatible endpoint for 100+ LLM providers
Built-in Load Balancing: Distribute requests across multiple deployments
Automatic Failover: Seamlessly retry on different models/providers when one fails
Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors
Cost Tracking: Monitor spend across all providers in one place
Streaming Support: Full SSE (Server-Sent Events) support with proper failover

The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest.

Architecture: LiteLLM Proxy vs Azure APIM

Here's how LiteLLM Proxy compares to the Azure-native approach:

Azure APIM Architecture (Previous Article)

Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary)
                                         -> Azure OpenAI (Secondary)

Pros: Native Azure integration, enterprise compliance, WAF protection
Cons: Azure-only, complex policies, expensive at scale

LiteLLM Proxy Architecture

Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI
                                       -> OpenAI Direct
                                       -> Anthropic Claude
                                       -> Google Gemini
                                       -> AWS Bedrock
                                       -> Any LLM Provider

Supported Providers: Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ more

Pros: Provider-agnostic, simple configuration, open-source, runs anywhere
Cons: Self-managed infrastructure, requires containerization

Getting Started: 5-Minute Setup

Option 1: Docker (Recommended for Production)

# Pull the official image
docker pull ghcr.io/berriai/litellm:main-latest

# Run with your config
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e AZURE_API_KEY="your-azure-key" \
  -e OPENAI_API_KEY="your-openai-key" \
  -e ANTHROPIC_API_KEY="your-anthropic-key" \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Option 2: Python (Quick Testing)

pip install 'litellm[proxy]'
litellm --config litellm_config.yaml

The Configuration File

Create litellm_config.yaml:

model_list:
  # Primary: Azure OpenAI GPT-4o (West US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://westus-primary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-westus-gpt4o

  # Failover 1: Azure OpenAI GPT-4o (East US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus-secondary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY_SECONDARY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-eastus-gpt4o

  # Failover 2: OpenAI Direct
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: openai-direct-gpt4o

  # Failover 3: Anthropic Claude (ultimate backup)
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      id: anthropic-claude-sonnet

litellm_settings:
  # Enable automatic failover
  num_retries: 3
  retry_after: 5

  # Fallback configuration
  fallbacks:
    - gpt-4o: [gpt-4o]  # Retry across all gpt-4o deployments

  # Request timeout
  request_timeout: 120

  # Enable streaming
  stream: true

router_settings:
  # Load balancing strategy
  routing_strategy: least-busy

  # Enable rate limit awareness
  enable_pre_call_checks: true

  # Cooldown failed deployments
  cooldown_time: 60

  # Number of retries per deployment
  num_retries: 2

  # Retry on these status codes
  retry_after: 5
  allowed_fails: 3

general_settings:
  # Master key for proxy authentication
  master_key: os.environ/LITELLM_MASTER_KEY

  # Database for tracking (optional)
  database_url: os.environ/DATABASE_URL

The Magic: How Failover Actually Works

Automatic 429 Handling

When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically:

Reads the Retry-After header
Marks that deployment as "cooling down"
Routes the request to the next available deployment
Continues until a successful response or all deployments exhausted

# Your code stays simple - LiteLLM handles everything
from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"  # Point to LiteLLM Proxy
)

# This request automatically fails over if needed
response = client.chat.completions.create(
    model="gpt-4o",  # LiteLLM routes to best available
    messages=[{"role": "user", "content": "Hello!"}]
)

Load Balancing Strategies

LiteLLM supports multiple routing strategies:

Strategy	Description	Best For
`simple-shuffle`	Random selection	Even distribution
`least-busy`	Route to deployment with fewest active requests	High throughput
`latency-based-routing`	Route to fastest responding deployment	Latency-sensitive apps
`cost-based-routing`	Route to cheapest available option	Cost optimization

Configure in your YAML:

router_settings:
  routing_strategy: latency-based-routing

  # For latency-based routing, set expected latencies
  model_group_alias:
    gpt-4o:
      - model: azure/gpt-4o
        weight: 0.7  # 70% of traffic
      - model: openai/gpt-4o
        weight: 0.3  # 30% of traffic

Streaming Support: It Just Works

Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles SSE (Server-Sent Events) natively:

from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"
)

# Streaming works exactly like direct OpenAI
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about resilience"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

If the primary provider fails mid-stream, LiteLLM will:

Detect the connection failure
Automatically retry on the next provider
Return an error only if all providers fail

Production Configuration: Enterprise-Ready Setup

High Availability Deployment

For production, deploy multiple LiteLLM instances behind a load balancer:

# docker-compose.yml
version: '3.8'

services:
  litellm-1:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4001:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  litellm-2:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4002:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx:
    image: nginx:alpine
    ports:
      - "4000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - litellm-1
      - litellm-2
    restart: always

Nginx Load Balancer Configuration

# nginx.conf
events {
    worker_connections 1024;
}

http {
    upstream litellm {
        least_conn;
        server litellm-1:4000 weight=1;
        server litellm-2:4000 weight=1;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://litellm;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 300s;
            proxy_buffering off;  # Important for streaming
        }

        location /health {
            proxy_pass http://litellm;
            proxy_connect_timeout 5s;
            proxy_read_timeout 5s;
        }
    }
}

Advanced Features

1. Budget & Rate Limiting

Control spending and prevent runaway costs:

general_settings:
  master_key: sk-your-master-key

# User-level budgets
litellm_settings:
  max_budget: 100.00  # $100 max per user
  budget_duration: monthly

Create users with specific limits:

curl -X POST 'http://localhost:4000/user/new' \
  -H 'Authorization: Bearer sk-your-master-key' \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.00,
    "budget_duration": "monthly",
    "models": ["gpt-4o", "gpt-3.5-turbo"]
  }'

2. Request Caching

Reduce costs and latency with semantic caching using Redis:

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour cache

3. Custom Callbacks & Logging

Track every request for observability:

litellm_settings:
  success_callback: ["langfuse", "prometheus"]  # Langfuse & Prometheus integrations
  failure_callback: ["langfuse", "slack"]

  # Langfuse integration
  langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
  langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY

4. Guardrails & Content Moderation

Add safety layers:

litellm_settings:
  guardrails:
    - guardrail_name: "content-filter"
      litellm_params:
        guardrail: openai_moderation
        mode: pre_call  # Check before sending to LLM

Comparing Results: LiteLLM vs Azure APIM

I ran the same load test from my Azure article against both architectures:

Metric	Azure APIM	LiteLLM Proxy
Success Rate	99.4%	99.6%
Avg Latency	2,184ms	1,892ms
P95 Latency	4,128ms	3,456ms
Setup Time	~4 hours	~30 minutes
Monthly Cost	~$500+	~$50 (compute only)
Provider Lock-in	Azure only	Any provider

Key observations:

LiteLLM showed slightly better latency due to simpler request pipeline
Both achieved similar reliability with proper configuration
LiteLLM's multi-provider fallback provided an extra safety net
Cost difference is significant for smaller teams

When to Use Which?

Choose Azure APIM + Front Door When:

You're all-in on Azure and need native integration
Enterprise compliance requirements mandate Azure services
You need WAF/DDoS protection at the edge
Your organization has existing APIM expertise
Audit logging must stay within Azure ecosystem

Choose LiteLLM Proxy When:

You need multi-provider failover (not just multi-region)
Cost optimization is a priority
You want provider flexibility to switch easily
Your team prefers simple YAML configuration over XML policies
You're running on Kubernetes, AWS, GCP, or on-prem
You need rapid prototyping and iteration

Production Checklist

If you're deploying LiteLLM Proxy to production:

[ ] Deploy Multiple Instances: At least 2 behind a load balancer
[ ] Enable Health Checks: Configure /health endpoint monitoring
[ ] Set Up Database: PostgreSQL for persistence and analytics
[ ] Configure Caching: Redis for semantic caching
[ ] Add Monitoring: Prometheus + Grafana or Langfuse
[ ] Set Budget Limits: Prevent runaway costs
[ ] Secure the Proxy: Use master key authentication
[ ] Enable TLS: HTTPS in production (via nginx or cloud LB)
[ ] Configure Alerts: Slack/PagerDuty for failures
[ ] Test Failover: Deliberately fail providers to verify behavior

Conclusion: The Right Tool for the Job

Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints:

Azure APIM is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features.

LiteLLM Proxy is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model.

The best part? These aren't mutually exclusive. You can run LiteLLM Proxy behind Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing.

📦 LiteLLM GitHub: github.com/BerriAI/litellm

📄 LiteLLM Docs: docs.litellm.ai

The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you.

Building Resilient AI Services: Implementing Multi-Region Failover for Azure OpenAI at Enterprise Scale

Deneesh Narayanasamy — Fri, 27 Feb 2026 05:41:05 +0000

Introduction: When Your AI Service Goes Down at 3 AM

Picture this: It's 3 AM on a Monday. Your enterprise AI application, the one powering customer support for millions of users, suddenly stops responding. Azure OpenAI in your primary region is experiencing an outage. Your phone explodes with alerts. Customer complaints flood in. Revenue is bleeding.

This isn't a hypothetical scenario. It's a reality that every organization building on cloud AI services must prepare for. When you're running production AI workloads at scale, the question isn't if you'll need failover—it's when.

In this article, I'll walk you through the exact architecture that's implemented to achieve 99.95% uptime for Azure OpenAI services serving millions of requests daily. You'll get the actual APIM policies, load testing scripts, and production readiness strategies.

The Problem: Why Azure OpenAI Needs Sophisticated Failover

The Reality of Cloud AI Services

Azure OpenAI is remarkable, but it's still a cloud service with real-world constraints:

Regional Quota Limits: You can't just throw infinite traffic at a single endpoint. Azure enforces TPM (Tokens Per Minute) and RPM (Requests Per Minute) quotas per region.
Rate Limiting (429 Errors): When you hit quota limits, you get HTTP 429 (Too Many Requests) responses. These aren't service errors—they're expected behavior that you must handle gracefully.
Regional Outages: Azure regions can and do experience issues. In Q3 2024 alone, we saw multiple incidents affecting OpenAI availability in specific regions.
Deployment Latency Variance: A request to westus might take 200ms, while the same request to eastus takes 450ms. Geography matters.

The Business Impact

Let's talk numbers. For an enterprise AI application serving more than 1 million requests per day:

Uptime	Downtime per Year
99%	14.4 hours
99.9%	8.76 hours
99.95%	4.38 hours

That difference between 99% and 99.95%? That's potentially millions in revenue, thousands of lost customers, and immeasurable damage to brand reputation.

The Architecture: A Multi-Layer Resilience Strategy

Here's the complete architecture implemented to achieve high availability:

Figure 1: Multi-region Azure OpenAI architecture with Azure Front Door, APIM, and regional OpenAI instances.

Architecture Components

📦 GitHub Repo: azure-openai-multi-region-failover

Let's break down each layer:

Layer 1: Azure Front Door + WAF (Global Entry Point)

Azure Front Door serves as global load balancer with:

DDoS protection and Web Application Firewall
SSL/TLS termination at the edge
Geographic routing to nearest APIM instance
Health probing of backend APIM endpoints

Layer 2: Azure API Management (Regional Intelligence)

APIM instances deployed in multiple regions provide:

API key management and authentication
Rate limiting and throttling policies
Intelligent failover logic (this is where the magic happens)
Telemetry and monitoring

Why APIM and not just Front Door? Because Front Door doesn't understand HTTP 429 responses. It can't distinguish between a true service failure and a rate limit. APIM gives us the intelligence to react appropriately.

Layer 3: Azure OpenAI Resources (Regional Capacity)

Deploy OpenAI resources across multiple Azure regions:

Primary regions (WestUS, SouthIndia, JapanEast, AustraliaEast) for normal traffic
Secondary regions (EastUS, CentralIndia, JapanWest, AustraliaWest) as failover targets
European regions (SwedenCentral, SwitzerlandWest, GermanyWestCentral) for GDPR compliance

The Implementation: APIM Policy Magic

Here's where things get interesting. The APIM policy is the brain of the failover system. Let me show you the actual policy that handles 429 responses and fails over seamlessly.

📄 GitHub: basic-failover-policy.xml

The Failover Policy (Complete Implementation)

<policies>
    <inbound>
        <!-- Store the original request path for failover -->
        <set-variable name="originalPath" value="@(context.Request.Url.Path)" />

        <!-- Extract deployment name from path -->
        <set-variable name="deploymentName" 
                      value="@{
                          var path = context.Request.Url.Path;
                          var match = System.Text.RegularExpressions.Regex.Match(
                              path, 
                              @"/openai/deployments/([^/]+)/");
                          return match.Success ? match.Groups[1].Value : "";
                      }" />

        <!-- Set primary backend -->
        <set-backend-service base-url="https://westus-primary.openai.azure.com/openai" />

        <!-- Add request ID for tracing -->
        <set-header name="X-Request-ID" exists-action="override">
            <value>@(Guid.NewGuid().ToString())</value>
        </set-header>

        <!-- Pass through API key (or transform as needed) -->
        <set-header name="api-key" exists-action="override">
            <value>{{primary-openai-key}}</value>
        </set-header>
    </inbound>

    <backend>
        <!-- Forward to backend -->
        <forward-request buffer-response="true" />
    </backend>

    <outbound>
        <!-- Check for rate limit response -->
        <choose>
            <when condition="@(context.Response.StatusCode == 429)">
                <!-- Log the rate limit event -->
                <trace source="apim-failover">
                    Primary backend returned 429 for request @(context.Variables.GetValueOrDefault<string>("X-Request-ID"))
                </trace>

                <!-- Attempt failover to secondary region -->
                <send-request mode="new" response-variable-name="failoverResponse" 
                              timeout="120" ignore-error="false">
                    <set-url>@{
                        var deployment = context.Variables.GetValueOrDefault<string>("deploymentName");
                        return $"https://eastus-secondary.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version=2024-08-01-preview";
                    }</set-url>
                    <set-method>POST</set-method>
                    <set-header name="Content-Type" exists-action="override">
                        <value>application/json</value>
                    </set-header>
                    <set-header name="api-key" exists-action="override">
                        <value>{{secondary-openai-key}}</value>
                    </set-header>
                    <set-header name="X-Failover-Attempt" exists-action="override">
                        <value>true</value>
                    </set-header>
                    <set-body>@(context.Request.Body.As<string>(preserveContent: true))</set-body>
                </send-request>

                <!-- Return the failover response -->
                <return-response>
                    <set-status code="@(((IResponse)context.Variables["failoverResponse"]).StatusCode)" 
                                reason="@(((IResponse)context.Variables["failoverResponse"]).StatusReason)" />
                    <set-header name="X-Served-By" exists-action="override">
                        <value>secondary-region</value>
                    </set-header>
                    <set-header name="X-Failover" exists-action="override">
                        <value>true</value>
                    </set-header>
                    <set-body>@(((IResponse)context.Variables["failoverResponse"]).Body.As<string>())</set-body>
                </return-response>
            </when>
            <when condition="@(context.Response.StatusCode >= 500)">
                <!-- Handle 5xx errors similarly -->
                <trace source="apim-failover">
                    Primary backend returned @(context.Response.StatusCode) - attempting failover
                </trace>
                <!-- Same failover logic as above -->
            </when>
        </choose>

        <!-- Add header indicating which backend served the request -->
        <set-header name="X-Served-By" exists-action="skip">
            <value>primary-region</value>
        </set-header>
    </outbound>

    <on-error>
        <!-- Log errors -->
        <trace source="apim-failover-error">
            Error occurred: @(context.LastError.Message)
        </trace>
    </on-error>
</policies>

Key Policy Features Explained

Let me walk you through what makes this policy effective:

Request Context Preservation: We store the original path and deployment name in variables. This is crucial because when we construct the failover request, we need to maintain the exact same endpoint structure.
Buffer Response = True: This is critical. APIM needs to read the complete response (including status code) before it can make decisions. Without buffering, we can't inspect the 429 status.
Synchronous Failover: I use send-request with mode="new" to create a completely new HTTP request to the secondary region. The original request is abandoned.
Header Propagation: The X-Served-By header tells the client which region actually served the request. This is invaluable for debugging and telemetry.
Named Values: Notice {{primary-openai-key}} and {{secondary-openai-key}}? These are APIM Named Values stored in Azure Key Vault—secure configuration that keeps secrets out of policy XML.

Why This Approach Works

Traditional load balancers fail here because:

They see HTTP 429 as a "successful" response (it's not a 5xx)
They can't read and interpret the response body
They can't make intelligent decisions based on API-specific behavior

APIM bridges this gap by giving us full control over the request/response pipeline.

The Streaming Challenge: Handling SSE in Failover Scenarios

Here's something most failover guides don't tell you: streaming responses fundamentally change the game. When you're calling GPT-4o or similar LLMs, you're not getting a single response—you're getting a continuous stream of tokens via Server-Sent Events (SSE).

Why LLMs Use Streaming

In production AI applications, streaming isn't optional—it's essential:

# Non-streaming: User waits 10+ seconds staring at a blank screen
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a detailed analysis..."}],
    stream=False  # Bad UX!
)

# Streaming: Tokens appear immediately, feels responsive
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a detailed analysis..."}],
    stream=True  # Good UX!
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

The UX difference is massive: non-streaming feels like your application is frozen. Streaming gives users immediate feedback and a perception of speed, even if the total response time is similar.

The APIM + SSE Problem

Here's where it gets tricky. Remember buffer-response="true" setting in the APIM policy? That works great for standard HTTP responses, but it breaks streaming:

Buffered responses: APIM reads the entire response before forwarding. Perfect for inspecting status codes (429), terrible for SSE.
Streaming responses: APIM forwards chunks as they arrive. Great for UX, but we can't inspect the status code mid-stream.

You can't have both... or can you?

Solution: Hybrid Approach with Smart Detection

Microsoft recently documented proper SSE support in APIM (Server-Sent Events in Azure API Management), and it's adapted for failover scenario:

📄 GitHub: streaming-aware-failover-policy.xml

<policies>
    <inbound>
        <!-- Store request details -->
        <set-variable name="originalPath" value="@(context.Request.Url.Path)" />
        <set-variable name="deploymentName" 
                      value="@{
                          var path = context.Request.Url.Path;
                          var match = System.Text.RegularExpressions.Regex.Match(
                              path, @"/openai/deployments/([^/]+)/");
                          return match.Success ? match.Groups[1].Value : "";
                      }" />

        <!-- Check if this is a streaming request -->
        <set-variable name="isStreaming" 
                      value="@{
                          var body = context.Request.Body?.As<JObject>(preserveContent: true);
                          return body != null && 
                                 body["stream"] != null && 
                                 body["stream"].Value<bool>() == true;
                      }" />

        <set-backend-service base-url="https://westus-primary.openai.azure.com/openai" />

        <set-header name="api-key" exists-action="override">
            <value>{{primary-openai-key}}</value>
        </set-header>
    </inbound>

    <backend>
        <!-- For streaming requests, don't buffer -->
        <forward-request 
            buffer-response="@(!(bool)context.Variables["isStreaming"])" />
    </backend>

    <outbound>
        <choose>
            <!-- Only attempt failover for non-streaming 429s -->
            <when condition="@(context.Response.StatusCode == 429 && 
                              !(bool)context.Variables["isStreaming"])">
                <trace source="apim-failover">
                    Primary backend returned 429 for non-streaming request - attempting failover
                </trace>

                <!-- Standard failover logic here -->
                <send-request mode="new" response-variable-name="failoverResponse" 
                              timeout="120" ignore-error="false">
                    <set-url>@{
                        var deployment = context.Variables.GetValueOrDefault<string>("deploymentName");
                        return $"https://eastus-secondary.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version=2024-08-01-preview";
                    }</set-url>
                    <set-method>POST</set-method>
                    <set-header name="Content-Type" exists-action="override">
                        <value>application/json</value>
                    </set-header>
                    <set-header name="api-key" exists-action="override">
                        <value>{{secondary-openai-key}}</value>
                    </set-header>
                    <set-body>@(context.Request.Body.As<string>(preserveContent: true))</set-body>
                </send-request>

                <return-response>
                    <set-status code="@(((IResponse)context.Variables["failoverResponse"]).StatusCode)" 
                                reason="@(((IResponse)context.Variables["failoverResponse"]).StatusReason)" />
                    <set-header name="X-Served-By" exists-action="override">
                        <value>secondary-region</value>
                    </set-header>
                    <set-body>@(((IResponse)context.Variables["failoverResponse"]).Body.As<string>())</set-body>
                </return-response>
            </when>

            <!-- For streaming requests, if we get here, just pass through -->
            <when condition="@((bool)context.Variables["isStreaming"])">
                <set-header name="X-Stream-Mode" exists-action="override">
                    <value>enabled</value>
                </set-header>
                <set-header name="X-Served-By" exists-action="skip">
                    <value>primary-region-stream</value>
                </set-header>
            </when>
        </choose>
    </outbound>
</policies>

Key Design Decision: Client-Side Retry Strategy

Since we can't fail over mid-stream at the APIM level, I implement retry logic in the client application:

import time
from openai import AzureOpenAI, APIError, RateLimitError

def stream_with_retry(client, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            stream = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                stream=True,
                timeout=30.0  # Detect dead streams quickly
            )

            for chunk in stream:
                if chunk.choices[0].delta.content:
                    yield chunk.choices[0].delta.content

            # Stream completed successfully
            return

        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) * 1  # 1s, 2s, 4s
                time.sleep(wait_time)
                continue
            raise

        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) * 1
                time.sleep(wait_time)
                continue
            raise

# Usage
client = AzureOpenAI(
    azure_endpoint="https://your-afd.azurefd.net",
    api_key="your-apim-key",
    api_version="2024-08-01-preview"
)

for token in stream_with_retry(client, [
    {"role": "user", "content": "Explain quantum computing"}
]):
    print(token, end="", flush=True)

Load Testing: Proving It Works

Theory is great. Data is better. Here's how I validated the architecture across three scenarios:

Direct OpenAI (Baseline): Calling Azure OpenAI endpoints directly
AFD+APIM to Single Region: Using Front Door and APIM but with only one region
AFD+APIM with Failover: The complete multi-region architecture

Test Methodology

📄 GitHub: load_test.py

Built a Python load test that:

Sends 1M requests with high concurrency
Uses a 70/30 mix of simple and complex queries
Measures success rate, latency, and failover events
Categorizes responses by region served

# Simplified load test script
async def run_test(scenario, config, total_requests=1000000, concurrency=200):
    results = []
    simple_ratio = 0.7  # 70% simple queries

    async with httpx.AsyncClient() as client:
        for batch_start in range(0, total_requests, concurrency):
            batch_size = min(concurrency, total_requests - batch_start)

            # Create a mix of simple and complex queries
            queries = []
            for i in range(batch_size):
                query_type = "simple" if random.random() < simple_ratio else "complex"
                query = SIMPLE_QUERIES[i % len(SIMPLE_QUERIES)] if query_type == "simple" else COMPLEX_QUERIES[i % len(COMPLEX_QUERIES)]
                queries.append((query, query_type))

            # Send batch of requests concurrently
            tasks = [
                send_request(client, config, scenario, query, query_type) 
                for query, query_type in queries
            ]

            batch_results = await asyncio.gather(*tasks)
            results.extend(batch_results)

    return results

The Results: Success Rates and Latency

Here's what the data showed:

Scenario	Success Rate	Avg Latency	P95 Latency	Failover Rate
Direct OpenAI	87.3%	1,521ms	2,874ms	N/A
AFD+APIM Single	88.1%	1,698ms	3,056ms	N/A
AFD+APIM Failover	99.4%	2,184ms	4,128ms	12.2%

Key findings:

Direct OpenAI suffered from rate limits with no recovery mechanism
AFD+APIM Single added minimal overhead (~177ms) but didn't improve reliability
AFD+APIM Failover achieved near-perfect reliability at the cost of higher P95 latency

The latency increase for failover requests is expected—we're making a second API call when the first one fails. However, this tradeoff is absolutely worth it given the massive improvement in success rate.

Production Readiness

1. Circuit Breakers Are Essential

Pure failover isn't enough. You need intelligent circuit breakers to avoid hammering overloaded regions:

📄 GitHub: circuit-breaker-policy.xml

<set-variable name="failoverCount" value="@{
    string counterKey = "failover-count-" + context.Deployment.Region;
    int count = context.Variables.ContainsKey(counterKey) 
        ? (int)context.Variables[counterKey] 
        : 0;

    if (count > 10) { // Circuit breaker threshold
        // Check if 5 minutes have passed since last circuit break
        if (DateTime.UtcNow > context.Variables.GetValueOrDefault<DateTime>("circuit-breaker-time", DateTime.MinValue).AddMinutes(5)) {
            // Reset counter and allow a test request
            context.Variables[counterKey] = 0;
            return 0;
        }
        // Circuit still open
        return count;
    }
    // Increment counter
    return count + 1;
}" />

<choose>
    <when condition="@((int)context.Variables["failoverCount"] > 10)">
        <!-- Circuit is open, return friendly error -->
        <return-response>
            <set-status code="503" reason="Service Unavailable" />
            <set-body>{"error": "All regions currently at capacity. Please try again in a few minutes."}</set-body>
        </return-response>
    </when>
</choose>

2. Monitoring Is Everything

Built a comprehensive monitoring dashboard that tracks:

Success Rates: Overall, per region, and per failover status
Latency Distribution: P50/P95/P99 across all scenarios
Failover Metrics: Failover count, success rate, and latency impact
Quota Utilization: Per-region TPM/RPM usage against limits
Circuit Breaker Status: Open/closed state and activation frequency

Alert triggers:

Success rate drops below 99.5% for 5 minutes
Failover rate exceeds 15% for 10 minutes
Primary region 429 errors exceed 5% for 5 minutes
Any region's quota utilization exceeds 85% for 15 minutes

Best Practices: Your Implementation Checklist

If you're implementing this architecture, here's your checklist:

[ ] Provision Multiple Regions: Deploy both primary and failover OpenAI resources
[ ] Set Up Front Door: Configure with WAF and geographic routing
[ ] Deploy Regional APIM: Use Premium tier for availability sets
[ ] Implement Failover Policy: Use my policy template, adjusting for your deployment names
[ ] Configure Named Values: Secure your API keys using APIM Named Values
[ ] Set Up Monitoring: Track success rates, latency, and failover events
[ ] Implement Circuit Breakers: Avoid cascading failures with breaker policies
[ ] Add Client Retries: Implement exponential backoff for streaming requests
[ ] Test, Test, Test: Load test with your actual traffic patterns

Conclusion: From 3 AM Panic to Peaceful Sleep

The investment in this multi-region, intelligent failover architecture pays for itself many times over—not just in reduced downtime costs, but in customer trust and team sanity.

Is it perfect? No. We still have the occasional hiccup. But the difference between 87.3% and 99.4% reliability is the difference between an unreliable product and one that users can count on.

Though there are many sandbox projects available, what I've described is a proven, self-managed method. This solution stems from personal experience, and I acknowledge that high availability can be achieved through various approaches. For instance, purchasing Azure's Provisioned Throughput Units (PTUs) offers guaranteed capacity but can be costly and still requires a strategy for regional outages. For those exploring alternatives, projects like KGateway and Bifrost offer interesting options worth investigating. I'd love to hear from readers about other projects or approaches you've used to solve this problem.

As Azure OpenAI continues to evolve, so does the architecture. But the fundamental principles outlined here—multiple layers of resilience, intelligent request routing, and a deep understanding of the service's behavior—will remain essential for any enterprise-scale AI deployment.

📦 GitHub Repo: azure-openai-multi-region-failover

Imagine this: instead of being jolted awake at 3 AM by alert notifications, you simply check your dashboard during your morning coffee and see "Incident detected at 03:17, automatic failover initiated, recovery complete by 03:18." That's not fantasy—it's precisely what this architecture delivers. Your system detected the problem, executed the failover, and restored service while you enjoyed uninterrupted sleep. The difference between constant firefighting and confident reliability isn't just technical—it's transformative for your team's wellbeing and your customers' trust.

Google A2UI: The Future of Agentic AI for DevOps & SRE (Goodbye Text-Only ChatOps)

Deneesh Narayanasamy — Sat, 27 Dec 2025 19:54:27 +0000

The era of "Text-Only" ChatOps is ending. Google's new open-source protocol, A2UI, lets AI agents render native, interactive interfaces. Here is what Platform Engineers and SREs need to know.

🚀 TL;DR (For the Busy Engineer)

What is it? A2UI (Agent-to-User Interface) is a new open-source standard by Google that lets AI agents generate UI components (JSON) instead of raw text or HTML.
Why care? It solves the "Wall of Text" problem in ChatOps. Agents can now pop up interactive forms, charts, and buttons inside your chat app or internal portal.
Key Tech: It uses declarative JSON payloads ("Safe like data, expressive like code") to ensure security, no arbitrary JavaScript execution.
Use Case: Perfect for SRE Incident Response, MLOps Labeling, and Self-Service Infrastructure.

The Problem: The "Wall of Text" Bottleneck

We have all been there. It's 3 AM, and you are responding to a P1 incident. You query your Ops bot:

> @ops-bot status service-payments

The bot responds with 50 lines of unformatted JSON logs. To fix the issue, you have to remember specific CLI syntax, type it out, and hope you didn't typo a region flag.

This is the "Last Mile" problem in AI operations. We have brilliant LLMs that can diagnose complex Kubernetes issues, but they are forced to communicate through dumb text channels. This friction increases cognitive load and slows down Mean Time To Resolution (MTTR).

Enter A2UI: "Safe Like Data, Expressive Like Code"

Google released A2UI to bridge this gap. Unlike previous approaches that relied on heavy iframes or dangerous raw HTML injection, A2UI uses a standardized JSON schema.

The workflow is simple:

The Agent analyzes the request and sends a JSON "blueprint."
The Client (your web portal, mobile app, or chat interface) receives the JSON.
The Renderer converts that JSON into native components (React, Flutter, Angular, etc.) that match your brand's style system.

Why This Architecture Wins for DevOps

Security First: The agent cannot execute code. It can only request components (like Card, Button, Graph) that exist in your client's "Allow List."
Native Feel: The UI looks and behaves like your internal developer platform, not a disjointed third-party embed.
Bi-Directional Sync: When you click "Restart Pod," the state updates instantly in the UI without a page refresh.

Some Use Cases for Platform Teams

If you are building an Internal Developer Platform (IDP), here is how you can use A2UI today.

1. The Interactive Incident Commander (SRE)

Instead of linking to a Grafana dashboard, the agent generates the dashboard in the conversation.

The Trigger: "Alert: High Latency on Checkout."
The A2UI Response: An interactive Card containing:
- 📉 Visual: A live mini-chart of error rates over the last 15 minutes.
- 📝 Context: A summary of the last 3 deployments.
- 🔴 Action: A "Rollback" button that triggers a specific GitHub workflow.

2. Human-in-the-Loop MLOps

MLOps teams often struggle with "edge cases" where a model has low confidence. Building a custom web app for labelers is expensive.

The Scenario: A fraud model flags a transaction with 45% confidence.
The A2UI Solution: The agent pushes a "Review Card" to the Ops channel.
- Content: Transaction metadata + User History.
- Input: [Confirm Fraud] vs [False Positive] buttons.
- Outcome: The click labels the data and triggers a fine-tuning job instantly.

3. Self-Service Infrastructure Provisioning

Stop making developers write Terraform for simple resources.

The Request: "I need a Redis instance for staging."
The A2UI Response: A dynamic form.
- Dropdown: Select Environment (Dev/Stage).
- Slider: Select TTL / Retention.
- Validation: The agent validates the quota before the user clicks submit.

The Code: Anatomy of a Payload

For the developers reading this, here is what the actual wire protocol looks like. It is incredibly readable.

{
  "component": "Card",
  "title": "⚠️ Production Alert: High CPU",
  "children": [
    {
      "component": "Text",
      "content": "Service 'payment-gateway' is at 98% utilization."
    },
    {
      "component": "Row",
      "children": [
        {
          "component": "Button",
          "label": "Scale Up (5 Nodes)",
          "action": "scale_up_action",
          "style": "primary"
        },
        {
          "component": "Button",
          "label": "Snooze Alert",
          "action": "snooze_action",
          "style": "secondary"
        }
      ]
    }
  ]
}

This JSON is platform-agnostic. Your React frontend renders it as a Material UI card; your iOS app renders it as a native SwiftUI view.

A2UI vs. MCP vs. Standard ChatOps

For those comparing this to Anthropic's MCP (Model Context Protocol) or standard webhooks, here is the breakdown:

Feature	Standard ChatOps	MCP (Model Context Protocol)	Google A2UI
Output	Text / Static Images	Resources / Text / Prompts	Native UI Components
Interactivity	Low (Command Line)	Medium (Tool Use)	High (Stateful UI)
Security	High	High	High (No Code Exec)
Implementation	Easy	Moderate	Moderate
Best For	Simple queries	Connecting Data Sources	Human-in-the-loop Workflows

Getting Started

Google has open-sourced the specification and renderers. You can clone the repo and run the "Restaurant Finder" sample to see the rendering in action (it translates perfectly to a "Service Finder" for DevOps).

git clone https://github.com/google/A2UI.git

Navigate to the client sample

cd A2UI/samples/client/lit/shell

Install and Run

npm install && npm run dev

Final Thoughts: The Shift to Generative UI

We are moving away from Generic UIs (dashboards that show everything) to Generative UIs (interfaces created on-the-fly for the exact problem you are solving).

For DevOps and SREs, A2UI is the toolkit to build that future. It allows us to keep the "Chat" in ChatOps, but finally ditch the "Ops" headaches.

🔗 Resources:

Have you tried implementing generative UI in your Ops workflows? Let me know in the comments below!