Is OpenAI Down? Real-Time AI Provider Status Monitoring

#ai #api #devops #tutorial

Your AI pipeline breaks when OpenAI goes down. In 2025, GPT-4 had 12 significant outages. Claude had 8. If your app has no fallback detection, each outage means manual intervention and angry users. Here's how to monitor AI provider status in real time.

The Problem with Official Status Pages

Most AI providers have status pages, but they:

Update with 15-30 minute delays after incidents start
Don't distinguish between degraded performance and total outages
Don't track regional differences (US-East fine, EU degraded)
Require manual checking

You need automated, real-time detection.

Real-Time Status Monitoring

const resp = await fetch('https://api.lazy-mac.com/ai-provider-status', {
  headers: { 'Authorization': 'Bearer YOUR_KEY' }
});

const status = await resp.json();

// {
//   "openai": { "status": "operational", "latency_p95_ms": 2340, "error_rate": 0.001 },
//   "anthropic": { "status": "degraded", "latency_p95_ms": 8900, "error_rate": 0.043 },
//   "google": { "status": "operational", "latency_p95_ms": 1200, "error_rate": 0.002 }
// }

if (status.anthropic.status !== 'operational') {
  // Route to backup provider
  await routeToOpenAI(request);
} else {
  await routeToAnthropic(request);
}

Webhook Alerts

Get notified the moment a provider degrades:

# Register a webhook for status changes
import requests

requests.post("https://api.lazy-mac.com/ai-provider-status/webhooks",
    json={
        "url": "https://your-app.com/webhooks/ai-status",
        "providers": ["openai", "anthropic", "google"],
        "alert_on": ["degraded", "outage", "recovered"],
        "latency_threshold_ms": 5000  # Alert if P95 > 5s
    },
    headers={"Authorization": "Bearer YOUR_KEY"}
)

Building a Resilient AI Router

from functools import lru_cache
import time

class ResilientAIRouter:
    def __init__(self, status_api_key: str):
        self.api_key = status_api_key
        self._cache = {}
        self._cache_ts = 0

    def get_operational_providers(self) -> list[str]:
        # Cache status checks for 30 seconds
        if time.time() - self._cache_ts > 30:
            resp = requests.get(
                "https://api.lazy-mac.com/ai-provider-status",
                headers={"Authorization": f"Bearer {self.api_key}"}
            )
            self._cache = resp.json()
            self._cache_ts = time.time()

        return [
            provider for provider, info in self._cache.items()
            if info["status"] == "operational" and info["error_rate"] < 0.01
        ]

    def route(self, request, preferred="anthropic"):
        operational = self.get_operational_providers()

        if preferred in operational:
            return self.call_provider(preferred, request)

        for fallback in ["openai", "google", "anthropic"]:
            if fallback in operational:
                return self.call_provider(fallback, request)

        raise Exception("All AI providers currently unavailable")