Your AI pipeline breaks when OpenAI goes down. In 2025, GPT-4 had 12 significant outages. Claude had 8. If your app has no fallback detection, each outage means manual intervention and angry users. Here's how to monitor AI provider status in real time.
The Problem with Official Status Pages
Most AI providers have status pages, but they:
- Update with 15-30 minute delays after incidents start
- Don't distinguish between degraded performance and total outages
- Don't track regional differences (US-East fine, EU degraded)
- Require manual checking
You need automated, real-time detection.
Real-Time Status Monitoring
const resp = await fetch('https://api.lazy-mac.com/ai-provider-status', {
headers: { 'Authorization': 'Bearer YOUR_KEY' }
});
const status = await resp.json();
// {
// "openai": { "status": "operational", "latency_p95_ms": 2340, "error_rate": 0.001 },
// "anthropic": { "status": "degraded", "latency_p95_ms": 8900, "error_rate": 0.043 },
// "google": { "status": "operational", "latency_p95_ms": 1200, "error_rate": 0.002 }
// }
if (status.anthropic.status !== 'operational') {
// Route to backup provider
await routeToOpenAI(request);
} else {
await routeToAnthropic(request);
}
Webhook Alerts
Get notified the moment a provider degrades:
# Register a webhook for status changes
import requests
requests.post("https://api.lazy-mac.com/ai-provider-status/webhooks",
json={
"url": "https://your-app.com/webhooks/ai-status",
"providers": ["openai", "anthropic", "google"],
"alert_on": ["degraded", "outage", "recovered"],
"latency_threshold_ms": 5000 # Alert if P95 > 5s
},
headers={"Authorization": "Bearer YOUR_KEY"}
)
Building a Resilient AI Router
from functools import lru_cache
import time
class ResilientAIRouter:
def __init__(self, status_api_key: str):
self.api_key = status_api_key
self._cache = {}
self._cache_ts = 0
def get_operational_providers(self) -> list[str]:
# Cache status checks for 30 seconds
if time.time() - self._cache_ts > 30:
resp = requests.get(
"https://api.lazy-mac.com/ai-provider-status",
headers={"Authorization": f"Bearer {self.api_key}"}
)
self._cache = resp.json()
self._cache_ts = time.time()
return [
provider for provider, info in self._cache.items()
if info["status"] == "operational" and info["error_rate"] < 0.01
]
def route(self, request, preferred="anthropic"):
operational = self.get_operational_providers()
if preferred in operational:
return self.call_provider(preferred, request)
for fallback in ["openai", "google", "anthropic"]:
if fallback in operational:
return self.call_provider(fallback, request)
raise Exception("All AI providers currently unavailable")
Historical Uptime Data
The API also provides 30-day uptime history per provider — useful for SLA reporting and choosing the right provider for critical workloads.
Top comments (0)