TL;DR
Claude AI experienced incidents on 11 out of 16 days in April 2026. Authentication failures (6x) and Sonnet 4.6 errors (5x) were the primary patterns. Here's what the data shows and how to build resilient systems around it.
The Numbers
I tracked every incident from status.claude.com between April 1-16, 2026:
Days with incidents: 11/16
Auth/login failures: 6 times
Sonnet 4.6 errors: 5 times
Worst day (Apr 8): 4 separate incidents
Longest outage: ~6 hours (Admin API, Apr 14)
Official 90-day uptime:
claude.ai: 98.79% (~26 hours downtime)
API: 99.1% (~19 hours downtime)
Claude Code: 99.26%
Pattern 1: Auth System is the Weakest Link
Six incidents directly involved authentication — login failures, email auth broken, session errors. This single system accounts for ~33% of all April incidents.
Practical impact: Claude Code depends on login sessions. When auth goes down, Code goes down too.
Pattern 2: Sonnet 4.6 Instability
The most-used model had the most errors:
Apr 3: Sonnet 4.6 error rate spike (~70 min)
Apr 4: Sonnet 4.6 + Opus 4.6 (~22 min)
Apr 6: Sonnet 4.6 error rate (~20 min)
Apr 8: Sonnet 4.6 error rate (~3 hours)
Apr 9: Sonnet 4.6 error rate (~46 min)
Opus 4.6 was involved in only 1 incident. If you're running production workloads, Opus is currently more stable.
Pattern 3: API Recovers Before claude.ai
In every multi-service incident, the API came back online before the web interface. This has a clear architectural implication.
Mitigation: Multi-Provider Architecture
Route through managed services for production:
# AWS Bedrock - independent infrastructure, 99.9% SLA
import boto3
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
response = bedrock.invoke_model(
modelId='anthropic.claude-sonnet-4-6-20260301-v1:0',
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 4096
})
)
# Google Vertex AI - alternative managed path
from anthropic import AnthropicVertex
client = AnthropicVertex(region="us-east5", project_id="my-project")
message = client.messages.create(
model="claude-sonnet-4-6-20260301",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
Mitigation: Response Caching + Retry
import time
import hashlib
# Simple response cache
cache = {}
def cached_completion(prompt, **kwargs):
key = hashlib.sha256(prompt.encode()).hexdigest()
if key in cache:
return cache[key]
for attempt in range(3):
try:
response = client.messages.create(
model="claude-sonnet-4-6-20260301",
messages=[{"role": "user", "content": prompt}],
**kwargs
)
cache[key] = response
return response
except Exception:
wait = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait)
raise RuntimeError("All retries failed")
Mitigation: Independent Monitoring
Don't rely solely on status.claude.com — it can lag 15-30 minutes behind actual incidents.
# Simple health check
import requests
def check_claude_health():
try:
response = requests.post(
"https://api.anthropic.com/v1/messages",
headers={
"x-api-key": API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json"
},
json={
"model": "claude-haiku-4-5-20251001",
"max_tokens": 10,
"messages": [{"role": "user", "content": "ping"}]
},
timeout=10
)
return response.status_code == 200
except Exception:
return False
Run this every 5 minutes alongside Downdetector for early detection.
Key Takeaway
Build assuming AI services will go down. Multi-provider routing, response caching, exponential backoff, and independent monitoring aren't nice-to-haves — they're requirements for production AI systems.
Data: status.claude.com
Top comments (0)