TL;DR
When AI APIs hit rate limits and fail, proper architecture design keeps your core systems running. The key is separating AI dependencies and implementing fallback strategies.
Prerequisites
- Multiple cron jobs using AI APIs (Claude, OpenAI, etc.)
- Core systems (web API, database) that need to stay online
- Need to improve system resilience
The Problem: Everything Breaks at Once
Yesterday at 12:00 PM, Claude API returned "service temporarily overloaded" errors. Within minutes, multiple cron jobs failed simultaneously. Sound familiar?
Common failure pattern:
# All jobs hit the same API at the same time
0 9 * * * /path/to/ai-job1 # AI heavy
0 9 * * * /path/to/ai-job2 # AI heavy
0 9 * * * /path/to/ai-job3 # AI heavy
# Result: 429 Rate Limit Exceeded for everyone
Step 1: Service Tier Architecture
Separate your services by AI dependency level:
# Tier 1: Critical Services (NO AI dependency)
# - Web API server
# - Database operations
# - User authentication
# - Core business logic
# Tier 2: AI-Enhanced Services (AI optional)
# - Content generation with fallback
# - Auto-summarization with default text
# - Smart notifications with basic alerts
# Tier 3: AI-Only Services (AI required)
# - LLM chat features
# - Code generation tools
# - Complex AI analysis
Design principle: Tier 1 services NEVER depend on external AI APIs.
Step 2: Temporal Load Distribution
Spread your cron jobs across time windows:
# Before: API rate limit collision
0 9 * * * /path/to/job1
0 9 * * * /path/to/job2
0 9 * * * /path/to/job3
# After: Staggered execution
0 9 * * * /path/to/job1 # 09:00
15 9 * * * /path/to/job2 # 09:15
30 9 * * * /path/to/job3 # 09:30
Pro tips:
- Calculate your API limit (e.g., 1000 req/min) and divide among jobs
- Prioritize critical jobs for prime time slots
- Avoid peak hours (weekday 9-17 in your provider's timezone)
Step 3: Multi-Provider Fallback Implementation
import time
import random
from typing import Optional
class ResilientAIService:
def __init__(self):
self.providers = ['claude', 'openai', 'gemini']
self.fallback_responses = {
'summary': 'Auto-summary unavailable',
'generation': 'Default content displayed'
}
def call_ai_with_fallback(self, prompt: str, service_type: str) -> str:
for provider in self.providers:
try:
response = self._call_provider(provider, prompt)
if response:
return response
except APIOverloadError:
# Exponential backoff
time.sleep(random.uniform(1, 5))
continue
except Exception as e:
print(f"{provider} failed: {e}")
continue
# All providers failed - return fallback
return self.fallback_responses.get(service_type, 'Processing failed')
Step 4: Health Check Separation
Monitor core systems and AI services separately:
#!/bin/bash
check_core_systems() {
# Database
if ! pg_isready -h localhost -p 5432; then
echo "CRITICAL: Database down"
return 1
fi
# Web API
if ! curl -f http://localhost:8000/health; then
echo "CRITICAL: API server down"
return 1
fi
echo "Core systems: OK"
return 0
}
check_ai_services() {
local ai_failures=0
for provider in claude openai gemini; do
if ! test_ai_provider "$provider"; then
((ai_failures++))
echo "WARNING: $provider unavailable"
fi
done
if [ $ai_failures -eq 3 ]; then
# Alert but don't panic - core systems still work
send_slack_alert "AI services degraded, using fallbacks"
fi
}
Step 5: Graceful Degradation Config
# docker-compose.yml
version: '3'
services:
core-api:
image: myapp/core
restart: always
environment:
- AI_ENABLED=false # Core features work without AI
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
ai-worker:
image: myapp/ai-worker
restart: on-failure
environment:
- MAX_RETRIES=3
- BACKOFF_MULTIPLIER=2
depends_on:
- core-api # AI worker can fail, core cannot
Real-World Incident Response
March 3rd, 2026 - Claude API Overload:
09:01 - Roundtable standup: Normal operation
12:00 - Claude API "service temporarily overloaded"
12:01 - Multiple cron job failures detected
12:05 - Core systems check: Web API still up ✅
12:10 - AI-Enhanced services disabled
12:15 - Fallback responses activated
23:00 - Manual daily memory skill: Success
Result:
- Core systems: Continued operating ✅
- User experience: Limited features but usable ✅
- Data integrity: Maintained ✅
Key Takeaways
| Lesson | Detail |
|---|---|
| Separate AI dependencies | Core functionality should never depend on external AI APIs |
| Temporal distribution | Stagger cron jobs to avoid rate limit collisions |
| Multi-layer fallbacks | Multiple providers + static responses prevent total failure |
| Differentiated monitoring | AI service issues ≠ system-critical alerts |
AI services are powerful tools, but treating them as critical infrastructure is a recipe for outages. Design for AI failure, and your users will thank you when the inevitable happens.
Remember: Your system's resilience matches its weakest external dependency. Make AI enhancement optional, not essential.
Top comments (0)