Published on: 2026-06-06
Reading time: 10 min
Tags: #devops #cloud #fastapi #production
Situation
I deployed 3 FastAPI projects to 3 different clouds. Here's what actually happened (not marketing speak):
contest-agent → Google Cloud Run
ai-insight-curator → Railway
ai-lifelogger → Oracle Cloud Always Free
1. Google Cloud Run: 20+ Deployments Before Discovering the Real Problem
Issue: "Container Failed to Start"
Deployed 20+ times, same error every time:
Build: SUCCESS ✅
Push: SUCCESS ✅
Start: TIMEOUT ❌
Port 8080 binding: TIMEOUT ❌
Root cause: FastAPI startup was blocking port binding with I/O operations
# ❌ Problem code (startup blocks port binding)
@asynccontextmanager
async def lifespan(app: FastAPI):
await telegram_client.send_message("Starting...") # I/O blocking
db_check = await db.test_connection() # I/O blocking
scheduler.start() # Heavy init
yield
Cloud Run waits for port binding to complete before health checks. Startup blocking = timeout.
Solution: Lazy Loading
# ✅ Fixed code (startup returns immediately)
_initialized = False
async def lazy_init():
global _initialized
if _initialized:
return
_initialized = True
await telegram_client.send_message("Started")
scheduler.start()
@app.post("/webhook")
async def webhook(request: Request):
await lazy_init() # Init on first actual request
...
Result: Startup 100ms (was 60s+ timeout), port binding immediate, health check passes.
Key Lesson: Start Minimal
Don't deploy a complex system all at once. Lessons learned:
# Phase 1: Just "/" endpoint
@app.get("/")
async def root():
return {"status": "ok"}
# → Deploy, test, pass ✅
# Phase 2: Add health check
@app.get("/health")
async def health():
return {"status": "healthy"}
# → Deploy, test, pass ✅
# Phase 3-N: Gradually add features
# Each phase = one deployment test
2. Railway: The "Simple" Illusion
Advantages
- Git push → auto-deploy (very fast)
- PostgreSQL, Redis built-in
- Intuitive dashboard
Reality Check
Cost surprises:
Expected: $10/month
Actual: $25/month (250% overage)
Reason:
- 1 vCPU + 512MB RAM always running
- No cold start = memory always consumed
- Bandwidth costs added up
Memory leak detection is hard:
Hour 1: 150MB ✅
Hour 2: 180MB
Hour 3: 220MB
Hour 4: 260MB (OOM incoming)
Cause: RSS feed crawler not releasing memory
Auto-deploy is a double-edged sword:
- Con: Changes go live without testing
- Con: Need fast rollback procedure
How I Actually Operate It
# Before pushing to main:
pytest # Run tests
pylint # Lint check
docker build && docker run # Local test
# Only push after passing:
git push origin main # Auto-deploys
3. Oracle Cloud Always Free: Free but Demanding
Advantages
- Completely free (4 CPU, 24GB RAM, 200GB storage)
- No limits
- Full SSH control
Real Problems
Problem #1: 1GB instance, pip install fails
MemoryError during pip install
Reason: 1GB RAM instance can't handle
all packages at once
Solution:
# Add swap
sudo fallocate -l 8G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Or: Install only essentials
pip install --no-cache-dir anthropic supabase python-telegram-bot
Problem #2: Docker vs Local Mismatch
Local: anthropic==0.40.0 (already installed)
Docker: Fresh install reads requirements.txt
- anthropic==0.40.0
- langchain-anthropic needs anthropic>=0.41.0
→ pip can't resolve
Solution: Remove version pins, let pip resolve
DON'T: anthropic==0.40.0, supabase==2.0.0, ...
DO: anthropic, supabase (let pip figure it out)
Problem #3: SSH Deployment Needs Automation
# Manual (every time):
ssh oracle@your-ip
cd /opt/ai-lifelogger
git pull && systemctl restart
# Better (automated via GitHub Actions):
ssh -i $key oracle@$ip "cd /opt && git pull && systemctl restart"
Performance Comparison (3-Month Data)
| Metric | Cloud Run | Railway | Oracle |
|---|---|---|---|
| Deploy time | 2-3 min | 30 sec | 5 min |
| Cold start | 3-5 sec | 0 sec | <1 sec |
| Monthly cost | $15 | $25 | $0 |
| CPU limit | 2 cores | 1 core | 4 cores |
| RAM limit | 2GB | 512MB | 24GB |
| Stability | ✅ Solid | ⚠️ Memory issues | ✅ Solid |
Practical Advice
1. Start Minimal, Add Gradually
- Deploy "/" endpoint first
- Test, pass, add next feature
- Repeat
2. Always Test Locally
docker build -t myapp .
docker run -p 8080:8080 myapp
3. Choose Based on Use Case
- High traffic: Cloud Run (autoscales)
- Medium traffic: Railway (simple)
- Low traffic: Oracle (free)
4. Monitoring is Non-Negotiable
Cloud Run: GCP Logs + Cloud Monitoring
Railway: Built-in dashboard (limited)
Oracle: SSH → journalctl + tail -f
What I Learned
There's no "perfect" platform.
- Cloud Run: startup timeout (solvable with lazy loading)
- Railway: memory leaks (code issue, not platform)
- Oracle: operational overhead (worth it for free tier)
The real skill: Understanding each platform's constraints and designing around them.
The 20+ Cloud Run deployment failures? They taught me more than 10 successful deployments would have.
Bottom Line
If you're deploying to multiple clouds, expect problems—but they're solvable. Document them, learn from them, share them.
Your experience matters to other developers.
Top comments (0)