JustJinoIT

Posted on Jun 7

Multi-Cloud Deployment in Production: Cloud Run, Railway, Oracle Cloud

#devops #cloud #fastapi #production

Multi-Cloud Deployment in Production: Cloud Run, Railway, Oracle Cloud

Published on: 2026-06-06

Reading time: 10 min

Tags: #devops #cloud #fastapi #production

Situation

I deployed 3 FastAPI projects to 3 different clouds. Here's what actually happened (not marketing speak):

contest-agent      → Google Cloud Run
ai-insight-curator → Railway
ai-lifelogger      → Oracle Cloud Always Free

1. Google Cloud Run: 20+ Deployments Before Discovering the Real Problem

Issue: "Container Failed to Start"

Deployed 20+ times, same error every time:

Build: SUCCESS ✅
Push: SUCCESS ✅
Start: TIMEOUT ❌
Port 8080 binding: TIMEOUT ❌

Root cause: FastAPI startup was blocking port binding with I/O operations

# ❌ Problem code (startup blocks port binding)
@asynccontextmanager
async def lifespan(app: FastAPI):
    await telegram_client.send_message("Starting...")  # I/O blocking
    db_check = await db.test_connection()              # I/O blocking  
    scheduler.start()                                   # Heavy init
    yield

Cloud Run waits for port binding to complete before health checks. Startup blocking = timeout.

Solution: Lazy Loading

# ✅ Fixed code (startup returns immediately)
_initialized = False

async def lazy_init():
    global _initialized
    if _initialized:
        return
    _initialized = True
    await telegram_client.send_message("Started")
    scheduler.start()

@app.post("/webhook")
async def webhook(request: Request):
    await lazy_init()  # Init on first actual request
    ...

Result: Startup 100ms (was 60s+ timeout), port binding immediate, health check passes.

Key Lesson: Start Minimal

Don't deploy a complex system all at once. Lessons learned:

# Phase 1: Just "/" endpoint
@app.get("/")
async def root():
    return {"status": "ok"}
# → Deploy, test, pass ✅

# Phase 2: Add health check
@app.get("/health")
async def health():
    return {"status": "healthy"}
# → Deploy, test, pass ✅

# Phase 3-N: Gradually add features
# Each phase = one deployment test

2. Railway: The "Simple" Illusion

Advantages

Git push → auto-deploy (very fast)
PostgreSQL, Redis built-in
Intuitive dashboard

Reality Check

Cost surprises:

Expected: $10/month
Actual: $25/month (250% overage)

Reason:
- 1 vCPU + 512MB RAM always running
- No cold start = memory always consumed
- Bandwidth costs added up

Memory leak detection is hard:

Hour 1: 150MB ✅
Hour 2: 180MB
Hour 3: 220MB
Hour 4: 260MB (OOM incoming)

Cause: RSS feed crawler not releasing memory

Auto-deploy is a double-edged sword:

Con: Changes go live without testing
Con: Need fast rollback procedure

How I Actually Operate It

# Before pushing to main:
pytest              # Run tests
pylint             # Lint check
docker build && docker run  # Local test

# Only push after passing:
git push origin main  # Auto-deploys

3. Oracle Cloud Always Free: Free but Demanding

Advantages

Completely free (4 CPU, 24GB RAM, 200GB storage)
No limits
Full SSH control

Real Problems

Problem #1: 1GB instance, pip install fails

MemoryError during pip install

Reason: 1GB RAM instance can't handle 
all packages at once

Solution:

# Add swap
sudo fallocate -l 8G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Or: Install only essentials
pip install --no-cache-dir anthropic supabase python-telegram-bot

Problem #2: Docker vs Local Mismatch

Local: anthropic==0.40.0 (already installed)
Docker: Fresh install reads requirements.txt
  - anthropic==0.40.0
  - langchain-anthropic needs anthropic>=0.41.0
  → pip can't resolve

Solution: Remove version pins, let pip resolve

DON'T: anthropic==0.40.0, supabase==2.0.0, ...
DO: anthropic, supabase (let pip figure it out)

Problem #3: SSH Deployment Needs Automation

# Manual (every time):
ssh oracle@your-ip
cd /opt/ai-lifelogger
git pull && systemctl restart

# Better (automated via GitHub Actions):
ssh -i $key oracle@$ip "cd /opt && git pull && systemctl restart"

Performance Comparison (3-Month Data)

Metric	Cloud Run	Railway	Oracle
Deploy time	2-3 min	30 sec	5 min
Cold start	3-5 sec	0 sec	<1 sec
Monthly cost	$15	$25	$0
CPU limit	2 cores	1 core	4 cores
RAM limit	2GB	512MB	24GB
Stability	✅ Solid	⚠️ Memory issues	✅ Solid

Practical Advice

1. Start Minimal, Add Gradually

Deploy "/" endpoint first
Test, pass, add next feature
Repeat

2. Always Test Locally

docker build -t myapp .
docker run -p 8080:8080 myapp

3. Choose Based on Use Case

High traffic: Cloud Run (autoscales)
Medium traffic: Railway (simple)
Low traffic: Oracle (free)

4. Monitoring is Non-Negotiable

Cloud Run: GCP Logs + Cloud Monitoring
Railway: Built-in dashboard (limited)
Oracle: SSH → journalctl + tail -f

What I Learned

There's no "perfect" platform.

Cloud Run: startup timeout (solvable with lazy loading)
Railway: memory leaks (code issue, not platform)
Oracle: operational overhead (worth it for free tier)

The real skill: Understanding each platform's constraints and designing around them.

The 20+ Cloud Run deployment failures? They taught me more than 10 successful deployments would have.

Final Deployment Architecture (June 7, 2026)

Production Status

🦅 Oracle Cloud (Always Free Tier)
├─ ai-lifelogger (port 8000)
│  ├─ FastAPI + APScheduler
│  ├─ Daily summaries: 05:00 KST
│  ├─ Weekly reviews: Sunday 08:00 KST
│  └─ Memory: 111MB / 954MB
│
└─ ai-insight-curator (port 8001)
   ├─ FastAPI + Telegram Bot
   ├─ RSS collection: Daily 06:00 KST
   ├─ Auto-summarization (Claude/Gemini/Groq fallback)
   └─ Memory: 22MB / 954MB

🌐 Vercel (Free Hosting)
└─ Curator Web Dashboard
   ├─ React + Vite frontend
   ├─ Article search & filtering
   ├─ Image downloads
   └─ https://curator-web-ui.vercel.app

📊 Total Memory: 537MB / 954MB (56% usage, 44% available)

What Changed

Initial Plan:

contest-agent → Cloud Run ❌ (dependency conflicts)
ai-insight-curator → Railway ❌ (over-engineered)
ai-lifelogger → Oracle Cloud ✅

Actual Production:

ai-lifelogger → Oracle Cloud ✅ (running)
ai-insight-curator → Oracle Cloud ✅ (1 instance = better)
Curator Web UI → Vercel ✅ (new, auto-deployed)

Key Insight: Single server + Web UI > Multi-cloud complexity

Performance Metrics

API Response Times:
- Lifelogger /health: < 50ms ✅
- Curator /api/v1/articles: < 100ms ✅
- Curator /api/v1/insights: < 100ms ✅

System Health:
- Memory: 537MB (56%) - 417MB free for scaling
- Availability: 99.9%
- Uptime: Continuous (Always Free tier)

Cost Analysis (Final)

Platform	Cost	Status
Oracle Cloud	$0/month	✅ Always Free
Vercel	$0/month	✅ Free tier
Supabase DB	$0/month	✅ Free tier
Claude API	Needs reset*	⚠️ Using Gemini/Groq backup
TOTAL	$0/month	Forever Free

*Anthropic tokens exhausted → fallback to Gemini/Groq working

Lessons Learned

Multi-cloud Isn't Always Better

Cloud Run: Good for high-traffic APIs
Railway: Convenient but expensive
Oracle: Best for low-traffic, cost-sensitive projects

Single Server Wins Here

2 concurrent FastAPI services
Database included (PostgreSQL via Supabase)
Web dashboard on separate CDN (Vercel)
Total cost: $0

Design Around Constraints

Memory: 954MB available → deployed with 537MB usage
Can still run 300MB+ additional services
Monitoring via SSH (not ideal, but works)

Conclusion

Don't chase multi-cloud complexity.

The optimal deployment turned out to be:

1 Oracle Cloud instance (FastAPI services)
1 CDN (Vercel for web)
1 Database (Supabase)
Everything free

Cost: $0/month ✅
Reliability: 99.9% ✅
Maintainability: Simple ✅

Sometimes simpler is better.

DEV Community

Multi-Cloud Deployment in Production: Cloud Run, Railway, Oracle Cloud

Multi-Cloud Deployment in Production: Cloud Run, Railway, Oracle Cloud

Situation

1. Google Cloud Run: 20+ Deployments Before Discovering the Real Problem

Issue: "Container Failed to Start"

Solution: Lazy Loading

Key Lesson: Start Minimal

2. Railway: The "Simple" Illusion

Advantages

Reality Check

How I Actually Operate It

3. Oracle Cloud Always Free: Free but Demanding

Advantages

Real Problems

Performance Comparison (3-Month Data)

Practical Advice

1. Start Minimal, Add Gradually

2. Always Test Locally

3. Choose Based on Use Case

4. Monitoring is Non-Negotiable

What I Learned

Final Deployment Architecture (June 7, 2026)

Production Status

What Changed

Performance Metrics

Cost Analysis (Final)

Lessons Learned

Multi-cloud Isn't Always Better

Single Server Wins Here

Design Around Constraints

Conclusion

Top comments (0)