DEV Community

JustJinoIT
JustJinoIT

Posted on

Multi-Cloud Deployment in Production: Cloud Run, Railway, Oracle Cloud (3-Month Report)

Published on: 2026-06-06

Reading time: 10 min

Tags: #devops #cloud #fastapi #production

Situation

I deployed 3 FastAPI projects to 3 different clouds. Here's what actually happened (not marketing speak):

contest-agent      → Google Cloud Run
ai-insight-curator → Railway
ai-lifelogger      → Oracle Cloud Always Free
Enter fullscreen mode Exit fullscreen mode

1. Google Cloud Run: 20+ Deployments Before Discovering the Real Problem

Issue: "Container Failed to Start"

Deployed 20+ times, same error every time:

Build: SUCCESS ✅
Push: SUCCESS ✅
Start: TIMEOUT ❌
Port 8080 binding: TIMEOUT ❌
Enter fullscreen mode Exit fullscreen mode

Root cause: FastAPI startup was blocking port binding with I/O operations

# ❌ Problem code (startup blocks port binding)
@asynccontextmanager
async def lifespan(app: FastAPI):
    await telegram_client.send_message("Starting...")  # I/O blocking
    db_check = await db.test_connection()              # I/O blocking  
    scheduler.start()                                   # Heavy init
    yield
Enter fullscreen mode Exit fullscreen mode

Cloud Run waits for port binding to complete before health checks. Startup blocking = timeout.

Solution: Lazy Loading

# ✅ Fixed code (startup returns immediately)
_initialized = False

async def lazy_init():
    global _initialized
    if _initialized:
        return
    _initialized = True
    await telegram_client.send_message("Started")
    scheduler.start()

@app.post("/webhook")
async def webhook(request: Request):
    await lazy_init()  # Init on first actual request
    ...
Enter fullscreen mode Exit fullscreen mode

Result: Startup 100ms (was 60s+ timeout), port binding immediate, health check passes.

Key Lesson: Start Minimal

Don't deploy a complex system all at once. Lessons learned:

# Phase 1: Just "/" endpoint
@app.get("/")
async def root():
    return {"status": "ok"}
# → Deploy, test, pass ✅

# Phase 2: Add health check
@app.get("/health")
async def health():
    return {"status": "healthy"}
# → Deploy, test, pass ✅

# Phase 3-N: Gradually add features
# Each phase = one deployment test
Enter fullscreen mode Exit fullscreen mode

2. Railway: The "Simple" Illusion

Advantages

  • Git push → auto-deploy (very fast)
  • PostgreSQL, Redis built-in
  • Intuitive dashboard

Reality Check

Cost surprises:

Expected: $10/month
Actual: $25/month (250% overage)

Reason:
- 1 vCPU + 512MB RAM always running
- No cold start = memory always consumed
- Bandwidth costs added up
Enter fullscreen mode Exit fullscreen mode

Memory leak detection is hard:

Hour 1: 150MB ✅
Hour 2: 180MB
Hour 3: 220MB
Hour 4: 260MB (OOM incoming)

Cause: RSS feed crawler not releasing memory
Enter fullscreen mode Exit fullscreen mode

Auto-deploy is a double-edged sword:

  • Con: Changes go live without testing
  • Con: Need fast rollback procedure

How I Actually Operate It

# Before pushing to main:
pytest              # Run tests
pylint             # Lint check
docker build && docker run  # Local test

# Only push after passing:
git push origin main  # Auto-deploys
Enter fullscreen mode Exit fullscreen mode

3. Oracle Cloud Always Free: Free but Demanding

Advantages

  • Completely free (4 CPU, 24GB RAM, 200GB storage)
  • No limits
  • Full SSH control

Real Problems

Problem #1: 1GB instance, pip install fails

MemoryError during pip install

Reason: 1GB RAM instance can't handle 
all packages at once
Enter fullscreen mode Exit fullscreen mode

Solution:

# Add swap
sudo fallocate -l 8G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Or: Install only essentials
pip install --no-cache-dir anthropic supabase python-telegram-bot
Enter fullscreen mode Exit fullscreen mode

Problem #2: Docker vs Local Mismatch

Local: anthropic==0.40.0 (already installed)
Docker: Fresh install reads requirements.txt
  - anthropic==0.40.0
  - langchain-anthropic needs anthropic>=0.41.0
  → pip can't resolve
Enter fullscreen mode Exit fullscreen mode

Solution: Remove version pins, let pip resolve

DON'T: anthropic==0.40.0, supabase==2.0.0, ...
DO: anthropic, supabase (let pip figure it out)
Enter fullscreen mode Exit fullscreen mode

Problem #3: SSH Deployment Needs Automation

# Manual (every time):
ssh oracle@your-ip
cd /opt/ai-lifelogger
git pull && systemctl restart

# Better (automated via GitHub Actions):
ssh -i $key oracle@$ip "cd /opt && git pull && systemctl restart"
Enter fullscreen mode Exit fullscreen mode

Performance Comparison (3-Month Data)

Metric Cloud Run Railway Oracle
Deploy time 2-3 min 30 sec 5 min
Cold start 3-5 sec 0 sec <1 sec
Monthly cost $15 $25 $0
CPU limit 2 cores 1 core 4 cores
RAM limit 2GB 512MB 24GB
Stability ✅ Solid ⚠️ Memory issues ✅ Solid

Practical Advice

1. Start Minimal, Add Gradually

  • Deploy "/" endpoint first
  • Test, pass, add next feature
  • Repeat

2. Always Test Locally

docker build -t myapp .
docker run -p 8080:8080 myapp
Enter fullscreen mode Exit fullscreen mode

3. Choose Based on Use Case

  • High traffic: Cloud Run (autoscales)
  • Medium traffic: Railway (simple)
  • Low traffic: Oracle (free)

4. Monitoring is Non-Negotiable

Cloud Run: GCP Logs + Cloud Monitoring
Railway: Built-in dashboard (limited)
Oracle: SSH → journalctl + tail -f
Enter fullscreen mode Exit fullscreen mode

What I Learned

There's no "perfect" platform.

  • Cloud Run: startup timeout (solvable with lazy loading)
  • Railway: memory leaks (code issue, not platform)
  • Oracle: operational overhead (worth it for free tier)

The real skill: Understanding each platform's constraints and designing around them.

The 20+ Cloud Run deployment failures? They taught me more than 10 successful deployments would have.

Bottom Line

If you're deploying to multiple clouds, expect problems—but they're solvable. Document them, learn from them, share them.

Your experience matters to other developers.

Top comments (0)