JustJinoIT

Posted on Jun 6

실전 다중 클라우드 배포: Cloud Run, Railway, Oracle Cloud 3개월 운영 기록

#devops #cloud #fastapi #production

Published on: 2026-06-06

Reading time: 10 min

Tags: #devops #cloud #fastapi #production

상황

3개의 FastAPI 프로젝트를 3개의 다른 클라우드에 배포했습니다. 각 플랫폼의 실제 문제와 해결 방법을 공유합니다.

contest-agent      → Google Cloud Run
ai-insight-curator → Railway  
ai-lifelogger      → Oracle Cloud Always Free

1. Google Cloud Run: 20+ 배포 시도 끝에 발견한 Startup Timeout

문제: "The container failed to start and listen on port 8080"

20번 이상 배포 시도했지만 계속 실패했습니다.

Deployment logs:
- Build: SUCCESS ✅
- Push to Registry: SUCCESS ✅
- Container Start: TIMEOUT ❌
- Port 8080 binding: TIMEOUT ❌

원인: FastAPI startup 단계에서 I/O 작업이 port binding을 블로킹

# ❌ 문제 코드 (startup 중 I/O blocking)
@asynccontextmanager
async def lifespan(app: FastAPI):
    await telegram_client.send_message("Starting...")  # I/O 대기
    db_check = await db.test_connection()              # I/O 대기
    scheduler.start()                                   # 무거운 초기화
    yield

Cloud Run은 port binding 완료 후 health check를 하는데, startup이 block되면서 timeout 발생.

해결책: Lazy Loading

# ✅ 해결 코드 (startup 최소화)
_initialized = False

async def lazy_init():
    global _initialized
    if _initialized:
        return
    _initialized = True
    # 무거운 작업은 첫 요청 시점으로 이동
    await telegram_client.send_message("Started")
    scheduler.start()

@app.post("/webhook")
async def webhook(request: Request):
    await lazy_init()  # 첫 요청에서 초기화
    ...

효과:

Startup: 100ms (이전 60초+ timeout)
Port binding: 즉시 성공
Health check: 통과

교훈: 극-최소 버전부터 시작

복합 시스템을 한번에 올리면 원인 파악이 어렵습니다.

# 1단계: "/" endpoint만
@app.get("/")
async def root():
    return {"status": "ok"}

# 배포 테스트 → 성공 ✅

# 2단계: health check 추가
@app.get("/health")
async def health():
    return {"status": "healthy"}

# 배포 테스트 → 성공 ✅

# 3단계: webhook 추가
# ... 계속 점진적 추가

2. Railway: "간단"하다는 착각

장점

Git push → 자동 배포 (매우 빠름)
PostgreSQL, Redis 통합 쉬움
대시보드 직관적

실제 문제들

1) 비용 계산 불가

예상: $10/월
실제: $25/월 (예상 250% 초과)

원인:
- 1vCPU + 512MB RAM 지속 실행
- 항상 ON (cold start 없음 = 메모리 지속 소비)
- Bandwidth 추가 비용

2) Memory leak 감지 어려움

Railway는 프로세스별 메모리 모니터링이 제한적입니다.

메모리 사용량:
시간 1: 150MB ✅
시간 2: 180MB
시간 3: 220MB
시간 4: 260MB (OOM 위험)

원인: RSS 크롤링 중 메모리 해제 누락

3) 자동 재배포의 양날검

Git push → 자동 배포는 편하지만:

테스트 없이 배포되는 위험
문제 발생 시 빠른 롤백 필요

실제 운영 방식

# GitHub에 push하기 전에
pytest  # 테스트
pylint  # 린트
docker build && docker run  # 로컬 테스트

# 통과 후 push (자동 배포)
git push origin main

3. Oracle Cloud Always Free: 무료지만 운영 수고

장점

완전 무료 (4 CPU, 24GB RAM, 200GB storage)
제약 없음
SSH 전체 제어

실제 문제들

1) 1GB RAM 서버에서 pip install 실패

pip install 중:
MemoryError: Unable to allocate 500MB

원인: 1GB 인스턴스인데 모든 패키지를 한번에 설치

해결책:

# swap 추가
sudo fallocate -l 8G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# 또는 필수 패키지만 설치
pip install --no-cache-dir anthropic supabase python-telegram-bot

2) 버전 호환성 (Docker vs 로컬)

로컬에서: anthropic==0.40.0 (이미 설치됨)
Docker에서: requirements.txt 처음부터 해석
  - anthropic==0.40.0
  - langchain-anthropic이 요구: anthropic>=0.41.0
  → pip 해석 불가능

해결책: 버전 고정 제거, pip 자동 해결

DO NOT:  anthropic==0.40.0, supabase==2.0.0, ...
DO THIS: anthropic, supabase (pip resolve)

3) SSH 배포 자동화 필요

# 수동 배포 (매번 SSH 접속)
ssh oracle@your-ip
cd /opt/ai-lifelogger
git pull
systemctl restart

# 자동화 (GitHub Actions)
- name: Deploy to Oracle
  run: |
    ssh -i ${{ secrets.ORACLE_KEY }} oracle@${{ secrets.ORACLE_IP }} \
      "cd /opt && git pull && systemctl restart"

성능 비교 (3개월 운영 데이터)

지표	Cloud Run	Railway	Oracle
배포 시간	2-3분	30초	5분
Cold start	3-5초	0초	<1초
월 비용	$15	$25	$0
CPU 제한	2개	1개	4개
메모리 제한	2GB	512MB	24GB
상태	안정적 ✅	메모리 leak	안정적 ✅

실전 팁

1. 극-최소 버전부터 시작

"모든 게 한번에 되겠지"는 환상
각 단계마다 배포 검증

2. 로컬 테스트는 필수

docker build -t myapp .
docker run -p 8080:8080 myapp

3. 비용과 성능의 트레이드오프

높은 트래픽: Cloud Run (자동 스케일)
중간 규모: Railway (간편함 but 비용)
낮은 트래픽: Oracle (무료 but 관리 수고)

4. 모니터링은 선택이 아닌 필수

Cloud Run: GCP Logs + Cloud Monitoring
Railway: 내장 대시보드 (제한적)
Oracle: SSH → journalctl + tail -f

결론

각 플랫폼이 "완벽"한 경우는 없습니다.

Cloud Run: startup timeout (해결 가능)
Railway: 비용 + 메모리 (환경 의존)
Oracle: 운영 수고 (자동화로 해결)

중요한 건 문제를 인식하고, 체계적으로 대응하는 것입니다.

특히 startup timeout처럼 "아, 그거구나"하면 해결되는 문제도 있고, 메모리 leak처럼 근본 코드 개선이 필요한 문제도 있습니다.

3개 클라우드를 다루면서 배운 가장 큰 교훈: 극-최소 버전부터 시작하세요.

DEV Community

실전 다중 클라우드 배포: Cloud Run, Railway, Oracle Cloud 3개월 운영 기록

상황

1. Google Cloud Run: 20+ 배포 시도 끝에 발견한 Startup Timeout

문제: "The container failed to start and listen on port 8080"

해결책: Lazy Loading

교훈: 극-최소 버전부터 시작

2. Railway: "간단"하다는 착각

장점

실제 문제들

실제 운영 방식

3. Oracle Cloud Always Free: 무료지만 운영 수고

장점

실제 문제들

성능 비교 (3개월 운영 데이터)

실전 팁

1. 극-최소 버전부터 시작

2. 로컬 테스트는 필수

3. 비용과 성능의 트레이드오프

4. 모니터링은 선택이 아닌 필수

결론

Top comments (0)