Chris

Posted on Oct 14

I tried to run Socket.IO on Cloud Run. It kinda worked… until it really didn’t. Here’s how I fixed it.

#webdev #cloud #gcp #socket

TL;DR: I started with Socket.IO inside a single Cloud Run service. I was naive about keeping “prolonged sessions” alive on a stateless platform. When emits were flaky, I split into two containers (backend + realtime) as a quick fix but still unreliable during deploys/auto-scale, and min-instances would have raised baseline cost. The final setup that works: CRUD on Cloud Run, sockets on one small GCE VM, with Google Pub/Sub as the glue. Cheap. Stable. And my first “distributed system” that actually behaves.

Why Cloud Run + sockets bit me

Cloud Run is awesome for stateless HTTP. WebSockets technically work, but:

Multiple revisions/instances: Socket.IO keeps room membership in memory per process. When Cloud Run deploys or scales, sockets can connect to Revision A while your emit happens in Revision B → dropped emits.
Rollouts: During deploys, two revisions can be live. Your Pub/Sub subscriber may run in one, while connected sockets live in the other.
Scale-to-zero/cold starts: There are windows where the subscriber isn’t ready or clients haven’t rejoined yet.

I also saw cost drift when keeping a realtime-ish instance “warm”.

The original plan (and my wrong assumptions)

I put Socket.IO and CRUD API in one Cloud Run service.
Assumption #1 (naive): “I can just keep a long-lived WebSocket session on Cloud Run and it’ll be fine.” Reality: Cloud Run is stateless & ephemeral; revisions come/go. In-memory room state doesn’t persist across them.
Assumption #2 (naive): “If it’s flaky, adding capacity will improve reliability.” Reality: More processes without shared state amplify the problem.

Symptoms I saw: sometimes receiveMessage arrived, often it didn’t, especially around deploys.

The quick “second container” getaway that bit me

When things got flaky, I split into two containers: one for backend (CRUD) and one for realtime (Socket.IO). Great separation of concerns… but still unreliable when multiple copies existed (deploys/auto-scale):

Clients could connect to Realtime Container A, but the publish/emit path could run in Realtime Container B.
Socket.IO rooms are per-process memory → emits from B don’t reach sockets on A.
I considered min-instances=1 to keep it warm, but that raises baseline cost and still doesn’t fix cross-process room state.

Lesson: Two containers is fine structurally, but without a single always-on realtime process (or shared state like Redis), emits will still get “lost” across processes.

The architecture that fixed it

I kept Cloud Run for stateless business logic and moved sockets to one always-on process on GCE. Pub/Sub bridges the two worlds.

[ Frontend ]  --Socket.IO-->  [ Realtime (GCE, 1 small VM) ]
                                   ▲              │
                                   │ sockets      │ subscribes
                                   │              ▼
                                     Google Pub/Sub  (topics: chat.out, chat.in)
                                   ▲              │ publishes
                                   │              ▼
                         [ Cloud Run Backend (CRUD/DB/notifications) ]

Backend (Cloud Run)
- DB writes, unread counts, notifications, auth.
- Publishes client-visible events to chat.out.
- Subscribes to chat.in for commands (e.g., mark read).
Realtime (GCE, single small VM)
- Socket.IO only (rooms, presence, emits).
- Subscribes to chat.out and emits to sockets.
- Publishes to chat.in for DB work the backend must do.

This removes the cross-instance room problem: all sockets live in one process.

Minimal wiring (what I actually changed)

1) Backend is publish-only (no direct emits)

Everything the client should see goes through Pub/Sub:

// Backend (Cloud Run) on message created / reaction toggled / unread updated
await pubSubService.publish(process.env.CHAT_OUT_TOPIC || 'chat.out', {
  type: 'receiveMessage', // or 'messageReadUpdate' | 'roomUpdated' | 'reactionRemoved'
  roomId,
  userId,      // optional for per-user broadcasts
  payload: {...} // what the client expects
});

2) Realtime (GCE) subscribes and emits

// Realtime (GCE) boot
await pubSubService.subscribe(process.env.CHAT_OUT_SUB || 'chat.out.realtime', (msg) => {
  const { type, roomId, userId, payload } = msg || {};
  if (type === 'receiveMessage' && roomId)       io.to(roomId).emit('receiveMessage', payload);
  else if (type === 'messageReadUpdate' && roomId) io.to(roomId).emit('messageReadUpdate', payload);
  else if (type === 'roomUpdated' && userId)     io.to(`user:${userId}`).emit('roomUpdated', payload);
});

3) Realtime → Backend for DB work

// Realtime publishes commands back
await pubSubService.publish(process.env.CHAT_IN_TOPIC || 'chat.in', {
  type: 'markMessageRead',
  roomId, readerId, senderId,
  timestamp: new Date().toISOString()
});

4) Add a join ACK to kill the first-emit race

Server:

@SubscribeMessage('joinRoom')
handleJoinRoom({ roomId, userId }, client) {
  client.join(roomId);
  client.data = { roomId, userId };
  return { ok: true }; // ACK
}

@SubscribeMessage('joinUser')
handleJoinUser({ userId }, client) {
  client.join(`user:${userId}`);
  client.data.userId = userId;
  return { ok: true };
}

Client:

socket.emit('joinRoom', { roomId, userId }, (ack) => {
  if (!ack?.ok) console.warn('joinRoom not ACKed');
});
socket.emit('joinUser', { userId }, (ack) => {
  if (!ack?.ok) console.warn('joinUser not ACKed');
});

GCE setup (cheap + HTTPS)

VM: e2-micro (1 vCPU / 1 GB) was enough for my load; e2-small is comfy.
Static IP + Caddy for HTTPS:

/opt/realtime/docker-compose.yml

version: "3.9"
services:
  realtime:
    image: gcr.io/<PROJECT>/realtime:latest
    environment:
      NODE_ENV: production
      PORT: "8080"
      PUBSUB_PROJECT_ID: <PROJECT>
      CHAT_IN_TOPIC: chat.in
      CHAT_OUT_TOPIC: chat.out
      CHAT_OUT_SUB: chat.out.realtime
    restart: unless-stopped

  caddy:
    image: caddy:2-alpine
    ports: ["80:80","443:443"]
    volumes:
      - /opt/realtime/Caddyfile:/etc/caddy/Caddyfile
    depends_on: [realtime]
    restart: unless-stopped

/opt/realtime/Caddyfile

realtime.yourdomain.com {
  reverse_proxy realtime:8080
}

Point realtime.yourdomain.com to the VM’s static IP (I used Onamae.com DNS). Caddy auto-issues Let’s Encrypt and proxies WebSockets out of the box.

Dev tip: you can Stop the VM when idle. You’ll still pay for the disk + static IP, but not CPU/RAM.

Capacity & tuning

For mostly-idle chat:

e2-micro: ~1–2k concurrent sockets safely (I’d cap around 2k).
e2-small: ~3–5k+ safely.

Hardening:

Increase FD limits (LimitNOFILE) if you push many connections.
Socket.IO: transports: ['websocket'], sensible pingInterval/Timeout.
Monitor event-loop lag and CPU; bump to e2-small if sustained >70% CPU or lag >100–200ms.

What I learned (a.k.a. the useful mistakes)

Cloud Run ≠ stateful socket host.
You can run WebSockets there, but instance/revision churn breaks in-memory room state unless you add a shared adapter (Redis) and manage rollouts.
“Prolonged sessions” don’t beat platform semantics.
Keeping sockets open longer doesn’t stop Cloud Run from swapping revisions or scaling.
Two containers didn’t fix cross-process state.
Splitting backend/realtime clarified concerns but emits were still lost across processes without shared state.
Min-instances helps warmness, not correctness.
It raises baseline cost and doesn’t solve the multi-process room problem.
One always-on realtime process is a huge simplifier.
A tiny GCE VM is cheap and eliminates revision/scale races. CRUD remains serverless on Cloud Run.
Pub/Sub is the right boundary.
At-least-once + unordered is fine with idempotent handlers. It decouples backend logic from realtime emits.
Join ACK matters.
A tiny handshake prevents the “first emit before join” race.

“Do this and it will work” checklist

Backend (Cloud Run) publishes to chat.out; consumes chat.in
Realtime (GCE) consumes chat.out; publishes chat.in
No direct .emit() in backend, publish only
joinRoom/joinUser return { ok: true }; client waits for ACK
Socket.IO: transports: ['websocket'], CORS locked to your app origin
HTTPS via Caddy on the VM (ports 80/443 open)
Single subscriber process (no duplicate subscribers)

Final Note

I went from “one Cloud Run with sockets” → “two containers (backend + realtime) on Cloud Run” → “Cloud Run + one small GCE + Pub/Sub.”
It’s cheaper, simpler, and finally RELIABLE!!! If your emits are flaky on Cloud Run or you’re scaling sockets, try this split. For me, it turned a frustrating rabbit hole into a boring (in the best way) distributed system.

DEV Community