You're shipping an AI chat product. The LLM streams ~40 tokens/sec per user.
50,000 concurrent users on launch day. Browser clients only. Tokens flow one way: server → user.
Your team meets to pick the transport. Everyone shows up with a strong opinion.
Here's the setup:
• Frontend: React in the browser
• Backend: Python (FastAPI) behind an ALB
• Payload: UTF-8 text tokens, ~5–20 bytes each
• Direction: server pushes, client just renders
• Reconnects must be invisible (mobile networks drop constantly)
The team lead says "WebSockets, obviously." The platform engineer pushes back. What do you ship?
A) WebSockets — the default for "real-time," full-duplex, every chat app uses it.
B) Server-Sent Events (SSE) — one-way HTTP stream, native browser EventSource, auto-reconnect built in.
C) gRPC server streaming — HTTP/2, binary frames, backpressure handled for you.
D) Long polling — boring, battle-tested, works through every proxy on earth.
Three of these are real production patterns. One is what most teams default to and quietly regret six months in.
Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments (including why the most popular answer is the senior engineer trap).
If your team is about to argue this exact tradeoff, send them this post before the meeting. Save yourself a whiteboard session.
Drop your answer 👇
Top comments (4)
Why B wins (SSE):
The traffic is one-way. Server pushes tokens, client renders. That's it. The client never sends a token back mid-stream — user input goes through a separate POST. So you're paying for full-duplex on a half-duplex problem.
SSE is built for exactly this shape. One long-lived HTTP connection, text/event-stream, server writes, browser's native EventSource reads. No protocol upgrade. No new framing. No new auth path.
The killer feature: automatic reconnect with Last-Event-ID. Browser drops, mobile switches from WiFi to LTE — EventSource reconnects on its own and tells the server the last event it saw. You replay from there. With WebSockets, you write that logic yourself, and you write it wrong the first three times.
OpenAI's streaming API? SSE. Anthropic's streaming API? SSE.
Why A is the trap (WebSockets):
The word "chat" tricks people. User-to-user chat (Slack, WhatsApp) is bidirectional. An AI chat app is a unidirectional token stream with a separate POST for the prompt. Different shape, different transport.
The cost at 50K connections:
• Sticky sessions on the ALB
• Hand-rolled reconnect + replay logic
• Heartbeats and ping/pong to detect dead connections
• Higher per-connection memory for buffers you'll never use
Every one of those turns into an on-call page.
Why C is wrong (gRPC streaming):
Browsers don't speak gRPC. You'd need Envoy + gRPC-Web, which downgrades the streaming model and adds a proxy hop. You're now operating Envoy and debugging Protobuf frames in the network tab — to ship UTF-8 tokens. Great for service-to-service. Wrong tool for browsers.
Why D is wrong (Long polling):
At 40 tokens/sec, every poll cycle burns a full HTTP request. 50K users × 40 req/sec = 2M RPS just to render text. It's a valid fallback when SSE/WS are blocked. It's not your primary transport in 2026.