Frontend System Design: WebSocket Architecture (Simple)
-
Frontend System Design: WebSocket Architecture (Simple)
- 1. What Is This
- 2. Simple Slack Example
- 3. Functional Requirements (Simple)
- 4. Non Functional Requirements (Simple)
- 5. High Level Architecture
- 6. Simple Data Flow
- 7. Data Model (Simple)
- 8. State Management (Simple)
- 9. API Design (Frontend POV)
- 10. Caching, Rendering, and Performance
- 11. Security and Reliability
- 12. Edge Cases
- 13. WSS Token and Handshake (ASCII Diagrams)
-
14. WebSocket Internals and Scaling
- 14.1 How WebSocket Works Internally (Simple)
- 14.2 Internal Components (Backend Side)
- 14.3 Scaling Stages (1 -> 100s -> Millions)
- 14.4 How Traffic Is Handled Safely
- 14.5 Practical Capacity Metrics to Track
- 14.6 Deep Theory: Kafka, Routing Table, and Consumers
- 14.7 Ordering Policy (What Is Guaranteed and Where)
- 14.8 Observability, SLOs, and Alerts
- 14.9 Testing Strategy and Validation Gates
- 15. Final Summary
1. What Is This
WebSocket in simple words:
- It is one live connection between browser and server.
- You keep it open and receive updates instantly.
- You do not need to refresh page again and again.
Reference system in this document:
- Slack-like chat app with channels, DMs, typing, and presence.
2. Simple Slack Example
Think like this:
- You open Slack web.
- App logs in and opens one live socket.
- App subscribes to channels you care about.
- New message appears instantly.
- You send message, it shows
sending, thensent. - If internet drops, app reconnects and catches up missed messages.
3. Functional Requirements (Simple)
3.1 Connect Once
- After login, open one WebSocket connection.
- Show clear status in UI: connecting, connected, reconnecting, offline.
3.2 Subscribe to Needed Channels Only
Client should subscribe only to relevant channels.
Where channel list comes from:
- Bootstrap API at startup.
- User membership data (joined channels, DMs).
- Current screen (open channel/thread).
Simple rule:
- Subscribed channels = allowed channels intersect currently needed channels.
3.3 Receive and Process Events
-
message.created-> update thread and sidebar preview. -
presence.updated-> update online dot. - Ignore duplicate events using
eventId. - Keep ordering using
sequence.
3.4 Send User Actions
- User sends message.
- UI adds optimistic bubble with
sending. - Server ack confirms
sent. - On error, show retry.
3.5 Recover Automatically
- Detect disconnect.
- Reconnect with backoff.
- Resubscribe channels.
- Replay missed events from last sequence.
3.6 Multi Tab Behavior
- Prefer one leader tab to own socket.
- Other tabs receive same updates via BroadcastChannel/SharedWorker.
- If leader closes, another tab becomes leader.
Tab A (leader) BroadcastChannel Tab B (follower)
| | |
|==== owns WebSocket ======>| |
|<=== receives live events =| |
|--- publish updates ------>|--- deliver updates ------->|
| | |
|---- closes/crashes ----X | |
| |<-- election heartbeat -----|
| |--- Tab B becomes leader -->|
| |==== Tab B opens socket ===>|
4. Non Functional Requirements (Simple)
- Fast: chat feels instant.
- Reliable: reconnect and recover automatically.
- Scalable: subscribe only to needed data.
- Secure: authenticated socket and channel authorization.
- Usable: user always sees connection status.
5. High Level Architecture
Simple layers:
- UI layer (channel list, message thread, composer)
- State layer (messages, unread, presence, connection)
- Realtime layer (socket, subscribe, reconnect, replay)
- API layer (bootstrap, snapshot, replay endpoints)
UI -> State -> Realtime -> API
6. Simple Data Flow
6.1 Initial Load
- User logs in.
- App calls bootstrap API.
- App opens WebSocket and authenticates.
- App subscribes to required channels.
- App requests missed events from replay cursor.
Browser App API WSS Gateway Replay API
| | | |
|--- login/session ---->| | |
|--- GET /bootstrap --->| | |
|<-- wsUrl/channels ----| | |
|--- POST /token ------>| | |
|<-- wsToken -----------| | |
|==================== open wss://...token ==============================>|
|-------------------- subscribe(channel, fromSeq) ----------------------->|
|--- GET /replay?fromSeq=cursor ----------------------------------------->|
|<-- missed events -------------------------------------------------------|
6.2 Sending a Message
- User types and sends message.
- Optimistic message appears immediately.
- Command goes over socket.
- Ack arrives.
- UI becomes final state.
User/UI Browser Socket WSS Gateway Storage
| | | |
|-- click send -------------->| | |
| add optimistic bubble | | |
| |--- message.send ------->| |
| | |--- persist --------->|
| | |<-- saved id --------|
| |<-- message.ack ---------| |
|<-- optimistic -> sent ------| | |
6.3 On Network Drop
- Socket disconnects.
- Reconnect starts with backoff.
- After reconnect, resubscribe.
- Replay missed events.
Browser Client Network WSS Gateway Replay API
| | | |
|<====== live stream ======>| | |
|-------- X disconnect -----| | |
| wait 1s, 2s, 4s | | |
|================ reconnect wss ===================>| |
|---------------- resubscribe(fromSeq) ----------->| |
|---------------- GET /replay?fromSeq=last ------->|--------------------->|
|<--------------- missed events ------------------------------------------|
| UI catches up and continues live | |
7. Data Model (Simple)
Core entities:
- ConnectionState
- ChannelSubscription
- EventEnvelope
- ReplayCursor
type EventEnvelope<T> = {
eventId: string;
sequence: number;
channel: string;
type: string;
payload: T;
};
7.1 Event Schema Versioning and Compatibility
Use schema versioning from day 1 so old clients do not break when payload evolves.
Simple rules:
- Add
schemaVersionin every event envelope. - Prefer additive changes (new optional fields) over breaking changes.
- Never remove/rename required fields without version bump.
- Support at least
N-1client version in production rollout.
Example envelope:
{
"eventId": "evt_123",
"schemaVersion": 2,
"channel": "channel:engineering",
"type": "message.created",
"payload": {
"id": "m_1",
"text": "hello",
"mentions": []
}
}
Deprecation flow:
- Introduce new optional fields.
- Ship clients that can read both old and new shape.
- Mark old fields deprecated in API contract.
- Remove old fields only after upgrade window closes.
8. State Management (Simple)
Global state:
- socket status
- subscribed channels
- replay cursor
Feature state:
- messages
- unread counts
- presence
Local state:
- draft input
- panel open/close
9. API Design (Frontend POV)
Useful APIs:
GET /realtime/bootstrapGET /snapshot/{channel}GET /replay/{channel}?fromSeq=...POST /realtime/token
How many APIs are used?
- Total core APIs:
4 - Always called on startup:
2 GET /realtime/bootstrapPOST /realtime/token- Called only when needed:
2 GET /snapshot/{channel}GET /replay/{channel}?fromSeq=...
Which API is called when?
GET /realtime/bootstrap- Called right after login/page load.
- Gives websocket URL, allowed channels, and cursor hints.
POST /realtime/token- Called before opening
wss://. - Gives short-lived websocket token.
GET /snapshot/{channel}- Called when channel data is missing in cache.
GET /replay/{channel}?fromSeq=...- Called after reconnect or sequence gap.
- Gives only missed events.
Example subscribe message:
{
"subscribe": {
"channel": "channel:engineering",
"fromSequence": 1234
}
}
9.1 Delivery Guarantees and Failure Semantics
For realtime chat, this practical model is recommended:
- Transport delivery: at-least-once (duplicates are possible).
- UI behavior: effectively-once using
eventIddedupe. - Ordering: guaranteed per channel partition key, not globally across all channels.
What each means:
- At-most-once
No duplicates, but messages can be lost on failure.
At-least-once
Retries can duplicate events, but loss risk is lower.
Effectively-once at UI
Client dedupes with
eventIdand sequence checks.User sees one final message even if network retries happened.
Failure semantics to document in API contract:
- Publisher timeout: client shows retry.
- Ack lost but write succeeded: retry may duplicate, dedupe must handle.
- Replay overlap: replay can include already seen events, client dedupe handles.
10. Caching, Rendering, and Performance
- Keep current chat state in memory.
- Store replay cursor in IndexedDB/local storage.
- Render only changed UI parts.
- Use virtualization for large message lists.
- Batch frequent events to avoid too many rerenders.
10.1 Connection Management Defaults (Practical Numbers)
Use concrete defaults so behavior is predictable:
- Heartbeat interval: 25s
- Heartbeat timeout: 10s
- Reconnect backoff: 1s, 2s, 4s, 8s, 16s (cap 30s)
- Reconnect jitter: plus/minus 20%
- Max reconnect attempts before offline banner: 8
- Max channels per socket (soft limit): 200
- Max inbound frame size: 64 KB
- Max outbound buffer per socket: 1 MB
These are starting points. Tune from production telemetry.
10.2 Client Backpressure and Memory Policy
Client must avoid unbounded queues.
Recommended policy:
- Keep only bounded in-memory event queue (for example 5,000 events).
- Prioritize visible channel updates over background channels.
- Coalesce high-frequency event types (
presence.updated, typing). - If queue is near full:
- Drop non-critical transient events.
- Keep critical durable events (
message.created, moderation actions). - If queue overflows:
- Trigger replay resync from last stable sequence.
- Show subtle "syncing" UI state.
11. Security and Reliability
Security:
- Use
wss://only. - Validate auth token.
- Validate channel permission on server.
Reliability:
- Heartbeat ping/pong.
- Reconnect with backoff.
- Replay missed events.
- Dedupe by event id.
11.1 Auth Lifecycle After Connect (Revocation and Role Changes)
Authentication is not only a connect-time check.
Handle these runtime cases:
- Session revoked (logout from another device)
- Server sends
auth.revokedcontrol event. Client clears local auth and redirects to login.
Permission changed while connected
Example: user removed from channel.
Next subscribe/publish must fail authorization.
Server can also push
channel.access.revokedevent.Token nearing expiry
Client fetches fresh websocket token before reconnect.
Avoid silent long-lived stale tokens.
Auth Service Gateway Client
| | |
|-- revoke user ----->| |
| |-- auth.revoked ---->|
| |<-- close/ack -------|
| | |
| | client clears state |
12. Edge Cases
- Tab sleeps: reconnect and replay on resume.
- Duplicate delivery: ignore by
eventId. - Out-of-order delivery: fix using
sequence. - Multiple tabs: leader tab pattern.
12.1 Protocol Fallback Strategy (When WebSocket Is Blocked)
Some corporate networks/proxies block WebSocket upgrade.
Fallback order:
- Try WebSocket (
wss://) first. - If upgrade repeatedly fails, switch to SSE.
- If SSE also fails, switch to long-polling.
Feature degradation plan:
- WebSocket: full duplex (send/receive realtime).
- SSE: realtime receive is good; sends still go through HTTP.
- Long-poll: highest latency and cost; keep only essential updates.
Client rule:
- Persist current transport mode in memory only.
- Retry WebSocket periodically (for example every 5-10 minutes).
13. WSS Token and Handshake (ASCII Diagrams)
Use this simple, secure pattern:
- Client gets short-lived websocket token using normal HTTPS API.
- Client opens
wss://connection with that token. - Server validates token and upgrades HTTP -> WebSocket.
- After connect, client sends subscribe frames.
Important:
- Keep token short-lived (for example 1-5 minutes).
- Prefer ephemeral websocket token, not long-lived access token.
- Never hardcode real production token in frontend code.
13.1 Proper End-to-End ASCII Sequence
+---------+ +-------------------+ +-------------------+ +----------------+
| Browser | | App API | | Token API | | WSS Gateway |
+---------+ +-------------------+ +-------------------+ +----------------+
| | | |
| (A) user already logged in | |
|----------------------->| | |
| | | |
| (1) GET /realtime/bootstrap | |
|----------------------->| | |
|<-----------------------| 200 { wsUrl, allowedChannels, cursorHints } |
| | | |
| (2) POST /realtime/token | |
|----------------------------------->| | |
|<-----------------------------------| 200 { wsToken, exp } |
| | | |
| (3) GET /snapshot/{channel} (optional, cache miss) |
|----------------------->| | |
|<-----------------------| 200 { initial channel state } |
| | | |
| (4) Open WSS: wss://rt.example.com/ws?token=... |
|---------------------------------------------------------------------------> |
|<--------------------------------------------------------------------------- |
| 101 Switching Protocols (Upgrade success) |
|
| (5) WS frame: subscribe(channel:engineering, fromSeq:1234) |
|---------------------------------------------------------------------------> |
| (6) WS frame: connection.ack(conn_7f2a) |
|<--------------------------------------------------------------------------- |
|
| (7) On reconnect/gap -> GET /replay/{channel}?fromSeq=1234 |
|----------------------->| |
|<-----------------------| 200 { missedEvents:[...] } |
13.2 API Count + Call Pattern (ASCII)
Core APIs = 4
[Always]
1) GET /realtime/bootstrap
2) POST /realtime/token
[Conditional]
3) GET /snapshot/{channel}
4) GET /replay/{channel}?fromSeq=...
13.3 Token Fetch + Connect Flow
Browser Client API/Gateway
-------------- -----------
1) HTTPS login already done
2) GET /realtime/token -------------------> validate session
(cookie/JWT)
3) <-------------------------- 200 { wsToken: "eyJ...abc", exp: 60s }
4) new WebSocket(
"wss://rt.example.com/ws?token=eyJ...abc"
)
5) HTTP Upgrade request -------------------> verify token + claims
check exp, userId, tenantId
6) <-------------------------- 101 Switching Protocols (Upgrade success)
7) send subscribe frame -------------------> start streaming events
13.4 Handshake Headers (What Actually Happens)
GET /ws?token=eyJ...abc HTTP/1.1
Host: rt.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==
Sec-WebSocket-Version: 13
Origin: https://app.example.com
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: HSmrc0sMlYUkAGmm5OPpG2HaGWk=
13.5 First Messages After Handshake
{
"type": "subscribe",
"channel": "channel:engineering",
"fromSequence": 1234
}
{
"type": "connection.ack",
"connectionId": "conn_7f2a",
"userId": "u_42"
}
13.6 Failure Cases
token expired -> server rejects upgrade or closes with auth code -> client fetches new wsToken -> reconnect
invalid channel subscribe -> server sends error frame -> client does not retry that channel
network break -> reconnect with backoff -> resubscribe -> replay missed events
Case 1: token expired
Client ---- open wss(token_old) ----> Gateway
Client <--- auth close / reject ----- Gateway
Client ---- POST /realtime/token ---> API
Client <--- new wsToken ------------- API
Client ---- open wss(token_new) ----> Gateway
Case 2: invalid subscribe
Client ---- subscribe(private:admin) -> Gateway
Client <--- error.not_authorized ----- Gateway
Client ---- stop retry for this chan -> (client rule)
Case 3: network break
Client ----X socket lost X----------- Gateway
Client ---- reconnect + subscribe ---> Gateway
Client ---- GET /replay?fromSeq ----> API
Client <--- missed events ----------- API
13.7 Token Mechanism in Simple Terms
Use this easy model:
- User authentication (normal login)
- User logs in via password/SSO.
Server creates authenticated session (cookie or access token).
WebSocket token exchange
Client calls
POST /realtime/tokenusing logged-in session.Server returns short-lived
wsToken.WSS handshake auth
Client opens
wss://.../ws?token=wsToken.Gateway validates token and returns
101 Switching Protocols.Connection is now bound to that user identity.
Authorization after connect
Auth says who the user is.
Authorization says which channels the user can subscribe to.
Server checks permissions on every subscribe request.
Reconnect + token refresh
If socket drops and token expired, fetch new
wsToken.Reconnect, resubscribe, replay missed events.
Simple security checklist:
- Use only
wss://. - Keep websocket token very short-lived (for example 1-5 min).
- Do not expose long-lived auth token in socket URL.
- Do not log full tokens in frontend/backend logs.
14. WebSocket Internals and Scaling
This section explains how WebSocket works internally and how systems scale from a few users to millions.
14.1 How WebSocket Works Internally (Simple)
No contradiction note:
- In code, client starts with
wss://...URL. - Under the hood, browser sends an HTTPS HTTP-Upgrade request (
Upgrade: websocket) and gets101 Switching Protocols. - After
101, connection switches from HTTP semantics to WebSocket frames.
- HTTP starts the connection
Client sends an HTTP request with
Upgrade: websocket.Server upgrades protocol
Server returns
101 Switching Protocols.Now it is a persistent full-duplex connection.
Frames are exchanged
Data is sent as WebSocket frames (small packets), not full HTTP responses.
Client and server can both send anytime.
Keep-alive and health
Ping/Pong frames check if connection is alive.
If heartbeat fails, client reconnects.
Connection lifecycle
Connect -> authenticate -> subscribe -> stream events -> reconnect if broken.
14.2 Internal Components (Backend Side)
+----------------+ +----------------+ +-------------------+
| Clients (Web) | ---> | WSS Gateway | ---> | Event Router |
+----------------+ +----------------+ +-------------------+
| |
v v
+----------------+ +-------------------+
| Presence Store | | Message/Event Bus |
+----------------+ +-------------------+
| |
+-----------+-------------+
|
v
+---------------------+
| Channel Subscribers |
+---------------------+
What each does:
- WSS Gateway: accepts socket connections and does auth.
- Event Router: routes each event to correct channel subscribers.
- Presence Store: tracks online users.
- Event Bus: distributes events across multiple backend nodes.
14.3 Scaling Stages (1 -> 100s -> Millions)
Below is the practical journey most realtime systems follow.
Step 0: understand the core scaling problem
- HTTP requests are short-lived; WebSocket connections are long-lived.
- Each connected user consumes memory/file descriptors/CPU on one gateway.
- At high scale, fan-out (one message to many subscribers) becomes the biggest cost.
Stage A: 1 to 100 users (single node)
What architecture looks like:
- One gateway instance handles all sockets.
- Subscriptions are kept in memory map:
channelId -> [connectionIds]
Why this works:
- Small connection count.
- Low fan-out pressure.
- Simple debugging and fast iteration.
Limits you will hit:
- Single point of failure.
- Restart drops all connections.
- No horizontal scaling.
Stage B: 100s to 10k users (multi-gateway)
What changes first:
- Run multiple gateway instances behind a load balancer.
Why multiple gateways are needed:
- One server cannot hold all sockets safely forever.
- You need rolling deploys without dropping everyone.
- You need failover if one node dies.
Why sticky sessions are used:
- WebSocket is a persistent TCP connection.
- After upgrade, connection stays bound to one gateway.
- Sticky routing keeps reconnects from bouncing unpredictably between nodes.
- This reduces repeated rehydration work (presence, channel maps, cache warmup).
Sticky session options:
- LB cookie affinity.
- Source-IP hash (less accurate behind NAT/mobile).
- Consistent hash on
userId/tenantIdwhen possible.
What must move out of memory now:
- Presence state (online/offline, last seen).
- Subscription metadata needed across nodes.
- Use shared store like Redis for cross-node coordination.
Stage C: 10k to millions (distributed realtime fabric)
What architecture becomes:
- Many gateway pools (often per region).
- Event bus/pub-sub backbone for cross-node fan-out.
- Channel partitioning so work is spread evenly.
Core patterns:
- Partitioning
- Route channel events by partition key (workspaceId/channelId).
Keeps ordering per channel while distributing load.
Fan-out pipeline
Producer publishes event once.
Bus/stream distributes to partitions.
Gateway workers push only to local subscribed sockets.
Regionalization
Users connect to nearest region for lower latency.
Cross-region replication handles global rooms if needed.
Autoscaling
Scale by connection count and outbound events/sec.
Also watch CPU, memory, and queue lag.
Hot connections and hot channels (important)
Hot connection means:
- A single client sends/receives too much traffic (bot, abuse, power stream).
Hot channel means:
- One channel/room has massive fan-out (for example all-hands channel).
How to handle hot traffic:
- Per-connection limits
- Rate-limit publish and subscribe actions per user.
Apply token bucket/leaky bucket policies.
Per-channel protection
Cap burst fan-out per time window.
Batch/coalesce updates for high-frequency channels.
Shard heavy channels
Move very large channels to dedicated partition/gateway pool.
Isolate them from normal traffic.
Priority lanes
Prioritize user-visible critical events over low-priority events.
Defer/drop non-critical updates under pressure.
Slow consumer strategy
Bound per-socket buffer size.
If buffer exceeds threshold: drop low-priority events or force replay resync.
Step-by-step migration checklist
- Start single gateway and measure baseline.
- Add second gateway + load balancer.
- Introduce shared presence/subscription store.
- Add event bus for inter-node fan-out.
- Partition channels by stable key.
- Add autoscaling rules from real traffic metrics.
- Add hot-channel isolation and rate-limits.
- Run chaos/failover tests (node kill, region outage, reconnect storm).
Simple mental model:
- 1 node: easy and cheap.
- 10 nodes: shared state + sticky routing.
- 100+ nodes: partitioning + event bus + regional strategy.
14.4 How Traffic Is Handled Safely
- Backpressure
- Do not render every event instantly on client.
Batch/coalesce frequent updates.
Rate limiting
Limit publish/subscribe rate per user and per tenant.
Message fan-out control
Popular channels can create burst traffic.
Use queue + worker fan-out and partitioning.
Slow consumer handling
If a client cannot keep up, buffer with limits.
Drop old non-critical events or force resync via replay.
Replay and recovery
Keep sequence numbers.
On reconnect, request only missed events (
fromSeq).
14.5 Practical Capacity Metrics to Track
- Concurrent socket connections
- Messages in/out per second
- P95/P99 delivery latency
- Reconnect rate
- Replay request rate
- Dropped/failed message count
Simple rule:
- If reconnect rate and replay rate spike together, your system is under stress.
14.6 Deep Theory: Kafka, Routing Table, and Consumers
This subsection answers the exact question: when sender hits Node A, what happens if receiver is on Node B?
A) End-to-end cross-node flow (theory)
- Sender is connected to Gateway Node A.
- Receiver is connected to Gateway Node B.
- Sender sends frame to Node A.
- Node A validates auth + publish permission.
- Node A persists message (DB/log) and publishes one event to Kafka (key usually = channelId).
- A fanout/routing component reads that event and checks routing table.
- Routing table returns target gateway nodes for this channel (Node B, maybe Node A too).
- Delivery tasks are sent to those target nodes.
- Node B pushes frame to receiver socket(s).
- Receiver UI renders.
So yes: event bus (Kafka) is the bridge across nodes.
B) Why Kafka is used
Kafka is used for backend event distribution and durability, not for direct socket delivery.
Kafka gives:
- Durable event log (survives node restarts).
- Decoupling (Node A can publish quickly; delivery can happen asynchronously).
- Backpressure buffer (consumer lag can recover later).
- Partitioned scale (parallel processing).
- Replay by offsets.
Kafka does NOT know live sockets by itself.
- Socket-to-node mapping still comes from routing/subscription metadata.
C) Routing table: what it is, who updates it, who reads it
Typical routing table model:
channelId -> set(gatewayNodeIds)gatewayNodeId -> set(connectionIds)connectionId -> userId/session/subscriptions
Who updates it:
- Gateway node updates local memory on connect/subscribe/unsubscribe/disconnect.
- Gateway node updates shared store (Redis/etc.) for cross-node visibility.
Who reads it:
- Fanout router/service reads it when a new message event is consumed.
- In smaller systems, Node A may read it directly.
D) Consumer configuration patterns
Pattern 1: all gateways consume same topic (simple, wasteful)
- Every node reads many events and drops most if no local subscribers.
- Easy start, poor efficiency at scale.
Pattern 2: targeted fanout (recommended)
- Producer publishes once to channel-event topic.
- Fanout service consumer group reads events.
- Fanout resolves target nodes via routing table.
- Fanout writes to node-targeted streams/topics/queues.
- Each gateway consumes only its own delivery stream.
Sender on Node A
|
| 1) ws frame: message.send
v
+-----------+ 2) publish(channelId key) +----------------+
| Gateway A | -----------------------------------> | Kafka Topic(s) |
+-----------+ +----------------+
|
| 3) consume
v
+---------------+
| Fanout Router |
+---------------+
|
4) read routing table: channel -> nodes
|
+------------------------------+-----------------------------+
| |
v v
+-------------------+ +-------------------+
| Node B delivery Q | | Node A delivery Q |
+-------------------+ +-------------------+
| |
| 5) consume only own queue | (optional)
v v
+-----------+ +-----------+
| Gateway B | | Gateway A |
+-----------+ +-----------+
| |
| 6) push to local sockets | local subscriber(s)
v v
Receiver(s) Local receiver(s)
This reduces unnecessary cross-node work significantly.
E) Can Node A also be subscribed/consumer?
Yes, possible and common.
- If Node A also has local subscribers for that channel, it is also a valid target.
- It should receive delivery work only when it has matching subscribers.
Important distinction:
- "Node A can be a target" is good.
- "Every node consumes every event" is expensive at high scale.
F) Where load balancer fits
LB handles:
- Connection admission (TLS + handshake routing to a gateway).
- Reconnect placement policy (sticky/cookie/hash).
LB does not perform cross-node fanout for messages.
- Cross-node fanout is handled by Kafka + fanout service + routing table.
G) Failure and correctness notes
- If Node B is down, routing table must evict stale node entries (TTL/heartbeat).
- If consumer lags, Kafka offsets allow catch-up.
- If receiver is offline, message remains durable and is delivered via replay later.
- Ordering is guaranteed per partition key (so keying strategy matters).
14.7 Ordering Policy (What Is Guaranteed and Where)
Ordering rules should be explicit:
- Per-channel ordering: guaranteed when all events for a channel share one partition key.
- Cross-channel ordering: not guaranteed.
- Cross-region ordering: eventual, may be delayed/reordered.
- Client rendering rule: sort by
sequenceinside a channel. - Gap rule: if
nextSequence != lastSequence + 1, trigger replay.
Tie-break policy:
- Primary:
sequence - Secondary: server timestamp
- Final fallback: lexical
eventId
14.8 Observability, SLOs, and Alerts
Track and alert with clear thresholds.
Suggested SLOs:
- Realtime delivery latency P99 < 2s
- Socket connect success rate > 99.9%
- Reconnect recovery under 30s for 99% sessions
- Replay success rate > 99.5%
Suggested alerts:
- Reconnect spike
Trigger: reconnect rate > 3x baseline for 5 min.
Replay spike
Trigger: replay requests > 2x baseline for 10 min.
Consumer lag
Trigger: lag exceeds threshold for 5 min.
Error burst
Trigger: auth/subscribe errors exceed normal percentile.
Minimal runbook:
- Check regional gateway health.
- Check broker lag and partition hot spots.
- Check auth/token service latency.
- If needed, enable protective throttling and coalescing.
14.9 Testing Strategy and Validation Gates
Add repeatable tests before production rollout.
Core test suites:
- Functional
Connect, subscribe, publish, ack, replay, dedupe.
Resilience
Node restart, broker restart, network partition, reconnect storm.
Scale
Concurrent sockets, fan-out stress, hot-channel stress.
Security
Expired token, revoked token, unauthorized subscribe/publish.
Pass criteria examples:
- No message loss in replay scenarios.
- Duplicate rate remains within accepted threshold and UI dedupe hides duplicates.
- P99 latency stays under SLO at target load.
- Recovery after single-node failure meets reconnect SLO.
15. Final Summary
If you explain WebSocket design in these 4 lines, it is enough for most interviews:
- Open one live socket after login.
- Subscribe only to relevant channels.
- Update UI instantly with optimistic + ack flow.
- Reconnect and replay missed events after disconnect.
More Details:
Get all articles related to system design
Hashtag: SystemDesignWithZeeshanAli
Git: https://github.com/ZeeshanAli-0704/front-end-system-design
Top comments (0)