shubham pandey (Connoisseur)

Posted on Mar 11

Designing WhatsApp / Chat System at Scale Deep Dive — Question by Question

#architecture #distributedsystems #interview #systemdesign

Introduction

A chat application seems simple — send a message, receive a message. But at 2 billion users, WhatsApp hides some of the most complex distributed systems challenges in software engineering. This post walks through the real complexity challenge by challenge, including the wrong turns and how to navigate out of them.

Challenge 1: The Naive Message Delivery Approach

Interview Question: Walk me through the basic flow of how you would get a message from one phone to another — and where does the first major challenge appear?

Initial Approach: User A sends a message, it goes to the WhatsApp server, and the server uses push notifications to deliver it to User B.

Why Push Notifications Alone Are Not Enough: Push notifications work perfectly when the app is closed — waking up the device and alerting the user. But when User B has WhatsApp actively open on their screen, push notifications add 100-500ms latency through FCM or SNS. In an active conversation that feels laggy and unnatural. WhatsApp delivers messages in under 100ms when both users are online.

Navigation: The key realization is that there are two distinct scenarios — app open and app closed — and they need different delivery mechanisms.

Solution: Hybrid delivery approach.

App is open and active — use WebSocket persistent connection for instant sub 100ms delivery
App is closed or in background — fall back to FCM or AWS SNS push notification

Key Insight: Push notifications and WebSockets solve different problems. WebSockets handle real time delivery for active users. Push notifications handle delivery for offline or background users. You need both.

Challenge 2: WebSocket Connections at Scale

Interview Question: WhatsApp has 2 billion users. Even 10% active simultaneously means 200 million open WebSocket connections. A single server holds roughly 65,000 connections. How does Server 1 deliver a message to User B who is connected to Server 7?

Initial Approach: Each server handles its own connections but has no knowledge of where other users are connected.

Why It Fails: With thousands of WebSocket servers each holding a slice of connections, a message arriving at Server 1 has no way to reach User B on Server 7 without a routing mechanism.

Navigation: You need a centralized lookup that any server can query to find where any user is currently connected.

Solution: Redis lookup table mapping users to their WebSocket server.

User B connects to Server 7 — Redis stores UserB mapped to Server 7
User A sends message — arrives at Server 1
Server 1 queries Redis — finds User B on Server 7
Server 1 forwards message to Server 7
Server 7 delivers to User B via WebSocket instantly

Key Insight: Redis acts as a real time routing table for WebSocket connections. Every connection and disconnection updates this table so any server can route to any user in O(1).

Challenge 3: Message Durability and the ACK Pattern

Interview Question: User B temporarily loses internet connection for 30 seconds while a message is being delivered. What happens to the message and how does User B get it when they reconnect?

Navigation: The key insight is that delivery confirmation must be explicit — the server cannot assume a message was delivered just because it sent it. The client must acknowledge receipt.

Solution: ACK (Acknowledgement) pattern with database persistence.

Message delivered to User B — User B's app sends ACK back to server
Server receives ACK — marks message as delivered — no further action needed
No ACK received — server knows User B is offline
Server updates Redis — UserB marked as offline
Server stores message in database for retry
User B reconnects — server checks database — delivers all pending messages

This is exactly what WhatsApp's tick system means:

Single grey tick — message reached WhatsApp server
Double grey tick — message delivered to User B's device and ACK received
Double blue tick — User B has read the message

Key Insight: Explicit ACKs are the foundation of reliable message delivery. Never assume delivery succeeded without confirmation from the recipient.

Challenge 4: The Duplicate Message Problem

Interview Question: Server delivers message to User B. User B sends ACK but the ACK gets lost in the network. Server never receives ACK, assumes delivery failed, and retries. User B now sees the same message twice. How do you prevent duplicate messages?

Wrong Approach: Add ACK from server to client so both sides confirm. This creates an infinite loop — ACK for the ACK for the ACK.

Navigation: More ACKs do not solve duplicates. The solution is recognizing duplicates when they arrive rather than preventing retries entirely. Every message needs a globally unique identity so the receiver can detect and discard messages it has already seen.

Wrong Approach 2: Hash the message content for unique identification.

Why It Fails: If User A sends "Hello" twice, both messages produce identical hashes. The second legitimate message gets discarded as a duplicate.

Wrong Approach 3: Use timestamp for unique identification.

Why It Fails: Two messages sent in the same millisecond get identical timestamps. Clock skew between devices also causes ordering and uniqueness issues.

Solution: UUID (Universally Unique Identifier) generated server side for every message.

Server generates UUID for each message
UUID sent to User B along with message content
User B's app stores all received UUIDs locally with 30 day TTL
Duplicate arrives — app checks UUID — already seen — discard silently
TTL matches message retry window — after 30 days message is either delivered or dropped

On UUID collision probability — UUID is 128 bits with 340 undecillion possible values. You would need to generate 1 billion UUIDs per second for 100 years before expecting a single collision. Treat collision as practically impossible.

Key Insight: Idempotency via UUID is the standard solution to duplicate delivery in distributed messaging. The receiver, not the sender, is responsible for deduplication.

Challenge 5: Group Messaging Fan-out

Interview Question: User A sends a message in a group of 1024 members. Some are online on different servers, some are offline. How do you deliver one message to 1024 people simultaneously?

Wrong Approach: Fetch the WebSocket server location of each of the 1024 members from Redis and forward to each server synchronously while User A waits.

Why It Fails: WhatsApp handles 1 billion group messages per day. With average group size of 200 members that is 200 billion Redis lookups and 200 billion server to server forwarding calls per day — all happening synchronously while users wait for send confirmation.

Navigation: User A should never wait for 1024 individual deliveries to complete. This is the same async pattern used for Twitter hashtag processing — publish an event and let a background service handle the fan-out.

Solution: Kafka async fan-out via dedicated Group Service.

User A sends message — server publishes one single event to Kafka instantly
User A gets immediate send confirmation — never waits for 1024 deliveries
Group Service consumes from Kafka
Group Service fetches all 1024 member locations from Redis in one batch lookup
Delivers to online members via their WebSocket servers
Stores messages in database for offline members

Key Insight: Kafka decouples message sending from message delivery. The sender gets instant confirmation while fan-out happens asynchronously in the background at whatever pace the system can handle.

Challenge 6: Group Read Receipts

Interview Question: WhatsApp shows blue double tick in groups only when all 1024 members have read the message. How do you track per member read status efficiently across millions of messages?

Wrong Approach: Store a simple read counter per message.

Why It Fails: A counter tells you how many people have read but not who specifically has read. WhatsApp lets you tap a message to see exactly which members have read and which have not. A counter cannot provide this granularity.

Solution: Redis Set per message using UUID as key.

Key is the message UUID
Value is a Redis Set containing user IDs of members who have read

Operations:

Mark as read — SADD messageUUID UserB — O(1)
Check if specific member read — SISMEMBER messageUUID UserB — O(1) returns true or false
Check if everyone read for blue tick — SCARD messageUUID equals 1024 — O(1)

Memory Management — Two layer cleanup strategy:

Eager deletion — all 1024 members have read — delete Redis Set immediately, no point keeping it
TTL safety net — 30 day TTL catches messages that some members never read, orphaned entries automatically cleaned up

Key Insight: Redis Set gives O(1) membership checking and cardinality counting — perfect for tracking who has read a message. Eager deletion plus TTL prevents memory from growing unbounded.

Challenge 7: Online Status and the Stale Presence Problem

Interview Question: 2 billion users opening and closing WhatsApp constantly. Every open means online, every close means last seen with timestamp. But what happens when a phone crashes or loses internet without explicitly sending an offline signal? The server still shows the user as online forever.

This is called the stale presence problem.

Wrong Approach: Store online or offline status in Redis and update on app open and close.

Why It Fails: App crashes and network drops never trigger an explicit offline signal. Status gets stuck as online indefinitely.

Navigation: If you cannot rely on an explicit offline signal, you need a mechanism where online status automatically expires unless actively refreshed. TTL on the Redis entry solves this — but only if something keeps refreshing it while the user is genuinely online.

Solution: Heartbeat mechanism with short TTL.

User opens WhatsApp — Redis sets UserA to online with 30 second TTL
App sends heartbeat ping every 20 seconds — refreshes TTL
User closes app normally — app sends explicit offline signal — Redis updates to last seen with timestamp
App crashes or loses internet — heartbeats stop — TTL expires after 30 seconds — status automatically becomes offline

The 20 and 30 second rule:

Heartbeat interval 20 seconds — always refreshes before TTL expires
TTL 30 seconds — buffer for network hiccups but short enough to detect crashes quickly

Key Insight: Heartbeat plus TTL is the standard pattern for presence detection in distributed systems. Never rely on explicit disconnect signals alone — networks and devices are too unreliable.

Full Architecture Summary

Real time messaging — WebSocket persistent connections for active users
Offline delivery — FCM and AWS SNS push notifications
Message routing — Redis lookup table mapping users to WebSocket servers
Message durability — Database persistence with explicit ACK pattern
Duplicate prevention — Server generated UUID with 30 day TTL
Group fan-out — Kafka async processing via dedicated Group Service
Group read receipts — Redis Set per message with eager deletion and TTL
Online presence — Redis heartbeat with 30 second TTL auto expiry

Final Thoughts

WhatsApp at 2 billion users is a masterclass in combining simple building blocks — WebSockets, Redis, Kafka, and a database — into a system that feels effortless to the end user. Every feature that seems trivial on the surface hides a distributed systems challenge underneath.

The recurring theme throughout this design is that reliability requires explicit confirmation at every step. ACKs for delivery, UUIDs for deduplication, heartbeats for presence — nothing is assumed, everything is verified.

Happy building. 🚀

DEV Community