Case Study: Designing a Chat System (Meta / WhatsApp–Style)
This section answers a common follow-up interview request:
“Okay, now apply this thinking to a real problem.”
We will do exactly that — without jumping to tools or architectures first.
The goal is not to “design WhatsApp,” but to demonstrate how interviewers expect you to think.
The Interview Question (Realistic & Common)
“Design a chat system like WhatsApp.”
This is a real company interview question asked (in variants) at:
- Meta
- Uber
- Amazon
- Stripe
Most candidates fail this question not because it’s hard, but because they start in the wrong place.
What Most Candidates Do (Wrong Start)
Typical opening:
- “We’ll use WebSockets”
- “We’ll use Kafka”
- “We’ll shard by user ID”
This skips reasoning.
A strong candidate pauses and applies the checklist.
Applying the First-Principles Checklist Live
We will apply the same five questions, in order, and show what problems naturally surface.
1. State
“Where does state live? When is it durable?”
Ask This Out Loud in the Interview
What information must the chat system remember for it to function correctly?
Identify Required State (No Design Yet)
- Users
- Conversations
- Messages
- Message delivery status
Now ask:
Which of this state must never be lost?
Answer:
- Messages (core product)
- Conversation membership
First-Principles Conclusion
- Messages must be persisted
- In-memory-only solutions are insufficient
What the Interviewer Sees
You identified correctness-critical state before touching architecture.
2. Time
“How long does each step take?”
Now we introduce time.
Break the Chat Flow
- User sends message
- Message is stored
- Message is delivered to recipient(s)
Ask:
Which of these must be fast?
- Sending a message → must feel instant
- Delivery → may be delayed (offline users)
Critical Question
Does the sender wait for delivery confirmation?
If yes:
- Latency depends on recipient availability If no:
- Sending and delivery are time-decoupled
First-Principles Conclusion
- Message acceptance must be fast
- Delivery can happen later
This naturally introduces asynchrony, without naming any tools.
3. Failure
“What breaks independently?”
Now assume failures — explicitly.
Ask
What happens if the system crashes after accepting a message but before delivery?
Possible states:
- Message stored
- Recipient not notified yet
Now ask:
Can delivery be retried safely?
This surfaces a key invariant:
A message must not be delivered zero times or multiple times incorrectly.
Failure Scenarios Discovered
- Duplicate delivery
- Message loss
- Inconsistent delivery status
First-Principles Conclusion
- Message delivery must be idempotent
- Storage and delivery failures must be decoupled
The interviewer now sees you understand distributed failure, not just happy paths.
4. Order
“What defines correct sequence?”
Now introduce multiple messages.
Ask
Does message order matter in a conversation?
Answer:
- Yes — chat messages must appear in order
Now ask the dangerous question:
Does arrival order equal delivery order?
In distributed systems:
- No guarantee
Messages can:
- Be processed by different servers
- Experience different delays
First-Principles Conclusion
- Ordering is part of correctness
- It must be explicitly modeled (e.g., sequence per conversation)
This is a senior-level insight, derived from questioning alone.
5. Scale
“What grows fastest under load?”
Now — and only now — do we talk about scale.
Ask
As usage grows, what increases fastest?
Likely answers:
- Number of messages
- Concurrent active connections
- Offline message backlog
Now ask:
What happens during spikes (e.g., group chats, viral events)?
You discover:
- Hot conversations
- Uneven load
- Memory pressure from live connections
First-Principles Conclusion
- The system must scale on messages, not users
- Load is not uniform
What We Have Discovered (Before Any Design)
Without choosing any tools, we now know:
- Messages must be durable
- Sending and delivery must be decoupled
- Failures must not cause duplicates or loss
- Ordering is a correctness requirement
- Message volume, not user count, dominates scale
This is exactly what interviewers want to hear before you propose architecture.
What Comes Next (And Why It’s Easy Now)
Only after this reasoning does it make sense to talk about:
- Persistent storage
- Async delivery
- Streaming connections
- Partitioning strategies
At this point, architecture choices are obvious, not arbitrary.
Why This Approach Scores High in Interviews
Interviewers are evaluating:
- How you reason under ambiguity
- Whether you surface hidden constraints
- Whether you understand failure modes
They are not testing whether you know WhatsApp’s internals.
This method shows:
- Structured thinking
- Calm problem decomposition
- Senior-level judgment
Common Candidate Mistakes (Seen in This Question)
- Jumping to WebSockets without discussing durability
- Ignoring offline users
- Assuming message order “just works”
- Treating retries as harmless
- Talking about scale before correctness
Every one of these mistakes is prevented by the checklist.
Final Reinforcement: The Checklist (Again)
Use this verbatim in interviews:
- Where does state live? When is it durable?
- Which steps are fast vs slow?
- What can fail independently?
- What defines correct order?
- What grows fastest under load?
Final Mental Model
Strong candidates design systems.
Exceptional candidates design reasoning.
Top comments (0)