Most developers underestimate how hard messaging systems really are.
A basic chat demo with WebSockets is easy to build. A production-grade messaging platform like WhatsApp, Telegram, Discord, or Slack is a completely different engineering problem. The hard part is not rendering messages in the UI. The hard parts are keeping millions of persistent connections alive, delivering messages reliably, preserving ordering, handling offline sync, scaling group chats, surviving partial failures, and keeping latency low around the world.
In this post, we'll design a scalable real-time chat architecture and walk through the trade-offs behind modern messaging systems.
Why chat gets hard
At first glance, chat seems simple:
- send a message
- store it
- show it to the other user
That works for a demo. It falls apart fast in production.
Once you add real users, the system has to handle:
- persistent connections
- reconnects
- message ordering
- offline delivery
- read receipts
- typing indicators
- multi-device sync
- large group chats
- global latency
- partial failures
This is where chat stops being a UI feature and becomes a distributed systems problem.
Requirements
Before designing the system, we need clear requirements.
Functional requirements
Our chat system should support:
- One-to-one chats
- Group chats
- Real-time message delivery
- Read receipts
- Typing indicators
- Push notifications
- Media attachments
- Message history
- Multi-device synchronization
- Online and offline presence
Non-functional requirements
The system must also provide:
- Low latency
- High availability
- Horizontal scalability
- Fault tolerance
- Reliable delivery
- Event ordering
- Efficient storage
- Global distribution
At small scale, these are manageable. At millions of users, every one of them becomes a distributed systems concern.
High-level architecture
A modern chat platform is usually event-driven and distributed.
Clients
│
▼
Load Balancer
│
▼
WebSocket Gateway Cluster
│
├── Authentication Service
├── Presence Service
├── Chat Service
├── Notification Service
└── Media Service
│
▼
Kafka / Redis Streams
│
▼
Storage Layer
Each component has a separate responsibility. That separation is what makes the system scalable and easier to evolve.
Why polling fails
Many beginner chat apps start with polling:
GET /messages every 2 seconds
This works for a demo. It fails badly at scale.
Polling creates:
- massive request overhead
- unnecessary database reads
- increased latency
- battery drain on mobile devices
- poor real-time responsiveness
Now imagine 2 million connected users polling every 2 seconds. That becomes 1 million requests per second even when nobody is sending anything. Real chat systems avoid this by using persistent connections.
Why WebSockets became the default
WebSockets keep a bidirectional connection open between the client and the server.
That gives you:
- near real-time communication
- lower overhead
- reduced latency
- efficient server push
- better mobile performance
The client connects once:
Client ───── persistent connection ───── Server
After that, both sides can exchange events instantly. That is the foundation of modern messaging systems.
The hidden WebSocket cost
WebSockets solve one problem and introduce another.
A persistent connection is not free. Every connected user consumes:
- a TCP socket
- memory buffers
- heartbeat state
- authentication context
At 5 million concurrent users, this is no longer a normal API problem. It becomes a connection management problem.
That is why serious chat systems build specialized gateway infrastructure.
WebSocket gateway layer
The gateway layer is responsible for:
- maintaining persistent connections
- authenticating users
- routing events
- managing heartbeats
- detecting disconnects
Gateways should stay as stateless as possible. Stateless gateways are much easier to scale, replace, and recover after failures.
Authentication flow
A common flow looks like this:
- User logs in via HTTP.
- Backend issues a JWT token.
- Client opens a WebSocket connection.
- Token is validated during the handshake.
- Connection is associated with a user session.
Example:
wss://chat.example.com?token=JWT
After authentication, the gateway knows which user owns the connection.
Sending a message
Now let's look at the real delivery pipeline.
Suppose User A sends:
Hello
A simplified flow is:
- Client sends a message event.
- Gateway validates authentication.
- Chat service validates permissions.
- Message is persisted in durable storage.
- Event is published into Kafka.
- Recipient gateway receives the event.
- Message is pushed to the recipient.
- Delivery ACK is generated.
- Read receipt is generated later.
This looks simple on paper. In reality, every step can fail.
Why persistence comes first
A lot of beginners try this:
deliver → save later
That is dangerous.
If the server crashes before persistence, the recipient saw the message, but the database lost it. Now the system is inconsistent.
Production systems usually do:
persist → publish → deliver
Durability comes first.
Message IDs and ordering
Distributed systems do not guarantee ordering automatically.
Example:
- Message A
- Message B
Recipient receives:
- Message B
- Message A
Why does this happen? Because messages may travel through different gateway servers, queues, and network paths.
Common ordering strategies
Timestamp ordering
Simple, but unreliable. Clock drift breaks consistency.
Incremental sequence IDs
More reliable.
Example:
conversation_id: 42
messages:
1
2
3
4
This guarantees ordering inside a conversation. Most real systems only guarantee local ordering per chat. Global ordering across the entire platform is usually impossible at scale.
Database design
Messaging systems are write-heavy. A popular group chat can generate thousands of writes per second.
Typical tables:
usersconversationsconversation_membersmessagesmessage_statusattachments
The real challenge is partitioning.
Why a single database eventually fails
A single PostgreSQL instance works at the beginning. Over time, problems show up:
- write bottlenecks
- storage growth
- replication lag
- index size explosion
At scale, systems introduce sharding.
Sharding strategy
A common strategy is:
shard by conversation_id
Benefits:
- messages for the same chat stay colocated
- ordering becomes easier
- queries stay efficient
Bad shard keys create hotspots. For example, sharding by user_id can spread large group chats across multiple shards and make fan-out expensive.
Kafka and event streaming
Modern messaging systems are heavily event-driven.
Kafka is useful because it provides:
- durable event logs
- replayability
- partitioned scalability
- consumer groups
Instead of services calling each other directly:
Chat Service → Kafka → Consumers
Consumers may include:
- delivery service
- notification service
- analytics
- moderation
- push notifications
This decouples the system and makes failures easier to isolate.
Presence system
Presence is deceptively expensive.
Tracking online, offline, typing, and last_seen for millions of users creates a huge amount of event traffic. That is why most systems isolate presence into a dedicated service.
Presence implementation
A common architecture is:
Gateway → Redis → Presence Service
Gateways periodically send heartbeats.
Example:
PING every 30 seconds
If heartbeats stop, the user is marked offline.
Redis works well here because presence data is ephemeral. Not everything belongs in a relational database.
Typing indicators
Typing indicators look trivial. They are not.
Problems include:
- high event frequency
- noisy updates
- unnecessary fan-out
Most systems heavily throttle typing events.
Example:
User typing → emit once every 3 seconds
Without throttling, typing indicators can overload the infrastructure faster than messages.
Fan-out in group chats
Suppose a group contains 500,000 users.
One message may require 500,000 deliveries. This is called fan-out.
Large fan-out is one of the hardest messaging problems.
Fan-out strategies
Fan-out on write
The server precomputes deliveries immediately.
- fast reads
- expensive writes
Fan-out on read
Messages are stored once.
- cheaper writes
- more expensive reads
Different platforms choose different trade-offs.
Offline synchronization
Users disconnect constantly:
- mobile app closed
- network loss
- airplane mode
- battery saver
The system must synchronize missed events efficiently.
Typical approach:
fetch all events after last_sequence_id
Example:
last_seen_message = 10451
Server returns:
10452+
Incremental synchronization is critical. Full synchronization is too expensive.
Push notifications
Offline users still need notifications.
Pipeline:
message event
↓
notification service
↓
APNs / FCM
↓
mobile device
Push systems are eventually consistent. Notifications may arrive late, duplicated, or out of order. Clients need to tolerate that.
Multi-device sync
Modern users expect sync across:
- phone
- desktop
- browser
- tablet
Each device may keep its own session. The backend tracks:
user_iddevice_idconnection_idlast_sync_state
Events are usually delivered independently per device.
Reliability guarantees
Messaging systems need clear delivery semantics.
At-most-once
Fastest. Messages may disappear.
At-least-once
Reliable. Duplicates are possible.
Exactly-once
Extremely expensive and difficult in distributed systems.
Most production systems use:
at-least-once + idempotency
That is the practical choice.
Why exactly-once is mostly marketing
Exactly-once delivery across distributed infrastructure is incredibly hard.
Network failures create ambiguity:
- Did the recipient receive the message?
- Did the ACK get lost?
- Did the retry create a duplicate?
Sometimes the sender cannot know for sure.
That is why many systems rely on retries, deduplication, and idempotent consumers instead of true exactly-once guarantees.
Common failure scenarios
Real systems fail all the time.
Examples:
- gateway crashes
- Kafka partition unavailable
- Redis outage
- slow consumers
- duplicate events
- partial synchronization
- network splits
Reliable systems assume failure is normal.
Scaling strategies
As traffic grows, teams usually add:
- horizontal scaling for gateway nodes
- partitioned Kafka topics
- regional infrastructure
- CDN for media
- caching to reduce database pressure
The biggest bottleneck is usually not CPU.
At scale, bottlenecks are often:
- network throughput
- memory usage
- hot partitions
- connection limits
- disk I/O
- replication lag
Chat systems are infrastructure-heavy workloads.
What makes messaging hard
The frontend UI may look simple. Underneath is a distributed system balancing:
- consistency
- availability
- latency
- reliability
- scalability
Every architectural decision introduces trade-offs. You can optimize latency, durability, throughput, or cost, but rarely all at once.
That is why messaging systems remain one of the most interesting areas in system design.
Practical takeaway
If you are building chat, do not think only in terms of sockets and message lists. Think in terms of delivery guarantees, storage strategy, ordering, recovery, and fan-out.
A chat product becomes serious very quickly. The moment you need offline sync, presence, multi-device support, and reliable delivery, you are no longer building a UI feature. You are building a distributed messaging platform.
Top comments (0)