Damir Karimov

Posted on Jun 23 • Originally published at blog.damir-karimov.com

Designing a Real-Time Chat System at Scale

#systemdesign #backend #websockets #distributedsystems

Most developers underestimate how hard messaging systems really are.

A basic chat demo with WebSockets is easy to build. A production-grade messaging platform like WhatsApp, Telegram, Discord, or Slack is a completely different engineering problem. The hard part is not rendering messages in the UI. The hard parts are keeping millions of persistent connections alive, delivering messages reliably, preserving ordering, handling offline sync, scaling group chats, surviving partial failures, and keeping latency low around the world.

In this post, we'll design a scalable real-time chat architecture and walk through the trade-offs behind modern messaging systems.

Why chat gets hard

At first glance, chat seems simple:

send a message
store it
show it to the other user

That works for a demo. It falls apart fast in production.

Once you add real users, the system has to handle:

persistent connections
reconnects
message ordering
offline delivery
read receipts
typing indicators
multi-device sync
large group chats
global latency
partial failures

This is where chat stops being a UI feature and becomes a distributed systems problem.

Requirements

Before designing the system, we need clear requirements.

Functional requirements

Our chat system should support:

One-to-one chats
Group chats
Real-time message delivery
Read receipts
Typing indicators
Push notifications
Media attachments
Message history
Multi-device synchronization
Online and offline presence

Non-functional requirements

The system must also provide:

Low latency
High availability
Horizontal scalability
Fault tolerance
Reliable delivery
Event ordering
Efficient storage
Global distribution

At small scale, these are manageable. At millions of users, every one of them becomes a distributed systems concern.

High-level architecture

A modern chat platform is usually event-driven and distributed.

Clients
   │
   ▼
Load Balancer
   │
   ▼
WebSocket Gateway Cluster
   │
   ├── Authentication Service
   ├── Presence Service
   ├── Chat Service
   ├── Notification Service
   └── Media Service
          │
          ▼
    Kafka / Redis Streams
          │
          ▼
     Storage Layer

Each component has a separate responsibility. That separation is what makes the system scalable and easier to evolve.

Why polling fails

Many beginner chat apps start with polling:

GET /messages every 2 seconds

This works for a demo. It fails badly at scale.

Polling creates:

massive request overhead
unnecessary database reads
increased latency
battery drain on mobile devices
poor real-time responsiveness

Now imagine 2 million connected users polling every 2 seconds. That becomes 1 million requests per second even when nobody is sending anything. Real chat systems avoid this by using persistent connections.

Why WebSockets became the default

WebSockets keep a bidirectional connection open between the client and the server.

That gives you:

near real-time communication
lower overhead
reduced latency
efficient server push
better mobile performance

The client connects once:

Client ───── persistent connection ───── Server

After that, both sides can exchange events instantly. That is the foundation of modern messaging systems.

The hidden WebSocket cost

WebSockets solve one problem and introduce another.

A persistent connection is not free. Every connected user consumes:

a TCP socket
memory buffers
heartbeat state
authentication context

At 5 million concurrent users, this is no longer a normal API problem. It becomes a connection management problem.

That is why serious chat systems build specialized gateway infrastructure.

WebSocket gateway layer

The gateway layer is responsible for:

maintaining persistent connections
authenticating users
routing events
managing heartbeats
detecting disconnects

Gateways should stay as stateless as possible. Stateless gateways are much easier to scale, replace, and recover after failures.

Authentication flow

A common flow looks like this:

User logs in via HTTP.
Backend issues a JWT token.
Client opens a WebSocket connection.
Token is validated during the handshake.
Connection is associated with a user session.

Example:

wss://chat.example.com?token=JWT

After authentication, the gateway knows which user owns the connection.

Sending a message

Now let's look at the real delivery pipeline.

Suppose User A sends:

Hello

A simplified flow is:

Client sends a message event.
Gateway validates authentication.
Chat service validates permissions.
Message is persisted in durable storage.
Event is published into Kafka.
Recipient gateway receives the event.
Message is pushed to the recipient.
Delivery ACK is generated.
Read receipt is generated later.

This looks simple on paper. In reality, every step can fail.

Why persistence comes first

A lot of beginners try this:

deliver → save later

That is dangerous.

If the server crashes before persistence, the recipient saw the message, but the database lost it. Now the system is inconsistent.

Production systems usually do:

persist → publish → deliver

Durability comes first.

Message IDs and ordering

Distributed systems do not guarantee ordering automatically.

Example:

Message A
Message B

Recipient receives:

Message B
Message A

Why does this happen? Because messages may travel through different gateway servers, queues, and network paths.

Common ordering strategies

Timestamp ordering

Simple, but unreliable. Clock drift breaks consistency.

Incremental sequence IDs

More reliable.

Example:

conversation_id: 42

messages:
1
2
3
4

This guarantees ordering inside a conversation. Most real systems only guarantee local ordering per chat. Global ordering across the entire platform is usually impossible at scale.

Database design

Messaging systems are write-heavy. A popular group chat can generate thousands of writes per second.

Typical tables:

users
conversations
conversation_members
messages
message_status
attachments

The real challenge is partitioning.

Why a single database eventually fails

A single PostgreSQL instance works at the beginning. Over time, problems show up:

write bottlenecks
storage growth
replication lag
index size explosion

At scale, systems introduce sharding.

Sharding strategy

A common strategy is:

shard by conversation_id

Benefits:

messages for the same chat stay colocated
ordering becomes easier
queries stay efficient

Bad shard keys create hotspots. For example, sharding by user_id can spread large group chats across multiple shards and make fan-out expensive.

Kafka and event streaming

Modern messaging systems are heavily event-driven.

Kafka is useful because it provides:

durable event logs
replayability
partitioned scalability
consumer groups

Instead of services calling each other directly:

Chat Service → Kafka → Consumers

Consumers may include:

delivery service
notification service
analytics
moderation
push notifications

This decouples the system and makes failures easier to isolate.

Presence system

Presence is deceptively expensive.

Tracking online, offline, typing, and last_seen for millions of users creates a huge amount of event traffic. That is why most systems isolate presence into a dedicated service.

Presence implementation

A common architecture is:

Gateway → Redis → Presence Service

Gateways periodically send heartbeats.

Example:

PING every 30 seconds

If heartbeats stop, the user is marked offline.

Redis works well here because presence data is ephemeral. Not everything belongs in a relational database.

Typing indicators

Typing indicators look trivial. They are not.

Problems include:

high event frequency
noisy updates
unnecessary fan-out

Most systems heavily throttle typing events.

Example:

User typing → emit once every 3 seconds

Without throttling, typing indicators can overload the infrastructure faster than messages.

Fan-out in group chats

Suppose a group contains 500,000 users.

One message may require 500,000 deliveries. This is called fan-out.

Large fan-out is one of the hardest messaging problems.

Fan-out strategies

Fan-out on write

The server precomputes deliveries immediately.

fast reads
expensive writes

Fan-out on read

Messages are stored once.

cheaper writes
more expensive reads

Different platforms choose different trade-offs.

Offline synchronization

Users disconnect constantly:

mobile app closed
network loss
airplane mode
battery saver

The system must synchronize missed events efficiently.

Typical approach:

fetch all events after last_sequence_id

Example:

last_seen_message = 10451

Server returns:

10452+

Incremental synchronization is critical. Full synchronization is too expensive.

Push notifications

Offline users still need notifications.

Pipeline:

message event
   ↓
notification service
   ↓
APNs / FCM
   ↓
mobile device

Push systems are eventually consistent. Notifications may arrive late, duplicated, or out of order. Clients need to tolerate that.

Multi-device sync

Modern users expect sync across:

phone
desktop
browser
tablet

Each device may keep its own session. The backend tracks:

user_id
device_id
connection_id
last_sync_state

Events are usually delivered independently per device.

Reliability guarantees

Messaging systems need clear delivery semantics.

At-most-once

Fastest. Messages may disappear.

At-least-once

Reliable. Duplicates are possible.

Exactly-once

Extremely expensive and difficult in distributed systems.

Most production systems use:

at-least-once + idempotency

That is the practical choice.

Why exactly-once is mostly marketing

Exactly-once delivery across distributed infrastructure is incredibly hard.

Network failures create ambiguity:

Did the recipient receive the message?
Did the ACK get lost?
Did the retry create a duplicate?

Sometimes the sender cannot know for sure.

That is why many systems rely on retries, deduplication, and idempotent consumers instead of true exactly-once guarantees.

Common failure scenarios

Real systems fail all the time.

Examples:

gateway crashes
Kafka partition unavailable
Redis outage
slow consumers
duplicate events
partial synchronization
network splits

Reliable systems assume failure is normal.

Scaling strategies

As traffic grows, teams usually add:

horizontal scaling for gateway nodes
partitioned Kafka topics
regional infrastructure
CDN for media
caching to reduce database pressure

The biggest bottleneck is usually not CPU.

At scale, bottlenecks are often:

network throughput
memory usage
hot partitions
connection limits
disk I/O
replication lag

Chat systems are infrastructure-heavy workloads.

What makes messaging hard

The frontend UI may look simple. Underneath is a distributed system balancing:

consistency
availability
latency
reliability
scalability

Every architectural decision introduces trade-offs. You can optimize latency, durability, throughput, or cost, but rarely all at once.

That is why messaging systems remain one of the most interesting areas in system design.

Practical takeaway

If you are building chat, do not think only in terms of sockets and message lists. Think in terms of delivery guarantees, storage strategy, ordering, recovery, and fan-out.

A chat product becomes serious very quickly. The moment you need offline sync, presence, multi-device support, and reliable delivery, you are no longer building a UI feature. You are building a distributed messaging platform.