DEV Community

Cover image for Designing a Real-Time Chat System at Scale
Damir Karimov
Damir Karimov

Posted on • Originally published at blog.damir-karimov.com

Designing a Real-Time Chat System at Scale

Most developers underestimate how hard messaging systems really are.

A basic chat demo with WebSockets is easy to build. A production-grade messaging platform like WhatsApp, Telegram, Discord, or Slack is a completely different engineering problem. The hard part is not rendering messages in the UI. The hard parts are keeping millions of persistent connections alive, delivering messages reliably, preserving ordering, handling offline sync, scaling group chats, surviving partial failures, and keeping latency low around the world.

In this post, we'll design a scalable real-time chat architecture and walk through the trade-offs behind modern messaging systems.


Why chat gets hard

At first glance, chat seems simple:

  • send a message
  • store it
  • show it to the other user

That works for a demo. It falls apart fast in production.

Once you add real users, the system has to handle:

  • persistent connections
  • reconnects
  • message ordering
  • offline delivery
  • read receipts
  • typing indicators
  • multi-device sync
  • large group chats
  • global latency
  • partial failures

This is where chat stops being a UI feature and becomes a distributed systems problem.


Requirements

Before designing the system, we need clear requirements.

Functional requirements

Our chat system should support:

  • One-to-one chats
  • Group chats
  • Real-time message delivery
  • Read receipts
  • Typing indicators
  • Push notifications
  • Media attachments
  • Message history
  • Multi-device synchronization
  • Online and offline presence

Non-functional requirements

The system must also provide:

  • Low latency
  • High availability
  • Horizontal scalability
  • Fault tolerance
  • Reliable delivery
  • Event ordering
  • Efficient storage
  • Global distribution

At small scale, these are manageable. At millions of users, every one of them becomes a distributed systems concern.


High-level architecture

A modern chat platform is usually event-driven and distributed.

Clients
   │
   ▼
Load Balancer
   │
   ▼
WebSocket Gateway Cluster
   │
   ├── Authentication Service
   ├── Presence Service
   ├── Chat Service
   ├── Notification Service
   └── Media Service
          │
          ▼
    Kafka / Redis Streams
          │
          ▼
     Storage Layer
Enter fullscreen mode Exit fullscreen mode

Each component has a separate responsibility. That separation is what makes the system scalable and easier to evolve.


Why polling fails

Many beginner chat apps start with polling:

GET /messages every 2 seconds
Enter fullscreen mode Exit fullscreen mode

This works for a demo. It fails badly at scale.

Polling creates:

  • massive request overhead
  • unnecessary database reads
  • increased latency
  • battery drain on mobile devices
  • poor real-time responsiveness

Now imagine 2 million connected users polling every 2 seconds. That becomes 1 million requests per second even when nobody is sending anything. Real chat systems avoid this by using persistent connections.


Why WebSockets became the default

WebSockets keep a bidirectional connection open between the client and the server.

That gives you:

  • near real-time communication
  • lower overhead
  • reduced latency
  • efficient server push
  • better mobile performance

The client connects once:

Client ───── persistent connection ───── Server
Enter fullscreen mode Exit fullscreen mode

After that, both sides can exchange events instantly. That is the foundation of modern messaging systems.

The hidden WebSocket cost

WebSockets solve one problem and introduce another.

A persistent connection is not free. Every connected user consumes:

  • a TCP socket
  • memory buffers
  • heartbeat state
  • authentication context

At 5 million concurrent users, this is no longer a normal API problem. It becomes a connection management problem.

That is why serious chat systems build specialized gateway infrastructure.


WebSocket gateway layer

The gateway layer is responsible for:

  • maintaining persistent connections
  • authenticating users
  • routing events
  • managing heartbeats
  • detecting disconnects

Gateways should stay as stateless as possible. Stateless gateways are much easier to scale, replace, and recover after failures.

Authentication flow

A common flow looks like this:

  1. User logs in via HTTP.
  2. Backend issues a JWT token.
  3. Client opens a WebSocket connection.
  4. Token is validated during the handshake.
  5. Connection is associated with a user session.

Example:

wss://chat.example.com?token=JWT
Enter fullscreen mode Exit fullscreen mode

After authentication, the gateway knows which user owns the connection.

Sending a message

Now let's look at the real delivery pipeline.

Suppose User A sends:

Hello
Enter fullscreen mode Exit fullscreen mode

A simplified flow is:

  1. Client sends a message event.
  2. Gateway validates authentication.
  3. Chat service validates permissions.
  4. Message is persisted in durable storage.
  5. Event is published into Kafka.
  6. Recipient gateway receives the event.
  7. Message is pushed to the recipient.
  8. Delivery ACK is generated.
  9. Read receipt is generated later.

This looks simple on paper. In reality, every step can fail.

Why persistence comes first

A lot of beginners try this:

deliver → save later
Enter fullscreen mode Exit fullscreen mode

That is dangerous.

If the server crashes before persistence, the recipient saw the message, but the database lost it. Now the system is inconsistent.

Production systems usually do:

persist → publish → deliver
Enter fullscreen mode Exit fullscreen mode

Durability comes first.


Message IDs and ordering

Distributed systems do not guarantee ordering automatically.

Example:

  • Message A
  • Message B

Recipient receives:

  • Message B
  • Message A

Why does this happen? Because messages may travel through different gateway servers, queues, and network paths.

Common ordering strategies

Timestamp ordering

Simple, but unreliable. Clock drift breaks consistency.

Incremental sequence IDs

More reliable.

Example:

conversation_id: 42

messages:
1
2
3
4
Enter fullscreen mode Exit fullscreen mode

This guarantees ordering inside a conversation. Most real systems only guarantee local ordering per chat. Global ordering across the entire platform is usually impossible at scale.


Database design

Messaging systems are write-heavy. A popular group chat can generate thousands of writes per second.

Typical tables:

  • users
  • conversations
  • conversation_members
  • messages
  • message_status
  • attachments

The real challenge is partitioning.

Why a single database eventually fails

A single PostgreSQL instance works at the beginning. Over time, problems show up:

  • write bottlenecks
  • storage growth
  • replication lag
  • index size explosion

At scale, systems introduce sharding.

Sharding strategy

A common strategy is:

shard by conversation_id
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • messages for the same chat stay colocated
  • ordering becomes easier
  • queries stay efficient

Bad shard keys create hotspots. For example, sharding by user_id can spread large group chats across multiple shards and make fan-out expensive.


Kafka and event streaming

Modern messaging systems are heavily event-driven.

Kafka is useful because it provides:

  • durable event logs
  • replayability
  • partitioned scalability
  • consumer groups

Instead of services calling each other directly:

Chat Service → Kafka → Consumers
Enter fullscreen mode Exit fullscreen mode

Consumers may include:

  • delivery service
  • notification service
  • analytics
  • moderation
  • push notifications

This decouples the system and makes failures easier to isolate.


Presence system

Presence is deceptively expensive.

Tracking online, offline, typing, and last_seen for millions of users creates a huge amount of event traffic. That is why most systems isolate presence into a dedicated service.

Presence implementation

A common architecture is:

Gateway → Redis → Presence Service
Enter fullscreen mode Exit fullscreen mode

Gateways periodically send heartbeats.

Example:

PING every 30 seconds
Enter fullscreen mode Exit fullscreen mode

If heartbeats stop, the user is marked offline.

Redis works well here because presence data is ephemeral. Not everything belongs in a relational database.

Typing indicators

Typing indicators look trivial. They are not.

Problems include:

  • high event frequency
  • noisy updates
  • unnecessary fan-out

Most systems heavily throttle typing events.

Example:

User typing → emit once every 3 seconds
Enter fullscreen mode Exit fullscreen mode

Without throttling, typing indicators can overload the infrastructure faster than messages.

Fan-out in group chats

Suppose a group contains 500,000 users.

One message may require 500,000 deliveries. This is called fan-out.

Large fan-out is one of the hardest messaging problems.

Fan-out strategies

Fan-out on write

The server precomputes deliveries immediately.

  • fast reads
  • expensive writes

Fan-out on read

Messages are stored once.

  • cheaper writes
  • more expensive reads

Different platforms choose different trade-offs.


Offline synchronization

Users disconnect constantly:

  • mobile app closed
  • network loss
  • airplane mode
  • battery saver

The system must synchronize missed events efficiently.

Typical approach:

fetch all events after last_sequence_id
Enter fullscreen mode Exit fullscreen mode

Example:

last_seen_message = 10451
Enter fullscreen mode Exit fullscreen mode

Server returns:

10452+
Enter fullscreen mode Exit fullscreen mode

Incremental synchronization is critical. Full synchronization is too expensive.


Push notifications

Offline users still need notifications.

Pipeline:

message event
   ↓
notification service
   ↓
APNs / FCM
   ↓
mobile device
Enter fullscreen mode Exit fullscreen mode

Push systems are eventually consistent. Notifications may arrive late, duplicated, or out of order. Clients need to tolerate that.

Multi-device sync

Modern users expect sync across:

  • phone
  • desktop
  • browser
  • tablet

Each device may keep its own session. The backend tracks:

  • user_id
  • device_id
  • connection_id
  • last_sync_state

Events are usually delivered independently per device.


Reliability guarantees

Messaging systems need clear delivery semantics.

At-most-once

Fastest. Messages may disappear.

At-least-once

Reliable. Duplicates are possible.

Exactly-once

Extremely expensive and difficult in distributed systems.

Most production systems use:

at-least-once + idempotency
Enter fullscreen mode Exit fullscreen mode

That is the practical choice.


Why exactly-once is mostly marketing

Exactly-once delivery across distributed infrastructure is incredibly hard.

Network failures create ambiguity:

  • Did the recipient receive the message?
  • Did the ACK get lost?
  • Did the retry create a duplicate?

Sometimes the sender cannot know for sure.

That is why many systems rely on retries, deduplication, and idempotent consumers instead of true exactly-once guarantees.


Common failure scenarios

Real systems fail all the time.

Examples:

  • gateway crashes
  • Kafka partition unavailable
  • Redis outage
  • slow consumers
  • duplicate events
  • partial synchronization
  • network splits

Reliable systems assume failure is normal.


Scaling strategies

As traffic grows, teams usually add:

  • horizontal scaling for gateway nodes
  • partitioned Kafka topics
  • regional infrastructure
  • CDN for media
  • caching to reduce database pressure

The biggest bottleneck is usually not CPU.

At scale, bottlenecks are often:

  • network throughput
  • memory usage
  • hot partitions
  • connection limits
  • disk I/O
  • replication lag

Chat systems are infrastructure-heavy workloads.


What makes messaging hard

The frontend UI may look simple. Underneath is a distributed system balancing:

  • consistency
  • availability
  • latency
  • reliability
  • scalability

Every architectural decision introduces trade-offs. You can optimize latency, durability, throughput, or cost, but rarely all at once.

That is why messaging systems remain one of the most interesting areas in system design.


Practical takeaway

If you are building chat, do not think only in terms of sockets and message lists. Think in terms of delivery guarantees, storage strategy, ordering, recovery, and fan-out.

A chat product becomes serious very quickly. The moment you need offline sync, presence, multi-device support, and reliable delivery, you are no longer building a UI feature. You are building a distributed messaging platform.

Top comments (0)