How I built a horizontally scalable chat server in Go.
It's not about sending messages fast.
It's about doing three things at once:
— deliver in real time
— preserve order + durability
— scale without sticky sessions
Project: Signal-Flow
GitHub: https://github.com/abhiram-karanth-core/signal-flow
Demo: https://global-chat-app-web.vercel.app/
Here's how I broke it down
The key insight: stop treating a chat message as one thing.
Every message has a lifecycle with very different requirements at each stage. Collapsing them into one system is where most chat servers break down at scale.
The lifecycle:
- User sends a message (Next.js)
- Go WebSocket server receives it
- Made durable (Kafka)
- Fanned out to other users (Redis)
- Written to queryable history (Postgres)
Each step needs a different system. Each system does exactly one job.
The first big lesson: real-time delivery and durable storage have opposing needs.
Real-time = sub-millisecond latency
Durable = safety, ordering, replayability
Forcing one system to do both means compromising on both. So don't.
Kafka is the source of truth. Every message hits Kafka first — before fan-out, before anything.
But Kafka doesn't deliver messages to users. It exists so the system can survive failures and recover state.
Correctness lives here. Latency does not.
Why a single Kafka partition?
Message ordering matters more than raw throughput in chat. Ordering is only guaranteed within a partition. Cross-partition ordering breaks chat semantics.
Single partition → ~5k msg/sec cap. But deterministic ordering, simpler consumers, clean replay.
Conscious trade-off. Not a limitation.
Redis makes it feel instant.
It sits on the hot path — broadcasting messages to connected WebSocket clients and coordinating fan-out across multiple Go servers.
Redis is intentionally ephemeral. If it drops a message: Kafka still has it. Clients recover on reconnect. History stays correct.
Postgres makes it usable.
Kafka consumers write to Postgres asynchronously. This means:
— DB slowness never blocks ingestion
— State can be rebuilt from Kafka replay
— Frontend reloads always return consistent history
Postgres is the materialized view users actually interact with.
The most important architectural decision: heavy decoupling between consumers.
Kafka → Redis (real-time fan-out)
Kafka → Postgres (durable writes)
Producer publishes once. Doesn't care who consumes. Each consumer fails and restarts independently.
No cascading failures.
Why it scales horizontally:
— Go servers are stateless (no sticky sessions)
— Real-time delivery is decoupled from durability
— Each component has one job
— Any component can rebuild from Kafka replay
Scaling becomes an ops concern. Not an app rewrite.
The frontend never sees any of this.
WebSockets → real-time updates via Redis
HTTP APIs → message history from Postgres
Kafka stays internal. Clean boundary.
Don't build a chat server. Build three systems that each do one thing well:
Kafka → correctness
Redis → latency
Postgres → usability
Everything else is trade-offs.
Top comments (0)