DEV Community

sanjay khambhala
sanjay khambhala

Posted on

Building a Scalable Chat System for Customer Support — System Design Deep Dive

Designing a chat system that handles millions of customer support conversations while maintaining sub-second response times isn't just about moving messages around—it's about architecting for scale, reliability, and seamless user experience under pressure.

The Challenge: Scale Meets Real-Time

When we set out to build a customer support chat system for a fintech platform serving 2M+ users, we faced unique constraints:

- Peak load: 50,000 concurrent conversations during market hours
- Agent efficiency: Route customers to specialized agents instantly
- Compliance: Financial regulations require message retention and audit trails
- Global reach: Sub-200ms latency across 15+ countries

Architecture Overview: Event-Driven Foundation
Our solution centers on an event-driven microservices architecture that separates concerns while maintaining real-time performance.

Core Components
Message Gateway (WebSocket + HTTP)

  • Handles 100K+ simultaneous WebSocket connections
  • Auto-scales based on connection count using AWS ALB
  • Implements connection pooling and heartbeat mechanisms
  • Falls back to HTTP polling for unstable network

Message Broker (Apache Kafka)

  • Partitioned by conversation_id for guaranteed message ordering
  • 3-node cluster with replication factor of 3
  • Handles 500K messages/second at peak
  • Enables event sourcing for complete conversation history

Routing Engine

  • Real-time agent availability tracking using Redis
  • ML-powered customer intent classification (80% accuracy)
  • Skill-based routing with fallback queues
  • Average routing time: 1.2 seconds

Persistence Layer

  • MongoDB for conversation metadata and agent profiles
  • Cassandra for message storage (optimized for time-series queries)
  • Redis for session management and real-time state
  • S3 for file attachments with CDN distribution

Scaling Strategies That Work
Horizontal Pod Autoscaling: WebSocket gateways auto-scale based on connection count, with custom metrics tracking connection density per pod.

Database Sharding: Messages partitioned by conversation_id hash, enabling parallel processing and preventing hot spots.

Caching Layers:

  • Agent status cached in Redis (30-second TTL)
  • Conversation context cached for quick agent handoffs
  • Message history cached for last 50 messages per conversation

Circuit Breakers: Hystrix implementation prevents cascade failures when downstream services experience latency spikes.

Handling the Hard Problems
Message Ordering: Kafka partitioning by conversation_id guarantees FIFO delivery within conversations while allowing parallel processing across different chats.

Agent Handoffs: When specialists need to join, we maintain conversation context in Redis, allowing seamless transfers without message loss or duplication.

Offline Scenarios: Messages queue in Kafka when agents disconnect, with automatic replay when they reconnect. Customers receive delivery confirmations to manage expectations.

Global Distribution: Regional message gateways with cross-region Kafka mirroring ensure local latency while maintaining data consistency.

Performance Metrics That Matter
After 6 months in production:

  • **99.9% message delivery success rate
  • Average message latency: 120ms
  • System availability: 99.95%**
  • Agent efficiency improved 35% (faster context switching)
  • Customer satisfaction up 28% (reduced wait times)

Lessons Learned
Stateless services are your friend: Every component can be horizontally scaled without complex coordination.
Event sourcing pays dividends: Complete message history reconstruction from events proved invaluable for debugging and compliance audits.
Monitor connection health aggressively: WebSocket connections fail silently; active health checks and reconnection logic are essential.
Cache conversation context wisely: Agent productivity skyrockets when they have immediate access to customer history without database queries.

The biggest insight? Building for scale isn't just about handling more users—it's about maintaining responsiveness and reliability as complexity grows. Every architectural decision should optimize for both throughput and latency.

Top comments (0)