Lalit Mishra

Posted on Dec 18, 2025

Scaling Horizontally: Kubernetes, Sticky Sessions, and Redis

#kubernetes #redis #python #websockets

Introduction

Scaling stateless HTTP applications is a well-understood problem: spin up more pods, put a load balancer in front, and use a round-robin algorithm. If one pod fails, the next request is simply routed to another healthy instance. However, real-time applications using WebSockets—specifically those built with Flask-SocketIO—break this paradigm fundamentally.

WebSockets rely on long-lived, stateful TCP connections. Once a client connects to a server process, that specific process holds the socket file descriptor and the in-memory context (rooms, session data) for that user. If you simply replicate a Flask-SocketIO container to ten pods in Kubernetes, the system will fail immediately upon deployment.

This failure occurs because the standard horizontal scaling model does not account for the dual requirements of the Socket.IO protocol: connection persistence during the handshake and distributed event propagation after the connection is established. To scale Flask-SocketIO effectively, we must move beyond the single-server mindset and implement a distributed architecture utilizing Kubernetes Ingress for session affinity and Redis as a pub/sub message bus.

The Stateful Problem: Why Round-Robin Fails

To understand why standard load balancing fails, we must look at the Socket.IO protocol negotiation. Unlike raw WebSockets, Socket.IO does not immediately establish a WebSocket connection. Instead, it typically begins with HTTP Long-Polling to ensure compatibility and robust connectivity through restrictive proxies.

The handshake sequence looks like this:

Handshake Request: Client POST /socket.io/?EIO=4&transport=polling. The server responds with a Session ID (sid) and connection intervals.
Poll Request: Client GET /socket.io/?EIO=4&transport=polling&sid=....
Upgrade Request: Client sends an Upgrade: websocket header to switch protocols.

In a round-robin Kubernetes environment without session affinity, the Handshake Request might route to Pod A, which generates a session ID (e.g., abc-123) and stores it in its local memory. The subsequent Poll Request might be routed by the Service to Pod B. Pod B has no record of session abc-123 in its memory, so it rejects the request with a 400 Bad Request or {"code":1,"message":"Session ID unknown"} error.

Even if the connection successfully upgrades to WebSocket (which locks the TCP connection to a single pod), the system remains broken for broadcasting. If User A is connected to Pod A and User B is connected to Pod B, and they are both in a chat room room_1, a message sent by User A will only exist inside Pod A's memory. Pod B will never know it needs to forward that message to User B.

Sticky Sessions: Configuring Ingress-Nginx

The solution to the handshake failure is Session Affinity, commonly known as "Sticky Sessions." This ensures that once a client initiates a handshake with a specific pod, all subsequent requests from that client are routed to the exact same pod.

In Kubernetes, this is typically handled at the Ingress controller level rather than the Service level (which offers sessionAffinity: ClientIP, but this is often unreliable behind NATs). For ingress-nginx, the standard controller used in many clusters, stickiness is achieved via cookie-based affinity.

Configuration via Annotations

You must add specific annotations to your Ingress resource to inject a routing cookie.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: socketio-ingress
  annotations:
    # Enable cookie-based affinity
    nginx.ingress.kubernetes.io/affinity: "cookie"

    # Name of the cookie sent to the client
    nginx.ingress.kubernetes.io/session-cookie-name: "route"

    # Critical: Use "persistent" mode to prevent rebalancing active sessions
    nginx.ingress.kubernetes.io/affinity-mode: "persistent"

    # Hash algorithm (sha1, md5, or index)
    nginx.ingress.kubernetes.io/session-cookie-hash: "sha1"

    # Duration (should match your socket.io ping timeout logic)
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
spec:
  rules:
  - host: socket.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: socketio-service
            port: 
              number: 5000

The "Persistent" vs. "Balanced" Trap

A common configuration mistake is ignoring the affinity-mode. By default, or in "balanced" mode, Nginx might redistribute sessions if the number of pods scales up or down to balance the load. For stateless apps, this is fine. For WebSockets, this breaks the connection. Setting nginx.ingress.kubernetes.io/affinity-mode: "persistent" ensures that Nginx honors the cookie even if the pod distribution is uneven, preserving the WebSocket connection stability at the cost of potential load imbalance.

The Redis Backplane: Distributed Event Propagation

Sticky sessions solve the connection problem, but they create a new isolation problem. Users are now siloed on different servers. To allow User A (on Pod 1) to send a message to User B (on Pod 2), we need a backplane—a mechanism to bridge the isolated memory spaces of the Flask processes.

Flask-SocketIO solves this using a Message Queue, with Redis being the most performant and common choice. This implements the Pub/Sub (Publish/Subscribe) pattern.

How It Works Internally

When you configure Flask-SocketIO with a Redis message queue:

Subscription: Every Flask-SocketIO worker establishes a connection to Redis and subscribes to a specific channel (usually flask-socketio).
Emission: When code in Pod A executes emit('chat', msg, room='lobby'), it does not loop through its own client list. Instead, it publishes a message to Redis saying "Send 'chat' to 'lobby'".
Distribution: Redis pushes this message to all other subscribed Flask workers (Pod B, Pod C).
Fan-Out: Pod B receives the Redis message, checks its own local memory for clients in 'lobby', and forwards the message to them over their open WebSocket connections.

This architecture decouples the origin of an event from the delivery of the event.

Installation: Setting Up Redis and Flask-SocketIO

Implementing this requires installing the Redis server (usually via a Helm chart in Kubernetes, like bitnami/redis) and configuring the Python application to use it.

1. Dependencies

You must install the Redis client library along with Flask-SocketIO.

pip install flask-socketio redis

Note: If you use eventlet or gevent for async workers, ensure redis is monkey-patch compatible, or use a compatible driver, though the standard redis-py usually works well with recent versions of eventlet if patched correctly.

2. Application Configuration

You pass the connection string to the SocketIO constructor. This is the only code change required to switch from a single-node memory store to a distributed Redis store.

from flask import Flask
from flask_socketio import SocketIO

app = Flask(__name__)

# The 'message_queue' argument enables the Redis backend.
# In Kubernetes, 'redis-master' is typically the service DNS name.
socketio = SocketIO(app, 
                    message_queue='redis://redis-master:6379/0', 
                    cors_allowed_origins="*")

@socketio.on('message')
def handle_message(data):
    # This emit is now broadcast via Redis to all pods
    socketio.emit('response', data)

Common Mistake: Do not use the client_manager argument manually unless you are customizing the underlying Engine.IO implementation. The message_queue argument is the high-level wrapper provided by Flask-SocketIO to configure the RedisManager automatically.3

Trade-offs: Latency and Bottlenecks

While this architecture enables horizontal scaling, it introduces specific engineering trade-offs that must be monitored.

Latency Introduction

In a single-node setup, an emit is a direct memory operation. In a distributed setup, every broadcast involves a network round-trip to Redis.

Path: Client -> Pod A -> Redis -> Pod B -> Client.
This adds single-digit millisecond latency (typically <5ms within a K8s cluster), which is negligible for chat apps but critical for high-frequency trading or real-time gaming.

The Redis Bottleneck

Redis is single-threaded. While extremely fast, it has a physical limit on throughput. In a Pub/Sub model, every message published to a channel is sent to all subscribers.

Scenario: If you have 20 pods and send 1,000 messages/sec.
Amplification: Redis must perform 20 * 1,000 = 20,000 pushes/sec.
As you scale to hundreds of pods, the Redis CPU bandwidth becomes the limiting factor. If Redis becomes saturated, message delivery latency spikes across the entire cluster.

To mitigate this at extreme scale (100k+ concurrent users), you cannot use a single Redis channel. You must implement sharding (using multiple Redis instances) or switch to a broker that supports partitioning more natively, though Flask-SocketIO's default adapter is designed primarily for Redis Pub/Sub.

Conclusion

Scaling Flask-SocketIO horizontally is a solved problem, but it requires a strict adherence to a state-aware architecture. You cannot rely on the stateless scaling patterns used for REST APIs.

The Decision Framework:

Single Server: Sufficient for < 1,000 concurrent users. No Redis or Sticky Sessions needed.
Multi-Server (Production): Required for high availability and > 1,000 users.
Layer 1 (Ingress): MUST enable Sticky Sessions (Cookie Affinity) to ensure handshake completion.
Layer 2 (App): MUST configure Redis Message Queue to bridge isolated worker processes.

Production Checklist:

Ingress configured with affinity: "cookie" and affinity-mode: "persistent".
Redis deployed (preferably with persistence disabled for pure Pub/Sub performance, or enabled if using it for other storage).
Flask-SocketIO initialized with message_queue='redis://...'.
Gevent/Eventlet monkey patching applied at the very top of the application entry point.

By implementing this architecture, you transform Flask-SocketIO from a development toy into a robust, scalable real-time platform capable of handling tens of thousands of concurrent connections.

DEV Community