Lalit Mishra

Posted on Dec 20, 2025

Performance Tuning: Linux Kernel Optimizations for 10k+ Connections

#devops #performance #linux #networking

1. Introduction

In high-concurrency real-time architectures, the performance bottleneck inevitably shifts from the application layer to the operating system. A well-optimized Flask-SocketIO application running on Gevent or Eventlet is theoretically capable of handling tens of thousands of concurrent connections. However, in a default Linux environment, such an application will likely crash or stop accepting connections long before CPU or memory resources are saturated.

This plateau occurs because the Linux kernel, out of the box, is tuned for general-purpose computing, not for acting as a massive termination point for persistent TCP connections. For a WebSocket server, where connections are long-lived and stateful, resource exhaustion manifests in the form of file descriptor limits, ephemeral port starvation, and TCP stack congestion. This article outlines the specific kernel-level tuning required to scale Flask-SocketIO beyond the 10,000-connection barrier.

2. The 1024 Limit: Understanding File Descriptors

In Unix-like operating systems, "everything is a file." This includes TCP sockets. When a client connects to your server, the kernel allocates a file descriptor (FD) to represent that socket.

By default, most Linux distributions enforce a strict limit of 1024 open file descriptors per process. This is a legacy constraint. For a WebSocket server, this means that after roughly 1,000 concurrent users (plus a few descriptors for log files and shared libraries), the application will crash or throw OSError: [Errno 24] Too many open files.

Soft Limits vs. Hard Limits

The kernel distinguishes between the "soft limit" (user-configurable ceiling) and the "hard limit" (absolute ceiling set by root).

To support 10k+ connections, you must increase these limits both at the system level and the process level (systemd/Supervisor).

Verification:

$ ulimit -n
1024

Remediation:
For a production system, this should be raised significantly, often to 65535 or higher.

System-wide (/etc/security/limits.conf):

* soft nofile 65535
* hard nofile 65535

Systemd Service (/etc/systemd/system/app.service): Systemd ignores user limits. You must explicitly define them in the unit file:


LimitNOFILE=65535

3. Ephemeral Ports: The Redis Connection Bottleneck

While file descriptors limit incoming connections, ephemeral ports limit outgoing connections. This distinction is critical for Flask-SocketIO architectures that rely on a message broker like Redis.

When your Flask application connects to Redis (or Nginx connects to your upstream Flask Gunicorn workers), it opens a TCP socket. This outgoing connection requires a local IP address and a local port. The kernel assigns this local port from a predefined range known as the "ephemeral port range."

The Exhaustion Scenario

By default, this range is often narrow (e.g., 32768–60999). This provides only ~28,000 ports. In high-throughput scenarios where the Flask app is furiously publishing to Redis or Nginx is proxying massive traffic, the server can run out of available local ports.

Symptoms:

EADDRNOTAVAIL (Cannot assign requested address) errors in logs.
Sudden inability of the Flask app to talk to Redis, despite Redis being healthy.
Nginx returning 502 Bad Gateway because it cannot open a socket to the upstream.

Tuning:
Widen the range to the maximum allowable by the TCP spec (1024–65535).

# Check current range
sysctl net.ipv4.ip_local_port_range

# Expand range (add to /etc/sysctl.conf)
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

4. TIME_WAIT Reuse: Optimizing TCP Stack Recycling

The most misunderstood aspect of TCP scaling is the TIME_WAIT state. When a TCP connection is closed, the side that initiated the close enters TIME_WAIT for a duration of 2 * MSL (Maximum Segment Lifetime), typically 60 seconds. This ensures that any delayed packets related to that connection are handled correctly and not attributed to a new connection on the same port.

In a high-churn environment (e.g., clients constantly refreshing pages or reconnecting), the server can accumulate tens of thousands of sockets in TIME_WAIT. These sockets consume system resources and, more critically, lock up the ephemeral port 4-tuple, preventing new outgoing connections.

The `tcp_tw_recycle` Hazard

Historically, guides recommended enabling net.ipv4.tcp_tw_recycle. Do not do this. It was removed in Linux kernel 4.12 because it breaks connections for users behind NAT (Network Address Translation) by aggressively dropping packets with out-of-order timestamps.

The Correct Fix: `tcp_tw_reuse`

Instead, use net.ipv4.tcp_tw_reuse. This allows the kernel to reclaim a TIME_WAIT socket for a new outgoing connection if the new connection's timestamp is strictly greater than the last packet seen on the old connection. This is safe for most internal infrastructure (like Flask connecting to Redis).

Configuration (/etc/sysctl.conf):

# Allow reuse of sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1

5. Load Testing: Simulating Reality with Artillery

Standard HTTP benchmarking tools like ab (Apache Bench) are useless for WebSockets. They measure request throughput (requests/sec), whereas the primary metric for WebSockets is concurrency (simultaneous open connections) and message latency.

To verify your kernel tuning, you must simulate sustained connection loads. Artillery and Locust are the industry standards here.

Testing Strategy

Ramp Up: Don't connect 10k users instantly; this triggers "thundering herd" protection or syn-flood defenses. Ramp up over minutes.
Sustain: Hold the connections open.
Broadcast: While connections are held, trigger a broadcast event to measure the latency of the Redis backplane and the Nginx proxy buffering.

If your test fails at exactly 1024 users, you missed a file descriptor limit. If it fails around 28,000 users, you likely hit the ephemeral port limit.

6. Monitoring: Key Metrics in Grafana

Observability is the only way to confirm that kernel tuning is effective. When running high-concurrency workloads, you must monitor specific OS-level metrics.

1. File Descriptors (Open Files)

Track process_open_fds for the Gunicorn/uWSGI process. If this line flattens at a specific number (e.g., 1024 or 4096) while CPU is low, you have hit a hard limit.

2. TCP States

Monitor the count of sockets in each state, specifically:

ESTABLISHED: Should match your active user count.
TIME_WAIT: If this spikes to 30k+, you have a churn problem or need tcp_tw_reuse.
Allocated Sockets (sockstat): The total memory consumed by networking buffers.

3. Conntrack Table

If you are using iptables or Docker, the nf_conntrack table limits how many connections the firewall tracks. Use dmesg to check for "nf_conntrack: table full, dropping packet".

Tune: sysctl -w net.netfilter.nf_conntrack_max=131072

7. Trade-offs and Production Considerations

Aggressive kernel tuning is not without risk. It removes safety rails designed to protect the system from resource exhaustion.

Security: Expanding the ephemeral port range makes port scanning slightly easier, though this is negligible inside a private VPC.
Stability: Setting file descriptor limits too high (e.g., millions) can allow a memory leak in the application to crash the entire server rather than just the process.
Connection Tracking: Increasing nf_conntrack_max consumes kernel memory (RAM). Ensure your server has sufficient RAM to store the state of 100k+ tracked connections.

Golden Rule: Never apply sysctl settings blindly. Apply them via configuration management (Ansible/Terraform), document why they are needed, and validate with load testing.

8. Conclusion

Scaling Flask-SocketIO to 10,000+ connections is an exercise in systems engineering as much as software development. The default Linux configuration is conservative, designed for general desktop or low-traffic server usage. By systematically addressing the file descriptor limits (ulimit), expanding the ephemeral port range (ip_local_port_range), and optimizing TCP reuse (tcp_tw_reuse), you unlock the operating system's ability to support high-concurrency workloads.

Production Readiness Checklist:

[ ] ulimit -n > 65535 for the Gunicorn process.
[ ] net.ipv4.ip_local_port_range set to 1024 65535.
[ ] net.ipv4.tcp_tw_reuse enabled (1).
[ ] net.ipv4.tcp_tw_recycle disabled (0) or absent.
[ ] nf_conntrack_max increased if using stateful firewalls.

DEV Community

Performance Tuning: Linux Kernel Optimizations for 10k+ Connections

1. Introduction

2. The 1024 Limit: Understanding File Descriptors

Soft Limits vs. Hard Limits

3. Ephemeral Ports: The Redis Connection Bottleneck

The Exhaustion Scenario

4. TIME_WAIT Reuse: Optimizing TCP Stack Recycling

The `tcp_tw_recycle` Hazard

The Correct Fix: `tcp_tw_reuse`

5. Load Testing: Simulating Reality with Artillery

Testing Strategy

6. Monitoring: Key Metrics in Grafana

1. File Descriptors (Open Files)

2. TCP States

3. Conntrack Table

7. Trade-offs and Production Considerations

8. Conclusion

Top comments (0)

1. Introduction

2. The 1024 Limit: Understanding File Descriptors

Soft Limits vs. Hard Limits

3. Ephemeral Ports: The Redis Connection Bottleneck

The Exhaustion Scenario

4. TIME_WAIT Reuse: Optimizing TCP Stack Recycling

The tcp_tw_recycle Hazard

The Correct Fix: tcp_tw_reuse

5. Load Testing: Simulating Reality with Artillery

Testing Strategy

6. Monitoring: Key Metrics in Grafana

1. File Descriptors (Open Files)

2. TCP States

3. Conntrack Table

7. Trade-offs and Production Considerations

8. Conclusion

The `tcp_tw_recycle` Hazard

The Correct Fix: `tcp_tw_reuse`