With the rising demand for sync engines and real-time feature, WebSockets have become a critical component for modern applications. At Compose, WebSockets form the backbone of our service, powering our backend SDKs that enable developers to deliver low-latency interactive applications with just backend code.
But, scaling WebSockets has proven to be far more complex than we expected. Below are some of the most important lessons we've learned along the way.
Handle deployments gracefully
Users should never notice when deployments happen, so WebSocket connections need to persist across deployments. This is a delicate process, and requires robust reconnection logic to deal with unexpected issues. At Compose, we achieve near-zero downtime by following these steps:
Spin up new servers.
Once the new servers are healthy, old servers begin returning
503 Service Unavailable
responses to health checks.After 4 consecutive
503
responses, the load balancer declares the server unhealthy and removes the old servers from the pool. The load balancer health checks every 5 seconds, so this process takes up to 25 seconds.Old servers send a custom WebSocket close message instructing clients to delay reconnection by a random interval to avoid a reconnection surge.
- The custom close message lets clients show users a more accurate message during the ~10 second period where the client is disconnected.
- The random delay helps prevent thundering herd issues where all clients reconnect at once. Clients also double the exponential backoff for deployment-related reconnections to account for unforeseen issues.
- The close message is delayed by 20 seconds to account for the time it takes for the load balancer to shift traffic.
Once all clients disconnect, the old servers shut down completely.
If you're using a managed service like Render or Railway, you should be especially cognizant that client connections are transferred gracefully during deployments.
Many managed services that tout zero-downtime deployments will wait until all outstanding requests are processed before shutting down a server. Since WebSocket connections are persistent, this can lead to situations in which old servers are active for minutes or even hours after a deploy until the managed service forcibly terminates the process.
Establish a consistent message schema
While HTTP comes with built-in routing conventions (GET /user
, POST /company
, PUT /settings
), WebSockets require developers to define their own schema for organizing messages.
At Compose, every WebSocket message starts with a fixed 2-byte type
prefix for categorizing messages.
It's space-efficient (only 2 bytes), while still scaling to 65,536 different types.
It enables clients to reliably slice the
type
prefix from the message without affecting the rest of the data, since the prefix is always 2 bytes.It gives us a simple method for upgrading our APIs by versioning message types.
const MESSAGE_TYPE_TO_HEADER = {
RENDER_UI: "aa",
UPDATE_UI: "ab",
SHOW_LOADING: "ac",
RENDER_UI_V2: "ad",
/* ... */
}
Additionally, we use delimiters to separate different fields inside the message, which is both faster to encode/decode and more memory-efficient than JSON.
const DELIMITER = "|";
function createDelimitedMessage(type: string, args: any[]) {
return [MESSAGE_TYPE_TO_HEADER[type], ...args].join(DELIMITER);
}
function parseDelimitedMessage(message: string) {
const [type, ...args] = message.split(DELIMITER);
return { type, args };
}
We're lucky that our backend and frontend are written in TypeScript, allowing us to share message schemas between the two and ensure that neither falls out of sync.
Detect silent disconnects with heartbeats
Connections can drop unexpectedly without triggering a close event, leading to a situation in which the client thinks they're connected, but actually aren't. To prevent stale connections, implementing a robust heartbeat mechanism is essential.
We send periodic ping/pong messages between client and server and reconnect in cases where the heartbeat isn't received within some interval.
Our server sends a ping
message every 30 seconds, and expects a pong
response. In cases where the client doesn't receive a ping
every 45 seconds, it immediately drops the connection and tries to reconnect. Similarly, the server closes connections that miss pong
responses within 45 seconds.
By monitoring heartbeats on both ends, we detect and handle rare cases where the client side network appears functional but the server never receives responses.
Have an HTTP fallback
WebSocket connections can be unexpectedly blocked, especially on restrictive public networks. To mitigate such issues, Compose uses server-sent events (SSE) as a fallback for receiving updates, while HTTP requests handle client-to-server communication.
Since SSE is HTTP-based, it's much less likely to be blocked, providing a reliable alternative in restricted environments. Plus it still achieves decently low latency, especially compared to short-polling solutions.
Concluding thoughts
There's a whole lot more to scaling WebSockets that we didn't cover here. For example:
- Lack of standard tooling: While most frameworks include built-in tools for rate limiting, data validation, and error handling, you'll generally have to implement these features on your own for WebSockets.
- Inability to cache responses: Edge networks make it easy to cache HTTP responses close to users, but there's no standard way to accomplish this with WebSockets.
- Per-message authentication: Guarding against abuse by ensuring that each message is valid for that user before processing it.
But regardless of the complexity, users expect modern applications to be fast, realtime, and collaborative. And, as of now, there's no better way to achieve that than WebSockets.
At Compose, WebSockets power the entire platform - from the database all the way to the main UI thread. Via our SDKs, developers can generate full web apps from their backend logic. Making sure those apps are fast and performant at scale requires WebSockets. If you're interested in learning more, check out our docs. It takes less than 5 minutes to install the SDK and build your first app.
Top comments (0)