DEV Community


Posted on

The Advanced Challenge of Load Balancing

Not all traffic can be arbitrarily routed.

While taking the responsibility of a mature service, it may not be possible for a single server to handle all the workload. This is due to three main considerations: performance, availability, and economy.

In order to handle the excessive traffic well, increasing the capacity of processing requests is necessary, it can be achieved by either scale-up or scale-out.

Photo by Piret Ilver on Unsplash

Photo by Piret Ilver on Unsplash

However, machine failures happen, and the cost of using top specification machines may not be affordable. Spread the workload evenly into multiple general workers sounds like a more feasible solution in most cases.

This seems pretty natural….right?

Yes, but actually not.

The architecture works for most stateless APIs, but this newly added "Spread" behavior adds some uncertainty for other stateful interactions.


Any techniques that make the servers store "state", which results in different logical behaviors between them, may conflict with the general load balancing architecture above.

Using sessions on the web service is a classic case.

Limited by the stateless design of HTTP protocol, in order to allow users to have continuity in each operation, the context information was stored on the server-side within a certain period of time, such as login status and shopping cart.

Just like taking out a meal in the real world. Customers can get a number plate after ordering a meal at the counter, and they can pick up the meal with the number plate when it is ready.

Photo by Brooke Cagle on Unsplash

Photo by Brooke Cagle on Unsplash

Ordering a meal and pick it up are two independent steps for customers but with continuous logic, this is because the counter already stores the data of who they are and what they have ordered, and could be retrieved by the number plate they provided.

Sticky Session

Well, most of the time, the number plate can only be used in the same shop, we can not take the plate we got at A store to ask B store for our meal.

This principle is the same on web services, because the session data may not be shared. There are two main directions to solve this problem:

  1. Share session data between all web server with external storage (e.g. Redis cluster)
  2. Force all the requests from the same client dispatched to the same server.

The first one is a little bit more complex and out of the scope of this story, I will focus on the second solution here.

Maintaining the mapping relationships between the clients and web servers can help us forward each client to the same server they connected to last time, which currently handles the session context for them.

This can be achieved by various identifiers from the client-side like IP addresses and cookies. Many well-known load-balancing solutions provide this option with different approaches, such as AWS NLB/ALB, GCP Cloud Load Balancer, and Envoy from CNCF.

Photo by Benedetta Pacelli on Unsplash

Photo by Benedetta Pacelli on Unsplash

However, enable the sticky sessions option is equivalent to adding a hard rule the may conflict with traffic balancing. For example, while handling a vast amount of client requests with the same IP address, an IP-based sticky session may not be a good choice since it may cause part of the servers under a heavy workload.


In modern software services, there are many situations that require real-time updates of information, such as stock market transactions, online games, and chat rooms.

Using the polling strategy which driven by the periodic requests sent by the client-side sounds not economical, we need to afford the cost of TCP connections for each request, and it is very likely that there is no information to update.

WebSocket with full-duplex communication is a solution worth trying. It allows the server-side to actively push messages to the client-side, effectively avoiding meaningless requests.

However, a long-live connection between the client and the server would be maintained after the first request …

We are actually balancing the connections, not the workloads.

Any problem with this?
The main risk is that the workload cannot be properly distributed among multiple machines:

  • Different connections have different workloads at the same time. For chat software servers, users with many friends may have a much higher connection workload than users without friends (so sad …).
  • The same connection has different workloads at different times. An obvious example is a game server, especially large open-world RPGs. When a character browses and trades items in the market, the required data transmission may be relatively small, but the workload may increase dramatically in an instant because many characters frequently move and cast skills during guild wars.

Photo by Chanhee Lee on Unsplash

Photo by Chanhee Lee on Unsplash

To avoid a single server carrying too much workload, there are two main directions to solve this problem:

  • Lower the metrics value upper bound for cluster scale-out trigger
  • Reshape the Traffic by Reconnecting

Lower the metrics value upper bound for cluster scale-out trigger

Assuming that after a period of observation and statistics, we found that the workload of the same group of WebSocket connections during rush hours is about twice as much as usual, then we can do some simple calculations:

  • Set 70% of resource usages as a scale-out trigger is very dangerous, common traffic fluctuations can easily overload a server.
  • Set 50% of resource usages as a scale-out trigger can be a great choice, From past experience, each server will not run out of resources for more than 99% of the time.
  • Set 30% of resource usages as a scale-out trigger seems not very economical, more than half of the resources are idle outside of peak hours.

This is just a simple case for easy explanation, this strategy highly depends on the traffic shape and business nature of your services.

Photo by Marcus Castro on Unsplash

Photo by Marcus Castro on Unsplash

If the service only encounters traffic peak during holidays, we can definitely set 70% of resources usages as upper bound during weekdays, and only increase the sensitivity of the trigger before holidays.

Honestly, I don't recommend this practice as a long-term solution. Although it is relatively simple and can gain effect immediately, it does not solve the issue of unbalanced workload fundamentally, but make costs of devices grow more rapidly.

Reshape the Traffic by Reconnecting

Another more solid practice is to reshape all traffic through reconnecting. To a certain extent, it overcomes the balancing failure of long-term connection, but it also brings new challenges to the user experience and service resilience.

Every reconnection is not only an opportunity but a risk.
The timing of the reconnection is a critical issue, it has a very direct impact on the effectiveness of load balancing and the user experience. Below are several commonly used strategies:

Reconnect Periodically

This is one of the most intuitive methods, it can almost ensure the effectiveness of the workload balancing with an appropriate time interval setting. Unfortunately, the brute force of this hard rule probably devastates the client's experience while using the service.

Photo by JOHN TOWNER on Unsplash

Photo by JOHN TOWNER on Unsplash

Product managers can easily make a list of situations that should not be interrupted:

  • Users are shopping without the awareness of budget control
  • Users are filling in the information required for the purchase order
  • Users are playing player versus player (PVP) arena in an e-sport game
  • Users are playing a card game with a time limit for each round, and it is currently counting down

There is no doubt that disturbing the user while they expect to have fluent operations is terrible.

Choose the right occasion to reconnect

Since we can sort out many situations that are not suitable for reconnecting, on the other hand, there may also exist some suitable. To be more precise, we can make good use of the moment that users can accept and expect to wait:

  • Start watching a new live streaming on social APPs
  • Request for a massive amount of data or download a file
  • Teleport between different maps in an open-world RPG

When a user is ready to wait, he is less concerned about waiting a little longer, and they will not even notice that you are secretly reconnecting. Even if the reconnection unfortunately fails, it will not interrupt the continuous operation and have too much negative impact.

Wrap Up

A good user experience is often regarded as a holy grail, in order to realize various product imaginations, the techniques and strategies behind them are always astonishing.

I hope that my experience sharing in load balancing architectures and strategies can help you handle the challenges in software engineering well in the future :)

Top comments (0)