ChunTing Wu

Posted on May 30, 2022

Scaling Web Service

#architecture #tutorial #database #webdev

This article is an internal training for my team to explain to newcomers what are the practices and what are the aspects to consider when scaling a web service.

The scaling mentioned in this article is divided into several different levels.

Read loading
Write loading
Data volume size
Task loading
User distribution

In addition, there is another level of scaling that is not explained in detail in this article, feature scaling, but I will describe the concept of feature scaling a little bit at the end of the article.

Next, let's evolve our system step by step.

Bare Metal

All products start with a single machine. For proof of concept, we take the simplest approach and put everything on the same box, perhaps the laptop at hand. Whether it's a physical machine, a virtual machine, or a container, a single machine with everything you need on it is as follows.

Users can use this basic service either through the web or through a mobile device. This machine contains the APIs for business logic and a database for data storage. This is where all projects begin and is the easiest to implement.

When the concept has been successfully validated, the number of users will start to rise, easily exceeding the limit of what a single machine can handle. At this point, we choose to move the service from a laptop to a professional server in order to continue to verify that the project will continue to grow and not be a flash in the pan. The upgrade of hardware specifications is called vertical scaling, a.k.a. scale-up.

However, even for professional servers, where hardware can be easily upgraded and swapped out, there are limits. The limit here is not only a physical limit, but also a budget limit. For example, a 1TB SSD drive is more than twice as expensive as a 512GB drive. The curve of increasing hardware costs is exponential.

Therefore it is necessary to separate the components to make the cost more manageable, especially for the database, which is usually the most costly component.

Layered Architecture

When we separate the database from the API, we are able to upgrade their hardware individually. From the original scale-up of a single machine to the scale-up of individual components.

The hardware specifications of the database are usually very advanced, but the API ones are not. The main reason is that as the users grow, the API needs a stronger CPU to handle more traffic, but the rest of the resources are not as urgent. Upgrading CPUs runs into the same problem as hard drives, an exponentially increasing cost curve.

In other words, it is much more cost effective to increase the number of CPUs than to increase the size of CPUs. So in order to handle the increased traffic, we usually adopt the strategy of increasing the number of APIs.

Horizontal Scaling

In order to get the number of APIs to scale smoothly, we will need a new role to assign all incoming traffic, called load balancer. The load balancer assigns the traffic based on its algorithm, the common algorithms are RR (Round Robin) and LU (Least Used), but I suggest to go for the simplest one, RR, because the complex algorithm puts an extra load on the load balancer and makes it another kind of bottleneck.

When the user's usage changes, the number of APIs can be dynamically adjusted, for example, the number of APIs is increased when the usage goes up, which is called scale-out, and the opposite is called scale-in.

So far, the APIs can be scaled up to handle the increasing traffic, but there is another bottleneck, the database, which can be found in the above diagram.

When usage increases to a certain amount, a single database will not be able to handle it efficiently, resulting in an overall increase in response time. The result is a poor user experience and possibly even a failure of functionality. There are many articles on how page speed affects web user experience, so I won't go into detail here.

In order to solve the database bottleneck, we would also like to have horizontal scaling of the database. However, unlike the stateless characteristic of APIs, databases are usually stateful and therefore cannot be simply scale-out.

Read Write Splitting

To enable the database to be scaled horizontally, a common practice is called Read/Write Splitting.

First, we keep all writes to the same database entity to maintain the state of the database. Reads, on the other hand, are reads on entities that can scale horizontally. Keeping reads and writes split makes it easier to scale-out the database.

When there are data updates, the primary is responsible for replicating the changes to each of the read entities. In this way, the read entities can be scaled out based on usage.

This is seen in several common databases, such as MySQL Replication and MongoDB ReplicaSet.

Nevertheless, there is a problem that in order to handle more reads, we have to create more replicas of the database, which is a challenge for the budget. Because of the high hardware specifications of the databases, creating replicas is a very high price to pay.

Is there a way to support the traffic and save cost at the same time? Yes, caching.

Caching

There are several types of caching practices, the common ones are:

Read-aside cache
Content Delivery Network, CDN

There are advantages and disadvantages to both approaches, which will be briefly analyzed below.

Read-aside Cache

In addition to the original database, we put a cache aside. The whole process of reading is,

first read the data from the cache.
if the data does not exist in the cache, read it from the database instead.
then write back to the cache.

This is the most common scenario for caching. Depending on the nature of the data, different time to live, TTL will be set in the cached dataset.

CDN

Another caching practice is CDN.

Instead of building a cache, a new component is used and placed in front of the load balancer. When any read request comes in, the CDN first determines if there is already cached data based on the configured rules. If there is, the request will be replied directly. On the other hand, the request will follow the original process, and the data will be cached when the response passes through the CDN so that it can be replied directly next time.

Compared with the read-aside cache, we can see from the diagram there are fewer lines in the CDN. Fewer lines means less complexity in application implementation and easier to achieve. After all, the CDN only needs to configure the rules and the application does not need to be changed at all.

One potential problem with caching is the inconsistency of the data. When the data is updated in the database, if the TTL of the cache does not expire, the cached data will not be updated, and then the user may see the inconsistent result.

For read-aside cache, when the API updates the database, it can delete the data in the cache at the same time, so that the next time when it reads in, it can get the latest data. On the other hand, CDNs have to perform invalidation through the APIs provided by each vendor, which is more complicated than read-aside cache in practice.

Writing Bottlenecks

Caching and read/write splitting have been effective in handling the increasing amount of read requests, but as you can see from the above diagrams, there is still no effective way to deal with the large number of writes to the database. To solve the write bottleneck, there are many different mechanisms that can be applied, and two common approaches are listed below.

Master-master replication
Write through cache

These two approaches are orthogonal and have completely different practical considerations and are difficult to compare with each other. Then, let's analyze these two approaches.

Master-master Replication

Although there is a corresponding database entity for each of the APIs in the above diagram, the APIs are not actually locked to a particular entity, depending on the database implementation in use and configuration.

Comparing with read-write split, you will find both read and write can be executed on the same database entity. Writes on any database entity will be synchronized to other entities.

A typical example of master-master replication is Cassandra, which is very scalable for writes and can support very large numbers of simultaneous writes. On the other hand, MySQL also supports master-master replication, but one major problem with MySQL master-master replication is its replication is performed asynchronously in the background, which sacrifices MySQL's most important feature, consistency.

If master-master replication is needed, PACELC must be taken into account.

Master-master replication is a technique that tolerates partition failure, so you can only choose between consistency and availability. In addition, MySQL can achieve consistent master-master replication with external frameworks, such as Galera Cluster, which in returns, creates long latency.

Therefore, when choosing master-master replication to improve write performance, it is important to consider whether the usage scenario is appropriate. If you have to apply such a technique to MySQL, it would be better to switch to another database, e.g. Cassandra.

However, changing the database is a huge effort. Is there a way to solve the write bottleneck without changing the database? Yes, through caching once again.

Write through Cache

Before writing the data into the database by API, write all the data into the cache. After a period of time, the results of the cache are then written into the database in batches. By this way, a large number of write operations can be turned into a small number of batch operations, which can effectively reduce the load on the database.

The read operation can also be like read-aside cache which reads from the cache first and then the database, so that on the one hand we can get the first update and on the other hand we can reduce the load on the database read.

Write through cache is a completely different design pattern from master-master replication, which scales the database performance through architectural changes while keeping the original database. Although the application does not need to be modified by changing the database, writing through cache will also create additional complexity.

We've covered how the entire web service deals with heavy traffic, from scale-up to scale-out, from API to database, but the story doesn't end here. We have overcome the traffic problem, but when the volume of data in the database is very large, the performance of the database, both in terms of reads and writes, is seriously affected. Given the large volume of data, even a few reads can take up a huge amount of resources, which leads to database performance problems.

Sharding

Since the amount of data is too large for a single database, it is sufficient to spread the data evenly across several database entities, which is the concept of sharding.

Although there is only one database in each of the above clusters, all three of them are actually database clusters, and the mentioned read-write split or master-master replication can be applied. Furthermore, even though the APIs correspond to a database cluster individually, it does not mean that the APIs can only access specific clusters. For example, API2 can also access Cluster3.

Sharding is the technique of dividing a large dataset into several smaller datasets. By using pre-defined indexes such as shard key (MongoDB) or partition key (Cassandra), the data is distributed to the corresponding database entities. Thus, for an application, accessing a specific data will be in a specific database entity, and if the data is spread evenly enough, then the load of individual database entity is 1/N, N being the number of clusters.

Nevertheless, if the data is not spread out enough and is overly concentrated in one database, the meaning of sharding is lost, which is called a hot spot, and this will not only cause the performance to drop, but also the cost of redundant database. As I mentioned earlier, the cost of a database is high, and it would be very wasteful to have a redundant database.

Once we have overcome the problem of scaling traffic and data volume, the next challenge is what to do when the task to be performed becomes large enough to affect the performance of the API and the database? The answer is to break up the task.

Messaging

When the tasks running by the API become large, the response time of the API will be significantly affected. To reduce the response time of the API for large tasks, the most common approach is to perform synchronous tasks asynchronously, and sometimes even divide the large tasks into several smaller ones.

The asynchronous approach has actually been mentioned in many of my previous articles, that is, the Event-Driven Architecture.

The API sends tasks to the message queue, which are executed by workers behind the queue. As the number of events grows, the number of workers can be scaled horizontally depending on the number of pending events.

The details of the event-driven architecture are listed below, so I won't dive into them in this article.

How to choose a message queue? There are four main aspects.
What is the event-driven architecture pattern? Design pattern, Part 1 and Design pattern, Part 2.
How to implement with minimal effort? Solution.

Finally, when the project has been successful, the users are around the world. If the data center is located in the same place, then for areas that are physically far away, the delay will be noticeable, resulting in a decrease in user acceptance. So, how do we address the scalability of users?

Edge Computing

Edge means to arrange the system as close to the user as possible for a good user experience.

Therefore, for three distant regions, we can set up three different data centers to provide better efficiency for users in each region.

Nevertheless, when we need to perform data analysis, we will always need data from all three regions. From a data analysis perspective, we need one logically unified database, not three physically independent databases.

In other words, how to get the database as close to the user as possible and still have a unified entry point?

DB shards: This is a relatively simple approach to practice. Just use the region as the shard key and create database shards in each region. Then, there will be a unified entry point, which in the case of MongoDB is mongos.
DB master-master replication: master-master replication also allows different database entities to be placed in different regions, but because of the physical distance, the replication is not very efficient, so the synchronization rate is a potential problem.
Data ETL: Data is extracted from various databases and transformed and loaded into a unified data store. This is the most common way for data analysts to do this without changing the database structure of the original application, as well as to pre-process the data and even choose their own familiar data storage.

Conclusion

This article analyzes how to scale web services from various perspectives. From the beginning of the project, the API is horizontally scaled to handle the large number of incoming requests until the database becomes a bottleneck. To solve the database bottleneck, no simple horizontal scaling can be used, so techniques such as read-write splitting or caching are used to reduce the load on the database. However, if the dataset is very large, we still need to adopt sharding and other methods to separate the dataset. Finally, if the users are worldwide, it is necessary to establish different data centers to physically distribute the workload.

However, while this article is focused on explaining infrastructure-related topics, there is another kind of service scaling that is actually very important, and that is the scaling of feature requirements.

As a project moves towards success, more and more feature requests will be made, so how to quickly respond to feature scaling? The most common technology used nowadays is microservices, but microservices also have problems which must be faced, and in my previous article, I introduced what are the aspects of designing a microservice architecture that are worth considering.

This article serves as an internal training for my team members, because the target is junior engineers, so there is not too much detail on each topic. If you are interested in any of the topics, please let me know and I will analyze those techniques in depth.

DEV Community