DEV Community


Posted on • Updated on

Assess how many Kafka servers are needed to face a scenario of 1 billion requests.

Requirements scenario analysis

Architecture design according to the production environment, in a scenario-driven way, such as what kind of scenario is encountered in the company and what kind of scale of a cluster needs to be set up.
Kafka clusters, HBase clusters, Hadoop clusters, cluster evaluation is similar.

Let's take Kafka for example:

E-commerce platform, 1 billion requests are sent to the Kafka cluster every day. According to the 80-20 rule, the general assessment of the problem is not big.

For example, if our Kafka cluster is facing approximately 1 billion requests a day, then based on the 24 hours of the day, there is not a lot of data between 12:00pm and 8:00 am on a typical day. Eighty percent of requests take another 16 hours to process.

Witch means we need to use 16 hours to process 800 million requests。

According to the actual situation, we have hot time every day, that is, the number of requests received during this time is the most, then according to the 80-20 rule, 16*0.2=3 hours processing 80% of the data of 800 million requests.

So we got a preliminary figure, which is that in three hours, we processed about 600 million requests

So let's just do a simple calculation of QPS during the peak period

600 million requests / 3 hours = 55,000/s
Enter fullscreen mode Exit fullscreen mode

For the size of a message, let's assume 50KB. Of course, 50KB does not mean the size of a message, as we usually combine multiple logs into one message. (Typically, the size of a log message is a few bytes.)

So we can get the following formula

1Billion Requests  50KB = 46T
Enter fullscreen mode Exit fullscreen mode

That means we need to store 46 terabytes of data per day.

Given that Kafka has replicas in it, we would normally set up two replicas: 46T * 2 = 92T

At the same time, the data in Kafka has a time period of retention. It is assumed that the data of the last 3 days is retained 92T * 3 days = 276T

Scenario summary

1 billion requests, 55,000 QPS at peak, 276T of data.

Evaluation of physical machine quantity

First, analyze whether you need a virtual machine or a physical machine

When we build clusters like Kafka, MySQL, Hadoop, we use physical machines in our production. Because of the performance requirements.

According to the above calculation, the peak number of requests needed to handle is 55,000 per second, in fact, one or two physical machines can withstand. Normally, when we evaluate a machine, we evaluate it at four times the peak.

If it's 4 times, maybe our cluster can afford about 200,000 QPS. That way our cluster is more secure.

About 5 physical machines, each capable of 40,000 requests per second

Scenario summary

1 billion requests, 55,000 QPS at peak, 276T of data. You need five physical machines.

The disk selection

The evaluation is based on the following two aspects: SSD solid-state drives, or ordinary mechanical drives?

SSD drives: better performance, but more expensive

SAS disk: some performance is not very good, but cheaper

The performance of SSD hard disk is better, which means its performance of random read and write is better. Suitable for clusters like MySQL. But its sequential write performance is similar to that of an SAS disk.

Kafka, as we understand it, is written in sequential write. So we can just use a normal mechanical hard disk.

Then need to estimate how many disks per server needs?

Five servers, a total of 276T is needed, and about 60T of data needs to be stored on each server.

Assume that the company's servers can be configured with 11 hard disks, each of disk is 7T.

11 * 7T = 77T
77T * 5 servers = 385T
Enter fullscreen mode Exit fullscreen mode

The estimate is that the disk usage is approximately 80% of the total disk volume

Scenario summary

a billion requests, 5 physical machines, 11 SAS hard disks with a capacity of 7T

Memory assessment

Evaluate how much memory is required.

We found that Kafka's process of reading and writing data is based on OS Cache. In other words, given that our OS cache is infinite, the entire Kafka operation is based on memory, and the performance must be very good.

But memory is limited

You need to reserve some resources for the OS cache

Kafka's core code is written in Scala, and the client code is written in Java. It's all JVM based. So we have to give some memory to the JVM. In Kafka's design, there are not many data structures in the JVM. So we don't need a lot of memory for JVM. As a rule of thumb, you can reserve 10GB of memory resources for the JVM.

Let's say the project of 1 billion requests that has 100 topics.

100 topic * 5 partition * 2 replicas = 1000 partition. A partition is a directory on the physical machine that contains a number of .log files.

.log is the file that stores the log data. By default, a.log file is 1 gigabyte in size.

We just need to keep 25% of the data that's currently in the latest .log file in memory.

250M * 1000 = 0.25G * 1000 = 250G of memory.
250GB of memory / 5 servers = 50GB of memory
50G + 10G = 60G memory
Enter fullscreen mode Exit fullscreen mode

Overall, 64 gigabytes of memory is about right. Because we have to reserve 4 gigabytes of memory for the OS. Of course, if you can get a server with 128 gigabytes of memory, that would be great.

At the beginning of the evaluation, we assumed that there would be 5 partitions in a topic. If it was a topic with a large amount of data, there might be 10 partitions. Then only need to follow the above process, and calculate.

Scenario summary

a billion requests, 5 physical machines, 11 SAS hard disks with a capacity of 7T, requires 64GB of memory (128GB is better)

CPU stress assessment

Evaluate how much CPU core is required per server (resources are limited). The evaluation is based on how many threads we have in our service to process.

Assess how many threads a Kafka server will have once it starts up.

Acceptor thread                    1
Processor threads                  6-9
threads handling the request       32
Threads for cleaning up
Leader partition -> follower partition, the threads to pull data
ISR mechanism (ID number), mechanism to check ISR list periodically
So, once a Kafka service is up and running, there are about a hundred threads.
Enter fullscreen mode Exit fullscreen mode

If the CPU has four cores, a few dozen threads will normally fill the CPU.

If the CPU core count is 8, it should easily be able to support dozens of threads.

If we have more than 100 threads, or something like 200, then 8 CPU cores are not going to work. The recommended number of CPU cores is 16. If possible, 32 CPU cores are the best.


For a Kafka cluster, a minimum of 16 CPU cores should be given, and 32 CPU cores would be better.

2cpu * 8 = 16 cpu core
4cpu * 8 = 32 cpu core
Enter fullscreen mode Exit fullscreen mode

Scenario summary

a billion requests, 5 physical machines, 11 SAS hard disks with a capacity of 7T, requires 64GB of memory (128GB is better), Requires 16 CPU cores (32 is better)

Network Requirements Assessment

Evaluate what kind of network card we need

We basically have two choices: usually either gigabit (1G/s) or 10 gigabit (10G/s) network card

At its peak, there were 55,000 requests pouring in per second, which is 5.5/5 = about 10,000 requests per server

We said earlier that 10,000 requests * 50KB = 488M, which is 488M data per second per server. The data must be duplicated, and synchronization between duplicates is also required by the network. 488 * 2 = 976M/s

Note: In many companies, a request is not as large as 50KB. In our company, it is because the host encapsulates the data at the production end and then combines multiple data together, so a request can be so big.

In general, the broadband of the network card is not up to the limit, if it is a gigabit network card, we can generally use is about 700M.

But if, at best, we use a ten-megabit network card, that's easy.

Top comments (0)