Dimitar Ivanov

Posted on Oct 11

Making Kafka 7,658x slower for the Lulz

#kafka #eventdriven #programming

Welcome, ladies and gentlemen, to tonight's main event! In the left corner, we have the reigning champion - a highly scalable, highly performant, high-throughput technology known as "The Unstoppable Force" Kafka.

In the right corner, we have the challenger - a software engineer with about 5 years of professional experience, armed with terrible configuration choices and a questionable understanding of performance optimization. He goes by the nickname "The Immovable Object" Dimitar Ivanov.

What's going to happen when they clash in the middle of the ring?

Spoiler alert: I win. Very, very easily.

The goal of this exercise was mainly to satisfy my own curiosity and to find out just how bad I can make Kafka perform, which was become the holy grail of low-latency systems. I will mainly be focusing on consuming and not producing events. This entire article consists of me tearing Kafka apart, if you're interested in Kafka internals, this is the right place for you.

This is not for beginners as I'm assuming you already know what a producer, consumer, topic and broker are.

Initial Kafka setup:

broker:
- 2 Kafka brokers managed by Zookeeper
topic:
- partitions: 10
  - A Kafka topic is split into partitions, the idea is to distribute the data in the topic, these partitions give Kafka its scalability as the consumers aim to create consumer threads equal to the number of partitions and poll messages in parallel from different partitions. Imagine the partition as a pipe of data, which cannot be manually changed after its published.
- replication factor: 2
  - This means that the leader partition will have its data replicated on the second broker. I'm doing this to make Kafka put resources into syncing the replica. By default the producer HAS to wait for the data to be replicated before continuing (applies for Kafka version >= 3.X.X)
producer:
- producer is really just a loop over the desired amount of messages, for each iteration a new message is published to Kafka. It produces an message in String format, where the value is "Sending message COUNT at TIME" without any key or headers. Lack of key means that Kafka will put the message into partitions round-robin style. The producing logic is exposed via API /api/producer/{eventCount}
consumer:
- The consumer logic is a literally just increasing a counter and acknowledging that the message was consumed. For each 5K messages we print a log to track our progress.
- concurrency: 10 -> 10 consumer threads reading from 10 partitions

Stack:

Java 25
Spring 4.0.0-M3
Zookeeper
Kafka 3.4.0
Kafka UI (Monitoring tool for Kafka)
Yaak (For sending REST requests to Producer App to produce X amount of messages)

My Specs:

Processor: Intel(R) Core(TM) i7-14700HX (2.10 GHz), 28 Threads
RAM: 32 GB
OS: Windows 11

Testing rules:

To start the test I will be sending a request to the producer app via /api/producer/{eventCount}. Before each test I will run 2 warmups, one with 1k messages and another with 5k messages. There will be 3 cases that I run
1) 10k messages
2) 50k messages
3) 100k messages

Baseline:

10k messages
- producer: Sent 10000 messages in 147 ms (68027.21 msg/sec)
- consumer: Processed 10000 messages in 220 ms (45454.55 msg/sec)
50k messages
- producer: Sent 50000 messages in 117 ms (427350.43 msg/sec)
- consumer: Processed 50000 messages in 131 ms (381679.39 msg/sec)
100k messages
- producer: Sent 100000 messages in 193 ms (518134.72 msg/sec)
- consumer: Processed 100000 messages in 200 ms (500000.00 msg/sec)

Let's start breaking

First iteration:

I wanted to focus on changes I can apply directly on the broker

num.network.threads = 1

The number of threads per broker that are used for handling requests over the network such as produce/fetch message, update metadata, fantastic bottleneck. Default is 3.

num.io.threads = 1

The number of threads per broker that are used for handling disk I/O operations, mainly used for writing the bytes of data to disk of a message when its produced. Default is 8.

socket.send.buffer.bytes = 1024

This settings tells the Kafka broker to request the SO_SNDBUFF (Send Buffer) size from the OS. Default is 102400 (100 kibibytes), with my config its 1KB. SO_SNDBUF is the operating system-level TCP socket send buffer - a staging area for outgoing network data. This introduces a bottleneck for consumption.

Flow: Broker -> [Broker Send Buffer] -> Network -> [Consumer Receive Buffer] -> Consumer

Flow with default value:
1. Kafka writes 50KB of data to send buffer
2. Buffer holds it temporarily
3. OS sends data over network in chunks
4. Kafka can keep writing while OS handles network

Current flow:
1. Kafka writes 50KB of data
2. Buffer fills up after 1KB
3. Kafka BLOCKS waiting for buffer to drain
4. OS sends 1KB, buffer empties
5. Kafka writes next 1KB
6. Repeat 50 times for 50KB

socket.receive.buffer.bytes = 1024

This is the max size for receive buffer(SO_RCVBUF) I mentioned above, again default is 102400 (100 kibibytes), current one is 1KB.

Producer → [producer buffer] → Network → [1KB receive buffer] → Broker

socket.request.max.bytes = 262144

Kafka likes to batch requests, I'm limiting that by setting a cap of big a request can be. Default is 104857600 (100 mebibytes).

log.segment.bytes = 262144

First let's explain what a segment is.
So we put data into topic, which is split into partitions. Each partition is split into segments. When Kafka writes data, it can only do so to the last active segment, by default after it ages 7 days or gets to 1GB Kafka makes a new segment. The new one becomes the active segment and the current active becomes inactive and then Kafka can consider retention and delete data from it. This entire process of of creating a new segment and making the old one inactive is called rolling.

Each segment has a log, which is literally a file where Kafka writes the bytes of messages. If you go to /var/lib/kafka/data/topic-name-partition-number on your broker you can see the log files.

With this config I'm telling Kafka that "each segment file should only hold up to 256 KB of data — after that, start a new segment." I choose 256KB because that is 1-2 of my messages.

log.roll.ms = 1

The maximum time before a new segment is rolled out. Forces creation of a new segment every millisecond, if the active one has any data. Default is 168 hours (7 days).

log.flush.interval.messages = 1

Kafka brokers keep data in page cache (the OS memory buffer) and doesn’t necessarily write to disk immediately.
This parameter controls how often Kafka explicitly tells the OS to flush those in-memory pages to disk with an fsync() call. In other words Kafka is forced to flush every single message to disk immediately after appending it.

log.flush.interval.ms = 1

The maximum time in ms that a message in any topic is kept in memory before flushed to disk. Default is 9223372036854775807.

log.retention.check.interval.ms = 1

The frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion, default is 5 minutes. Just adding additional pressure and taking resources away from producing/consuming.

First iteration results:

10k messages
- producer: Sent 10000 messages in 164 ms (60975.61 msg/sec)
- consumer: Processed 10000 messages in 405 ms (24691.36 msg/sec)
50k messages
- producer: Sent 50000 messages in 205 ms (243902.44 msg/sec)
- consumer: Processed 50000 messages in 2175 ms (22988.51 msg/sec)
100k messages
- producer: Sent 100000 messages in 230 ms (434782.61 msg/sec)
- consumer: Processed 100000 messages in 2505 ms (39920.16 msg/sec)

So far so good, we able to slow down the consumer by:

10K: 1.8x slower
50K: 16.6x slower
100K: 12.5x slower

Second iteration:

Here I wanted to focus on Consumer level changes.

max.poll.records = 1

The max.poll.records configuration controls the maximum number of records that a Kafka consumer can retrieve in a single call to poll(). Default is 500.

fetch.max.bytes = 1

The maximum amount of data the server should return for a fetch request. Default is 52428800 (50 mebibytes).

receive.buffer.bytes = 1

This is the receive buffer for the consumer. Default is 65536 (64 kibibytes). Taking into consideration the above set socket.send.buffer.bytes, this is what the flow looks like:

Broker → [1KB send buffer] → Network → [1 byte receive buffer] → Consumer

enable.auto.commit = false

Every consumer after processing a message has to commit an offset to Kafka and say "I've read up to this message on this partition", by default is happens in a separate thread every 5 seconds.
With auto commits disabled there won't be a separate thread that works on just committing the offsets, instead the commit will happen in the same thread that is processing the message. This is actually a more secure way to commit offsets as with the default one you can have duplicate message consumption if the offset commit fails.

Before:
Poll 500 messages → Process all 500 → Continue processing
                                      ↓
                                 (Background thread)
                             commits offset once every 5 seconds

Processing is NOT blocked by commits as they happen in a separate thread.

Now:
Poll 1 message → Process 1 message → commit → Block until commit is complete

concurrency = 1

The cherry on top of my evil plan. My initial idea was to make the message id the same for all messages, which was going to force all of them into a single partition and make consumption single threaded completely removing Kafka parallelism, but it's actually more evil to have 1 consumer thread that has to switch context between 10 partitions. Now what we have is one consumer thread that can only fetch 1 record at a time and can do so byte by byte and is blocked by manual offset commit before it can get to the next message.

Before:

Now:

Second iteration results:

10k messages
- producer: Sent 10000 messages in 54 ms (185185.19 msg/sec)
- consumer: Processed 10000 messages in 158950 ms (62.91 msg/sec)
50k messages
- producer: Sent 50000 messages in 224 ms (223214.29 msg/sec)
- consumer: Processed 50000 messages in 761824 ms (65.63 msg/sec)
100k messages
- producer: Sent 100000 messages in 216 ms (462962.96 msg/sec)
- consumer: Processed 100000 messages in 1531620 ms (65.29 msg/sec)

Overall:

10K: 45,454 → 62.91 = 722x slower
50K: 381,679 → 65.63 = 5,817x slower
100K: 500,000 → 65.29 = 7,658x slower

Conclusion

Can we make Kafka slower? Absolutely.

If you’d like to try it yourself, I’ve open-sourced the full experiment here.

Fork it, run it, and see how much pain you can inflict on your own cluster.

As for me — after winning this round against Kafka, I’ve already set my sights on the next opponent: Java.

DEV Community