DEV Community: AutoMQ

Introducing AutoMQ: a cloud-native replacement of Apache Kafka

AutoMQ — Mon, 05 Aug 2024 07:40:14 +0000

Author: Xinyu Zhou, AutoMQ CTO

AutoMQ is a Kafka alternative designed with a cloud-first philosophy. AutoMQ innovatively redesigns the storage layer of Apache Kafka based on the cloud, bringing a 10x cost reduction and a 100x increase in elasticity while being 100% compatible with Kafka by separating persistence to EBS and S3. It also has better performance than Apache Kafka. Low latency, high throughput, low cost, easy to use, all in one. The community edition of AutoMQ is source code available on Github, and you can deploy and test AutoMQ for free now.

The Growing AutoMQ Community

The AutoMQ community is a vibrant and diverse group of individuals and organizations committed to the growth and development of AutoMQ. As a source-available software on GitHub, AutoMQ has amassed an impressive following. With 2900+ stargazers and counting, the community's enthusiasm for our project is palpable.

Our community's diversity and engagement are testaments to the broad appeal and applicability of AutoMQ. We're excited to continue fostering this dynamic community, driving innovation, and shaping the future of "Data in Motion" together.

The Evolution of the Streaming World

The stream storage industry has undergone a significant transformation over the past decade, marked by technical evolution and the emergence of innovative solutions.

Kafka is the Begining: Apache Kafka, birthed a decade ago, marked the beginning of a new era in stream storage. Kafka integrated advanced technologies of its era, such as the append-only log and zero-copy technique, which dramatically enhanced data writing efficiency and throughput.
Commercial Leads Innovation: As the industry matured, commercial opportunities began to surface. Companies like Confluent and Redpanda emerged, driving technical innovations in the Kafka ecosystem. Confluent introduced significant architectural innovations, namely KRaft and Tiered Storage, which streamlined the architecture and substantially reduced storage costs. Redpanda rewrite Kafka in the native language CPP and replacing the ISR with the Raft replication protocol to achieved lower tail latency. They are both based on a Shared-Nothing replication architecture and have adopted tiered storage optimization.
Cloud Reshapes Architecture:The advent of cloud-native technologies has further reshaped the stream storage industry. Warpstream has rewritten Kafka in Go language, with a storage layer fully built on S3. It achieves a cloud-native elastic architecture by sacrificing latency and is compatible at the Kafka API protocol level. AutoMQ innovatively redesigns and implements the storage layer of Apache Kafka based on the cloud. On the basis of 100% compatibility with Kafka, it achieves a 10x cost reduction and a 100x elasticity improvement by separating persistence to EBS and S3, without sacrificing any latency and throughput performance.

Truly Cloud-Native Architecture of AutoMQ

The cloud-native architecture of AutoMQ is a result of careful design decisions, innovative approaches, and the strategic use of cloud storage technologies. We aimed to create a system that could leverage the benefits of the cloud while overcoming the limitations of traditional stream storage solutions.

Decoupling Durability to Cloud Storage

The first step in realizing the cloud-native architecture of AutoMQ was to decouple durability to cloud storage. Unlike the typical decoupling of storage, where we refer to separating the storage to a distributed and replicated storage software, decoupling durability takes it a step further. In the former case, we are left with two types of clusters that need to be managed, as seen in Apache Pulsar, where you need to manage both the broker cluster and the bookkeeper cluster.

However, AutoMQ has taken a different route, opting to decouple durability to cloud storage, with S3 serving as the epitome. S3 already offers a durability rate of 99.999999999%, making it a reliable choice for this purpose. In the realm of cloud computing, merely decoupling storage is insufficient; we must also decouple durability to cloud storage.

The essence of the Decoupling Durability architecture lies in its reliance on cloud storage for durability, eliminating the need for replication protocols such as Raft. This approach is gaining traction over the traditional Decoupling Storage architecture. Guided by this philosophy, we developed S3Stream, a stream storage library that combines the advantages of EBS and S3.

Stateless Broker with S3Stream

With S3Stream in place, we replaced the storage layer of the Apache Kafka broker, transforming it from a Shared-Nothing architecture to a Shared-Storage architecture, and in the process, making the Broker stateless. This is a significant shift, as it reduces the complexity of managing the system. In the AutoMQ architecture, the Broker is the only component. Once it becomes stateless, we can even deploy it using cost-effective Spot instances, further enhancing the cost-efficiency of the system.

Automate Everything for Elasticity

The final step in realizing the cloud-native architecture of AutoMQ was to automate everything to achieve an elastic architecture. Once AutoMQ became stateless, it was straightforward to automate various aspects, such as auto-scaling and auto-balancing of traffic.

We have two automated controllers that collect key metrics from the cluster. The auto-scaling controller monitors the load of the cluster and decides whether to scale in or scale out the cluster. The auto-balancing controller minimizes hot-spotting by dynamically reassigning partitions across the entire cluster. This level of automation is integral to the flexibility and scalability of AutoMQ, and it is also the inspiration behind its name.

Moving Toward Multi-Cloud Native Architecture

As we move toward a multi-cloud native architecture, the need for a flexible and adaptable storage solution becomes critical. AutoMQ's shared storage design is an embodiment of this flexibility, designed to integrate seamlessly with a variety of cloud providers.

Shared Storage: WAL Meets Object Storage

At the heart of this design lies the concept of S3Stream, a shared stream storage repository. It is essentially composed of a shared Write-Ahead Log (WAL) and shared object storage.

Data is first persistently written to the WAL and then uploaded to object storage in near real-time. The WAL does not provide data reading capabilities. Instead, it serves as a recovery mechanism in the event of a failure. Consumers read data directly from S3. To enhance performance, a memory cache is implemented for acceleration, which means that tailing-read consumers do not need to access object storage directly.

This architecture of S3Stream is highly flexible due to the variety of mediums that can be used for the WAL. For instance, EBS, Regional EBS, S3, or even a combination of these can be used to form a Replication WAL. This flexibility is primarily due to the varying capabilities of cloud storage offered by different cloud providers. The aim is to pursue an architecture that is optimal across multiple cloud providers.

Adapting Architecture to Different Cloud Providers

The architecture of AutoMQ's shared storage model is designed to be adaptable to the specific capabilities of different cloud providers. The choice of architecture depends primarily on the specific features and services offered by each cloud provider.

For instance, Azure, Google Cloud, and Alibaba Cloud all provide regional EBS. Given this feature, the best practice for these cloud providers is to use regional EBS as the WAL. This allows the system to tolerate zone failures, ensuring reliable and consistent performance.

In contrast, AWS does not offer regional EBS. However, AWS does provide S3 Express One Zone, which boasts single-digit millisecond latency. Although this service is limited to a single availability zone, AutoMQ can still ensure tolerance to zone failures by using a replication WAL. In this setup, data is written both to the S3 One Zone bucket and an EBS volume.

In cases where you have access to a low-latency alternative to S3 or your business can tolerate hundreds of milliseconds of latency, it is possible to use S3 as the WAL. This means the entire architecture relies solely on S3 for both WAL and data storage. Yes, AutoMQ also provides a warpstream-like architecture easily.

By understanding and leveraging the unique features of each cloud provider, AutoMQ ensures optimal performance and reliability across a variety of cloud environments. This flexibility and adaptability are key to the success of a multi-cloud native architecture.

Performance Data and Benefits of AutoMQ

To fully appreciate the capabilities and advantages of AutoMQ, let's take a look at some key benchmark data and performance metrics.

The advantages of AutoMQ compared to Apache Kafka can be summarized as follows:

⚡ 10x cost-effective than Apache Kafka: AutoScaling、Support Spot Instance、Separate Storage to S3. All this make AutoMQ 10x cost-effective than Apache Kafka.
👍 Easy to operate: No need to manage the cluster's capacity yourself. Stateless Broker that can autoscale in seconds. Forget data skew, hot and cold data competition. Self-blancing fixes them all automatically.
🚀 High performance: Single digit ms latency with high throughput as Apache Kafka, but with much better catch-up reads performance.
😄 Easy to migrate: 100% Compatible with Apache Kafka, so you don't need to change anyting you already have. Access to the new bootstrap server endpoint and all things are done.

10x Cost Effective

AutoMQ's innovative architecture brings unprecedented cost savings in the realm of data-intensive software. Its design focuses on optimizing both computational and storage resources, resulting in a cost advantage that's nearly tenfold compared to traditional solutions.

The first major advantage comes from the optimization of EC2 resources. By eliminating data replication, AutoMQ removes the need for extra resources to manage replication traffic. And, coupled with the platform's elastic nature that dynamically adjusts the cluster size in response to workload, results in a dramatic reduction of EC2 resources—up to 90%.

Furthermore, AutoMQ's stateless architecture allows the use of Spot instances. This strategy leads to a significant cost reduction, further enhancing computational resource savings.

On the storage front, AutoMQ also shines. Instead of adhering to the traditional three-replication EBS storage, it utilizes a single-replica object storage model. This innovative approach reduces storage costs by as much as 90%.

Our detailed cost comparison chart, based on real bill comparisons from stress testing on AWS, illustrates these savings. For more in-depth information, we invite you to access the complete report from our website.

Instant Elastic Efficiency

AutoMQ's shared storage architecture greatly enhances operational efficiency. For example, reassigning partitions in AutoMQ no longer involves data replication and can be completed within seconds, unlike in Kafka where it could take up to several hours. Additionally, when it comes to cluster scaling, AutoMQ can balance the traffic of new nodes with the cluster in just about one minute by reassigning partitions in batches. In contrast, this process could take days with Kafka.

100% Compatibility

Perhaps one of the most important aspects of AutoMQ is its compatibility. We've replaced Kafka's storage layer with s3stream while keeping all the code from the computation layer. This ensures that AutoMQ is fully compatible with Kafka's protocols and features. For instance, newer versions of Apache Kafka that support features such as Compact Topics, Idempotent Producer, and Transactional Messages are fully supported by AutoMQ.

Furthermore, we replace Kafka's storage layer through a very small LogSegment aspect. This approach makes it very easy for us to synchronize code from the Kafka upstream, meaning that we can easily merge new features of Apache Kafka in the future. This is a significant advantage over solutions like WarpStream, where such compatibility and future-proofing can be a challenge.

In summary, AutoMQ's flexible architecture, cost savings, operational efficiency, and compatibility make it a powerful solution for stream storage in the cloud.

Roadmap: streaming data to data lake

In this final section, we outline our vision for the future of streaming data into data lakes, a critical aspect of our roadmap.

The Shift Toward Shared Data

We're witnessing a trend where all data-intensive software eventually stores data on object storage to leverage the benefits of shared storage. However, even with all data stored on object storage, there isn't a straightforward way to share data between different systems. This process typically requires Extract, Transform, Load (ETL) operations and data format conversions.

We believe the transition from shared storage to shared data will be the next critical evolution in modern data technology. Table storage solutions like Delta Lake and Iceberg have unified the data format in the data lake, making this transition feasible.

From Stream to Lake: A Data Journey

In the future, we envision data usage to be a seamless, interconnected process that maximizes data utility and operational efficiency.

The journey begins with data generation. As data is produced in a streaming manner, it is immediately stored in stream storage. This continuous flow of information forms the foundation of our data landscape.

Next, we unlock the real-time value of this data. Tools like Flink Jobs, Spark Jobs, or Kafka consumers dive into the data stream, extracting valuable insights on the fly through the Stream API. This step is crucial in keeping pace with the dynamic nature of the data.

As the data matures and loses its freshness, the built-in Compactor in AutoMQ steps in. Quietly and transparently, it transforms the data into the Iceberg table format. This conversion process ensures the data remains accessible and usable even after it has passed its real-time relevance.

Finally, we arrive at the stage of large-scale analysis. The entire big data technology stack can now access the converted data, using a zero ETL approach. This approach eliminates the need for additional data processing, allowing for direct, efficient analysis.

In conclusion, as we continue to innovate and evolve, our goal remains the same: to provide a powerful, efficient, and cost-effective solution for stream storage in the cloud. By streamlining the process of streaming data to data lakes, we aim to further enhance the value and utility of big data for businesses.

Embracing the Future with AutoMQ

AutoMQ, our cloud-native solution, is more than an alternative to existing technologies—it's a leap forward in the realm of data-intensive software. It promises cost savings, operational efficiency, and seamless compatibility.

We envision a future where data effortlessly streams into data lakes, unlocking the potential of real-time generative AI. This approach will enhance the utility of big data, leading to more comprehensive analyses and insights.

Finally, we invite you to join us on this journey and contribute to the evolution of AutoMQ. Visit our website to access the GitHub repository and join our Slack group for communication: https://www.automq.com/. Let's shape the future of data together with AutoMQ.

References

Here are some useful links to deepen your understanding of AutoMQ. Feel free to reach out if you have any queries.

AutoMQ Website: https://www.automq.com/
AutoMQ Repository: https://github.com/AutoMQ/automq
AutoMQ Architecture Overview: https://docs.automq.com/automq/architecture/overview
AutoMQ S3Stream Overview: https://docs.automq.com/automq/architecture/s3stream-shared-streaming-storage/overview
AutoMQ Technical Advantages: https://docs.automq.com/automq/architecture/technical-advantage/overview
The Difference between AutoMQ and Kafka: https://docs.automq.com/automq/what-is-automq/difference-with-apache-kafka
The Difference between AutoMQ and WarpStream: https://docs.automq.com/automq/what-is-automq/difference-with-warpstream
The Difference between AutoMQ and Tiered Storage: https://docs.automq.com/automq/what-is-automq/difference-with-tiered-storage
AutoMQ Customers: https://www.automq.com/customer

How to Monitor AutoMQ Cluster using Guance Cloud

AutoMQ — Mon, 29 Jul 2024 09:07:18 +0000

Preface
Guance Cloud
Guance Cloud [1] is a unified real-time monitoring application designed for cloud platforms, cloud-native environments, applications, and business-related needs. It integrates three main signals: metrics, logs, and traces, covering testing, prerelease, and production environments to achieve observability across the entire software development lifecycle. Through Guance Cloud, enterprises can build comprehensive application full-link observability, enhancing the transparency and controllability of the overall IT architecture.
As a powerful data analysis platform, Guance Cloud includes several core modules such as the DataKit [2] unified data collector and the DataFlux Func data processing development platform.

AutoMQ
AutoMQ [3] is a next-generation Apache Kafka distribution redesigned based on cloud-native concepts. It provides up to 10 times the cost and elasticity advantages while maintaining 100% compatibility with the Apache Kafka protocol. Moreover, AutoMQ stores data entirely on S3, allowing it to quickly handle sudden traffic spikes during cluster expansion without the need for data replication. In contrast, Apache Kafka requires substantial bandwidth for partition data replication after scaling, making it difficult to manage sudden traffic surges. With features like automatic scaling, self-balancing, and automatic fault recovery, AutoMQ achieves a high degree of system autonomy, offering higher levels of availability without the need for manual intervention.

Observability Interface of AutoMQ
Due to AutoMQ's full compatibility with Kafka and support for open Prometheus-based metrics collection ports, it can be integrated with Guance Cloud's data collection tool, DataKit. This enables users to monitor and manage the status of AutoMQ clusters conveniently. The Guance Cloud platform also supports user-defined aggregation and querying of metrics data. By utilizing the provided dashboard templates or custom dashboards, we can effectively compile various information about the AutoMQ cluster, such as common Topics, Brokers, Partitions, and Group statistics.
Based on observable data from Metrics, we can also query the errors encountered during the operation of the AutoMQ cluster and various current system utilization metrics, such as JVM CPU usage, JVM heap usage, and cache size. These metrics can help us quickly identify and resolve issues when the cluster encounters anomalies, which is highly beneficial for system high availability and quick recovery. Next, I will introduce how to monitor the AutoMQ cluster status using the Observability Cloud Platform.

Steps to Integrate with the Observability Cloud
Enable Metric Fetch Interface in AutoMQ
Refer to the AutoMQ documentation: Cluster Deployment | AutoMQ [4]. Before deployment and startup, add the following configuration parameters to enable the Prometheus fetch interface. After starting the AutoMQ cluster with the following parameters, each node will additionally open an HTTP interface for fetching AutoMQ monitoring metrics. The format of the metrics will follow Prometheus Metrics format.

bin/kafka-server-start.sh ...\
--override  s3.telemetry.metrics.exporter.type=prometheus \
--override  s3.metrics.exporter.prom.host=0.0.0.0 \
--override  s3.metrics.exporter.prom.port=8890 \
....

Once the AutoMQ monitoring metrics are enabled, you can fetch Prometheus format monitoring metrics from any node via HTTP protocol at the address: http://{node_ip}:8890. A sample response is as follows:

....
kafka_request_time_mean_milliseconds{otel_scope_name="io.opentelemetry.jmx",type="DescribeDelegationToken"} 0.0 1720520709290
kafka_request_time_mean_milliseconds{otel_scope_name="io.opentelemetry.jmx",type="CreatePartitions"} 0.0 1720520709290
...

For more information on metrics, refer to the official AutoMQ documentation: Metrics | AutoMQ [5].
Install and Configure the DataKit Collection Tool
DataKit is an open-source monitoring collection tool provided by the Observability Cloud, supporting the fetching of Prometheus Metrics. We can use DataKit to fetch monitoring data from AutoMQ and aggregate it into the Observability Cloud platform.
Installation of DataKit Tool
For more details on installing DataKit, refer to the documentation: Host Installation - Observability Cloud Documentation [6].
First, register for an Observability Cloud account and log in. Then, from the main interface, click on "Integration" on the left side and select "DataKit" at the top. You will see the DataKit installation command:

DK_DATAWAY="https://openway.guance.com?token=<TOKEN>" bash -c "$(curl -L https://static.guance.com/datakit/install.sh)"

Copy the above command and run the DataKit installation command on all nodes in the cluster to complete the installation.
DataKit needs to be installed on all Brokers in the cluster that need to be monitored.

After successfully executing the installation command, use the command datakit monitor to verify whether DataKit was installed successfully.

AutoMQ Collector Configuration and Activation
In this section, we will configure the AutoMQ collector for DataKit on the server where each data collection node resides. Navigate to the directory /usr/local/datakit/conf.d/prom and create a collector configuration file named prom.conf. The collector configuration includes the open observable data interface, collector name, prom instance name, and important collection interval. You can make adjustments to the configuration on each server as needed:

  [[inputs.prom]]

  urls = ["http://clientIP:8890/metrics"]   # clientIP 为你自己的服务器地址
  source = "AutoMQ"

  ## Keep Exist Metric Name
  ## If the keep_exist_metric_name is true, keep the raw value for field names.
  keep_exist_metric_name = true

  [inputs.prom.tags_rename]
    overwrite_exist_tags = true

  [inputs.prom.tags_rename.mapping]
    service_name = "job"
    service_instance_id = "instance"

  [inputs.prom.tags]
    component="AutoMQ"
  interval = "10s"

Monitor the AutoMQ cluster through the Cloud Visualization Management.
The Observation Cloud platform has integrated AutoMQ and offers multiple default dashboards. You can view them at Dashboard Example [7]. Below are some commonly used templates, with a brief introduction to their functionalities:
Cluster Monitoring
This primarily displays the number of active Brokers, total number of Topics, number of Partitions, etc. Additionally, you can specify which node to query by selecting it in the Cluster_id.

By monitoring the state of the Kafka cluster, we can promptly detect and resolve potential issues, such as node failures, insufficient disk space, and network latency, to ensure the system remains controllable and stable.
Broker Monitoring
The AutoMQ Broker dashboard on Guance Cloud describes various metrics for all Brokers, such as the number of connections, the number of partitions, the number of messages received per second (ops), and the input/output data volume per second, measured in bytes.

Topic Monitoring
This section provides an overview of information for all Topics contained within all nodes. As mentioned above, you can specify and query Topic information under a specific node. These metrics mainly include the space occupied by each Topic, the number of messages received, and the Request Throughput, which indicates the ability to process requests per unit time.

At this point, we have successfully monitored the status of the AutoMQ cluster using Guance Cloud, and the data on the dashboard is obtained by aggregating or querying Metrics indicators.
Conclusion
In this article, we introduced how to perfectly integrate the Guance Cloud platform with AutoMQ to monitor the status information of the AutoMQ cluster. There are also many advanced operations, such as custom alert functions and custom data queries, which can be customized according to the rules provided by the official documentation. You can manually experiment with these operations to find the ones that suit your needs. We also hope that this article will help you when integrating the Guance Cloud platform with AutoMQ!
References
[1] Guance Cloud: https://docs.guance.com/getting-started/product-introduction/
[2] DataKit: https://docs.guance.com/datakit/
[3] AutoMQ: https://www.automq.com
[4] Cluster Deployment of AutoMQ: https://docs.automq.com/en/docs/automq-opensource/IyXrw3lHriVPdQkQLDvcPGQdnNh
[5] Host Installation - Guance Cloud Documentation: https://docs.guance.com/datakit/datakit-install/
[6] Metrics | AutoMQ：https://docs.automq.com/zh/docs/automq-opensource/ArHpwR9zsiLbqwkecNzcqOzXn4b
[7] Dashboard Example: https://console.guance.com/scene/dashboard/createDashboard?w=wksp_63b96920660e4962a07429b65ef163e7&lak=Scene

Challenges of Custom Cache Implementation in Netty-Based Streaming Systems: Memory Fragmentation and OOM Issues

AutoMQ — Mon, 29 Jul 2024 08:59:00 +0000

Preface
Kafka, as a stream processing platform, aims for end-to-end low latency in real-time stream computation and online business scenarios. In offline batch processing and peak shaving scenarios, it seeks high throughput for cold reads. Both scenarios require a well-designed data caching mechanism to support them. Apache Kafka stores data in local files and accesses them by mapping files into memory using mmap, naturally leveraging the operating system for file buffering, cache loading, and cache eviction.
AutoMQ adopts a separation of storage and computation architecture, where storage is offloaded to object storage. With no local data files, it cannot directly use mmap for data caching like Apache Kafka. At this point, there are usually two approaches to cache data from object storage:

The first approach is to download object storage files to local files and then read the local files using mmap. This approach is relatively simple to implement but requires additional disk space to cache data. Depending on the size and rate of the cache required, it also necessitates purchasing disk space and IOPS, making it economically inefficient.
The second approach is to directly use memory for data caching based on the data consumption characteristics of stream processing. This method is more complex to implement, essentially requiring the creation of a memory management system similar to an operating system. However, like everything in life has its pros and cons, implementing memory cache management oneself allows for achieving the best caching efficiency and cost-effectiveness based on business scenarios. To reduce operational complexity and holding costs, and to improve cache efficiency, AutoMQ ultimately chose the second approach: "directly using memory for data caching." AutoMQ Cache Design Directly leveraging memory for data caching, AutoMQ has designed two caching mechanisms for tail read and cold read scenarios based on their data access characteristics: LogCache and BlockCache.

LogCache is designed for the tail read scenario. When data is uploaded to object storage, it is simultaneously cached in LogCache as a single RecordBatch. This allows hot data to be accessed directly from the cache, providing extremely low end-to-end latency. Compared to general-purpose OS cache designs, LogCache has the following two features:

FIFO: Given the characteristic of continuous access to new data in tail read scenarios, LogCache uses a First In, First Out eviction policy to ensure the availability of the cache for new data.
Low Latency: LogCache has a dedicated cache space solely responsible for caching hot data, avoiding the problem of cold data reads affecting hot data consumption.

BlockCache is designed for cold read scenarios. When the required data cannot be accessed in LogCache, it is read from BlockCache. Compared to LogCache, BlockCache has the following two distinctions:

LRU: BlockCache uses the Least Recently Used eviction strategy, which offers better cache utilization in high fan-out cold read scenarios.
High Throughput: Cold read scenarios focus on throughput; therefore, BlockCache reads and caches data in large chunks (~4MB) from object storage and uses a prefetching strategy to load data that is likely to be read next.

In Java programs, data can be cached in memory using either on-heap or off-heap memory. To alleviate the burden on JVM GC, AutoMQ uses off-heap Direct Memory for caching data. To improve the efficiency of Direct Memory allocation, it employs the industry-standard Netty PooledByteBufAllocator for memory allocation and release from a pooled memory.
"The Incident" occurred.
The expectation was that by using Netty's PooledByteBufAllocator, AutoMQ could achieve efficient memory allocation speed through pooling, along with a well-honed memory allocation strategy to minimize overhead, providing peace of mind. However, during the performance testing of AutoMQ 1.0.0 RC, reality hit hard.
AutoMQ was deployed on a 2C16G production model, with an off-heap memory limit set to 6GiB using -XX:MaxDirectMemorySize=6G. Memory allocation was set as 2GiB for LogCache + 1GiB for BlockCache + 1GiB for other small items, totaling ~4GiB, which is less than 6GiB. In theory, there was ample off-heap memory available. However, in practice, after running AutoMQ 1.0.0 RC for an extended period under various loads, an OutOfMemoryError (OOM) was encountered.

Following the principle of suspecting our own code before suspecting mature libraries and operating systems.
Upon observing the exception, the initial suspicion was whether there was a missed ByteBuf#release call in the code. Hence, the Netty leak detection level was set to -Dio.netty.leakDetection.level=PARANOID to check if any ByteBuf instances were being garbage collected without being released. After running for a while, no leak logs were found, ruling out the possibility of missed releases.

Next, the suspicion shifted to whether any part of the code was allocating more memory than expected. Netty's ByteBufAllocatorMetric only provides global memory usage statistics, and traditional memory allocation flame graphs only offer memory request amounts at specific times. What we needed was the memory usage of various types at a given moment. Therefore, AutoMQ consolidated ByteBuf allocation into a custom ByteBufAlloc factory class, using WrappedByteBuf to track memory requests and releases of various types. This allowed us to record the memory usage of different types at any given moment and also record Netty's actual memory usage, thereby providing insight into AutoMQ's overall and categorized memory usage.

Buffer usage: 
ByteBufAllocMetric{allocatorMetric=PooledByteBufAllocatorMetric(usedDirectMemory: 2294284288; ...), // Physical Memory Size Allocated by Netty
allocatedMemory=1870424720, // Total Memory Size Requested By AutoMQ
1/write_record=1841299456, 11/block_cache=0, ..., // Detail Memory Size Requested By AutoMQ
pooled=true, direct=true} (com.automq.stream.s3.ByteBufAlloc)

After adding categorized memory statistics, it was found that the memory usage of various types was within the expected range. However, it was observed that there was a significant discrepancy between the memory requested by AutoMQ and the actual memory allocated by Netty. This discrepancy grew over time, sometimes even resulting in Netty's actual memory usage being twice that of AutoMQ's requested memory. This discrepancy was identified as memory fragmentation in memory allocation.

Ultimately, the cause of the OOM was identified as memory fragmentation in Netty's PooledByteBufAllocator. Having initially identified the problem, the next step was to understand why Netty had memory fragmentation and how AutoMQ could mitigate this issue.
Netty Memory Fragmentation
First, let's explore the causes of Netty's memory fragmentation. Netty's memory fragmentation can be divided into internal fragmentation and external fragmentation:

Internal Fragmentation: This type of fragmentation occurs due to size standardization alignment. For example, when you expect to allocate 1 byte, but the underlying system actually occupies 16 bytes, leading to an internal fragmentation waste of 15 bytes.
External Fragmentation: Simply put, any fragmentation caused by factors other than internal fragmentation is considered external fragmentation. This usually results from memory layout fragmentation caused by allocation algorithms.

Internal and external fragmentation exhibit different behaviors in different versions of Netty. Below, we will briefly introduce the working mechanisms and causes of memory fragmentation for the Buddy Allocation Algorithm and the PageRun/PoolSubPage Allocation Algorithm, using Netty version 4.1.52 as a dividing line.
Buddy Allocation Algorithm in Netty < 4.1.52
Netty versions prior to 4.1.52 use the Buddy Allocation Algorithm, which originates from jemalloc3. To improve memory allocation efficiency, Netty requests a contiguous chunk of memory (PoolChunk) from the operating system at once. When a ByteBuf is requested from the upper layer, this chunk of memory is logically divided and returned as needed. The default size of a PoolChunk is 16MB, which is logically divided into 2048 pages, each 8KB in size. The memory usage is represented by a complete binary tree.

Each node in the complete binary tree uses one byte to represent the node's state (memoryMap):

The initial value represents the number of layers, with the status value == number of layers indicating that the node is completely idle.
When the number of layers < status value < 12, it means that the node is partially used but still has remaining space.
When the status value == 12, it means that the node has been fully allocated.

Memory allocation is divided into four types: Tiny [0, 512 bytes], Small (512 bytes, 8KB), Normal [8KB, 16MB], and Huge (16MB, Max). Tiny and Small are managed by PoolSubpage, Normal is managed by PoolChunk, and Huge is allocated directly.

First, let's look at the allocation efficiency of small memory blocks. Tiny [0, 512 bytes] and Small (512 bytes, 8KB) divide a Page into equally sized logical blocks through PoolSubpage, with a bitmap marking the usage of these blocks:

The basic unit of Tiny memory allocation is 16 bytes, meaning if the requested size is 50 bytes, 64 bytes are actually allocated, resulting in an internal fragmentation rate of 28%.
The basic unit of Small memory allocation is 1KB, meaning if the requested size is 1.5KB, 2KB are actually allocated, resulting in an internal fragmentation rate of 25%.

Next, let's examine the allocation of medium-sized memory blocks, Normal [8KB, 16MB]. Suppose we request 2MB + 1KB = 2049KB from a completely idle PoolChunk:

2049KB normalized upwards to 4MB using base 2, thus targeting a Depth-3 free node.
Check node at index=1, find it free, then check the left subtree.
Check node at index=2, find it free, then continue checking the left subtree.
Check node at index=4, find it unallocated, mark the state of index=4 as 12, and update the parent node's state to the smallest of its children, thus changing the state of index=2 to 3, similarly updating parent nodes' states in succession.
Allocation completed. From the allocation result, we can see that requesting 2049KB of memory actually marks 4MB as occupied, implying an internal fragmentation rate of 49.9%.

Suppose another 9MB memory is requested. Although the previous PoolChunk still has 12MB of remaining space, due to the Buddy memory allocation algorithm, index=1 is partially occupied, requiring a new PoolChunk to allocate 9MB of memory. The resulting external fragmentation rate is 1 - (4MB + 9MB) / 32MB = 59.3%. The effective memory utilization rate, which is the required memory / actual underlying occupied memory, is only 34.3%.

Furthermore, in scenarios of continuous allocation and release of variously sized memory blocks, even if the PoolChunk doesn't allocate a large space, it might be logically fragmented by scattered memory blocks, leading to increased external memory fragmentation. As shown in the figure below, although the upper-layer application ultimately retains only 4 * 8KB, it is no longer possible to request 4MB of memory from this PoolChunk.

PageRun/PoolSubpage Allocation Algorithm in Netty >= 4.1.52
Netty >= 4.1.52 adopts jemalloc4 to enhance memory allocation through the PageRun/PoolSubpage allocation strategy. Compared to the original Buddy allocation algorithm, it offers lower internal and external memory fragmentation rates for both small and large memory allocations.
The PageRun/PoolSubpage allocation algorithm compared to the original Buddy allocation algorithm:

The default size of a Chunk has been reduced from 16MB to 4MB.
The Chunk and Page concepts are retained, with the addition of the Run concept. A Run is a series of contiguous Pages used to allocate Normal (28KB to 4MB) medium-sized memory.
Tiny and Small memory blocks are replaced with PoolSubpages, which can span multiple Pages, ranging from 16 bytes to 28KB, with a total of 38 basic allocation sizes.

Let's first examine the efficiency of small memory block allocation with an example of requesting 1025 bytes:

First, 1025 bytes will be rounded to the nearest PoolSubpage allocation size, which is 1280 bytes.

sizeIdx2sizeTab=[16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, 448, 512, 640, 768, 896, 1024, 1280, 1536, 1792, 2048, 2560, 3072, 3584, 4096, 5120, 6144, 7168, 8192, 10240, 12288, 14336, 16384, 20480, 24576, 28672, ...]

Then, PoolChunk will determine that the PoolSubPage should contain 5 pages by finding the least common multiple of 1280 bytes and the page size of 8KB, which is 40KB.
It allocates 5 contiguous pages from PoolChunk and tracks the allocated elements via bitmapIdx.
At this point, the allocation is complete, resulting in an internal fragmentation rate of 1 - 1025 / 1280 = 19.9%. Thanks to the finer granularity of PoolSubPage, which has been refined from 2 levels to 38 levels, the allocation efficiency of small memory blocks has been significantly improved.

Next, let's examine the allocation efficiency of medium-sized memory blocks, Normal (28KB, 4MB]. Suppose a request is made to allocate 2MB + 1KB = 2049KB of memory from a completely idle PoolChunk:

After rounding up 2049KB to the nearest multiple of 8KB, it is determined that 257 pages are needed.
PoolChunk finds a run that satisfies the size requirement: Run{offset=0, size=512}.
PoolChunk splits the run into Run{offset=0, size=257} and Run{offset=257, size=255}. The first run is returned to the requester, while the second run is added to the free run list (runsAvail).
At this point, the allocation is complete, and the internal fragmentation rate is 1 - 2049KB / (257 * 8K) = 0.3%; Through the PageRun mechanism, Netty can control the memory waste of memory block allocation greater than 28KB, not exceeding 8KB, with an internal fragmentation rate of less than 22.2%.

Assuming an additional 1MB of memory is applied for, the PoolChunk continues to run the same logic, splitting Run{offset=257, size=255} into Run{offset=257, size=128} and Run{offset=385, size=127}. The former is returned to the upper layer, while the latter is added to the list of free Runs. At this point, the external fragmentation rate is 25%. If we were to follow the old Buddy algorithm, in a scenario where the size of the PoolChunk is 4MB, a new PoolChunk would need to be opened, resulting in an external fragmentation rate of 62.5%.

Although the PageRun/PoolSubpage allocation algorithm has a lower internal and external memory fragmentation rate compared to the original Buddy allocation algorithm, it does not compact fragmented memory through Garbage Collection (GC) like the JVM does. This results in scenarios where memory blocks of various sizes are continuously allocated and released, leading to fragmented available runs within a PoolChunk. Over time, the memory fragmentation rate gradually increases, eventually causing an Out Of Memory (OOM) error.
AutoMQ's Response
After introducing the Netty memory allocation mechanism and scenarios where memory fragmentation occurs, how does AutoMQ solve the memory fragmentation issue?

LogCache adopts a first-in, first-out eviction policy to cater to the characteristics of tailing read for continuous access to new data. This means memory allocated at adjacent times will be freed at adjacent times. AutoMQ employs a strategy called ByteBufSeqAlloc:

ByteBufSeqAlloc requests ByteBuf of ChunkSize from Netty each time, avoiding external memory fragmentation and achieving zero external memory fragmentation;
ByteBufSeqAlloc allocates memory through the underlying ByteBuf#retainSlice, which splits small memory segments from large contiguous memory blocks, avoiding internal memory fragmentation caused by size normalization, achieving zero internal memory fragmentation.
When releasing, adjacent blocks are released together. It's possible that most of a block is released while a small portion is still in use, preventing the entire large block from being released. However, this waste occurs only once and will only waste the size of one ChunkSize.

The feature of BlockCache is to pursue high throughput for cold reads, reading large segments of data from object storage. AutoMQ's strategy is to cache large chunks of raw data from object storage:

On-demand decoding: Data is decoded into specific RecordBatch only when queried, reducing the number of resident memory blocks and hence minimizing memory fragmentation.
Structured splitting: In the future, large cache blocks can be split into structured 1MB memory blocks to avoid increasing memory fragmentation rates caused by continuous allocation and release of various sized memory blocks.

It can be seen that the essence of optimizing LogCache and BlockCache is to avoid memory fragmentation issues brought by Netty's memory allocation strategy through large and structured memory allocations according to the characteristics of their own caches. With this method, AutoMQ maintains an off-heap memory fragmentation rate below 35% in various long-term running scenarios, such as tail reads, cold reads, and mixed message sizes, without encountering off-heap memory OOM issues.

Summary
Netty's PooledByteBufAllocator is not a silver bullet; when using it, consider the actual memory space amplification caused by memory fragmentation and plan to reserve reasonable JVM memory size. If Netty is used only as a network layer framework, the memory lifecycle allocated by PooledByteBufAllocator will be relatively short, so the actual memory amplification caused by memory fragmentation will not be significant. However, it is still recommended to upgrade Netty's version to 4.1.52 or above for better memory allocation efficiency. If using Netty's PooledByteBufAllocator for caching, it is recommended to allocate large blocks of memory and then split them continuously to avoid Netty's memory fragmentation.

Reference Document:

AutoMQ vs Kafka: An Independent In-Depth Evaluation and Comparison by Little Red Book

AutoMQ — Mon, 29 Jul 2024 08:55:21 +0000

Test Background: The current Xiaohongshu message engine team is deeply collaborating with The AutoMQ Team to promote community building and explore cutting-edge cloud-native messaging engine technologies. This article provides a comprehensive evaluation of AutoMQ based on the OpenMessaging framework. We welcome everyone to join the community and share their evaluation experiences.
1. Testing Conclusion
This article primarily evaluates the performance comparison between the cloud-native messaging engine AutoMQ and Apache Kafka® (version 3.4).
Testing Conclusion:

Real-time Read/Write: With the same cluster size, AutoMQ's maximum read/write throughput is three times that of Apache Kafka, and the E2E latency is 1/13 of Apache Kafka.
Catch-up Read: With the same cluster size, AutoMQ's peak catch-up read is twice that of Apache Kafka, and during the catch-up read, AutoMQ's write throughput and latency remain unaffected.
Partition Reassignment: AutoMQ's partition reassignment takes seconds on average, whereas Apache Kafka's partition reassignment takes minutes to hours on average.

2. Testing Configuration
The benchmark testing is enhanced based on the Linux Foundation's OpenMessaging Benchmark, simulating real user scenarios with dynamic workloads.
2.1 Configuration Parameters
By default, AutoMQ forces data to be flushed to disk before responding, using the following configuration:


acks=all
flush.message=1

AutoMQ ensures high data durability through EBS's underlying multi-replica mechanism, making multi-replica configurations unnecessary on the Kafka side.
For Apache Kafka, choose version 3.6.0, and based on Confluent's recommendations, do not set flush.message = 1. Instead, use a three-replica, in-memory asynchronous flush to ensure data reliability (power outages in the data center may cause data loss), configured as follows:


acks=all
replicationFactor=3
min.insync.replicas=2

2.2 Machine Specifications
16 cores, maximum network bandwidth of 800MB/s, configured with a cloud disk of 150MB/s bandwidth

3. Detailed Comparison
3.1 Real-time Read and Write Performance Comparison
This test measures the performance and throughput limits of AutoMQ and Apache Kafka® under the same cluster size and different traffic scales. The test scenarios are as follows:

Deploy 6 data nodes each, create a Topic with 100 partitions
Starts with 100 MiB/s and 200 MiB/s 1:1 read/write traffic (message size=4kb, batch size=200kb); additionally, both are tested for their maximum throughput. Load files: [tail-read-100mb.yaml], [tail-read-200mb.yaml], [tail-read-900mb.yaml] Extreme Throughput Send Latency:

Extreme Throughput:

Detailed Data on Send Duration and E2E Duration:

Analysis:

In a cluster of the same scale, AutoMQ's maximum throughput (870MB/s) is three times that of Apache Kafka (280MB/s).
Under the same cluster scale and traffic (200 MiB/s), AutoMQ's P999 latency is 1/50th that of Apache Kafka, and the E2E latency is 1/13th that of Apache Kafka.
Under the same cluster scale and traffic (200 MiB/s), AutoMQ's bandwidth usage is 1/3rd that of Apache Kafka.

3.2 Comparison of Catch-up Read Performance
Catch-up reading is a common scenario in message and stream systems:

For messages, they are typically used to decouple business processes and smooth out peaks and valleys. Smoothing out peaks requires the message queue to hold the upstream data so that the downstream can consume it slowly. In this case, the downstream is catching up on cold data that is not in memory.
For streams, periodic batch processing tasks need to scan and compute data from several hours or even a day ago.
Additionally, there are failure scenarios: Consumers may go down for several hours and then come back online; consumer logic issues may be fixed, requiring a catch-up on historical data. Chasing read primarily focuses on two aspects:
Speed of chasing read: The faster the chasing read, the quicker consumers can recover from failures, and batch processing tasks can produce analytical results faster.
Isolation of read and write: Chasing read should minimize the impact on the production rate and latency. Testing This test measures the chasing read performance of AutoMQ and Apache Kafka® under the same cluster scale. The test scenario is as follows:
Deploy 6 data nodes each, create a Topic with 100 partitions
Continuously send data at a throughput of 300 MiB/s.
After sending 1 TiB of data, start the consumer to consume from the earliest offset. Load file: [catch-up-read.yaml] Test Results:

Analysis

Under the same cluster size, AutoMQ's catch-up read peak is twice that of Apache Kafka.
During the catch-up read, AutoMQ's sending throughput was unaffected, with an average send latency increase of approximately 0.4 ms. In contrast, Apache Kafka's sending throughput decreased by 10%, and the average send latency surged to 900 ms. This is because Apache Kafka reads from the disk during catch-up reads and does not perform IO isolation, occupying the cloud disk's read-write bandwidth. This reduces the write bandwidth, leading to a drop in sending throughput. Moreover, reading cold data from the disk contaminates the page cache, further increasing write latency. In comparison, AutoMQ separates reads and writes, utilizing object storage for reads during catch-up, which does not consume disk read-write bandwidth and hence does not affect sending throughput and latency.

3.3 Partition Reassignment Capability Comparison
This test measures the time and impact of reassigning a partition with 30 GiB of data to a node that does not currently have a replica of the partition, under a scenario with regular send and consume traffic. The specific test scenario is as follows:

2 brokers, with the following setup:
- 1 single-partition single-replica Topic A, continuously reading and writing at a throughput of 40 MiB/s.
- 1 four-partition single-replica Topic B, continuously reading and writing at a throughput of 10 MiB/s as background traffic.
After 10 minutes, migrate the only partition of Topic A to another node with a migration throughput limit of 100 MiB/s. Load file: [partition-reassign.yaml]

Analysis

AutoMQ partition migration only requires uploading the buffered data from EBS to S3 to safely open it on the new node. Typically, 500 MiB of data can be uploaded within 2 to 5 seconds. The time taken for AutoMQ partition migration is not dependent on the data volume of the partition. The average migration time is around 2 seconds. During the migration process, AutoMQ returns the NOT_LEADER_OR_FOLLOWER error code to clients. After the migration is complete, the client updates to the new Topic routing table and internally retries sending to the new node. As a result, the send latency for that partition will increase temporarily and will return to normal levels after the migration is complete.
Apache Kafka® partition reassignment requires copying the partition's replicas to new nodes. While copying historical data, it must also keep up with newly written data. The reassignment duration is calculated as partition data size / (reassignment throughput limit - partition write throughput). In actual production environments, partition reassignment typically takes hours. In this test, reassigning a 30 GiB partition took 15 minutes. Besides the long reassignment duration, Apache Kafka® reassignment necessitates reading cold data from the disk. Even with throttle settings, it can still cause page cache contention, leading to latency spikes and affecting service quality.

Use Kafdrop to Manage AutoMQ

AutoMQ — Mon, 29 Jul 2024 08:46:09 +0000

Preface
Kafdrop [1] is a simple, intuitive, and powerful web UI tool designed for Kafka. It allows developers and administrators to easily view and manage key metadata of Kafka clusters, including Topics, partitions, Consumer Groups, and their offsets. By providing a user-friendly interface, Kafdrop greatly simplifies the monitoring and management of Kafka clusters, enabling users to quickly obtain cluster status information without relying on complex command-line tools.

Thanks to AutoMQ's full compatibility with Kafka, it can seamlessly integrate with Kafdrop. By utilizing Kafdrop, AutoMQ users can also benefit from an intuitive user interface for real-time monitoring of Kafka cluster status, including Topics, partitions, Consumer Groups, and their offsets. This monitoring capability not only enhances problem diagnosis efficiency but also helps optimize cluster performance and resource utilization.
This tutorial will teach you how to start the Kafdrop service and integrate it with an AutoMQ cluster to monitor and manage the cluster state.

Prerequisites

Kafdrop Environment: AutoMQ cluster, JDK17, and Maven 3.6.3 or above.
Kafdrop can be run through JAR files, Docker deployment, or protobuf deployment. Refer to the official documentation [3].
Prepare 5 hosts to deploy the AutoMQ cluster. It is recommended to choose Linux amd64 hosts with 2 CPUs and 16GB of RAM and prepare two virtual storage volumes.
Download the latest official binary installation package from AutoMQ Github Releases to install AutoMQ. Below, I will first set up the AutoMQ cluster and then start Kafdrop. Install and start the AutoMQ cluster. Configure S3URL. Step 1: Generate the S3 URL. AutoMQ provides the automq-kafka-admin.sh tool to quickly start AutoMQ. Simply provide the S3 URL containing the required S3 access points and authentication information to start AutoMQ with one click, without the need to manually generate cluster IDs or perform storage formatting.

### 命令行使用示例
bin/automq-kafka-admin.sh generate-s3-url \
--s3-access-key=xxx \
--s3-secret-key=yyy \
--s3-region=cn-northwest-1 \
--s3-endpoint=s3.cn-northwest-1.amazonaws.com.cn \
--s3-data-bucket=automq-data \
--s3-ops-bucket=automq-ops

Note: You need to pre-configure an AWS S3 bucket. If you encounter errors, please ensure the parameters and format are correct.
Output Result
After executing this command, the process will automatically proceed through the following stages:

Based on the provided accessKey and secret Key, test the core features of S3 to verify the compatibility between AutoMQ and S3.
Generate the s3url based on identity information and access point information.
Obtain the startup command for AutoMQ using the s3url. In the command, replace --controller-list and --broker-list with the actual CONTROLLER and BROKER that need to be deployed. Example results are as follows:

############  Ping s3 ########################

[ OK ] Write s3 object
[ OK ] Read s3 object
[ OK ] Delete s3 object
[ OK ] Write s3 object
[ OK ] Upload s3 multipart object
[ OK ] Read s3 multipart object
[ OK ] Delete s3 object
############  String of s3url ################

Your s3url is:

s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=xxx&s3-secret-key=yyy&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA


############  Usage of s3url  ################
To start AutoMQ, generate the start commandline using s3url.
bin/automq-kafka-admin.sh generate-start-command \
--s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" \
--controller-list="192.168.0.1:9093;192.168.0.2:9093;192.168.0.3:9093"  \
--broker-list="192.168.0.4:9092;192.168.0.5:9092"

TIPS: Please replace the controller-list and broker-list with your actual IP addresses.

Step 2: Generate a list of startup commands
Replace the --controller-list and --broker-list in the commands generated in the previous step with your host information. Specifically, substitute them with the IP addresses of the 3 CONTROLLERs and 2 BROKERs mentioned in the environment setup, and use the default ports 9092 and 9093.

bin/automq-kafka-admin.sh generate-start-command \
--s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" \
--controller-list="192.168.0.1:9093;192.168.0.2:9093;192.168.0.3:9093"  \
--broker-list="192.168.0.4:9092;192.168.0.5:9092"

Output Result
Upon executing the command, a startup command for AutoMQ will be generated.

############  Start Commandline ##############
To start an AutoMQ Kafka server, please navigate to the directory where your AutoMQ tgz file is located and run the following command.

Before running the command, make sure that Java 17 is installed on your host. You can verify the Java version by executing 'java -version'.

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker,controller --override node.id=0 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.1:9092,CONTROLLER://192.168.0.1:9093 --override advertised.listeners=PLAINTEXT://192.168.0.1:9092

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker,controller --override node.id=1 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.2:9092,CONTROLLER://192.168.0.2:9093 --override advertised.listeners=PLAINTEXT://192.168.0.2:9092

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker,controller --override node.id=2 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.3:9092,CONTROLLER://192.168.0.3:9093 --override advertised.listeners=PLAINTEXT://192.168.0.3:9092

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker --override node.id=3 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.4:9092 --override advertised.listeners=PLAINTEXT://192.168.0.4:9092

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker --override node.id=4 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.5:9092 --override advertised.listeners=PLAINTEXT://192.168.0.5:9092


TIPS: Start controllers first and then the brokers.

Note: The node.id is automatically generated starting from 0 by default.
Step 3: Start AutoMQ
To start the cluster, sequentially execute the list of commands generated in the previous step on the pre-specified CONTROLLER or BROKER hosts. For instance, to start the first CONTROLLER process on 192.168.0.1, execute the first command template from the generated startup command list.

bin/kafka-server-start.sh --s3-url="s3://s3.cn-northwest-1.amazonaws.com.cn?s3-access-key=XXX&s3-secret-key=YYY&s3-region=cn-northwest-1&s3-endpoint-protocol=https&s3-data-bucket=automq-data&s3-path-style=false&s3-ops-bucket=automq-ops&cluster-id=40ErA_nGQ_qNPDz0uodTEA" --override process.roles=broker,controller --override node.id=0 --override controller.quorum.voters=0@192.168.0.1:9093,1@192.168.0.2:9093,2@192.168.0.3:9093 --override listeners=PLAINTEXT://192.168.0.1:9092,CONTROLLER://192.168.0.1:9093 --override advertised.listeners=PLAINTEXT://192.168.0.1:9092

Run in the background
If you need to run in background mode, please add the following code at the end of the command:

command > /dev/null 2>&1 &

Start the Kafdrop service
In the previous process, we set up the AutoMQ cluster and obtained the addresses and ports of all broker nodes. Next, we will start the Kafdrop service.
Note: Ensure that the address where the Kafdrop service is located can access the AutoMQ cluster; otherwise, it will result in connection timeout issues.
In this example, I use the JAR package method to start the Kafdrop service. The steps are as follows:

Pull the Kafdrop repository source code: Kafdrop GitHub [4]

git clone https://github.com/obsidiandynamics/kafdrop.git

Use Maven to locally compile and package Kafdrop to generate the JAR file. Execute the following in the root directory:

mvn clean compile package

Start the service, specifying the addresses and ports of the AutoMQ cluster brokers:

java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
    -jar target/kafdrop-<version>.jar \
    --kafka.brokerConnect=<host:port,host:port>,...

Replace kafdrop-<version>.jar with the specific version, such as kafdrop-4.0.2-SNAPSHOT.jar.
--kafka.brokerConnect=host:port,host:port requires you to specify the host and port for the specific broker nodes in the cluster. The console startup output is as follows:

If not specified, kafka.brokerConnect defaults to localhost:9092.

Note: Starting from Kafdrop 3.10.0, a ZooKeeper connection is no longer required. All necessary cluster information is retrieved through the Kafka management API.
Open your browser and navigate to http://localhost:9000. You can override the port by adding the following configuration:

--server.port=<port> --management.server.port=<port>

Final effect

Complete interface

Displays the number of partitions, number of topics, and other cluster state information.

Creating a Topic

Detailed Broker Node Information

Detailed Topic Information

Message Information Under the Topic

Summary
Through this tutorial, we explored the key features and functionalities of Kafdrop, as well as the methods to integrate it with AutoMQ clusters. We demonstrated how to easily monitor and manage AutoMQ clusters. The use of Kafdrop not only helps teams better understand and control their data flow but also enhances development and operational efficiency, ensuring a highly efficient and stable data processing workflow. We hope this tutorial provides you with valuable insights and assistance when using Kafdrop with AutoMQ clusters.
References
[1] Kafdrop：https://github.com/obsidiandynamics/kafdrop
[2] AutoMQ：https://www.automq.com/zh
[3] Kafdrop Deployment: https://github.com/obsidiandynamics/kafdrop/blob/master/README.md#getting-started
[4] Kafdrop project repository: https://github.com/obsidiandynamics/kafdrop

Delving deep into the adoption of Alibaba Cloud's cloud-native technologies in AutoMQ

AutoMQ — Mon, 29 Jul 2024 08:38:50 +0000

Author information: Zhou Xinyu, Co-founder & CTO of AutoMQ

Introduction: AutoMQ[1] is a groundbreaking cloud-native Kafka built on a shared storage architecture. By utilizing its compute-storage separation and integrating deeply with Alibaba Cloud's robust and advanced services such as Object Storage OSS, Block Storage ESSD, Elastic Scaling ESS, and Spot Instances, AutoMQ provides a cost advantage ten times greater than Apache Kafka while offering automatic scalability.

Leading the charge toward the cloud-native era, our mission at Alibaba Cloud and AutoMQ is to enhance our customers' capabilities in the cloud-based business landscape. As the industry evolves, we've observed that many products hastily claim to be cloud-native without fundamentally embracing cloud computing capabilities. Merely supporting deployment on Kubernetes does not suffice. True cloud-native products must exploit the full potential, elasticity, and scalability of cloud computing, thereby achieving significant cost and efficiency benefits.
Today, we delve into how Alibaba Cloud leverages cloud-native technologies with AutoMQ, addressing practical challenges effectively.
Object Storage OSS
With data increasingly migrating to the cloud, object storage has emerged as the primary storage solution for big data and data lake ecosystems. The shift from file APIs to object APIs is becoming prevalent, especially as stream data, often handled by Kafka, increasingly flows into these data lakes.
AutoMQ has developed the [S3Stream][1] stream storage library utilizing object storage, which enables efficient reading and ingestion of stream data via the Object API. By adopting a storage-compute separation architecture, it integrates Apache Kafka's storage layer with object storage, fully capitalizing on the technical and cost advantages of shared storage:

The standard version storage price of OSS with in-city redundancy is 0.12 yuan/GiB/month, vastly more economical—over eight times less—than ESSD PL1 priced at 1 yuan/1 GiB/month. Moreover, OSS inherently provides multi-zone availability and durability. Without the need for additional data replication, it significantly reduces costs by 25 times compared to the conventional cloud disk-based 3-copy architecture.
In contrast to a Shared-Nothing architecture, the shared storage model achieves a true separation of storage and compute, decoupling data from compute nodes. Consequently, when AutoMQ undertakes partition reassignment, it avoids data replication, facilitating true second-level lossless partition reassignments. This feature supports AutoMQ's capability for real-time self-balancing and rapid horizontal scaling of nodes.

While AutoMQ has effectively harnessed OSS for cost and architectural benefits, this represents merely the beginning. The adoption of shared storage is set to spur a wave of technical and product innovations at AutoMQ.

Disaster Recovery: As a fundamental aspect of software infrastructure, the greatest concern is the failure of a cluster to continue delivering services or the inability to restore data following a cluster failure. Potential issues include software bugs and catastrophic data center-level disasters. Thanks to shared storage and a straightforward metadata snapshot system, it is feasible to shut down a compromised cluster and restart it as a new cluster using the data stored on OSS to resume operations.
Cross-Region Disaster Recovery: OSS provides near real-time replication across different regions. Companies don't have to establish their own cross-regional networks or set up costly data connectivity clusters. Paired with the previously mentioned disaster recovery technology, this enables straightforward, code-free implementation of cross-region disaster recovery strategies.
Shared Read-Only Copies: High fan-out is a critical business use case for consuming streaming data. In a data-driven company, a single data item might be accessed by dozens of subscribers. The original cluster is unable to manage the increased load. With OSS, it is possible to create read-only copies directly from OSS without data duplication, offering scalable high fan-out capabilities.
Zero ETL: Modern data technology frameworks rely on object storage. When data resides in a common storage pool and possesses a level of self-description, data silos can be dismantled at a minimal cost without the necessity to construct ETL pipelines. Various analytical tools or computing engines can access shared data from multiple sources.

On the other hand, incorporating stream data into lakes completes the modern data stack, laying the groundwork for the Stream-Lake architecture. This is the source of the vast creative potential behind Confluent's TableFlow[2]. Data is produced and stored in stream formats, which align with the nature of continuously generated and evolving information in the real world. Real-time data must be in stream form, enabling stream computing frameworks to extract more immediate value. Eventually, as data ages, it transitions to table formats like Iceberg[3] for broader scale data analysis. From a lifecycle perspective, the move from streams to tables naturally matches the data's progression from high frequency to low frequency, from hot to cold, and constructing a stream-table integrated data technology stack on object storage represents a forward-looking trend.

Block Storage ESSD
If ECS is still regarded as a physical server, cloud disk ESSD faces a similar predicament. Users generally harbor two misconceptions about ESSD:

Comparing ESSD to local disks, they are concerned about data durability, apprehensive that problems typical of physical disks such as faulty disks or bad sectors might persist.
It's commonly believed that ESSD is a cloud disk, which leads to assumptions of poor remote write performance, uncontrollable latency, and jitter.

However, ESSD is supported by a robust distributed file system, utilizing triple-replica technology that ensures nine nines of data durability. Users are insulated from errors in physical storage media, as the system automatically detects and corrects faults across millions of physical disks.
Moreover, ESSD functions as shared storage. In the event of ECS failures, ESSD volumes can be mounted on other nodes to continue providing read and write services. In this respect, ESSD, similar to OSS, is shared storage and not a stateful local disk, which is a key reason why AutoMQ is touted as a stateless data software.
From a performance standpoint, ESSD benefits from combined software and hardware enhancements, including offloading the ESSD client to the Shenlong MOC[5] for hardware acceleration. It employs a high-performance proprietary network protocol and a congestion control algorithm based on RDMA technology, bypassing the traditional TCP stack to meet the low-latency and low packet loss requirements of data centers. These improvements ensure stable IOPS and throughput performance, as well as highly scalable storage capacity.
AutoMQ employs ESSD innovatively in three ways:

First, reliability separation, by fully utilizing the multi-replica technology of ESSD to circumvent the need for replication mechanisms such as Raft or ISR at the application layer, significantly reducing storage costs and network replication bandwidth.
Second, using ESSD as a WAL, where data is cyclically written to ESSD as bare devices and Direct IO, exclusively for recovery in fault scenarios. The shared nature of ESSD allows AutoMQ's WAL to be a remote, shareable WAL that can be taken over and recovered by any node in the cluster.
Finally, a cloud service-oriented billing design, where ESSD provides at least approximately 100 MiB/s throughput and about 1800 IOPS for any volume size. AutoMQ requires only a minimal ESSD volume as the WAL disk, such as a 2GiB ESSD PL0 volume, costing just 1 yuan per month to deliver the aforementioned performance. For enhanced storage performance on a single machine, simply combine multiple small-spec WAL disks for linear expansion.

ESSD and OSS offer distinctly different storage characteristics. ESSD provides high performance, low latency, and high IOPS, albeit at a higher cost. AutoMQ, however, has developed a cost-effective approach to utilizing ESSD. OSS is not ideal for environments requiring high IOPS as it charges per IO operation, yet it offers economical storage with virtually unlimited scalability in both throughput and capacity. As primary storage, OSS delivers high throughput, low cost, high availability, and limitless scalability; ESSD provides durable, highly available, low-latency storage ideal for storing WAL, and its virtualized nature allows for requesting very small storage capacities. AutoMQ's proprietary streaming library, S3Stream[1], cleverly merges the benefits of both ESSD and OSS shared storage, achieving low-latency, high-throughput, low-cost, and unlimited capacity streaming storage.

Multiple mounting and NVMe protocol
Although ESSD is shared storage, it functions as a block device. To efficiently share ESSD, additional storage technology support is necessary, specifically multiple mounting and the NVMe PR protocol.
Cloud disks natively support remounting to other nodes for recovery after unloading, but when the original mounting node encounters issues, such as an ECS Hang, the unloading time for the cloud disk becomes unpredictable. Therefore, with ESSD's multiple mounting capability, it's feasible to mount directly to another ECS node without unmounting the cloud disk.
Taking the AutoMQ Failover process as an example, when a Broker node is identified as a Failed Broker, its cloud disk is multiply mounted to a healthy Broker for data recovery. Before commencing the actual Recovery process, it's crucial to ensure that the original node has ceased writing. AutoMQ utilizes the NVMe protocol's PR lock for IO Fencing on the original node.
Both these processes are millisecond-level operations, effectively transforming ESSD into shared storage within the AutoMQ framework.
Regional ESSD
While ESSD typically uses a multi-replica architecture, these replicas are often confined to a single AZ, restricting ESSD's capability to handle AZ-level failures. Regional EBS[6] is crafted to solve this problem. By spreading the underlying multi-replica redundancy across multiple AZs with robust consistency read-write technology, it can withstand single AZ failures.
In terms of shared mounting, it supports cross-AZ mounting within a region and multi-AZ shared mounting, with preemptive IO Fencing and NVMe PR lock as forms of IO Fencing. Regional ESSD, offered by major international cloud providers, is also soon to be launched on Alibaba Cloud. This product enables AutoMQ to handle single AZ failures at a very low cost, satisfying the requirements of scenarios that demand higher availability.

Elastic Cloud Server (ECS)
Over the past decade, businesses have increasingly adopted a Rehost strategy to transition to the cloud, primarily replacing traditional on-premise physical servers with cloud-based servers like ECS. A key distinction between ECS and on-premise servers is the SLA services that ECS offers. Leveraging virtualization technologies, ECS addresses many hardware and software failures typical of physical servers. For those failures that cannot be avoided, cloud servers can quickly recover on a new physical server, substantially reducing downtime and limiting disruption to business operations.

Alibaba Cloud offers a 99.975% SLA for individual ECS instances. Operating a service on a single ECS node can ensure an availability of 99.9%, making it well-suited for production environments and fulfilling the availability demands of numerous services. For example, running a single-node AutoMQ cluster on an ECS setup with 2 CPUs and 16GB of RAM can achieve this level of availability and provide a write capacity of 80MiB/s, all while keeping costs low.
Since its development, AutoMQ has been designed to operate on ECS as a cloud service rather than as a physical server. Should an ECS failure occur, the system depends on the quick recovery features of the ECS node, such as automatic reassignment and restart. AutoMQ initiates proactive failover only after detecting several missed heartbeats from a node, considering two primary factors:

In the event of physical hardware or kernel failures, ECS can recover within seconds. Therefore, AutoMQ relies on the swift recovery capabilities of ECS to manage such issues, while avoiding overly sensitive failover mechanisms that might prompt unnecessary disaster recovery efforts.
AutoMQ's failover mechanisms are activated only in the event of ECS failures, network partitions, or even failures at the availability zone level, taking advantage of the features provided by ESSD and OSS for proactive disaster recovery. Elastic Scaling Service (ESS) In March 2024, AutoMQ was officially launched on the Alibaba Cloud marketplace through a collaborative release with Alibaba Cloud. From the general availability of AutoMQ's core features to its swift listing on the Alibaba Cloud marketplace, two key products played a pivotal role: Alibaba Cloud Compute Nest, which ensures standardized delivery processes for service providers, and Elastic Scaling Service (ESS). Although AutoMQ's architecture naturally supports elastic scaling, providing these capabilities seamlessly presents challenges[4]. AutoMQ leverages ESS to simplify the final delivery steps. AutoMQ chose ESS over Kubernetes for its public cloud deployment for a variety of reasons:
AutoMQ's initial deployment model is BYOC, which simplifies dependencies and eliminates the need for each user to set up a Kubernetes cluster when installing AutoMQ.
Elastic Scaling Service (ESS) offers configuration management, automatic scaling, scheduled scaling, instance management, multi-AZ deployment, and health checks, all akin to the core deployment features of Kubernetes. We view ESS as a streamlined version of Kubernetes at the IaaS layer.
Future sections will explore AutoMQ's use of multiple mounts, Regional ESSDs, and other advanced features provided by cloud vendors, which Kubernetes may not immediately support. Using APIs at the IaaS layer, rather than Kubernetes APIs, is similar to the difference between the C++ and Java programming languages; native functionalities need to be accessible at the Kubernetes level for effective use. Certainly, Kubernetes is an exceptional platform, and we plan to support deployments on Kubernetes in the future, particularly in private cloud scenarios, to abstract many of the differences at the IaaS layer. Spot Instances Elastic capabilities are not inherently available to cloud providers; they must incur considerable holding costs to offer adequate elasticity, which often results in an excess of unused computing resources. These resources are made available through spot instances, which function just like regular ECS instances but can offer savings of up to 90% compared to standard pay-as-you-go rates. Unlike regular pay-as-you-go instances, the pricing of spot instances varies with market supply and demand. For example, if demand for computing power decreases overnight, prices typically drop, adding a temporal dimension to spot instance pricing. If all users adopt spot instances, prices will adjust accordingly, promoting optimal usage times for different workloads. For instance, AutoMQ conducts extensive testing overnight using spot instances, drastically cutting testing costs. Another characteristic of preemptive instances is their ability to be interrupted and reclaimed at any time, which indeed poses a high barrier to their adoption. However, the compute-storage separation architecture employed by AutoMQ ensures that Broker nodes maintain no local state, enabling them to gracefully manage the reclamation of preemptive instances. The diagram below illustrates the process of WAL recovery via the ESSD API when preemptive instances are reclaimed by AutoMQ. Through this approach, AutoMQ achieves a tenfold reduction in costs, with preemptive instances playing a significant role in lowering compute expenses.

Closing Remarks
Today, much of the foundational software that supports the rapid growth of big data and the internet was developed a decade ago. Yet, software crafted for IDC environments does not translate to high efficiency or low costs in the current mature cloud computing landscape. Thus, there is a significant push to redesign foundational software for the cloud, including components for observability storage, TP and AP databases, and data lake software. As an essential piece of flow storage software within the big data ecosystem, Kafka occupies a crucial position, representing 10% to 20% of IT spending in data-centric enterprises. Redesigning Kafka with cloud-native features is vital in today’s environment of cost reduction. AutoMQ utilizes deep cloud integration and cloud-native capabilities to reengineer Apache Kafka®, achieving a tenfold cost advantage. In comparison to Kafka, AutoMQ’s shared storage architecture has led to a drastic improvement in operational metrics, such as partition reassignment, dynamic node scaling, and traffic self-balancing.
Cloud computing has heralded a new era, and embracing a cloud-native approach ensures no regrets in transitioning to the cloud. We are convinced that all foundational software should be reengineered based on cloud architectures to fully capitalize on its benefits.
References

Open-source cloud-native version of Kafka — AutoMQ: https://github.com/AutoMQ/automq
Confluent recently introduced Tableflow, merging streaming and analytical computing: https://www.confluent.io/blog/introducing-tableflow/
Official site of the open table format Iceberg: https://iceberg.apache.org/
Why is it difficult to fully utilize the elasticity of public clouds? https://www.infoq.cn/article/tugbtfhemdiqlxm1x63y
Alibaba Cloud's in-house developed "Shenlong Architecture": https://developer.aliyun.com/article/743920
Announced at the 2023 Yunqi Conference, the Regional ESSD: https://developer.aliyun.com/article/1390447

Understanding Kafka Producer

AutoMQ — Mon, 29 Jul 2024 08:36:27 +0000

Introduction
Today, we present an in-depth analysis of the Kafka Producer (based on [Apache Kafka 3.7][2]). Given the extensive nature of the topic, the article is divided into two segments; the first part elucidates the usage and principles of the Kafka Producer, while the second part will discuss the implementation details and prevalent issues associated with the Kafka Producer.
Usage
Before we dive into the specifics of the Kafka Producer implementation, let's first understand how to utilize it. Here's the example code for sending a message to a specified topic using Kafka Producer:



// 配置并创建一个 Producer
Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers", "localhost:9092");
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(kafkaProps);

// 向指定 topic 发送一条消息
ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "my-key", "my-value");
producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        // 发送失败
        exception.printStackTrace();
    } else {
        // 发送成功
        System.out.println("Record sent to partition " + metadata.partition() + " with offset " + metadata.offset());
    }
});

// 关闭 Producer，释放资源
producer.close();

Subsequently, the primary interfaces of Kafka Producer are outlined.



public class ProducerRecord<K, V> {
    private final String topic;
    private final Integer partition;
    private final Headers headers;
    private final K key;
    private final V value;
    private final Long timestamp;
}

public interface Callback {
    void onCompletion(RecordMetadata metadata, Exception exception);
}

public interface Producer<K, V> {
    // ...
    Future<RecordMetadata> send(ProducerRecord<K, V> record);
    Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback);
    void flush();
    void close();
    // ...
}

Note: The Producer interface also includes several transaction-related interfaces, such as beginTransaction, commitTransaction, etc., which have been discussed in another article and will not be addressed here.
ProducerRecord
A message sent by the Producer possesses the following properties

topic: Required. Specifies the topic to which the record is sent
partition: Optional. Indicates the sequence number of the partition to which the record is sent (zero-indexed). When unspecified, a partition is chosen using either the specified Partitioner or the default BuiltInPartitioner (details provided below)
headers: Optional. User-defined additional key-value pair information
key: Optional. The key value of the message
value: Optional. The content of the message
timestamp: Optional. The timestamp when the message is sent, determined by the following logic
- If the topic's message.timestamp.type configuration is "CreateTime"
- If the user provides a timestamp, that value is utilized
- Otherwise, the timestamp defaults to the message's creation time, roughly when the send method is invoked
- If the topic's message.timestamp.type is set to "LogAppendTime", the broker's write time is used for the message, irrespective of any user-specified timestamp Callback Utilized in callbacks following message acknowledgment, potential exceptions include:
Non-retriable
- InvalidTopicException: Topic name is invalid, e.g., too long, empty, or includes prohibited characters.
- OffsetMetadataTooLarge: Metadata string used in Producer#sendOffsetsToTransaction is excessively long (controlled by offset.metadata.max.bytes, default 4 KiB).
- RecordBatchTooLargeException: Size of the sent batch exceeds limits.
- Exceeded the maximum size allowed (broker configuration message.max.bytes or topic configuration max.message.bytes, default 1MiB + 12 B)
- Exceeded the segment size (broker configuration log.segment.bytes or topic configuration segment.bytes, default is 1 GiB) Note: This error may only occur in older versions of the Client
- RecordTooLargeException: Size of a single message
- Exceeded the maximum size of a single producer request (producer configuration max.request.size, default 1MiB)
- Exceeded the size of the producer buffer (producer configuration buffer.memory, default 32 MiB)
- Exceeded the maximum size allowed (broker configuration message.max.bytes or topic configuration max.message.bytes, default 1MiB + 12 B)
- TopicAuthorizationException, ClusterAuthorizationException: Authentication failed
- UnknownProducerIdException: In transaction requests, the PID has expired or the records associated with the PID have expired
- InvalidProducerEpochException: In transaction requests, the epoch is illegal
- UnknownServerException: An unspecified error occurred.
Retriable
- CorruptRecordException: CRC check failed, typically because of a network error.
- InvalidMetadataServerException: The client-side metadata is outdated.
- UnknownTopicOrPartitionException: The topic or partition does not exist, potentially due to expired metadata
- NotLeaderOrFollowerException: The requested broker is not the leader, possibly due to ongoing leader election
- FencedLeaderEpochException: The leader epoch in the request is outdated, potentially caused by slow metadata refresh
- NotEnoughReplicasException, NotEnoughReplicasAfterAppendException: Insufficient number of insync replicas (configured under broker's min.insync.replicas or similar named topic configuration, default is 1). Note that NotEnoughReplicasAfterAppendException occurs after a record is written, and retries by the producer may lead to duplicate data
- TimeoutException: Processing timeout, which could have two possible causes
- Synchronous operations often experience delays, especially when the producer buffer is full or there are timeouts in fetching metadata.
- Asynchronous operations can also encounter timeouts, for example, when throttling restricts the producer from sending messages, or if a broker fails to respond promptly. Producer#send Initiate sending a message asynchronously, and if needed, activate a Callback upon message acknowledgment. Ensure that Callbacks for sending requests to the same partition are executed in the sequence they were initiated. Producer#flush Label all messages in the producer's cache as ready to send immediately, and block the current thread until all previously dispatched messages receive acknowledgments. Note: This action will block the current thread only; other threads may continue sending messages, although the completion time for messages sent post-flush is not assured. Producer#close Shut down the producer and block until all messages are sent. Note:
Invoking close in the Callback will immediately shut down the producer
Any send method still in the synchronous call phase (pulling metadata, waiting for memory allocation) will be terminated immediately and throw a KafkaException Core Components Next, we will discuss the specific implementation of the Kafka Producer, which consists of the following core components
ProducerMetadata & Metadata Responsible for caching and refreshing the metadata needed on the Producer side, including all metadata of the Kafka Cluster such as broker addresses, the distribution status of partitions in topics, and information about leaders and followers.
RecordAccumulator Responsible for managing the Producer's buffer, it groups messages for transmission based on partition, time (linger.ms), and size (batch.size) into RecordBatch and holds them for dispatch.
Sender Manages a daemon thread named "kafka-producer-network-thread | {client.id}" which facilitates the sending of Produce requests and processes Produce responses, as well as managing timeouts, error handling, and retries.
TransactionManager Charged with implementing idempotence and transaction capabilities, this involves assigning sequence numbers, tackling message loss and disorder, and managing transaction states. Sending Process The message sending process is depicted in the diagram below:

The process is divided into the following steps:

Refreshing Metadata
Serialize the message using the specified Serializer
Select the target partition for sending the message using either a user-specified Partitioner or the BuiltInPartitioner
Insert the message into the RecordAccumulator for batching
Sender asynchronously retrieves the sendable batch from the RecordAccumulator (grouped by node), registers a callback, and sends
Sender handles the response, and based on the situation, returns results, exceptions, or retries The following sections will detail each of these steps Refreshing Metadata ProducerMetadata is tasked with caching and updating the metadata needed on the producer side, ensuring a comprehensive view of all the topics required for the producer. It will
Add topics in the following scenarios
- When sending a message, if the specified topic is not present in the cached metadata
Remove topics in the following scenarios
- When it is determined that the metadata for a topic has been inactive for a continuous period defined by metadata.max.idle.ms
Refresh metadata in the following scenarios:
- When sending a Message, the specified start-up command is not in the cached metadata (this occurs when the number of partitions in a Topic increases)
- When sending a Message, the leader of the specified partition is unknown
- After sending a Message, an InvalidMetaException response is received
- When the continuous metadata.max.age.ms setting does not refresh metadata Associated configurations include
metadata.max.idle.ms The cache timeout for topic metadata, specifically, if no messages are sent to a certain topic within the specified duration, the metadata for that topic will expire; the default setting is 5 minutes.
metadata.max.age.ms Mandatory metadata refresh interval, specifically, if metadata is not refreshed within the specified duration, an update is forcibly initiated; the default setting is 5 minutes. Partition selection In KIP-794[3], addressing the issue of the Sticky Partitioner in previous versions, which led to an uneven distribution of messages across brokers, a new Uniform Sticky Partitioner was introduced (and set as the default built-in Partitioner). This partitioner, without key constraints, more effectively distributes messages to faster brokers. When selecting partitions, there are two scenarios:
If a user specifies a Partitioner, then that Partitioner is used to select the partition
If not, the default BuiltInPartitioner is used
- If a record key is set, a unique partition is determined based on the hash value of the key
- Records with the same key are consistently assigned to the same partition.
- However, this consistency is not maintained when the number of partitions within a topic is altered, as the same key may not be assigned to the original partition.
- If no key is specified, or if the partitioner.ignore.keys is set to "true", Kafka defaults to sending more messages to faster brokers. Associated configurations include
partitioner.class The class name for the partition selector can be customized by users to meet specific requirements.
- DefaultPartitioner and UniformStickyPartitioner: These "sticky" partitioners allocate messages sequentially to each partition, filling one partition before moving to the next. However, their implementation has been problematic as it tends to overload slower brokers, leading to its deprecation.
- RoundRobinPartitioner: This partitioner disregards the record key and distributes messages evenly across all partitions in a cyclic manner. It is important to note that it can lead to uneven message distribution when initiating new batches. It is advisable to either use the built-in partitioner or develop a custom one to suit your needs.
partitioner.adaptive.partitioning.enable The decision to determine the number of messages sent based on broker speed can be enabled or disabled; if disabled, partitions are chosen at random. This is only applicable when the partitioner.class is not set, with the default setting being "true".
partitioner.availability.timeout.ms This setting is effective only when partitioner.adaptive.partitioning.enable is set to "true". Should the time lapse between accumulating a batch of messages for a specific broker and sending them exceed this setting, the system will halt message allocation to that broker; a setting of 0 indicates that this feature is disabled. This applies only when the partitioner.class is not configured, with the default set to 0.
partitioner.ignore.keys When selecting a partition, if "false" is selected, the partition is determined based on the hash value of the key; if not, the key is disregarded. This setting applies only if partitioner.class is not set. The default setting is "false". Message Batching In the RecordAccumulator, batches to be sent are organized by partition. Key methods include:



public RecordAppendResult append(String topic,
                                 int partition,
                                 long timestamp,
                                 byte[] key,
                                 byte[] value,
                                 Header[] headers,
                                 AppendCallbacks callbacks,
                                 long maxTimeToBlock,
                                 boolean abortOnNewBatch,
                                 long nowMs,
                                 Cluster cluster) throws InterruptedException;

public ReadyCheckResult ready(Metadata metadata, long nowMs);

public Map<Integer, List<ProducerBatch>> drain(Metadata metadata, Set<Node> nodes, int maxSize, long now);

append: Adds a message to the buffer, registers a future, and returns it. This future completes either when the message is successfully sent or if it fails.
ready: Identifies nodes that have messages prepared for dispatch. Scenarios include:
- A batch of messages has reached the batch.size
- Messages have been batched continuously for longer than linger.ms
- The memory allocated to the producer has been exhausted; specifically, the total size of the messages in the buffer has exceeded buffer.memory
- The batch requiring a retry has already been delayed for at least retry.backoff.ms
- The user called Producer#flush to ensure message delivery.
- The producer is in the process of shutting down.
Drain: For each node, go through every partition, pulling the earliest batch from each partition (if present), until the message size hits the max.request.size, or all partitions have been checked. Associated configurations include
linger.ms Each batch will wait for the maximum time, which by default is 0. It's important to note that setting it to 0 doesn't eliminate batching; instead, it means there's no delay before sending. To completely disable batching, you should set batch.size to 0 or 1. Enhancing this configuration will
- increase throughput (as the overhead of sending each message is reduced and the benefits of compression are enhanced)
- Slightly increases latency
batch.size determining the maximum size of each batch, which by default is 16 KiB. Setting this value to 0 (effectively the same as setting it to 1) disables batching, so each batch contains only one Message. When a single Message exceeds the batch.size, it is sent as a standalone batch. Enhancing this configuration will
- Boosts throughput
- Utilizes more memory (each new batch creation allocates a memory block of batch.size)
max.in.flight.requests.per.connection A producer can send up to 5 batches to each broker without awaiting a response, by default
max.request.size The maximum total message size per all requests is also the limit for individual messages, set by default at 1 MiB Be aware that the broker's configuration message.max.bytes and the topic's configuration max.message.bytes also set boundaries on the maximum message size Timeout Handling The Kafka Producer offers various timeout-related settings to manage the allowable duration for each phase of the message delivery process, as detailed below:

These settings include

buffer.memory the maximum capacity of the producer buffer, set by default at 32 MiB. When this buffer is full, it will block and wait for up to max.block.ms before it issues an error.
max.block.ms By default, when the send method is invoked, the current thread may be blocked for a maximum of 60 seconds. It encompasses
- Time required to retrieve metadata
- Time incurred waiting when the producer buffer is at capacity Excluding
- Duration required to serialize the message
- Time spent by the Partitioner to determine a partition
request.timeout.ms Maximum duration from sending a request to receiving a response, typically 30s.
delivery.timeout.ms The entire duration of asynchronous message delivery, namely, from the moment the send method returns to the activation of the Callback. The default is set at 120s. It encompasses
- the time required for batching by the producer
- Sending a request to the broker and waiting for a response
- The time for each retry Its value should be no less than linger.ms + request.timeout.ms.
retries The maximum number of retries, by default, is set to Integer.MAX_VALUE.
retry.backoff.ms and retry.backoff.max.ms These settings govern the exponential backoff strategy for retries following a send failure: each retry attempt begins with a wait time of retry.backuff.ms, which increases exponentially by a factor of 2, plus an additional 20% jitter, capped at retry.backoff.max.ms. The default values are 100ms and 1000ms, respectively. Summary Our project, AutoMQ[1], is committed to developing the next-generation, cloud-native Apache Kafka® system, specifically designed to tackle the cost and elasticity challenges of traditional Kafka. As devoted supporters and active contributors to the Kafka ecosystem, we persist in delivering top-tier Kafka technical content to enthusiasts. In our previous article, we covered the functionality of Kafka Producers and the fundamental principles behind their implementation; our upcoming article will delve into further implementation details and address typical challenges associated with Kafka Producers. Keep an eye out for more updates.

References
[1] AutoMQ: https://github.com/AutoMQ/automq
[2] Kafka 3.7: https://github.com/apache/kafka/releases/tag/3.7.0
[3] KIP-794: https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner

AutoMQ Integration with Redpanda Console

AutoMQ — Mon, 29 Jul 2024 08:26:04 +0000

Managing Kafka/AutoMQ clusters more conveniently with Kafka Web UI
With the rapid development of big data technology, Kafka, as a high-throughput, low-latency distributed messaging system, has become a core component of real-time data processing in enterprises. However, managing and monitoring Kafka clusters is not an easy task. Traditional command-line tools and scripts, although powerful, are complex and not intuitive for developers and operations personnel. To address these challenges, Kafka Web UI emerged, providing users with a more convenient and efficient way to manage Kafka clusters.

Over more than a decade of development, Apache Kafka® has accumulated a very rich ecosystem. As the successor to Apache Kafka®, AutoMQ can fully leverage the products in its ecosystem due to its complete compatibility with Kafka. AutoMQ Business Edition offers a very powerful control plane. If you are using AutoMQ, you can also manage AutoMQ clusters with products like Kafdrop and Redpanda Console [1].

Today, we will share how to monitor the state of AutoMQ clusters using Redpanda Console [1] to enhance system maintainability and stability.

Integrating AutoMQ with Redpanda Console
Redpanda Console is a Kafka Web UI interface provided by Redpanda [2] for monitoring and managing Redpanda or Kafka clusters. It offers an intuitive user interface where users can easily view cluster states, monitor performance metrics, and manage topics and partitions. This console is designed to simplify the daily operations of data flow systems, enabling users to more effectively maintain and monitor their clusters.
Thanks to AutoMQ's complete compatibility with Kafka, it can be seamlessly integrated with Redpanda Console. By utilizing Redpanda Console, AutoMQ users can also benefit from an intuitive user interface to monitor the real-time status of AutoMQ clusters, including topics, partitions, consumer groups, and their offsets. This monitoring capability not only enhances the efficiency of issue diagnosis but also helps optimize cluster performance and resource utilization.
This tutorial will teach you how to start the Redpanda Console service and use it in conjunction with AutoMQ clusters to achieve cluster state monitoring and management.

Prerequisites

Deploy AutoMQ Cluster
Prepare Redpanda Console Environment Deploy AutoMQ Cluster Please refer to the official AutoMQ documentation: Cluster Deployment | AutoMQ [3]. Deploy Redpanda Console There are two ways to deploy Redpanda Console: Docker deployment and release version deployment. Docker deployment is simpler, and if you want to quickly and easily experience the integration of AutoMQ and Redpanda Console, it is recommended to choose Docker for deployment. If you have special requirements, such as login authentication, SASL authentication, TLS configuration, and log level settings, you can opt for the release version deployment. Below, I will introduce both configuration methods separately. Docker Deployment Redpanda Console can be deployed via Docker, as referenced in Quick Start [4]. In this process, after setting up the AutoMQ cluster, you will know all the addresses and ports of the Broker nodes that are listening. Therefore, you can establish a connection between Redpanda Console and the AutoMQ cluster by specifying the KAFKA_BROKERS parameter in the Docker startup command. The Docker container startup command is as follows: docker run -p 8080:8080 -e KAFKA_BROKERS=192.168.0.4:9092,192.168.0.5:9092,192.168.0.6:9092 docker.redpanda.com/redpandadata/console:latest
-p 8080:8080: Specify the port mapping for accessing the Redpanda Console service.
KAFKA_BROKERS: This should be specified as the Broker addresses of your AutoMQ cluster.

Release Deployment
You need to download and extract the suitable version from the GitHub Releases page of Redpanda Console: Release Redpanda Console [5], into a specified folder, such as /opt. The command is as follows:

ubuntu Linux

cd /opt
sudo curl -L -o redpanda_console.tar.gz https://github.com/redpanda-data/console/releases/download/v2.6.0/redpanda_console_2.6.0_linux_amd64.tar.gz

unzip, get redpanda_console

sudo tar -xzf redpanda_console.tar.gz

config set

sudo mkdir -p /etc/redpanda

write config

sudo vim /etc/redpanda/redpanda-console-config.yaml
An example of the content of the redpanda-console-config.yaml configuration file is as follows:
kafka:
#Brokers is a list of bootstrap servers with
#port (for example "localhost:9092").
brokers:
- broker-0.mycompany.com:19092
- broker-1.mycompany.com:19092
- broker-2.mycompany.com:19092
Note: Please ensure that the server where you are currently installing Redpanda Console can access the servers where the Broker nodes specified in the configuration file are located.
For more detailed settings, you can refer to the official documentation: Redpanda Console Configuration [6]. After completing the configuration, you need to set environment variables so that the Redpanda Console executable can obtain the configuration file information and start the Redpanda Console:

set env

export CONFIG_FILEPATH="/etc/redpanda/redpanda-console-config.yaml"

/opt/ run console

./redpanda-console
You will get the following results:
./redpanda-console
{"level":"info","ts":"2024-07-10T09:52:52.958+0800","msg":"started Redpanda Console","version":"2.6.0","built_at":"1717083695"}
{"level":"info","ts":"2024-07-10T09:52:52.963+0800","msg":"connecting to Kafka seed brokers, trying to fetch cluster metadata"}
{"level":"info","ts":"2024-07-10T09:52:54.780+0800","msg":"successfully connected to kafka cluster","advertised_broker_count":1,"topic_count":2,"controller_id":0,"kafka_version":"at least v3.6"}
{"level":"info","ts":"2024-07-10T09:53:05.620+0800","msg":"Server listening on address","address":"[::]:8080","port":8080}
Access the console page
fter completing the above deployment operations, you can access the console service by entering the address (e.g., http://{console_ip}:8080) in the browser. The display effect is as follows:
Cluster Overview
The Cluster Overview page provides users with a macro perspective, displaying core information of the AutoMQ cluster. This includes but is not limited to:

Cluster Health Status: Shows the current health status of the cluster, aiding users in quickly identifying issues.
Storage Usage: Displays the data storage usage within the cluster, making it easier for users to manage and plan their storage.
Version Information: Shows the current version of the AutoMQ cluster, facilitating tracking and upgrades.
Number of Online Brokers: Displays the real-time number of online Brokers, which is a critical metric.
Number of Topics and Replicas: Provides information on the number of Topics and Replicas, helping users understand the scale of the cluster and data replication status.

Monitoring the cluster status is crucial for ensuring the stability and performance of the messaging queue system. By monitoring the real-time status of the cluster, storage usage, version information, the number of online Brokers, as well as the number of Topics and Replicas, operations personnel can quickly identify and resolve potential issues, preventing system failures from impacting business operations. Additionally, these metrics aid in capacity planning and resource management, ensuring the system can handle future data growth. Moreover, knowing the cluster's version information helps users to perform timely software upgrades, leveraging the latest features and security fixes, thereby enhancing overall system reliability and efficiency.
**Topic Overview
**On the Topic list page, users can see a list of all Topics in the current AutoMQ cluster, including key information for each Topic, such as the number of partitions and replication strategy. Users can quickly browse and manage Topics through this page.

Topic Details
After clicking on a specific Topic, users will be taken to the detailed page of that Topic, where they can explore and manage various aspects:

Message List: Browse and search messages within the Topic, which is very useful for message tracking and debugging.
Consumer Information: Displays information about the consumers and Consumer Groups currently subscribed to the Topic, facilitating monitoring of consumption status.
Partition Status: Shows detailed information for each partition, including key metrics such as Leader and ISR.
Configuration Information: List the configuration details of the Topic, supporting modifications to optimize performance or behavior.
ACL (Access Control List): Manage access permissions for the Topic to ensure data security. Additionally, Redpanda Console supports users in manually creating and publishing messages, which is highly valuable for testing or injecting messages in specific scenarios.

Monitoring Topic details allows us to gain deep insights into the operation of the message queue. By browsing the message list, we can track and debug messages, monitor consumer information to assess consumption status, understand partition states to ensure data distribution and high availability, manage configuration information to optimize performance, and set access controls to ensure data security. These features help in timely identification and resolution of issues, thereby improving the overall efficiency and reliability of the system.
Summary
This article introduces the integration process of Redpanda Console with AutoMQ, demonstrating how this powerful tool simplifies and enhances the management of AutoMQ clusters. It is hoped that this article provides practical references for users aiming to improve the efficiency and functionality of message queue management.

References
[1] Redpanda Console: https://redpanda.com/redpanda-console-kafka-ui
[2] Redpanda: https://redpanda.com/
[3] Cluster Deployment of AutoMQ: https://docs.automq.com/zh/docs/automq-opensource/IyXrw3lHriVPdQkQLDvcPGQdnNh
[4] Quick Start: https://github.com/redpanda-data/console?tab=readme-ov-file#quick-start
[5] Release Redpanda Console: https://github.com/redpanda-data/console/releases/tag/v2.6.0
[6] Redpanda Console Configuration: https://docs.redpanda.com/current/reference/console/config/#example-redpanda-console-configuration-file
[7] Kafdrop Github: https://github.com/obsidiandynamics/kafdrop

Implementing Kafka to Run on S3 with a Hundred Lines of Code

AutoMQ — Mon, 29 Jul 2024 08:21:38 +0000

TL;DR
Yes, you read that correctly. AutoMQ[1] currently supports being fully built on object storage like S3. You can refer to the quick start guide[3] to get started immediately. AutoMQ, with its existing stream storage engine, achieves features that other competitors pride themselves on by extending the top-level WAL abstraction with minimal code, enabling the entire stream system to be built on object storage like S3. Notably, we have made this part of the source code fully open, allowing developers to use the S3Stream[2] stream storage engine to easily deploy a Kafka service entirely on object storage in their environment, with extremely low storage costs and operational complexity.

The core stream storage engine of AutoMQ can achieve this capability so effortlessly due to its excellent top-level abstraction around WAL and shared storage architecture design. It is precisely based on this excellent top-level abstraction that we have implemented the highly innovative S3Stream[2] stream storage engine. In this article, we will share the design details of AutoMQ's shared stream storage engine, the underlying considerations, and the evolution process. After reading the previous content, you will truly understand why we say that only a hundred lines of code are needed to run Kafka on S3.

Embarking from the Shared Storage Architecture
Over a decade ago, Kafka emerged in an era where Internet Data Centers (IDC) were the primary scenarios. At that time, compute and storage resources were typically tightly coupled, forming an integrated Share-Nothing architecture. This architecture was highly effective in the physical data center environments of that period. However, as Public Cloud technology matured, the limitations of this architecture became apparent. The tightly coupled compute-storage nature of the Share-Nothing architecture made it impossible to decouple the storage layer completely and offload capabilities such as durability and high availability to cloud storage services. This also meant that the Share-Nothing architecture could not leverage the technical and cost benefits of scalable cloud storage services. Furthermore, the integrated compute-storage architecture made Kafka lack elasticity and difficult to scale. When adjusting Kafka cluster capacity, it involves substantial data replication, which affects the efficiency of capacity adjustments and impacts normal read and write requests during this period.

AutoMQ is committed to fully leveraging the advantages of the cloud, adhering to a Cloud-First philosophy. Through a shared storage architecture, AutoMQ decouples data durability and offloads it to mature cloud storage services like S3 and EBS, thereby fully exploiting the potential of these cloud storage services. Problems such as lack of elasticity, high costs, and complex operations associated with Kafka due to the Share-Nothing architecture are resolved under AutoMQ's new shared storage architecture.

Stream Storage Top-Level Abstraction: Shared WAL + Shared Object
The core architecture of AutoMQ's shared storage is Shared WAL and Shared Object. Under this shared storage architecture abstraction, we can have various implementations. The Shared WAL abstraction allows us to reassign this WAL implementation to any shared storage medium, enjoying the advantages brought by different shared storage media. Readers familiar with software engineering would know that every software design has trade-offs, and different shared storage media will have varying benefits and drawbacks as their trade-offs change. AutoMQ's top-level Shared WAL abstraction enables it to adapt to these changes. AutoMQ can reassign the Shared WAL implementation freely to any shared storage service and even combine them. Shared Object is primarily built on mature cloud object storage services, enjoying extremely low storage costs and the scalability benefits of cloud object storage services. As the S3 API becomes the de facto standard for object storage protocols, AutoMQ can also use Shared Object to adapt to various object storage services, offering multi-cloud storage solutions to users. Shared WAL can be adapted to low-latency storage media like EBS and S3E1Z, providing users with low-latency stream services.

[图片]

Best Shared WAL Implementation in the Cloud: EBS WAL
WAL was initially used in relational databases to achieve data atomicity and consistency. With the maturity of cloud storage services like S3 and EBS, combining WAL with low-latency storage and asynchronously writing data to low-cost storage like S3 balances latency and cost. AutoMQ is the first in the stream domain to use WAL based on a shared storage architecture, fully harnessing the advantages of different cloud storage. We believe that the EBS WAL implementation is the best for cloud stream storage engines because it combines the low-latency and high-durability advantages of EBS with the low-cost benefits of object storage. Through clever design, it also mitigates the expensive drawbacks of EBS.

The following diagram illustrates the core implementation process of EBS WAL:

The Producer writes data to EBS WAL through the S3Stream stream storage engine. Once the data is successfully persisted to disk, a success response is immediately returned to the client, fully leveraging the low-latency and high-durability characteristics of EBS.
Consumers can read newly written data directly from the cache.
Once the data in the cache is asynchronously and batch-written to S3 in parallel, it becomes invalid.
If consumers need to read historical data, they should directly access the object storage.

A common misconception is confusing the Shared WAL built on EBS with Kafka’s tiered storage. The primary way to distinguish between them is to check whether the compute node broker is entirely stateless. For tiered storage implementations by Confluent and Aiven, their brokers are still stateful. Kafka's tiered storage requires the last log segment of its partition to be on the local disk, hence their local storage data is tightly coupled with the compute layer brokers. However, AutoMQ’s EBS WAL implementation does not have this limitation. When a broker node crashes, other healthy broker nodes can take over the EBS volume within milliseconds via Multi Attach, write the small fixed-size WAL data (usually 500MB) to S3, and then delete the volume.

The Natural Evolution of Shared WAL: S3 WAL

S3 WAL is the natural evolution of the Shared WAL storage architecture. AutoMQ currently supports building the entire storage layer on S3, which is a specific implementation of Shared WAL. This WAL implementation built directly on S3 is what we refer to as S3 WAL. Thanks to the top-level abstraction of Shared WAL and the foundational implementation of EBS WAL, the core processes of S3 WAL are identical to those of EBS WAL. Therefore, the AutoMQ Team was able to support the implementation of S3 WAL within just a few weeks.

Implementing S3 WAL is a natural evolution of the AutoMQ Shared WAL architecture and helps AutoMQ expand its capability boundaries. When using S3 WAL, all user data is written to object storage, which leads to some latency increase compared to EBS WAL. However, with this trade-off, the entire architecture becomes more streamlined and efficient due to fewer dependent services. On "special" cloud providers like AWS, which do not offer cross-AZ EBS, and in private IDC scenarios using self-built object storage services like minio, the S3 WAL architecture provides stronger cross-AZ availability guarantees and flexibility.

S3WAL Benchmark
AutoMQ has optimized the performance of S3 WAL significantly, especially its latency. In our test scenarios, the average latency for S3 WAL Append is 168ms, with P99 at 296ms.

Kafka Produce request processing latency averages 170ms, with P99 at 346ms.

Average send latency is 230ms, with P99 at 489ms.

How AutoMQ Achieves S3 WAL with Hundreds of Lines of Code
In AutoMQ's GitHub repository, you can find the core stream storage repository, S3Stream [2]. The class com.automq.stream.s3.wal.WriteAheadLog contains the top-level abstraction for WAL, while the implementation class ObjectWALService includes more than 100 lines of implementation code for S3 WAL. In this sense, we leveraged over 100 lines of implementation code in conjunction with the existing EBS WAL infrastructure to fully build AutoMQ on S3.

Of course, implementing hundreds of lines of code does not mean you only need to write over 100 lines of code to run Kafka on S3. This is merely an appearance. The key lies in thoroughly understanding the WAL-based shared storage architecture concept of AutoMQ. Within this framework, whether achieving fully S3-based shared storage or implementing on other shared storage media in the future, the approach remains consistent. In AutoMQ's architecture, Shared WAL is one of the core components. By organizing the code through the top-level abstraction of Shared WAL, we can reassign the implementation methods of Shared WAL to any other shared storage media. Specifically, when implementing a shared storage WAL on AutoMQ, the actual workload and complexity have already been absorbed by the underlying architecture. You only need to focus on efficiently writing and reading WAL to the target storage media. Because AutoMQ's stream storage engine has already paved the way for you, once you fully understand the concept of Shared WAL and the S3Stream stream storage engine, implementing a fully S3-based S3WAL is as simple as writing 100 lines of code.

Summary
This article reveals the core concept of the shared storage architecture based on Shared WAL behind AutoMQ's storage architecture by introducing its thoughts and evolution. In the future, AutoMQ will continue to optimize the capabilities of the stream storage engine foundation based on this abstraction, building a more powerful Kafka stream service for everyone. In the near future, S3E1Z WAL will also officially meet everyone, so please stay tuned to us.

References
[1] AutoMQ: https://github.com/AutoMQ/automq
[2] S3Stream:https://github.com/AutoMQ/automq/tree/main/s3stream
[3] Direct S3 Cluster Deployment: https://docs.automq.com/automq/getting-started/deploy-direct-s3-cluster

Industry Standard for Cloud Instance Initialization: Cloud-Init

AutoMQ — Fri, 07 Jun 2024 06:11:27 +0000

Introduction

Cloud-Init[1] is the industry-standard tool for initializing cloud instances across multiple platforms. It is endorsed by all leading public cloud providers and is ideal for configuring private cloud infrastructures and bare-metal environments. At boot-up, Cloud-Init detects its cloud environment, accesses any provided metadata, and initializes the system. This process may include setting up network and storage configurations, establishing SSH access keys, among other system settings. Following this, Cloud-Init processes any additional user or vendor data supplied to the instance. Whether you're creating custom Linux deployment images or launching new Linux servers, Cloud-Init is pivotal for automating and streamlining these processes.

Current Context: Cloud-Init's Ubiquity Across Cloud Platforms

Cloud-Init has become the industry standard for initializing virtual machines in the cloud computing sector, with widespread use across all major cloud platforms. An examination of the data sources that Cloud-Init supports shows its extensive compatibility, catering to numerous cloud service providers like AWS (Amazon Web Services), Azure (Microsoft Cloud), and Alibaba Cloud, as well as various private cloud and container virtualization solutions including CloudStack, OpenNebula, OpenStack, and LXD. This broad adoption highlights Cloud-Init's essential role in automating cloud infrastructure deployments across an array of platforms and services.

Amazon EC2
Alibaba cloud (AliYun)
Azure
Google Compute Engine
LXD

Objective: What Issues Does Cloud-Init Address?

Cloud-Init primarily addresses the need for rapid and automated configuration and startup of cloud instances, to efficiently adapt to the dynamic demands of the cloud computing environment. This tool was initially designed to simplify the initialization process of cloud instances. Since its inception as an open-source project, Cloud-Init has quickly gained widespread recognition and has become a standard feature supported by nearly all major cloud service providers, including Amazon Web Services, Google Cloud Platform, and Microsoft Azure.
Challenges in Cloud Computing Deployment
In the early stages of cloud computing, setting up and configuring virtual machines was a time-consuming and complex process, especially when dealing with large-scale configurations and dependent software installations. Although pre-configured system images could achieve rapid deployment, as computing needs diversified and architectures became more complex, this approach gradually appeared less flexible and efficient. Operations staff had to manually configure each instance, such as setting up networks, storage, SSH keys, software packages, and various other system aspects, which not only increased the workload but also heightened the possibility of errors.
Cloud-Init's Solution
Cloud-Init emerged to address this pain point. It allows users to automatically execute a series of customized configuration tasks at the first startup of a cloud instance, such as setting hostnames, network configurations, user management, and software package installations, significantly simplifying the deployment and management of cloud instances. By using Cloud-Init, users can customize startup scripts and configuration files for cloud instances, achieving a truly "configure once, run anywhere" capability, which greatly enhances the deployment efficiency and flexibility of cloud resources.
During the startup process of cloud instances, Cloud-Init is responsible for identifying the cloud environment in which it operates and accordingly initializing the system. This means that at first startup, the cloud instance is automatically configured with network settings, storage, SSH keys, software packages, and other various system settings, without the need for additional manual intervention.
The core value of Cloud-Init lies in providing a seamless bridge for the startup and connection of cloud instances, ensuring that the instances function as expected. For users of cloud services, Cloud-Init offers a first-time startup configuration management solution that does not require installation. For cloud providers, it offers instance settings that can be integrated with their cloud services.

Features and Use Cases of Cloud-Init

Cloud-Init provides a suite of capabilities designed for automated configuration and management across diverse cloud computing platforms. These features enable robust support for automated deployments and management in cloud settings, greatly improving the flexibility and efficiency of configuring cloud resources.
**Common use cases for Cloud-Init
**Cloud-Init is routinely employed to carry out custom initialization tasks prior to the actual startup of application processes. Typical initialization tasks include:

Setting up the hostname
Adding SSH keys
Executing a script on the first boot
Formatting and mounting a data disk
Launching an Ansible playbook
Install a DEB/RPM package.

Our project, AutoMQ[2], is a cloud-native Kafka implementation that leverages cloud infrastructure. On platforms like AWS, AutoMQ utilizes ASG and EC2 for operations when not deploying via Kubernetes. Before initiating AutoMQ, several preparatory steps and configurations are required. Here is the Cloud-Init script content from the Enterprise Edition of AutoMQ, detailing the key initialization steps:

Initialize the systemd service files.
Utilize the AWS SDK to authenticate with the ECS RAM Role, ensuring proper access to additional cloud services.
Set up the necessary environment variables for AutoMQ.
Launch the AutoMQ systemd service using a script.


#cloud-config

write_files:
  - path: /etc/systemd/system/kafka.service
    permissions: '0644'
    owner: root:root
    content: |
      // ignore some code...


  - path: /opt/automq/scripts/run.info
    permissions: '0644'
    owner: root:root
    content: |
      role=
      wal.path=
      init.finish=

runcmd:

    // ignore some code....

    echo "Start getting the meta and wal volume ids" > ${AUTOMQ_HOME}/scripts/automq-server.log
    region_id=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

    aws configure set default.region ${region_id} --profile ec2RamRoleProfile
    aws configure set credential_source Ec2InstanceMetadata --profile ec2RamRoleProfile
    aws configure set role_arn #{AUTOMQ_INSTANCE_PROFILE} --profile ec2RamRoleProfile

    instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)


  - |
    echo "AUTOMQ_ENABLE_LOCAL_CONFIG=#{AUTOMQ_ENABLE_LOCAL_CONFIG}" >> ${AUTOMQ_HOME}/scripts/env.info
    // ignore some code....


  - |
    echo "export AUTOMQ_NODE_ROLE='#{AUTOMQ_NODE_ROLE}'" >> /etc/bashrc
    // ignore some code....

    source /etc/bashrc

  - sh ${AUTOMQ_HOME}/scripts/automq-server.sh up --s3url="#{AUTOMQ_S3URL}" >> ${AUTOMQ_HOME}/scripts/automq-server.log 2>&1 &

Note: This userdata content is incomplete and is for illustrative purposes only; it requires integration with other AutoMQ scripts and Enterprise Edition code to be fully operational.
Why choose Cloud-Init when I have Docker or Kubernetes?
When you think about setting up your environment, Docker and Kubernetes likely come to mind. However, it's great to know that choosing isn't necessary. Even if you opt for Docker or Kubernetes, you'll still need to install and configure their elements on your machines, which is precisely where Cloud-Init comes into play. They simply offer different abstraction levels in runtime environments; they're not mutually exclusive. Think of Cloud-Init as essentially the Dockerfile for the VM world.

How does Cloud-Init work?

The process is broken down into two primary phases, taking place early in the boot process (local boot stage) and thereafter.
Early Boot Stage
In the local boot stage, before the network configuration kicks in, Cloud-Init primarily carries out the following tasks:

Identify data sources: It determines the data source of the running instance by examining built-in hardware values. Data sources are the wellsprings of all configuration data.
Fetch configuration data: After pinpointing the data source, Cloud-Init pulls configuration data from it. This data provides Cloud-Init with directives on the actions to take, which may encompass instance metadata (like machine ID, hostname, and network settings), vendor data, and user data (userdata). Vendor data comes from cloud providers, and user data (userdata) is usually implemented following network configurations.
Network Configuration Writing: Cloud-Init writes network configurations and sets up DNS, prepping the system for network services to be implemented at startup. Late Startup Phase Following the network configuration, during the subsequent startup phase, Cloud-Init executes non-critical configuration tasks using vendor data and user data (userdata) to tailor the running instance. Specific tasks include:
Configuration Management: Cloud-Init interfaces with management tools such as Puppet, Ansible, or Chef to apply intricate configurations and ensure the system remains current.
Software Installation: At this juncture, Cloud-Init installs necessary software and performs updates to guarantee that the system is fully operational and up-to-date.
User Accounts: Cloud-Init manages the creation and modification of user accounts, sets default passwords, and configures permissions accordingly.
Execute User Scripts: Cloud-Init executes custom scripts included in the user data, facilitating the installation of additional software, the application of security measures, and more. It also injects SSH keys into the instance's authorized_keys file to enable secure remote access.

Subdivision of the Startup Phase

Detect: Use the platform identification tool ds-identify to ascertain the platform on which the instance operates.
Local: Functions under Cloud-Init-local.service, chiefly responsible for detecting "local" data sources and setting up network configurations.
Network: Operates under Cloud-Init.service, which necessitates all configured networks to be active and processes user data.
Config: Runs under cloud-config.service, executing configuration-only modules, such as runcmd.
Final: Performs under cloud-final.service, marking the conclusion of the boot sequence, where user-defined scripts are executed.

Differences and workflows between Cloud-Init and other tools

While Cloud-Init, Packer, and Ansible are all automation tools used in deployment and configuration, they vary in their functionality, positioning, and workflows.

Cloud-Init is primarily designed for the initial boot and configuration stages of cloud instances.
Packer specializes in creating immutable machine images that can be reused across various platforms.
Ansible serves as a more comprehensive tool for configuration management and application deployment, ideal for automating system setups and deploying applications.

While there is some functional overlap, using these tools in tandem can enhance and streamline automation during different phases of deployment and management.

Summary

This article offers an in-depth look at the functionalities and use cases of Cloud-Init, highlighting its differences from other deployment automation tools. We hope you find this information useful.

AutoMQ[2] is committed to advancing messaging and streaming systems into the cloud-native era. Our goal is to fully utilize mature, scalable cloud services to unlock the full potential of the cloud. Understanding the features, pricing, and principles of various cloud services thoroughly is essential. Moving forward, we will continue to share insights on cloud technology, striving to be your go-to cloud expert and helping everyone maximize the benefits of cloud services.

References

[1] Cloud-Init: https://github.com/canonical/Cloud-Init
[2] AutoMQ: https://github.com/AutoMQ/automq
[3] Introduction to Cloud-Init: https://cloudinit.readthedocs.io/en/latest/explanation/introduction.html#how-does-Cloud-Init-work

AutoMQ Automated Streaming System Continuous Testing Platform Technical Insider

AutoMQ — Fri, 07 Jun 2024 03:57:34 +0000

Overview

AutoMQ[1], as a streaming system, is widely used in critical customer operations that demand high reliability. Consequently, a simulated, long-term testing environment that replicates real-world production scenarios is essential to ensure the viability of SLAs. This level of assurance is critical for the confidence in releasing new versions and for client adoption. With this objective, we created an automated, continuous testing platform for streaming systems, named Marathon. Before rolling out the Marathon framework, we established three key design principles:

Scalable: The platform must accommodate the growth of test cases and deployment modes as the system under test evolves
Observable: Being a testing platform, encountering bugs is expected. Thus, robust debugging tools are essential for pinpointing and resolving root causes
Cost-effective: Given the fluctuating traffic patterns in test scenarios, resource consumption should dynamically adjust according to traffic changes These three principles guided subsequent technology choices and architectural decisions.

Architectural Overview

Let’s begin with an overview of the architecture diagram

The Marathon project's Controller, Worker, and the AutoMQ Enterprise Edition control plane are all integrated within Kubernetes (K8S):

The Controller interacts with the AutoMQ Enterprise Edition control plane within the same VPC to oversee the creation, modification, and deletion of Kafka clusters, while also coordinating test tasks and managing the quantity and configuration of Workers.
Worker: Operates Kafka clients to generate the necessary workload for tasks and is also tasked with reporting observability data and performing client-side SLA assessments
AutoMQ Enterprise Edition control plane: Delivers a comprehensive set of productized features for the data plane, including cluster lifecycle management, observability, security auditing, and cluster reassignment. Marathon predominantly leverages its OpenAPI related to cluster lifecycle management to create, modify, and destroy clusters, facilitating the execution of the entire testing process

The architecture of the Controller and Worker is crafted as a distributed system: The Controller functions akin to a K8S Operator, dynamically adjusting the number and setup of Workers via a tuning loop to align with task demands; Workers are fully stateless systems that inform the Controller about various events to manage corresponding actions. This setup provides the architecture with remarkable flexibility, supporting the scalability demands of tasks. Moreover, the lightweight, adaptable Workers can dynamically scale and even operate on Spot instances[2], considerably lowering operational expenses and enabling the feasibility of ultra-large-scale elastic tasks

Technical Details

Running the Controller
Startup process
The Controller is designed for resource management and task orchestration, initiating several resource managers at the outset:

Service Discovery: Monitors the operational status of Workers
Event Bus: Acts as the communication conduit with Workers
Alert Service: Alerts administrators to events requiring immediate attention
Kafka Cluster Manager: Oversees the status of Kafka clusters; tracks Kafka release updates and manages upgrades
Signal Processor: Detects SIG_TERM to begin the termination process, reclaiming any resources created

The Controller accommodates various types of Kafka clusters:

Existing Kafka clusters: Rapidly confirms the functionality of designated clusters
Managed Kafka Clusters: Managed by a Controller that oversees the entire lifecycle of the cluster, these Kafka clusters leverage the control plane capabilities of AutoMQ for creation and destruction Task cycles The Controller uses a mechanism akin to a K8S Operator, dynamically adjusting the number and configuration of Workers based on task requirements during a tuning cycle. Each task corresponds to a test scenario, where tasks are programmed to send and receive messages from Kafka, constructing various traffic models for black-box testing Each task is divided into four stages, sequentially executed within the same thread:
Resource creation
Warm-up
Running task load
Resource recovery The Marathon framework provides a comprehensive set of utility classes designed to streamline the process of task creation. These include functionalities for generating Kafka topics, managing consumer backlogs, adjusting worker traffic, monitoring specific events, and introducing faults into Kafka clusters. Paired with Workers, these tools facilitate the simulation of traffic across any scale and enable testing in unique scenarios, such as large-scale cold reads or the deliberate shutdown of a Kafka node to assess data integrity. Coding tasks offer the flexibility to craft specific scenarios with the sole restriction of avoiding non-interruptible blocking operations. If a Worker's Spot instance is reclaimed, the Controller intervenes to interrupt the task thread, reclaim resources, and retry the task as needed. Managing Workers Creation and service discovery of Workers Conducting stress tests on a Kafka cluster can demand bandwidths exceeding tens of GB/s, clearly surpassing the capabilities of a single machine. Thus, designing a distributed system becomes imperative. The initial step involves determining how to locate newly established Workers and communicate with them. Our decision to manage the system with Kubernetes (K8s) naturally leads us to employ K8s mechanisms for service discovery.

We conceptualize a collection of identically configured Workers as a Worker Deployment, aligning with the Deployment model in K8s. Each Worker functions as a Pod within this Deployment. Creating Workers through the Controller is comparable to deploying a Deployment to the API Server and awaiting the activation of all Pods, as illustrated in Steps 1 and 2. K8s nodes scale appropriately, provisioning the necessary Spot instance virtual machines.
Upon initialization, each Worker generates a Configmap that catalogs the events of interest, initially concentrating on initialization events (Step 3). The Controller monitors for newly created Configmaps using the K8s Watch API (Step 4), subsequently dispatching initialization events containing configurations to these Workers (Step 5).
This completes the service discovery and initialization process for Workers. Workers then update their Configmaps to subscribe to additional events of interest. This mechanism of service discovery empowers the Controller with the dynamic ability to create Workers, setting the groundwork for the event bus outlined in the subsequent section.
Event Bus
Leveraging the service discovery mechanism discussed previously, the Controller now identifies the service addresses of each Worker (combining Pod IP and port) and the events these Workers are interested in (such as subscribing to Configmap changes), allowing the Controller to push events directly to specific Workers.

Numerous RPC frameworks are available, and Marathon has opted for Vert.x. It supports the traditional request-reply communication model as well as the multi-receiver publish-subscribe model, which proves invaluable in scenarios where multiple nodes must acknowledge an event (illustrated in the figure for the Adjust throughput command).
Spot Instance Application
As deduced from the preceding sections, Workers can be dynamically generated as needed by tasks, and commands to execute tasks on Workers can also be dispatched through the event bus (as illustrated in the figure for the Initialize new worker command). Essentially, Workers are stateless and can be rapidly created or destroyed, making the utilization of Spot Instances viable (the Controller, utilizing minimal resources, can operate on a smaller-scale Reserved Instance).

The Controller employs Kubernetes' Watch API to monitor the status of Pods, pausing and restarting the current task upon detecting an unexpected termination of a Pod. This enables prompt detection and mitigation of task impacts during the reclamation of Spot Instances. Spot Instances, derived from the excess capacity of cloud providers, offer significant cost savings compared to Reserved Instances. By leveraging Spot Instances, Marathon can drastically cut the costs of executing tasks with lower stability demands over prolonged periods.
Test Scenarios
Scenario Description and Resource Management.
Marathon test scenarios are outlined in code by inheriting from an Abstract class, defining the test case configuration, and implementing its lifecycle methods. Here are some of the existing test scenarios:

Test case configurations utilize generics, for instance, taking CatchUpReadTask as an example, the class is structured as

public class CatchUpReadTask extends AbstractTask<CatchUpReadTaskConfig>

The related configuration class, CatchUpReadTaskConfig, outlines the necessary parameters for executing this task, which users can dynamically set

Each task scenario is characterized through the implementation of the following lifecycle methods to simulate a specific traffic pattern:

prepare: Establish the necessary resources for the task
warmup: Ready the Worker and the cluster for testing
workload: Generate the task workload
cleanup: Remove the resources established for the task

Taking CatchUpReadTask as an example:

The Workload stage is the key differentiator among various task scenarios, where the CatchUpReadTask needs to build an appropriate backlog volume and then ensure it can be consumed within 5 minutes. For ChaosTask, the approach shifts to terminating a node and verifying that its partitions can be reassigned to other nodes within 1 minute. To cater to the diverse requirements of these tasks, the Marathon framework offers a toolkit for crafting test scenarios, as illustrated in the figure above:

KafkaUtils: Create/Delete Topic (a resource type within Kafka clusters)
WorkerDeployment: Create Worker
ThroughputChecker: Continuously monitor whether the throughput meets the expected standards
AwaitUtils: Confirm that the piled-up messages can be consumed within five minutes

Task Orchestration
With a variety of implementations of AbstractTask, a wide range of testing scenarios is possible. Orchestrating different task stages and even distinct tasks is essential for the Controller to execute the aforementioned scenarios.

Exploring additional methods in AbstractTask reveals its inheritance from the Runnable interface. By overriding the run method, it sequentially executes the lifecycle stages: prepare, warmup, workload, and cleanup, enabling the Task to be assigned to a thread for execution.
Upon initialization, the Controller sets up a task loop, constructs the required Task objects based on user specifications, and activates them by invoking the start method to launch a new thread for each task. The Controller then employs the join method to await the completion of each Task's lifecycle before moving on to the next one. This cycle is repeated to maintain the stability of the system under test.
In the event of unrecoverable errors (such as Spot instances being reclaimed) or when operational commands are manually executed to interrupt the task, the Controller calls the interrupt method on the current Task to halt the thread and stop the task. The task loop then handles resource recovery, proceeds with the next task, or pauses, awaiting further instructions based on the situation.
Assertions, Observability, and Alerts
Assertions
The framework categorizes assertions based on the type of metrics detected into the following groups:

Client-side assertions include Message continuity assertions and transaction isolation level assertions.
Server-side state assertions encompass Traffic threshold assertions and load balancing assertions.
Time-based Assertions: These include stack accumulation duration assertions, task timeout verifications, and more If standard assertion rules are insufficient, the Checker interface can be implemented to tailor custom assertions as needed Observability Building a robust system necessitates essential observability tools; without them, monitoring is reduced to passively observing alerts. The Marathon framework efficiently collects runtime data from Controllers and Workers, and it non-intrusively captures observability data from the tested systems. Utilizing Grafana's visualization tools, one can easily examine metrics, logs, profiling, and other observability data Metrics

Log

Profiling

Alerts
In an event-driven architecture, unsatisfied assertions trigger specific events with varying severity levels. Alerts are issued for those events that require immediate attention from operational staff and are sent to the OnCall group for assessment. Combined with observability data, this approach enables quick and accurate issue identification, allows preemptive action by customers to address and mitigate potential risks, and facilitates ongoing performance optimization

Conclusion and Future Outlook

**Focus on spot instances, Kubernetes, and stateless applications
**Reflecting on our three design principles—scalability, observability, and cost-efficiency—it is critical that the Marathon framework addresses operations right from the start:

How can we build resilient loads for various task scenarios?
Considering the different resource demands of these loads, is it possible for the underlying machine resources to dynamically scale accordingly?
Costs are categorized into usage costs and operational costs.
- In terms of usage costs, how can we quickly create and dismantle resources to reduce barriers for users?
- As for operational costs, how can we efficiently construct the required loads using the fewest resources possible?

Marathon leverages Spot instances, K8s, and stateless Workers to address the problem, each representing the infrastructure layer, operational management layer, and application layer respectively.
Given the demand for both flexibility and cost-efficiency, Spot instances in the cloud are the obvious choice, priced at just 10% of what comparable Reserved instances cost. However, Spot instances introduce challenges, particularly the unpredictability of instance termination, which presents a significant architectural hurdle for applications. For Marathon, however, this is less of a concern as tasks can be rerun as needed.
The most straightforward design strategy is essentially no design: Marathon focuses on scenario description and task orchestration, leaving the scheduling responsibilities to K8s. Marathon concentrates on determining the necessary workload size and the required number of cores per workload unit; the elasticity of the underlying resources is managed by K8s, starting with an initial application for a Spot instance node group and then focusing on the logic of the testing scenario.
Nonetheless, the capability to utilize the benefits of Spot instances and K8s hinges on the application being stateless; otherwise, managing state persistence and reassignment becomes essential. This consideration is crucial in the design of the Worker module.
Generalization of testing scenarios
Marathon exhibits excellent abstraction in many of its modules, including service discovery, task scheduling, and load generation, all of which are readily adaptable to other contexts:

Service discovery: Currently based on APIs provided by the K8s API server, the data structure is abstracted into Node and Registration. Node represents the address and port of a Worker node, while Registration corresponds to the events of interest to each Worker. Thus, any shared storage capable of supporting these two data structures can act as a component for service functioning, whether it's MySQL or Redis.
Task scheduling: Workers are currently packaged as Docker images and deployed via K8s Deployment. Alternatively, they could be packaged as AMIs for direct launch on EC2 via cloud interfaces, or deployed using tools such as Vagrant and Ansible.
Load Generation: Currently, Marathon has incorporated a Kafka workload for each worker, which primarily involves deploying a specific number of Kafka clients to send and receive messages as dictated by the Controller's settings. Replacing Kafka clients with RocketMQ clients or HTTP clients can be accomplished with minimal effort. Thanks to its robust abstraction features, Marathon's dependencies on external systems are modular and pluggable. Consequently, it functions not only as a continuous reliability testing platform for Kafka, but can also be seamlessly adapted to assess any distributed system, whether it operates in cloud-based or on-premises environments.

References

[1] AutoMQ: https://github.com/AutoMQ/automq
[2] Spot Instance: https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/using-spot-instances.html****

ZhongAn Insurance's Wang Kai Analyzes Kafka Network Communication

AutoMQ — Fri, 07 Jun 2024 03:22:11 +0000

Author: Kai Wang, Java Development Expert at ZhongAn Online Insurance Basic Platform

Introduction

Today, we explore the core workflow of network communication in Kafka, specifically focusing on Apache Kafka 3.7[2]. This discussion also includes insights into the increasingly popular AutoMQ, highlighting its network communication optimizations and enhancements derived from Kafka.

I. How to Construct a Basic Request and Handle Responses

As a message queue, network communication essentially involves two key aspects:

Communication between message producers and the message queue server (in Kafka, this involves producers "pushing" messages to the queue)
Communication between message consumers and the message queue server (in Kafka, this involves consumers "pulling" messages from the queue)

This diagram primarily illustrates the process from message dispatch to response reception.
Client:

KafkaProducer initializes the Sender thread
The Sender thread retrieves batched data from the RecordAccumulator (for detailed client-side sending, see [https://mp.weixin.qq.com/s/J2_O1l81duknfdFvHuBWxw])
The Sender thread employs the NetworkClient to check the connection status and initiates a connection if necessary
The Sender thread invokes the NetworkClient's doSend method to transmit data to the KafkaChannel
The Sender thread utilizes the NetworkLink's poll method for actual data transmission Server:
KafkaServer initializes SocketServer, dataPlaneRequestProcessor (KafkaApis), and dataPlaneRequestHandlerPool
SocketServer sets up the RequestChannel and dataPlaneAcceptor
The dataPlaneAcceptor takes charge of acquiring connections and delegating tasks to the appropriate Processor
The Processor thread pulls tasks from the newConnections queue for processing Processor threads handle prepared IO events
configureNewConnections(): Establish new connections
processNewResponses(): Dispatch Response and enqueue it in the inflightResponses temporary queue
poll(): Execute NIO polling to retrieve ready I/O operations on the respective SocketChannel
processCompletedReceives(): Enqueue received Requests in the RequestChannel queue
processCompletedSends(): Implement callback logic for Responses in the temporary Response queue
processDisconnected(): Handle connections that have been disconnected due to send failures
closeExcessConnections(): Terminate connections that surpass quota limits
The KafkaRequestHandler retrieves the ready events from the RequestChannel and assigns them to the appropriate KafkaApi for processing.
After processing by the KafkaApi, the response is returned to the RequestChannel.
The Processor thread then delivers the response to the client. This completes a full cycle of message transmission in Kafka, encompassing both client and server processing steps. ##Ⅱ.Kafka Network Communication 1. Server-side Communication Thread Model Unlike RocketMQ, which relies on Netty for efficient network communication, Kafka uses Java NIO to implement a master-slave Reactor pattern for network communication (for further information, see [https://jenkov.com/tutorials/java-nio/overview.html]).

Both DataPlanAcceptor and ControlPlanAcceptor are subclasses of Acceptor, a thread class that executes the Runnable interface. The primary function of an Acceptor is to listen for and receive requests between Clients and Brokers, as well as to set up transmission channels (SocketChannel). It employs a polling mechanism to delegate these to a Processor for processing. Additionally, a RequestChannel (ArrayBlockingQueue) is utilized to facilitate connections between Processors and Handlers. The MainReactor (Acceptor) solely manages the OP_ACCEPT event; once detected, it forwards the SocketChannel to the SubReactor (Processor). Each Processor operates with its own Selector, and the SubReactor listens to and processes other events, ultimately directing the actual requests to the KafkaRequestHandlerPool.

2. Initialization of the main components in the thread model

The diagram illustrates that during the broker startup, the KafkaServer's startup method is invoked (assuming it operates in zookeeper mode)
The startup method primarily establishes:

KafkaApis handlers: creating dataPlaneRequestProcessor and controlPlaneRequestByProcessor
KafkaRequestHandlerPool: forming dataPlaneRequestHandlerPool and controlPlaneRequestHandlerPool
Initialization of socketServer
Establishment of controlPlaneAcceptorAndProcessor and dataPlaneAcceptorAndProcessor Additionally, an important step not depicted in the diagram but included in the startup method is the thread startup: enableRequestProcessing is executed via the initialized socketServer.

3.Addition and Removal of Processor
1.Addition

Processor is added when the broker starts
Actively adjust the number of num.network.threads processing threads 2.Startup
Processor starts when the broker launches the acceptor
Actively start the new processing threads that were not started during the adjustment 3.Remove from the queue and destroy
broker shutdown
Actively adjusting the num.network.threads to eliminate excess threads and close them

4. KafkaRequestHandlePool and KafkaRequestHandler
1.KafkaRequestHandlerPool
The primary location for processing Kafka requests, this is a request handling thread pool tasked with creating, maintaining, managing, and dismantling its associated request handling threads.
2.KafkaRequestHandler
The actual class for business request handling threads, where each request handling thread instance is tasked with retrieving request objects from the SocketServer's RequestChannel queue and processing them.
Below is the method body processed by KafkaRequestHandler:

def run(): Unit = {
  threadRequestChannel.set(requestChannel)
  while (!stopped) {
    // We use a single meter for aggregate idle percentage for the thread pool.
    // Since meter is calculated as total_recorded_value / time_window and
    // time_window is independent of the number of threads, each recorded idle
    // time should be discounted by # threads.
    val startSelectTime = time.nanoseconds
    // 从请求队列中获取下一个待处理的请求
    val req = requestChannel.receiveRequest(300)
    val endTime = time.nanoseconds
    val idleTime = endTime - startSelectTime
    aggregateIdleMeter.mark(idleTime / totalHandlerThreads.get)

    req match {
      case RequestChannel.ShutdownRequest =>
        debug(s"Kafka request handler $id on broker $brokerId received shut down command")
        completeShutdown()
        return

      case callback: RequestChannel.CallbackRequest =>
        val originalRequest = callback.originalRequest
        try {

          // If we've already executed a callback for this request, reset the times and subtract the callback time from the 
          // new dequeue time. This will allow calculation of multiple callback times.
          // Otherwise, set dequeue time to now.
          if (originalRequest.callbackRequestDequeueTimeNanos.isDefined) {
            val prevCallbacksTimeNanos = originalRequest.callbackRequestCompleteTimeNanos.getOrElse(0L) - originalRequest.callbackRequestDequeueTimeNanos.getOrElse(0L)
            originalRequest.callbackRequestCompleteTimeNanos = None
            originalRequest.callbackRequestDequeueTimeNanos = Some(time.nanoseconds() - prevCallbacksTimeNanos)
          } else {
            originalRequest.callbackRequestDequeueTimeNanos = Some(time.nanoseconds())
          }

          threadCurrentRequest.set(originalRequest)
          callback.fun(requestLocal)
        } catch {
          case e: FatalExitError =>
            completeShutdown()
            Exit.exit(e.statusCode)
          case e: Throwable => error("Exception when handling request", e)
        } finally {
          // When handling requests, we try to complete actions after, so we should try to do so here as well.
          apis.tryCompleteActions()
          if (originalRequest.callbackRequestCompleteTimeNanos.isEmpty)
            originalRequest.callbackRequestCompleteTimeNanos = Some(time.nanoseconds())
          threadCurrentRequest.remove()
        }
     // 普通情况由KafkaApis.handle方法执行相应处理逻辑
      case request: RequestChannel.Request =>
        try {
          request.requestDequeueTimeNanos = endTime
          trace(s"Kafka request handler $id on broker $brokerId handling request $request")
          threadCurrentRequest.set(request)
          apis.handle(request, requestLocal)
        } catch {
          case e: FatalExitError =>
            completeShutdown()
            Exit.exit(e.statusCode)
          case e: Throwable => error("Exception when handling request", e)
        } finally {
          threadCurrentRequest.remove()
          request.releaseBuffer()
        }

      case RequestChannel.WakeupRequest => 
        // We should handle this in receiveRequest by polling callbackQueue.
        warn("Received a wakeup request outside of typical usage.")

      case null => // continue
    }
  }
  completeShutdown()
}

Here, line 56 will reassign the task to KafkaApis's handle for processing.

Ⅲ.unified request handling dispatch

The primary business processing class in Kafka is actually KafkaApis, which serves as the core of all communication and thread handling efforts.

override def handle(request: RequestChannel.Request, requestLocal: RequestLocal): Unit = {
  def handleError(e: Throwable): Unit = {
    error(s"Unexpected error handling request ${request.requestDesc(true)} " +
      s"with context ${request.context}", e)
    requestHelper.handleError(request, e)
  }

  try {
    trace(s"Handling request:${request.requestDesc(true)} from connection ${request.context.connectionId};" +
      s"securityProtocol:${request.context.securityProtocol},principal:${request.context.principal}")

    if (!apiVersionManager.isApiEnabled(request.header.apiKey, request.header.apiVersion)) {
      // The socket server will reject APIs which are not exposed in this scope and close the connection
      // before handing them to the request handler, so this path should not be exercised in practice
      throw new IllegalStateException(s"API ${request.header.apiKey} with version ${request.header.apiVersion} is not enabled")
    }

    request.header.apiKey match {
      case ApiKeys.PRODUCE => handleProduceRequest(request, requestLocal)
      case ApiKeys.FETCH => handleFetchRequest(request)
      case ApiKeys.LIST_OFFSETS => handleListOffsetRequest(request)
      case ApiKeys.METADATA => handleTopicMetadataRequest(request)
      case ApiKeys.LEADER_AND_ISR => handleLeaderAndIsrRequest(request)
      case ApiKeys.STOP_REPLICA => handleStopReplicaRequest(request)
      case ApiKeys.UPDATE_METADATA => handleUpdateMetadataRequest(request, requestLocal)
      case ApiKeys.CONTROLLED_SHUTDOWN => handleControlledShutdownRequest(request)
      case ApiKeys.OFFSET_COMMIT => handleOffsetCommitRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.OFFSET_FETCH => handleOffsetFetchRequest(request).exceptionally(handleError)
      case ApiKeys.FIND_COORDINATOR => handleFindCoordinatorRequest(request)
      case ApiKeys.JOIN_GROUP => handleJoinGroupRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.HEARTBEAT => handleHeartbeatRequest(request).exceptionally(handleError)
      case ApiKeys.LEAVE_GROUP => handleLeaveGroupRequest(request).exceptionally(handleError)
      case ApiKeys.SYNC_GROUP => handleSyncGroupRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.DESCRIBE_GROUPS => handleDescribeGroupsRequest(request).exceptionally(handleError)
      case ApiKeys.LIST_GROUPS => handleListGroupsRequest(request).exceptionally(handleError)
      case ApiKeys.SASL_HANDSHAKE => handleSaslHandshakeRequest(request)
      case ApiKeys.API_VERSIONS => handleApiVersionsRequest(request)
      case ApiKeys.CREATE_TOPICS => maybeForwardToController(request, handleCreateTopicsRequest)
      case ApiKeys.DELETE_TOPICS => maybeForwardToController(request, handleDeleteTopicsRequest)
      case ApiKeys.DELETE_RECORDS => handleDeleteRecordsRequest(request)
      case ApiKeys.INIT_PRODUCER_ID => handleInitProducerIdRequest(request, requestLocal)
      case ApiKeys.OFFSET_FOR_LEADER_EPOCH => handleOffsetForLeaderEpochRequest(request)
      case ApiKeys.ADD_PARTITIONS_TO_TXN => handleAddPartitionsToTxnRequest(request, requestLocal)
      case ApiKeys.ADD_OFFSETS_TO_TXN => handleAddOffsetsToTxnRequest(request, requestLocal)
      case ApiKeys.END_TXN => handleEndTxnRequest(request, requestLocal)
      case ApiKeys.WRITE_TXN_MARKERS => handleWriteTxnMarkersRequest(request, requestLocal)
      case ApiKeys.TXN_OFFSET_COMMIT => handleTxnOffsetCommitRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.DESCRIBE_ACLS => handleDescribeAcls(request)
      case ApiKeys.CREATE_ACLS => maybeForwardToController(request, handleCreateAcls)
      case ApiKeys.DELETE_ACLS => maybeForwardToController(request, handleDeleteAcls)
      case ApiKeys.ALTER_CONFIGS => handleAlterConfigsRequest(request)
      case ApiKeys.DESCRIBE_CONFIGS => handleDescribeConfigsRequest(request)
      case ApiKeys.ALTER_REPLICA_LOG_DIRS => handleAlterReplicaLogDirsRequest(request)
      case ApiKeys.DESCRIBE_LOG_DIRS => handleDescribeLogDirsRequest(request)
      case ApiKeys.SASL_AUTHENTICATE => handleSaslAuthenticateRequest(request)
      case ApiKeys.CREATE_PARTITIONS => maybeForwardToController(request, handleCreatePartitionsRequest)
      // Create, renew and expire DelegationTokens must first validate that the connection
      // itself is not authenticated with a delegation token before maybeForwardToController.
      case ApiKeys.CREATE_DELEGATION_TOKEN => handleCreateTokenRequest(request)
      case ApiKeys.RENEW_DELEGATION_TOKEN => handleRenewTokenRequest(request)
      case ApiKeys.EXPIRE_DELEGATION_TOKEN => handleExpireTokenRequest(request)
      case ApiKeys.DESCRIBE_DELEGATION_TOKEN => handleDescribeTokensRequest(request)
      case ApiKeys.DELETE_GROUPS => handleDeleteGroupsRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.ELECT_LEADERS => maybeForwardToController(request, handleElectLeaders)
      case ApiKeys.INCREMENTAL_ALTER_CONFIGS => handleIncrementalAlterConfigsRequest(request)
      case ApiKeys.ALTER_PARTITION_REASSIGNMENTS => maybeForwardToController(request, handleAlterPartitionReassignmentsRequest)
      case ApiKeys.LIST_PARTITION_REASSIGNMENTS => maybeForwardToController(request, handleListPartitionReassignmentsRequest)
      case ApiKeys.OFFSET_DELETE => handleOffsetDeleteRequest(request, requestLocal).exceptionally(handleError)
      case ApiKeys.DESCRIBE_CLIENT_QUOTAS => handleDescribeClientQuotasRequest(request)
      case ApiKeys.ALTER_CLIENT_QUOTAS => maybeForwardToController(request, handleAlterClientQuotasRequest)
      case ApiKeys.DESCRIBE_USER_SCRAM_CREDENTIALS => handleDescribeUserScramCredentialsRequest(request)
      case ApiKeys.ALTER_USER_SCRAM_CREDENTIALS => maybeForwardToController(request, handleAlterUserScramCredentialsRequest)
      case ApiKeys.ALTER_PARTITION => handleAlterPartitionRequest(request)
      case ApiKeys.UPDATE_FEATURES => maybeForwardToController(request, handleUpdateFeatures)
      case ApiKeys.ENVELOPE => handleEnvelope(request, requestLocal)
      case ApiKeys.DESCRIBE_CLUSTER => handleDescribeCluster(request)
      case ApiKeys.DESCRIBE_PRODUCERS => handleDescribeProducersRequest(request)
      case ApiKeys.UNREGISTER_BROKER => forwardToControllerOrFail(request)
      case ApiKeys.DESCRIBE_TRANSACTIONS => handleDescribeTransactionsRequest(request)
      case ApiKeys.LIST_TRANSACTIONS => handleListTransactionsRequest(request)
      case ApiKeys.ALLOCATE_PRODUCER_IDS => handleAllocateProducerIdsRequest(request)
      case ApiKeys.DESCRIBE_QUORUM => forwardToControllerOrFail(request)
      case ApiKeys.CONSUMER_GROUP_HEARTBEAT => handleConsumerGroupHeartbeat(request).exceptionally(handleError)
      case ApiKeys.CONSUMER_GROUP_DESCRIBE => handleConsumerGroupDescribe(request).exceptionally(handleError)
      case ApiKeys.GET_TELEMETRY_SUBSCRIPTIONS => handleGetTelemetrySubscriptionsRequest(request)
      case ApiKeys.PUSH_TELEMETRY => handlePushTelemetryRequest(request)
      case ApiKeys.LIST_CLIENT_METRICS_RESOURCES => handleListClientMetricsResources(request)
      case _ => throw new IllegalStateException(s"No handler for request api key ${request.header.apiKey}")
    }
  } catch {
    case e: FatalExitError => throw e
    case e: Throwable => handleError(e)
  } finally {
    // try to complete delayed action. In order to avoid conflicting locking, the actions to complete delayed requests
    // are kept in a queue. We add the logic to check the ReplicaManager queue at the end of KafkaApis.handle() and the
    // expiration thread for certain delayed operations (e.g. DelayedJoin)
    // Delayed fetches are also completed by ReplicaFetcherThread.
    replicaManager.tryCompleteActions()
    // The local completion time may be set while processing the request. Only record it if it's unset.
    if (request.apiLocalCompleteTimeNanos < 0)
      request.apiLocalCompleteTimeNanos = time.nanoseconds
  }
}

From the code discussed, key components are identifiable, such as the ReplicaManager, which manages replicas, the GroupCoordinator, which oversees consumer groups, the KafkaController, which operates the Controller components, and the most frequently used operations, KafkaProducer.send (to send messages) and KafkaConsumer.consume (to consume messages).

IV. AutoMQ Thread Model

1. Optimization of Processing Threads
AutoMQ, drawing inspiration from the CPU pipeline, refines Kafka's processing model into a pipeline mode, striking a balance between sequentiality and efficiency.

Sequentiality: Each TCP connection is tied to a single thread, with one network thread dedicated to request parsing and one RequestHandler thread responsible for processing the business logic;
Efficiency: The stages are pipelined, allowing a network thread to parse MSG2 immediately after finishing MSG1, without waiting for MSG1’s persistence. Similarly, once the RequestHandler completes verification and sequencing of MSG1, it can start processing MSG2 right away. To further improve persistence efficiency, AutoMQ groups data into batches for disk storage.

2. Optimization of the RequestChannel
AutoMQ has redesigned the RequestChannel into a multi-queue architecture, allowing requests from the same connection to be consistently directed to the same queue and handled by a specific KafkaRequestHandler, thus ensuring orderly processing during the verification and sequencing stages.
Each queue is directly linked to a particular KafkaRequestHandler, maintaining a one-to-one relationship.
After the Processor decodes the request, it assigns it to a specific queue based on the hash(channelId) % N formula.

References

[1] AutoMQ: https://github.com/AutoMQ/automq
[2] Kafka 3.7: https://github.com/apache/kafka/releases/tag/3.7.0
[3] JAVANIO: https://jenkov.com/tutorials/java-nio/overview.html
[4] AutoMQ Thread Optimization: [https://mp.weixin.qq.com/s/kDZJgUnMoc5K8jTuV08OJw]