XIAOXU CHANG for ByteDance Open Source

Posted on Jan 23

Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on Its Third Anniversary

#llm #backend #performance #application

By Yang Rui from CloudWeGo Team

It has been three years since CloudWeGo's open-source journey began. Adhering to the principle of Internal and External Consistency, we have continuously iterated on our open-source repository, releasing features that served ByteDance internally to the external world. From 2023 to 2024, Kitex/Hertz focused on LLM back-end services, user experience, and performance, aiding the rapid development of new business scenarios while continuously optimizing user experience and performance. Meanwhile, Kitex/Hertz has been widely adopted by external enterprises and attracted numerous external developers, enhancing our CloudWeGo ecosystem all the way.

This article summarizes the presentation "Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on The Third Anniversary". It introduces the significant features of Kitex/Hertz over the past year, aiming to assist enterprise users and community developers in better applying Kitex/Hertz to build their microservices systems in their projects.

Enhanced Streaming Capabilities to Support LLMs

With the rapid development of LLMs and ByteDance's AI applications, streaming communication has emerged as the primary communication mode for LLM application services. To better support business growth, we have optimized streaming communication in microservices in terms of stability, engineering practices, and performance over the past year.

Previous Streaming Capabilities of Kitex/Hertz

Both Kitex and Hertz support streaming scenarios. Kitex supports gRPC with better performance than the official gRPC and aligns with its functionality. Hertz supports HTTP Chunked Transfer Encoding and WebSocket. However, these capabilities were insufficient to support the rapid development of LLM internally at ByteDance due to several reasons:

More SSE Applications on Clients

Before MultiModal Machine Learning Models, LLM applications were mainly text-based dialogue scenarios, often using the SSE protocol to return server results to clients in real-time. Text push scenarios are simpler, requiring only a browser-friendly, straightforward protocol.

The Burden of Transitioning from Thrift to Protobuf

Although gRPC (Protobuf) is commonly used for streaming communication in RPC scenarios, and Kitex also supports gRPC, ByteDance's server-side services primarily use Thrift IDLs. Developers are more familiar with Thrift, and there are not many services using gRPC protocols. However, as the demand for streaming increases, we need to reduce the cognitive burden on developers during the transition based on internal realities. Additionally, widely increasing services defined by Protobuf is not conducive to unified IDL/interface management.

Lack of Engineering Practices

Compared to the PingPong model of one-send-one-receive, streaming communication adds complexity in service governance and engineering practices. The industry lacks accumulated engineering practices for streaming communication. Streaming interfaces can be easily misused, affecting service stability. From an observability perspective, there is no definition for streaming monitoring.

Streaming Capabilities – SSE/Thrift Streaming

Hertz SSE

SSE (Server-Send Events) is based on the HTTP protocol, supporting unidirectional data push from the server to the client. Its advantages include simplicity, ease of use, and developer-friendly, making it suitable for text transmission and meeting the basic communication needs of text dialogue models. Compared to WebSocket, SSE is lighter. For text-based dialogue LLM applications, the server only needs to push data to the client without handling the complexity of bidirectional communication. However, in voice dialogue scenarios, WebSocket, which is also browser-friendly, is more suitable. SSE can define different event types and process data on the client based on event types. In LLM applications, this can be used to distinguish different types of response data (e.g., partial outputs, error messages, status updates).

However, SSE is not suitable for the server side due to the following reasons: high computational and transmission performance requirements on the server side, unsuitability for inefficient text protocols, JSON's simplicity but unsuitability for complex server-side interaction scenarios, preference for strongly typed RPC, and the need for bidirectional streaming communication in certain cases.

Therefore, considering ByteDance's internal needs, we choose to support Thrift Streaming.

Kitex Thrift Streaming

Streaming communication is used not only in LLM applications but also in other business. For example, Douyin Search aims to improve performance by RPC streaming results. During the video packaging stage, it retrieves information related to recalled video IDs, hoping to bundle services (10 docs) in one request and return the first completed package. In the Lark People data export scenario, data is retrieved concurrently. If all data is filled into an Excel sheet before returning, excessive data can lead to OOM (Out of Memory), causing process exception terminated. Enhancing streaming capabilities not only supports the rapid development of LLMs but also meets the development needs of other business scenarios.

Although Kitex supports gRPC, we recommend using Thrift internally. Supporting diversity can meet various needs. However, it is best for a company to establish a best practice to minimize the burden on developers' choices, and the tool chain system will also be more supportive.

Streaming Protocols

Within ByteDance, traffic control for Streaming protocols mainly relies on Service Mesh. However, to quickly support implementation without relying on Service Mesh's support for new protocols, Kitex first supported Thrift Streaming based on gRPC (HTTP2). Since the official gRPC protocol specification supports extending content-type, the implementation is based on gRPC's RPC communication specification, changing Protobuf encoding to Thrift encoding.

Thrift over gRPC began its Alpha at ByteDance in December 2023 and was officially released in Kitex v0.9.0 in March 2024. It is now widely used internally, with usage instructions available on the official website.

Pros:
- Service Mesh Compatibility: Based on HTTP2 transmission, no separate support is required from Service Mesh.
- Low Support Cost: The decoding type is explicitly determined based on SubContentType (an extension supported by gRPC protocol specification.
Cons:
- High Resource Consumption: Flow control and dynamic windows introduce additional overhead.
- Significant Latency Impact: Flow control can significantly degrade latency with heavier traffic or larger packets, requiring users to adjust WindowSize.
- Difficult Troubleshooting: Increased complexity also raises the difficulty of troubleshooting.

Thrift over gRPC can be quickly implemented. However, from the perspectives of performance and troubleshooting, we have developed a Streaming protocol (Streaming over TTHeader) to simplify streaming communication. It is currently under internal debugging and trials, with an expected release in November-December 2024.

How to Define Streaming in Thrift

Users familiar with Thrift know that native Apache Thrift does not support the definition of streaming interfaces. Adding new keywords would make other Thrift parsing tools, including IDE plugins, incompatible. Therefore, defining streaming types for Thrift's RPC methods through annotations ensures parsing compatibility:

streaming.mode="bidirectional": Bidirectional Streaming
streaming.mode="client": Client Streaming
streaming.mode="server": Server Streaming

Both the currently supported Thrift Streaming over gRPC and the upcoming Thrift Streaming over TTHeader use this method to define streaming methods. The client-side will provide options to specify which Streaming protocol to use, while the server-side will support multiple protocols through protocol detection.

Generalized Streaming Invocation

If SSE is used for streaming communication on clients and Thrift Streaming is used on servers, how does the overall communication from clients to servers work?

Taking the internal text dialogue model as an example, the traffic undergoes protocol conversion after passing through the API gateway, and the server uses the Server Streaming type to push data to the client.

An important capability here is protocol conversion. Additionally, pressure testing and interface testing platforms need to dynamically construct data to test server services.

Users of Kitex know that Kitex provides generalized invocation for Thrift protocols, primarily supporting such general services. Previously, internal microservices were mainly Thrift PingPong services. Kitex provided generalized invocation for Map, JSON, HTTP data types, as well as binary generalized invocation for traffic forwarding.

Therefore, for streaming interfaces, Kitex has added support for generalized streaming invocation. Compared to PingPong generalized interfaces, generalized streaming requires separate interfaces for the three streaming types.

PingPong/Unary Generalized Invocation Interface

Streaming Generalized Invocation Interface

Currently, support for the mainstream JSON data type is complete, and other data types will be supported based on business needs in the future. (Since the Kitex Streaming v2 interface is yet to be released, and to avoid affecting the user experience of generalized streaming, this support has not been officially announced, but the functionality is ready. Users can visit the generalized invocation section on the official website for English documents.)

User Experience of Streaming Capability

Although we have improved the basic cases for streaming and introduced the streaming capabilities that Kitex/Hertz has supported in the past, newly supported, and will soon release, do developers who have worked on streaming interfaces, including those using other frameworks like the official gRPC, know how to properly use streaming interfaces and how to locate issues when they arise?

Within ByteDance, as streaming services have evolved, we've noticed a significant increase in feedback issues. On one hand, compared to Thrift PingPong, our support at the basic capability level is still incomplete. On the other hand, developing streaming interfaces requires a deep understanding of proper usage; otherwise, misuse can easily lead to problems.

Therefore, in 2024, we initiated a Streaming Optimization Project, sorted through various issues, and optimized them one by one. In terms of user experience, some issues are related to streaming interface definitions. After comprehensive consideration, we decided to shed the streaming burden and release the Streaming v2 interface. Below are some of the existing issues and ongoing optimizations. It's difficult to enforce proper usage of streaming interfaces solely from the framework level. Therefore, we will release usage specifications and best practices for streaming interfaces to help users develop high-quality streaming interfaces. If you have better suggestions for streaming usage, we welcome your feedback!

Taking streaming observability as an example, previously, streaming interface monitoring was not defined separately, which reuse PingPong reporting, resulting in only overall stream reporting information and lacking Recv/Send monitoring. Therefore, when supporting Thrift Streaming, StreamSend & StreamRecv events were added, with the framework recording the time of occurrence and the size of user-transmitted data. For custom Tracer reporting by enterprise users, it only requires implementing the rpcinfo.StreamEventReporter interface. Kitex will call this interface after each Recv and Send execution, allowing access to the event information for this Recv and Send. Below is the Trace information for Send/Recv within a Stream.

Review of New Features, User Experience/Performance Improvements

While specialized support and optimization for streaming capabilities have been conducted over the past year, we have also provided other new features to meet user needs, enhance user experience, and continue to improve framework performance.

New Features – Thrift/gRPC Multi-Services

The official gRPC framework supports multi-services, but previous versions of Kitex did not, mainly to align with Thrift usage. Thrift's limitation arises from supporting multi-services introducing protocol incompatibility changes, impacting users. Within ByteDance, the TTHeader protocol is widely used, so we decided to transmit the IDL Service Name via TTHeader to solve the issue of Thrift not supporting multi-services.

Kitex v0.9.0 officially supports registering multiple IDL Services within one Server, including Thrift and Protobuf. Thrift provides true multi-service functionality at the protocol level based on TTHeader, while being compatible with the old CombineService.

Here is a briefly introduction of Combine Service. Kitex previously provided a pseudo-multi-service feature, Combine Service, to address the issue of excessively large IDLs (leading to large code outputs and slow compilation speeds). It allows the server to split one IDL Service into multiple ones, but requires that the multiple IDL Services cannot have methods with the same name (since the protocol does not support multi-services, method routing cannot be done). Ultimately, Kitex merges multiple IDL Services into one Service, hence the name CombineService.

With Kitex's new multi-service support, the server can not only register multiple IDL Services, but also provide both Thrift and Protobuf interfaces simultaneously. For example, using Kitex-gRPC (Protobuf) but wanting to switch to Thrift Streaming while ensuring compatibility with old interface traffic, two types of IDL interfaces can be provided for transition.

Below is an example of registering multiple services on the server:

New Features – Mixed Retry

Kitex previously provided two retries: Failure Retry and Backup Request. Failure Retry can improve success rates (enhancing service SLAs), but most are timeout retries, leading to increased latency; Backup Request can reduce request latency, but if there is a failed return, it terminates retries.

In internal practice, businesses generally express a desire to have both retries, offering advantages to:

Optimize the overall retry latency of Failure Retry
Improve the request success rate of Backup Request

Therefore, Kitex supports Mixed Retry in v0.11.0, a hybrid retry function combining Failure Retry and Backup Request functions.

To facilitate understanding the differences between the three retries, here is a scenario: assume the first request takes 1200ms, the second request takes 900ms, with RPCTimeout configured to 1000ms, MaxRetryTimes to 2, and BackupDelay to 200ms.

Comparing the results of the three retries:

Mixed Retry: Success, cost 1100ms
Failure Retry: Success, cost 1900ms
Backup Retry: Failure, cost 1000ms

User Experience - Frugal & FastCodec (Thrift)

Both Frugal and FastCodec (Thrift) are high-performance Thrift serialization tools provided by Kitex. Frugal's advantage over FastCodec is that it does not require code generation, significantly addressing the issue of excessively large outputs.

But still two cons:

Both Frugal and FastCodec decoding must rely on packets with headers. If it's a Thrift Buffered packet, it will fallback to Apache Codec. Users need to be clear about the received protocol; otherwise, using Frugal cannot completely eliminate code generation.
Frugal is based on JIT implementation, with x86 support completed, but ARM just provides a fallback strategy with poor performance.

Addressing the protocol binding issue, the new version supports SkipDecode. Test results show that SkipDecode + FastCodec still outperforms Apache Thrift Codec.

For the Frugal ARM issue, new reflection support is provided, eliminating the need for separate support for different architectures. Although using reflection, bypassing type checks within reflection achieves higher performance. Test results are slightly better compared to JIT.

User Experience - Output Reduction and Generation Speed Optimization

Large output size, slow output generation,and compilation speeds are significant pain points for services with longer iterations within ByteDance. Therefore, Kitex provides various optimization methods to reduce output size and improve output generation speed.

IDL Trimming

A complex IDL with longer iterations contains many obsolete struct definitions. Actively cleaning up these unnecessary definitions can also increase development burden. The trimming tool supports generating code based on the struct definitions required by RPC methods. Users can also specify which methods to generate. According to pilot projects in large ByteDance repositories, generation time is halved, and output size is reduced by over 60%.

Usage: $ kitex -module xx -thrift **trim_idl xxxx.thrift

Example effect: In the example below, the trimming tool deleted 60,000 unused structs and 530,000 fields.

no_fmt Speedup

After code output generated, the code is formatted by default to improve readability, but users rarely care about that. Therefore, users can disable the fmt option to improve generation speed.

Usage: $ kitex -module xx -thrift **no_fmt xxxx.thrift

Effect: The P90 generation time for a certain platform within ByteDance decreased from 80s to 20s.

Removing Unnecessary Codes from Kitex

Kitex defaults to generating the full Apache Thrift code, but in reality, only the Codec part is used in fallback scenarios, and the rest of the code is not needed.

Therefore, Kitex v0.10.0 defaults to removing the Thrift Processor and can remove all Apache Thrift code via parameter specification.

kitex -module xxx -thrift no_default_serdes xxx.thrift

Usage: $ kitex -module xxx -thrift no_default_serdes xxx.thrift

Effect: Output size is reduced by about 50%+.

Frugal Slim Extreme Reduction

Usage: $ kitex -thrift frugal_tag,template=slim -service p.s.m idl/api.thrift, using Frugal for Thrift serialization.

Effect: Output size is reduced by about 90%.

User Experience - kitexcall

Although RPC calls are simpler and more convenient than HTTP, they are not convenient to test, requiring tools to generate code first and then construct request data. Previously mentioned testing platforms use generalized invocations to construct request data without relying on generated code, but the cost of using generalized invocations is not low. Users must first understand the usage and data construction of the method.

To improve testing convenience, based on Kitex JSON generalized invocations, a separate command tool - kitexcall is provided, allowing users to initiate Thrift tests using JSON data. (This feature is supported by community contributions; thanks here!)

Usage: $ kitexcall -idl-path echo.thrift -m echo -d '{"message": "hello"}' -e 127.0.0.1:8888

Future optimization plans:

Graphical interface for more convenient testing
Support for gRPC testing
No need to specify IDL, using server reflection to obtain IDL information

Performance Optimization – Thrift On-Demand Serialization

As business iterations make IDL definitions increasingly complex, upstream services in production may only need partial fields but need to serialize and transmit all of those, introducing additional performance overhead. Considering this issue, Kitex supports on-demand serialization for Thrift.

Reference Protobuf to provide a Thrift FieldMask feature, allowing users to select encoding fields and optimize serialization and transmission overhead.

For example, below, only the Foo field is encoded and returned, ignoring the Bar field:

User constructs Bar data, but annotate the Foo field, and the framework will only encode Foo:

It also supports specifying required fields by the opposite end; for specific usage, see the on-demand serialization documentation on the official website.

Performance Optimization – Thrift Memory Allocation Optimization

Kitex continuously monitors RPC performance. In the current context of high cost pressures, we are deeply exploring more optimizations. Routine optimizations on hot paths have all been done, but further ones are less conventional. v0.10.0 released new optimizations focusing on memory allocation and GC.

Span Cache: Optimizes String/Binary decoding costs:
- Pre-allocates memory, reducing mallocgc calls
- Reduces the actual number of generated objects -> lower GC costs
Centralized memory allocation for container fields
- Similarly, changes from separate memory allocation for each element to centralized approach

Span Cache can optimize CPU but increases memory usage. To avoid impacting services with small memory specifications, it is not enabled by default and requires user to turn on:

Effect: Under extreme testing, throughput is increased by about 10%, and latency is reduced by about 30%.

Memory Analysis Tool

The received objects in RPC/HTTP are constructed, memory-allocated, and value-assigned by the framework before being returned to the user. However, if the user's code holds onto these objects indefinitely, it can lead to memory leaks. While pprof heap can indicate where memory is allocated, it cannot show where references are made. So, how do we determine who is referencing a Go object?

In fact, GC scans and marks objects, capturing reference relationships. By combining this with variable names and type information, we can analyze the referencing situation of objects. Leveraging Delve, we have developed the goref object reference analysis tool, which was open-sourced in July (github.com/cloudwego/goref). This addresses the limitation of Go's native tools in analyzing memory references, aiding Go developers in quickly identifying memory leaks and enhancing the Go tooling ecosystem.

For instance, the Heap Profile of pprof shown in the following image reveals that the currently referenced objects are primarily allocated within FastRead (Kitex's deserialization code). It is normal for decoding to allocate memory for construct data, but this flame graph offers limited help in troubleshooting as allocated memory addresses are often not the source of memory leaks.

However, using the goref tool yields the following result: mockCache holds an RPC Resp, preventing memory from being released. The issue is immediately apparent.

Conclusion and Outlook

Conclusion

Enhancing Streaming Capabilities to Support LLMs

Streaming capabilities provided by Kitex/Hertz: gRPC, HTTP 1.1 Chunked, WebSocket, SSE, Thrift Streaming
SSE <-> Thrift Streaming
Generalized streaming invocations
Streaming capability optimizations to enhance user experience and engineering practices

Review of New Features, User Experience, and Performance Improvements

New Features: Thrift/gRPC multi-service support, Mixed Retry
User Experience: Frugal/FastCodec, streamlined outputs, generation speed optimizations, kitexcall
Performance Optimization: Thrift on-demand serialization, memory allocation improvement
Memory Analysis Tool: goref

Outlook

In the coming year, we will continue to enhance streaming capabilities and optimize the user experience. We will provide usage guidelines for streaming interfaces to help users better develop their streaming services:

Release Kitex Streaming v2 interface to address historical issues
Release TTHeader Streaming for better performance
Engineering practices: graceful shutdown, retries, timeout control
Release streaming-related specifications: error handling, interface usage guidelines

Furthermore, we will consider strengthening the streaming ecosystem, such as enriching generalized streaming invocations and providing more gateway-friendly support:

SSE <-> Thrift Streaming(HTTP2 and TTHeader Streaming)
WebSocket <-> Thrift Streaming (HTTP2 and TTHeader Streaming)
Binary and Map generalized invocations for Streaming

Special announcement: Kitex plans to gradually remove Apache Thrift-generated code in future versions. Due to incompatible changes in Apache Thrift v0.14, Kitex is forced to be locked-in Apache Thrift v0.13. To resolve this, Kitex will eliminate its dependency on Apache Thrift.

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

DEV Community