DEV Community

Cover image for Demystifying Confluent's Schema Registry Wire Format
Steven Jenkins De Haro
Steven Jenkins De Haro

Posted on

Demystifying Confluent's Schema Registry Wire Format

Table of Contents


Within Kafka-based architectures, message serialization is more than just a format choice, it defines the contract between producers and consumers. When that contract is paired with a schema registry, it is enhanced with features like efficient schema evolution, versioning, and compatibility checks. However, these benefits are more easily realized when both sides of the data exchange use it. When a client does not or cannot use the schema registry, or when the integration with Kafka happens via a REST API, there will be additional considerations at the byte level to take into account.

To address these scenarios, this article provides a detailed breakdown of Confluent's Schema Registry wire format and how to use it without integrating with a schema registry. In addition, it will demonstrate how to inspect and decode Kafka messages using tools like xxd, dd, and low-level byte parsing for clearer insight into how data flows through Kafka topics.


A brief overview on Avro, JSON, and Protobuf

Apache Avro, JSON, and Protobuf formats are widely used data serialization technologies that serve slightly different purposes and offer contrasting trade-offs in efficiency, readability, and interoperability.

Avro is a row-oriented, binary format that can optionally embed schemas with the data (or reference it externally), enabling fast, compact, and strongly typed data exchange. It's well-suited for high-throughput pipelines and big data applications due to its small payload size and efficient encoding. JSON, in contrast, is a human-readable, text-based format that prioritizes simplicity and interoperability, ideal for web APIs and configuration files, though its verbosity results in larger payloads and slower parsing compared to binary formats. Protobuf (Protocol Buffers), developed by Google, is a compact binary format similar to Avro but uses numeric field tags instead of field names. Schemas are defined in .proto files that are later compiled into code, providing efficient encoding and making it a popular choice for microservices, APIs, and inter-service communication.

Below is a simple comparison of the test record { "message": "Hello World!" } to illustrate the size differences between formats:

πŸ“Š Size Comparison

Format Raw Base64 Base64 Encoded Raw Payload Meta
JSON 29 B 40 B eyAibWVzc2FnZSI6ICJIZWxsbyBXb3JsZCEiIH0= N/A
Protobuf 14 B 20 B CgxIZWxsbyBXb3JsZCE= 0x0a 0x0c
Avro 13 B 20 B GEhlbGxvIFdvcmxkIQ== 0x18

πŸ“ Note: For the Avro length prefix, the string value is encoded as a long using ZigZag encoding ( lengthΒ inΒ bytesΓ—2\text{length in bytes} Γ— 2 ), and then written as a variable-length integer (varint/LEB128). The same encoding is applied to all signed numeric types to make negative numbers efficient by using fewer bytes to represent them. See Data Serialization and Deserialization for more information.

As can be seen above, Avro is the most compact in this example. This is mainly because the schema is referenced externally, including the field names. However, had the schema been embedded, it would have been around 159 raw bytes in size. Protobuf, on the other hand, encodes each field in the payload using a tag encoded with Protobuf wire format, which contains the field number and wire type. For length-delimited types, like strings or Embedded Messages, the tag is immediately followed by a length prefix and then the serialized data. This structure makes Protobuf messages partially self-describing, allowing known fields to be parsed without having the compiled .proto files in a pinch. With all this in mind, it's now time to discuss the wire format to understand where schemas and payloads fit in.


What is Confluent's wire format?

The wire format defines a fixed binary layout for serialized messages that will be sent over the network. It’s not tied to Kafka itself, but rather how Confluent's Schema Registry, and compatible systems, encode metadata alongside the actual payload. In other words, instead of producing only a raw Avro, JSON, or Protobuf payload into a topic, an additional header structure is prepended to the payload. This header tells the consumer which schema to use, and in the case of Protobuf, which top-level message within that schema. The payload can then be deserialized correctly regardless of the consumer's language or platform as long as it has access to the Schema Registry and the corresponding schema ID.

The following is a high-level view of the wire format:

πŸ’Ύ For Arvo and JSON payloads:

+----------------------+-------------------+
| Byte Offset          | Content           |
+----------------------+-------------------+
| 0                    | Magic byte (0x00) |
| 1 – 4                | Schema ID (int32) |
| 5 – end              | Avro/JSON payload |
+----------------------+-------------------+
Enter fullscreen mode Exit fullscreen mode

πŸ’Ύ For Protobuf payloads:

+----------------------+----------------------------------------+
| Byte Offset          | Content                                |
+----------------------+----------------------------------------+
| 0                    | Magic byte (0x00)                      |
| 1 – 4                | Schema ID (int32)                      |
| 5 – (5+N-1)          | Protobuf message index (varint)        |
| (5+N) – end          | Protobuf payload (field tags + values) |
+----------------------+----------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Within the wire format, the first byte is called the Magic Byte (0x00), which is just an indicator to say that the wire format is being used, otherwise, a consumer may throw a serialization exception for an unknown magic byte for raw payloads. This section is followed by 4 bytes to store the schema ID from the schema registry in the form of a big-endian signed 32-bit integer. For Protobuf payloads, the next section is for the message index. It is used to identify which top-level message found in the associated .proto file is encoded. This byte is also varint-encoded and is usually 0x00 if there is only one message. The final section stores the actual raw payload.


Debugging the wire format

It's useful to understand how to debug a Kafka message to validate the structure/format of the data being exchanged to see if it meets expectations. For example, to be able to identify a message that uses the wire format versus one that does not. For this, the xxd CLI tool will be used as it comes preinstalled with most operating systems, including Windows via git-bash.

Prepare the source data

To begin, an Avro sample message using the wire format is needed. Use one of the following options to generate or dump an existing sample:

βš™οΈ Option 1: Generate

printf '\x00\x00\x00\x00\x01\x18\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21' > avro-message.bin
Enter fullscreen mode Exit fullscreen mode

The generated message above is associated with the following dummy schema below with a schema ID of 1:

{
  "type": "record",
  "name": "Greeting",
  "namespace": "com.example.messages",
  "fields": [
    { "name": "message", "type": "string" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

βš™οΈ Option 2: Kafka CLI

# Depending on package, it might be './bin/kafka-console-consumer.sh ...'.
/bin/kafka-console-consumer \
     --bootstrap-server <BOOTSTRAP_URL>:<PORT> \
     --topic <YOUR_TOPIC> \
     --from-beginning \
     --max-messages 1 \
     | tr -d '\n' > avro-message.bin
Enter fullscreen mode Exit fullscreen mode

πŸ“ Note: The last part of the command removes the 0x0a byte that may appear at the end due to a new line introduced by the command.

βš™οΈ Option 3: kcat CLI (formerly known as kafkacat < 1.7.0)

# Alternative 'docker run -it --rm edenhill/kafkacat:1.6.0 kafkacat -b ...'.
kcat -b <BOOTSTRAP_URL> \
  -t <YOUR_TOPIC> \
  -C -c 1 -o beginning -e \
  > avro-message.bin
Enter fullscreen mode Exit fullscreen mode

Analyze the message contents

With the avro-message.bin now in hand, it's time to debug its contents. Execute the following command to view the hex representation of the data:

xxd -g 1 -c 16 ./avro-message.bin
Enter fullscreen mode Exit fullscreen mode

πŸ”³ Output

00000000: 00 00 00 00 01 18 48 65 6c 6c 6f 20 57 6f 72 6c  ......Hello Worl
00000010: 64 21                                            d!
Enter fullscreen mode Exit fullscreen mode

πŸ“ Note: Depending on the shell used, the meta sections may be color-coded to help identify the sections more easily πŸš€.

πŸ” Avro Breakdown

                   Confluent's Wire Format
 ____________________________|_____________________________
|                                        Avro              |
|                       __________________|________________|
|                      |                                   |
[ 00 ] [ 00 00 00 01 ] [ 18 ] [ "Hello World!" UTF-8 bytes ]
  ↑           ↑          ↑                  ↑
magic B  schema id   zigzag len       payload data
Enter fullscreen mode Exit fullscreen mode

In the above breakdown, the last two sections are specific to the Avro format, which can also be considered the raw payload data. The wire format is the combination of this and the first two sections. This is the main difference between producing with raw payload data and data that can be associated with a particular schema in the Confluent Schema Registry.

For brevity, let's quickly compare a Protobuf dump against the Avro dump from earlier by running the following command:

printf '\x00\x00\x00\x00\x01\x00\x0a\x0c\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21' | xxd -g 1 -c 16
Enter fullscreen mode Exit fullscreen mode

The generated message above is associated with the following dummy schema below with a schema ID of 1:

syntax = "proto3";

package com.example.messages;

message Greeting {
  string message = 1;
}
Enter fullscreen mode Exit fullscreen mode

πŸ”³ Output

00000000: 00 00 00 00 01 00 0a 0c 48 65 6c 6c 6f 20 57 6f  ........Hello Wo
00000010: 72 6c 64 21                                      rld!
Enter fullscreen mode Exit fullscreen mode

πŸ“ Note: If using a wrapper type like google.protobuf.StringValue from google/protobuf/wrappers.proto instead of string or optional string for presence detection with primitive fields, and for more consistent behavior, the output will be slightly different. This is because the wrapper type adds an inner message structure with a single field called value. The end result would be 00 00 00 00 01 00 [0a 0e] 0a 0c 48 65 6c 6c 6f 20 57 6f 72 6c 64 21, where bytes 0x0a and 0x0e for the outer tag and length are prepended to the expected raw payload in this example.

πŸ” Protobuf Breakdown

                         Confluent's Wire Format
 __________________________________|_____________________________________
|                                                Protobuf                |
|                              _____________________|____________________|
|                             |                                          |
[ 00 ] [ 00 00 00 01 ] [ 00 ] [ 0a ] [ 0c ] [ "Hello World!" UTF-8 bytes ]
  ↑           ↑          ↑      ↑      ↑                  ↑
magic B  schema id  msg index  tag   length         payload data
Enter fullscreen mode Exit fullscreen mode

The structure is similar to the previous Avro one, except now the wire format introduces one previously omitted section meant only for Protobuf, the message index. Also, notice how Protobuf's length doesn't use ZigZag encoding for the length prefix.

Debugging challenge

As a small challenge, see if you can spot the differences between the following wire-formatted message carrying a JSON payload and the previous examples:

printf 'AAAAAAF7ICJtZXNzYWdlIjogIkhlbGxvIFdvcmxkISIgfQ==' | base64 -d | xxd -g 1 -c 16
Enter fullscreen mode Exit fullscreen mode

Alternatively, use the longer byte approach:

printf '\x00\x00\x00\x00\x01\x7b\x20\x22\x6d\x65\x73\x73\x61\x67\x65\x22\x3a\x20\x22\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21\x22\x20\x7d' | xxd -g 1 -c 16
Enter fullscreen mode Exit fullscreen mode

The generated message above is associated with the following dummy schema below with a schema ID of 1:

{
  "$id": "https://example.com/schemas/com.example.messages.Greeting.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Greeting",
  "type": "object",
  "properties": {
    "message": {
      "type": "string"
    }
  },
  "required": ["message"],
  "additionalProperties": false
}
Enter fullscreen mode Exit fullscreen mode

Additional commands

Finally, here are a few more commands to play with that could come in handy. The commands are mostly for Avro and JSON-based messages, but they can be adapted for Protobuf as well.

πŸ“š PARSED VIEW

xxd -p ./avro-message.bin | sed -E 's/^(.{2})(.{8})(.*)$/Magic:\1 SchemaID:\2 Payload:\3/'
Enter fullscreen mode Exit fullscreen mode

πŸ”³ Output

Magic:00 SchemaID:00000001 Payload:1848656c6c6f20576f726c6421
Enter fullscreen mode Exit fullscreen mode

πŸ“š SECTIONAL VIEWS

# Magic (first byte)
dd if=avro-message.bin bs=1 count=1 status=none | xxd -g 1
# Schema ID (4 bytes after 1)
dd if=avro-message.bin bs=1 skip=1 count=4 status=none | xxd -g 4
# Payload (all bytes after 5)
dd if=avro-message.bin bs=1 skip=5 status=none | xxd -g 1 -c 16
Enter fullscreen mode Exit fullscreen mode

πŸ”³ Output

00000000: 00
00000000: 00000001
00000000: 18 48 65 6c 6c 6f 20 57 6f 72 6c 64 21           .Hello World!
Enter fullscreen mode Exit fullscreen mode

πŸ“š PARSE CONTINUOUS HEX

# Avoids having to add \x before each byte, just remove spaces.
printf '00000000011848656c6c6f20576f726c6421' | xxd -r -p | xxd -g 1 -c 16
Enter fullscreen mode Exit fullscreen mode

πŸ”³ Output

00000000: 00 00 00 00 01 18 48 65 6c 6c 6f 20 57 6f 72 6c  ......Hello Worl
00000010: 64 21                                            d!
Enter fullscreen mode Exit fullscreen mode

Third-party wire format compatibility

The following tables provide a general overview of schema registries and tools that are compatible with Confluent's wire format. This information is based on first-hand testing and or what could be found in the documentation at the time of this writing.

πŸ—‚οΈ Schema Registries

Name Wire Format Support Special Considerations
Azure Schema Registry ❌ No Uses a different approach
Apicurio Registry βœ… Yes, Automatic
Karapace βœ… Yes, Automatic

β˜• Tools/Services

Name Wire Format Support Special Considerations
Kafka CLI ⚠️ Yes, via Addon Add Confluent serializers
kcat CLI / kafkacat CLI ❌ No, Manual only Supports working with bytes
Strimzi Bridge REST API ❌ No, Manual only
Strimzi Connect βœ… Yes, Automatic
AKHQ βœ… Yes, Automatic
Kafka UI / Kafbat UI βœ… Yes, Automatic
Vector ❌ No, Manual only Avoid value transformations

πŸ“ Note: The list assumes clients are using the correct serializers/converters to integrate with Kafka and the schema registry.


Manually handling the wire format

This section takes theory into practice by showing how a Native or REST-based Kafka client can produce or consume wire-formatted messages without a direct integration with a compatible schema registry. The goal is to prevent breaking other clients in the data exchange that do use one. The examples below have been created to be framework agnostic and generic enough to apply to any type of client. However, with that said, some adaptation will be needed for the target implementation.

Producer code snippet

The code snippet below shows how a producer can manually tack on the required wire format bytes to an Avro-based payload before sending it to a Kafka topic.

Greeting greeting = new Greeting("Hello World!");

...

// Prepend the wire format.
ByteBuffer buffer = ByteBuffer.allocate(1 + 4 + avroPayload.length);
buffer.put(magicByte);
buffer.putInt(schemaId);
buffer.put(avroPayload);

// Ready for sending as a native-based Kafka client.
byte[] message = buffer.array();
// Ready for sending as a REST-based Kafka client.
String base64Message= Base64.getEncoder().encodeToString(message);

...

System.out.println("Avro Payload Length: " + avroPayload.length);
System.out.println("Message Length: " + message.length);
System.out.println("Message (base64): " + base64Message);
System.out.println("Message (hex): " + hexMessage);
Enter fullscreen mode Exit fullscreen mode

πŸ€– View full code on GitHub

πŸ”³ Output

Avro Payload Length: 13
Message Length: 18
Message (base64): AAAAAAEYSGVsbG8gV29ybGQh
Message (hex): 00 00 00 00 01 18 48 65 6c 6c 6f 20 57 6f 72 6c 64 21
Enter fullscreen mode Exit fullscreen mode

The actual sending of the message will depend on the type of producer client being used. For example, a native Kafka client will wrap the message bytes using ProducerRecord<String, byte[]> record = new ProducerRecord<>("my-topic", message); before using producer.send(record); to send it. On the other hand, a REST-based client will use the base64 encoded version of the message to POST it via the endpoint /topics/{topic-name} using the Content-Type: application/vnd.kafka.binary.v2+json header.

Consumer code snippet

Similar to the producer, this consumer code snippet reverses the wire format encoding process to extract the payload so that it can be deserialized.

byte[] message = Base64.getDecoder().decode("AAAAAAEYSGVsbG8gV29ybGQh");

ByteBuffer buffer = ByteBuffer.wrap(message);
byte magicByte = buffer.get();   // Position goes from 0 -> 1.
int schemaId = buffer.getInt();  // Reads bytes [1..4], position goes from 1 -> 5.

int remainingBytes = buffer.remaining();
byte[] avroPayload = new byte[remainingBytes];
buffer.get(avroPayload); // Writes remaining bytes to byte array.

...

Greeting greeting = reader.read(null, decoder);

System.out.println("Magic: " + magicByte);
System.out.println("Schema ID: " + schemaId);
System.out.println("Payload [Message Field]: " + greeting.getMessage());
Enter fullscreen mode Exit fullscreen mode

πŸ€– View full code on GitHub

πŸ”³ Output

Magic: 0
Schema ID: 1
Payload [Message Field]: Hello World!
Enter fullscreen mode Exit fullscreen mode

In practice, a native consumer will use an annotation like @KafkaListener(topics = "my-topic") over a method like public void consume(ConsumerRecord<String, byte[]> record) {} to consume messages. REST-based clients will need to poll with GET the /consumers/{consumer-group}/instances/{instance}/records endpoint using the Accept: application/vnd.kafka.binary.v2+json header to consume the message as base64 encoded.


Best practices & tips

Ideally, Kafka clients should use libraries that natively integrate with Confluent's Schema Registry or compatible alternatives when exchanging schema-backed data. If this is not possible, then the following list can help avoid some pains when developing solutions that do not have this integration as an option:

  • Prefer schema-aware producers in production where possible, as this will be the safest option for consumers.
  • Consumers should check for the magic byte (message.length > 5 && message[0] == 0x00) to accept or reject a message. Alternatively, use this check to accept both wire-formatted messages and ones that are not, since most of the code for processing a message is the same.
  • Consumers should also consider using the schema ID for manually validating a message after requesting the schema with a schema registry client.
  • For native clients, ensure not to accidentally use Confluent's KafkaAvro*, KafkaJsonSchema*, or KafkaProtobuf* based Serializers/Deserializers in clients not using the schema registry, since these libraries depend on one. Instead, use the ByteArraySerializer and ByteArrayDeserializer for this scenario.
  • For native clients, be explicit about what serializers/deserializers are being used. Frameworks like Quarkus support automatic selection, however, it may default to a String-based one for these types of messages.
  • Handle the manual serialization/deserialization logic in a separate class using standard interfaces like Deserializer<T> to make it more reusable and portable.
  • If there are limitations or complexities in handling the wire format manually, create a bridging application or adapter to handle this task on behalf of the clients that need it. This should be preferred over requiring all clients to disable their schema registry integration.

Conclusion

In conclusion, understanding Confluent's wire format simplifies producing, consuming, and debugging schema-backed messages even when a client does not integrate with a schema registry. Remember to consider the roles each component contributes to in a data exchange. For example, Kafka transports the bytes, Schema Registry defines how those bytes are structured, and serializers/deserializers implement the wire format convention. Use byte-level tools and simple checks (magic byte and schema ID) to validate and diagnose messages across heterogeneous clients. Using only native clients that are integrated into the schema registry is ideal, but when this is not possible, prepend or strip the wire format header consistently to preserve compatibility. Applying these practices will go a long way towards reducing schema-related surprises that may creep into the solution at some point in the future.


Resources

The following resources provide additional details, examples, specifications, and tools related to the topics covered in this article.

Top comments (0)