If you’re new to Kafka, you’ve probably asked yourself:
“When a producer sends a record, what does Kafka actually store on disk? Is it JSON? Avro? Hex? Bytes??”
I recently went down this rabbit hole myself — so here’s a breakdown of what really happens behind the scenes.
First Principles: Kafka Stores Bytes
Let’s clear the air first:
Kafka doesn’t know or care about JSON, Avro, Protobuf, POJOs, or unicorns 🦄. It only deals in byte arrays (byte[]
).
- Producer sends:
byte[] key
,byte[] value
- Broker writes to log:
byte[]
- Consumer reads:
byte[]
Everything else (JSON, Avro, Protobuf, String) is just a serialization format layered on top.
Serialization vs Encoding
Two words that get thrown around a lot:
- Serialization = turning an in-memory object (like a Java POJO) into a storable/transmittable format.
- Encoding = mapping characters into bytes (e.g., UTF-8 for text).
👉 JSON uses both. Avro is pure serialization.
How is data actually stored
Case 1: JSON in Kafka
Imagine a Java POJO: User{name="Lucky", age=30}
Steps
- Serialize to JSON string
{
"name":"Lucky",
"age":30
}
- Encode string in UTF-8 → bytes
[0x7B, 0x22, 0x6E, 0x61, ...]
(In hex: 7B 22 6E 61 6D 65 22 3A 22 4C )
- Producer sends the above byte[] to Kafka.
- Kafka stores raw bytes in its log
When you xxd
the Kafka log file, you’ll see that hex dump.
👉️ Takeaway
JSON → text → UTF-8 encoding → bytes → Kafka.
It’s human-readable but bulky and slower to parse.
Case 2: Avro in Kafka
Same POJO: User{name="Lucky", age=30}
Steps
-
Serialize directly to Avro binary format
- Field values are encoded as compact bytes following the Avro schema.
- Example output (simplified):
[0x08, 0x4C, 0x75, 0x63, 0x6B, 0x79, 0x3C]
Producer sends byte[] to Kafka
Kafka stores raw bytes in its log
No intermediate “string” or UTF-8 step.
👉️ Takeaway
Avro → compact binary → bytes → Kafka.
It’s efficient, schema-driven, and faster to deserialize.
JSON vs Avro: Side-by-Side
Step | JSON | Avro |
---|---|---|
Serialization | POJO → JSON text | POJO → Avro binary |
Encoding | UTF-8 needed (text → bytes) | Not needed (already binary) |
Size | Larger (e.g., {"name":"Lucky"} ) |
Smaller, compact |
Readability | Human-readable | Not human-readable |
Schema enforcement | Optional | Strict (via Avro schema) |
Wrap-Up
When people ask “Does Kafka store JSON or Avro?”, the real answer is:
👉 Neither. Kafka stores raw bytes.
👉 JSON/Avro/Protobuf are just contracts between producer and consumer.
So next time you see a Kafka hex dump like:
7B 22 6E 61 6D 65 22 3A 22 4C 75 63 6B 79 22 7D
…you’ll know: that’s just your data, encoded and serialized, resting in Kafka’s logs waiting for the next consumer.
Top comments (0)