Laxman Patel

Posted on Sep 17

How Data is Stored in Kafka: JSON vs Avro Explained

#kafka #tutorial #learning #pubsub

If you’re new to Kafka, you’ve probably asked yourself:

“When a producer sends a record, what does Kafka actually store on disk? Is it JSON? Avro? Hex? Bytes??”

I recently went down this rabbit hole myself — so here’s a breakdown of what really happens behind the scenes.

First Principles: Kafka Stores Bytes

Let’s clear the air first:

Kafka doesn’t know or care about JSON, Avro, Protobuf, POJOs, or unicorns 🦄. It only deals in byte arrays (byte[]).

Everything else (JSON, Avro, Protobuf, String) is just a serialization format layered on top.

Two words that get thrown around a lot:

Serialization = turning an in-memory object (like a Java POJO) into a storable/transmittable format.
Encoding = mapping characters into bytes (e.g., UTF-8 for text).

👉 JSON uses both. Avro is pure serialization.

Imagine a Java POJO: User{name="Lucky", age=30}

Steps

{
"name":"Lucky",
"age":30
}

[0x7B, 0x22, 0x6E, 0x61, ...]

(In hex: 7B 22 6E 61 6D 65 22 3A 22 4C )

When you xxd the Kafka log file, you’ll see that hex dump.

👉️ Takeaway

JSON → text → UTF-8 encoding → bytes → Kafka.
It’s human-readable but bulky and slower to parse.

Same POJO: User{name="Lucky", age=30}

Steps

Serialize directly to Avro binary format
- Field values are encoded as compact bytes following the Avro schema.
- Example output (simplified): [0x08, 0x4C, 0x75, 0x63, 0x6B, 0x79, 0x3C]
Producer sends byte[] to Kafka
Kafka stores raw bytes in its log

No intermediate “string” or UTF-8 step.

👉️ Takeaway

Avro → compact binary → bytes → Kafka.
It’s efficient, schema-driven, and faster to deserialize.

When people ask “Does Kafka store JSON or Avro?”, the real answer is:

👉 Neither. Kafka stores raw bytes.
👉 JSON/Avro/Protobuf are just contracts between producer and consumer.

So next time you see a Kafka hex dump like:

7B 22 6E 61 6D 65 22 3A 22 4C 75 63 6B 79 22 7D

…you’ll know: that’s just your data, encoded and serialized, resting in Kafka’s logs waiting for the next consumer.

Great explanation