DEV Community

Cover image for How Data is Stored in Kafka: JSON vs Avro Explained
Laxman Patel
Laxman Patel

Posted on

How Data is Stored in Kafka: JSON vs Avro Explained

If you’re new to Kafka, you’ve probably asked yourself:

“When a producer sends a record, what does Kafka actually store on disk? Is it JSON? Avro? Hex? Bytes??”

I recently went down this rabbit hole myself — so here’s a breakdown of what really happens behind the scenes.

First Principles: Kafka Stores Bytes

Let’s clear the air first:

Kafka doesn’t know or care about JSON, Avro, Protobuf, POJOs, or unicorns 🦄. It only deals in byte arrays (byte[]).

  • Producer sends: byte[] key, byte[] value
  • Broker writes to log: byte[]
  • Consumer reads: byte[]

Everything else (JSON, Avro, Protobuf, String) is just a serialization format layered on top.


Serialization vs Encoding

Two words that get thrown around a lot:

  • Serialization = turning an in-memory object (like a Java POJO) into a storable/transmittable format.
  • Encoding = mapping characters into bytes (e.g., UTF-8 for text).

👉 JSON uses both. Avro is pure serialization.


How is data actually stored

avro_json

Case 1: JSON in Kafka

Imagine a Java POJO: User{name="Lucky", age=30}

Steps

  • Serialize to JSON string
{
"name":"Lucky",
"age":30
}
Enter fullscreen mode Exit fullscreen mode
  • Encode string in UTF-8 → bytes

[0x7B, 0x22, 0x6E, 0x61, ...]

(In hex: 7B 22 6E 61 6D 65 22 3A 22 4C )

  • Producer sends the above byte[] to Kafka.
  • Kafka stores raw bytes in its log

When you xxd the Kafka log file, you’ll see that hex dump.

👉️ Takeaway

JSON → text → UTF-8 encoding → bytes → Kafka.
It’s human-readable but bulky and slower to parse.

Case 2: Avro in Kafka

Same POJO: User{name="Lucky", age=30}

Steps

  • Serialize directly to Avro binary format

    • Field values are encoded as compact bytes following the Avro schema.
    • Example output (simplified): [0x08, 0x4C, 0x75, 0x63, 0x6B, 0x79, 0x3C]
  • Producer sends byte[] to Kafka

  • Kafka stores raw bytes in its log

No intermediate “string” or UTF-8 step.

👉️ Takeaway

Avro → compact binary → bytes → Kafka.
It’s efficient, schema-driven, and faster to deserialize.


JSON vs Avro: Side-by-Side

Step JSON Avro
Serialization POJO → JSON text POJO → Avro binary
Encoding UTF-8 needed (text → bytes) Not needed (already binary)
Size Larger (e.g., {"name":"Lucky"}) Smaller, compact
Readability Human-readable Not human-readable
Schema enforcement Optional Strict (via Avro schema)

Wrap-Up

When people ask “Does Kafka store JSON or Avro?”, the real answer is:

👉 Neither. Kafka stores raw bytes.
👉 JSON/Avro/Protobuf are just contracts between producer and consumer.

So next time you see a Kafka hex dump like:

7B 22 6E 61 6D 65 22 3A 22 4C 75 63 6B 79 22 7D
Enter fullscreen mode Exit fullscreen mode

…you’ll know: that’s just your data, encoded and serialized, resting in Kafka’s logs waiting for the next consumer.

Top comments (0)