DEV Community

Elad Eldor
Elad Eldor

Posted on • Originally published at Medium

Tune Compression Before Changing the Schema

Most JSON → Avro migrations are justified by bandwidth savings. In practice, tuning batching and compression often removes most of that gain before the schema is touched.

To reduce the data transfer bill, it makes sense to migrate your data schema from JSON to Avro or Protobuf. While in theory that sounds correct, in practice it’s the compression codec that determines the ROI of any payload optimization. By fixing the compression codec before changing the schema, you reduce the data propagating through production, replication, and every consumer lane crossing AZ boundaries.

Why batching changes compression

zstd is much more effective at exploiting repetition across a batch compared to snappy and lz4. In zstd, repeated field names across many messages are effectively reduced to a small set of dictionary tokens. This difference drives most of the observed gains.

Batch depth matters as much as the codec choice. On a generic structured JSON message shape, increasing batch size alone, with no codec or schema change, can significantly reduce the compressed payload size.

At batch size=1, all three codecs perform similarly. At batch size=1,000, zstd-1 is often ~2x smaller than snappy on the same data.

This holds across real production message shapes. In benchmarks with 1,000-message batches across three datasets, the “batch win” column on the right shows how much zstd-1 improves from batch size 1 to 1,000 across the datasets.

Fix the batching first — the batch win grows with message size because larger messages tend to contain more repeated structure for zstd to collapse into shared dictionary entries. Without sufficient batch depth, the codec switch won’t deliver its full effect.

This effect shows up in any system that batches data (Kafka, ingestion pipelines, bulk HTTP APIs). In request/response RPC paths where messages are sent individually, the effect is much smaller.

Tuning the Zstd Level

Kafka deployments commonly run with zstd level 3 by default. I benchmarked on a 1,000-message structured JSON batch across levels 1–9.

Level 3 produces output 1% larger than level 1 and runs slower. Level 9 produces output that is still larger than level 1 while consuming 5x more CPU time. Higher levels search for complex patterns that don’t exist in regular JSON schemas, while level 1 finds the dominant patterns immediately.

What Avro actually saves on a well-batched topic

The wire savings argument for Avro is usually calculated against uncompressed data, and that’s where the 63.5% figure comes from. The benchmark below compares JSON and Avro across compression modes with a 1,000-message batch size:

On uncompressed data, Avro saves ~60% of the data compared to JSON. However on a well-batched zstd-1 topic, the saving drops to ~20% because zstd already compresses most field name repetition into dictionary tokens.

Decompression latency is also important

While a batch is compressed once at the producer, it decompresses once per consumer group. On a 5x fan-out topic, the decompression cost runs five times for every one compressed fetch request. For low-latency systems, consider the end-to-end latency: compression time + wire delay + decompression time. Also consider the CPU time it takes for consumers to decompress the data, since a codec that reduces the amount of data sent over the wire can increase the CPU usage of the consumers that decompress it.

The recommended tuning order:

  • Increase the batch size in order to increase compression rate: batch.size, linger.ms
  • Tune the fetch size in order to increase consumer throughput: fetch.min.bytes, fetch.max.wait.ms
  • Tune your compression codec: compression.type=zstd, compression.zstd.level=1

After these changes, check the data transfer cost and the storage usage on your Kafka brokers. Both should drop before any schema changes.

When you change the schema, prioritize Kafka topics by throughput × fan-out. A 10-consumer topic at moderate throughput outranks a massive topic with a single consumer.

Part 4 of the Kafka Network Cost series. Part 1: cross-AZ topology and fan-out attribution. Part 2: Kafka’s Real Compression Problem Is Batch Depth. Part 3: The Cheapest Kafka Consumer Is One That Doesn’t Read From Kafka.

Top comments (0)