DEV Community: Elad Eldor

Tune Compression Before Changing the Schema

Elad Eldor — Wed, 10 Jun 2026 13:04:22 +0000

Most JSON → Avro migrations are justified by bandwidth savings. In practice, tuning batching and compression often removes most of that gain before the schema is touched.

To reduce the data transfer bill, it makes sense to migrate your data schema from JSON to Avro or Protobuf. While in theory that sounds correct, in practice it’s the compression codec that determines the ROI of any payload optimization. By fixing the compression codec before changing the schema, you reduce the data propagating through production, replication, and every consumer lane crossing AZ boundaries.

Why batching changes compression

zstd is much more effective at exploiting repetition across a batch compared to snappy and lz4. In zstd, repeated field names across many messages are effectively reduced to a small set of dictionary tokens. This difference drives most of the observed gains.

Batch depth matters as much as the codec choice. On a generic structured JSON message shape, increasing batch size alone, with no codec or schema change, can significantly reduce the compressed payload size.

At batch size=1, all three codecs perform similarly. At batch size=1,000, zstd-1 is often ~2x smaller than snappy on the same data.

This holds across real production message shapes. In benchmarks with 1,000-message batches across three datasets, the “batch win” column on the right shows how much zstd-1 improves from batch size 1 to 1,000 across the datasets.

Fix the batching first — the batch win grows with message size because larger messages tend to contain more repeated structure for zstd to collapse into shared dictionary entries. Without sufficient batch depth, the codec switch won’t deliver its full effect.

This effect shows up in any system that batches data (Kafka, ingestion pipelines, bulk HTTP APIs). In request/response RPC paths where messages are sent individually, the effect is much smaller.

Tuning the Zstd Level

Kafka deployments commonly run with zstd level 3 by default. I benchmarked on a 1,000-message structured JSON batch across levels 1–9.

Level 3 produces output 1% larger than level 1 and runs slower. Level 9 produces output that is still larger than level 1 while consuming 5x more CPU time. Higher levels search for complex patterns that don’t exist in regular JSON schemas, while level 1 finds the dominant patterns immediately.

What Avro actually saves on a well-batched topic

The wire savings argument for Avro is usually calculated against uncompressed data, and that’s where the 63.5% figure comes from. The benchmark below compares JSON and Avro across compression modes with a 1,000-message batch size:

On uncompressed data, Avro saves ~60% of the data compared to JSON. However on a well-batched zstd-1 topic, the saving drops to ~20% because zstd already compresses most field name repetition into dictionary tokens.

Decompression latency is also important

While a batch is compressed once at the producer, it decompresses once per consumer group. On a 5x fan-out topic, the decompression cost runs five times for every one compressed fetch request. For low-latency systems, consider the end-to-end latency: compression time + wire delay + decompression time. Also consider the CPU time it takes for consumers to decompress the data, since a codec that reduces the amount of data sent over the wire can increase the CPU usage of the consumers that decompress it.

The recommended tuning order:

Increase the batch size in order to increase compression rate: batch.size, linger.ms
Tune the fetch size in order to increase consumer throughput: fetch.min.bytes, fetch.max.wait.ms
Tune your compression codec: compression.type=zstd, compression.zstd.level=1

After these changes, check the data transfer cost and the storage usage on your Kafka brokers. Both should drop before any schema changes.

When you change the schema, prioritize Kafka topics by throughput × fan-out. A 10-consumer topic at moderate throughput outranks a massive topic with a single consumer.

Part 4 of the Kafka Network Cost series. Part 1: cross-AZ topology and fan-out attribution. Part 2: Kafka’s Real Compression Problem Is Batch Depth. Part 3: The Cheapest Kafka Consumer Is One That Doesn’t Read From Kafka.

Kafka's Real Compression Problem Is Batch Depth

Elad Eldor — Thu, 21 May 2026 22:09:32 +0000

Kafka compression waste is usually a batch depth problem, not a codec problem. Better batching improves producer compression, which reduces consumer CPU and cross-AZ cost downstream.

In one production deployment, changing batch sizing and linger settings cut the consumer fleet in half and moved compression from under 10% to over 50% - with no codec change. The cause wasn't the codec. It was batch depth.

Why batch depth controls what the codec sees

Kafka producers compress batches, not individual messages. The compression codec sees whatever the producer has accumulated by the time it flushes. linger.ms sets how long the producer waits to accumulate records. batch.size caps how large that accumulation can grow.

Both settings are conservative by default. When per-producer throughput is low - because traffic is light, or because it's spread across too many producer instances - the linger window closes before much data has arrived.

That matters because compression ratio is a function of (1) how much data the codec can see at once and (2) how much redundancy exists across that data. A compressor working on a single JSON record finds repetition only within that record. Working on a hundred records from the same schema, it finds the same field names, the same value patterns, and the same structural redundancy repeated across every record.

At shallow batch depth, redundancy is limited to a single record. At depth, the compressor finds the same field names, value patterns, and structural repetition across every record in the batch - a qualitatively different input. This batch shape problem doesn't stay at the producer.

Small producer batches create a consumer CPU tax

When producer batches are small, the broker stores small compressed record batches. Consumers fetching from that topic receive small responses, so to get more data they issue more fetch requests to Kafka brokers.

Each fetch request carries fixed overhead: a network round trip, broker-side processing, client-side dispatch, metadata handling, bookkeeping. When responses are small, that overhead is paid repeatedly on little data. The consumer fleet burns CPU on round-trip mechanics rather than on processing records.

In one production deployment, a high-throughput topic had batch.size at 16KB (the default) and fetch.min.bytes at 1 byte (also the default). Tuning batch.size to 80KB and fetch.min.bytes to 512KB cut the consumer fleet from 60 to 30 pods. Per-pod CPU increased by roughly 30%, but the fleet was processing the same volume of data with half the pods - it had stopped spending the majority of its time on fetch overhead. Compression ratio on the same topic improved from 10% to 50% with no codec change.

The overhead is fixed per fetch. What changes is how much data it buys you.

The producer's batch decision bills every consumer group

In cloud deployments, data crossing availability zone boundaries is billed per byte - producer-to-broker, inter-broker replication, and broker-to-consumer are all billable paths. Batch depth affects all three paths simultaneously:

Smaller wire size from better compression reduces the bytes in the producer-to-broker path.
Replication copies those same bytes, so smaller compressed batches reduce replication traffic proportionally.
Every consumer group fetches its own copy of those bytes - fan-out multiplies the savings across every downstream reader automatically.

A meaningful reduction in compressed batch size propagates through producer ingress, replication, and every consumer fan-out stream.

The prioritization rule follows directly: throughput × fan-out. A 20% wire-size reduction on a topic with 8× fan-out matters more than a 50% reduction on a topic with 1× fan-out. The highest ROI comes from fixing the topics where the multiplier is largest.

Diagnosing the problem

The following queries use metric names common to the standard JMX exporter - verify names against your specific client library and exporter version before relying on them.

Batch fill rate:
kafka_producer_batch_size_avg / kafka_producer_batch_size_max

Values consistently below 0.3 indicate that batches are flushing before they are meaningfully filled.

Compression ratio by topic:
rate(kafka_producer_compression_rate_avg[5m])

This metric reports the ratio of compressed to uncompressed size - lower is better. A value near 1.0 means the codec is doing nothing. On a zstd-configured producer with structured data, sustained values well below 1.0 are achievable with proper batch depth - if you're seeing values near 1.0 consistently, batches are too shallow.

Consumer fetch size:
rate(kafka_consumer_fetch_size_avg[5m])

Consistently small values indicate consumers are issuing many small fetches - a downstream symptom of small producer batches.
These three metrics, read together, identify whether the problem is at the producer (batch fill), at the codec (compression rate), or propagated to the consumer (fetch size). They also identify which topics to fix first: sort by bytes_out_per_sec × consumer_group_count.

What to fix, in order

For each prioritized topic:

Batch depth: Increase linger.ms to 20–50ms. This adds a hard latency floor - every message waits up to that window before flushing. On latency-sensitive paths - fraud detection, ad bidding, synchronous request-reply over Kafka - this is unacceptable. Apply only where end-to-end latency tolerance is measured in seconds, not milliseconds.

Increase batch.size to 64–256KB depending on message size and throughput and measure batch fill rate before and after.

One constraint before raising batch.size: Kafka producers allocate memory pools per partition from a shared buffer.memory budget (default 32MB). On a producer writing to many partitions simultaneously, large batch.size values can exhaust this budget under load, causing blocked send() calls or client-side exceptions. Check partition count per producer instance and raise buffer.memory proportionally before making the change.

Codec: Switch to compression.type=zstd with compression.zstd.level=1, not zstd-3. If the topic is already on zstd, check the level - the Kafka default is not optimal for structured data.

Consumer fetch settings: Align fetch.min.bytes and fetch.max.wait.ms with the new batch sizes. Without this, consumers issue small fetches against larger broker batches, negating part of the gain.

Broker disk usage drops as a side effect - Kafka stores compressed record batches on disk, so whatever reduces wire size reduces storage without additional work.

Closing

Kafka compression waste is usually a batch depth problem. Once the batch is deep enough, the codec does its job; until then, the producer is starving it of useful input.

This is part 2 of the Kafka Network Cost series. Part 1: Kafka Compute Is Cheap. Network Is Not. Part 3: Fix the Codec Before You Touch the Schema. Part 4: the S3 indirection pattern for analytical consumers.

Keywords: Kafka batch tuning, Kafka compression zstd, linger.ms batch.size optimization, Kafka producer tuning, cross-AZ network cost, fetch.min.bytes.

Kafka Compute Is Cheap. Network Is Not

Elad Eldor — Wed, 20 May 2026 14:28:42 +0000

Cross-AZ network transfer often costs more than compute. Here's why it's invisible and what to do about it

Your most expensive Kafka topic probably isn't the one with the most data. It's the one with the most consumers, because cross-AZ network transfer often costs more than compute in real Kafka deployments - sometimes by 5–10x when fan-out is high and pod placement is unlucky.

While the Data Transfer cost shows up in cloud billing, the line items don't point back to Kafka topics. The AWS CUR (Cost and Usage Report) shows EC2 / Data Transfer, Kafka dashboard shows producer and consumer metrics, and nobody looks at both at once. That gap is why Data Transfer cost persists at companies that are otherwise rigorous about infrastructure spend.

This article is about this hidden cost and what to do about it, and it's relevant when:

Kafka brokers span multiple AZs (Availability Zones)
Producers and Consumers run in different AZs than the Brokers
You run Kafka on AWS or GCP (Azure doesn't charge on cross-AZ networking)

Quick Diagnostic

If you already have bytes_in and bytes_out metrics per topic, you can estimate fan-out.
For a topic with 200 GB/hour in and 600 GB/hour out at RF=2:

Producer throughput ≈ 200 ÷ 2 = 100 GB/hour
Replication outbound ≈ 100 GB/hour
Consumer outbound ≈ 600–100 = 500 GB/hour
Fan-out ≈ 500 ÷ 100 = 5x

That estimate is enough to rank topics by cost impact. It's not exact enough for chargeback, because compression, retries, and internal broker traffic can distort the numbers.
If bytes_out is much larger than bytes_in, the gap is usually fan-out.

How Data Transfer Billing Works

In AWS CUR, AZ-to-AZ traffic in the same region appears under DataTransfer-Regional-Bytes. On AWS, this is typically about $0.01 per GB (before discount) for data leaving an AZ within a region. GCP is similar, but exact rates vary by region and discount agreement.
This means a single GB can be charged multiple times as it moves through Kafka:

producer to leader broker
leader to follower broker
broker to each consumer group

Kafka also generates extra bidirectional traffic from fetches, acknowledgments, heartbeats, and recovery activity, so the effective cost of a topic is usually a bit higher than the raw payload size suggests.

The Three Paths

Kafka traffic has three cost-bearing paths.

Producer to broker: Producers write to partition leaders. If the producer pod and the leader live in different AZs, that write crosses an AZ boundary. Producers must reach the leader, so this cannot be avoided by configuration alone.
Replication between brokers: Leaders replicate to followers. RF=2 copies each write once. RF=3 copies it twice. In a multi-AZ cluster, replication is part of durability.
Broker to consumer: Consumers fetch data from brokers. Each consumer group reads the topic independently, so this path scales with fan-out. The mental model is:

billable transfers=1+(RF−1)+fan-out

This is a worst-case upper bound, but it is a useful one. It explains why a topic with many consumers can cost far more than a topic with more writes.

Kafka transfers compressed batches, so Data Transfer cost is based on bytes on the wire, not logical message size - better compression and batching reduce every term in the model. Consumer-side filtering doesn't reduce network cost, since Kafka still ships the full records - filtering saves CPU, not bandwidth.

Fan-Out Drives Cost

The main surprise is that consumer count often matters more than producer throughput - a topic with one consumer group has far less cross-AZ cost than the same topic with five consumer groups, even when producer traffic is identical. The extra cost comes entirely from the broker-to-consumer path.

Fan-out is often understated in steady-state measurements. Rebalances, pod restarts, backfills, and offset resets can replay old data and temporarily amplify Data Transfer cost. That's why optimization should target throughput×fan-out instead of throughput alone.

Placement Matters

Compute optimization and network cost optimization pull in opposite directions, since Kubernetes autoscalers usually optimize for compute, not Kafka topology. When pods are rescheduled, they land wherever capacity is available, not wherever Kafka brokers happen to be.

That matters because a pod in a non-broker AZ pays extra on every Kafka interaction:

Producer-heavy services are affected on writes
Consumer-heavy services are affected on fetches
Mixed services pay on both

In a three-AZ cluster with brokers in two of them, a randomly placed pod has a baseline ~67% chance of landing outside a broker AZ. K8s autoscalers can push that higher since they bin-pack into whatever AZ has spare capacity, so in practice the effective cross-AZ exposure for consumer pods can run 73–90%+ on some clusters.

RF=3 Can Be Cheaper Than RF=2

When looking only at the replication path, RF=2 is cheaper on storage and replication compared to RF=3. Counter-intuitively, it's not always true for total network cost. On high fan-out topics, RF=3 can reduce cross-AZ consumer traffic because each AZ has a replica available for local reads. The extra replication cost is fixed per write, while the read-side savings scale with fan-out.

This requires client.rack on consumers, a rack-aware assignor, and reasonably balanced AZ distribution.

KIP-392 enables consumers to fetch from the closest replica when rack-aware selection is configured. KIP-881 improves rack-aware consumer assignment. However KIP-405 is different since it moves cold log segments to remote storage, which reduces storage cost but doesn't remove broker-mediated Data Transfer cost.
If those conditions are met, RF=3 can lower total cross-AZ traffic even though it stores more data. On read-heavy topics, that can make it cheaper overall.
The right question isn't is RF=3 more expensive? - Instead it's which costs more on this topic: extra replication or repeated cross-AZ reads?

What To Do With This

The biggest savings usually lie within these steps:

Start with the top throughput topics
Derive fan-out from bytes_in and bytes_out
Sort topics by throughput × fan-out
Map pod distribution against broker AZs
Audit client.rack on all consumers
Revisit RF on high fan-out topics
Check for AZ mismatches across clusters

A few things worth keeping in mind as you work through that list:

Services with significant pod presence outside broker AZs are paying a topology tax in the form of Data Transfer cost
RF=3 with proper rack configuration may be cheaper than RF=2 on read-heavy topics
The conventional Data Transfer cost ranking assumes single-consumer topics and doesn't generalize for high fanout ones
Compression is an important lever - better batching and better codecs reduce bytes on the wire, and that lowers cross-AZ cost directly
For non-real-time consumers, another option is to remove them from Kafka entirely and serve them from S3 instead - one Kafka consumer writes to S3, and many analytical readers consume from S3 over a VPC gateway endpoint. That avoids Kafka fan-out for workloads that can tolerate seconds to minutes of latency

Closing

Kafka often looks expensive because the bill is driven by network topology, consumer fan-out, and placement - not just EC2 compute.

The fix is usually the same:

measure fan-out
align placement with broker topology where possible
improve compression
reconsider RF or S3 indirection where read traffic dominates

The cost is rarely where people first look. Once you can see the fan-out, the leverage is obvious.

This is Part 1 of the Kafka Network Cost series. Part 2: Kafka's Real Compression Problem Is Batch Depth. Part 3: Fix the Codec Before You Touch the Schema. Part 4: The Cheapest Kafka Consumer Is One That Doesn't Read From Kafka. Part 5: The S3 GET Limit Nobody Plans For

Keywords: Kafka cross-AZ cost, AWS DataTransfer-Regional-Bytes, Kafka network optimization, fan-out cost model, Kafka VPC topology, KIP-392, KIP-881, KIP-405.