Kafka Cost Optimization Starts with Usage

#kafka #dataengineering #devops #architecture

I sit in a lot of Kafka reviews. Vendors, instances, replication, tiered storage, advanced stuff like fetch-from-follower, networking, partitions, best practices etc. Most discussions are driven by tech only, instead of looking at the big picture: how this beautiful infra is being used.

Unpopular opinion: most of your Kafka cost is not due to infrastructure, it's due to a usage problem.

Where the cost actually comes from

Vendor calculators are hard to compare because of so many assumptions. Replication multipliers, disk class, compression ratio, tiered storage (billed at the replicated rate or the actual S3 rate). The price you see is almost never what you pay.

RF=3 multiplies the per-GB price by 3 everywhere. And tiered storage is often still billed at the replicated rate even though only one copy lives in S3. You're paying the RF=3 rate for data Kafka no longer replicates.
Cross-VPC, in-region traffic between your account and the vendor's lands on your cloud bill, roughly 1c/GB each way depending on the path.
Without fetch-from-follower, most consumer fetches cross AZ boundaries. With three balanced AZs, ~2/3 of consumer reads go cross-AZ, because the leader lives in one AZ and the other two reads come from elsewhere.
Compression is often just... off.

With zstd at sane batch sizes, JSON-ish logs and metrics commonly compress 8–10x:

compression.type=zstd
batch.size=65536          # 64KB
linger.ms=20

Going from 5x to 10x halves your stored bytes and halves the replication bytes flowing inside the cluster. You pay for that traffic three times over at RF=3, so the ratio matters.

And fetch-from-follower, available since Kafka 2.4, is a broker + consumer config away. Same-AZ traffic inside your VPC is free on AWS, so no cross-AZ tax:

# broker
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
broker.rack=us-east-1a

# consumer — must match the broker's rack value
client.rack=us-east-1a

Do all of it: fetch-from-follower, tiered storage, compression enforcement, partition right-sizing, BYOC to apply your existing cloud discount, single-AZ topics where you can tolerate it. But notice that it's still infrastructure tuning. Let's go up.

Cost is a stack, not a line item

When you tune anything in Kafka, you think in layers, bottom-up: hardware, JVM, broker config, producer/consumer tuning, topic design, application code. Same for cost:

Cloud infrastructure: instance types, AZ placement, networking, BYOC negotiation. At big contract sizes, negotiated networking discounts can hit 90%, but only if traffic flows through your account.
Broker & protocol tuning: compression, retention, RF, fetch-from-follower, tiered storage, partition count. Easy, they're config changes.
Architecture: diskless topics, Iceberg topics, single-AZ topics, proxies between clients and brokers, virtual clusters for multi-tenancy and non-prod consolidation.
Usage: fan-out, governance, discovery, self-service.

Often payoff goes up as you go higher (more system thinking) Everyone's comfortable arguing about GP2 vs GP3 (volume types on AWS. Almost nobody thinks "why are 40% of these partitions doing nothing?"

Speaking of which: most clusters carry 40–70% partition waste, did you know that? On managed Kafka that's per-partition-hour billing. On self-managed, you hit the ~4,000–6,000 partition-replicas-per-broker ceiling (RF=3 turns 100k partitions into 300k replicas to host and track). KRaft raises the ceiling but it doesn't make the waste free.

Fan-out is the whole point of Kafka

Kafka exists so that one byte written can be read by N independent consumers, decoupled in time, with zero coordination back to the producer. That's the log abstraction's reason to live.

Do you measure your average fan-out? If it's 1, you probably shouldn't be running Kafka at all, you're paying for a distributed log to do a point-to-point queue's job. LinkedIn famously ran at ~5.4: the same bytes, written once, read by 5.4 independent teams.

Cluster cost stays flat while consumers grow, so cost-per-business-outcome is decreasing the more we consume existing topics:

cost_per_use_case = cluster_cost / fan_out

fan-out 1  ->  $X      (one team carries the whole bill)
fan-out 3  ->  $X / 3
fan-out 5  ->  $X / 5  (same hardware, five outcomes)

"Is our Kafka usage growing?" is the wrong question. More business use-cases reading existing data is the best money you'll ever spend. Duplicated topics because nobody could find the existing one is pure waste, more storage, more replication, more pipelines, all because discovery and ownership are missing.

The same goes for partitions: people over-provision because nobody knows how to size them, and you can't reduce partition count after the fact (breaks key ordering). The only way to surface that waste is chargeback at the team-and-topic level. You can't optimize what you can't attribute.

"A third of our traffic, we know what it has to do with, but we don't know exactly what they're doing."

That's the usage layer leaking. It costs money, and nobody can fix it because nobody knows how to, where to look, or just own it. It's not an infra problem, it's governance, discovery, and self-service.

Cost optimization is everybody's concern and nobody's objective. Teams over-provision because what if we need it later and what if it breaks when we touch it are rational fears. "It's expensive" is not a business case. What works is showing the waste, the annual dollar number, and the effort to reclaim it, with a name next to it.

2026: Where to spend your effort

Most deployments I see have way more headroom in the usage layer than the infra layer: topics nobody reads, partitions nobody needs, teams who'd benefit from streaming but find it too painful to onboard.

There's a funny industry reflex here too. We chase the next architectural shiny thing, diskless, Iceberg topics, single-AZ, before we've answered the boring questions: who's using this, for what, and why aren't more teams using it?

My actual recommendation:

Do the infrastructure pass once. Instance types, AZ placement, BYOC.
Do the config pass once. Compression, retention, partition right-sizing, fetch-from-follower, tiered storage.
Spend the rest of the year on the usage layer. Fan-out, ownership, discovery, chargeback, self-service.

Steps 1 and 2 are a sprint. Step 3 is the marathon.

If you want to see where your usage layer is leaking, Conduktor's field engineering team does a free Kafka cost analysis: they'll map cost back to teams and topics and show you where the payoff sits. And if you just want to keep reading, Why Kafka Costs Keep Rising and the partition waste deep-dive are good next stops.

What's your average fan-out? If you don't know it off the top of your head, that's probably where I'd start.