<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elad Eldor</title>
    <description>The latest articles on DEV Community by Elad Eldor (@eeldor).</description>
    <link>https://dev.to/eeldor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939934%2F2931f13c-b15c-4d4e-942e-bfb467f64827.png</url>
      <title>DEV Community: Elad Eldor</title>
      <link>https://dev.to/eeldor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eeldor"/>
    <language>en</language>
    <item>
      <title>Kafka Compute Is Cheap. Network Is Not</title>
      <dc:creator>Elad Eldor</dc:creator>
      <pubDate>Wed, 20 May 2026 14:28:42 +0000</pubDate>
      <link>https://dev.to/eeldor/kafka-compute-is-cheap-network-is-not-2bdh</link>
      <guid>https://dev.to/eeldor/kafka-compute-is-cheap-network-is-not-2bdh</guid>
      <description>&lt;h3&gt;
  
  
  Cross-AZ network transfer often costs more than compute. Here's why it's invisible and what to do about it
&lt;/h3&gt;

&lt;p&gt;Your most expensive Kafka topic probably isn't the one with the most data. It's the one with the most consumers, because cross-AZ network transfer often costs more than compute in real Kafka deployments - sometimes by 5–10x when fan-out is high and pod placement is unlucky.&lt;/p&gt;

&lt;p&gt;While the Data Transfer cost shows up in cloud billing, the line items don't point back to Kafka topics. The AWS CUR (Cost and Usage Report) shows EC2 / Data Transfer, Kafka dashboard shows producer and consumer metrics, and nobody looks at both at once. That gap is why Data Transfer cost persists at companies that are otherwise rigorous about infrastructure spend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqgwb5ztabh71m3ki3az.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqgwb5ztabh71m3ki3az.png" alt=" " width="799" height="291"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article is about this hidden cost and what to do about it, and it's relevant when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka brokers span multiple AZs (Availability Zones)&lt;/li&gt;
&lt;li&gt;Producers and Consumers run in different AZs than the Brokers&lt;/li&gt;
&lt;li&gt;You run Kafka on AWS or GCP (Azure doesn't charge on cross-AZ networking)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick Diagnostic
&lt;/h3&gt;

&lt;p&gt;If you already have bytes_in and bytes_out metrics per topic, you can estimate fan-out.&lt;br&gt;
For a topic with 200 GB/hour in and 600 GB/hour out at RF=2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer throughput ≈ 200 ÷ 2 = 100 GB/hour
Replication outbound ≈ 100 GB/hour
Consumer outbound ≈ 600–100 = 500 GB/hour
Fan-out ≈ 500 ÷ 100 = 5x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That estimate is enough to rank topics by cost impact. It's not exact enough for chargeback, because compression, retries, and internal broker traffic can distort the numbers.&lt;br&gt;
If bytes_out is much larger than bytes_in, the gap is usually fan-out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5riy95rs48uvvv7cpuut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5riy95rs48uvvv7cpuut.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Data Transfer Billing Works
&lt;/h3&gt;

&lt;p&gt;In AWS CUR, AZ-to-AZ traffic in the same region appears under DataTransfer-Regional-Bytes. On AWS, this is typically about $0.01 per GB (before discount) for data leaving an AZ within a region. GCP is similar, but exact rates vary by region and discount agreement.&lt;br&gt;
This means a single GB can be charged multiple times as it moves through Kafka:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;producer to leader broker&lt;/li&gt;
&lt;li&gt;leader to follower broker&lt;/li&gt;
&lt;li&gt;broker to each consumer group&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kafka also generates extra bidirectional traffic from fetches, acknowledgments, heartbeats, and recovery activity, so the effective cost of a topic is usually a bit higher than the raw payload size suggests.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Paths
&lt;/h3&gt;

&lt;p&gt;Kafka traffic has three cost-bearing paths.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producer to broker:&lt;/strong&gt; Producers write to partition leaders. If the producer pod and the leader live in different AZs, that write crosses an AZ boundary. Producers must reach the leader, so this cannot be avoided by configuration alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication between brokers:&lt;/strong&gt; Leaders replicate to followers. RF=2 copies each write once. RF=3 copies it twice. In a multi-AZ cluster, replication is part of durability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broker to consumer:&lt;/strong&gt; Consumers fetch data from brokers. Each consumer group reads the topic independently, so this path scales with fan-out.
The mental model is:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;billable transfers=1+(RF−1)+fan-out

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5v74i7xeqqkg4h3t0op.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5v74i7xeqqkg4h3t0op.png" alt=" " width="800" height="675"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a worst-case upper bound, but it is a useful one. It explains why a topic with many consumers can cost far more than a topic with more writes.&lt;/p&gt;

&lt;p&gt;Kafka transfers compressed batches, so Data Transfer cost is based on bytes on the wire, not logical message size - better compression and batching reduce every term in the model. Consumer-side filtering doesn't reduce network cost, since Kafka still ships the full records - filtering saves CPU, not bandwidth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fan-Out Drives Cost
&lt;/h3&gt;

&lt;p&gt;The main surprise is that consumer count often matters more than producer throughput - a topic with one consumer group has far less cross-AZ cost than the same topic with five consumer groups, even when producer traffic is identical. The extra cost comes entirely from the broker-to-consumer path.&lt;/p&gt;

&lt;p&gt;Fan-out is often understated in steady-state measurements. Rebalances, pod restarts, backfills, and offset resets can replay old data and temporarily amplify Data Transfer cost. That's why optimization should target &lt;em&gt;throughput×fan-out&lt;/em&gt; instead of throughput alone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzn5bt3mn3rbfa0lhg8uu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzn5bt3mn3rbfa0lhg8uu.png" alt=" " width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Placement Matters
&lt;/h3&gt;

&lt;p&gt;Compute optimization and network cost optimization pull in opposite directions, since Kubernetes autoscalers usually optimize for compute, not Kafka topology. When pods are rescheduled, they land wherever capacity is available, not wherever Kafka brokers happen to be.&lt;/p&gt;

&lt;p&gt;That matters because a pod in a non-broker AZ pays extra on every Kafka interaction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Producer-heavy services are affected on writes&lt;/li&gt;
&lt;li&gt;Consumer-heavy services are affected on fetches&lt;/li&gt;
&lt;li&gt;Mixed services pay on both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a three-AZ cluster with brokers in two of them, a randomly placed pod has a baseline ~67% chance of landing outside a broker AZ. K8s autoscalers can push that higher since they bin-pack into whatever AZ has spare capacity, so in practice the effective cross-AZ exposure for consumer pods can run 73–90%+ on some clusters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc97qklqggmw7uazyj8is.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc97qklqggmw7uazyj8is.png" alt=" " width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  RF=3 Can Be Cheaper Than RF=2
&lt;/h3&gt;

&lt;p&gt;When looking only at the replication path, RF=2 is cheaper on storage and replication compared to RF=3. Counter-intuitively, it's not always true for total network cost. On high fan-out topics, RF=3 can reduce cross-AZ consumer traffic because each AZ has a replica available for local reads. The extra replication cost is fixed per write, while the read-side savings scale with fan-out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6eikjovpcnlhrcfb4gk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6eikjovpcnlhrcfb4gk.png" alt=" " width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This requires client.rack on consumers, a rack-aware assignor, and reasonably balanced AZ distribution. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica" rel="noopener noreferrer"&gt;KIP-392&lt;/a&gt; enables consumers to fetch from the closest replica when rack-aware selection is configured. &lt;a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-881:+Rack-aware+Partition+Assignment+for+Kafka+Consumers" rel="noopener noreferrer"&gt;KIP-881&lt;/a&gt; improves rack-aware consumer assignment. However &lt;a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage" rel="noopener noreferrer"&gt;KIP-405&lt;/a&gt; is different since it moves cold log segments to remote storage, which reduces storage cost but doesn't remove broker-mediated Data Transfer cost.&lt;br&gt;
If those conditions are met, RF=3 can lower total cross-AZ traffic even though it stores more data. On read-heavy topics, that can make it cheaper overall.&lt;br&gt;
The right question isn't is RF=3 more expensive? - Instead it's which costs more on this topic: extra replication or repeated cross-AZ reads?&lt;/p&gt;

&lt;h3&gt;
  
  
  What To Do With This
&lt;/h3&gt;

&lt;p&gt;The biggest savings usually lie within these steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with the top throughput topics&lt;/li&gt;
&lt;li&gt;Derive fan-out from bytes_in and bytes_out&lt;/li&gt;
&lt;li&gt;Sort topics by throughput × fan-out&lt;/li&gt;
&lt;li&gt;Map pod distribution against broker AZs&lt;/li&gt;
&lt;li&gt;Audit client.rack on all consumers&lt;/li&gt;
&lt;li&gt;Revisit RF on high fan-out topics&lt;/li&gt;
&lt;li&gt;Check for AZ mismatches across clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqs2835ohojc66ipe25t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqs2835ohojc66ipe25t.png" alt=" " width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few things worth keeping in mind as you work through that list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services with significant pod presence outside broker AZs are paying a topology tax in the form of Data Transfer cost&lt;/li&gt;
&lt;li&gt;RF=3 with proper rack configuration may be cheaper than RF=2 on read-heavy topics&lt;/li&gt;
&lt;li&gt;The conventional Data Transfer cost ranking assumes single-consumer topics and doesn't generalize for high fanout ones&lt;/li&gt;
&lt;li&gt;Compression is an important lever - better batching and better codecs reduce bytes on the wire, and that lowers cross-AZ cost directly&lt;/li&gt;
&lt;li&gt;For non-real-time consumers, another option is to remove them from Kafka entirely and serve them from S3 instead - one Kafka consumer writes to S3, and many analytical readers consume from S3 over a VPC gateway endpoint. That avoids Kafka fan-out for workloads that can tolerate seconds to minutes of latency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Closing
&lt;/h3&gt;

&lt;p&gt;Kafka often looks expensive because the bill is driven by network topology, consumer fan-out, and placement - not just EC2 compute.&lt;/p&gt;

&lt;p&gt;The fix is usually the same:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;measure fan-out&lt;/li&gt;
&lt;li&gt;align placement with broker topology where possible&lt;/li&gt;
&lt;li&gt;improve compression&lt;/li&gt;
&lt;li&gt;reconsider RF or S3 indirection where read traffic dominates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost is rarely where people first look. Once you can see the fan-out, the leverage is obvious.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Part 1 of the Kafka Network Cost series. Part 2: Kafka's Real Compression Problem Is Batch Depth. Part 3: Fix the Codec Before You Touch the Schema. Part 4: The Cheapest Kafka Consumer Is One That Doesn't Read From Kafka. Part 5: The S3 GET Limit Nobody Plans For&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Keywords: Kafka cross-AZ cost, AWS DataTransfer-Regional-Bytes, Kafka network optimization, fan-out cost model, Kafka VPC topology, KIP-392, KIP-881, KIP-405.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>aws</category>
      <category>infrastructure</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
