DEV Community: Alexander Alten

What Breaks When Kafka Meets Iceberg at Scale

Alexander Alten — Sat, 24 Jan 2026 10:29:26 +0000

I work on KafScale, as disclaimer. But I've also spent time in GitHub issues for Kafka Connect, Flink, and Hudi trying to understand why Kafka-to-Iceberg pipelines break in production. I wrote a longer version of this on our company blog with more detail on each failure mode.

The marketing makes it look simple. The reality is different.

The Problem

Every data team eventually wants the same thing: streaming data in queryable tables. Kafka handles the streaming. Iceberg won the table format war. Getting data from one to the other should be straightforward.

It's not.

Search GitHub for "Iceberg sink" and you'll find 344 open issues. "Kafka Connect Iceberg" adds 89 more. "Flink Iceberg checkpoint" brings 127. "Hudi streaming" pulls up over 1,200.

I read through enough of them to see the same failures repeating.

What Actually Breaks

Kafka Connect Iceberg Sink

The connector looks straightforward in demos. Production is different.

Silent coordinator failures. Mismatch between consumer group IDs and the connector fails silently. No error messages. Data flows in, nothing comes out. You find out three hours later.

Dual offset tracking. Offsets stored in two different consumer groups. Reset one, forget the other, lose data or duplicate it.

Schema evolution crashes. Drop a column, recreate it with a different type. Connector crashes.

Version hell. Avro converter wants 1.11.4, Iceberg 1.8.1 ships with 1.12.0. ClassNotFoundException at startup.

Timeout storms. Under load: TimeoutException: Timeout expired after 60000ms while awaiting TxnOffsetCommitHandler. Task killed. Manual intervention required.

Flink + Iceberg

Flink is the standard answer. It comes with its own problems.

Small file apocalypse. Frequent checkpoints create thousands of KB-sized files. Query performance collapses. Metadata overhead explodes.

Compaction conflicts. Compact the same partition your streaming job writes to. Get write failures or corruption.

Checkpoint ghost commits. Checkpoints complete but metadata files don't update. Tencent built a custom operator because the default "will be invalid."

Recovery failures. FileNotFoundException on checkpoint recovery. No automatic fix.

Hudi

Similar story.

30-minute latency. Kafka to Spark to Hudi to AWS. Pipeline that should be seconds takes half an hour.

Upgrade breakage. Version 0.12.1 to 0.13.0 breaks second micro-batch.

Connection pool exhaustion. Metadata service enabled, HTTP connections leak.

The Cost

A 1 GiB/s streaming pipeline writing to Iceberg through connectors can cost $3.4 million annually in duplicate storage and transfer fees.

Data gets written to Kafka, copied to a connector, transformed, written to S3, then registered in a catalog. Four hops. Four failure points. Four cost centers.

The Options

1. Kafka Connect Iceberg Sink

Works if you have simple schemas, moderate throughput, and an ops team that knows Connect.

Doesn't work if you have schema evolution, high throughput, or need reliability without manual intervention.

2. Flink

Works if you have Flink expertise and can tune checkpoints, manage compaction separately, handle the small file problem.

Doesn't work if you want something simple or don't have Flink ops experience.

3. Confluent Tableflow

Works if you're on Confluent Cloud and topics have schemas.

Doesn't work for topics without schemas, self-managed Kafka, or external catalog sync.

Upsert mode has limits: 30B unique keys, 20K events/sec under 6B rows. Additional charges coming 2026.

4. Storage-Native Architecture

This is the approach I took with KafScale.

Write streaming data directly to S3 in a format analytical tools can read. No connector layer. No broker involvement for reads.

The Iceberg Processor reads .kfs segments from S3, converts to Parquet, writes to Iceberg tables. Works with Unity Catalog, Polaris, AWS Glue.

Zero broker load for analytical workloads.

apiVersion: kafscale.io/v1
kind: IcebergProcessor
metadata:
  name: events-to-iceberg
spec:
  source:
    topics: [events]
    s3:
      bucket: kafscale-data
      prefix: segments/
  sink:
    catalog:
      type: unity
      endpoint: https://workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg
      warehouse: /Volumes/main/default/warehouse
    database: analytics
  processing:
    parallelism: 4
    commitIntervalSeconds: 60

The tradeoff: you're coupled to the .kfs format, not just the Kafka protocol. But the format is documented and public, and it's more stable than trying to keep Kafka Connect and Flink versions aligned.

When to Use What

Simple schemas, low throughput, Connect expertise? Kafka Connect Iceberg Sink.

Complex transformations, Flink team on staff? Flink.

Confluent Cloud, schema registry everywhere? Tableflow.

Want to skip the connector layer entirely? KafScale or similar storage-native approach.

Need transactions? Flink or Tableflow. Not KafScale.

Why I Built Another Thing

I kept seeing the same pattern: teams build Kafka-to-Iceberg pipelines, hit one of these issues, spend weeks debugging, then either add more infrastructure or accept the latency.

The connector model assumes brokers are the only way to access data. That made sense when storage was expensive. S3 at $0.02/GB/month changed the math.

If your storage format is documented, processors can read directly from S3. No broker load. No connector framework. Kubernetes pods that scale independently.

The .kfs format is public. Build your own processors if you want. The Iceberg Processor is just the one we needed first.

SQL on Kafka Data Does Not Require a Streaming Engine

Alexander Alten — Wed, 14 Jan 2026 07:55:00 +0000

Stream processing engines solved a real problem: continuous computation over unbounded data. Flink, ksqlDB, and Kafka Streams gave teams a way to run SQL-like queries against event streams without writing custom consumers.

The operational cost of that solution is widely acknowledged. Confluent's own documentation notes that Flink "poses difficulties with deployment and cluster operations, such as tuning performance or resolving checkpoint failures" and that "organizations using Flink tend to require teams of experts dedicated to developing and maintaining it."

For a large share of the questions teams ask their Kafka data, a simpler architecture exists: SQL on immutable segments in object storage.

What engineers actually ask Kafka

In production debugging sessions and ops reviews, the questions are repetitive:

What is in this topic right now?
What happened around an incident window?
Where is the message with this key?
Are all partitions still producing data?

These are not streaming problems. They are bounded lookups over historical data. They run once, terminate, and do not need windows, watermarks, checkpoints, or state recovery.

Kafka data is already structured for this

Kafka does not persist records individually. It appends them to log segments and rolls those segments by size or time. Each partition is an ordered, immutable sequence of records. Once a segment is closed, it is immutable.

Kafka also maintains sparse indexes so readers can seek by offset and timestamp efficiently. Each segment file is accompanied by lightweight offset and timestamp indexes that allow consumers to seek directly to specific message positions without scanning entire files.

Retention deletes whole segments. Compaction rewrites segments. This means Kafka data is already organized like a SQL-on-files dataset. The only difference is where the files live.

Since Kafka 3.6.0, tiered storage allows these segments to live in object storage like S3. As of Kafka 3.9.0, this feature is production-ready. Durability is now decoupled from compute without changing the data model.

The streaming engine "tax"

Streaming engines pay for capabilities most queries never use:

Distributed state backends
Coordinated checkpoints
Watermark tracking
Long-running cluster operations

That cost is justified for continuous aggregation, joins, and real-time inference.

It is wasted for "show me the last 10 messages".

Production experience confirms this. Riskified migrated from ksqlDB to Flink, noting that ksqlDB's strict limitations on evolving schemas made it impractical for real-world production use cases and that operational complexity required fighting the system more than working with it.

The scale mismatch is also documented. Vendor surveys from Confluent and Redpanda show that approximately 56% of all Kafka clusters run at or below 1 MB/s. Most Kafka usage is small-data, yet teams pay big-data operational costs.

SQL on immutable segments

If Kafka data lives as immutable segments with sparse indexes, querying it looks like any other SQL-on-files workload.

The query planner:

Resolves the topic to segment files
Filters by timestamp or offset metadata
Reads only relevant segments
Applies predicates and returns results

No consumer groups. No offset commits. No streaming job lifecycle.

Expose Kafka-native fields as columns and the common queries become trivial:

SELECT * FROM orders TAIL 10;

SELECT * FROM orders
WHERE ts BETWEEN '2026-01-08 09:00' AND '2026-01-08 09:05';

SELECT * FROM orders
WHERE key = 'order-12345'
AND ts >= now() - interval '24 hours';

This is not stream processing. It is indexed file access with SQL semantics.

Latency, realistically

Yes, object storage is slower than broker-local disk. Remote storage typically has higher latency than local block storage.

That is fine.

Most of these queries are debugging and ops workflows. Waiting one or two seconds is acceptable. Waiting minutes to deploy or restart a streaming job is not.

If you need sub-second continuous results, use a streaming engine. That boundary is clear.

Cost visibility beats hidden complexity

The real risk with SQL on object storage is unbounded scans. Object storage pricing is calculated based on the amount of data stored and the number of API calls made.

The solution is not more infrastructure. It is transparency.

Every query should show:

How many segments will be read
How many bytes will be scanned
The estimated request cost

Queries without time bounds should require explicit opt-in.

This keeps cost a conscious decision instead of a surprise.

Where streaming engines still belong

Streaming engines are still the right tool for:

Continuous aggregations
Joins over live streams
Real-time scoring
Exactly-once outputs

Most Kafka interactions are not those.

They are lookups and inspections that were forced into streaming infrastructure because no better interface existed.

Once Kafka data is durable as immutable segments, SQL becomes the simpler tool.

The takeaway

Most teams do not need a streaming engine to answer Kafka questions.

They need a clean, bounded way to query immutable data.

SQL on Kafka segments does exactly that.

Read a more deeper post at https://www.novatechflow.com/2026/01/sql-on-streaming-data-does-not-require.html

S3-Native Kafka Alternatives: What's Actually Different

Alexander Alten — Fri, 02 Jan 2026 15:03:01 +0000

I work on KafScale, so take this with appropriate salt. But I've also spent time looking at WarpStream, AutoMQ, and Bufstream, and the marketing pages don't tell you what you actually need to know.

They all store data in S3. They all claim to be cheaper than Kafka. Here's what's actually different.

WarpStream

Confluent bought them in September 2024. The agents run in your VPC, but metadata and coordination run in Confluent's cloud.

Latency is 400-600ms p99. That's the cost of writing directly to S3 with no local buffer.

If you're already a Confluent shop and want S3 pricing without running infrastructure, this makes sense. If you don't want a cloud dependency, look elsewhere.

AutoMQ

Fork of Kafka with a new storage layer. Uses EBS as a write-ahead log, then tiers to S3.

Latency is around 10ms p99 because of the EBS buffer. That's close to real Kafka.

The catch: you're still managing EBS volumes. It's simpler than Kafka, but it's not stateless. Also BSL licensed, so read the terms if you're building a platform.

Bufstream

From the Buf/Protobuf people. S3 for storage, PostgreSQL for metadata. Native Iceberg output.

Latency around 500ms p99. Similar to WarpStream.

If you're building on Iceberg and want Kafka-compatible ingestion, this is purpose-built for that. If you're not in the lakehouse world, the PostgreSQL dependency is extra infrastructure for no benefit.

KafScale

Stateless Go brokers, S3 for storage, etcd for coordination. Apache 2.0 license.

Latency around 400ms p99. Same ballpark as WarpStream.

No transactions. No compacted topics. If you need those, use something else.

What's different: the segment format is documented and open. You can write processors that read directly from S3 without hitting brokers. That matters if you have analytical workloads (batch replay, Iceberg materialization, AI agents pulling context) that you want to keep separate from your streaming traffic.

The tradeoff is coupling. Your processors depend on the .kfs format, not just the Kafka protocol.

When to use what

Need low latency (<100ms)? AutoMQ or stick with Kafka/Redpanda.

Want managed S3 streaming with Confluent ecosystem? WarpStream.

Building on Iceberg? Bufstream.

Want Apache 2.0 license and direct S3 reads? KafScale.

Need transactions? Not KafScale. Kafka or Bufstream.

The latency thing

400-500ms is fine for most workloads. Log aggregation, ETL, async events, audit trails. If you're honest about your actual requirements, you probably don't need 10ms.

But if you do need it, the pure S3 options won't work for you. AutoMQ with EBS is the compromise.

The license thing

WarpStream is proprietary (Confluent). AutoMQ was BSL until May 2025[1]. Bufstream is proprietary. KafScale is Apache 2.0.

If you care about this, you already know why. If you don't, it probably won't matter until it does.

Why I built another one

After looking at all of these, I still wrote KafScale. Here's why.

WarpStream got the architecture right: stateless brokers, S3 storage, no disk ops. But it's proprietary and now owned by Confluent. I wanted that architecture without the dependency.

More importantly, I wanted processors that bypass brokers entirely. Kubernetes pods that read directly from S3, process historical data, write to Iceberg, feed AI agents. No connector framework. No fighting for broker resources. Just pods and object storage.

Someone pointed out that this makes processors "fat clients" coupled to the storage format. Fair. But Kafka's message format has had three versions in 15 years. V2 has been stable since 2017. The entire ecosystem depends on it not changing. That's a bet I'm willing to make.

The alternative is routing everything through brokers. Then your batch replay jobs compete with your real-time consumers. Your AI training pipeline spikes latency for everyone. That's the problem I was trying to solve.

Open format. Open license. Processors that scale independently from brokers. That's the gap none of the others filled.

More on the architecture: Streaming Data Becomes Storage-Native

[1] Was BSL until May 2025. Changed for Strimzi compatibility to support K8s rollouts. No community announcement.

Data Processing Does Not Belong in the Message Broker

Alexander Alten — Mon, 29 Dec 2025 14:10:08 +0000

Kafka changed the industry by making event streaming practical at scale.

Over time, people started pushing data processing into the streaming platform itself. Kafka Streams, ksqlDB, broker-side transforms. It looks convenient on paper. In production, it often turns into operational friction.

Incidents, benchmarks, and vendor documentation all point to the same conclusion: data processing does not belong in the streaming platform.

State recovery does not scale

Kafka Streams restores state by replaying changelog topics. There is no checkpointing mechanism. Recovery time grows with state size.

One publicly documented incident: state store restoring from offset 0 to over 2.8 million records. Took more than two minutes. Producer transaction timeout was one minute. The application entered an ERROR state with no automatic recovery.

Practitioners regularly raise this problem when running Kafka Streams at scale:

Recovery by replay means restart time is proportional to how much state you accumulated. Once the state grows beyond "small", recovery becomes part of your availability risk.

Processing engines took a different approach years ago. They checkpoint state and restore from snapshots instead of replaying everything from the beginning. That difference shows up the first time you actually need to recover under load.

Exactly-once is more limited than it sounds

Kafka's exactly-once semantics apply inside Kafka. Spring's official documentation states it clearly: the read and process steps are still at-least-once.

Spring Kafka exactly-once semantics

As soon as you write to a database, call an external service, or touch anything outside Kafka, duplicate handling is your problem again.

Kafka also documents the scaling problems this creates. Before Kafka 2.5, exactly-once required one transactional producer per input partition. At scale, that meant thousands of producers, each with its own buffers, threads, and network connections.

KIP-447: Producer scalability for EOS

Kafka explicitly calls this an architecture that does not scale well as partition counts increase.

ksqlDB made the limits obvious

Riskified published their migration story in 2025. Schema evolution in ksqlDB did not automatically include new fields. Fixing it required dropping and recreating streams, disrupting offsets and production pipelines. Shared clusters made recovery unpredictable.

Riskified migration case (AWS Big Data Blog)

ksqlDB was not sustainable for production. They moved to Flink.

Confluent's own documentation backs this up. Push queries create continuous consumers. Pull queries create burst consumers. Both add load that is hard to predict and can affect other workloads in the same cluster.

ksqlDB query types and resource usage

Even vendors draw a boundary

Redpanda's Data Transforms documentation is explicit. Transforms are limited to single-message operations. No joins. No aggregations. No external access. A small number of output topics. At-least-once semantics only.

Redpanda Data Transforms limitations

For anything more complex, their recommendation is to use a dedicated processing engine like Apache Flink.

Confluent acquired Immerok, a managed Flink provider, and is integrating Flink into its cloud offering. That move acknowledges what the architecture already tells you: serious stream processing requires a different execution model than a Kafka-native library.

The architectural issue

Streaming platforms are built for durable logs, ordering guarantees, fan-out, and backpressure.

They are not built to be stateful compute engines with fast recovery, checkpoint coordination, or complex query runtimes with strong resource isolation.

Once transport and processing are coupled, scaling, recovery, and cost are coupled too. You cannot scale processing without scaling brokers. You cannot tune recovery independently. Compute costs get buried inside your Kafka bill.

What works in practice

Separating concerns works.

Kafka or Redpanda handle transport. A dedicated processing engine handles state, checkpoints, and complex logic. When pipelines span multiple engines, something like Apache Wayang can orchestrate across them.

Lightweight transformations inside the streaming layer still make sense. Filtering, format normalization, and simple enrichment cover many cases.

Core business logic with state, joins, and external writes does not.

If you are running joins, aggregations, or external writes inside your streaming platform today, what happens when you need to recover?

I published the full version with vendor documentation quotes, Kafka KIPs, migration case studies, and architecture diagrams on my blog:

👉 Data Processing Does Not Belong in the Message Broker

Stateless Kafka-compatible brokers backed by object storage, k8s native

Alexander Alten — Sun, 21 Dec 2025 16:04:15 +0000

Running Kafka-style systems on Kubernetes is possible, but it often feels like fighting the model rather than working with it.

Stateful brokers, disks, tight coupling between compute and storage, painful scaling events, and recovery paths that are harder than they should be are common operational pain points. This becomes especially visible when clusters grow, traffic patterns fluctuate, or upgrades are frequent.

We started experimenting with an alternative design that keeps Kafka protocol compatibility but changes the underlying assumptions.

The core ideas are simple:

Brokers are stateless and disposable
Message segments are stored in object storage (S3 or compatible)
Scaling brokers becomes a compute concern, not a data migration problem
Retention and durability are handled by object storage lifecycle policies

KafScale was initially released yesterday and is not meant to replace Kafka everywhere. It’s an DevOps-friendly drop-in with minimal ops in mind.

We’ve been building this as an open-source project, licensed under Apache 2.0 and designed to be fully self-hosted.

Repository and technical details:
https://github.com/novatechflow/kafscale

Architecture and Docs:
https://kafscale.io

At this stage, the most valuable thing for us is feedback from people who operate streaming systems in production: where this model makes sense, where it breaks down, and what tradeoffs matter most in practice.

Update:
A deeper architectural and historical analysis of stateless Kafka-compatible brokers is now available here:
https://www.novatechflow.com/2025/12/kafka-on-object-storage-was-inevitable.html