Kafka changed the industry by making event streaming practical at scale.
Over time, people started pushing data processing into the streaming platform itself. Kafka Streams, ksqlDB, broker-side transforms. It looks convenient on paper. In production, it often turns into operational friction.
Incidents, benchmarks, and vendor documentation all point to the same conclusion: data processing does not belong in the streaming platform.
State recovery does not scale
Kafka Streams restores state by replaying changelog topics. There is no checkpointing mechanism. Recovery time grows with state size.
One publicly documented incident: state store restoring from offset 0 to over 2.8 million records. Took more than two minutes. Producer transaction timeout was one minute. The application entered an ERROR state with no automatic recovery.
Practitioners regularly raise this problem when running Kafka Streams at scale:
- Kafka Streams state restore discussion (Stack Overflow)
- Kafka Streams state recovery explanation (Confluent docs)
Recovery by replay means restart time is proportional to how much state you accumulated. Once the state grows beyond "small", recovery becomes part of your availability risk.
Processing engines took a different approach years ago. They checkpoint state and restore from snapshots instead of replaying everything from the beginning. That difference shows up the first time you actually need to recover under load.
Exactly-once is more limited than it sounds
Kafka's exactly-once semantics apply inside Kafka. Spring's official documentation states it clearly: the read and process steps are still at-least-once.
As soon as you write to a database, call an external service, or touch anything outside Kafka, duplicate handling is your problem again.
Kafka also documents the scaling problems this creates. Before Kafka 2.5, exactly-once required one transactional producer per input partition. At scale, that meant thousands of producers, each with its own buffers, threads, and network connections.
Kafka explicitly calls this an architecture that does not scale well as partition counts increase.
ksqlDB made the limits obvious
Riskified published their migration story in 2025. Schema evolution in ksqlDB did not automatically include new fields. Fixing it required dropping and recreating streams, disrupting offsets and production pipelines. Shared clusters made recovery unpredictable.
ksqlDB was not sustainable for production. They moved to Flink.
Confluent's own documentation backs this up. Push queries create continuous consumers. Pull queries create burst consumers. Both add load that is hard to predict and can affect other workloads in the same cluster.
Even vendors draw a boundary
Redpanda's Data Transforms documentation is explicit. Transforms are limited to single-message operations. No joins. No aggregations. No external access. A small number of output topics. At-least-once semantics only.
For anything more complex, their recommendation is to use a dedicated processing engine like Apache Flink.
Confluent acquired Immerok, a managed Flink provider, and is integrating Flink into its cloud offering. That move acknowledges what the architecture already tells you: serious stream processing requires a different execution model than a Kafka-native library.
The architectural issue
Streaming platforms are built for durable logs, ordering guarantees, fan-out, and backpressure.
They are not built to be stateful compute engines with fast recovery, checkpoint coordination, or complex query runtimes with strong resource isolation.
Once transport and processing are coupled, scaling, recovery, and cost are coupled too. You cannot scale processing without scaling brokers. You cannot tune recovery independently. Compute costs get buried inside your Kafka bill.
What works in practice
Separating concerns works.
Kafka or Redpanda handle transport. A dedicated processing engine handles state, checkpoints, and complex logic. When pipelines span multiple engines, something like Apache Wayang can orchestrate across them.
Lightweight transformations inside the streaming layer still make sense. Filtering, format normalization, and simple enrichment cover many cases.
Core business logic with state, joins, and external writes does not.
If you are running joins, aggregations, or external writes inside your streaming platform today, what happens when you need to recover?
I published the full version with vendor documentation quotes, Kafka KIPs, migration case studies, and architecture diagrams on my blog:
Top comments (0)