Alexander Alten

Posted on Jan 14

SQL on Kafka Data Does Not Require a Streaming Engine

#kafka #sql #dataengineering #eventdriven

Stream processing engines solved a real problem: continuous computation over unbounded data. Flink, ksqlDB, and Kafka Streams gave teams a way to run SQL-like queries against event streams without writing custom consumers.

The operational cost of that solution is widely acknowledged. Confluent's own documentation notes that Flink "poses difficulties with deployment and cluster operations, such as tuning performance or resolving checkpoint failures" and that "organizations using Flink tend to require teams of experts dedicated to developing and maintaining it."

For a large share of the questions teams ask their Kafka data, a simpler architecture exists: SQL on immutable segments in object storage.

What engineers actually ask Kafka

In production debugging sessions and ops reviews, the questions are repetitive:

What is in this topic right now?
What happened around an incident window?
Where is the message with this key?
Are all partitions still producing data?

These are not streaming problems. They are bounded lookups over historical data. They run once, terminate, and do not need windows, watermarks, checkpoints, or state recovery.

Kafka data is already structured for this

Kafka does not persist records individually. It appends them to log segments and rolls those segments by size or time. Each partition is an ordered, immutable sequence of records. Once a segment is closed, it is immutable.

Kafka also maintains sparse indexes so readers can seek by offset and timestamp efficiently. Each segment file is accompanied by lightweight offset and timestamp indexes that allow consumers to seek directly to specific message positions without scanning entire files.

Retention deletes whole segments. Compaction rewrites segments. This means Kafka data is already organized like a SQL-on-files dataset. The only difference is where the files live.

Since Kafka 3.6.0, tiered storage allows these segments to live in object storage like S3. As of Kafka 3.9.0, this feature is production-ready. Durability is now decoupled from compute without changing the data model.

The streaming engine "tax"

Streaming engines pay for capabilities most queries never use:

Distributed state backends
Coordinated checkpoints
Watermark tracking
Long-running cluster operations

That cost is justified for continuous aggregation, joins, and real-time inference.

It is wasted for "show me the last 10 messages".

Production experience confirms this. Riskified migrated from ksqlDB to Flink, noting that ksqlDB's strict limitations on evolving schemas made it impractical for real-world production use cases and that operational complexity required fighting the system more than working with it.

The scale mismatch is also documented. Vendor surveys from Confluent and Redpanda show that approximately 56% of all Kafka clusters run at or below 1 MB/s. Most Kafka usage is small-data, yet teams pay big-data operational costs.

SQL on immutable segments

If Kafka data lives as immutable segments with sparse indexes, querying it looks like any other SQL-on-files workload.

The query planner:

Resolves the topic to segment files
Filters by timestamp or offset metadata
Reads only relevant segments
Applies predicates and returns results

No consumer groups. No offset commits. No streaming job lifecycle.

Expose Kafka-native fields as columns and the common queries become trivial:

SELECT * FROM orders TAIL 10;

SELECT * FROM orders
WHERE ts BETWEEN '2026-01-08 09:00' AND '2026-01-08 09:05';

SELECT * FROM orders
WHERE key = 'order-12345'
AND ts >= now() - interval '24 hours';

This is not stream processing. It is indexed file access with SQL semantics.

Latency, realistically

Yes, object storage is slower than broker-local disk. Remote storage typically has higher latency than local block storage.

That is fine.

Most of these queries are debugging and ops workflows. Waiting one or two seconds is acceptable. Waiting minutes to deploy or restart a streaming job is not.

If you need sub-second continuous results, use a streaming engine. That boundary is clear.

Cost visibility beats hidden complexity

The real risk with SQL on object storage is unbounded scans. Object storage pricing is calculated based on the amount of data stored and the number of API calls made.

The solution is not more infrastructure. It is transparency.

Every query should show:

How many segments will be read
How many bytes will be scanned
The estimated request cost

Queries without time bounds should require explicit opt-in.

This keeps cost a conscious decision instead of a surprise.

Where streaming engines still belong

Streaming engines are still the right tool for:

Continuous aggregations
Joins over live streams
Real-time scoring
Exactly-once outputs

Most Kafka interactions are not those.

They are lookups and inspections that were forced into streaming infrastructure because no better interface existed.

Once Kafka data is durable as immutable segments, SQL becomes the simpler tool.

The takeaway

Most teams do not need a streaming engine to answer Kafka questions.

They need a clean, bounded way to query immutable data.

SQL on Kafka segments does exactly that.

Read a more deeper post at https://www.novatechflow.com/2026/01/sql-on-streaming-data-does-not-require.html

DEV Community