Stanislav Kozlovski

Posted on May 7

Ursa — a new Diskless Lakestream engine for Kafka

#kafka #iceberg #diskless #deltalake

The future of Kafka is diskless topics + native Apache Iceberg / Delta Lake integration

Intro

Ursa is a relatively new engine that's being fitted into Kafka. This is done through a minimally invasive fork that will be open-sourced in a few months.

Here's why one should care:

introduces Diskless Topics to Kafka — the ones that WarpStream pioneered, with 10x lower infra costs, simpler operations, instant scalability, etc.
is Iceberg/Delta Lake-native — stores the topic data in an open-table format without additional copies or ETL jobs; One copy means lower costs, easier governance and lower risk of drift/issues.
is a minimally-invasive fork of Kafka — it only extends the codebase by adding an additional path, and keeps the rest untouched; This means upgrades to (and back) should be seamless
supports adaptable topic profiles — because Diskless Topics are enabled with the flip of a switch (not a cluster migration like some others), you get less operational sprawl in your org with both topic types in the same cluster!

🧵 Now allow me to explain in detail, with a quick backstory:

2023: The Rise of Diskless Kafka

https://news.ycombinator.com/item?id=37036291

Three years ago, WarpStream launched with an innovative and elegant spin on the classic Apache Kafka architecture. Instead of brokers replicating events between each other and persisting it to their local disks, they wrote the events directly to S3.

It sounds simple, but this architecture achieved two very important things:

it made the brokers stateless and leaderless. This unlocked significantly simpler scaling, load balancing and operations.
it lowered infrastructure costs by 10x, all by hacking the cloud pricing model. AWS charges you a literal ton on cross-AZ traffic and EBS storage costs. By comparison, S3 gives you the same durability/availability guarantees at an incredibly enormous discount.

it costs from $2.1M-$3.51M a year; 2/3rds of produce traffic inevitably crosses zones; the replication traffic inevitably crosses zones; 2/3rds of consume crosses zones unless KIP-352 is enabled, so it's either $3.51M/yr or $2.1M/yr

⚠️ NOTE: When I say "infrastructure costs", I specifically refer to the underlying cloud infrastructure costs. This doesn't mean that the vendor's solution lowers your total cost by 10x — they have to make a profit after all. Read such statements carefully to separate marketing from fact (as I've been harping about for a while).

I even released a Kafka cost calculator to help you reason about your costs at the time (because the others were so biased):

https://2minutestreaming.com/tools/apache-kafka-calculator

The trade-off this architecture came with was that of 10x-20x higher latency. A regular Kafka event may take no more than 50–100ms of p99 end-to-end latency, whereas a Diskless Kafka event may take 1000–2000ms.

💡 end-to-end latency — measures the time from which an event was published from a producer application to the time it was read by a consumer application. This is the latency metric Kafka users care about, the rest is marketing fluff.

The reason for this latency is

[1] — batching: you have to batch requests in order to not rack up costs from too many S3 PUT/GET costs

[2] — S3 latency: it's in the hundreds of ms

It can be optimized and tuned somewhat, but realistically speaking you can't get much lower than a second of p99 e2e without quickly eliminating your cost savings (due to the batching).

The key realization from WarpStream was that many Kafka use-cases are not latency sensitive. They're happy to receive their events 10–20x slower if that means significant operational ease and cost savings.

It didn't take long for everybody to follow suit.

2024–26: Diskless Everywhere

The next few years were the most active the space had ever been:

Confluent bought WarpStream for $220M (after just 13 months of operation)
Aiven released the first open-source implementation of this architecture and took initiative to get this merged in OSS Apache Kafka (KIP-1150)
IBM bought Confluent for $11B
virtually every Kafka vendor released some form of similar architecture, including Ursa

competition caught up quickly

Today, this architecture is everywhere. Thanks to Aiven, it's coming to the open-source Apache Kafka project as well (realistically in 1–2 years).

There is one other massive trend I want to draw your attention to before getting into Ursa:

2024: The $1–2B Iceberg

A big acquisition came out of nowhere in the summer of 2024. Databricks acquired Tabular — a ~30 person company by the creators of Apache Iceberg — for $1–2 Billion.

Apache Iceberg is nothing more than a storage format for big data.

It was created at Netflix to solve their problems with the Hive storage format which was used in their data lake, namely:

make changes to tables atomic with serializable isolation
support many concurrent writers
offer native cloud object store support
less gotchas (renaming a Parquet column in Hive used to break tons of stuff)

Alternative formats are Delta Lake (by Databricks), Apache Hudi, Apache Paimon. All of these are called Open Table Formats (OTF).

It was clear that Iceberg had the most momentum out of them all, coming out of a neutral third-party (Netflix).

Every query engine was adding support for Apache Iceberg quickly. Star count only played a minor part, yet still showed faster growth than alternatives (e.g Delta Lake)

But the format quickly grew into a much grander vision. The ensuing open table format war was precisely around what the name says — openness. The idea was that:

Storage and compute products should be largely interchangeable and easily swapped, by using open standards like Iceberg.

Open table formats give you the following benefits:

zero copy data stack — share your database's storage by storing it in one place but using it from different query engines (avoids copying data and the insane network costs associated with that). This is a much bigger proposition that we won't dive into, but it essentially promises you both a data warehouse and data lake with the same storage layer. This was coined the data lakehouse (in classic industry jargon).
openness — avoid lock-in by being able to easily swap query engines. Also increases the interoperability between tools (they all connect to the Iceberg table), therefore allowing you to use the right tool for the job with minimum effort.

an example architecture with one source of truth dataset — an Iceberg table in S3

The acquisition was a big deal at the time and only supercharged the space's momentum.

https://linkedin.com/posts/stanislavkozlovski_apacheiceberg-share-7209936821386964992-JnT_

As of writing, open table formats are supported everywhere, yet still somewhat rough around the edges.

Their catalog dependency is being iterated on, the core formats are actively evolving with, as well as performance improvements.

Their adoption is growing just as the industry is working hard to seamline their use. The topic of our article is a direct example of this in action 👇

Ursa

industry-winning 2025 VLDB paper: https://vldb.org/pvldb/vol18/p5184-guo.pdf

Ursa is a diskless log storage engine with native open-table format support.

Diskless Topics

It's designed to support diskless topics — the type that write data to S3 directly in order to save on costs and simplify operations.

a high-level example of the architecture. Ursa's Coordination service is Oxia.

How Writes Work

I won't go too in-depth into this since the code isn't public, but here's what I could piece together

the producer sends records (belonging to all partitions) to brokers in its local AZ (leaders don't exist in diskless)
the broker buffers the records up to 200ms or 4 MB (configurable, whichever comes first)
the broker sorts all records by topic-partition id and flushes this mixed-partition data to S3 as one row-based object. The data remains cached in the broker to serve future reads from memory
for each partition, the broker updates each partition's offset & data reference pointer in Oxia (their metadata store)
the broker returns the response

The key trick with this architecture is that data persistence is separated from metadata persistence.

In step 3), the data is durably persisted — but its metadata is still null (e.g what offset do the records have? none yet). In step 4), the metadata is assigned in a strongly-consistent manner.

This allows many brokers to concurrently handle the same partition — it's the metadata store that serializes those metadata writes to them. Because the metadata is inherently a much smaller load, this scales fine.

Ursa, in particular, uses the CNCF Oxia project for its metadata store. It's one they designed themselves and custom-fit for this type of problem. It's got interesting trade-offs between etcd/ZooKeeper.

The key tradeoff with this diskless approach is higher-latency.

Such a write can take 10x longer (1–2s p99) than a regular Kafka write.
But, it can also be 10x cheaper than a regular Kafka write as the storage is in S3 (10–20x cheaper to store than in replicated EBS) and there are no cross-AZ network fees (again, these can be very high as you need to replicate between EBS drives)

How Compaction Works

The careful reader will notice that, over time, the system will accumulate a lot of small 4 MB files. Those files contain small amounts of data for many partitions — e.g 200KB a partition. This is optimized for writes, but not for reads that may want to ingest a few MB from a single partition.

An asynchronous compaction process fixes this by re-ordering the data by partition.

The special thing about Ursa is that at this step, the data format is converted into columnar Parquet inside an open table format (Iceberg or Delta).

Here's how it roughly works:

a high-level visualization of the end-to-end write+compaction path, using Delta Lake as the table format

A Compaction Manager inspects the per-partition metadata and creates a "plan" of what needs to be compacted. The plan consists of many compaction tasks, saved in Oxia.
Compaction Workers process those tasks. They read data for a single partition from many mixed-partition S3 files, convert it to columnar Parquet and store it back to S3. They persist their progress in Oxia
The Compaction Manager waits to batch up many such newly-written Parquet files (e.g 100–1000), and then commits them to the Iceberg/Delta catalog. This is likely done to reduce the number of table snapshots in Iceberg.
The old metadata and now-defunct mixed-partition S3 objects get garbage collected later.

Critically, after step 3), the Kafka data is entirely visible from any Iceberg-compatible query engine. This can come as soon as a few minutes after the data is first written.

After that step, the only reference of the records exists in an open table format. We call this process "zero-copy" because the data is not copied/duplicated, as opposed to how it would be by using a traditional ETL approach like Kafka Connect.

Most plainly said — vanilla Kafka treats your data as opaque bytes. Ursa treats it as a table.

StreamNative, the company behind Ursa, calls this distinction "the Lakestream architecture".

The Lakestream

A lakestream architecture is simply one that treats event streams (i.e Kafka logs) as a first-class lakehouse primitive. In other words — one that integrates Kafka with Open Table Formats natively.

the ingested data is very quickly visible to Iceberg/Delta supportive query engines

This vision is significant because it integrates your ingestion layer (Kafka) even more natively with your query layer (Clickhouse, Snowflake, etc.) while separating both through a clean storage layer (Iceberg/S3).

🗣️ Why am I talking about it now?

Ursa isn't brand new. It's been out since 2024. Here's why it grabbed my attention now.

First, a quick primer on the company. Ursa is developed by StreamNative — a company founded by the creators of Apache Pulsar.

While they aren't necessarily a "hot" startup, these guys have some serious engineering firepower behind them. They've:

architected Apache Pulsar
were the first in streaming to separate compute from storage
built a highly-scalable ZooKeeper/etcd replacement (Oxia, Apache-licensed and in the CNCF incubator)

Sidenote: While Pulsar never got as popular than Kafka, it was frequently hailed to be technically better than Kafka, mainly due to its superior architecture that gave it more topics (up to a million), multi-tenancy and better operability. Here is a good in-depth 2018 piece on the matter.

As biased as I am as "the Kafka guy"… I don't disagree.

It's a great example of how a system doesn't win only because of technical merits or number of features. The ecosystem's maturity, network effect and commercial support maturity matter a ton. While the Pulsar project started internally at Yahoo around the same time Kafka started at LinkedIn… it was only open-sourced in late 2016 and graduated as a top-level Apache project in 2018 — a full 6 years after Kafka.

In any case, StreamNative surprised me when they announced "We are a Kafka Company Too":

my twitter post about it (also on youtube, find me there)

Essentially, there they announced that:

they had officially forked Kafka and extended its storage layer to support Ursa (called Ursa-for-Kafka, or UFK)
they committed to open-sourcing UFK
they defined their lakestream architecture and vision

This is different than their previous solution which was presumably a proxy-level re-implementation of the Kafka protocol with an Ursa backend.

As a fork which extends the storage layer to bring a new type of topic, this is regular Kafka, with its regular backend, plus an opt-in Ursa backend.

While "The Lakestream" may initially sound like marketing slop (trust me I'm very allergic to these), I find it captures the convergence of these two trends in data engineering very well:

[1] Iceberg gaining ground as the open storage format

[2] Diskless (direct-to-S3) workloads gaining ground as the ingestion layer for it

the trends

As for Ursa-for-Kafka, StreamNative committed to open-sourcing it in 9 months time (on April 10, so we expect it ~January 10 2027).

In the meanwhile, they open-sourced the formally-verified leaderless log protocol that backs Ursa. I spent a few days working with it to see how usable it is alongside Codex and the result impressed me:

👉 One-shotting a Diskless Kafka in Python — Using StreamNative Ursa's leaderless log protocol

🤔 How does Ursa compare to others?

The Kafka space is very cutthroat — it's not short of worthy competitors at all. I'll save you the full 20+ product comparison and concisely zoom in on two areas where Ursa stands out to me:

Open Table Format integration
Adaptable Topics

Whether you need these options enough to value their benefits, and whether you can tolerate the trade-offs that come with them, is a separate question that I won't get into.

1) Lakehouse Native Competitors

Everybody in the Kafka space has some form of an open-table-format sink service today. The ones that have integrated it natively into their product are much fewer — just:

Bufstream
Ursa

First-class integration in the critical path matters because it significantly changes the scales in two subtle ways.

[1] The SLA guarantees and its quality are usually much better.

It's not a connector bolted on the side that may or may not be polished and monitored. It's something that's a core part of the product.

[2] The cost is also lower. It's not an extra feature the vendor gets to upsell, it's the product.

In Ursa and Buf's cases, the cost is even lower because the data isn't duplicated. In these "zero-copy" designs, the Kafka data is the lake's data, whereas otherwise a copy is made which forces you to pay for 2x the storage.

2) Adaptable Topics

Similarly, a lot of products in the Kafka space have some form of low-cost, high-latency "diskless" topic option. They usually lack in one of two ways:

they don't give you an option to have fast topics with their products at all
they let you have fast topics but force you to deploy another (classic Kafka) cluster for it

Forcing users to split workloads by cluster is very cumbersome as it splinters the security model, configurations, observability, etc.

It also tends to cost more in overhead (number of brokers) and require more human effort to manage.

The best solution is one which lets you host different types of topics inside the same cluster:

a high-level visual of UFK supporting both fast, regularly-replicated Kafka topics and the new low-cost Diskless, Lakestream topics

In the 20+ product Kafka market, only three solutions adapt to different workloads inside the same cluster:

Aiven Inkless
StreamNative Ursa for Kafka
Redpanda 26.1

A comparison of all adaptable engines. Note that Aiven is shipping this type of topic flexibility into OSS Apache Kafka too via KIP-1150, but realistically-speaking it's 1–2 years away due to the OSS process.

🎬 Conclusion

Ursa-for-Kafka is a positive development in the event-streaming space, solidifying:

the cost deflation trend of Kafka (diskless topics)
the unification with analytical workloads via open table formats
the network effect of Kafka's protocol and its ecosystem

If you found this informative, please consider sharing. It takes 5s to do. Writing+researching this takes me 10+ hours

Thanks for reading :)

DEV Community