Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead

#database #datascience #showdev

Stream processing isn’t a new technology. In fact, the concept has been studied for at least 23 years! The first academic paper I came across dates back to 2002, just two years before the publication of the famous MapReduce paper.

Pioneering companies like StreamBase (now part of TIBCO) began commercializing the technology on Wall Street in the 2000s. However, only in recent years have companies started to embrace commercializing stream processing in the cloud. RisingWave was introduced in early 2021; Confluent acquired Immerok and began commercializing Apache Flink in 2023. Databricks also announced Project Lightspeed, a new version of Spark Streaming, to compete in the data streaming space. Additionally, several startups have emerged, either building around existing open-source systems or developing their own solutions.

With so many data vendors—large and small—operating in this space, it’s fascinating to observe that most are converging on similar goals and approaches. In this article, I’ll share my predictions for stream processing systems in 2025 from an engineer’s perspective.

Disclaimer: I am affiliated with RisingWave. However, I strive to remain as neutral as possible and focus solely on the technological aspects, avoiding commercial bias. If I’ve overlooked something or made an inaccurate statement, please feel free to reach out and let me know.

Baking with “S3 as the Primary Storage” Architecture

The rise of AWS S3 as a reliable, cost-effective storage service and the success of Snowflake have solidified S3’s position as a cornerstone of modern data infrastructure. Over time, data systems have increasingly transitioned to S3-based architectures, and startups are pushing the boundaries by building innovative systems entirely on S3.

Streaming systems are now exploring similar possibilities. RisingWave, as far as I know, is the first stream processing system purpose-built with S3 as its primary storage layer. Development began in 2021, and after four years of iteration, it has significantly evolved. Recently, Alibaba announced their plans to introduce storage-compute separation in Flink 2.0, drawing from internal best practices. While storage-compute separation aligns with the broader trends in distributed systems, implementing it for stream processing presents unique engineering challenges.

Unlike batch processing systems like Snowflake, stream processing systems are inherently stateful. They require constant access to internal states for incremental computation. Adopting storage-compute separation effectively means moving these states to S3. On the surface, this approach seems promising. S3’s lower storage costs compared to local memory and disk, combined with its scalability, make it appealing for handling large stateful operations such as joins that can easily cause out-of-memory errors. However, the reality is far from straightforward.

The primary obstacle is S3’s latency. While it excels in durability and scalability, its access times are orders of magnitude slower than local storage, which is a critical limitation for low-latency stream processing workloads. Additionally, frequent interactions with S3 can result in significant access costs, eroding the cost benefits of using it as a storage layer. To make matters more complex, handling the performance hit from S3 often requires sophisticated caching strategies. Without these optimizations, production workloads can suffer from severe performance degradation and unmanageable expenses.

By 2025, I expect that many stream processing systems will adopt S3 as a foundational component of their architectures. However, building an efficient system around S3 will demand heavy engineering investments. Techniques like hybrid storage models—where frequently accessed data resides in local storage or memory—and advanced caching mechanisms will be indispensable. The shift to storage-compute separation marks a pivotal moment for stream processing, but realizing its potential will hinge on solving the performance and cost challenges that come with it.

Taking a Bite of Kafka's Lunch

Whenever we discuss event streaming, Kafka inevitably enters the conversation. As the de facto standard for event streaming, Kafka is widely used as a data pipeline to move data between systems. However, Kafka is not the only tool capable of facilitating data movement. Products like Fivetran, Airbyte, and other SaaS offerings provide user-friendly tools for data ingestion, expanding the options available to engineers.

Despite Kafka’s popularity, its computational capabilities are limited. This creates a need for stream processing systems to handle real-time data transformations, including joins, aggregations, filtering, and projections. The challenge arises from managing two separate systems: one for data ingestion and another for stream processing. Maintaining such a dual setup is resource-intensive, increasing development complexity and operational costs.

In response to this inefficiency, stream processing systems are increasingly integrating data ingestion capabilities. Notably, systems like RisingWave, Apache Flink, and Apache Spark Streaming now support direct consumption of CDC (Change Data Capture) data from upstream sources such as Postgres, MySQL, and MongoDB. This eliminates the necessity of Kafka as an intermediary, reducing architectural overhead and streamlining workflows.

Looking ahead to 2025, will stream processing systems compete directly with event streaming platforms like Kafka? The short answer is: not entirely. While there will be overlaps in functionality, stream processing systems are unlikely to fully replace Kafka. Kafka’s broad range of use cases—many of which extend beyond what stream processing systems are designed to handle—ensures its continued relevance in the data ecosystem.

Embracing Data Lake

2024 is undoubtedly the year of the data lake. Databricks made waves by acquiring Tabular, a company founded by Iceberg’s original creators, signaling a significant endorsement of Iceberg’s potential. Simultaneously, Snowflake introduced Polaris, its Iceberg-based catalog offering. Prominent query engine vendors like Starburst and Dremio have aligned their support around Polaris, indicating a shift toward unified standards.

To stay relevant in modern data engineering, nearly all data streaming vendors have announced their integration with Iceberg. For instance, Confluent unveiled Tableflow, a product enabling direct ingestion of Kafka data into Iceberg format. Similarly, Redpanda launched a comparable service for streaming data into data lakes. StreamNative’s Ursa Engine is yet another example of this growing trend.

When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.

However, the convergence of data streaming and data lakes is about more than just data ingestion. There is a growing demand for incremental computation over data lakes, as evidenced by Databricks’ Delta Live Tables feature. Interestingly, because Iceberg has yet to fully support CDC (Change Data Capture), no system currently offers efficient incremental computation on Iceberg. That said, this gap may soon close—Iceberg spec v3 is on the horizon, and the competition in this space is just heating up.

Optimize Query Serving

If you’ve been following the stream processing domain for a while, you’ll notice a significant trend: most stream processing systems are now building their own storage engines. For instance, RisingWave is a streaming database with built-in capabilities for storing and serving data by default. Similarly, Flink recently introduced Fluss and Paimon to enhance its serving capabilities. Databricks’ Delta Live Tables, likely built on Spark Streaming, allows users to directly serve data, highlighting a broader industry trend.

Why are all these stream processing systems moving toward integrating storage and serving? The answer lies in simplifying architecture. Traditionally, stream processing systems were designed to process data, while the storage and serving layers were managed by separate systems. However, maintaining multiple systems for a single application introduces significant operational overhead, driving up both complexity and costs. By consolidating ingestion, processing, and serving layers into one system, stream processing platforms enable smoother data flows, reduce maintenance burdens, and accelerate application development timelines. Developers can now build and deploy applications in months rather than years.

This shift also addresses a critical pain point: the cost and complexity of managing too many moving parts in a system. When a single platform handles data ingestion, stateful processing, and real-time serving, the benefits include improved efficiency, lower latency, and reduced costs. As a result, modern stream processing systems are embracing this holistic approach to provide robust storage and serving capabilities alongside their processing strengths.

Looking forward, we can expect continued innovation in this space as systems evolve to meet growing demands for scalability, performance, and simplicity in real-time data applications.

Buzz around AI

AI has become the focal point of nearly every conversation in the tech world, and stream processing systems are no exception. Many event streaming and data systems are developing features to stay relevant in this AI-driven landscape. One emerging pattern involves directly ingesting data from various sources, leveraging embedding services to convert raw data into vectors, and then using vector databases to enable vector search. This trend has gained so much traction that even AWS now offers a solution to support this workflow.

The demand for such capabilities is clear. For example, one of the hottest crypto companies, Kaito, ingests high-volume real-time data from X, performs sentiment analysis, and generates actionable insights for traders using RisingWave. The sentiment analysis is powered by LLMs. However, a critical limitation of LLMs today is their latency, which often ranges between 100-200 ms. This makes them unsuitable for latency-sensitive domains like ad targeting or product recommendations, where traditional ML models still dominate.

What does real-time AI look like in the future? With the advancements in LLMs, more developers are exploring ways to integrate AI-driven mechanisms into their applications. Real-time feature engineering will remain a cornerstone of these efforts, enabling applications to process and act on data dynamically. The synergy between AI and stream processing is still in its early stages, but it is poised to define the next wave of innovation in real-time data applications.

Conclusion

If I were to summarize the trend for stream processing systems in 2025 in two words, they would be: lakehouse and AI. It’s evident that every major stream processing system is converging on Iceberg and exploring their role in AI integration. Companies that adapt to these trends will not only remain competitive but will also thrive in the ever-expanding world of real-time, data-intensive applications.