DEV Community: DeltaStream

Enable “Shift-Left” with Apache Kafka and Iceberg

kelsey-deltastream — Tue, 22 Apr 2025 17:25:42 +0000

In the past few years, the Apache Iceberg table format has become the 800-pound gorilla in the data space. DeltaStream supports reading and writing Iceberg using either the AWS Glue catalog or the Apache Polaris (incubating) catalog. This blog walks you through a data scenario in which data in Apache Kafka topics are read, filtered, and enriched with data from another Kafka topic, then written to Iceberg and queried from DeltaStream.

Writing Data Tables to Iceberg

When you sign up for a DeltaStream demo, you’re provided with a demo Kafka cluster called trial_store. If you are unfamiliar with DeltaStream, this short interactive demo walks you through it. For the Iceberg catalog implementation, we’ll be using REST and the Snowflake Polaris implementation, the docs for which can be found here. AWS Glue is also supported, as is any other REST implementation.

For this example, we’ll use two of the topics in that cluster and follow the first part of the Quick Start guide, but we’ll enhance it to write two new tables of processed data to Iceberg.

Our demo includes the following:

Users Kafka topic
Pageviews Kafka topic
Enrich the pageviews with user information
Write to two tables in Iceberg, comprised of:
- Pageviews per city per minute
- Pageviews per user per hour

Our queries do the following:

Top 3 cities with highest pageviews per hour
Top 5 users with the highest pageviews per hour

Streaming Lakehouse with DeltaStream Fusion

First, let’s define the data we are working with. Our Kafka topic pageviews looks like this:

{
“key”: {
“userid”: “User_5”
},
“value”: {
“viewtime”: 1742335218442,
“userid”: “User_5”,
“pageid”: “Page_67”
}
}Copy Code


1. 
{
2. 
"key": {
3. 
"userid": "User\_5"
4. 
},
5. 
"value": {
6. 
"viewtime": 1742335218442,
7. 
"userid": "User\_5",
8. 
"pageid": "Page\_67"
9. 
}
10. 
}

That means we need to create a DeltaStream, Stream-type Object. We do this with a CSAS statement:

CREATE STREAM pageviews (
viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR
)WITH (
‘topic’=’pageviews’,
‘value.format’=’JSON’
);Copy Code


1. 
CREATE STREAM pageviews (
2. 
 viewtime BIGINT, 
3. 
 userid VARCHAR, 
4. 
 pageid VARCHAR
5. 
)WITH (
6. 
'topic'='pageviews', 
7. 
'value.format'='JSON'
8. 
);

That now sets up our pageviews topic so you can access it via SQL. Let’s look at the data structure in our users topic. Note that a userid field will tie the two streams together.

{
“key”: {
“userid”: “User_2”
},
“value”: {
“registertime”: 1742597129390,
“userid”: “User_2”,
“regionid”: “Region_9”,
“gender”: “OTHER”,
“interests”: [
“News”,
“Movies”
],
“contactinfo”: {
“phone”: “6503889999”,
“city”: “Palo Alto”,
“state”: “CA”,
“zipcode”: “94301”
}
}
}Copy Code


1. 
{
2. 
"key": {
3. 
"userid": "User\_2"
4. 
},
5. 
"value": {
6. 
"registertime": 1742597129390,
7. 
"userid": "User\_2",
8. 
"regionid": "Region\_9",
9. 
"gender": "OTHER",
10. 
"interests": [
11. 
"News",
12. 
"Movies"
13. 
],
14. 
"contactinfo": {
15. 
"phone": "6503889999",
16. 
"city": "Palo Alto",
17. 
"state": "CA",
18. 
"zipcode": "94301"
19. 
}
20. 
}
21. 
}

Next, we define a DeltaStream Changelog object for our users topic named users_log. Note that we’ve created an array from interests and a struct from the contactinfo. Also note that state is a reserved word in DeltaStream, so it is enclosed in quotes. Our command looks like this:

CREATE CHANGELOG users_log (
registertime BIGINT,
userid VARCHAR,
regionid VARCHAR,
gender VARCHAR,
interests ARRAY,
contactinfo STRUCT,
PRIMARY KEY(userid)
)WITH (
‘topic’=’users’,
‘key.format’=’json’,
‘key.type’=’STRUCT’,
‘value.format’=’json’
);Copy Code


1. 
CREATE CHANGELOG users\_log (
2. 
 registertime BIGINT, 
3. 
 userid VARCHAR, 
4. 
 regionid VARCHAR, 
5. 
 gender VARCHAR, 
6. 
 interests ARRAY<VARCHAR>, 
7. 
 contactinfo STRUCT<phone VARCHAR, city VARCHAR, "state" VARCHAR, zipcode VARCHAR>, 
8. 
 PRIMARY KEY(userid)
9. 
)WITH (
10. 
'topic'='users', 
11. 
'key.format'='json', 
12. 
'key.type'='STRUCT<userid VARCHAR>', 
13. 
'value.format'='json'
14. 
);

Next, we join our STREAM and CHANGELOG into a new enriched Kafka topic defined as a DeltaStream STREAM object named csas_enriched_pv. This combines user data with the pageviews information that is written to our Iceberg table.

CREATE STREAM csas_enriched_pv AS
SELECT
TO_TIMESTAMP_LTZ(viewtime, 3) AS viewtime,

p.userid AS userid,
pageid,
TO_TIMESTAMP_LTZ(registertime, 3) AS registertime,
regionid,
gender,
interests,
Contactinfo -> city as user_city,
Contactinfo -> “state” as user_state
FROM pageviews p
JOIN users_log u ON u.userid = p.userid;Copy Code


1. 
CREATE STREAM csas\_enriched\_pv AS 
2. 
SELECT 
3. 
 TO\_TIMESTAMP\_LTZ(viewtime, 3) AS viewtime,
4. 
 p.userid AS userid, 
5. 
 pageid, 
6. 
 TO\_TIMESTAMP\_LTZ(registertime, 3) AS registertime, 
7. 
 regionid, 
8. 
 gender, 
9. 
 interests, 
10. 
 Contactinfo -> city as user\_city,
11. 
 Contactinfo -> "state" as user\_state
12. 
FROM pageviews p
13. 
 JOIN users\_log u ON u.userid = p.userid;

Here is what that data looks like for reference:

{
“key”: {
“userid”: “User_5”
},
“value”: {
“viewtime”: “2025-03-24 21:18:23.526Z”,
“userid”: “User_5”,
“pageid”: “Page_22”,
“registertime”: “2025-03-24 21:18:23.232Z”,
“regionid”: “Region_1”,
“gender”: “MALE”,
“interests”: [
“News”,
“Movies”
],
“user_city”: “Irvine”,
“user_state”: “CA”
}
}Copy Code


1. 
{
2. 
"key": {
3. 
"userid": "User\_5"
4. 
},
5. 
"value": {
6. 
"viewtime": "2025-03-24 21:18:23.526Z",
7. 
"userid": "User\_5",
8. 
"pageid": "Page\_22",
9. 
"registertime": "2025-03-24 21:18:23.232Z",
10. 
"regionid": "Region\_1",
11. 
"gender": "MALE",
12. 
"interests": [
13. 
"News",
14. 
"Movies"
15. 
],
16. 
"user\_city": "Irvine",
17. 
"user\_state": "CA"
18. 
}
19. 
}

Now that the Kafka topic csas_enriched_pv is available, the fun part begins. This creates an Iceberg table with our “Pageviews per state per minute” scenario. If you are unfamiliar with configuring Iceberg in DeltaStream, this short interactive demo can walk you through it.

CREATE TABLE pv_per_city_per_minute WITH(
‘iceberg.rest.catalog.namespace.name’=’sgns’,
‘iceberg.rest.catalog.table.name’=’pv_per_city_per_minute’
)
AS SELECT
user_city,
count(pageid) AS viewcount,
window_start,
window_end
FROM TUMBLE(csas_enriched_pv, size 1 minutes)
GROUP BY user_city, window_start, window_end;Copy Code


1. 
CREATE TABLE pv\_per\_city\_per\_minute WITH(
2. 
'iceberg.rest.catalog.namespace.name'='sgns',
3. 
'iceberg.rest.catalog.table.name'='pv\_per\_city\_per\_minute'
4. 
)
5. 
AS SELECT 
6. 
 user\_city, 
7. 
 count(pageid) AS viewcount, 
8. 
 window\_start,
9. 
 window\_end
10. 
FROM TUMBLE(csas\_enriched\_pv, size 1 minutes)
11. 
GROUP BY user\_city, window\_start, window\_end;

Let’s break down what is happening here for those unfamiliar:

CREATE TABLE pv_per_city_per_minute: Creates a new table named “pv_per_city_per_minute”
WITH(…): Contains table properties or configurations. Further explanation is below.
SELECT user_city, count(pageid) AS viewcount, window_start, window_end:

Selects the user’s city
Counts the number of page IDs and names this count “viewcount”
Includes the start and end times of each window interval

TUMBLE(csas_enriched_pv, size 1 minutes): Applies a tumbling window function to the “csas_enriched_pv” table with a window size of 1 minute. A tumbling window divides data into non-overlapping, fixed-size time intervals.
GROUP BY user_city, window_start, window_end: Groups the results by state and time window, so the count is calculated separately for each state within each one-minute interval

One possible initial confusion is where that ‘iceberg.rest.catalog.namespace.name’ value comes from. Iceberg supports an arbitrary number of namespaces to organize tables in a catalog. Our data store that maps to the Iceberg catalog is named sgir; by right-clicking on it we get a dialog that includes add, which prompts for a name. This becomes your new namespace; in our case, it is sgns.

To further explain the WITH parameters, the ‘iceberg.rest.catalog.table.name’ specifies the same value as the CREATE TABLE. It can be a different value to alias the table name you’ll be using in your queries to the table in Iceberg – for example, if you want to shorten the name or organize your query in a particular fashion. In our case, though, we’re keeping the values the same.

With that query running, let’s launch our second query to generate our table for “Pageviews per user per hour.” What we’re doing here is very similar, but we’ve changed our TUMBLE window to 1 hour and are using userid instead of user_city.

CREATE TABLE pv_per_user_per_hour WITH(
‘iceberg.rest.catalog.namespace.name’=’sgns’,
‘iceberg.rest.catalog.table.name’=’pv_per_user_per_hour’
)
AS SELECT
userid,
count(pageid) AS viewcount,
window_start,
window_end
FROM TUMBLE(csas_enriched_pv, size 1 hour)
GROUP BY userid, window_start, window_end;Copy Code


1. 
CREATE TABLE pv\_per\_user\_per\_hour WITH(
2. 
'iceberg.rest.catalog.namespace.name'='sgns',
3. 
'iceberg.rest.catalog.table.name'='pv\_per\_user\_per\_hour'
4. 
)
5. 
AS SELECT 
6. 
 userid, 
7. 
 COUNT(pageid) AS viewcount, 
8. 
 window\_start,
9. 
 window\_end
10. 
FROM TUMBLE(csas\_enriched\_pv, SIZE 1 HOUR)
11. 
GROUP BY userid, window\_start, window\_end;

Without leaving DeltaStream, we can query the Iceberg table for our “Pageviews per city per minute” scenario directly in the same workspace and get results without moving to another tool.

SELECT user_city, hour(window_start) AS hour_value, SUM(viewcount) AS total

FROM pv_per_city_per_minute

GROUP BY user_city, hour(window_start)

ORDER BY total DESC

LIMIT 3;Copy Code


1. 
SELECT user\_city, HOUR(window\_start) AS hour\_value, SUM(viewcount) AS total 
2. 
FROM pv\_per\_city\_per\_minute 
3. 
GROUP BY user\_city, HOUR(window\_start)
4. 
ORDER BY total DESC
5. 
LIMIT 3;

Next, we perform our second query to get the results from our Iceberg table for our top 5 users per hour.

SELECT userid, window_start, SUM(viewcount) AS total

FROM pv_per_user_per_hour

GROUP BY userid, window_start

ORDER BY total DESC

LIMIT 5;Copy Code


1. 
SELECT userid, window\_start, SUM(viewcount) AS total 
2. 
FROM pv\_per\_user\_per\_hour 
3. 
GROUP BY userid, window\_start 
4. 
ORDER BY total DESC
5. 
LIMIT 5;

Like our first scenario, our second scenario, “Pageviews per user per hour, ” is all within DeltaStream. Since this data is now in Iceberg, you can also use any other compatible compute engine or BI tool without any lock-in.

Shift-left and So Much More

We’ve just walked through a classic “shift-left” scenario where we have moved enrichment and filtering to the streaming architecture and the popular Iceberg table format on AWS, making it available to query from any compatible compute engine. We’ve reduced latency by not waiting for a batch process to move the data through a medallion architecture, and we’ve reduced costs by eliminating that transformation compute cost and additional storage. We can even query those Iceberg tables from DeltaStream to do additional joins and enhancements and write them back out to Iceberg if we want. This is just the tip of the, um, Iceberg when it comes to what is possible with DeltaStream.

The post Enable “Shift-Left” with Apache Kafka and Iceberg appeared first on DeltaStream.

A Guide to Stateless vs. Stateful Stream Processing

kelsey-deltastream — Wed, 02 Apr 2025 17:30:30 +0000

Stream processing has become a cornerstone of modern data architectures, enabling real-time analytics, event-driven applications, and continuous data pipelines. From tracking user activity on websites to monitoring IoT devices or processing financial transactions, stream processing systems allow organizations to handle data as it arrives. However, not all stream processing is created equal. Two fundamental paradigms—stateless and stateful stream processing—offer distinct approaches to managing data streams, each with unique strengths, trade-offs, and use cases. This blog’ll explore the technical differences, dive into their implementations, provide examples, and touch on when each approach is most applicable.

What is Stream Processing?

First, a little background. Stream processing differs from traditional batch processing by operating on continuous, unbounded flows of data—think logs, sensor readings, or social media updates arriving in real time. Frameworks like Apache Kafka Streams, Apache Flink, and Spark Streaming provide the infrastructure to efficiently ingest, transform, and analyze these streams. The key distinction between stateless and stateful processing lies in how these systems manage information across events.

Stateless Stream Processing: Simplicity in Motion

Stateless stream processing treats each event as an isolated entity, processing it without retaining memory of prior events. When an event arrives, the system applies predefined logic based solely on its content and then moves on.

How Stateless Works:

Input A stream of events, e.g., {user_id: 123, action: "click", timestamp: 2025-03-13T10:00:00}.
Processing Apply a transformation or filter, such as “if action == ‘click’, increment a counter” or “convert timestamp to local time.”
Output Emit the result (e.g., a metric or enriched event) without referencing historical data.

Technical Characteristics:

No State Storage Stateless systems don’t maintain persistent storage or in-memory context, reducing resource overhead.
Scalability Since events are independent, stateless processing scales horizontally easily. You can distribute the workload across nodes without synchronization.
Fault Tolerance Recovery is straightforward—lost events can often be replayed without affecting correctness, assuming idempotency (i.e., processing the same event twice yields the same result).

Latency Processing is typically low-latency due to minimal overhead.

Example Use Case Consider a real-time clickstream filter that identifies and forwards “purchase” events from an e-commerce site. Each event is evaluated independently: if the action is “purchase,” it’s sent downstream; otherwise, it’s discarded. No historical context is needed.

Implementation Example (Kafka Streams):

StreamsBuilder builder = new StreamsBuilder();
KStream clicks = builder.stream(“clicks-topic”);
clicks.filter((key, value) -> value.contains(“\”action\”:\”purchase\””))
.to(“purchases-topic”);Copy Code


1. 
StreamsBuilder builder = new StreamsBuilder();
2. 
KStream<String, String> clicks = builder.stream("clicks-topic");
3. 
clicks.filter((key, value) -> value.contains("\"action\":\"purchase\""))
4. 
 .to("purchases-topic");

Trade-Offs: Stateless processing is lightweight and simple but limited. It can’t handle use cases requiring aggregation, pattern detection, or temporal relationships—like counting clicks per user over an hour—because it lacks memory of past events.

Stateful Stream Processing: Memory and Context

Stateful stream processing, in contrast, maintains state across events, enabling complex computations that depend on historical data. The system tracks information—like running totals, user sessions, or windowed aggregates—in memory or persistent storage.

How Stateful Works:

Input The same stream, e.g., {user_id: 123, action: "click", timestamp: 2025-03-13T10:00:00}.
Processing Update a state store, e.g., “increment user 123’s click count in a 1-hour window.”
Output Emit results based on the updated state, e.g., “user 123 has 5 clicks this hour.”

Technical Characteristics:

State Management requires mechanisms to track state, such as key-value stores (e.g., RocksDB in Flink) or in-memory caches.
Scalability More complex due to state partitioning and consistency requirements. Keys (e.g., user_id) are often used to shard state across nodes.
Fault Tolerance State must be checkpointed or replicated to recover from failures, adding overhead but ensuring correctness.
Latency Higher than stateless due to state access and updates, though optimizations like caching mitigate this.

Example Use Case: A fraud detection system that flags users with more than 10 transactions in a 5-minute window. This requires tracking per-user transaction counts over time—a stateful operation.

Implementation Example (Flink):

DataStream stream = env.addSource(new TransactionSource());
KeyedStream keyed = stream.keyBy(t -> t.userId);
keyed.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new CountAggregate())
.filter(count -> count >= 10)
.addSink(new AlertSink());Copy Code


1. 
DataStream<Transaction> stream = env.addSource(new TransactionSource());
2. 
KeyedStream<Transaction, String> keyed = stream.keyBy(t -> t.userId);
3. 
keyed.window(TumblingEventTimeWindows.of(Time.minutes(5)))
4. 
 .aggregate(new CountAggregate())
5. 
 .filter(count -> count >= 10)
6. 
 .addSink(new AlertSink());

Trade-Offs: Stateful processing is powerful but resource-intensive. Managing state increases memory and storage demands, and fault tolerance requires sophisticated checkpointing or logging (e.g., Kafka’s changelog topics). It’s also prone to issues like state bloat (e.g., expiring old windows) if not properly managed.

When to Use Stateless vs. Stateful?

Stateless: Opt for stateless processing when your application involves simple transformations, filtering, or enrichment without historical context. It’s ideal for lightweight, high-throughput pipelines where speed and simplicity matter.
Stateful: Choose stateful processing for analytics requiring memory—aggregations (e.g., running averages), sessionization, or pattern matching. It’s essential when the “why” behind an event depends on “what came before.”

Wrap-up

Stateless and stateful stream processing serve complementary roles in the streaming ecosystem. Stateless processing offers simplicity, scalability, and low latency for independent event handling, making it a go-to for straightforward transformations. In contrast, more complex and resource-heavy, stateful processing provides more advanced capabilities like time-based aggregations and contextual analysis, which are critical for real-time insights. Choosing between them depends on your use case: stateless for speed and simplicity, stateful for depth and memory. Modern frameworks often support both, allowing hybrid pipelines where stateless filters feed into stateful aggregators. Understanding their mechanics empowers you to design efficient, purpose-built streaming systems.

The post A Guide to Stateless vs. Stateful Stream Processing appeared first on DeltaStream.

An Overview of Shift Left Architecture

kelsey-deltastream — Wed, 26 Mar 2025 15:59:39 +0000

Consumer expectations for speed of service has only increased since the dawn of the information age. The ability to process information quickly and cost-effectively is no longer a luxury, it’s a necessity. Businesses across industries are racing to extract value from their data in real-time, and a transformative approach known as “shift left” is gaining traction. With streaming technologies, organizations can move data processing earlier in the pipeline to slash storage and compute costs, cut latency, and simplify operations. Let’s dive into what shift left means, why it’s a game-changer, and how it can reshape your data strategy.

Streaming Data: The Backbone of Modern Systems

Streaming data is ubiquitous in today’s tech ecosystem. From mobile apps to IoT ecosystems, real-time processing powers everything from convenience to security. Consider the scale of this trend: Uber runs over 2,500 Apache Flink jobs to keep ride-sharing seamless; Netflix manages a staggering 16,000 Flink jobs internally; Epic Games tracks real-time gaming metrics; Samsung’s SmartThings platform analyzes device usage on the fly; and Palo Alto Networks leverages streaming for instant threat detection. These examples highlight a clear truth: batch processing alone can’t keep pace with the demands of modern applications.

The Traditional ELT Approach: A Reliable but Rigid Standard

Historically, organizations have leaned on Extract, Load, Transform (ELT) pipelines to manage their data. In this model, raw data is ingested into data warehouses or lakehouses and then transformed for downstream use. Many adopt the “medallion architecture” to structure this process:

Bronze Raw, unprocessed data lands here.
Silver Data is cleansed, filtered, and standardized.
Gold Aggregations and business-ready datasets are produced.

This approach has been a staple thanks to the maturity of batch processing tools and its straightforward design. However, ELT’s limitations are glaring as data volumes grow and real-time needs intensify.

The Pain Points of ELT

High Latency Batch jobs run on fixed hourly, daily, or worse schedules, leaving a gap between data generation and actionability. For time-sensitive use cases, this delay is a dealbreaker.
Operational Complexity When pipelines fail, partial executions can leave a mess. Restarting often requires manual cleanup, draining engineering resources.
Cost Inefficiency Batch processing recomputes entire datasets, even if only a fraction has changed. This overprovisioning unnecessarily inflates compute costs.

Shift Left: Processing Data in Flight

Enter the shift left paradigm. Instead of deferring transformations to the warehouse, this approach uses streaming technologies—like Apache Flink—to process data as it flows through the pipeline. By shifting computation upstream, organizations can tackle data closer to its source, unlocking dramatic improvements.

Why Shift Left Wins

Reduced Latency Processing shrinks from hours or minutes to seconds—or even sub-seconds—making data available almost instantly.
Lower Costs Incremental processing computes only what’s new, avoiding the waste of rehashing unchanged data. Reduced storage costs from data filtering before it lands and no redundant data copies.
Simplified Operations Continuous streams eliminate the need for intricate scheduling and orchestration, reducing operational overhead.

A Real-World Win

Consider a company running batch pipelines in a data warehouse, costing $11,000 monthly. After shifting left to streaming, their warehouse bill dropped to $2,500. Even factoring in streaming infrastructure costs, they halved their total spend—while slashing latency from 30 minutes to seconds. This isn’t an outlier; it’s a glimpse of shift left’s potential.

Bridging the Expertise Gap

Streaming historically demanded deep expertise—think custom Flink jobs or Kafka integrations. That barrier is crumbling. Platforms like Delta Stream are democratizing stream processing with:

Serverless Options No need to manage clusters or nodes.
Automated Operations Fault tolerance and scaling are handled behind the scenes.
SQL-Friendly Interfaces Define transformations with familiar syntax, not arcane code.
Reliability Guarantees Exactly-once processing ensures data integrity without extra effort.

This shift makes streaming viable for teams without PhDs in distributed systems.

Transitioning Made Simple

Adopting shift left doesn’t mean scrapping your existing work. If your batch pipelines use SQL, you’re in luck: those statements can often be repurposed for streaming with minor tweaks. This means you can:

Preserve your business logic.
Stick with SQL-based workflows your team already knows.
See instant latency and cost benefits.
Skip the headache of managing streaming infrastructure.

For example, a batch query aggregating hourly sales could pivot to a streaming windowed aggregation with near-identical syntax—same logic, faster results.

The Future Is Streaming

Shifting left isn’t just an optimization, it’s a strategic evolution. As data grows and real-time demands escalate, clinging to batch processing risks falling behind. Thanks to accessible tools and platforms, what was once the domain of tech giants like Netflix or Uber is now within reach for organizations of all sizes. The numbers speak for themselves: lower costs, sub-second insights, and leaner operations. For competitive businesses, shifting left may soon transition from a smart move to a survival imperative. Ready to rethink your pipelines? Take a look at our on-demand webinar for more, Shift Left: Lower Cost & Reduce Latency of your Data Pipelines.

The post An Overview of Shift Left Architecture appeared first on DeltaStream.

The Flink 2.0 Official Release is Stream Processing All Grown Up

kelsey-deltastream — Mon, 24 Mar 2025 16:20:05 +0000

The Apache Flink crew dropped version 2.0.0 on March 24, 2025, and it’s the kind of update that makes you sit up and pay attention. I wrote about what was likely coming to Flink 2.0 back in November, and the announcement doesn’t disappoint. This isn’t some minor patch cobbled together over a weekend—165 people chipped in over two years, hammering out 25 Flink Improvement Proposals and squashing 369 bugs. It’s the first big leap since Flink 1.0 landed back in 2016, and as someone who’s been in the data weeds for more years than I care to remember, I’m here to tell you it’s a release that feels less like hype and more like a toolset finally catching up to reality. Let’s dig into the details.

The Backdrop: Data’s Moving Fast, and We’re Still Playing Catch-Up

Nine years ago, Flink 1.0 showed up when batch jobs were still the default, and streaming was the quirky sidekick. Fast forward to 2025, and the game’s flipped; real-time isn’t optional; it’s necessary. Whether it’s tracking sensor pings from a factory floor or keeping an AI chatbot from spitting out stale answers, data’s got to move at the speed of now. The problem is that most streaming setups still feel like they’re held together with duct tape and optimism, costing a fortune and tripping over themselves when the load spikes. With Flink 2.0, this all becomes more manageable.

What’s Cooking in Flink 2.0

The official rundown’s got plenty of details, but I’m not here to parrot the press release. Here’s my take on what matters:

State Goes Remote: Less Baggage, More Breathing Room Flink’s new trick of shoving state management off to remote storage is a quiet killer. No more tying compute and state together like they’re stuck in a toxic relationship; now they’re free to do their own thing. With some asynchronous magic and a nod to stuff like ForStDB, it’s built to scale without choking, especially if you’re running on Kubernetes or some other cloud playground. This feels like a lifeline for anyone who’s watched a pipeline buckle under big state.
Materialized Tables: Less Babysitting, More Doing Ever tried explaining watermarks to a new hire without their eyes glazing over? Flink’s Materialized Tables promise to deal with the details. You toss in a query and a freshness rule, and it figures out the rest, the schema, refreshes, and all the grunt work. That means you can build a pipeline that works for batch and streaming relatively easily. Practical, not flashy.
Paimon Integration: Expanded Lakehouse Support The Apache Paimon support was interesting to see. I’ve been curious about what might happen in that space for a while now. I wrote about it in late 2023. The focus is on the concept of the Streaming Lakehouse.
AI Nod: Feeding the Future They hint at AI and large language models with a “strong foundation” line but don’t expect a manual yet. My guess is that Flink is betting on being the real-time engine for fraud alerts or LLM-driven apps that need fresh data to stay sharp, which just makes sense. Flink CDC 3.3 introduced support for OpenAI chat and embedding models, so keep an eye on those developments.

Why I’m Excited for this Flink Release

Flink 2.0 doesn’t feel like it’s chasing trends; it’s tackling the stuff that keeps engineers up at night. Compared to Kafka Streams, which is lean but light on heavy lifting, or Spark Streaming, which still leans on micro-batches like it’s 2015, Flink can handle the nitty-gritty of event-by-event processing. This release doubles down with better cloud smarts and focuses on keeping costs sane. It’s not about throwing more hardware at the problem; it’s about working more innovatively, and that’s a win for anyone who’s ever had to justify a budget.

The usability updates really can’t be understated. Stream processing can be a beast to learn, but those Materialized Tables and cleaner abstractions mean you don’t need to be a guru to get started. It’s still Flink—powerful as ever—but it’s not gatekeeping anymore.

The Rough Edges: Change Hurts

Fair warning: This isn’t a plug-and-play upgrade if you’re cozy on Flink 1.x. Old APIs like DataSet are deprecated, and Scala’s legacy bits got the boot. Migration’s going to sting if your setup’s crusty. But honestly? That’s the price of progress. They’re trimming the fat to keep Flink lean and mean; dealing with the pain now will provide many years of stability.

The Bottom Line: Flink 2.0 is Worth a Look

Flink 2.0 isn’t here to reinvent the wheel but to make the wheel roll more smoothly. It’s a solid, no-nonsense upgrade that fits the chaos of 2025’s data demands: fast, scalable, and less of a pain to run. The community’s poured real effort into this, and it shows. Get all the details from the Flink team in their announcement and then start planning for your updates. Or take a look at DeltaStream if you’re interested in all the functionality of Flink, but without the required knowledge and infrastructure.

The post The Flink 2.0 Official Release is Stream Processing All Grown Up appeared first on DeltaStream.

The Top Four Trends Driving Organizations from Batch to Streaming Analytics

kelsey-deltastream — Thu, 20 Mar 2025 15:43:49 +0000

Over the past decade, the way businesses handle data has fundamentally changed. Organizations that once relied on batch processing to analyze data at scheduled intervals are now moving toward streaming analytics —where data is processed in real-time. While early adopters of streaming technologies were primarily large tech companies like Netflix, Apple, and DoorDash, today, businesses of all sizes are embracing streaming analytics to make faster, more informed decisions.

But what’s driving this shift? Below, we explore the key trends pushing organizations toward streaming analytics and highlight the most common use cases where it’s making a significant impact.

1. Rising Customer Expectations for Real-Time Insights

“ 74% of IT leaders report that streaming data enhances customer experiences, and 73% say it enables faster decision-making.” Source: VentureBeat

Modern consumers expect instant interactions. Businesses that rely on batch-processed analytics struggle to keep up with customer demands for instant responses. Streaming analytics allows companies to react in real-time, improving customer satisfaction and competitive advantage.

Example Use Cases:

E-commerce: Dynamic pricing and personalized recommendations based on real-time browsing behavior.
AdTech: Update ad bids dynamically based on audience engagement.
Gaming: Tailors in-game rewards based on real-time player activity.

2. Enterprise-Ready Solutions Make Streaming More Accessible

“ The streaming analytics market is projected to grow at a CAGR of 26% from 2024 to 2032, reaching $176.29B.” Source: GMInsights

Previously, streaming analytics required specialized expertise and was considered too complex and costly for most organizations. Today, the rise of streaming ETL and continuous data integration–combined with cloud-native solutions such as Google Dataflow, RedPanda, Confluent, and DeltaStream–is lowering the barrier to adoption. These platforms provide enterprise-friendly managed solutions that eliminate operational overhead, allowing businesses to implement streaming analytics without needing large in-house engineering teams.

Example Use Cases:

Data Warehousing: Ingests and updates analytics data in real time, ensuring dashboards reflect the latest insights.
IoT Platforms: Aggregates and processes sensor data instantly for real-time monitoring and automation.
Financial Services: Streams transactions into risk analytics models to detect fraud as it happens.

3. The Rise of LLMs and the Need for Fresh, Real-Time Data

“ AI and ML adoption are driving a 40% increase in real-time data workloads. ” Source: InfoQ

The rapid adoption of LLMs has shifted the focus from model capabilities to data freshness and uniqueness. Foundational models are becoming increasingly commoditized, and organizations can no longer rely on model performance alone for differentiation. Instead, real-time access to fresh, proprietary data determines accuracy, relevance, and competitive advantage.

The recent partnership between Confluent and Databricks highlights this growing demand for real-time data in AI workloads. Yet, stream processing remains a critical gap—organizations need ways to transform, enrich, and prepare real-time data before feeding it into RAG pipelines and other AI-driven applications to ensure accuracy and relevance.

Example Use Cases:

Real-Time Feature Engineering: Continuously transforms raw data streams into structured features for AI models.
News & Financial Analytics: Filters, enriches, and feeds LLMs with the latest market trends and breaking news.
Conversational AI & Chatbots: Incorporates real-time business data, technical support, and events to improve AI-driven interactions.

4. Regulations are Driving Real-Time Monitoring Needs

“ On November 12, 2024, the UK’s Financial Conduct Authority (FCA) fined Metro Bank £16.7 million for failing real-time monitoring of 60 million transactions worth £51 billion, a direct violation of their Anti-Money Laundering (AML) regulations.” Source FCA

Industries with strict compliance requirements are now mandated to monitor and report data events in real-time. Whether it’s fraud detection in banking, patient data security in healthcare, or GDPR compliance in data privacy, organizations must implement streaming analytics to meet these regulatory requirements. Real-time monitoring ensures businesses can detect anomalies instantly and prevent costly compliance violations.

Example Use Cases:

Banking: Anti-money laundering (AML) compliance.
Telecom: Real-time call monitoring for regulatory audits.
Government: Cybersecurity and national security threat detection.

Conclusion: Streaming Analytics is No Longer Optional

What was once a niche technology for highly technical organizations is now a necessity for businesses across industries. The push toward real-time analytics is being fueled by customer expectations, technological advancements, AI adoption, regulatory requirements, and competitive pressures.

Whether businesses are looking to prevent fraud, optimize supply chains, or personalize customer experiences, the ability to analyze data in motion is now a crucial part of modern data strategies.

For organizations still relying on batch processing, it is time to evaluate how streaming analytics can transform their data-driven decision-making. The future is real-time—how will you be ready?

The post The Top Four Trends Driving Organizations from Batch to Streaming Analytics appeared first on DeltaStream.

A Guide to the Top Stream Processing Frameworks

kelsey-deltastream — Mon, 27 Jan 2025 21:34:33 +0000

Every second, billions of data points pulse through the digital arteries of modern business. A credit card swipe, a sensor reading from a wind farm, or stock trades on Wall Street – each signal holds potential value, but only if you can catch it at the right moment. Stream processing frameworks enable organizations to process and analyze massive streams of data with low latency. This blog explores some of the most popular stream processing frameworks available today, highlighting their features, advantages, and use cases. These frameworks form the backbone of many real-time applications, enabling businesses to derive meaningful insights from ever-flowing torrents of data

What is Stream Processing?

Stream processing refers to the practice of processing data incrementally as it is generated rather than waiting for the entire dataset to be collected. This allows systems to respond to events or changes in real-time, making it invaluable for time-sensitive applications.

For example:

Fraud detection in banking : Transactions can be analyzed in real-time for suspicious activity.

E-commerce recommendations : Streaming data from user interactions can be used to offer instant product recommendations.

IoT monitoring : Data from IoT devices can be processed continuously for system updates or alerts.

Stream processing frameworks enable developers to build, deploy, and scale real-time applications. Let’s examine some of the most popular ones.

Apache Kafka Streams

Overview:

Apache Kafka Streams, an extension of Apache Kafka, is a lightweight library for building applications and microservices. It provides a robust API for processing data streams directly from Kafka topics and writing the results back to other Kafka topics or external systems. The API only supports JVM languages, including Java and Scala.

Key Features:

It is fully integrated with Apache Kafka, making it a seamless choice for Kafka users.
Provides stateful processing with the ability to maintain in-memory state stores.
Scalable and fault-tolerant architecture.
Built-in support for windowing operations and event-time processing.

Use Cases:

Real-time event monitoring and processing.
Building distributed stream processing applications.
Log aggregation and analytics.
Kafka Streams is ideal for developers already using Kafka for message brokering, as it eliminates the need for additional stream processing infrastructure.

Apache Flink

Overview:

Apache Flink is a highly versatile and scalable stream processing framework that excels at handling unbounded data streams. It offers powerful features for stateful processing, event-time semantics, and exactly-once guarantees.

Key Features:

Support for both batch and stream processing in a unified architecture.
Event-time processing: Handles out-of-order events using watermarks.
High fault tolerance with distributed state management.
Integration with popular tools such as Apache Kafka, Apache Cassandra, and HDFS.

Use Cases:

Complex event processing in IoT applications.
Fraud detection and risk assessment in finance.
Real-time analytics for social media platforms.

Apache Flink is particularly suited for applications requiring low-latency processing, high throughput, and robust state management.

Apache Spark Streaming

Overview:

Apache Spark Streaming extends Apache Spark’s batch processing capabilities to real-time data streams. Its micro-batch architecture processes streaming data in small, fixed intervals, making it easy to build real-time applications.

Key Features:

Micro-batch processing: Processes streams in discrete intervals for near-real-time results.
High integration with the larger Spark ecosystem, including MLlib, GraphX, and Spark SQL.
Scalable and fault-tolerant architecture.
Compatible with popular data sources like Kafka, HDFS, and Amazon S3.

Use Cases:

Live dashboards and analytics.
Real-time sentiment analysis for social media.
Log processing and monitoring for large-scale systems.

While its micro-batch approach results in slightly higher latency compared to true stream processing frameworks like Flink, Spark Streaming is still a popular choice due to its ease of use and integration with the Spark ecosystem.

Apache Storm

Overview:

Apache Storm is one of the pioneers in the field of distributed stream processing. Known for its simplicity and low latency, Storm is a reliable choice for real-time processing of high-velocity data streams.

Key Features:

Tuple-based processing: Processes data streams as tuples in real time.
High fault tolerance with automatic recovery of failed components.
Horizontal scalability and support for a wide range of programming languages.
Simple architecture with “spouts” (data sources) and “bolts” (data processors).

Use Cases:

Real-time event processing for online gaming.
Fraud detection in financial transactions.
Processing sensor data in IoT systems.

Although Apache Storm has been largely overtaken by newer frameworks like Flink and Kafka Streams, it remains an option for applications where low latency and simplicity are key priorities. It is being actively maintained and updated, with version 2.7.1 released in November 2024.

Google Dataflow

Overview:

Google Dataflow is a fully managed, cloud-based stream processing service. It is built on the Apache Beam model, which provides a unified API for batch and stream processing and enables portability across different execution engines.

Key Features:

Unified programming model for batch and stream processing.
Integration with Google Cloud services like BigQuery, Pub/Sub, and Cloud Storage.
Automatic scaling and resource management.
Support for windowing and event-time processing.

Use Cases:

Real-time analytics pipelines in cloud-native applications.
Data enrichment and transformation for machine learning workflows.
Monitoring and alerting systems.

Google Dataflow is best for businesses already operating in the Google Cloud ecosystem.

Amazon Kinesis

Overview:

Amazon Kinesis is a cloud-native stream processing platform provided by AWS. It simplifies streaming data ingestion, processing, and analysis in real-time.

Key Features:

Fully managed service with automatic scaling.
Supports custom application development using the Kinesis Data Streams API.
Integration with AWS services such as Lambda, S3, and Redshift.
Built-in analytics capabilities with Kinesis Data Analytics.

Use Cases:

Real-time clickstream analysis for e-commerce platforms.
IoT telemetry data processing.
Monitoring application logs and metrics.

Amazon Kinesis can be the most sensible option for a company already using AWS services, as it offers a quick way to start.

Choosing the Right Stream Processing Framework

The choice of a stream processing framework depends on your specific requirements, such as latency tolerance, scalability needs, ease of integration, and existing technology stack. For example:

If you’re heavily invested in Kafka, Kafka Streams is a likely fit.
Apache Flink is an excellent choice for low-latency, high-throughput applications and works with a wide array of data repository types.

- Organizations with expertise in the cloud can benefit from managed services like Google Dataflow or Amazon Kinesis.

Conclusion

Stream processing frameworks are essential for extracting real-time insights from dynamic data streams. The frameworks mentioned above – Apache Kafka Streams, Flink, Spark Streaming, Storm, Google Dataflow, and Amazon Kinesis, each have unique strengths and ideal use cases. By selecting the right tool for your needs, you can unlock the full potential of real-time data processing, powering next-generation applications and services.

The post A Guide to the Top Stream Processing Frameworks appeared first on DeltaStream.

Enhancing Fraud Detection with PuppyGraph and DeltaStream

kelsey-deltastream — Tue, 17 Dec 2024 16:39:21 +0000

Enhancing Fraud Detection with PuppyGraph and DeltaStream

The banking and finance industry has been one of the biggest beneficiaries of digital advancements. Many technological innovations find practical applications in finance, providing convenience and efficiency that can set institutions apart in a competitive market. However, this ease and accessibility have also led to increased fraud, particularly in credit card transactions, which remain a growing concern for consumers and financial institutions.

Traditional fraud detection systems rely on rule-based methods that struggle in real-time scenarios. These outdated approaches are often reactive, identifying fraud only after it occurs. Without real-time capabilities or advanced reasoning, they fail to match fraudsters’ rapidly evolving tactics. A more proactive and sophisticated solution is essential to combat this threat effectively.

This is where graph analytics and real-time stream processing come into play. Combining PuppyGraph, the first and only graph query engine, with DeltaStream, a stream processing engine powered by Apache Flink, enables institutions to improve fraud detection accuracy and efficiency, including real-time capabilities. In this blog post, we’ll explore the challenges of modern fraud detection and the advantages of using graph analytics and real-time processing. We will also provide a step-by-step guide to building a fraud detection system with PuppyGraph and DeltaStream.

Let’s start by examining the challenges of modern fraud detection.

Common Fraud Detection Challenges

Credit card fraud has always been a game of cat and mouse. Even before the rise of digital processing and online transactions, fraudsters found ways to exploit vulnerabilities. With the widespread adoption of technology, fraud has only intensified, creating a constantly evolving fraud landscape that is increasingly difficult to navigate. Key challenges in modern fraud detection include:

Volume: Daily credit card transactions are too vast to review and identify suspicious activity manually. Automation is critical to sorting through all that data and identifying anomalies.
Complexities: Fraudulent activity often involves complex patterns and relationships that traditional rule-based systems can’t detect. For example, fraudsters may use stolen credit card information to make a series of small transactions before a large one or use multiple cards in different locations in a short period.
Real-time: The sooner fraud is detected, the less financial loss there will be. Real-time analysis is crucial in detecting and preventing transactions as they happen, especially when fraud can be committed at scale in seconds.
Agility: Fraudsters will adapt to new security measures. Fraud detection systems must be agile, even learning as they go, to keep up with the evolving threats and tactics.
False positives: While catching fraudulent transactions is essential, it’s equally important to avoid flagging legitimate transactions as fraud. False positives can frustrate customers, especially when a card is automatically locked out due to legitimate purchases. As a consequence, they can adversely affect revenue.

To tackle these challenges, businesses require a solution that processes large volumes of data in real-time, identifies complex patterns, and evolves with new fraud tactics. Graph analytics and real-time stream processing are essential components of such a system. By mapping and analyzing transaction networks, businesses can more effectively detect anomalies in customer behavior and identify potentially fraudulent transactions.

Leveraging Graph Analytics for Fraud Detection

Traditional fraud detection methods analyze individual transactions in isolation. This can miss connections and patterns that emerge when we examine the bigger picture. Graph analytics allows us to visualize and analyze transactions as a network of connected things.

Think of it like a social network. Each customer, credit card, merchant, and device becomes a node in the graph, and each transaction connects those nodes. We can find hidden patterns and anomalies that indicate fraud by looking at the relationships between nodes.

Figure: an example schema for fraud detection use case

Here’s how graph analytics can be applied to fraud detection:

Finding suspicious connections: Graph algorithms can discover unusual patterns of connections between entities. For example, if the same person uses multiple credit cards in different locations in a short period or a single card is used to buy from a group of merchants known for fraud, those connections will appear in the graph and be flagged as suspicious.
Uncovering fraud rings: Fraudsters often work within the same circles, using multiple identities and accounts to carry out scams. Graph analytics can find those complex networks of people and their connections, helping to identify and potentially break up entire fraud rings.
Surfacing identity theft: When a stolen credit card is used, the spending patterns will generally be quite different from the cardholder’s normal behavior. By looking at the historical and current transactions within a graph, you can see sudden changes in spending habits, locations, and types of purchases that may indicate identity theft.
Predicting future fraud: Graph analytics can predict future fraud by looking at historical data and the patterns that precede a fraudulent transaction. By predicting fraud before it happens, businesses can take action to prevent it.

Of course, all of these benefits are extremely helpful. However, the biggest hurdle to realizing them is the complexity of implementing a graph database. Let’s look at some of those challenges and how PuppyGraph can help users avoid them entirely.

Challenges of Implementing and Running Graph Databases

As shown, graph databases can be an excellent tool for fraud detection. So why aren’t they used more frequently? This usually boils down to implementing and managing them, which can be complex for those unfamiliar with the technology. The hurdles that come with implementing a graph database can far outweigh the benefits for some businesses, even stopping them from adopting this technology altogether. Here are some of the issues generally faced by companies implementing graph databases:

Cost: Traditional relational databases have been the norm for decades, and many organizations have invested heavily in their infrastructure. Switching to a graph database or even running a proof of concept requires a significant upfront investment in new software, hardware, and training.
Implementing ETL: Extracting, transforming, and loading (ETL) data into a graph database can be tricky and time-consuming. Data needs to be restructured to fit into a graph model, which requires knowledge of the underlying data to be moved over and how to represent these entities and relationships within a graph model. This requires specific skills and adds to the implementation time and cost, meaning the benefits may be delayed.
Bridging the skills gap: Graph databases require a different data modeling and querying approach from traditional databases. In addition to the previous point regarding ETL, finding people with the skills to manage, maintain, and query the data within a graph database can also be challenging. Without these skills, graph technology adoption is mostly dead in the water.
Integration challenges: Integrating a graph database with existing systems and applications is complex. This usually involves taking the output from graph queries and mapping them into downstream systems, which requires careful planning and execution. Getting data to flow smoothly and be compatible with different systems is significant.

These challenges highlight the need for solutions that make graph database adoption and management more accessible. A graph query engine like PuppyGraph addresses these issues by enabling teams to integrate their data and query it as a graph in minutes without the complexity of ETL processes or the need to set up a traditional graph database. Let’s look at how PuppyGraph helps teams become graph-enabled without ETL or the need for a graph database.

How PuppyGraph Solves Graph Database Challenges

PuppyGraph is built to tackle the challenges that often hinder graph database adoption. By rethinking graph analytics, PuppyGraph removes many entry barriers, opening up graph capabilities to more teams than otherwise possible. Here’s how PuppyGraph addresses many of the hurdles mentioned above:

Zero-ETL: One of PuppyGraph’s most significant advantages is connecting directly to your existing data warehouses and data lakes—no more complex and time-consuming ETL. There is no need to restructure data or create separate graph databases. Simply connect the graph query engine directly to your SQL data store and start querying your data as a graph in minutes.
Cost: PuppyGraph reduces the expenses of graph analytics by using your existing data infrastructure. There is no need to invest in new database infrastructure or software and no ongoing maintenance costs of traditional graph databases. Eliminating the ETL process significantly reduces the engineering effort required to build and maintain fragile data pipelines, saving time and resources.
Reduced learning curve: Traditional graph databases often require users to master complex graph query languages for every operation, including basic data manipulation. PuppyGraph simplifies this by functioning as a graph query engine that operates alongside your existing SQL query engine using the same data. You can continue using familiar SQL tools for data preparation, aggregation, and management. When more complex queries suited to graph analytics arise, PuppyGraph handles them seamlessly. This approach saves time and allows teams to reserve graph query languages specifically for graph traversal tasks, reducing the learning curve and broadening access to graph analytics.
Multi-query language support: Engineers can continue to use their existing SQL skills and platform, allowing them to leverage graph querying when needed. The platform offers many ways to build graph queries, including Gremlin and Cypher support, so your existing team can quickly adopt and use graph technology.
Effortless scaling: PuppyGraph’s architecture separates compute and storage so it can easily handle petabytes of data. By leveraging their underlying SQL storage, teams can effortlessly scale their compute as required. You can focus on extracting value from your data without scaling headaches.
Fast deployment: With PuppyGraph, you can deploy and start querying your data as a graph in 10 minutes. There are no long setup processes or complex configurations. Fast deployment means you can start seeing the benefits of graph analytics and speed up your fraud detection.

In short, PuppyGraph removes the traditional barriers to graph adoption so more institutions can use graph analytics for fraud detection use cases. By simplifying, reducing costs, and empowering existing teams with effortless graph adoption, PuppyGraph makes graph technology accessible for all teams and organizations.

Real-Time Fraud Prevention with DeltaStream

Speed is key in the fight against fraud, and responsiveness is crucial to preventing or minimizing the impact of an attack. Systems and processes that act on events with minimal latency can mean the difference between successful and unsuccessful cyber attacks. DeltaStream empowers businesses to analyze and respond to suspicious transactions in real-time, minimizing losses and preventing further damage.

Why Real-Time Matters:

Immediate Response: Rapid incident response means security and data teams can detect, isolate, and trigger mitigation protocols, minimizing their vulnerability window faster than ever. With real-time data and sub-second latency, the Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) can be significantly reduced.
Proactive Prevention: Data and security teams can identify behavior patterns as they emerge and implement mitigation tactics. Real-time allows for continuous monitoring of system health and security with predictive models.
Improved Accuracy: Real-time data provides a more accurate view of customer behavior for precise detection. Threats are more complex than ever and often involve multi-stage attack patterns; streaming data aids in identifying these complex and ever-evolving threat tactics.

DeltaStream’s Key Features:

Speed: Increase the speed of your data processing and your team’s ability to create data applications. Reduce latency and cost by shifting your data transformations out of your warehouse and into DeltaStream. Data teams can also quickly write queries in SQL to create analytics pipelines with no other complex languages to learn.
Team Focus: Eliminate maintenance tasks with our continually optimizing Flink operator. Your team isn’t focused on infrastructure, meaning they can focus on building and strengthening pipelines.
Unified View: An organization’s data rarely comes from just one source. Process streaming data from multiple sources in real-time to get a complete picture of activities. This means transaction data, user behavior, and other relevant signals can be analyzed together as they occur.

By combining PuppyGraph’s graph analytics with DeltaStream’s real-time processing, businesses can create a dynamic fraud detection system that stays ahead of evolving threats.

Step-by-Step tutorial: DeltaStream and PuppyGraph

In this tutorial, we go through the high-level steps of integrating DeltaStream and PuppyGraph.

The detailed steps are available at:

Starting a Kafka Cluster

We start a Kafka Server as the data input. (Later in the tutorial, we’ll send financial data through Kafka.)

We create topics for financial data like this:

bin/kafka-topics.sh –create –topic kafka-Account –bootstrap-server localhost:9092 –partitions 1 –replication-factor 1Copy Code


1. 
bin/kafka-topics.sh --create --topic kafka-Account --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Setting up DeltaStream

Connecting to Kafka

Log in to the Deltastream console. Then, navigate to Resources and add a Kafka Store – for example, kafka_demo – with the Kafka Cluster parameters we created in the previous step.

Next, in the Workspace , create a deltastream database – for example: kafka_db

After that, we use DeltaStream SQL to create streams for the Kafka topics we created in the previous step. The stream describes the topic’s physical layout so it can be easily referenced with SQL. Here is an example of one of the streams we create in DeltaStream for a Kafka topic. Once we declare the streams, we can build streaming data pipelines to transform, enrich, aggregate, and prepare streaming data for analysis in PuppyGraph. First, we’ll define the account_stream from the kafka-Account topic.

CREATE STREAM account_stream (
“label” STRING,
“accountId” BIGINT,
“createTime” STRING,
“isBlocked” BOOLEAN,
“accoutType” STRING,
“nickname” STRING,
“phonenum” STRING,
“email” STRING,
“freqLoginType” STRING,
“lastLoginTime” STRING,
“accountLevel” STRING
) WITH (
‘topic’ = ‘kafka-Account’,
‘value.format’ = ‘JSON’
);Copy Code


1. 
CREATE STREAM account\_stream (
2. 
"label" STRING,
3. 
"accountId" BIGINT,
4. 
"createTime" STRING,
5. 
"isBlocked" BOOLEAN,
6. 
"accoutType" STRING,
7. 
"nickname" STRING,
8. 
"phonenum" STRING,
9. 
"email" STRING,
10. 
"freqLoginType" STRING,
11. 
"lastLoginTime" STRING,
12. 
"accountLevel" STRING
13. 
) WITH (
14. 
'topic' = 'kafka-Account',
15. 
'value.format' = 'JSON'
16. 
);

Next, we’ll define the accountrepayloan_stream from the kafka-AccountRepayLoan topic:

CREATE STREAM accountrepayloan_stream (
“label” STRING,
“accountrepayloandid” BIGINT,
“loanId” BIGINT,
“amount” DOUBLE,
“createTime” STRING
) WITH (
‘topic’ = ‘kafka-AccountRepayLoan’,
‘value.format’ = ‘JSON’
);Copy Code


1. 
CREATE STREAM accountrepayloan\_stream (
2. 
"label" STRING,
3. 
"accountrepayloandid" BIGINT,
4. 
"loanId" BIGINT,
5. 
"amount" DOUBLE,
6. 
"createTime" STRING
7. 
) WITH (
8. 
'topic' = 'kafka-AccountRepayLoan',
9. 
'value.format' = 'JSON'
10. 
);

And finally, we’ll show the accounttransferaccount_stream from the kafka-AccountTransferAccount. You’ll note there is both a fromid and toid that will like to the loanId. This allows us to enrich data in the account payment stream with account information from the account_stream and combine it with the account transfer stream.

With DeltaStream, this can then easily be written out as a more succinct and enriched stream of data to our destination, such as Snowflake or Databricks. We combine data from three streams with just the information we want, preparing the data in real-time from multiple streaming sources, which we then graph using PuppyGraph.

CREATE STREAM accounttransferaccount_stream (
“label” VARCHAR,
“accounttransferaccountid”, BIGINT,
“fromd” BIGINT,
“toid” BIGINT,
“amount” DOUBLE,
“createTime” STRING,
“ordernum” BIGINT,
“comment” VARCHAR,
“paytype” VARCHAR,
“goodstype” VARCHAR
) WITH (
‘topic’ = ‘kafka-AccountTransferAccount’,
‘value.format’ = ‘JSON’
);Copy Code


1. 
CREATE STREAM accounttransferaccount\_stream (
2. 
"label" VARCHAR,
3. 
"accounttransferaccountid", BIGINT,
4. 
"fromd" BIGINT,
5. 
"toid" BIGINT,
6. 
"amount" DOUBLE,
7. 
"createTime" STRING,
8. 
"ordernum" BIGINT,
9. 
"comment" VARCHAR,
10. 
"paytype" VARCHAR,
11. 
"goodstype" VARCHAR
12. 
) WITH (
13. 
'topic' = 'kafka-AccountTransferAccount',
14. 
'value.format' = 'JSON'
15. 
);

Adding a Store for Integration

PuppyGraph will connect to the stores and allow querying as a graph.

Once our data is ready in the desired format, we can write streaming SQL queries in DeltaStream to write data continuously in the desired storage. In this case, we can use DeltaStream’s native integration with Snowflake or Databricks, where we will use PuppyGraph. Here is an example of writing data continuously into a table in Snowflake or Databricks from DeltaStream:

CREATE TABLE ds_account
WITH
(
‘store’ = ”

) AS
SELECT * FROM account_stream;Copy Code


1. 
CREATE TABLE ds\_account
2. 
WITH
3. 
(
4. 
'store' = '<store\_name>'
5. 
<Storage parameters>
6. 
) AS
7. 
SELECT \* FROM account\_stream;

For Databricks integration, refer to the Databricks integration documentation for detailed steps.
For Snowflake integration, refer to Snowflake integration documentation for detailed steps.

Starting data processing

Now, you can start a Kafka Producer to send the financial JSON data to Kafka. For example, to send account data, run:

kafka-console-producer.sh –broker-list localhost:9092 –topic kafka-Account < json_data/Account.jsonCopy Code


1. 
kafka-console-producer.sh --broker-list localhost:9092 --topic kafka-Account < json\_data/Account.json

DeltaStream will process the data, and then we will query it as a graph.

Query your data as a graph

You can start PuppyGraph using Docker. Then upload the Graph schema, and that’s it! You can now query the financial data as a graph as DeltaStream processes it.

Start PuppyGraph using the following command:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
-e DATAACCESS_DATA_CACHE_STRATEGY=adaptive \
-e \
–name puppy –rm -itd puppygraph/puppygraph:stableCopy Code


1. 
docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
2. 
-e DATAACCESS\_DATA\_CACHE\_STRATEGY=adaptive \
3. 
-e <STORAGE PARAMETERS> \
4. 
--name puppy --rm -itd puppygraph/puppygraph:stable

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph

Password: puppygraph123

Upload the schema:Select the file schema_.json in the Upload Graph Schema JSON section and click Upload.

Navigate to the Query panel on the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin. For example, to query the accounts owned by a specific company and the transaction records of these accounts, you can run:

g.V(“Company[237]”)
.outE(‘CompanyOwnAccount’).inV()
.outE(‘AccountTransferAccount’).inV()
.path()Copy Code


1. 
g.V("Company[237]")
2. 
 .outE('CompanyOwnAccount').inV()
3. 
 .outE('AccountTransferAccount').inV()
4. 
 .path()

Conclusion

As this blog post explores, traditional fraud detection methods simply can’t keep pace with today’s sophisticated criminals. Real-time analysis and the ability to identify complex patterns are critical. By combining the power of graph analytics with real-time stream processing, businesses can gain a significant advantage against fraudsters.

PuppyGraph and DeltaStream offer robust and accessible solutions for building real-time dynamic fraud detection systems. We’ve seen how PuppyGraph unlocks hidden relationships and how DeltaStream analyzes real-time data to quickly and accurately identify and prevent fraudulent activity. Ready to take control and build a future-proof, graph-enabled fraud detection system? Try PuppyGraph and DeltaStream today. Visit PuppyGraph and DeltaStream to get started!

The post Enhancing Fraud Detection with PuppyGraph and DeltaStream appeared first on DeltaStream.

What’s Coming in Apache Flink 2.0?

kelsey-deltastream — Wed, 13 Nov 2024 01:10:44 +0000

What’s Coming in Apache Flink 2.0?

As champions for Apache Flink, we are excited for the 2.0 release and all that it will bring. Apache Flink 1.0 was released in 2016, and while we don’t have an exact release date, it looks like 2.0 will be released in late 2024/early 2025. Version 1.2 was just released in August 2024. Version 2.0 is set to be a major milestone release, marking a significant evolution in the stream processing framework. This blog runs down some of the key features and changes coming in Flink 2.0.

Disaggregated State Storage and Management

One of the most exciting features of Flink 2.0 is the introduction of disaggregated state storage and management. It will utilize a Distributed File System (DFS) as the primary storage for state data. This architecture separates compute and storage resources, addressing key scalability and performance needs for large-scale, cloud-native data processing.

Core Advantages of Disaggregated State Storage

Improved Scalability By decoupling storage from compute resources, Flink can manage massive datasets—into the hundreds of terabytes—without being constrained by local storage. This separation enables efficient scaling in containerized and cloud environments.
Enhanced Recovery and Rescaling The new architecture supports faster state recovery on job restarts, efficient fault tolerance, and quicker job rescaling with minimal downtime. Key components include shareable checkpoints and LazyRestore for on-demand state recovery.
Optimized I/O Performance Flink 2.0 uses asynchronous execution and grouped remote state access to minimize the latency impact of remote storage. A hybrid caching mechanism can improve cache efficiency, providing up to 80% better throughput than traditional file-level caching.
Improved Batch Processing Disaggregated state storage enhances batch processing by better handling large state data and integrating batch and stream processing tasks, making Flink more versatile across diverse workloads.
Dynamic Resource Management The architecture enables flexible resource allocation, minimizing CPU and network usage spikes during maintenance tasks like compaction and cleanup.

API and Configuration Changes

Several API and configuration changes will be introduced, including:

Removal of deprecated APIs, including the DataSet API and Scala versions of DataStream and DataSet APIs
Deprecation of the legacy SinkFunction API in favor of the Unified Sink API
Overhaul of the configuration layer, enhancing user-friendliness and maintainability
Introduction of new abstractions such as Materialized Tables in v1.2 and further enhanced in v2
Updates to configuration options, including proper type usage (e.g., Duration, Enum, Int)

Modernization and Unification

Flink 2.0 aims to further unify batch and stream processing:

Modernization of legacy components, such as replacing the legacy SinkFunction with the new Unified Sink API
Enhanced features that combine batch and stream processing seamlessly
Improvements to Adaptive Batch Execution for optimizing logical and physical plans

Performance Improvements

The community is working on making Flink’s performance on bounded streams (batch use cases) competitive with dedicated batch processors. This can further simplify your data processing stack.

Dynamic Partition Pruning (DPP) to minimize I/O costs
Runtime Filter to reduce I/O and shuffle costs
Operator Fusion CodeGen to improve query execution performance

Cloud-Native Focus

Flink 2.0 is being designed with cloud-native architectures in mind:

Improved efficiency in containerized environments
Better scalability for large state sizes
More efficient fault tolerance and faster rescaling

Summary of Flink 2.0

This is an exciting time for Apache Flink 2.0. It represents a significant leap forward in unified batch and stream processing, focusing on cloud-native architectures, improved performance, and streamlined APIs. These changes aim to address the evolving needs of data-driven applications and set new standards for what’s possible in data processing. DeltaStream is proudly powered by Apache Flink, which makes it easy to start running Flink in minutes. Get a free trial of DeltaStream and see for yourself.

The post What’s Coming in Apache Flink 2.0? appeared first on DeltaStream.

Open Sourcing our Snowflake Connector for Apache Flink

kelsey-deltastream — Wed, 06 Nov 2024 20:10:48 +0000

November 2024 Updates:

Support wider range of Apache Flink environments, including Managed Service for Apache Flink and BigQuery Engine for Apache Flink, with Java 11 and 17 support.
Fixes an issue affecting compatibility with Google Cloud Projects
Upgraded to Apache Flink 1.19.
See full release details

At DeltaStream our mission is to bring a serverless and unified view of all streams to make stream processing possible for any product use case. By using Apache Flink as our underlying processing engine, we can leverage its rich connector ecosystem to connect to many different data systems, breaking down the barriers of siloed data. As we mentioned in our Building Upon Apache Flink for Better Stream Processing article, using Apache Flink is more than using robust software with a good track record at DeltaStream. Using Flink has allowed us to iterate faster on improvements or issues that arise from solving the latest and greatest data engineering challenges. However, one connector that was missing until today was the Snowflake connector.

Today, in our efforts to make solving data challenges possible, we are open sourcing our Apache Flink sink connector built for writing data to Snowflake. This connector has already provided DeltaStream with native integration between other sources of data and Snowflake. This also aligns well with our vision of providing a unified view over all data, and we want to open this project up for public use and contribution so that others in the Flink community can benefit from this connector as well.

The open-source repository will be open for contributions, suggestions, or discussions. In this article, we touch on some of the highlights of this new Flink connector.

Utilizing the Snowflake Sink

The Flink connector uses the latest Flink Sink and SinkWriter interfaces to build a Snowflake sink connector and write data to a configurable Snowflake table, respectively:

Diagram 1: Each SnowflakeSinkWriter inserts rows into Snowflake table using their own dedicated ingest channel

The Snowflake sink connector can be configured with a parallelism of more than 1, where each task relies on the order of data it receives from its upstream operator. For example, the following shows how data can be written with parallelism of 3:

DataStream.sinkTo(SnowflakeSinkWriter).setParallelism(3);Copy Code


1. 

2. 
DataStream<InputT>.sinkTo(SnowflakeSinkWriter<InputT>).setParallelism(3);

Diagram 1 shows the flow of data between TaskManager(s) and the destination Snowflake table. The diagram is heavily simplified to focus on the concrete SnowflakeSinkWriter<InputT>, and it shows that each sink task connects to its Snowflake table using a dedicated SnowflakeStreamingIngestChannel from Snowpipe Streaming APIs.

The SnowflakeSink<InputT> is also shipped with a generic SnowflakeRowSerializationSchema<T> interface that allows each implementation of the sink to provide its own concrete serialization to a Snowflake row of Map<String, Object> based on a given use case.

Write Records At Least Once

The first version of the Snowflake sink can write data into Snowflake tables with the delivery guarantee of NONE or AT_LEAST_ONCE, using AT_LEAST_ONCE by default. Supporting EXACTLY_ONCE semantics is a goal for a future version of this connector.

The sink writes data into its destination table after buffering records for a fixed time interval. This buffering time interval is also bounded by Flink’s checkpointing interval, which is configured as part of the StreamExecutionEnvironment. In other words, if Flink’s checkpointing interval and buffering time are configured to be different values, then records are flushed as fast as the shorter interval:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(100L);
…
SnowflakeSink<Map> sf_sink = SnowflakeSink.builder()
.bufferTimeMillis(1000L)
…
.build(jobId);
env.fromSequence(1, 10).map(new SfRowMapFunction()).sinkTo(sf_sink);
env.execute();Copy Code


1. 

2. 
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
3. 
env.enableCheckpointing(100L);
4. 
…
5. 
SnowflakeSink<Map<String, Object>> sf\_sink = SnowflakeSink.<Row>builder()
6. 
 .bufferTimeMillis(1000L)
7. 
 …
8. 
 .build(jobId);
9. 
env.fromSequence(1, 10).map(new SfRowMapFunction()).sinkTo(sf\_sink);
10. 
env.execute();

In this example, the checkpointing interval is set to 100 milliseconds, and the buffering interval is configured as 1 second. This tells the Flink job to flush the records at least every 100 milliseconds, i.e., on every checkpoint.

Read more about Snowpipe Streaming best practices in the Snowflake documentation.

The Flink Community, to Infinity and Beyond

We are very excited about the opportunity to contribute our Snowflake connector to the Flink community. We’re hoping this connector will add more value to the rich connector ecosystem of Flink that’s powering many data application use cases.If you want to check out the connector for yourself, head over to the GitHub repository. Or if you want to learn more about DeltaStream’s integration with Snowflake, read our Snowflake integration blog.

The post Open Sourcing our Snowflake Connector for Apache Flink appeared first on DeltaStream.

A Guide to Standard SQL vs. Streaming SQL: Why Do We Need Both?

kelsey-deltastream — Tue, 29 Oct 2024 17:41:46 +0000

Understanding the Differences Between Standard SQL and Streaming SQL

SQL has long been a foundational tool for querying databases. Traditional SQL queries are typically run against static, historical data, generating a snapshot of results at a single point in time. However, the rise of real-time data processing, driven by applications like IoT, financial transactions, security monitoring/intrusion, and social media, has led to the evolution of Streaming SQL. This variant extends traditional SQL capabilities, offering features specifically designed for real-time, continuous data streams.

Standard SQL and Streaming SQL Key Differences

1. Point-in-Time vs. Continuous Queries

In standard SQL, queries are typically run once and return results based on a snapshot of data. For instance, when you query a traditional database to get the sum of all sales, it reflects only the state of data up until the moment of the query.

In contrast, Streaming SQL works with data that continuously flows in, updating queries in real-time. The same query can be run in streaming SQL, but instead of receiving a one-time result, the query is maintained in a materialized view that updates as new data arrives. This is especially useful for use cases like dashboards or monitoring systems, where the data needs to stay current.

2. Real-Time Processing with Window Functions

Streaming SQL introduces window functions , allowing users to segment a data stream into windows for aggregation or analysis. For example, a tumbling window is a fixed-length window (such as one minute) that collects data for aggregation over that time frame. In contrast, a hopping window is a fixed-size time interval that will hop by a specified length. That means if you want to calculate the current inventory every two minutes but update the results every minute, the hopping window would then be two minutes, and the hop size would be a minute.

Windowing in traditional SQL is static and backward-looking, whereas in streaming SQL, real-time streams are processed continuously, updating aggregations within the described window.

3. Watermarks for Late Data Handling

In streaming environments, data can arrive late or out of order. To manage this, Streaming SQL introduces watermarks. Watermarks mark the point in time up to which the system expects to have received data. For instance, if an event is delayed by a minute, a watermark ensures it’s still processed if it arrives within that window, making streaming SQL robust for real-world, unpredictable data flows. Conventional SQL has no ability or need to address this scenario.

4. Continuous Materialization

One of the unique aspects of Streaming SQL is the ability to materialize views incrementally. Unlike traditional databases that recompute queries when data changes, streaming SQL continuously maintains these views as new data flows in. This approach dramatically improves performance for real-time analytics by avoiding expensive re-computations.

Use Cases for Streaming SQL

The rise of streaming SQL has been a game-changer across industries. Common applications include:

Real-time analytics dashboards, such as stock trading platforms or retail systems where quick insights are needed to make rapid decisions.
Event-driven applications where alerts and automations are triggered by real-time data, such as fraud detection or IoT sensor monitoring.
Real-time customer personalization, where user actions or preferences update in real-time to deliver timely recommendations.

Conclusion

While Standard SQL excels in querying static, historical datasets, Streaming SQL is optimized for real-time data streams, offering powerful features like window functions, watermarks, and materialized views. These advancements handle fast-changing data with low latency, offering immediate insights and automation. This article at Datanami in July 2023 pegged 177% growth in streaming adoption in the previous 12 months. As more industries rely on real-time decision-making, streaming SQL is becoming a critical tool for modern data infrastructures.

The post A Guide to Standard SQL vs. Streaming SQL: Why Do We Need Both? appeared first on DeltaStream.

Democratizing Data with All-in-One Streaming Solutions

kelsey-deltastream — Wed, 23 Oct 2024 20:15:18 +0000

In today’s fast-paced data landscape, organizations must maximize efficiency, enhance collaboration, and maintain data quality. An all-in-one streaming data solution offers a single, integrated platform for real-time data processing, which simplifies operations, reduces costs, and makes advanced tools accessible across teams.

This blog explores the benefits of such solutions and their role in promoting a democratized data culture.

Key Benefits of All-in-One Streaming Data Solutions

Streamlined Learning Curve

All-in-one platforms simplify adoption by providing a single interface, unlike traditional setups requiring expertise in multiple tools and languages. This accelerates adoption and facilitates collaboration across teams.

Consolidated Toolset

By merging data integration, processing, and visualization into a unified system, these platforms eliminate the need to manage multiple applications. Teams can perform tasks like joins, filtering, and creating materialized views within one environment, improving workflow efficiency.

Simplified Language Support

Most all-in-one platforms use a common language, such as SQL, for all data operations. This reduces the need for proficiency in multiple languages, streamlines processes, and enables easier collaboration between team members.

Enhanced Security and Compliance

With centralized security controls, these platforms simplify the enforcement of compliance standards like GDPR and HIPAA. Fewer components reduce vulnerabilities, providing a more secure data environment.

Cost Savings

Managing multiple tools leads to increased costs, both in licensing and staffing. An all-in-one solution consolidates these tools, reducing expenses and providing long-term cost stability.

Improved Data Quality

Using a single platform for all data operations—collection, transformation, streaming, and analysis—minimizes errors and ensures consistent validation, resulting in more accurate and reliable insights.

Centralized Platform for Unified Operations

An all-in-one solution enables teams to handle all aspects of data processing on one platform, from combining datasets to filtering large volumes of data and creating materialized views for real-time access. This integrated approach reduces errors and boosts operational efficiency.

Single Interface for Event Streams

These platforms provide a single interface to access and work with event streams, regardless of location or device. This consistent access allows teams to monitor and manage streams globally, facilitating seamless data handling across distributed environments.

Breaking Down Silos

All-in-one platforms promote collaboration by breaking down data silos, enabling cross-functional teams to work with shared data in real-time. Whether in marketing, sales, engineering, or product development, everyone has access to the same data streams, facilitating collaboration and maximizing the value of data.

Democratized Data Access and Collaboration

Centralized Data Access

In traditional environments, only a few technical users control critical data pipelines. An all-in-one solution democratizes data by giving all team members access to the same tools, empowering them to make data-driven decisions regardless of technical expertise.

Simplified Data Analysis

These platforms provide intuitive tools for querying and visualizing data, allowing less technically sophisticated users to engage in data analysis. This extends the role of data across the organization, improving decision-making and fostering collaboration.

Cross-Functional Collaboration

The integration of all tools into a single platform enhances collaboration across functions. Teams from different departments can work together more efficiently, aligning on data-driven strategies without needing to navigate disparate systems or fight through inconsistent user access, i.e., some people may have access to tools A and B while others only to tools C and D.

Reduced Effort

With only one platform to learn, teams experience reduced effort and cognitive load, freeing up more time to focus on deriving insights rather than managing multiple tools. This ease of use encourages widespread adoption and enhances overall productivity.

Scalability and Flexibility

All-in-one solutions are designed for scalability, enabling organizations to grow without constantly adopting new tools or overhauling systems. Whether increasing data streams or integrating new sources, these platforms scale effortlessly with business needs.

Conclusion

Is this the promise of Data Mesh? All-in-one streaming data solutions are revolutionizing how organizations handle real-time data. By consolidating tools, simplifying workflows, and fostering collaboration, these platforms democratize data access while maintaining data quality and operational efficiency. Whether you’re a small team seeking streamlined processes or a large enterprise focused on scalability, the benefits of an all-in-one solution are clear. Investing in such platforms is a strategic move to unlock the full potential of real-time data.

DeltaStream can be part of your toolbox, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

The post Democratizing Data with All-in-One Streaming Solutions appeared first on DeltaStream.

Streaming Analytics vs. Real-time Analytics: Key Differences to Know

kelsey-deltastream — Tue, 01 Oct 2024 20:25:35 +0000

Introduction

Businesses rely heavily on timely insights to make informed decisions in today’s data-driven world. Two key approaches that enable organizations to derive value from their data as it is generated are streaming analytics and real-time analytics. While both terms are often used interchangeably, they differ in their operation and the types of use cases they address. This blog post will delve into the core differences between streaming, and real-time analytics, their respective architectures, and practical applications.

Defining Streaming and Real-Time Analytics

Streaming Analytics: Streaming analytics refers to analyzing and acting on data as it flows into the system continuously. Data is processed in real-time as it is ingested, typically in small, unbounded batches or event streams. These streams come from various sources like IoT devices, log files, and social media, with the analytics system making decisions or generating insights from the live data.

Real-Time Analytics: Real-time analytics, while similar in time sensitivity, typically involves processing a dataset or query with minimal latency. It involves quickly processing data to provide near-instantaneous insights, although the data is often stored or batched before it is analyzed. Real-time analytics operates in response to queries where results are expected from data as it enters the system, such as personalized advertising. Typically there are two types:

On-demand: Provides analytic results only when a query is submitted.

Continuous: Proactively sends alerts or triggers responses in other systems as the data is generated.

Differences in Data Ingestion and Processing

Streaming Analytics: In streaming analytics, data is processed in motion. As the data arrives in the system, it is immediately ingested and analyzed. The focus is on processing and analyzing the continuous flow of data, often in a windowed manner, to derive immediate actions from the data stream. This involves handling large volumes of unbounded, real-time data flows.

Example : A fraud detection system in a bank continuously monitors transactions. The moment suspicious activity is detected from a stream of transaction data, the system flags or blocks the transaction in real time.

Real-Time Analytics: While real-time analytics also deals with fast-moving data, it focuses on responding to queries in real time. The data might already reside in databases, and the system retrieves and processes it almost instantaneously when requested. This method is often less continuous than streaming analytics, but it’s still geared towards low-latency responses.

Example : A dashboard monitoring a retail chain’s sales might be refreshed every minute to reflect the latest sales data. Even though the updates are frequent, the data comes from a batched set that is processed in real time rather than directly from an event stream.

Latency and Time Sensitivity Distinctions

Streaming Analytics: Streaming analytics systems are designed to handle extremely low latency, as the focus is on processing data instantly as it arrives. This is critical in situations where immediate insights are required, like automated decision-making in fraud detection, predictive maintenance, or dynamic pricing. Streaming analytics typically involves sub-second latency, allowing for almost instantaneous actions based on data.

Real-Time Analytics: Real-time analytics also aims for low latency, but the data may be processed in slightly larger windows (seconds or minutes). The insights provided by real-time analytics are often near real-time, and acceptable latency can range from milliseconds to a few seconds, depending on the system’s requirements. Real-time analytics may involve batch processing, where the data is aggregated and processed as needed, rather than on a continuous stream.

Contrasting Architecture and Tools

Streaming Analytics : The architecture for streaming analytics is built around continuous data flows. The tools and platforms used for streaming analytics—such as Apache Kafka, Apache Flink, and Apache Storm—are designed to support data streams and perform calculations on the fly. The architecture involves source systems that generate continuous streams of events, a processing engine that can handle this real-time input, and sinks that store or act on the processed data.

Streaming analytics systems often incorporate concepts like event-driven architecture and micro-batching , where data is split into tiny batches to be processed almost instantaneously. The key focus is on scalability and the ability to handle high-throughput streams with very low latency.

Real-Time Analytics: Real-time analytics architecture is often centered around fast querying and low-latency data retrieval from storage. Systems like Apache Pinot, Apache Druid, and in-memory databases like Memcached are frequently used to achieve real-time query performance. Data is often ingested in bursts, cleaned, stored, and queried using systems optimized for low-latency access, such as in-memory or columnar databases.

While it can handle streaming data, real-time analytics systems usually aggregate and store data first, making it suitable for reporting and dashboarding where up-to-the-second freshness is only sometimes critical but very close to real time is required.

Streaming and Real-time Analytics Use Cases

Streaming Analytics:

IoT Sensor Monitoring: Where devices continuously generate data, analytics systems monitor this data in real time to detect anomalies or trigger automated responses.

Stock Market and High-Frequency Trading: In financial markets, price data, transaction volumes, and other metrics must be processed in real time to make split-second trading decisions.

Social Media Monitoring : For businesses that rely on sentiment analysis or real-time social media engagement, streaming analytics helps gauge public reaction instantly, allowing businesses to respond immediately.

Real-Time Analytics:

Customer Personalization: In e-commerce, real-time analytics helps provide personalized recommendations by processing customer interaction data stored in databases, delivering insights in near real-time during customer sessions.

Operational Dashboards: Many organizations utilize real-time analytics for internal monitoring, where data on sales, system health, or customer interactions is processed quickly but not instantaneously, such as refreshing every minute.

Dynamic Pricing : Real-time analytics can be used to adjust pricing based on historical sales and demand data that is processed every few minutes or hours.

Challenges with Streaming and Real-time Analytics

Streaming Analytics: One of the main challenges is dealing with the constant flow of high-velocity data. Ensuring data consistency, scaling infrastructure to handle bursts in data streams, and maintaining sub-second latency requires sophisticated engineering solutions. Another challenge is managing “event time” versus “processing time,” where events arrive out of order or late.

Real-Time Analytics: Real-time analytics faces the challenge of balancing query performance with data freshness. Storing and retrieving large volumes of data with low latency is difficult without optimized database architectures. Additionally, ensuring that the data queried reflects the most recent information without overwhelming the system requires careful tuning.

Conclusion

While both streaming and real-time analytics offer rapid data processing and insights, they serve different purposes depending on the specific use case. Streaming analytics excels in environments where decisions must be made instantly on data as it arrives, making it ideal for real-time monitoring and automated responses. Real-time analytics, on the other hand, offers low-latency querying for decision-making where instantaneous data streams aren’t necessary but timely responses are critical.

If your use case requires sub-second latency, consider technologies like DeltaStream. It handles both Streaming Analytics and acts as a Streaming Database, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

The post Streaming Analytics vs. Real-time Analytics: Key Differences to Know appeared first on DeltaStream.