Alvin Lee

Posted on May 10, 2023 • Edited on May 24, 2023 • Originally published at hackernoon.com

Efficient Enterprise Data Solutions with Stream Processing

#datascience #go #streamprocessing #data

Enterprise data solutions can quickly become expensive. NetApp reports that organizations see 30% data growth every 12 months and risk seeing their solutions fail if this growth isn't managed. Gartner echoes the concerns, stating that data solutions costs tend to be an afterthought, not addressed until they're already a problem.

Being a data-driven organization, you already know this and understand the critical need for cost optimization in your data solutions.

So what can you do? In this article, I'll look at the factors that increase costs, what you can do to contain them, and how stream processing can be a major source of savings. Through it all, I'll help you create an effective plan that leads to operational efficiencies in data solutions.

Why are data solutions expensive?

First, let's look at the factors that contribute to the increasing costs of data solutions. You probably already face many, if not all, of these situations:

Increasing volume of data: Data is being generated at an accelerated rate. And the cost of storing and managing that data has also increased.
A myriad of data sources and their associated complexity: Disparate data sources make data integration a herculean task that leads to increased data quality checks and management costs.
Frequently updated technology: Inevitable technological advancements lead to frequent upgrades in hardware and software, resulting in higher costs.
Data security and privacy concerns: Not only do we have to worry about how to store and manage data, but also how to protect it from bad actors. Data privacy is critical, demanding hefty investments in encryption, security, and maintenance.
Data consolidation: Many organizations hoard a tremendous amount of data before consulting their data practitioners and experts. This leads to unnecessary storage costs, incompatible systems, and inefficient and unscalable solutions.
Lack of skilled data professionals: According to Bloomberg, the big data market will grow to $229.4 billion by 2025. With that growth comes the need for specialized - and expensive - skills, such as Apache Kafka, Apache Flink, Hadoop, Docker, containerization, DevOps, and others.

How can enterprises reduce their data solution costs?

What can enterprises do to keep these costs down? There are many strategies, including:

Data compression and archiving (reduce data size and archive unused data)
Data partitioning (make it easier and faster to process data)
Data caching (keep frequently used data in high-speed storage)
On-demand autoscaling
Data governance measures (ensure accurate, complete, and well-managed data)
Efficient data movement (improve routing from cloud and on-prem in order to speed insights)
Cloud-agnostic deployment (reduce vendor lock-in, optimize costs)
Using a multi-cloud approach

Let's add one more to this list, the solution that we'll look at in more detail in this article: stream processing.

What is stream processing?

Stream processing, as the name suggests, is a data management strategy that involves ingesting (and acting on) continuous data streams - such as the user's clickstream journey, sensor data, or sentiment from a social media feed - in real time.

By using stream processing, organizations can greatly reduce their data solutions costs and see an increased ROI over common frameworks for managing data, such as Splunk, Apache Spark, and others.

Stream processing solves many of the issues encountered with other data solutions, including rising costs, by giving you:

Reduced storage costs: You handle the data in real time.
Real-time scalability: Stream processing systems can handle large-scale data using a distributed computing architecture.
Preventing the downstream flow of bad data: Stream processing enables real-time data validation, transformation, and filtering at the source. Faster decision-making to identify high-impact opportunities: Real-time analysis gives you real-time analytics - and real-time insights.
Team efficiency through automation: Automated workflows in stream processing platforms reduce complexity by abstracting the low-level details of data processing. This enables your data team to focus on data analytics, visualization, and reporting.

Manage stream processing effectively

Stream processing is powerful and opens many use cases, but managing the implementation is no easy task! Let's look at some challenges associated with implementing stream processing from scratch.

Storage and computing resources

Anomalies in streaming data can be difficult and time-consuming to detect and address without specialized tools.

Continuous delivery

Deploying updates to streaming applications can be challenging without disrupting the production version of the application.

Database integration

Integrating data from different databases can be complex and time-consuming for developers.
Handling schema management and scalability frequently becomes cumbersome for developers when integrating databases.
Migrating data from one database to another can be a costly and time-consuming exercise, requiring additional infrastructure.

Developer productivity

Developing streaming applications requires your developer to invest a significant amount of time in managing the infrastructure and configuration. Such efforts increase with the complex data processing logic, taking more time away from actual application development.

We can solve these issues by using a stream processing framework, such as Meroxa's Turbine. A stream processing framework connects to an upstream resource to stream data in real-time, receive the data, process it, then send it on to a destination.

Meroxa has a good write-up on what this looks like at a high level. If we were to look more closely at how we might use Go to interact with Turbine, we could look at code from this example application:

func (a App) Run(v turbine.Turbine) error {
    source, err := v.Resources("demopg")
    if err != nil {
        return err
    }
    // a collection of records, which can't be inspected directly
    records, err := source.Records("user_activity", nil)
    if err != nil {
        return err
    }
    // second return is dead-letter queue
    result := v.Process(records, Anonymize{})

    dest, err := v.Resources("s3")
    if err != nil {
        return err
    }
    err = dest.Write(result, "data-app-archive")
    if err != nil {
        return err
    }

    return nil
}

func (f Anonymize) Process(records []turbine.Record) []turbine.Record {
    for i, r := range records {
        hashedEmail := consistentHash(r.Payload.Get("email").(string))
        err := r.Payload.Set("email", hashedEmail)
        if err != nil {
            log.Println("error setting value: ", err)
            break
        }
        records[i] = r
        }
    return records
}

func consistentHash(s string) string {
    h := md5.Sum([]byte(s))
    return hex.EncodeToString(h[:])
}

In the above code, we see the following steps in action:

Create an upstream source (named source), from a PostgreSQL database (named demopg).
Fetch records (from the user_activity table) from that upstream source.
Call v.Process, which performs the stream processing. This process iterates through the list of records and overwrites the email of each record with an encoded hash.
Create a downstream destination (named dest), using AWS S3.
Write the resulting stream processed records to the destination.

As we can see, you only need a little bit of code for Turbine to process new records and stream them to the destination.

Using a framework such as Turbine for stream processing brings several benefits, including:

It reduces the need for storage and computing resources by using stream processing instead of batch processing.
It provides a powerful mechanism to identify anomalous behavior from large volumes of streaming data, assisting with a quick fix using Novelty Detector, a highly specialized real-time anomaly detection tool.
Because continuous delivery is a big challenge for streaming applications, Turbine streaming data apps offer Feature Branch Deploys, allowing you to deploy the branch whenever it is ready - without impacting the production version of the application.
It integrates any source database to any destination database by leveraging Change Data Capture (CDC) that receives real-time streams and publishes them downstream. The Meroxa platform handles data transformation or processing logic and orchestration through the Turbine app, sparing the developers from worrisome schema management or scalability issues. Ultimately, this allows developers to focus their time and effort on your core business needs. It also reduces the cost of migration associated with additional infrastructure.
Its code-first integration is aimed at helping developers to focus on building applications, leading to faster development while also reducing the infrastructure typically needed to support stream processing.

Conclusion

Data-driven organizations are betting high on data assets to fuel their transformation journeys. But they need better - and more cost-efficient - ways to manage the fast-growing datasets.

Stream processing provides the necessary benefits: improved data quality, faster decision-making, and more. But it requires the right plan and the right stream processing platform to handle the data efficiently. By following the tips in this article, you should be well on your way!

Top comments (2)

crivic • May 12 '23

Great write up!
The link to the Meroxa article about stream processing is not working (outdated or non-public resource maybe?)

Alvin Lee • May 24 '23

Thank you for the note and the heads up! It looks like Meroxa has recently shifted around some of its pages. I have updated the "good write-up" link with the updated URL.