DEV Community: Zander

Data Parallel, Task Parallel, and Agent Actor Architectures

Zander — Thu, 13 Jul 2023 19:26:35 +0000

Introduction:

In the rapidly evolving world of data processing, understanding the various architectural approaches is pivotal to choosing the right tools for your specific needs. The three dominant architectures that have emerged—data parallel, task parallel, and agent actor—each offer unique strengths that cater to different types of data workloads.

Data parallel architectures shine when large datasets need to be processed in parallel. This model divides data into smaller chunks, each processed independently but in the same manner on different workers or nodes. Apache Spark, a well-known data processing framework, uses this architecture. Spark's resilience, capacity for handling vast amounts of data, and ability to perform complex transformations make it a favorite in big data landscapes. Bytewax also follows this model with the same transformations happening on each worker, but on different data.

On the other hand, task parallel architectures, as exemplified by Apache Flink and Dask, focus on executing different tasks concurrently across distributed systems. This approach is particularly effective for workflows with a wide variety of tasks that can be performed independently or have complex dependencies. Flink's stream-first philosophy provides robustness for real-time processing tasks, while Dask's flexibility makes it a great choice for parallel computing tasks in Python environments.

Finally, the agent actor architecture, the foundation for Ray, presents a flexible and robust model for handling complex, stateful, and concurrent computations. In this model, "actors" encapsulate state and behavior, communicating through message passing. Ray's ability to scale from a single node to a large cluster makes it a popular choice for machine learning tasks.

As we delve deeper into these architectures in the following sections, we will explore their pros and cons, use cases, and the unique features offered by Spark, Flink, Dask, Ray, and Bytewax. By understanding these architectures, you'll be better equipped to select the right framework for your next data processing venture. Stay tuned!

Data Parallel Architectures

Data parallelism is a form of parallelization that distributes the data across different nodes, which operate independently of each other. Each node applies the same operation on its allocated subset of data. This approach is particularly effective when dealing with large datasets where the task can be divided and executed simultaneously, reducing computational time significantly.

The Mechanism

In data parallel architectures, the dataset is split into smaller, more manageable chunks, or partitions. Each partition is processed independently by separate tasks running the same operation. This distribution is done in a way that each task operates on a different core or processor, enabling high-level parallel computation.

Advantages

Scalability: Data parallel architectures are designed to handle large volumes of data. As data grows, you can simply add more nodes to the system to maintain performance.
Performance: The ability to perform computations in parallel leads to significant speedups, particularly for large datasets and computationally intensive operations. Due to the fact that data does not move around as often to different workers, there can also be a performance gain.
Simplicity: Since the same operation is applied to each partition, this model is relatively simple to understand and implement.

Disadvantages

Communication Overhead: The nodes need to communicate with each other to synchronize and aggregate results, which can add overhead, particularly for large numbers of nodes.
Limited Use Cases: Data parallelism works best when the same operation can be applied to all data partitions. It's less suitable for tasks that require complex interdependencies or shared state across tasks. As we have seen with spark though, this is not entirely true.

Best Use Cases

Data parallel architectures excel in situations where large volumes of data need to be processed quickly and in a similar manner. Some of the best use cases include:

Batch Processing: In scenarios where large amounts of data need to be processed all at once, data parallel architectures shine. This is a common use case in big data analytics, where massive datasets are processed in batch jobs.
Machine Learning: Many machine learning algorithms, especially those that involve matrix operations, can be easily parallelized. For instance, in the training phase of a neural network, the weights of the neurons are updated based on the error. This operation can be done in parallel for each layer, making data parallelism a great fit.
High Partitioned Input and Output: Data parallel frameworks excel when the input and output are partitioned in such a way that the workers can evenly match the partitions and redistribution of the data is limited.
Stream Processing: The data parallelism approach is well suited to stream processing where the same operation is happening to data in real-time.

Apache Spark, a notable data parallel framework, is widely used in big data analytics for tasks like ETL (Extract, Transform, Load), predictive analytics, and data mining. It's particularly known for its ability to perform complex data transformations and aggregations across large datasets.

Bytewax is known for its ability to handle large continuos streams of data and do complex transformations on them in real-time.

As we continue our exploration into the different data processing architectures, we'll see how other approaches handle tasks that might not be as suitable for data parallel processing.

Task Parallel Architectures: Unlocking Concurrent Processing

Task parallelism, also known as function parallelism, is an architectural approach that focuses on distributing tasks—rather than data—across different processing units. Each of these tasks can be a separate function or a method operating on different data or performing different computations. This type of parallelism is a great fit for problems where different operations can be performed concurrently on the same or different data.

The Mechanism

In a task parallel model, the focus is on concurrent execution of many different tasks that are part of a larger computation. These tasks can be independent, or they can have defined dependencies and need to be executed in a certain order. The tasks are scheduled and dispatched to different processors in the system, enabling parallel execution.

Advantages

Diverse Workloads: Task parallel architectures excel in scenarios where the problem can be broken down into a variety of tasks that can be executed in parallel.
Flexibility: Since tasks don't necessarily need to operate on the same data or perform the same operation, this model offers a high level of flexibility.
Efficiency: Task parallelism can lead to improved resource utilization, as tasks can be scheduled to keep all processors busy.

Disadvantages

Complexity: Managing and scheduling tasks, especially when there are dependencies, can add complexity to the system.
Inter-task Communication: Tasks often need to communicate with each other to synchronize or to pass data, which can lead to overhead and can be a challenge for performance.

Best Use Cases

Task parallel architectures are best suited to problems that can be broken down into discrete tasks that can run concurrently. This includes:

Complex Computations: Scenarios where a complex problem can be broken down into a number of separate tasks, such as simulations or optimization problems, are a good fit for task parallel architectures.
Real-Time Processing On Diverse Datasets: Task parallel architectures are often used in systems that require real-time processing and low latency, such as stream processing systems.

Apache Flink is an excellent example of a system that uses a task parallel architecture. Flink is designed for stream processing, where real-time results are of utmost importance. It breaks down stream processing into a number of tasks that can be executed in parallel, providing low-latency and high-throughput processing of data streams.

Similarly, Dask is a flexible library for parallel computing in Python that uses task scheduling for complex computations. Dask allows you to parallelize and distribute computation by breaking it down into smaller tasks, making it a popular choice for tasks that go beyond the capabilities of typical data parallel tools.

In the next section, we'll explore the agent actor model, a different approach to managing concurrency and state that opens up new possibilities for parallel computation.

Agent Actor Architectures: Pioneering Concurrent Computations

Agent actor architectures introduce a fundamentally different approach to handle parallel computations, particularly for problems that involve complex, stateful computations. This approach build on task parallelism with the addition of an actor. An actor is a computational entity that, in response to a message it receives, can concurrently: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. The agents are then similar to task distributed or functional distributed systems.

The Mechanism

In the agent actor model, actors are the universal primitives of concurrent computation. Upon receiving a message, an actor can change its local state, send messages to other actors, or create new actors. Actors encapsulate their state, avoiding common pitfalls of multithreaded programming such as race conditions. Actor systems are inherently message-driven and can be distributed across many nodes, making them highly scalable.

Advantages

Concurrent State Management: Actors provide a safe way to handle mutable state in a concurrent system. Since each actor processes messages sequentially and has isolated state, there is no need for locks or other synchronization mechanisms.
Scalability: Actor systems are inherently distributed and can easily scale out across many nodes.
Fault Tolerance: Actor systems can be designed to be resilient with self-healing capabilities. If an actor fails, it can be restarted, and messages it was processing can be redirected to other actors.

Disadvantages

Complexity: Building systems with the actor model can be more complex than traditional paradigms due to the asynchronous and distributed nature of actors.
Message Overhead: Communication between actors is done with messages, which can lead to overhead, especially in systems with a large number of actors.

Best Use Cases

Agent actor architectures are best suited for problems that involve complex, stateful computations and require high levels of concurrency. This includes:

Real-time Systems: The actor model is well suited for real-time systems where you need to process high volumes of data concurrently, such as trading systems or real-time analytics.
Distributed Systems: The actor model can be a good fit for building distributed systems where you need to manage state across multiple nodes, like IoT systems or multiplayer online games.

Ray is an example of a system that employs the actor model. It was designed to scale Python applications from a single node to a large cluster, and it's commonly used for machine learning tasks, which often require complex, stateful computations.

As we've seen, the landscape of data processing architectures is rich and diverse, with each model offering unique strengths and potential challenges. Whether it's data parallel, task parallel, or agent actor, the choice of architecture will depend largely on the nature of the data workload and the specific requirements of the system you're building.

Reasoning about Streaming vs Batch with a Case Study from GitHub

Zander — Thu, 15 Jun 2023 20:26:32 +0000

If you prefer videos check out Zander's talk at Data Council 2023 "When to Move from Batch to Streaming and how to do it without hiring an entirely new team"

The world of data processing is undergoing a significant shift, moving towards real-time processing. Despite an increase in understanding that shifting workloads to real-time can increase ROI and lower costs, there isn't consensus in the industry around how to best transition workloads to real-time and what the best tools are for different types of real-time workloads. While traditional analytical tools such as data warehouses, business intelligence layers, and metrics are widely accepted and understood, the concept of real-time data processing and the technologies that enable it are not as widely recognized or agreed upon.

In this post, we aim to demystify real-time data processing, discussing its relevance within an organization, the different types of real-time workloads, and some real-world examples from my time at GitHub. But first, let's clarify some definitions.

Understanding Real-Time and Stream Processing

What is Real-Time? "Real-Time" refers to anything that is perceived to happen in real time by a human - an admittedly fuzzy definition. Quantitatively, this usually refers to processes that happen in the sub-second realm. Interestingly, based on this definition, real-time data processing can actually occur with both batch and stream processing technologies depending on the end-to-end latency.

Stream processing refers to processing a single datum at a time, flowing in a continuous stream, while batch processing is when you gather a batch of data and process it all at once. By reducing the size of the batch progressively, we can edge closer to real-time processing. This is precisely what technologies like Spark's structured streaming do with micro-batches.

Now that we have some definitions out of the way, let's dive into real-time processing and the different types of real-time workloads.

The Relevance of Real-Time Data Processing

In our day-to-day lives, we're constantly receiving and processing information in real-time. Consider driving a car - an activity that requires processing multiple inputs and making decisions in real-time. If we were to approach driving in a batch processing manner, waiting to gather information for a duration and then trying to forecast the next 15 seconds, it would likely end in disaster. As another example you can imagine a sport like basketball. For each moment in time, the players are receiving tens or hundreds of inputs and they are reacting to them in real-time. If we also imagined a non-real-time version of this, it might not be so exciting to watch or play as each player waited for a certain number of seconds while receiving input and then tried to react to those inputs.

These examples help to highlight why we might choose to process things in real-time. In the context of driving, we're making decisions that could potentially be a matter of life or death. And in our basketball example, the real-time processing elevates the user experience. However, while these examples provide some understanding, they don't necessarily help us generalize the concept.

Types of Real-Time Workloads

We can broadly categorize real-time processing into two types of workloads: analytical and operational.

Analytical workloads

Analytical workloads require low latency, freshness, and the capability of retrieval at scale. Real-time analytical workloads must be queryable. A good example of this is LinkedIn's profile view notification. When you click into the profile view notification, you're taken to a page that shows your profile views history, all the way up to the most recent data. This demonstrates the freshness of data and the ability to query it as you can filter and interact with the data, querying the freshest data.

Another example of a real-time analytical workload is an Instacart order. When you place an order on Instacart, you can go into your order and see the updated Estimated Time of Arrival (ETA). This is another instance of an analytical real-time workload where the user is interacting with analytical data in real-time.

Operational workloads

On the other hand, operational workloads, require low latency and freshness, but they also need to be reactive. This means that some of the decision-making or business logic is embedded inline in the system. For example, in a streaming use case, this would be inside the stream processor. The data is received, transformed, and then a decision is made in an online fashion. Bytewax is a great example of a framework that can be used to make real-time decisions for operational workloads.

A good example of operational processing is fraud detection. The fraud detection system takes all the inputs in real time and makes a decision about them without a human in the loop. It then makes a decision on what to do and communicates with the user to confirm if its suspicion of fraud is correct.

Another example in financial markets is high-frequency trading. The software system consumes inputs from a variety of different data sources, processes them in real time, and then makes a decision whether to buy or sell. The speed of making that decision is a key factor in this context.

Analytical vs Operational

One more aspect I wanted to touch on here is the difference between having a human in the loop versus having a machine in the loop. If we look at different examples across analytical and operational workloads, there's a concept of the human being more involved in the analytical and less or not involved in the operational.

To summarize, if there is a situation where you believe there's value to be derived and there's a human in the loop, there's probably a subset of tools within the real-time space that fall under the analytical workload. If you're building something like an algorithmic trading system, where you believe that there's no requirement for a human in the loop, you're more likely to fall under the operational category, and you should look at tools, like Bytewax, that support operational processing.

Case Studies. GitHub's Real-Time Data Processing Decisions

Let's make things more concrete by discussing a couple of case studies involving decisions we made at GitHub concerning real-time data processing.

Trending Repositories and Developers: A Batch Processing Approach

The team I was a part of at GitHub was responsible for several data products that were featured on github.com, including Trending Repositories and Trending Developers. These features were located on the GitHub Explore page and aimed to identify trending repositories and developers based on a variety of metrics, such as stars, forks, and views. We had access to this data in real time through a streaming platform (Kafka) managed by another team.

Although we had the capacity to implement these as real-time features, we decided against it. Our team primarily consisted of data scientists and machine learning engineers who hadn't worked with streaming platforms or stream processors before. Moreover, these features were new products, and we didn't know how impactful they would be or whether users would find them valuable and engage with them repeatedly.

Instead of implementing these features to use real-time data, we decided to process this data in a batch format. We would run nightly queries against Presto, where the data landed from Kafka, then store the processed data in a MySQL database for retrieval from github.com. These features were not real-time workloads, but they could have been. If it was determined they would be useful as real-time data products, they would serve as excellent examples of analytical use cases.

Star Spam Detection: A Real-Time Processing Solution

Another task we undertook was star spam detection. The concept of "stars" on GitHub repositories is used as a proxy to gauge the health and utility of the project. If we were unable to detect star spammers, it would degrade the platform's value for users, potentially leading to a downward spiral in the platform's overall value.

We decided to tackle this problem in a real-time manner to limit the exposure of the users and potential degradation of the platform from star spam. The data was available in Kafka and could therefore be consumed as it was available. Based on certain criteria, users could be flagged as spammers and then action taken. Once a user was flagged, they were submitted for human review to decide on what the next steps should be. This is an excellent example of operational processing.

The Impact of Real-Time Processing Decisions

The point is that the decision to implement real-time processing can have significant impacts on the return on investment for a project, and this correlation should be carefully considered. If we had decided to make the trending feature real-time, it would have been even more necessary to maintain the platform's value by detecting star spam as close to real-time as possible.

The value of data often degrades over time, and while it's usually depicted as a sharp decline (see graph on the left), most projects or data tend to follow more of an S-curve (on the right). After a certain point, the return or value of the data caps out with respect to latency. In these case studies, neither project saw an exponential increase in return on investment as latency was reduced, and we were able to tackle star spammers on an hours timeframe instead of milliseconds. This demonstrates that not all data projects need to move towards zero latency to provide significant value.

If you are interested in moving some of your workloads to real-time and not sure where to start. Please reach out to us in our slack channel and we would be happy to help you figure out if the value is there and where to start.

Real-Time Anomaly Detection Visualization with Bytewax and Rerun

Zander — Thu, 13 Apr 2023 22:39:53 +0000

Rerun's open sourcing in February marked a significant step for those looking for accessible yet potent Python visualization libraries. Why is visualization important? Visualization is essential since companies like Scale.ai, Weights & Biases, and Hugging Face have streamlined deep learning by addressing dataset labeling, experiment tracking, and pre-trained models. However, a void still exists in rapid data capture and visualization.

Many companies develop in-house data visualization solutions but often end up with suboptimal tools due to high development costs. Moreover, Python visualization on streaming data is a problem that is not solved well either, leading to JavaScript based solutions in notebooks. Rerun leverages a Python interface into a high-performant Rust visualization engine (much like Bytewax!) that makes it dead easy to analyze streaming data.

In this blog post, we will explore how to use Bytewax and Rerun to visualize real-time streaming data in Python and create a real-time anomaly detection visualization. We chose anomaly detection, a.k.a. outlier detection, because it is a critical component in numerous applications, such as cybersecurity, fraud detection, and monitoring of industrial processes. Visualizing these anomalies in real time can aid in quickly identifying potential issues and taking necessary actions to mitigate them.

For those eager to dive in, check out our end-to-end Python solution on our GitHub. Don't forget to star Bytewax!

Overview

Here is what we'll cover:

We will navigate the code and briefly discuss top-level entities
Then we will discuss each step of the dataflow in greater detail: initialization of our dataflow, input source, stateful anomaly detection, data visualization & output, and how to spawn a cluster
Finally, we will learn how to run it and see the beautiful visualization, all in Python <3
As a bonus, we will think about other use cases

Let's go!

Setup your environment

This blog post is based on the following versions of Bytewax and Rerun:

bytewax==0.15.1
rerun-sdk==0.4.0

Rerun and Bytewax are installable as

pip install rerun-sdk
pip install bytewax

Follow Bytewax for updates as we are baking a new version that will ease the development of data streaming apps in Python further.

Code

The solution is relatively compact, so we copy the entire code example here. Please feel free to skip this big chunk if it looks overwhelming; we will discuss each function later.

import random
# pip install rerun-sdk
import rerun as rr

from time import sleep
from datetime import datetime

from bytewax.dataflow import Dataflow
from bytewax.execution import spawn_cluster
from bytewax.inputs import ManualInputConfig, distribute
from bytewax.outputs import ManualOutputConfig

rr.init(&quot;metrics&quot;)
rr.spawn()

start = datetime.now()

def input_builder(worker_index, worker_count, resume_state):
    assert resume_state is None
    keys = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;]
    this_workers_keys = distribute(keys, worker_index, worker_count)

    for _ in range(1000):
        for key in keys:
            value = random.randrange(0, 10)
            if random.random() &gt; 0.9:
                value *= 2.0
            yield None, (key, (key, value, (datetime.now() - start).total_seconds()))
            sleep(random.random() / 10.0)

class ZTestDetector:
    &quot;&quot;&quot;Anomaly detector.

    Use with a call to flow.stateful_map().

    Looks at how many standard deviations the current item is away
    from the mean (Z-score) of the last 10 items. Mark as anomalous if
    over the threshold specified.
    &quot;&quot;&quot;

    def __init__ (self, threshold_z):
        self.threshold_z = threshold_z

        self.last_10 = []
        self.mu = None
        self.sigma = None

    def _push(self, value):
        self.last_10.insert(0, value)
        del self.last_10[10:]

    def _recalc_stats(self):
        last_len = len(self.last_10)
        self.mu = sum(self.last_10) / last_len
        sigma_sq = sum((value - self.mu) ** 2 for value in self.last_10) / last_len
        self.sigma = sigma_sq**0.5

    def push(self, key __value__ t):
        key, value, t = key __value__ t
        is_anomalous = False
        if self.mu and self.sigma:
            is_anomalous = abs(value - self.mu) / self.sigma &gt; self.threshold_z

        self._push(value)
        self._recalc_stats()
        rr.log_scalar(f&quot;temp_{key}/data&quot;, value, color=[155, 155, 155])
        if is_anomalous:
            rr.log_point(f&quot;3dpoint/anomaly/{key}&quot;, [t, value, float(key) * 10], radius=0.3, color=[255,100,100])
            rr.log_scalar(
                f&quot;temp_{key}/data/anomaly&quot;,
                value,
                scattered=True,
                radius=3.0,
                color=[255, 100, 100],
            )
        else:
            rr.log_point(f&quot;3dpoint/data/{key}&quot;, [t, value, float(key) * 10], radius=0.1)

        return self, (value, self.mu, self.sigma, is_anomalous)

def output_builder(worker_index, worker_count):
    def inspector(input):
        metric, (value, mu, sigma, is_anomalous) = input
        print(
            f&quot;{metric}: &quot;
            f&quot;value = {value}, &quot;
            f&quot;mu = {mu:.2f}, &quot;
            f&quot;sigma = {sigma:.2f}, &quot;
            f&quot;{is_anomalous}&quot;
        )

    return inspector

if __name__ == &apos; __main__ &apos;:
    flow = Dataflow()
    flow.input(&quot;input&quot;, ManualInputConfig(input_builder))
    # (&quot;metric&quot;, value)
    flow.stateful_map(&quot;AnomalyDetector&quot;, lambda: ZTestDetector(2.0), ZTestDetector.push)
    # (&quot;metric&quot;, (value, mu, sigma, is_anomalous))
    flow.capture(ManualOutputConfig(output_builder))
    spawn_cluster(flow)

The provided code demonstrates how to create a real-time anomaly detection pipeline using Bytewax and Rerun. Let's break down the essential components of this code:

input_builder : This function generates random metrics simulating real-world data streams. It generates data points with a small chance of having an anomaly (values doubled).
ZTestDetector : This class implements an anomaly detector using the Z-score method. It maintains the mean and standard deviation of the last 10 values and marks a value as anomalous if its Z-score is greater than a specified threshold.
output_builder : This function is used to define the output behavior for the data pipeline. In this case, it prints the metric name, value, mean, standard deviation, and whether the value is anomalous.
Dataflow : The main part of the code constructs the dataflow using Bytewax, connecting the RandomMetricInput, ZTestDetector, and the output builder.
Rerun visualization : The Rerun visualization is integrated into the ZTestDetector class. The rr.log_scalar and rr.log_point functions are used to plot the data points and their corresponding anomaly status.

Now, with an understanding of the code's main components, let's discuss how the visualization is created step by step.

Building the Dataflow

To create a dataflow pipeline, you need to:

Initialize a new dataflow with flow = Dataflow().
Define the input source using flow.input("input", ManualInputConfig(input_builder)).
Apply the stateful anomaly detector using flow.stateful_map("AnomalyDetector", lambda: ZTestDetector(2.0), ZTestDetector.push).
Configure the output behavior with flow.capture(ManualOutputConfig(output_builder)).
Finally, spawn a cluster to execute the dataflow with spawn_cluster(flow, proc_count=3).

The resulting dataflow reads the randomly generated metric values from input_builder, passes them through the ZTestDetector for anomaly detection, and outputs the results using the output_builder function. Let's clarify the details for each step.

`input_builder` function

The input_builder function serves as an alternative input source for the dataflow pipeline, generating random metric values in a distributed manner across multiple workers. It accepts three parameters: worker_index, worker_count, and resume_state.

def input_builder(worker_index, worker_count, resume_state):
    assert resume_state is None
    keys = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;]
    this_workers_keys = distribute(keys, worker_index, worker_count)

    for _ in range(1000):
        for key in keys:
            value = random.randrange(0, 10)
            if random.random() &gt; 0.9:
                value *= 2.0
            yield None, (key, (key, value, (datetime.now() - start).total_seconds()))
            sleep(random.random() / 10.0)

worker_index: The index of the current worker in the dataflow pipeline.
worker_count: The total number of workers in the dataflow pipeline.
resume_state: The state of the input source from which to resume. In this case, it is asserted to be None, indicating that the input source does not support resuming from a previous state.

Here's a step-by-step description of the input_builder function:

Assert that resume_state is None.
Define a list of keys representing the metrics.
Distribute the keys among the workers using the distribute function (not provided in the code snippet). The distributed keys for the current worker are assigned to this_workers_keys.
Iterate 1,000 times and, for each iteration, iterate through the list of keys:
- Generate a random value between 0 and 10.
- With a 10% probability, double the value to simulate an anomaly.
- Yield a tuple containing None (to indicate no specific partition key), the key, the generated value, and the elapsed time since the starting time (not provided in the code snippet).
- Introduce a sleep time between each generated value to simulate real-time data generation.

The input_builder function is used in the dataflow as the input source with the following line of code:

flow.input(&quot;input&quot;, ManualInputConfig(input_builder))

This line tells the dataflow to use the RandomMetricInput class to generate the input data for the pipeline.

`ZTestDetector` Class

The ZTestDetector class is an anomaly detector that uses the Z-score method to identify whether a data point is anomalous or not. The Z-score is the number of standard deviations a data point is from the mean of a dataset. If a data point's Z-score is higher than a specified threshold, it is considered anomalous.

The class has the following methods:

__init__ (self, threshold_z): The constructor initializes the ZTestDetector with a threshold Z-score value. It also initializes the last 10 values list (self.last_10), mean (self.mu), and standard deviation (self.sigma).
_push(self, value): This private method is used to update the list of last 10 values with the new value. It inserts the new value at the beginning of the list and removes the oldest value, maintaining the list length at 10.
_recalc_stats(self): This private method recalculates the mean and standard deviation based on the current values in the self.last_10 list.
push(self, key __value__ t): This public method takes a tuple containing a key, a value, and a timestamp as input. It calculates the Z-score for the value, updates the last 10 values list, and recalculates the mean and standard deviation. It also logs the data point and its anomaly status using Rerun's visualization functions. Finally, it returns the updated instance of the ZTestDetector class and a tuple containing the value, mean, standard deviation, and anomaly status.

The ZTestDetector class is used in the dataflow pipeline as a stateful map with the following code:

flow.stateful_map(&quot;AnomalyDetector&quot;, lambda: ZTestDetector(2.0), ZTestDetector.push)

This line tells the dataflow to apply the ZTestDetector with a Z-score threshold of 2.0 and use the push method to process the data points.

Visualizing Anomalies

To visualize the anomalies, the ZTestDetector class logs the data points and their corresponding anomaly status using Rerun's visualization functions. Specifically, rr.log_scalar is used to plot a scalar value, while rr.log_point is used to plot 3D points.

The following code snippet shows how the visualization is created:

rr.log_scalar(f&quot;temp_{key}/data&quot;, value, color=[155, 155, 155])
if is_anomalous:
    rr.log_point(f&quot;3dpoint/anomaly/{key}&quot;, [t, value, float(key) * 10], radius=0.3, color=[255,100,100])
    rr.log_scalar(
        f&quot;temp_{key}/data/anomaly&quot;,
        value,
        scattered=True,
        radius=3.0,
        color=[255, 100, 100],
    )
else:
    rr.log_point(f&quot;3dpoint/data/{key}&quot;, [t, value, float(key) * 10], radius=0.1)

Here, we first log a scalar value representing the metric. Then, depending on whether the value is anomalous, we log a 3D point with a different radius and color. Anomalous points are logged in red with a larger radius, while non-anomalous points are logged with a smaller radius.

`output_builder` Function

The output_builder function is used to define the output behavior for the data pipeline. In this specific example, it is responsible for printing the metric name, value, mean, standard deviation, and whether the value is anomalous. The function takes two arguments: worker_index and worker_count. These arguments help the function understand the index of the worker and the total number of workers in the dataflow pipeline.

Here's the definition of the output_builder function:

def output_builder(worker_index, worker_count):
    def inspector(input):
        metric, (value, mu, sigma, is_anomalous) = input
        print(
            f&quot;{metric}: &quot;
            f&quot;value = {value}, &quot;
            f&quot;mu = {mu:.2f}, &quot;
            f&quot;sigma = {sigma:.2f}, &quot;
            f&quot;{is_anomalous}&quot;
        )

    return inspector

This function is a higher-order function, which means it returns another function called inspector. The inspector function is responsible for processing the input data tuple and printing the desired output. The output builder function is later used in the dataflow pipeline when configuring the output behavior with

flow.capture(ManualOutputConfig(output_builder)).

Running the Dataflow

Bytewax can run as a single process or in a multi-process way. This dataflow has been created to scale across multiple processes, but we will start off running it as a single process with the spawn_cluster execution module.

spawn_cluster(flow)

If we wanted to increase the parallelism, we would simply add more processes as arguments. For example - spawn_cluster(flow, proc_count=3).

To run the provided code we can simply run it as a Python script, but first we need to install the dependencies.

Create a new file in the same directory as dataflow.py and name it requirements.txt. Add the following content to the requirements.txt file:

bytewax==0.15.1
rerun-sdk==0.4.0

Open a terminal in the directory containing the requirements.txt and dataflow.py files.

Install the dependencies using the following command:

pip install -r requirements.txt

And run the dataflow!

python dataflow.py

Expanding the Use Case

While the provided code serves as a basic example of real-time anomaly detection, you can expand this pipeline to accommodate more complex scenarios. For example:

Incorporate real-world data sources : Replace the RandomMetricInput class with a custom class that reads data from a real-world source, such as IoT sensors, log files, or streaming APIs.
Implement more sophisticated anomaly detection techniques : You can replace the ZTestDetector class with other stateful anomaly detection methods, such as moving average, exponential smoothing, or machine learning-based approaches.
Customize the Visualization : Enhance the Rerun visualization by adding more data dimensions, adjusting the color schemes, or modifying the plot styles to better suit your needs.
Integrate with alerting and monitoring systems : Instead of simply printing the anomaly results, you can integrate the pipeline with alerting or monitoring systems to notify the appropriate stakeholders when an anomaly is detected.

By customizing and extending the dataflow pipeline, you can create a powerful real-time anomaly detection and visualization solution tailored to your specific use case. The combination of Bytewax and Rerun offers a versatile and scalable foundation for building real-time data processing and visualization systems.

Conclusion

This blog post has demonstrated how to use Bytewax and Rerun to create a real-time anomaly detection visualization. By building a dataflow pipeline with Bytewax and integrating Rerun's powerful visualization capabilities, we can monitor and identify anomalies in our data as they occur.

Using Language Models in a Streaming Context to Understand Financial Markets

Zander — Thu, 16 Mar 2023 08:09:14 +0000

For those who are eager to dive into the code, it's available:

bytewax / news-analyzer

Analyze financial news in real-time with Machine Learning

news-analyzer

Analyze financial news in real-time with Machine Learning

See FinancialNewsAnalysis.ipynb

View on GitHub

Effective analysis of news is crucial for understanding the world, especially when it comes to financial markets. Being able to quickly identify significant events, such as a major corporation being hacked and sensitive customer data being compromised, can enable you to respond rapidly and either capitalize on opportunities or minimize losses. In this blog post, we'll delve into how Bytewax and large language models can be leveraged to analyze financial news in real time, providing you with the ability to respond to breaking news more effectively. We need to answer at least three questions to implement our little project successfully:

Where do we get the data?
How do we analyze it?
How do we access the data source and perform analysis in real-time?

Data Source

For the data source used in this demo, we will use the Alpaca news API, which provides websocket access to news articles from Benzinga. To setup an account and create an API key and secret, you can follow the Alpaca documentation. You can use any websocket as a data source. A future follow-up will look at how we can build our own real-time news aggregation pipeline for analysis.

Content Analysis

We're obviously going to leverage Large Language Models (LLMs) to analyze news articles. And the best place which comes to mind when looking for LLMs is Hugging Face.Hugging Face is a company that provides a marketplace where researchers can release models and datasets on their hub that can then be used by other researchers and developers via their hosted model endpoints and their Transformers library. Firstly, we need to perform sentiment analysis on the headline, which can quickly provide valuable insights. For this, we'll use a fine-tuned BERT model called FinancialBERT. Then we will summarize the content of the article, and a fine-tuned BART model will come in handy for this. Both can be found on huggingface.co. We also are going to cover how we can use the Transformers library to run the models.

Real-Time Data Processing with Bytewax

If you are not familiar with Bytewax. Bytewax is a stateful stream processor that can be used to analyze data in real time with support for stateful operators like windowing and aggregation. Bytewax is especially suitable for workflows that leverage the Python ecosystem of tools, from data crunching tools like Pandas to machine learning-focused tools like Hugging Face Transformers. It also supports a variety of data sources, including websockets.

Let's get started analyzing the news in real-time. First things first! Dependencies:

!pip install bytewax transformers torch sentencepiece websocket-client

Constructing Our Dataflow

A Bytewax dataflow is a sequence of steps that transform data from an input source and then write it to an output. At each step an operator is used to control the flow of data; whether it should be filtered, aggregated or accumulated. Developers writing dataflows will write Python code that will do the data transformation at each step.

Input

To begin the dataflow, we'll create an input using the Alpaca websocket, which we'll use to subscribe to articles on multiple tickers. It's important to note that you'll require an Alpaca API key and secret, and it's recommended to store them as environment variables.

import os
import json

from bytewax.dataflow import Dataflow
from bytewax.inputs import ManualInputConfig, distribute

from websocket import create_connection

API_KEY = os.getenv("API_KEY")
API_SECRET = os.getenv("API_SECRET")

ticker_list = ["*"]


def input_builder(worker_index, worker_count, resume_state):
    state = resume_state or None
    worker_tickers = list(distribute(ticker_list, worker_index, worker_count))
    print({"subscribing to": worker_tickers})

    def news_input(worker_tickers, state):
        ws = create_connection("wss://stream.data.alpaca.markets/v1beta1/news")
        print(ws.recv())
        ws.send(json.dumps({"action":"auth","key":f"{API_KEY}","secret":f"{API_SECRET}"}))
        print(ws.recv())
        ws.send(json.dumps({"action":"subscribe","news":worker_tickers}))
        print(ws.recv())

        while True:
        # to use without API uncomment the below line and comment the one below that
        # articles = [{"T":"n","id":31248067,"headline":"Tesla Vehicles Could Be Banned From Leaving During A Hurricane In This State","summary":"A lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here’s the potential law and why it’s important.","author":"Chris Katje","created_at":"2023-03-07T22:58:40Z","updated_at":"2023-03-07T22:58:40Z","url":"https://www.benzinga.com/news/23/03/31248067/tesla-vehicles-could-be-banned-from-leaving-during-a-hurricane-in-this-state","content":"\u003cp\u003eA lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here\u0026rsquo;s the potential law and why it\u0026rsquo;s important.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cstrong\u003eWhat Happened:\u003c/strong\u003e States have passed laws aimed at banning the sale of gas-powered vehicles in the future. One state took it a step further by seeking to ban electric vehicle \u003ca href=\"https://www.benzinga.com/news/23/01/30424292/taking-on-elon-musk-this-state-legislature-could-ban-electric-vehicle-sales-by-2035\"\u003esales in the future.\u003c/a\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eOne of the leading states for electric vehicle purchases could now see a temporary ban on using electric vehicles during the time of a crisis.\u003c/p\u003e\r\n\r\n\u003cp\u003eFlorida Republican state Sen.\u0026nbsp;\u003cstrong\u003eJonathan Martin\u003c/strong\u003e is considering legislation to ban electric vehicles like those from \u003cstrong\u003eTesla Inc\u003c/strong\u003e (NASDAQ:\u003ca class=\"ticker\" href=\"https://www.benzinga.com/stock/TSLA#NASDAQ\"\u003eTSLA\u003c/a\u003e) to be used during hurricane evacuations in the state, according to \u003ca href=\"https://electrek.co/2023/03/06/florida-lawmaker-wants-to-ban-evs-from-hurricane-evacuations/\"\u003eElectrek\u003c/a\u003e.\u0026nbsp;\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin told the state\u0026rsquo;s Department of Transportation that electric vehicles could block traffic during evacuations if they run out of battery charge.\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin serves on the Committee on Environment and Natural Resources and the Select Committee on Resiliency.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe Select Committee on Resiliency met with the Florida Department of Transportation executive director of transportation technologies in Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003eAmong the topics discussed were the $198 million the state is going to get from the Bipartisan Infrastructure Law for electric vehicle charging infrastructure from the current administration led by \u003cstrong\u003ePresident Joe Biden.\u003c/strong\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eThe legislation requires electric vehicle charging stations to be 50 miles apart and serve all electric vehicles.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;With a couple of guys behind you, you can\u0026rsquo;t get out of the car and push it to the side of the road. Traffic backs up. And what might look like a two-hour trip might turn into an eight-hour trip once you\u0026rsquo;re on the road,\u0026rdquo; Martin said.\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin said his concern is with the electric vehicle infrastructure available in the state of Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003eRelated Link: \u003ca href=\"https://www.benzinga.com/trading-ideas/22/06/27568560/4-stocks-to-watch-this-hurricane-season\"\u003e4 Stocks To Watch This Hurricane Season\u0026nbsp;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cstrong\u003eWhy It\u0026rsquo;s Important:\u003c/strong\u003e The Florida Department of Transportation told Martin it isn\u0026rsquo;t a fan of banning electric vehicles during hurricane evacuations and that it is looking into portable EV chargers.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;We have our emergency assistance vehicles that we deploy during a hurricane evacuation that have gas \u0026hellip; we need to provide that same level of service to electrical vehicles,\u0026rdquo; Department of Transportation director of transportation technologies \u003cstrong\u003eTrey Tillander \u003c/strong\u003esaid.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe Tampa Bay Times \u003ca href=\"https://www.tampabay.com/hurricane/2023/02/24/florida-lawmaker-suggests-limiting-electric-vehicles-during-hurricane-evacuations/\"\u003ereported\u003c/a\u003e\u0026nbsp;around 1% of the vehicles in Florida are electric vehicles. One of the owners of an EV is state Sen.\u0026nbsp;\u003cstrong\u003eTina Polsky.\u003c/strong\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;I don\u0026rsquo;t think you can ban an electric vehicle from evacuating because that may be the only car someone has,\u0026rdquo; Polsky said.\u003c/p\u003e\r\n\r\n\u003cp\u003eIn December 2022, there were 203,094 electric vehicles registered in the state of Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe increased funding for charging infrastructure could help ease concerns over charging.\u003c/p\u003e\r\n\r\n\u003cp\u003eUltimately, once people are on the road headed out of the state, they likely won\u0026rsquo;t be able to stop at a charging station, similar to people not being able to quickly stop at a gas station.\u003c/p\u003e\r\n\r\n\u003cp\u003eJust like people prepare for the evacuation by filling up their vehicle with gas, owners of electric vehicles will likely need to fully charge their vehicle before evacuating the state.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe comments from the state senator may have Florida residents thinking about owning at least one non-electric vehicle or a hybrid to ensure they have the best chance to exit the state without future restrictions and without the potential of running out of charge and not finding stations prevalent.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003eRead Next:\u0026nbsp;\u003ca href=\"https://www.benzinga.com/analyst-ratings/analyst-color/23/03/31172188/tesla-analysts-praise-vertical-integration-after-investor-day-but-want-more-from-el\"\u003eTesla Analysts Praise Vertical Integration After Investor Day, But Want More From Elon Musk: \u0026#39;Long On Vision, Short On Specifics\u0026#39;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003ePhoto:\u0026nbsp;\u003ca href=\"https://www.shutterstock.com/g/hsaduraphotos\"\u003eHenryk Sadura\u003c/a\u003e\u0026nbsp;via Shutterstock\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cbr /\u003e\r\n\u0026nbsp;\u003c/p\u003e\r\n","symbols":["TSLA"],"source":"benzinga"}]
          articles = json.loads(ws.recv())
          for article in articles:
            yield state, (article["source"], article)

    return news_input(worker_tickers, state)


flow = Dataflow()
flow.input("inp", ManualInputConfig(input_builder))

The resulting data returned from the news API looks like the json shown here.

[{"T":"n","id":31248067,"headline":"Tesla Vehicles Could Be Banned From Leaving During A Hurricane In This State","summary":"A lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here’s the potential law and why it’s important.","author":"Chris Katje","created_at":"2023-03-07T22:58:40Z","updated_at":"2023-03-07T22:58:40Z","url":"https://www.benzinga.com/news/23/03/31248067/tesla-vehicles-could-be-banned-from-leaving-during-a-hurricane-in-this-state","content":"\u003cp\u003eA lawmaker in one American state could make it hard for owners of electric vehicles ... ertical Integration After Investor Day, But Want More From Elon Musk: \u0026#39;Long On Vision, Short On Specifics\u0026#39;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003ePhoto:\u0026nbsp;\u003ca href=\"https://www.shutterstock.com/g/hsaduraphotos\"\u003eHenryk Sadura\u003c/a\u003e\u0026nbsp;via Shutterstock\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cbr /\u003e\r\n\u0026nbsp;\u003c/p\u003e\r\n","symbols":["TSLA"],"source":"benzinga"}]

We will use this in the next steps in our dataflow to analyze the sentiment and provide a summary.

Managing Duplicates and Updates

When working with news stories from RSS/Atom feeds or news APIs, it's common to receive duplicates as they're created and then updated. To prevent these duplicates from being analyzed multiple times and incurring additional overhead of running ML models on the same story, we'll use the Bytewax operator stateful_map to create a simplified storage layer. We'll store a list of unique identifiers for each news article we encounter. If an article has been seen before, we'll mark it as an update. Otherwise, we'll add the article's ID to the stateful object. To filter out the updates and avoid reclassifying and summarizing them, we'll use the filter operator. Think of this process as the equivalent of checking a database for a unique ID.

def update_articles(articles, news):
    if news['id'] in articles:
        news['update'] = True
    else:
        articles.append(news['id'])
        news['update'] = False
    return articles, news

flow.stateful_map("source_articles", lambda: list(), update_articles)

flow.filter(lambda x: not x[1]['update'])

Sentiment Analysis

Sentiment analysis is the next step in our process. Our approach involves using a fine-tuned Hugging Face model to analyze the article's headline sentiment. We will be leveraging a BERT model for this purpose. BERT, which stands for Bidirectional Encoder Representations from Transformers, was developed by Google. For a detailed understanding of how this model operates and was trained, you can refer to the model card on Hugging Face or the accompanying research paper. Since we want to analyze each news article independently, the sentiment classification will take place in a map operator. Despite the extensive research that goes into designing novel model architectures and creating training datasets, implementing sentiment analysis is remarkably straightforward. Note that if you're following along in a notebook, the model will take some time to download initially.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, AutoModelForSeq2SeqLM

sent_tokenizer = AutoTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
sent_model = AutoModelForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
sent_nlp = pipeline("sentiment-analysis", model=sent_model, tokenizer=sent_tokenizer)

def sentiment_analysis(ticker__news):
    ticker, news = ticker__news
    sentiment = sent_nlp([news["headline"]])
    news['sentiment'] = sentiment[0]
    print(sentiment[0])
    return (ticker, news)

flow.map(sentiment_analysis)

Article Summarization

After analyzing the article sentiment, we will utilize a BART (Bidirectional Auto-Regressive Transformers) model architecture, which is a combination of Google's BERT and OpenAI's GPT architectures, to summarize its content. Despite the significant effort that goes into creating the model, implementing it with the Hugging Face Transformers library is relatively easy. We can generate a summarization pipeline and apply it in a map step. To obtain better results, we also incorporated an extra step into this map process, which involved cleaning the text before summarizing it.

import re

# Let's create a summarization pipeline
sum_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
sum_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
summarizer = pipeline("summarization", tokenizer=sum_tokenizer, model=sum_model)
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

def summarize(ticker__news):
    ticker, news = ticker__news
    article = news['content']
    article_no_tags = tag_re.sub('', article)
    article_no_tags = article_no_tags.replace("\r", "").replace("\n", "")
    summary = summarizer(article_no_tags, max_length=130, min_length=30, do_sample=False)
    news['bart_summary'] = summary[0]['summary_text']
    print(f"bart summary:{summary[0]['summary_text']}")
    return (ticker, news)

flow.map(summarize)

Output

With our news analyzed, we can set a capture step to output the modified news object and then run our dataflow. For this instance we are going to write the output to StdOut so we can easily view it, but in a production system we could write the results to a downstream kafka topic or database for further analysis.

If you are following along in a notebook, remember you have to be authenticated for this to work and will need to set your Alpaca API key and secret

from bytewax.execution import run_main
from bytewax.outputs import StdOutputConfig

flow.capture(StdOutputConfig())

if __name__ == '__main__':
    run_main(flow)

Wrapping up

While our example is simplified, it showcases the power of Bytewax and Hugging Face's language models. We can easily analyze financial news articles in real-time, identify significant events, and make informed decisions: using the Alpaca news API as our data source, we were able to construct a dataflow that deduplicate stories and summarizes the content of each article.

The ease of implementation through Python native Bytewax and the Hugging Face Transformers library makes it accessible for data engineers and researchers to utilize these state-of-the-art language models in their own projects. We hope this blog post serves as a useful guide for anyone looking to leverage real-time news analysis in their financial decision-making process.

How to Analyze Cryptocurrency Orders with Python

Zander — Thu, 16 Jun 2022 16:37:17 +0000

This post was originally published on the Bytewax blog

In this example we are going to walk through how you can maintain a limit order book in real-time in Python. Take advantage of the ability to accrue state and parallelize input with Bytewax, an open source Python library for stateful streaming.

We are going to:

Use websockets to connect to an exchange (coinbase)
Setup an order book using a snapshot
Update the order book in real-time
Use algorithms to trade crypto and profit. Just kidding, this is left to an exercise for the reader.

To start off, we are going to diverge into some concepts around markets, exchanges and orders.

Concepts

Order Book

A Limit Order Book, or just Order Book is a record of all limit orders that are made. A limit order is an order to buy (bid) or sell (ask) an asset for a given price. This could facilitate the exchange of dollars for shares or, as in our case, they could be orders to exchange crypto currencies. On exchanges, the limit order book is constantly changing as orders are placed every fraction of a second. The order book can give a trader insight into the market, whether they are looking to determine liquidity, to create liquidity, design a trading algorithm or maybe determine when bad actors are trying to manipulate the market.

Bid and Ask

In the order book, the ask price is the lowest price that a seller is willing to sell at and the bid price is the highest price that a buyer is willing to buy at. A limit order is different than a market order in that the limit order can be placed with generally 4 dimensions, the direction (buy/sell), the size, the price and the duration (time to expire). A market order, in comparison, has 2 dimensions, the direction and the size. It is up to the exchange to fill a market order and it is filled via what is available in the order book.

Level 2 Data

An exchange will generally offer a few different tiers of information that traders can subscribe to. These can be quite expensive for some exchanges, but luckily for us, most crypto exchanges provide access to the various levels for free! For maintaining our order book we are going to need at least level2 order information. This gives us granularity of each limit order so that we can adjust the book in real-time. Without the ability to maintain our own order book, the snapshot would get almost instantly out of date and we would not be able to act with perfect information. We would have to download the order book every millisecond, or faster, to stay up to date and since the order book can be quite large, this isn't really feasible.

Alright, let's get started!

Inputs & Outputs

We are going to eventually create a cluster of dataflows where we could have multiple currency pairs running in parallel on different workers. In order to follow this approach, we will use the spawn_cluster method of kicking off our dataflow. To start, we will build a websocket input function that will use the coinbase pro websocket url (wss://ws-feed.pro.coinbase.com) and the Python websocket library to create a connection. Once connected we can send a message to the websocket subscribing to product_ids (pairs of currencies - USD-BTC for this example) and channels (level2 order book data). Finally since we know there will be some sort of acknowledgement message we can grab that with ws.recv() and print it out. In this example we are assuming we receive our data in order and we are going to assign a monotonically increasing epoch to each new message we receive. In Bytewax, an epoch is assigned to data and then Bytewax is instructed move that data through the dataflow as far as possible with the Emit method. At some point, we would consider that epoch complete if there was not any additional data to be received. At that point we would instruct Bytewax to advance to the next epoch with AdvanceTo and would lead to completion of the dataflow process for that epoch. This allows for flexibility so you could span an epoch over a time window and complete the processing at the end of the window. For more details on how this works, check the epoch documentation.

from websocket import create_connection


PRODUCT_IDS = ['BTC-USD','ETH-USD']

def ws_input(product_ids):
    ws = create_connection("wss://ws-feed.pro.coinbase.com")
    ws.send(
        json.dumps(
            {
                "type": "subscribe",
                "product_ids": product_ids,
                "channels": ["level2"],
            }
        )
    )
    print(ws.recv())
    epoch = 0
    while True:
        yield Emit(ws.recv())
        epoch += 1
        yield AdvanceTo(epoch)

Now that we have our web socket based data generator built, we will write an input builder for our dataflow. The input builder is called on each worker and the function will have information about the worker_index and the total number of workers (worker_count). In this case we are designing our input builder to handle multiple workers and multiple currency pairs, so that we can parallelize the input. So we will distribute the currency pairs with the logic in the code below. It should be noted that if you run more than one worker with only one currency pair, the other workers will not be used.

from bytewax import Dataflow, inputs, spawn_cluster

def input_builder(worker_index, worker_count):
    prods_per_worker = int(len(PRODUCT_IDS)/worker_count)
    product_ids = PRODUCT_IDS[int(worker_index*prods_per_worker):int(worker_index*prods_per_worker+prods_per_worker)]
    return ws_input(product_ids)

Now that we have our input builder finished, we can create our output builder. The output builder is used in conjunction with spawn_cluster. In a dataflow, when you use the capture operator, the output_builder is called and will receive information about the worker as well as the epoch and the data (in the format (epoch, data)). For this example, we keep it simple, we are just going to print out the result of our dataflow to the terminal.

def output_builder(worker_index, worker_count):
    return print

Building Our Dataflow

Before we get to the exciting part of our order book dataflow we need to prep the data. We initially receive some JSON formatted text, so we will first deserialize the JSON we are receiving from the websocket into a dictionary. Once deserialized, we can reformat the data to be a tuple of the shape (product_id, data). This will permit us to aggregate by the product_id as our key in the next step.

def key_on_product(data):
    return(data['product_id'],data)

flow = Dataflow()
flow.map(json.loads)
# {'type': 'l2update', 'product_id': 'BTC-USD', 'changes': [['buy', '36905.39', '0.00334873']], 'time': '2022-05-05T17:25:09.072519Z'}
flow.map(key_on_product)
# ('BTC-USD', {'type': 'l2update', 'product_id': 'BTC-USD', 'changes': [['buy', '36905.39', '0.00334873']], 'time': '2022-05-05T17:25:09.072519Z'})

Now for the exciting part. The code below is what we are using to:

Construct the orderbook as two dictionaries, one for asks, one for bids
Assign a value to the ask and bid price.
For each new order, update the order book and then update the bid and ask prices where required.
Return bid and ask price, the respective volumes of the ask and the difference between the prices.

The data from the coinbase pro websocket is first a snapshot of the current order book in the format:

{
  "type": "snapshot",
  "product_id": "BTC-USD",
  "bids": [["10101.10", "0.45054140"]...],
  "asks": [["10102.55", "0.57753524"]...]
}

and then each additional message is an update with a new limit order of the format:

{
  "type": "l2update",
  "product_id": "BTC-USD",
  "time": "2019-08-14T20:42:27.265Z",
  "changes": [
    [
      "buy",
      "10101.80000000",
      "0.162567"
    ]
  ]
}

To maintain an order book in real time, we will first need to construct an object to hold the orders from the snapshot and then update that object with each additional update. This is a good use case for the stateful_map operator, which can aggregate by key, over many epochs. Stateful_map will aggregate data based on a function (mapper), into an object that you define. The object must be defined via a builder, because it will create a new object via this builder for each new key received. The mapper must return the object so that it can be updated.

Below we have the code for the OrderBook object that has a bids and asks dictionary. These will be used to first create the order book from the snapshot and once created we can attain the first bid price and ask price. The bid price is the highest buy order placed and the ask price is the lowest sell order places. Once we have determined the bid and ask prices, we will be able to calculate the spread and track that as well.

class OrderBook:
    def __init__(self):
        # if using Python < 3.7 need to use OrderedDict here
        self.bids = {}
        self.asks = {}

    def update(self, data):
        if self.bids == {}:
            self.bids = {float(price):float(size) for price, size in data['bids']}
            # The bid_price is the highest priced buy limit order.
            # since the bids are in order, the first item of our newly constructed bids
            # will have our bid price, so we can track the best bid
            self.bid_price = next(iter(self.bids))
        if self.asks == {}:
            self.asks = {float(price):float(size) for price, size in data['asks']}
            # The ask price is the lowest priced sell limit order.
            # since the asks are in order, the first item of our newly constructed 
            # asks will be our ask price, so we can track the best ask
            self.ask_price = next(iter(self.asks))

With our snapshot processed, for each new message we receive from the websocket, we can update the order book, the bid and ask price and the spread. Sometimes an order was filled or it was cancelled and in this case what we receive from the update is something similar to 'changes': [['buy', '36905.39', '0.00000000']]. When we receive these updates of size '0.00000000', we can remove that item from our book and potentially update our bid and ask price. The code below will check if the order should be removed and if not it will update the order. If the order was removed, it will check to make sure the bid and ask prices are modified if required.

        else:
            # We receive a list of lists here, normally it is only one change, 
            # but could be more than one.
            for update in data['changes']:
                price = float(update[1])
                size = float(update[2])
            if update[0] == 'sell':
                # first check if the size is zero and needs to be removed
                if size == 0.0:
                    try:
                        del self.asks[price]
                        # if it was the ask price removed, 
                        # update with new ask price
                        if price <= self.ask_price:
                            self.ask_price = min(self.asks.keys())
                    except KeyError:
                        # don't need to add price with size zero
                        pass
                else:
                    self.asks[price] = size
                    if price < self.ask_price:
                        self.ask_price = price
            if update[0] == 'buy':
                # first check if the size is zero and needs to be removed
                if size == 0.0:
                    try:
                        del self.bids[price]
                        # if it was the bid price removed, 
                        # update with new bid price
                        if price >= self.bid_price:
                            self.bid_price = max(self.bids.keys())
                    except KeyError:
                        # don't need to add price with size zero
                        pass
                else:
                    self.bids[price] = size
                    if price > self.bid_price:
                        self.bid_price = price
        return self, {'bid': self.bid_price, 'bid_size': self.bids[self.bid_price], 'ask': self.ask_price, 'ask_price': self.asks[self.ask_price], 'spread': self.ask_price-self.bid_price}

flow.stateful_map(lambda key: OrderBook(), OrderBook.update)
# if using bytewax>0.9.0 --> flow.stateful_map("order_book", lambda key: OrderBook(), OrderBook.update)

Finishing it up, for fun we can filter for a spread as a percentage of the ask price greater than 01% and then capture the output. Maybe we can profit off of this spread... or maybe not.

The capture operator is designed to use the output builder function that we defined earlier. In this case it will print out to our terminal.

flow.filter(lambda x: x[-1]['spread'] / x[-1]['ask'] > 0.0001)
flow.capture()

Bytewax provides a few different entry points for executing a dataflow, in this example we are using bytewax.spawn_cluster which allows us to run dataflows in parallel on threads and processes.

if __name__ == "__main__":
    spawn_cluster(flow, input_builder, output_builder, **parse.cluster_args())

That's it, let's run it and verify our output:

python orderbook.py
# for multiple workers --> python orderbook.py -w 2

And eventually, if the spread is greater than $5, we will see some output similar to what is below.

{"type":"subscriptions","channels":[{"name":"level2","product_ids":["BTC-USD"]}]}
(1046, ('BTC-USD', (38590.1, 0.00945844, 38596.73, 0.01347429, 6.630000000004657)))
(1047, ('BTC-USD', (38590.1, 0.00945844, 38597.13, 0.02591147, 7.029999999998836)))
(1048, ('BTC-USD', (38590.1, 0.00945844, 38597.13, 0.02591147, 7.029999999998836)))

That's it!

We would love to see if you can build on this example. Feel free to share what you've built in our community slack channel.

DEV Community: Zander

Data Parallel, Task Parallel, and Agent Actor Architectures

Introduction:

Data Parallel Architectures

Task Parallel Architectures: Unlocking Concurrent Processing

Agent Actor Architectures: Pioneering Concurrent Computations

Reasoning about Streaming vs Batch with a Case Study from GitHub

Understanding Real-Time and Stream Processing

The Relevance of Real-Time Data Processing

Types of Real-Time Workloads

Analytical workloads

Operational workloads

Analytical vs Operational

Case Studies. GitHub's Real-Time Data Processing Decisions

Trending Repositories and Developers: A Batch Processing Approach

Star Spam Detection: A Real-Time Processing Solution

The Impact of Real-Time Processing Decisions

Real-Time Anomaly Detection Visualization with Bytewax and Rerun

Overview

Setup your environment

Code

Building the Dataflow

input_builder function

ZTestDetector Class

Visualizing Anomalies

output_builder Function

Running the Dataflow

Expanding the Use Case

Conclusion

Using Language Models in a Streaming Context to Understand Financial Markets

bytewax / news-analyzer

Analyze financial news in real-time with Machine Learning

news-analyzer

Data Source

Content Analysis

Real-Time Data Processing with Bytewax

Constructing Our Dataflow

Input

Managing Duplicates and Updates

Sentiment Analysis

Article Summarization

Output

Wrapping up

How to Analyze Cryptocurrency Orders with Python

Concepts

Order Book

Bid and Ask

Level 2 Data

Inputs & Outputs

Building Our Dataflow

`input_builder` function

`ZTestDetector` Class

`output_builder` Function